23 November 2011
Averaging this leads to:
Leading to average of 18.4MB/s as the average rate with spikes in 12 hour average to above 80MB/s. (Individual file transmission rates across the network (excluding overhead) have been seen at over 110MB/s. This relates well to the 1Gbps NIC limit on the disk servers in question.
Now we know that of Storm,dCache,DPM and Castor systems within the UK that Castor tends to have the longest interaction overhead for transfers. Overhead for RAL-RAL transfer varies for the last week is between 14 and 196 seconds with an average of 47 seconds and a standard deviation of 24 seconds.
12 November 2011
Storage is popular
11 November 2011
RAL T1 copes with ATLAS spike of transfers.
At ~10pm on the 10th November 2011 (UTC); ATLAS went from running almost empty to almost full on FTS channels involving RAL being controlled by the RAL FTS server. This can be seen in the number of active transfer plot:
This was caused by atlas suddenly putting into the ATLAS FTS many transfers which can be seen in the "Ready" queue:
This lead to a high transfer rate as shown here:
And is also seen in our own internal network monitoring:
The FTS rate is for transfers only going through the RAL FTS. ( I.e does not include puts by CERN FTS, Gets from other T1s or the chaotic background of dq2-gets, dq2-puts and lcg-cps not covered in these plots. Hopefully this means our current FTS settings can cope with start of these ATLAS data transfer spikes. We have seen from previous backlogs that these large spikes lead to a temporary backlog ( for a typical size of spike;) which clears well within a day.
25 October 2011
"Georgina's " Travels
~9 months after Georgina's birth she has:
1497 unique children in 79 Houses in a total of 265 rooms.
( Further analysis is hard to describe in an anthropomorphous world since data set replicas would have to involve cloning in Dave and Georgina's world.)
Taking this into account of the 1497 datasets , the distribution of number of replicas is as follows:
556/1497 datasets only have one copy.
Maximum number of copies of any "data" dataset is 20.
Maximum number of copies of any "group" dataset is 3.
Maximum number of copies of any "user" dataset is 6.
What is of concern is to me is that 279/991 user or group derived datasets have one unique copy on the grid.
12 October 2011
Rumours of my storage have been somewhat exaggerated?
/* Convert all kibibyte values in the database to byte values */
UPDATE vmgr_tape_denmap
SET native_capacity = native_capacity * 1024;
UPDATE vmgr_tape_pool
SET capacity = capacity * 1024;
27 September 2011
SRM speedup update
14 September 2011
Bringing SeIUCCR to people
The main idea is that grids extend the types of research people can do, because we enable managing and processing large volumes of data, so we are in a better position to cope with the famous "data deluge." Some people will be happy with the friendly front end in the NGS portal but we also demonstrated moving data from RAL to Glasgow (hooray for dteam) and to QMUL with, respectively, lcg-rep and FTS.
If you are a "normal" researcher (ie not a particle physicist :-)) you normally don't want to"waste time" learning grid data management, but the entry level tools are actually quite easy to get into, no worse than anything else you are using to move data. And the advanced tools are there if and when you eventually get to the stage where you need them, and not that hard to learn: a good way to get started is to go to GridPP and click the large friendly HELP button. NGS also has tutorials (and if you want more tutorials, let us know.)
It is worth mentioning that we like holding hands: one thing we have found in GridPP is that new users like to contact their local grid experts - which is also the point of having campus champions. We should have a study at the coming AHM. Makes it even easier to get started. You have no excuse. Resistance is futile.
05 September 2011
New GridPP DPM Tools release: now part of DPM.
I'm happy to announce the release of the next version of the GridPP DPM toolkit, which now includes some tools for complete integrity checking of a disk filesystem against the DPNS database.
This should also be able to checksum the files as well, although this takes a lot longer.
The bigger change is that the tools are now provided in the DPM repository, as the dpm-contrib-admintools package. Due to packaging constraints, this RPM installs the tools to /usr/bin, so male sure it is earlier in your path than the old /opt/lcg/bin path...
Richards would like to encourage other groups with useful DPM tools to contribute them to the repo.
04 August 2011
Et DONA ferentis?
dig www.gridpp.ac.uk. A IN
18 July 2011
Storage accounting in OSG and OGF
<StorageElementRecord xmlns:urwg="http://www.gridforum.org/2003/ur-wg">
<RecordIdentity urwg:createTime="2011-07-17T21:18:07Z" urwg:recordId="head01.aglt2.org:544527.26"/>
<UniqueID>AGLT2_SE:Pool:umfs18_3</UniqueID>
<MeasurementType>raw</MeasurementType>
<StorageType>disk</StorageType>
<TotalSpace>25993562993750</TotalSpace>
<FreeSpace>6130300894785</FreeSpace>
<UsedSpace>19863262098965</UsedSpace>
<Timestamp>2011-07-17T21:18:02Z</Timestamp>
<ProbeName>dcache-storage:head01.aglt2.org</ProbeName>
<SiteName>AGLT2_SE</SiteName>
<Grid>OSG</Grid>
</StorageElementRecord>
23 June 2011
A Little or a lot. How many transfers should an SRM be handling??
Firstly into RAL (over the last four weeks:)
TOTAL- | 1966315 | |
---|---|---|
CA+ | 12191 | |
CERN+ | 140494 | |
DE+ | 27543 | |
ES+ | 15085 | |
FR+ | 31555 | |
IT+ | 15514 | |
ND+ | 10891 | |
NL+ | 23515 | |
TW+ | 6505 | |
UK+ | 1636099 | |
US+ | 46923 |
The number of transfers for when RAL is a source are:
TOTAL- | 872941 | |
---|---|---|
CA+ | 39604 | |
CERN+ | 150286 | |
DE+ | 66500 | |
ES+ | 17848 | |
FR+ | 78635 | |
IT+ | 57309 | |
ND+ | 19585 | |
NL+ | 22602 | |
TW+ | 37437 | |
UK+ | 303770 | |
US+ | 79365 |
( NB. There is a small amount of double counting as the 59299 RAL-RAL transfers appear in both sets of figures in the "UK" values.) average filesize was 287 MB and took 80.43 seconds to copy.
100k per day at RAL for ATLAS.
320k per day at BNL for ATLAS.
140k per day at FZK for ATLAS.
150k per day IN2P3 for ATLAS.
Now the SRM also has to handle files being written into it from the WNs at a site. The number of completed jobs for a selection of T1s is:
18k per day at RAL for ATLAS.
50k per day at BNL for ATLAS.
27k per day at FZK for ATLAS.
15k per day IN2p3 for ATLAS.
Now each job on average produces two output files; meaning that for RAL, ~35/135 of its SRM transfers (~1/4) come form its worker nodes.
UK T2s do approximately 80k transfers per day for ATLAS ( and complete ~50k jobs per day).
14 June 2011
FTS overhead factors and how to try and improve rates.
1- Speed up the data transfer phase of the file by changing network and host settings on a disk server. Mainly this has been following the advice of the good people at LBNL work on:
http://fasterdata.es.net/
2-The other area was to look at the overhead in the SRM and its communication with the FTS service.
So we had a tinker with number of files and number of threads on a FTS channel and got some improvement in overall rate for some channels. But as part of improving single file transfer rates (as part of our study to help the ATLAS VO SONAR test results;) we started to look into the overhead in prepare to get and put in the source and destination SRMs.
We have seen in the past that synchronous (rather than asynchronous) getTURL was quicker but what we did notice that within an FTS transfer; the sum of the time to preparetoGET and preparetoPUT varied greatly between channels. There is a strong correlation between this amount of time and the SRM involved at each end of the transfer. What we noticed was that transfers which involved CASTOR as the destination srm (preparetoPUT) we regularly taking over 30s to prepare (and regularly taking 20s to prepare as a source site.) Hence we started to look into a way of reducing the effective overhead of "prepare to transfer" for each file.
Looking at new improvements and options in the FTS, we discovered/(pointed at) the following decoupling of SRM preparation phase and the transfer phase:
https://twiki.cern.ch/twiki/bin/view/EGEE/FtsRelease22
Now it was pointed out to me by my friendly SRM developer that their is a timeout (of 180 seconds) which will fail a transfer if this time elapses between the end of the prepare phase and the start of the actual transfer on the disk server. Therefore we wanted to try this new functionality on transfers which:
1- Had a large amount of preparation time to transmission time (i.e either CASTOR as a destination or siurce.
2-Where the majority of transfer times per transfer where less than 180 seconds. ( either small files or fast connections.)
Looking at the value or ({Preparation Time} + {Transmission Time} )/ {Transmission Time}.
we got the following values.
Channel ratios for ATLAS, (CMS) and {LHCb}
<T2Ds-UKT2s>=2.2
<T2Ds-RAL>=7.5
<*-RAL>=3.1 (1.2)
<*-UKT2s>=6.7 (1.01)
<"slow transfer sites">=1.38 (1.02)
Showed that UKT2s-RAL transfers for ATLAS met these criteria; so we have now turned this on ( which seems to add~1.5 seconds to each transfer so you might only want to set this Boolean to true for channels you intend to change the ratio). and we have now set the ratio of SRM prepares to transfers to 2.5 for all UKT2s to RAL channels. No problem of timeing out jobs has bee nseen and we have been able to reduce the number of concurrent filre transfers without reducing the overall throughput.
13 June 2011
Dave is ageing ( but not forgotten), Hello to Georgina
Georgina/Dave numbers are:
973/103 Luminosity blocks=> 9 times more blocks.
12,140,770/1,101,123 events => 11.02 times the events.
203.6/31.8 Hz event rate=> 6.4 times the rate.
16hrs31'46"/9hrs36'51" Beam time=> 1.7 times greater than Dave.
2.01e4/7.72e-3 of integrated luminosity=> 2.6M times the data.
16219.5/5282.5TB of all RAW datasets=> 3 times the volume.
15200/3831 files of all RAW datasets=> 4 time the number of files. smaller?)
0.541/3.127TB for the MinBias subset=> 0.32 the volume.
977/1779 files for the MinBias subset=> 0.55 time the number of files.
So it appears for this comparison that filesize is 3.5 times smaller for the MinBias subset....
For those of interest if my ATLAS "DNA" is 2.3.7.3623; then Georgina's is 3.3.3.3.5.5.89
Of course what you really want to know (or not) is where in the world is Georgina and her relations and how what does her birthday calendar look like. My avatar is interested to find out that since Georgina is so much bigger than I am, will she have more children in more rooms and how long will they last...
31 May 2011
Decimation of children.
I now only have 11.76 TB of unique data including myself ( 84996 unique files in 334 datasets).
In total their are only 732 datasets now.
The number of replicas is dramatically reducing, but some children are still popular.
# Reps |# Datasets
1 |236
2 | 38
3 | 10
4 | 24
5 | 5
6 | 6
7 | 3
8 | 1
9 | 1
10 | 2
12 | 1
16 | 1
20 | 1
22 | 1
24 | 1
26 | 1
27 | 1
28 | 1
IE now only 98/334 actually have replicas.
My birthday calendar has also changed.The new birthday calendar looks like:
This shows that other than the datasets produced with in the first 12 weeks of life; at least 659 out of 718 datasets from the last year have had all copies deleted. Also show that there has been a hive of activity and new datasets produced ( a re-processing) in time for the important "Moriond" Conference. Also shows that the majority of datasets are only useful for less than one year. (But it is noticeable the files first produced seemed to be the most long lived.
27 May 2011
At last - accurate tape accounting?
The code accounts for data compression as it goes to tape - and estimates the free space on a tape by assuming that other data going to the same tape will compress in the same ratio. Also, as requested by ATLAS, there is accounting also for the "unreachable" data, ie data which can't be read because the tape is currently disabled, or free space which can't be used because the tape is read-only.
All the difficult stuff should now be complete: the restructuring of the internal objects to make the code more maintainable, and the nearline (aka tape) accounting. Online accounting will stay as it is for now.
11 April 2011
Summary of GridPP storage workshop at GridPP26
The full agenda is here (scroll down to "end of main meeting" :-), with presentations attached: http://www.gridpp.ac.uk/gridpp26/
05 April 2011
GridPP26 Pictures
30 March 2011
Happy Birthday!!
Of these 131 rooms; 41 only have 1 resident, but the top four rooms have 561, 508, 472 and 192 residents!
09 March 2011
How to build a Data Grid
15 February 2011
ATLAS Tape Store file sizes
Minimum file size stored for both DATA and MC is 20 bytes
Maximum file size stored is:[12963033040,11943444017] or [13GB,11.9GB]
Average file size is: [1791770811,637492224] or [1.79GB,637MB]
The median filesize for [DATA,MC] are [2141069172,602938704] ([2.1GB,602MB] ) .
Number of files stored is: [282929,672707] of [DATA,MC] files for a total size of [506943923754228,428845482035182] or [507TB,428TB] in total.
[37,5687] files in [DATA,MC] are zero sized ( but we don't have to worry about them as the tape system does not copy 0 size files to tape.
However these are better than the [538,537] 20 byte files which have been migrated to tape (these are 0B sized log files which have then been "tar" and "gzip"ed before being written into Castor.)
The modal average filesizes are [26860316,0]Bytes with [286,5687] files of this size.
These are most likely failed transfers, next modal filesize with 537 entries are files with a size of 20bytes, but theses are just test files. The first genuine modal filesize jointly have 13 files and have size 19492 an 19532 Bytes.
Whereas [254040,626556] have a unique filesize (this equates to only [89.8,93.1] percent of files having a unique file size so checksum is important!!)
Could be worse though , one VO successfully stored a file that is one byte in size (the fact the header and footer on tape file and compressing the file actually increased the size of file actually stored on tape......)
07 February 2011
Get yer clouds here
26 January 2011
Dirk get's mentioned in Nature
So at least one person other than my avatar is aware of my existence. One of my children is mentioned in the article; (even though the majority of the article is about me and ALL my children. can't be having favourites amongst them now can I??)
http://www.nature.com/news/2011/110119/full/469282a.html
Blog about the article can be seen here.
http://blogs.nature.com/news/thegreatbeyond/2011/01/travelling_the_petabyte_highwa_1.html
ATLAS have also changed now the way they send my children out. Interestingly I am now in 70/120 houses. The break down of where these rooms are is as follows:
9 rooms at BNL-OSG2.
8 rooms at CERN-PROD.
6 rooms at IN2P3-CC.
4 rooms at SLACXRD, LRZ-LMU, INFN-MILANO-ATLASC and AGLT2.
3 rooms at UKI-NORTHGRID-SHEF-HEP, UKI-LT2-QMUL, TRIUMF-LCG2,SWT2, RU-PROTVINO-IHEP, RAL-LCG2, PRAGUELCG2, NDGF-T1, MWT2, INFN-NAPOLI-ATLAS and DESY-HH.
2 rooms at WUPPERTALPROD, UNI-FREIBURG, UKI-SCOTGRID-GLASGOW, UKI-NORTHGRID-MAN-HEP, TW-FTT, TOKYO-LCG2, NIKHEF-ELPROD, NET2, MPPMU, LIP-COIMBRA, INFN-T1, GRIF-LAL, FZK-LCG2 and DESY-ZN.
1 room at AUSTRALIA-ATLAS, WISC, WEIZMANN-LCG2, UPENN, UNICPH-NBI, UKI-SOUTHGRID-RALPP, UKI-SOUTHGRID-OX-HEP, UKI-SOUTHGRID-BHAM-HEP, UKI-NORTHGRID-LIV-HEP, UKI-NORTHGRID-LANCS-HEP, UKI-LT2-RHUL, TAIWAN-LCG2, SMU, SFU-LCG2, SARA-MATRIX, RU-PNPI, RRC-KI, RO-07-NIPNE, PIC, NCG-INGRID-PT, JINR-LCG2, INFN-ROMA3, INFN-ROMA1, IN2P3-LPSC, IN2P3-LAPP, IN2P3-CPPM, IL-TAU-HEP, ILLINOISHEP, IFIC-LCG2,IFAE, HEPHY-UIBK, GRIF-LPNHE, GRIF-IRFU, GOEGRID, CSCS-LCG2, CA-SCINET-T2, CA-ALBERTA-WESTGRID-T2 and BEIJING-LCG2.
This is in total 139 of the 781 rooms at have. The number and type of rooms are:
56 rooms of type DATADISK.
26 rooms of type LOCALGROUPDISK.
17 rooms of type SCRATCHDISK.
7 rooms of type USERDISK.
5 rooms of type PHYS-SM and PERF-JETS.
4 rooms of type PERF-FLAVTAG.
3 rooms of type PERF-MUONS, PERF-EGAMMA, MCDISK, DATATAPE and CALIBDISK.
1 room of type TZERO, PHYS-HIGGS, PHYS-BEAUTY and EOSDATADISK.
11 January 2011
Who cares about TCP anyway....
So as part of my work to look at how to speed up individual transfers, I thought I would go back and look to see what the effect of changing some of our favourite TCP window settings would be. These are documented at http://fasterdata.es.net/TCP-tuning/
Our CMS instance of Castor is nice since CMS have a separate disk pool for incoming WAN transfers, outgoing WAN transfers and for pool for internal transfers between WNs and the SE. This is great feature as it means the disk servers in WanIn and WanOut will never have 100s of local connections ( a worry I have for setting TCP settings to high;) so we experimented to see what the effect of changing our TCP settings.
I decided to study transfers that the international as these are the large RTT transfers and most likely to benefit from tweaking. Our settings before the change were. 64kB for default and a 1MB maximum window size.
This lead to a maximum transfer rate per transfer of ~60MB/s and an average of ~7.0 MB/s.
This appears to be hardware dependent across the different generation s of kit.
We changed the settings to 128kB and 4MB. This led to an increase to ~90MB/s maximum data transfer rate per transfer and an average transfer of~11MB/s so roughly a 50% increase in performance. This might not seem a lot since we doubled and quadrupled are settings... However further analysis improves matters. changing TCP settings is only going to help with transfers where the settings at RAL were the bottleneck.
For channels where the settings at the source site are already the limiting factor then these changes would have a limited effect. However looking at transfers from FNAL to RAL for CMS we see a much greater improvement.
Before the tweak the maximum file transfer rate was ~20MB/s with an average of 6.2MB/s. However; after the TCP tweak these increased to 50MB/s and 12.9MB/s respectively.
Another set of sites where the changes dramatically helped were transfers from the US tier2s to RAL ( over the production network rather than the OPN). Before the tweaks the transfers peaked at 10Mb/s and averaged 4.9MB/s. After the tweaks, these values were 40MB/s and 10.8 MB/s respectively.
Now putting all these values into a spreadsheet and looking at other values we get:
Solid Line is Peak. Dotted line is average.
Green is total transfers.
Red is transfer from FNAL.
Blue is transfers to US T2 sites.
Tests on a pre-production system at RAL also show that the efffects on the LAN transfers for these changeas are acceptable.