23 June 2011

A Little or a lot. How many transfers should an SRM be handling??

The new atlas dashboard (version 2.0) now allows for better analysis of data flows. For the RAL T1 ATLAS endpoint of Castor, the breakdown for number of successful transfers from across the world both to and from RAL is as follows.
Firstly into RAL (over the last four weeks:)












Unsurpisingly; the majority of transfers are from with the UK; (due to the UK Tier 2s. ) However , 16.8% of transfers in are from outside the UK. (3% or 59k are transfers are internal RAL-RAL transfers.)

The number of transfers for when RAL is a source are:












( NB. There is a small amount of double counting as the 59299 RAL-RAL transfers appear in both sets of figures in the "UK" values.) average filesize was 287 MB and took 80.43 seconds to copy.
100k per day at RAL for ATLAS.
320k per day at BNL for ATLAS.
140k per day at FZK for ATLAS.
150k per day IN2P3 for ATLAS.

Now the SRM also has to handle files being written into it from the WNs at a site. The number of completed jobs for a selection of T1s is:
18k per day at RAL for ATLAS.
50k per day at BNL for ATLAS.
27k per day at FZK for ATLAS.
15k per day IN2p3 for ATLAS.

Now each job on average produces two output files; meaning that for RAL, ~35/135 of its SRM transfers (~1/4) come form its worker nodes.

UK T2s do approximately 80k transfers per day for ATLAS ( and complete ~50k jobs per day).

14 June 2011

FTS overhead factors and how to try and improve rates.

Within the UK we have been trying to speed up transfer rates. This has been a twofold approach.
1- Speed up the data transfer phase of the file by changing network and host settings on a disk server. Mainly this has been following the advice of the good people at LBNL work on:


2-The other area was to look at the overhead in the SRM and its communication with the FTS service.

So we had a tinker with number of files and number of threads on a FTS channel and got some improvement in overall rate for some channels. But as part of improving single file transfer rates (as part of our study to help the ATLAS VO SONAR test results;) we started to look into the overhead in prepare to get and put in the source and destination SRMs.

We have seen in the past that synchronous (rather than asynchronous) getTURL was quicker but what we did notice that within an FTS transfer; the sum of the time to preparetoGET and preparetoPUT varied greatly between channels. There is a strong correlation between this amount of time and the SRM involved at each end of the transfer. What we noticed was that transfers which involved CASTOR as the destination srm (preparetoPUT) we regularly taking over 30s to prepare (and regularly taking 20s to prepare as a source site.) Hence we started to look into a way of reducing the effective overhead of "prepare to transfer" for each file.
Looking at new improvements and options in the FTS, we discovered/(pointed at) the following decoupling of SRM preparation phase and the transfer phase:


Now it was pointed out to me by my friendly SRM developer that their is a timeout (of 180 seconds) which will fail a transfer if this time elapses between the end of the prepare phase and the start of the actual transfer on the disk server. Therefore we wanted to try this new functionality on transfers which:
1- Had a large amount of preparation time to transmission time (i.e either CASTOR as a destination or siurce.
2-Where the majority of transfer times per transfer where less than 180 seconds. ( either small files or fast connections.)

Looking at the value or ({Preparation Time} + {Transmission Time} )/ {Transmission Time}.
we got the following values.
Channel ratios for ATLAS, (CMS) and {LHCb}
<UKT2s-RAL>=15.1 (2.7)
<RAL-UKT2s>=5.5 (1.9)
<T1s-RAL>=4.3 (1.2) {8.6}
<*-RAL>=3.1 (1.2)
<*-UKT2s>=6.7 (1.01)
<"slow transfer sites">=1.38 (1.02)

Showed that UKT2s-RAL transfers for ATLAS met these criteria; so we have now turned this on ( which seems to add~1.5 seconds to each transfer so you might only want to set this Boolean to true for channels you intend to change the ratio). and we have now set the ratio of SRM prepares to transfers to 2.5 for all UKT2s to RAL channels. No problem of timeing out jobs has bee nseen and we have been able to reduce the number of concurrent filre transfers without reducing the overall throughput.

13 June 2011

Dave is ageing ( but not forgotten), Hello to Georgina

Well I am not actually dead, but my importance is receding. They say a week is a long time in politics, well 1 day in the LHC is not like all the others. I was one of the early runs from 2010. My 2011 compatriots are now a lot larger. Take a comparison between me and my new friend "Georgina"

Georgina/Dave numbers are:
973/103 Luminosity blocks=> 9 times more blocks.
12,140,770/1,101,123 events => 11.02 times the events.
203.6/31.8 Hz event rate=> 6.4 times the rate.
16hrs31'46"/9hrs36'51" Beam time=> 1.7 times greater than Dave.
2.01e4/7.72e-3 of integrated luminosity=> 2.6M times the data.
16219.5/5282.5TB of all RAW datasets=> 3 times the volume.
15200/3831 files of all RAW datasets=> 4 time the number of files. smaller?)
0.541/3.127TB for the MinBias subset=> 0.32 the volume.
977/1779 files for the MinBias subset=> 0.55 time the number of files.
So it appears for this comparison that filesize is 3.5 times smaller for the MinBias subset....

For those of interest if my ATLAS "DNA" is; then Georgina's is
Of course what you really want to know (or not) is where in the world is Georgina and her relations and how what does her birthday calendar look like. My avatar is interested to find out that since Georgina is so much bigger than I am, will she have more children in more rooms and how long will they last...