14 June 2011

FTS overhead factors and how to try and improve rates.

Within the UK we have been trying to speed up transfer rates. This has been a twofold approach.
1- Speed up the data transfer phase of the file by changing network and host settings on a disk server. Mainly this has been following the advice of the good people at LBNL work on:

http://fasterdata.es.net/

2-The other area was to look at the overhead in the SRM and its communication with the FTS service.

So we had a tinker with number of files and number of threads on a FTS channel and got some improvement in overall rate for some channels. But as part of improving single file transfer rates (as part of our study to help the ATLAS VO SONAR test results;) we started to look into the overhead in prepare to get and put in the source and destination SRMs.

We have seen in the past that synchronous (rather than asynchronous) getTURL was quicker but what we did notice that within an FTS transfer; the sum of the time to preparetoGET and preparetoPUT varied greatly between channels. There is a strong correlation between this amount of time and the SRM involved at each end of the transfer. What we noticed was that transfers which involved CASTOR as the destination srm (preparetoPUT) we regularly taking over 30s to prepare (and regularly taking 20s to prepare as a source site.) Hence we started to look into a way of reducing the effective overhead of "prepare to transfer" for each file.
Looking at new improvements and options in the FTS, we discovered/(pointed at) the following decoupling of SRM preparation phase and the transfer phase:

https://twiki.cern.ch/twiki/bin/view/EGEE/FtsRelease22

Now it was pointed out to me by my friendly SRM developer that their is a timeout (of 180 seconds) which will fail a transfer if this time elapses between the end of the prepare phase and the start of the actual transfer on the disk server. Therefore we wanted to try this new functionality on transfers which:
1- Had a large amount of preparation time to transmission time (i.e either CASTOR as a destination or siurce.
2-Where the majority of transfer times per transfer where less than 180 seconds. ( either small files or fast connections.)

Looking at the value or ({Preparation Time} + {Transmission Time} )/ {Transmission Time}.
we got the following values.
Channel ratios for ATLAS, (CMS) and {LHCb}
<UKT2s-RAL>=15.1 (2.7)
<RAL-UKT2s>=5.5 (1.9)
<T1s-RAL>=4.3 (1.2) {8.6}
<T1s-UKT2s>=2.1
<T2Ds-UKT2s>=2.2
<T2Ds-RAL>=7.5
<RAL-RAL>=19.9
<*-RAL>=3.1 (1.2)
<*-UKT2s>=6.7 (1.01)
<"slow transfer sites">=1.38 (1.02)

Showed that UKT2s-RAL transfers for ATLAS met these criteria; so we have now turned this on ( which seems to add~1.5 seconds to each transfer so you might only want to set this Boolean to true for channels you intend to change the ratio). and we have now set the ratio of SRM prepares to transfers to 2.5 for all UKT2s to RAL channels. No problem of timeing out jobs has bee nseen and we have been able to reduce the number of concurrent filre transfers without reducing the overall throughput.


No comments: