Low transfer rates to BNL - in the range of 10kB/s-500kB/s - was a problem that affected Birmingham, Oxford and ECDF for
few months and it was affecting Manchester too since I've upgraded to
SL6/EMI-2 the pool servers.
Transfers to BNL were the
only transfers with this problem, other transfers to other T1s and the
trasnfers from BNL had healthy rates; and other sites with DPM didn't
have this problem. On top of it all the perfsonar monitoring boxes were
reporting good rates too.
Some simpler gridftp
trasnfers showed healthy rates too so it seemed the problem might be in
FTS. Infact Wahid confirmed that FTS uses more streams. So I started to
test transfers with an increasing number of streams and it turned out
that up to 2 streams transfers were fine, using 3 streams transfers were
wobbling and from 4 upwards the rates were terrible.
Looking
at netstat in continuous mode this seemed to be confirmed by the fact
that out of 9 streams only 2 and seldom 3 had a Send-Q value different
from 0. Send-Q is the stream buffer if it is 0 it means there are no
data queued the documentation says that's a good thing but to
me it looked that if you have buffer empty on 7 out of 9 streams those
streams are not used. So I tentatively labelled the streams with Send-Q
values different from 0 as active and thought that if only 2 streams
were active out of 9 there was packet loss somewhere.
To be more systematic and replicate the sonar tests in the simpler gridftp transfers I did the following
I
found out which files were used by the sonar tests and wrote a script
which accepts the number of streams as a
parameter to copy them. File names only differ by a number so it
could all go in a simple loop. For each file I redirect STDOUT and
STDERR
to a logfile with a a timestamp extension I could then grep.
cat bnl-transfers.sh
#!/bin/bash
for a in `seq 1 5`
do
nst=$1
timestamp=`date +%y%m%d%H%M%S`
postfix="$a.${nst}st"
logfile="$HOME/test.$postfix.$timestamp"
srcfile="srm://bohr3226.tier2.hep.manchester.ac.uk/dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/step09/RAW/UKI-NORTHGRID-MAN-HEP_DATADISK/step09.50000030.sonar_1.sonar.RAW.UKI-NORTHGRID-MAN-HEP_DATADISK/step09.50000030.sonar_1.sonar.RAW.UKI-NORTHGRID-MAN-HEP_DATADISK._lb000${a}._0001.data_1286528503"
dstfile="srm://dcsrm.usatlas.bnl.gov:8443/srm/managerv2?SFN=/pnfs/usatlas.bnl.gov/BNLT0D1/user/aforti/sonar-manual.$postfix"
lcg-del -l $dstfile > $logfile 2>&1
sleep 2
(time lcg-cp --verbose -n $nst $srcfile $dstfile)
>> $logfile 2>&1 &
done
I then run the following
./bnl-transfers.sh 9; ./bnl-transfers.sh 2;
./bnl-transfers.sh 1
and the rates from gridftp told a clearer story
9 streams:
47054848 bytes 511.15 KB/sec avg 522.27 KB/sec inst
30539776 bytes 498.73 KB/sec avg 513.71 KB/sec inst
23461888 bytes 383.14 KB/sec avg 386.98 KB/sec inst
30277632 bytes 495.28 KB/sec avg 498.26 KB/sec inst
29491200 bytes 480.80 KB/sec avg 507.73 KB/sec inst
2 streams:
1777729536 bytes 28934.40 KB/sec avg 31978.57 KB/sec inst
1776025600 bytes 28858.57 KB/sec avg 25437.87 KB/sec inst
1261230486 bytes 41055.68 KB/sec avg 41055.68 KB/sec inst
1354288154 bytes 44084.90 KB/sec avg 44084.90 KB/sec inst
2000000000 bytes 32071.02 KB/sec avg 23708.53 KB/sec inst
1 stream:
977272832 bytes 31812.27 KB/sec avg 31812.27 KB/sec inst
515768320 bytes 16789.33 KB/sec avg 16789.33 KB/sec inst
741832146 bytes 24148.18 KB/sec avg 24148.18 KB/sec inst
348258304 bytes 11336.53 KB/sec avg 11336.53 KB/sec inst
612237312 bytes 19996.25 KB/sec avg 19996.25 KB/sec inst
I repeated then the tests every few hours for 10 times and the result was always the same.
After
discussion on the GridPP storage mailing list with other sites with
similar but not identical setup we reduced the possibilities to
1)
Some ports being blocked when the number of streams increases which
might cause a continuous loss of data and the TCP window size remaining
stuck at 4k as observed by ECDF.
2) Tcp sysctl settings applied. Most sites have applied the sysctl settings suggested on this page
http://fasterdata.es.net/host-tuning/linux/
and indeed they gave better rates than the much smaller settings we had
previously and they worked for Manchester before the upgrade but for
some reason not anymore after. They are characterised by a very large
max TCP buffer size and very small min buffer size and a relatively
small 87k default buffer size.
Liverpool, which doesn't
have this problem sent their sysctl settings which were characterised
by similar large max and small min and a huge, compared to the
fasterdata value, default.
net.ipv4.tcp_rmem = 8192 1048576 8388608
net.ipv4.tcp_wmem = 8192 1048576 8388608
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
after the change the transfers with 9 streams magically started to go at the same rate as the transfers with 1-2 streams
9 streams
2000000000 bytes 22398.22 KB/sec avg 7929.77 KB/sec
inst
2000000000 bytes 19054.88 KB/sec avg 9210.10 KB/sec
inst
2000000000 bytes 18408.34 KB/sec avg 11112.14 KB/sec
inst
2000000000 bytes 20844.45 KB/sec avg 12711.16 KB/sec
inst
2000000000 bytes 46613.96 KB/sec avg 7777.58 KB/sec
inst
2 streams
2000000000 bytes 31913.81 KB/sec avg 27794.67 KB/sec
inst
2000000000 bytes 25398.24 KB/sec avg 17900.99 KB/sec
inst
2000000000 bytes 13359.27 KB/sec avg 8529.12 KB/sec
inst
2000000000 bytes 15738.32 KB/sec avg 6519.57 KB/sec
inst
2000000000 bytes 43306.54 KB/sec avg 36966.53 KB/sec
inst
1 stream
2000000000 bytes 22790.26 KB/sec avg 16394.90 KB/sec
inst
2000000000 bytes 21229.62 KB/sec avg 18154.65 KB/sec
inst
2000000000 bytes 18067.76 KB/sec avg 5438.63 KB/sec
inst
2000000000 bytes 19280.60 KB/sec avg 4531.59 KB/sec
inst
2000000000 bytes 20387.53 KB/sec avg 10513.79 KB/sec
inst
We still
don't have an explanation of why a setup with an initial buffer size of
1M works. All the network sites claim that that value should be kept
small to avoid hurting small size transfers and the values reported are always between 65k and 87k never larger. And some sites like QMUL work fine with these small initial values. Still going back to netstat after the change all streams Send-Q values are filled with data and only occasionally 1 or 2 streams have empty buffers which looks a much healthier picture.
netstat -tape |head -2;netstat -tape
|grep dcd
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign
Address State
tcp 0 2364766 se10.tier2.hep.manche:35220
dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1339838 se10.tier2.hep.manche:35219
dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1580279 se10.tier2.hep.manche:35218
dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1496295 se10.tier2.hep.manche:35214
dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1361558 se10.tier2.hep.manche:35212
dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1515119 se10.tier2.hep.manche:35213
dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1401968 se10.tier2.hep.manche:35217
dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1222623 se10.tier2.hep.manche:35215
dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1255781 se10.tier2.hep.manche:35216
dcdoor14.usatlas.bnl.:21431 ESTABLISHED
The change applied worked well also for the other sites. For example Oxford before
2000000000 bytes 210.55 KB/sec avg 172.16
KB/sec inst
2000000000 bytes 237.41 KB/sec avg 38.85 KB/sec
inst
2000000000 bytes 209.80 KB/sec avg 39.54 KB/sec
inst
2000000000 bytes 206.65 KB/sec avg 30.98 KB/sec
inst
2000000000 bytes 263.14 KB/sec avg 144.02 KB/sec
inst
and after
2000000000 bytes 76593.14 KB/sec avg 76593.14 KB/sec
inst
2000000000 bytes 50468.35 KB/sec avg 29142.64 KB/sec
inst
2000000000 bytes 45316.12 KB/sec avg 2982.70 KB/sec
inst
2000000000 bytes 25631.56 KB/sec avg 12115.36 KB/sec
inst
2000000000 bytes 18548.20 KB/sec avg 7176.38 KB/sec
inst
We
are all wondering if it is worth to spend time learning why only few
sites had this problem and why 1M initial buffer size is better for than
87k. But I suspect that since the transfers now work we will know only
if we stumble upon the answer. A possible explanation of why a larger
inital value is not reccomended almost anywhere is that sites like fasterdata suggestions are
tailored for WEB sites whose small transfers are few kB of a WEB page
while our small transfers are normally few MBs log files. Also I don't think WEB servers use multi-stream transfers and the fasterdata suggestion was working fine with a very limited number of streams.
Subscribe to:
Post Comments (Atom)
No comments:
Post a Comment