27 February 2013

ss a tool to debug sockets

In the effort to explain why a 1M default buffer size works better than the more canonic 87k set also by the system and suggested in every network optimization site, as I wrote in this post about sonar tests to BNL, I tried the ss command suggested by John Green. Below are two results with 87k and 1M default buffer size respectively there are two things that jump to the eye for me

1) cwnd doesn't go beyond 30 in the first test, while in the second test cwnd is at least 556 and up to 1783. In my wikipedia knowledge of how this all really works it means the TCP window is not scaling with the first settings and it is with the second.

2) the memory field is completely different in the first case we have bytes in memory in some state in one (more often two streams, sometimes three) and nothing else. In the second case we have memory fields equally filled. This mirrors what observed with netstat and the Send-Q field.

Different state of the connection might report additional fields. For example when the connection is in CLOSE_WAIT state an ato (ack timeout) parameter appears. Also in CLOSE_WAIT the memory is fields are different r and f are uniformly filled instead of w and f.


Which confirms that with the fasterdata settings the TCP window doesn't scale (at least beyond a very small value such as 30).

Some interesting links I found to explain the fields

mem r,w,f,t values
cwnd or slow start and congestion control
wscale value
rto retransmission timeout

87k settings

ss -timeo|grep -A1 192.12.15.235|grep -v 192.12
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:292 rtt:92/0.75 cwnd:31 send 3.9Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:292 rtt:92/0.75 cwnd:30 send 3.8Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:307 rtt:104.125/0.75 cwnd:30 send 3.3Mbps rcv_space:14600
mem:(r0,w106648,f73576,t0) ts sack htcp wscale:7,9 rto:307 rtt:104/0.75 cwnd:21 send 2.3Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:307 rtt:104.125/0.75 cwnd:30 send 3.3Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:307 rtt:104/0.75 cwnd:30 send 3.3Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:292 rtt:92.125/0.75 cwnd:29 send 3.6Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:291 rtt:91.875/0.75 cwnd:30 send 3.8Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:292 rtt:92/0.75 cwnd:28 send 3.5Mbps rcv_space:14600

1MB settings

ss -timeo|grep -A1 192.12.15.226|grep -v 192.12
     mem:(r0,w1620945,f267311,t0) ts sack htcp wscale:9,8 rto:312 rtt:112.5/26 cwnd:621 send 63.9Mbps rcv_space:14600
     mem:(r0,w5074901,f213035,t0) ts sack htcp wscale:9,8 rto:320 rtt:120.5/34.5 cwnd:638 send 61.3Mbps rcv_space:14600
     mem:(r0,w269841,f495,t0) ts sack htcp wscale:9,8 rto:333 rtt:130.625/18 cwnd:356 send 31.6Mbps rcv_space:14600
     mem:(r0,w269841,f495,t0) ts sack htcp wscale:9,8 rto:319 rtt:119.5/18.75 cwnd:345 send 33.4Mbps rcv_space:14600
     mem:(r0,w2944464,f266800,t0) ts sack htcp wscale:9,8 rto:317 rtt:117.125/34.5 cwnd:1236 send 122.2Mbps rcv_space:14600
     mem:(r0,w269841,f495,t0) ts sack htcp wscale:9,8 rto:319 rtt:119.75/18.75 cwnd:320 send 31.0Mbps rcv_space:14600
     mem:(r0,w5621967,f239409,t0) ts sack htcp wscale:9,8 rto:313 rtt:113.125/24.25 cwnd:624 send 63.9Mbps rcv_space:14600
     mem:(r0,w269841,f495,t0) ts sack htcp wscale:9,8 rto:322 rtt:122.25/17.25 cwnd:318 send 30.1Mbps rcv_space:14600
     mem:(r0,w2943432,f800312,t0) ts sack htcp wscale:9,8 rto:314 rtt:114.375/21.75 cwnd:655 send 66.3Mbps rcv_space:14600


Connections in different states (additional fields)


ss -timeo|grep -A1 192.12.15
ESTAB      0      1587665     195.194.108.50:43684        192.12.15.233:41628    timer:(on,299ms,0) uid:19536 ino:20107501 sk:ffff88010b09cb00
     mem:(r0,w1620945,f267311,t0) ts sack htcp wscale:9,8 rto:312 rtt:112.5/26 cwnd:621 send 63.9Mbps rcv_space:14600
[...]

ss -timeo|grep -A1 192.12.15
CLOSE-WAIT 1      0          195.194.108.50:43684        192.12.15.233:41628    uid:19536 ino:20107501 sk:ffff88010b09cb00
     mem:(r4352,w0,f3840,t0) ts sack htcp wscale:9,8 rto:300 rtt:100.5/9 ato:40 cwnd:1924 send 221.8Mbps rcv_space:14600
[....]
 

21 February 2013

BNL rate improvements in practice

Thanks to the recent investigative work by Alessandra Forti, the TCP tuning settings on the UKI-SOUTHGRID-OX-HEP DPM disk pool servers have been changed to increase the default sizes of the IPv4 windows to a very large 1MB from the much smaller previous value of ~64KB (which had been taken from the fasterdata.es.net recommendations). The change made an immediate improvement in the results of our artificial testing, but a day later we can now see the pay off in speed of real transfers:

Transfers from Oxford to BNL are the red line that jumps from virtually zero straight up to about 40MB/s, making BNL now one of our fastest destination sites.

We're still not sure what the underlying cause of the apparent problem with automatic window scaling for multi-stream gridftp transfers is, but it appears safe to say that this change has completely removed the practical effects.

20 February 2013

Sonar test to BNL

Low transfer rates to BNL - in the range of 10kB/s-500kB/s - was a problem that affected Birmingham, Oxford and ECDF for few months and it was affecting Manchester too since I've upgraded to SL6/EMI-2 the pool servers.

Transfers to BNL were the only transfers with this problem, other transfers to other T1s and the trasnfers from BNL had healthy rates; and other sites with DPM didn't have this problem. On top of it all the perfsonar monitoring boxes were reporting good rates too.
  
Some simpler gridftp trasnfers showed healthy rates too so it seemed the problem might be in FTS. Infact Wahid confirmed that FTS uses more streams. So I started to test transfers with an increasing number of streams and it turned out that up to 2 streams transfers were fine, using 3 streams transfers were wobbling and from 4 upwards the rates were terrible.

Looking at netstat in continuous mode this seemed to be confirmed by the fact that out of 9 streams only 2 and seldom 3 had a Send-Q value different from 0. Send-Q is the stream buffer if it is 0 it means there are no data queued the documentation says that's a good thing but to me it looked that if you have buffer empty on 7 out of 9 streams those streams are not used. So I tentatively labelled the streams with Send-Q values different from 0 as active and thought that if only 2 streams were active out of 9 there was packet loss somewhere.

 To be more systematic and replicate the sonar tests in the simpler gridftp transfers I did the following

I found out which files were used by the sonar tests and wrote a script which accepts the number of streams as a parameter to copy them. File names only differ by a number so it could all go in a simple loop. For each file I redirect STDOUT and STDERR to a logfile with a a timestamp extension I could then grep.

cat bnl-transfers.sh
#!/bin/bash

for a in `seq 1 5`
  do
 
  nst=$1
  timestamp=`date +%y%m%d%H%M%S`
  postfix="$a.${nst}st"
  logfile="$HOME/test.$postfix.$timestamp"

  srcfile="srm://bohr3226.tier2.hep.manchester.ac.uk/dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/step09/RAW/UKI-NORTHGRID-MAN-HEP_DATADISK/step09.50000030.sonar_1.sonar.RAW.UKI-NORTHGRID-MAN-HEP_DATADISK/step09.50000030.sonar_1.sonar.RAW.UKI-NORTHGRID-MAN-HEP_DATADISK._lb000${a}._0001.data_1286528503"
  dstfile="srm://dcsrm.usatlas.bnl.gov:8443/srm/managerv2?SFN=/pnfs/usatlas.bnl.gov/BNLT0D1/user/aforti/sonar-manual.$postfix"
  lcg-del -l $dstfile > $logfile 2>&1
  sleep 2
  (time lcg-cp --verbose -n $nst $srcfile $dstfile) >> $logfile 2>&1 &

done


I then run the following

./bnl-transfers.sh 9;  ./bnl-transfers.sh 2; ./bnl-transfers.sh 1 

and the rates from gridftp told a clearer story

9 streams:
     47054848 bytes    511.15 KB/sec avg    522.27 KB/sec inst
     30539776 bytes    498.73 KB/sec avg    513.71 KB/sec inst
     23461888 bytes    383.14 KB/sec avg    386.98 KB/sec inst
     30277632 bytes    495.28 KB/sec avg    498.26 KB/sec inst
     29491200 bytes    480.80 KB/sec avg    507.73 KB/sec inst

2 streams:

   1777729536 bytes  28934.40 KB/sec avg  31978.57 KB/sec inst
   1776025600 bytes  28858.57 KB/sec avg  25437.87 KB/sec inst
   1261230486 bytes  41055.68 KB/sec avg  41055.68 KB/sec inst
   1354288154 bytes  44084.90 KB/sec avg  44084.90 KB/sec inst
   2000000000 bytes  32071.02 KB/sec avg  23708.53 KB/sec inst

1 stream:

    977272832 bytes  31812.27 KB/sec avg  31812.27 KB/sec inst
    515768320 bytes  16789.33 KB/sec avg  16789.33 KB/sec inst
    741832146 bytes  24148.18 KB/sec avg  24148.18 KB/sec inst
    348258304 bytes  11336.53 KB/sec avg  11336.53 KB/sec inst
    612237312 bytes  19996.25 KB/sec avg  19996.25 KB/sec inst 


I repeated then the tests every few hours for 10 times and the result was always the same.

After discussion on the GridPP storage mailing list with other sites with similar but not identical setup we reduced the possibilities to

1) Some ports being blocked when the number of streams increases which might cause a continuous loss of data and the TCP window size remaining stuck at 4k as observed by ECDF.

2) Tcp sysctl settings applied. Most sites have applied the sysctl  settings suggested on this page
http://fasterdata.es.net/host-tuning/linux/ and indeed they gave better rates than the much smaller settings we had previously and they worked for Manchester before the upgrade but for some reason not anymore after. They are characterised by a very large max TCP buffer size and very small min buffer size and a relatively small 87k default buffer size.

Liverpool, which doesn't have this problem sent their sysctl settings which were characterised by similar large max and small min and a huge, compared to the fasterdata value, default.

net.ipv4.tcp_rmem = 8192 1048576 8388608
net.ipv4.tcp_wmem = 8192 1048576 8388608
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608 


after the change the transfers with 9 streams magically started to go at the same rate as the transfers with 1-2 streams

9 streams

   2000000000 bytes  22398.22 KB/sec avg   7929.77 KB/sec inst
   2000000000 bytes  19054.88 KB/sec avg   9210.10 KB/sec inst
   2000000000 bytes  18408.34 KB/sec avg  11112.14 KB/sec inst
   2000000000 bytes  20844.45 KB/sec avg  12711.16 KB/sec inst
   2000000000 bytes  46613.96 KB/sec avg   7777.58 KB/sec inst

2 streams
   2000000000 bytes  31913.81 KB/sec avg  27794.67 KB/sec inst
   2000000000 bytes  25398.24 KB/sec avg  17900.99 KB/sec inst
   2000000000 bytes  13359.27 KB/sec avg   8529.12 KB/sec inst
   2000000000 bytes  15738.32 KB/sec avg   6519.57 KB/sec inst
   2000000000 bytes  43306.54 KB/sec avg  36966.53 KB/sec inst

1 stream

   2000000000 bytes  22790.26 KB/sec avg  16394.90 KB/sec inst
   2000000000 bytes  21229.62 KB/sec avg  18154.65 KB/sec inst
   2000000000 bytes  18067.76 KB/sec avg   5438.63 KB/sec inst
   2000000000 bytes  19280.60 KB/sec avg   4531.59 KB/sec inst
   2000000000 bytes  20387.53 KB/sec avg  10513.79 KB/sec inst


We still don't have an explanation of why a setup with an initial buffer size of 1M works. All the network sites claim that that value should be kept small to avoid hurting small size transfers and the values reported are always between 65k and 87k never larger. And some sites like QMUL work fine with these small initial values. Still going back to netstat after the change all streams Send-Q values are filled with data and only occasionally 1 or 2 streams have empty buffers which looks a much healthier picture.

netstat -tape |head -2;netstat -tape |grep dcd
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address               Foreign Address             State     
tcp        0 2364766 se10.tier2.hep.manche:35220 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp        0 1339838 se10.tier2.hep.manche:35219 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp        0 1580279 se10.tier2.hep.manche:35218 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp        0 1496295 se10.tier2.hep.manche:35214 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp        0 1361558 se10.tier2.hep.manche:35212 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp        0 1515119 se10.tier2.hep.manche:35213 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp        0 1401968 se10.tier2.hep.manche:35217 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp        0 1222623 se10.tier2.hep.manche:35215 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp        0 1255781 se10.tier2.hep.manche:35216 dcdoor14.usatlas.bnl.:21431 ESTABLISHED

 

The change applied worked well also for the other sites. For example Oxford before 

   2000000000 bytes    210.55 KB/sec avg    172.16 KB/sec inst
   2000000000 bytes    237.41 KB/sec avg     38.85 KB/sec inst
   2000000000 bytes    209.80 KB/sec avg     39.54 KB/sec inst
   2000000000 bytes    206.65 KB/sec avg     30.98 KB/sec inst
   2000000000 bytes    263.14 KB/sec avg    144.02 KB/sec inst


 and after

  2000000000 bytes  76593.14 KB/sec avg  76593.14 KB/sec inst
   2000000000 bytes  50468.35 KB/sec avg  29142.64 KB/sec inst
   2000000000 bytes  45316.12 KB/sec avg   2982.70 KB/sec inst
   2000000000 bytes  25631.56 KB/sec avg  12115.36 KB/sec inst
   2000000000 bytes  18548.20 KB/sec avg   7176.38 KB/sec inst

 

We are all wondering if it is worth to spend time learning why only few sites had this problem and why 1M initial buffer size is better for than 87k. But I suspect that since the transfers now work we will know only if we stumble upon the answer. A possible explanation of why a larger inital value is not reccomended almost anywhere is that sites like fasterdata suggestions are tailored for WEB sites whose small transfers are few kB of a WEB page while our small transfers are normally few MBs log files. Also I don't think WEB servers use multi-stream transfers and the fasterdata suggestion was working fine with a very limited number of streams.