It's nice to see plans and issues for DPM sites are similar around the world. Here are my musings:
The fact Australia and Taiwan sites see the approximate percentage of dark files after ATLAS RUCIO renamed their site as I have seen in the UK (8-9%) is encouraging to know that the UK is not especially bad; (now just need to work with ATLAS on how to efficiently clean up these files, delete empty directories; and how to reduce dark data creation in the future.) It will be interesting to see if other VO's have a lower or higher percentage of dark data.
Also good to hear that a puppet deployment of DPM (rather than YAIM) is almost complete for usage and now I have a better understanding of the development cycle of the individual components; I am less worried about the move away from a single product release.
23 December 2013
11 December 2013
Off to Edinburgh for DPM Workshop 2013
Friday has me going to Edinburgh for the latest DPM workshop. It'll be good to meet and discuss all the new(ish) features DPM has to offer (and make my own suggestions...)
Meeting agenda should be available at the following indico meeting page:
http://indico.cern.ch/conferenceDisplay.py?confId=273864
Meeting agenda should be available at the following indico meeting page:
http://indico.cern.ch/conferenceDisplay.py?confId=273864
22 October 2013
DPM Load Issue Mitigation for Sites seeing load brownouts on individual disk servers.
For a while, it has been known that sites can experience load issues on their storage, especially for those sites running a lot of ATLAS analysis.
For DPM sites, this is caused by a combination of poor dataset-level file distribution (something which a storage system cannot guarantee without knowledge inaccessible to it apriori, especially with the new Rucio naming scheme), and the lack of any way to consistently throttle access rates, or the number of active transfers, per disk server.
At Glasgow, we found the following mixed approaches seemed to significantly ameliorate the issue:
For DPM sites, this is caused by a combination of poor dataset-level file distribution (something which a storage system cannot guarantee without knowledge inaccessible to it apriori, especially with the new Rucio naming scheme), and the lack of any way to consistently throttle access rates, or the number of active transfers, per disk server.
At Glasgow, we found the following mixed approaches seemed to significantly ameliorate the issue:
- Firstly, enabling xrootd support in DPM (this is now entirely doable via YAIM or Puppet; and it's required for ATLAS and CMS xrootd federation to work anyway), and setting the queues in AGIS to use xrootd direct io (rather than rfcp or xrdcp) for remote file access. A majority of Analysis jobs access only fractions of the files they rely on, so this reduces the total amount of data that a job needs to move, and also distributes the load caused by the data accesses over a longer time period, reducing the peak IOs required by the disk server. Getting your queue settings changed requires you to have a friendly ATLAS person with relevant permissions available.
- Secondly, changing queue settings in your batch system to limit the rate at which analysis pilots can start. The cause of IO brownouts on disk servers seems to be the simultaneous start of a large number of analysis jobs (all of which immediately attempt to get/access files from the local SE); limiting the rate of pilot starts smears this peak IO over a longer time, again smoothing load out.
With our torque/maui system, we accomplish this by setting the MAXIPROC value for the atlaspil group to a small number (10 in our case). MAXIPROC sets the maximum number of jobs in a class which can be eligible for starting in any given scheduling iteration - essentially it means that maui will not start more than 10 atlaspil jobs every (60 second) scheduling iteration, in our case.
i.e: GROUPCFG[atlaspil] {elided other config variables} MAXIPROC=10
With these two changes, we rarely see issues with load spiking and brownouts, despite increasing our maximum fraction of Analysis jobs from ATLAS significantly since the fixes were installed. Evidence suggests that both changes are needed for robust effects on your load, however.
30 June 2013
Big data...
Big data is a big buzzword these days but it's also a real "problem" (in some practical sense: data can be difficult to move, to store, to process, to preserve - when you have lots of it), but the big data is also an opportunity.
Well, we've had our big data workshop. A proper writeup will be, er, written up, but here are a few personal notes to whet your appetite (hopefully). You can of course also go and view the presentations at the workshop - all speakers have made their presentations available.
Well, we've had our big data workshop. A proper writeup will be, er, written up, but here are a few personal notes to whet your appetite (hopefully). You can of course also go and view the presentations at the workshop - all speakers have made their presentations available.
- Lots of research communities have "big data" - in fact, it's hard to think of one that doesn't. Perhaps it's like email - after email became widely used at universities, researchers could communicate rapidly with each other, thus extending remote collaborations - but it's also a double edged sword in the sense that we spend much of our time processing email. Big data tools, I think, enable each research community to do more, but perhaps at a cost.
- The cost is not always obvious beforehand - many speakers mentioned the complexity of software, and maintaining software for your data which makes use of hardware advances - and, er, runs correctly - is nontrivial.
- Likewise, adapting existing code to process "big data" - to which extent is it like adapting applications and algorithms to run parallel, distributed, in the grid, in the cloud - ?
- There is clearly opportunities for sharing ideas and innovations - the LHC grid may be finding events not inconsistent with the existence of Higgs-like particles (or, to the less careful, discovering the Higgs), thanks to a global network of data transfer, storage, and computing (of which GridPP is a part) crunching data on the order of hundred(s) of petabytes. But this work doesn't rely on visualisation to the extent that astronomy does. And in humanities, artists have found new ways of visualising data.
- And who mentioned software complexity - the compute evolved along with building the collider, so we had time to test it.The last thing a researcher wants is to debug the infrastructure (well, actually, a few quite like that, but most would rather just get on with the research they're supposed to be doing.)
- Data policies - open data, sharing - making data sets usable - and giving academic credit to researchers for doing this. There is more to big data in research than the ability to store it. Human stuff, too. Policy. Security. Identity management. That sort of stuff.
- I think there is a gap between hardware/tools on one side and communities on the other - and the bridge is the infrastrcuture provider. But there is sometimes more to it than that - some communities find it harder to share than others. A cultural change may be needed.
- And making use of big data tools - as with clouds, grids - sometimes it's easier running stuff locally, if you can, or it feels more secure. Make use of tools that make it easy. Learn from others.
06 June 2013
Demonstrating 100 Gb/s
One of the interesting things from a networkingdataological perspective at this year's TNC is showing 100 Gb/s link across the Atlantic, and also the conference itself is connected with 100 Gb. 100 Gb is here!
We also had a music performance with a local (Maastricht) band and one musician in Edinburgh, and they were playing together. This can of course only happen if the latency is very low - 10-20ms. First time I have seen this in practice, very impressive.
We also had a music performance with a local (Maastricht) band and one musician in Edinburgh, and they were playing together. This can of course only happen if the latency is very low - 10-20ms. First time I have seen this in practice, very impressive.
26 April 2013
My new sister Eve
SO I have a new friend Eve. Eve is my last sister born in 2012 whose first home is the same as mine.
Initial info of Eve to compare with mine and Georgina follows. What surprises me is the number of files that that have no replicas at all and so are at risk if a house or room gets destroyed.
Initial info of Eve to compare with mine and Georgina follows. What surprises me is the number of files that that have no replicas at all and so are at risk if a house or room gets destroyed.
DataSet Name | Dave | Georgina | Eve | ||
"DNA" Number | 2.3.7.3623 | 3.3.3.3.5.5.89 | 3.53.1361 | ||
Number of Countries | 17 | 17 | |||
Number of "Houses" | 59 | 79 | 63 | ||
Type of Rooms:DATADISK | 28 | 49 | 50 | ||
Type of Rooms:LGD | 37 | 61 | 14 | ||
Type of Rooms:PERF+PHYS | 32 | 58 | 22 | ||
Type of Rooms:TAPE | 6 | 12 | 9 | ||
Type of Rooms:USERDISK | 1 | 9 | 5 | ||
Type of Rooms:CERN | 3 | 6 | 5 | ||
Type of Rooms:SCRATCH | 0 | 12 | 14 | ||
Type of Rooms:CALIB | 0 | 5 | 7 | ||
Total number of people (including clones) | 1166 | 1594 | 642 | ||
Number of unique people | 894 | 1120 | 299 | ||
Numer of "people" of type: | ^user | 137 | 470 | 64 | |
Numer of unique "people" of type: | ^user | 132 | 429 | 55 | |
Numer of "people" of type: | ^data | 725 | 1048 | 538 | |
Numer of unique "people" of type: | ^data | 532 | 631 | 212 | |
Numer of "people" of type: | ^group | 31 | 70 | 40 | |
Numer of unique "people" of type: | ^group | 26 | 54 | 32 | |
Numer of "people" of type: | ^valid | 1 | 0 | 0 | |
Numer of unique "people" of type: | ^valid | 1 | 0 | 0 | |
Datasets that have | 1 | copy | 691 | 811 | 178 |
Datasets that have | 2 | copies | 143 | 197 | 63 |
Datasets that have | 3 | copies | 52 | 68 | 25 |
Datasets that have | 4 | copies | 71 | 23 | 7 |
Datasets that have | 5 | copies | 1 | 7 | 21 |
Datasets that have | 6 | copies | 0 | 5 | 2 |
Datasets that have | 7 | copies | 0 | 2 | 1 |
Datasets that have | 8 | copies | 0 | 1 | 1 |
Datasets that have | 12 | copies | 0 | 0 | 6 |
Datasets that have | 13 | copies | 0 | 0 | 3 |
Number of files that have | 1 | copy | 53644 | 141663 | 9509 |
Number of files that have | 2 | copies | 12904 | 36863 | 7705 |
Number of files that have | 3 | copies | 1860 | 3708 | 4520 |
Number of files that have | 4 | copies | 7 | 2092 | 110 |
Number of files that have | 5 | copies | 7 | 572 | 1091 |
Number of files that have | 6 | copies | 0 | 487 | 20 |
Number of files that have | 7 | copies | 0 | 142 | 5 |
Number of files that have | 8 | copies | 0 | 73 | 1 |
Number of files that have | 12 | copies | 0 | 0 | 6 |
Number of files that have | 13 | copies | 0 | 0 | 5 |
Total number of files on the grid | 85095 | 242241 | 22973 | ||
Total number of unique files on the grid | 68422 | 185600 | 44674 | ||
Data Volume (TB) on the grid that has | 1 | copy | 8.8 | 27.7 | 1.7 |
Data Volume (TB) on the grid that has | 2 | copies | 6.8 | 21.3 | 4.7 |
Data Volume (TB) on the grid that has | 3 | copies | 0.3 | 5.6 | 7.2 |
Data Volume (TB) on the grid that has | 4 | copies | 0.011 | 1.5 | 0.12 |
Data Volume (TB) on the grid that has | 5 | copies | 0.028 | 1.4 | 0.47 |
Data Volume (TB) on the grid that has | 6 | copies | 0 | 0.36 | 0.039 |
Data Volume (TB) on the grid that has | 7 | copies | 0 | 0.13 | < 1GB |
Data Volume (TB) on the grid that has | 8 | copies | 0 | 0.12 | < 1GB |
Data Volume (TB) on the grid that has | 12 | copies | 0 | 0 | < 1GB |
Data Volume (TB) on the grid that has | 13 | copies | 0 | 0 | < 1GB |
Total Volume of data on the grid (TB) | 23.7 | 104 | 35.8 | ||
Total Volume of unique data on the grid (TB) | 16 | 58 | 14.2 |
27 February 2013
ss a tool to debug sockets
In the effort to explain why a 1M default buffer size works better than the more canonic 87k set also by the system and suggested in every network optimization site, as I wrote in this post about sonar tests to BNL, I tried the ss command suggested by John Green. Below are two results with 87k and 1M default buffer size respectively there are two things that jump to the eye for me
1) cwnd doesn't go beyond 30 in the first test, while in the second test cwnd is at least 556 and up to 1783. In my wikipedia knowledge of how this all really works it means the TCP window is not scaling with the first settings and it is with the second.
2) the memory field is completely different in the first case we have bytes in memory in some state in one (more often two streams, sometimes three) and nothing else. In the second case we have memory fields equally filled. This mirrors what observed with netstat and the Send-Q field.
Different state of the connection might report additional fields. For example when the connection is in CLOSE_WAIT state an ato (ack timeout) parameter appears. Also in CLOSE_WAIT the memory is fields are different r and f are uniformly filled instead of w and f.
Which confirms that with the fasterdata settings the TCP window doesn't scale (at least beyond a very small value such as 30).
Some interesting links I found to explain the fields
mem r,w,f,t values
cwnd or slow start and congestion control
wscale value
rto retransmission timeout
87k settings
ss -timeo|grep -A1 192.12.15.235|grep -v 192.12
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:292 rtt:92/0.75 cwnd:31 send 3.9Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:292 rtt:92/0.75 cwnd:30 send 3.8Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:307 rtt:104.125/0.75 cwnd:30 send 3.3Mbps rcv_space:14600
mem:(r0,w106648,f73576,t0) ts sack htcp wscale:7,9 rto:307 rtt:104/0.75 cwnd:21 send 2.3Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:307 rtt:104.125/0.75 cwnd:30 send 3.3Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:307 rtt:104/0.75 cwnd:30 send 3.3Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:292 rtt:92.125/0.75 cwnd:29 send 3.6Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:291 rtt:91.875/0.75 cwnd:30 send 3.8Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:292 rtt:92/0.75 cwnd:28 send 3.5Mbps rcv_space:14600
1MB settings
ss -timeo|grep -A1 192.12.15.226|grep -v 192.12
mem:(r0,w1620945,f267311,t0) ts sack htcp wscale:9,8 rto:312 rtt:112.5/26 cwnd:621 send 63.9Mbps rcv_space:14600
mem:(r0,w5074901,f213035,t0) ts sack htcp wscale:9,8 rto:320 rtt:120.5/34.5 cwnd:638 send 61.3Mbps rcv_space:14600
mem:(r0,w269841,f495,t0) ts sack htcp wscale:9,8 rto:333 rtt:130.625/18 cwnd:356 send 31.6Mbps rcv_space:14600
mem:(r0,w269841,f495,t0) ts sack htcp wscale:9,8 rto:319 rtt:119.5/18.75 cwnd:345 send 33.4Mbps rcv_space:14600
mem:(r0,w2944464,f266800,t0) ts sack htcp wscale:9,8 rto:317 rtt:117.125/34.5 cwnd:1236 send 122.2Mbps rcv_space:14600
mem:(r0,w269841,f495,t0) ts sack htcp wscale:9,8 rto:319 rtt:119.75/18.75 cwnd:320 send 31.0Mbps rcv_space:14600
mem:(r0,w5621967,f239409,t0) ts sack htcp wscale:9,8 rto:313 rtt:113.125/24.25 cwnd:624 send 63.9Mbps rcv_space:14600
mem:(r0,w269841,f495,t0) ts sack htcp wscale:9,8 rto:322 rtt:122.25/17.25 cwnd:318 send 30.1Mbps rcv_space:14600
mem:(r0,w2943432,f800312,t0) ts sack htcp wscale:9,8 rto:314 rtt:114.375/21.75 cwnd:655 send 66.3Mbps rcv_space:14600
Connections in different states (additional fields)
ss -timeo|grep -A1 192.12.15
ESTAB 0 1587665 195.194.108.50:43684 192.12.15.233:41628 timer:(on,299ms,0) uid:19536 ino:20107501 sk:ffff88010b09cb00
mem:(r0,w1620945,f267311,t0) ts sack htcp wscale:9,8 rto:312 rtt:112.5/26 cwnd:621 send 63.9Mbps rcv_space:14600
[...]
ss -timeo|grep -A1 192.12.15
CLOSE-WAIT 1 0 195.194.108.50:43684 192.12.15.233:41628 uid:19536 ino:20107501 sk:ffff88010b09cb00
mem:(r4352,w0,f3840,t0) ts sack htcp wscale:9,8 rto:300 rtt:100.5/9 ato:40 cwnd:1924 send 221.8Mbps rcv_space:14600
[....]
1) cwnd doesn't go beyond 30 in the first test, while in the second test cwnd is at least 556 and up to 1783. In my wikipedia knowledge of how this all really works it means the TCP window is not scaling with the first settings and it is with the second.
2) the memory field is completely different in the first case we have bytes in memory in some state in one (more often two streams, sometimes three) and nothing else. In the second case we have memory fields equally filled. This mirrors what observed with netstat and the Send-Q field.
Different state of the connection might report additional fields. For example when the connection is in CLOSE_WAIT state an ato (ack timeout) parameter appears. Also in CLOSE_WAIT the memory is fields are different r and f are uniformly filled instead of w and f.
Which confirms that with the fasterdata settings the TCP window doesn't scale (at least beyond a very small value such as 30).
Some interesting links I found to explain the fields
mem r,w,f,t values
cwnd or slow start and congestion control
wscale value
rto retransmission timeout
87k settings
ss -timeo|grep -A1 192.12.15.235|grep -v 192.12
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:292 rtt:92/0.75 cwnd:31 send 3.9Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:292 rtt:92/0.75 cwnd:30 send 3.8Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:307 rtt:104.125/0.75 cwnd:30 send 3.3Mbps rcv_space:14600
mem:(r0,w106648,f73576,t0) ts sack htcp wscale:7,9 rto:307 rtt:104/0.75 cwnd:21 send 2.3Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:307 rtt:104.125/0.75 cwnd:30 send 3.3Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:307 rtt:104/0.75 cwnd:30 send 3.3Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:292 rtt:92.125/0.75 cwnd:29 send 3.6Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:291 rtt:91.875/0.75 cwnd:30 send 3.8Mbps rcv_space:14600
mem:(r0,w0,f0,t0) ts sack htcp wscale:7,9 rto:292 rtt:92/0.75 cwnd:28 send 3.5Mbps rcv_space:14600
1MB settings
ss -timeo|grep -A1 192.12.15.226|grep -v 192.12
mem:(r0,w1620945,f267311,t0) ts sack htcp wscale:9,8 rto:312 rtt:112.5/26 cwnd:621 send 63.9Mbps rcv_space:14600
mem:(r0,w5074901,f213035,t0) ts sack htcp wscale:9,8 rto:320 rtt:120.5/34.5 cwnd:638 send 61.3Mbps rcv_space:14600
mem:(r0,w269841,f495,t0) ts sack htcp wscale:9,8 rto:333 rtt:130.625/18 cwnd:356 send 31.6Mbps rcv_space:14600
mem:(r0,w269841,f495,t0) ts sack htcp wscale:9,8 rto:319 rtt:119.5/18.75 cwnd:345 send 33.4Mbps rcv_space:14600
mem:(r0,w2944464,f266800,t0) ts sack htcp wscale:9,8 rto:317 rtt:117.125/34.5 cwnd:1236 send 122.2Mbps rcv_space:14600
mem:(r0,w269841,f495,t0) ts sack htcp wscale:9,8 rto:319 rtt:119.75/18.75 cwnd:320 send 31.0Mbps rcv_space:14600
mem:(r0,w5621967,f239409,t0) ts sack htcp wscale:9,8 rto:313 rtt:113.125/24.25 cwnd:624 send 63.9Mbps rcv_space:14600
mem:(r0,w269841,f495,t0) ts sack htcp wscale:9,8 rto:322 rtt:122.25/17.25 cwnd:318 send 30.1Mbps rcv_space:14600
mem:(r0,w2943432,f800312,t0) ts sack htcp wscale:9,8 rto:314 rtt:114.375/21.75 cwnd:655 send 66.3Mbps rcv_space:14600
Connections in different states (additional fields)
ss -timeo|grep -A1 192.12.15
ESTAB 0 1587665 195.194.108.50:43684 192.12.15.233:41628 timer:(on,299ms,0) uid:19536 ino:20107501 sk:ffff88010b09cb00
mem:(r0,w1620945,f267311,t0) ts sack htcp wscale:9,8 rto:312 rtt:112.5/26 cwnd:621 send 63.9Mbps rcv_space:14600
[...]
ss -timeo|grep -A1 192.12.15
CLOSE-WAIT 1 0 195.194.108.50:43684 192.12.15.233:41628 uid:19536 ino:20107501 sk:ffff88010b09cb00
mem:(r4352,w0,f3840,t0) ts sack htcp wscale:9,8 rto:300 rtt:100.5/9 ato:40 cwnd:1924 send 221.8Mbps rcv_space:14600
[....]
21 February 2013
BNL rate improvements in practice
Thanks to the recent investigative work by Alessandra Forti, the TCP tuning settings on the UKI-SOUTHGRID-OX-HEP DPM disk pool servers have been changed to increase the default sizes of the IPv4 windows to a very large 1MB from the much smaller previous value of ~64KB (which had been taken from the fasterdata.es.net recommendations). The change made an immediate improvement in the results of our artificial testing, but a day later we can now see the pay off in speed of real transfers:
Transfers from Oxford to BNL are the red line that jumps from virtually zero straight up to about 40MB/s, making BNL now one of our fastest destination sites.
We're still not sure what the underlying cause of the apparent problem with automatic window scaling for multi-stream gridftp transfers is, but it appears safe to say that this change has completely removed the practical effects.
20 February 2013
Sonar test to BNL
Low transfer rates to BNL - in the range of 10kB/s-500kB/s - was a problem that affected Birmingham, Oxford and ECDF for
few months and it was affecting Manchester too since I've upgraded to
SL6/EMI-2 the pool servers.
Transfers to BNL were the only transfers with this problem, other transfers to other T1s and the trasnfers from BNL had healthy rates; and other sites with DPM didn't have this problem. On top of it all the perfsonar monitoring boxes were reporting good rates too.
Some simpler gridftp trasnfers showed healthy rates too so it seemed the problem might be in FTS. Infact Wahid confirmed that FTS uses more streams. So I started to test transfers with an increasing number of streams and it turned out that up to 2 streams transfers were fine, using 3 streams transfers were wobbling and from 4 upwards the rates were terrible.
Looking at netstat in continuous mode this seemed to be confirmed by the fact that out of 9 streams only 2 and seldom 3 had a Send-Q value different from 0. Send-Q is the stream buffer if it is 0 it means there are no data queued the documentation says that's a good thing but to me it looked that if you have buffer empty on 7 out of 9 streams those streams are not used. So I tentatively labelled the streams with Send-Q values different from 0 as active and thought that if only 2 streams were active out of 9 there was packet loss somewhere.
To be more systematic and replicate the sonar tests in the simpler gridftp transfers I did the following
I found out which files were used by the sonar tests and wrote a script which accepts the number of streams as a parameter to copy them. File names only differ by a number so it could all go in a simple loop. For each file I redirect STDOUT and STDERR to a logfile with a a timestamp extension I could then grep.
cat bnl-transfers.sh
#!/bin/bash
for a in `seq 1 5`
do
nst=$1
timestamp=`date +%y%m%d%H%M%S`
postfix="$a.${nst}st"
logfile="$HOME/test.$postfix.$timestamp"
srcfile="srm://bohr3226.tier2.hep.manchester.ac.uk/dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/step09/RAW/UKI-NORTHGRID-MAN-HEP_DATADISK/step09.50000030.sonar_1.sonar.RAW.UKI-NORTHGRID-MAN-HEP_DATADISK/step09.50000030.sonar_1.sonar.RAW.UKI-NORTHGRID-MAN-HEP_DATADISK._lb000${a}._0001.data_1286528503"
dstfile="srm://dcsrm.usatlas.bnl.gov:8443/srm/managerv2?SFN=/pnfs/usatlas.bnl.gov/BNLT0D1/user/aforti/sonar-manual.$postfix"
lcg-del -l $dstfile > $logfile 2>&1
sleep 2
(time lcg-cp --verbose -n $nst $srcfile $dstfile) >> $logfile 2>&1 &
done
I then run the following
./bnl-transfers.sh 9; ./bnl-transfers.sh 2; ./bnl-transfers.sh 1
and the rates from gridftp told a clearer story
9 streams:
47054848 bytes 511.15 KB/sec avg 522.27 KB/sec inst
30539776 bytes 498.73 KB/sec avg 513.71 KB/sec inst
23461888 bytes 383.14 KB/sec avg 386.98 KB/sec inst
30277632 bytes 495.28 KB/sec avg 498.26 KB/sec inst
29491200 bytes 480.80 KB/sec avg 507.73 KB/sec inst
2 streams:
1777729536 bytes 28934.40 KB/sec avg 31978.57 KB/sec inst
1776025600 bytes 28858.57 KB/sec avg 25437.87 KB/sec inst
1261230486 bytes 41055.68 KB/sec avg 41055.68 KB/sec inst
1354288154 bytes 44084.90 KB/sec avg 44084.90 KB/sec inst
2000000000 bytes 32071.02 KB/sec avg 23708.53 KB/sec inst
1 stream:
977272832 bytes 31812.27 KB/sec avg 31812.27 KB/sec inst
515768320 bytes 16789.33 KB/sec avg 16789.33 KB/sec inst
741832146 bytes 24148.18 KB/sec avg 24148.18 KB/sec inst
348258304 bytes 11336.53 KB/sec avg 11336.53 KB/sec inst
612237312 bytes 19996.25 KB/sec avg 19996.25 KB/sec inst
I repeated then the tests every few hours for 10 times and the result was always the same.
After discussion on the GridPP storage mailing list with other sites with similar but not identical setup we reduced the possibilities to
1) Some ports being blocked when the number of streams increases which might cause a continuous loss of data and the TCP window size remaining stuck at 4k as observed by ECDF.
2) Tcp sysctl settings applied. Most sites have applied the sysctl settings suggested on this page
http://fasterdata.es.net/host-tuning/linux/ and indeed they gave better rates than the much smaller settings we had previously and they worked for Manchester before the upgrade but for some reason not anymore after. They are characterised by a very large max TCP buffer size and very small min buffer size and a relatively small 87k default buffer size.
Liverpool, which doesn't have this problem sent their sysctl settings which were characterised by similar large max and small min and a huge, compared to the fasterdata value, default.
net.ipv4.tcp_rmem = 8192 1048576 8388608
net.ipv4.tcp_wmem = 8192 1048576 8388608
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
after the change the transfers with 9 streams magically started to go at the same rate as the transfers with 1-2 streams
9 streams
2000000000 bytes 22398.22 KB/sec avg 7929.77 KB/sec inst
2000000000 bytes 19054.88 KB/sec avg 9210.10 KB/sec inst
2000000000 bytes 18408.34 KB/sec avg 11112.14 KB/sec inst
2000000000 bytes 20844.45 KB/sec avg 12711.16 KB/sec inst
2000000000 bytes 46613.96 KB/sec avg 7777.58 KB/sec inst
2 streams
2000000000 bytes 31913.81 KB/sec avg 27794.67 KB/sec inst
2000000000 bytes 25398.24 KB/sec avg 17900.99 KB/sec inst
2000000000 bytes 13359.27 KB/sec avg 8529.12 KB/sec inst
2000000000 bytes 15738.32 KB/sec avg 6519.57 KB/sec inst
2000000000 bytes 43306.54 KB/sec avg 36966.53 KB/sec inst
1 stream
2000000000 bytes 22790.26 KB/sec avg 16394.90 KB/sec inst
2000000000 bytes 21229.62 KB/sec avg 18154.65 KB/sec inst
2000000000 bytes 18067.76 KB/sec avg 5438.63 KB/sec inst
2000000000 bytes 19280.60 KB/sec avg 4531.59 KB/sec inst
2000000000 bytes 20387.53 KB/sec avg 10513.79 KB/sec inst
We still don't have an explanation of why a setup with an initial buffer size of 1M works. All the network sites claim that that value should be kept small to avoid hurting small size transfers and the values reported are always between 65k and 87k never larger. And some sites like QMUL work fine with these small initial values. Still going back to netstat after the change all streams Send-Q values are filled with data and only occasionally 1 or 2 streams have empty buffers which looks a much healthier picture.
netstat -tape |head -2;netstat -tape |grep dcd
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 2364766 se10.tier2.hep.manche:35220 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1339838 se10.tier2.hep.manche:35219 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1580279 se10.tier2.hep.manche:35218 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1496295 se10.tier2.hep.manche:35214 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1361558 se10.tier2.hep.manche:35212 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1515119 se10.tier2.hep.manche:35213 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1401968 se10.tier2.hep.manche:35217 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1222623 se10.tier2.hep.manche:35215 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1255781 se10.tier2.hep.manche:35216 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
The change applied worked well also for the other sites. For example Oxford before
2000000000 bytes 210.55 KB/sec avg 172.16 KB/sec inst
2000000000 bytes 237.41 KB/sec avg 38.85 KB/sec inst
2000000000 bytes 209.80 KB/sec avg 39.54 KB/sec inst
2000000000 bytes 206.65 KB/sec avg 30.98 KB/sec inst
2000000000 bytes 263.14 KB/sec avg 144.02 KB/sec inst
and after
2000000000 bytes 76593.14 KB/sec avg 76593.14 KB/sec inst
2000000000 bytes 50468.35 KB/sec avg 29142.64 KB/sec inst
2000000000 bytes 45316.12 KB/sec avg 2982.70 KB/sec inst
2000000000 bytes 25631.56 KB/sec avg 12115.36 KB/sec inst
2000000000 bytes 18548.20 KB/sec avg 7176.38 KB/sec inst
We are all wondering if it is worth to spend time learning why only few sites had this problem and why 1M initial buffer size is better for than 87k. But I suspect that since the transfers now work we will know only if we stumble upon the answer. A possible explanation of why a larger inital value is not reccomended almost anywhere is that sites like fasterdata suggestions are tailored for WEB sites whose small transfers are few kB of a WEB page while our small transfers are normally few MBs log files. Also I don't think WEB servers use multi-stream transfers and the fasterdata suggestion was working fine with a very limited number of streams.
Transfers to BNL were the only transfers with this problem, other transfers to other T1s and the trasnfers from BNL had healthy rates; and other sites with DPM didn't have this problem. On top of it all the perfsonar monitoring boxes were reporting good rates too.
Some simpler gridftp trasnfers showed healthy rates too so it seemed the problem might be in FTS. Infact Wahid confirmed that FTS uses more streams. So I started to test transfers with an increasing number of streams and it turned out that up to 2 streams transfers were fine, using 3 streams transfers were wobbling and from 4 upwards the rates were terrible.
Looking at netstat in continuous mode this seemed to be confirmed by the fact that out of 9 streams only 2 and seldom 3 had a Send-Q value different from 0. Send-Q is the stream buffer if it is 0 it means there are no data queued the documentation says that's a good thing but to me it looked that if you have buffer empty on 7 out of 9 streams those streams are not used. So I tentatively labelled the streams with Send-Q values different from 0 as active and thought that if only 2 streams were active out of 9 there was packet loss somewhere.
To be more systematic and replicate the sonar tests in the simpler gridftp transfers I did the following
I found out which files were used by the sonar tests and wrote a script which accepts the number of streams as a parameter to copy them. File names only differ by a number so it could all go in a simple loop. For each file I redirect STDOUT and STDERR to a logfile with a a timestamp extension I could then grep.
cat bnl-transfers.sh
#!/bin/bash
for a in `seq 1 5`
do
nst=$1
timestamp=`date +%y%m%d%H%M%S`
postfix="$a.${nst}st"
logfile="$HOME/test.$postfix.$timestamp"
srcfile="srm://bohr3226.tier2.hep.manchester.ac.uk/dpm/tier2.hep.manchester.ac.uk/home/atlas/atlasdatadisk/step09/RAW/UKI-NORTHGRID-MAN-HEP_DATADISK/step09.50000030.sonar_1.sonar.RAW.UKI-NORTHGRID-MAN-HEP_DATADISK/step09.50000030.sonar_1.sonar.RAW.UKI-NORTHGRID-MAN-HEP_DATADISK._lb000${a}._0001.data_1286528503"
dstfile="srm://dcsrm.usatlas.bnl.gov:8443/srm/managerv2?SFN=/pnfs/usatlas.bnl.gov/BNLT0D1/user/aforti/sonar-manual.$postfix"
lcg-del -l $dstfile > $logfile 2>&1
sleep 2
(time lcg-cp --verbose -n $nst $srcfile $dstfile) >> $logfile 2>&1 &
done
I then run the following
./bnl-transfers.sh 9; ./bnl-transfers.sh 2; ./bnl-transfers.sh 1
and the rates from gridftp told a clearer story
9 streams:
47054848 bytes 511.15 KB/sec avg 522.27 KB/sec inst
30539776 bytes 498.73 KB/sec avg 513.71 KB/sec inst
23461888 bytes 383.14 KB/sec avg 386.98 KB/sec inst
30277632 bytes 495.28 KB/sec avg 498.26 KB/sec inst
29491200 bytes 480.80 KB/sec avg 507.73 KB/sec inst
2 streams:
1777729536 bytes 28934.40 KB/sec avg 31978.57 KB/sec inst
1776025600 bytes 28858.57 KB/sec avg 25437.87 KB/sec inst
1261230486 bytes 41055.68 KB/sec avg 41055.68 KB/sec inst
1354288154 bytes 44084.90 KB/sec avg 44084.90 KB/sec inst
2000000000 bytes 32071.02 KB/sec avg 23708.53 KB/sec inst
1 stream:
977272832 bytes 31812.27 KB/sec avg 31812.27 KB/sec inst
515768320 bytes 16789.33 KB/sec avg 16789.33 KB/sec inst
741832146 bytes 24148.18 KB/sec avg 24148.18 KB/sec inst
348258304 bytes 11336.53 KB/sec avg 11336.53 KB/sec inst
612237312 bytes 19996.25 KB/sec avg 19996.25 KB/sec inst
I repeated then the tests every few hours for 10 times and the result was always the same.
After discussion on the GridPP storage mailing list with other sites with similar but not identical setup we reduced the possibilities to
1) Some ports being blocked when the number of streams increases which might cause a continuous loss of data and the TCP window size remaining stuck at 4k as observed by ECDF.
2) Tcp sysctl settings applied. Most sites have applied the sysctl settings suggested on this page
http://fasterdata.es.net/host-tuning/linux/ and indeed they gave better rates than the much smaller settings we had previously and they worked for Manchester before the upgrade but for some reason not anymore after. They are characterised by a very large max TCP buffer size and very small min buffer size and a relatively small 87k default buffer size.
Liverpool, which doesn't have this problem sent their sysctl settings which were characterised by similar large max and small min and a huge, compared to the fasterdata value, default.
net.ipv4.tcp_rmem = 8192 1048576 8388608
net.ipv4.tcp_wmem = 8192 1048576 8388608
net.core.rmem_max = 8388608
net.core.wmem_max = 8388608
after the change the transfers with 9 streams magically started to go at the same rate as the transfers with 1-2 streams
9 streams
2000000000 bytes 22398.22 KB/sec avg 7929.77 KB/sec inst
2000000000 bytes 19054.88 KB/sec avg 9210.10 KB/sec inst
2000000000 bytes 18408.34 KB/sec avg 11112.14 KB/sec inst
2000000000 bytes 20844.45 KB/sec avg 12711.16 KB/sec inst
2000000000 bytes 46613.96 KB/sec avg 7777.58 KB/sec inst
2 streams
2000000000 bytes 31913.81 KB/sec avg 27794.67 KB/sec inst
2000000000 bytes 25398.24 KB/sec avg 17900.99 KB/sec inst
2000000000 bytes 13359.27 KB/sec avg 8529.12 KB/sec inst
2000000000 bytes 15738.32 KB/sec avg 6519.57 KB/sec inst
2000000000 bytes 43306.54 KB/sec avg 36966.53 KB/sec inst
1 stream
2000000000 bytes 22790.26 KB/sec avg 16394.90 KB/sec inst
2000000000 bytes 21229.62 KB/sec avg 18154.65 KB/sec inst
2000000000 bytes 18067.76 KB/sec avg 5438.63 KB/sec inst
2000000000 bytes 19280.60 KB/sec avg 4531.59 KB/sec inst
2000000000 bytes 20387.53 KB/sec avg 10513.79 KB/sec inst
We still don't have an explanation of why a setup with an initial buffer size of 1M works. All the network sites claim that that value should be kept small to avoid hurting small size transfers and the values reported are always between 65k and 87k never larger. And some sites like QMUL work fine with these small initial values. Still going back to netstat after the change all streams Send-Q values are filled with data and only occasionally 1 or 2 streams have empty buffers which looks a much healthier picture.
netstat -tape |head -2;netstat -tape |grep dcd
Active Internet connections (servers and established)
Proto Recv-Q Send-Q Local Address Foreign Address State
tcp 0 2364766 se10.tier2.hep.manche:35220 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1339838 se10.tier2.hep.manche:35219 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1580279 se10.tier2.hep.manche:35218 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1496295 se10.tier2.hep.manche:35214 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1361558 se10.tier2.hep.manche:35212 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1515119 se10.tier2.hep.manche:35213 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1401968 se10.tier2.hep.manche:35217 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1222623 se10.tier2.hep.manche:35215 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
tcp 0 1255781 se10.tier2.hep.manche:35216 dcdoor14.usatlas.bnl.:21431 ESTABLISHED
The change applied worked well also for the other sites. For example Oxford before
2000000000 bytes 210.55 KB/sec avg 172.16 KB/sec inst
2000000000 bytes 237.41 KB/sec avg 38.85 KB/sec inst
2000000000 bytes 209.80 KB/sec avg 39.54 KB/sec inst
2000000000 bytes 206.65 KB/sec avg 30.98 KB/sec inst
2000000000 bytes 263.14 KB/sec avg 144.02 KB/sec inst
and after
2000000000 bytes 76593.14 KB/sec avg 76593.14 KB/sec inst
2000000000 bytes 50468.35 KB/sec avg 29142.64 KB/sec inst
2000000000 bytes 45316.12 KB/sec avg 2982.70 KB/sec inst
2000000000 bytes 25631.56 KB/sec avg 12115.36 KB/sec inst
2000000000 bytes 18548.20 KB/sec avg 7176.38 KB/sec inst
We are all wondering if it is worth to spend time learning why only few sites had this problem and why 1M initial buffer size is better for than 87k. But I suspect that since the transfers now work we will know only if we stumble upon the answer. A possible explanation of why a larger inital value is not reccomended almost anywhere is that sites like fasterdata suggestions are tailored for WEB sites whose small transfers are few kB of a WEB page while our small transfers are normally few MBs log files. Also I don't think WEB servers use multi-stream transfers and the fasterdata suggestion was working fine with a very limited number of streams.
Subscribe to:
Posts (Atom)