23 November 2011

The best rate to get from ATLAS's SONAR test involving RAL can be assumed to be internal transfers from one Space token at RAL to another space token at RAL. The Sonar plot for large files; (over 1 GB,) for the last six months is:

Averaging this leads to:

Leading to average of 18.4MB/s as the average rate with spikes in 12 hour average to above 80MB/s. (Individual file transmission rates across the network (excluding overhead) have been seen at over 110MB/s. This relates well to the 1Gbps NIC limit on the disk servers in question.

Now we know that of Storm,dCache,DPM and Castor systems within the UK that Castor tends to have the longest interaction overhead for transfers. Overhead for RAL-RAL transfer varies for the last week is between 14 and 196 seconds with an average of 47 seconds and a standard deviation of 24 seconds.

12 November 2011

Storage is popular

Storage is popular: why, only this morning GridPP storage received an offer of marriage from a woman from Belarus (via our generic contact list). I imagine they will stick wheels on the rack of disk servers so they can push it down the aisle. We need a health and safety risk assessment. Do they have doorsteps in churches? Do they have power near the altar or should we bring an extension?  If they have raised floors, can we lay the cables under the floor And what about cooling?

Back to our more normal storage management, it is worth noting that our friends in WLCG have kicked off a TEG working group on storage. TEG, since you ask, means Technical Evolution Group - the evolution being presumably the way to move forward without rocking the boat too much, ie. without disrupting services. The groups role is to look at the current state, successes and issues, and how to then move forward - looking ahead about five years.  In good and very capable hands with chairs Daniele Bonacorsi from INFN and our very own Wahid Bhimji from Edinburgh, the group membership is notable for being inclusive in the sense of having WLCG experiments, sites, middleware providers, and storage admins involved. Although the work focuses on the needs of WLCG, it will also be interesting to  compare with some of the wider data management activities.

11 November 2011

RAL T1 copes with ATLAS spike of transfers.

Following recent issues at the RAL T1 , we were worried about not just overall load on our SRM caused by ATLAS using the RAL FTS, but also the rate at which they put load on the system.
At ~10pm on the 10th November 2011 (UTC); ATLAS went from running almost empty to almost full on FTS channels involving RAL being controlled by the RAL FTS server. This can be seen in the number of active transfer plot:

This was caused by atlas suddenly putting into the ATLAS FTS many transfers which can be seen in the "Ready" queue:

This lead to a high transfer rate as shown here:
And is also seen in our own internal network monitoring:

The FTS rate is for transfers only going through the RAL FTS. ( I.e does not include puts by CERN FTS, Gets from other T1s or the chaotic background of dq2-gets, dq2-puts and lcg-cps not covered in these plots. Hopefully this means our current FTS settings can cope with start of these ATLAS data transfer spikes. We have seen from previous backlogs that these large spikes lead to a temporary backlog ( for a typical size of spike;) which clears well within a day.

25 October 2011

"Georgina's " Travels

So I received various postcards from Georgina's family from their new homes around the world.
~9 months after Georgina's birth she has:
1497 unique children in 79 Houses in a total of 265 rooms.
( Further analysis is hard to describe in an anthropomorphous world since data set replicas would have to involve cloning in Dave and Georgina's world.)
Taking this into account of the 1497 datasets , the distribution of number of replicas is as follows:
556/1497 datasets only have one copy.
Maximum number of copies of any "data" dataset is 20.
Maximum number of copies of any "group" dataset is 3.
Maximum number of copies of any "user" dataset is 6.
What is of concern is to me is that 279/991 user or group derived datasets have one unique copy on the grid.

12 October 2011

Rumours of my storage have been somewhat exaggerated?

It has been reported that RAL's tape capacity has grown by some factor, by which I deduce as the most likely explanation that at least one of the backend databases has been upgraded from 2.1.10-0 to 2.1.10-1:

/* Convert all kibibyte values in the database to byte values */
UPDATE vmgr_tape_denmap
   SET native_capacity = native_capacity * 1024;
UPDATE vmgr_tape_pool
   SET capacity = capacity * 1024;
As you can see the internal accounting numbers are multiplied with a factor 1024 which obviously confuses the CIP. The new CIP (2.2.0) has code to deal with this, but we can backport it to the current one. The caveat is that not all CASTOR instances may have been updated; we will check that.

27 September 2011

SRM speedup update

Or should that be speedupdate? If you remember the Jamboree last year, in Amsterdam, one of the suggestions to decrease the negotiation overheads in SRM by making more efficient use of the socket. Our very own Paul Millar from DESY has come up with a demonstrator using a lua shell which is able to reach SRMs by calling the functions in the API, like S2 does, but with perhaps a simpler language to learn, and you're in a shell.

What Paul demonstrated was the speedup associated with calling each function individually, and then by turning off GSI delegation, and finally by reusing the socket using HTTP KeepAlive. You'd not be surprised to see a big improvement - but of course the server must support KeepAlive.

Combined with the immediate return on srmGet when a file does not need staging, this could again speed up multiple file accesses (and of course you can still submit multiple file requests in a single SRM request.)

Paul has published the code, you can find the dCache LUA SRM interface.on the dCache web site.

14 September 2011

Bringing SeIUCCR to people

Am at the SeIUCCR (pronounced "succor" - no, not "sucker") summer school at the Coseners House in Abingdon and the doors are open out to the garden and we can well believe it is a summer school. Last time I lectured in a summer school (cloud security) I had made the presentation a bit too easy, so this time (data management) I included some hairy stuff. While it was basically about uploading data to the grid and moving it around, the presentation covered the NGS and GridPP, i.e. Globus and gLite, and we also (once) queried the information system directly (which was the aforementioned hairy part). But, like the 2-sphere, no talk can be hairy everywhere. Oh, and all the demos worked, despite being live.

The main idea is that grids extend the types of research people can do, because we enable managing and processing large volumes of data, so we are in a better position to cope with the famous "data deluge." Some people will be happy with the friendly front end in the NGS portal but we also demonstrated moving data from RAL to Glasgow (hooray for dteam) and to QMUL with, respectively, lcg-rep and FTS.

If you are a "normal" researcher (ie not a particle physicist :-)) you normally don't want to"waste time" learning grid data management, but the entry level tools are actually quite easy to get into, no worse than anything else you are using to move data. And the advanced tools are there if and when you eventually get to the stage where you need them, and not that hard to learn: a good way to get started is to go to GridPP and click the large friendly HELP button. NGS also has tutorials (and if you want more tutorials, let us know.)

It is worth mentioning that we like holding hands: one thing we have found in GridPP is that new users like to contact their local grid experts - which is also the point of having campus champions. We should have a study at the coming AHM. Makes it even easier to get started. You have no excuse. Resistance is futile.

05 September 2011

New GridPP DPM Tools release: now part of DPM.

I'm happy to announce the release of the next version of the GridPP DPM toolkit, which now includes some tools for complete integrity checking of a disk filesystem against the DPNS database.
This should also be able to checksum the files as well, although this takes a lot longer.

The bigger change is that the tools are now provided in the DPM repository, as the dpm-contrib-admintools package. Due to packaging constraints, this RPM installs the tools to /usr/bin, so male sure it is earlier in your path than the old /opt/lcg/bin path...

Richards would like to encourage other groups with useful DPM tools to contribute them to the repo.

04 August 2011

Et DONA ferentis?

If you've been reading papers in the past few years you would have seen DOIs in the references. (I mean academic papers, not newspapers.) The idea is that it saves you from writing

Journal of Theoretical and Applied Irrelevance, 53 Vol. 3 (1), (2008), pp.312-322.

when instead you can write a simpler string, a handle, which identifies the data.

For this to work, you will need a handle resolution service. Think of DNS. If you did it "by hand" you would resolve www.gridpp.ac.uk with:
dig www.gridpp.ac.uk. A IN
to an IP address, and then maybe telnet to port 80 or something. Or think of the GUID-to-SURL mapping in LFC that we all know and love (or the SURL-to-TURL mapping). Similarly, handles have to resolve into something that is the stuff you're looking for. Doesn't have to be a paper, it could also be data, or even particular versions of data.

Enter the Handle System. Patrice Lyons and Bob Kahn - one of the fathers of the Internet - from CNRI, are proposing to establish a global handle system, morally equivalent to ICANN, to manage the uniqueness and persistence of handles. This will be the Digital Object Numbering Authority, or DONA.

Of course, like DNS or GUIDs, there is no assurance that the data you're looking for is actually there. In fact, persistence is meaningful even for temporary objects, in the sense of the handle being associated with the object forever, even if the object itself doesn't live forever.

Sounds simple? Well, apart from those temporary objects, the system may need to be able to deal with modifiable objects, versions, replicas, and (possibly) part handles. Typing it in again from a printed representation. And what is the object? is it the object as a sequence-of-bits, or is it the "curation-object" which goes in as a Word97 document, say, and is referenced later as PDF.

We might even have a GFAL-type library which knows how to resolve handles into data, so the application doesn't have to know. Meanwhile, the handles are coming: apart from the publishers' DOIs in the papers, you can see the entertainment industry have also picked it up with EIDR.

18 July 2011

Storage accounting in OSG and OGF

Groups like UR are getting around to discussing storage records. OSG already create storage records: they have XML-formatted records for both the transfer and the file history. (With thanks to Steve Timm from FNAL.)
<StorageElementRecord xmlns:urwg="http://www.gridforum.org/2003/ur-wg">
<RecordIdentity urwg:createTime="2011-07-17T21:18:07Z" urwg:recordId="head01.aglt2.org:544527.26"/>
Over in GLUE-land, the GLUE group insist that using the GLUE schema to publish accounting data - and indeed to use GLUE data for anything other than resource selection - "cannot be done." Unfortunately the chairs didn't make it to OGF, but next steps will include work on the XML rendering of GLUE 2.0, along with the implementations.
Meanwhile, back home in GridPP-land, we use GLUE 1.3 for dynamic data. The question is still mainly about the accuracy (and freshness) of the information published: e.g. temporary copies on disk, files being "deleted" from tape, etc, how these should affect the published dynamic data. As we now have "accurate" tape accounting, the information provider should be updated soon.

23 June 2011

A Little or a lot. How many transfers should an SRM be handling??

The new atlas dashboard (version 2.0) now allows for better analysis of data flows. For the RAL T1 ATLAS endpoint of Castor, the breakdown for number of successful transfers from across the world both to and from RAL is as follows.
Firstly into RAL (over the last four weeks:)












Unsurpisingly; the majority of transfers are from with the UK; (due to the UK Tier 2s. ) However , 16.8% of transfers in are from outside the UK. (3% or 59k are transfers are internal RAL-RAL transfers.)

The number of transfers for when RAL is a source are:












( NB. There is a small amount of double counting as the 59299 RAL-RAL transfers appear in both sets of figures in the "UK" values.) average filesize was 287 MB and took 80.43 seconds to copy.
100k per day at RAL for ATLAS.
320k per day at BNL for ATLAS.
140k per day at FZK for ATLAS.
150k per day IN2P3 for ATLAS.

Now the SRM also has to handle files being written into it from the WNs at a site. The number of completed jobs for a selection of T1s is:
18k per day at RAL for ATLAS.
50k per day at BNL for ATLAS.
27k per day at FZK for ATLAS.
15k per day IN2p3 for ATLAS.

Now each job on average produces two output files; meaning that for RAL, ~35/135 of its SRM transfers (~1/4) come form its worker nodes.

UK T2s do approximately 80k transfers per day for ATLAS ( and complete ~50k jobs per day).

14 June 2011

FTS overhead factors and how to try and improve rates.

Within the UK we have been trying to speed up transfer rates. This has been a twofold approach.
1- Speed up the data transfer phase of the file by changing network and host settings on a disk server. Mainly this has been following the advice of the good people at LBNL work on:


2-The other area was to look at the overhead in the SRM and its communication with the FTS service.

So we had a tinker with number of files and number of threads on a FTS channel and got some improvement in overall rate for some channels. But as part of improving single file transfer rates (as part of our study to help the ATLAS VO SONAR test results;) we started to look into the overhead in prepare to get and put in the source and destination SRMs.

We have seen in the past that synchronous (rather than asynchronous) getTURL was quicker but what we did notice that within an FTS transfer; the sum of the time to preparetoGET and preparetoPUT varied greatly between channels. There is a strong correlation between this amount of time and the SRM involved at each end of the transfer. What we noticed was that transfers which involved CASTOR as the destination srm (preparetoPUT) we regularly taking over 30s to prepare (and regularly taking 20s to prepare as a source site.) Hence we started to look into a way of reducing the effective overhead of "prepare to transfer" for each file.
Looking at new improvements and options in the FTS, we discovered/(pointed at) the following decoupling of SRM preparation phase and the transfer phase:


Now it was pointed out to me by my friendly SRM developer that their is a timeout (of 180 seconds) which will fail a transfer if this time elapses between the end of the prepare phase and the start of the actual transfer on the disk server. Therefore we wanted to try this new functionality on transfers which:
1- Had a large amount of preparation time to transmission time (i.e either CASTOR as a destination or siurce.
2-Where the majority of transfer times per transfer where less than 180 seconds. ( either small files or fast connections.)

Looking at the value or ({Preparation Time} + {Transmission Time} )/ {Transmission Time}.
we got the following values.
Channel ratios for ATLAS, (CMS) and {LHCb}
<UKT2s-RAL>=15.1 (2.7)
<RAL-UKT2s>=5.5 (1.9)
<T1s-RAL>=4.3 (1.2) {8.6}
<*-RAL>=3.1 (1.2)
<*-UKT2s>=6.7 (1.01)
<"slow transfer sites">=1.38 (1.02)

Showed that UKT2s-RAL transfers for ATLAS met these criteria; so we have now turned this on ( which seems to add~1.5 seconds to each transfer so you might only want to set this Boolean to true for channels you intend to change the ratio). and we have now set the ratio of SRM prepares to transfers to 2.5 for all UKT2s to RAL channels. No problem of timeing out jobs has bee nseen and we have been able to reduce the number of concurrent filre transfers without reducing the overall throughput.

13 June 2011

Dave is ageing ( but not forgotten), Hello to Georgina

Well I am not actually dead, but my importance is receding. They say a week is a long time in politics, well 1 day in the LHC is not like all the others. I was one of the early runs from 2010. My 2011 compatriots are now a lot larger. Take a comparison between me and my new friend "Georgina"

Georgina/Dave numbers are:
973/103 Luminosity blocks=> 9 times more blocks.
12,140,770/1,101,123 events => 11.02 times the events.
203.6/31.8 Hz event rate=> 6.4 times the rate.
16hrs31'46"/9hrs36'51" Beam time=> 1.7 times greater than Dave.
2.01e4/7.72e-3 of integrated luminosity=> 2.6M times the data.
16219.5/5282.5TB of all RAW datasets=> 3 times the volume.
15200/3831 files of all RAW datasets=> 4 time the number of files. smaller?)
0.541/3.127TB for the MinBias subset=> 0.32 the volume.
977/1779 files for the MinBias subset=> 0.55 time the number of files.
So it appears for this comparison that filesize is 3.5 times smaller for the MinBias subset....

For those of interest if my ATLAS "DNA" is; then Georgina's is
Of course what you really want to know (or not) is where in the world is Georgina and her relations and how what does her birthday calendar look like. My avatar is interested to find out that since Georgina is so much bigger than I am, will she have more children in more rooms and how long will they last...

31 May 2011

Decimation of children.

The number of children I have is drastically reducing....
I now only have 11.76 TB of unique data including myself ( 84996 unique files in 334 datasets).
In total their are only 732 datasets now.
The number of replicas is dramatically reducing, but some children are still popular.
# Reps |# Datasets
1 |236
2 | 38
3 | 10
4 | 24
5 | 5
6 | 6
7 | 3
8 | 1
9 | 1
10 | 2
12 | 1
16 | 1
20 | 1
22 | 1
24 | 1
26 | 1
27 | 1
28 | 1
IE now only 98/334 actually have replicas.

My birthday calendar has also changed.The new birthday calendar looks like:

This shows that other than the datasets produced with in the first 12 weeks of life; at least 659 out of 718 datasets from the last year have had all copies deleted. Also show that there has been a hive of activity and new datasets produced ( a re-processing) in time for the important "Moriond" Conference. Also shows that the majority of datasets are only useful for less than one year. (But it is noticeable the files first produced seemed to be the most long lived.

27 May 2011

At last - accurate tape accounting?

So it seems we now have accurate tape accounting - CMS have looked at the new numbers generated by the new CIP code and declared that it matches their expectations.

The code accounts for data compression as it goes to tape - and estimates the free space on a tape by assuming that other data going to the same tape will compress in the same ratio. Also, as requested by ATLAS, there is accounting also for the "unreachable" data, ie data which can't be read because the tape is currently disabled, or free space which can't be used because the tape is read-only.

All the difficult stuff should now be complete: the restructuring of the internal objects to make the code more maintainable, and the nearline (aka tape) accounting. Online accounting will stay as it is for now.

11 April 2011

Summary of GridPP storage workshop at GridPP26

The full agenda is here (scroll down to "end of main meeting" :-), with presentations attached: http://www.gridpp.ac.uk/gridpp26/

Thanks again to all the sysadmins for the interesting discussions, very technical and with a high level of expertise - and I think I was the lone voice in mentioning the words "project planning" and "deliverables" and "metrics," as it should be :-)

And as you've seen, it was Dave's birthday - a glass or two was raised in The Prince George (we somehow have lots of vegetarians in the group.)

A special thanks again to our friends from Dell, not just for sponsoring GridPP26, but also for working closely with the T2s (specifically Edinburgh, Sussex, and QMUL), for being genuinely interested in the data grid we have built, and in particular for a couple of outstanding technical presentations - it is rare that you get the ever-multitasking sysadmins to close their laptops!

Oh, and as you've no doubt spotted, the graph is planar, so we conclude it does not contain a copy of K3,3 or K5. You have mathematicians writing summaries, you get planar graphs :-)

05 April 2011

GridPP26 Pictures

Last time I took pictures, I put them on the grid, and challenged people to read them.

This time I put them on the cloud: see if that works. Pictures are mostly from the dinner sponsored by Dell.

30 March 2011

Happy Birthday!!

Thank you for all the messages of congratulations on my acquirement of 31.536 M seconds of existence. After the winter break My "mother" has started again and will soon start to give me more siblings; so I am going to be getting less attention than I did 11 months ago. Plus my mother has changed how she sends my children around the world so I expect that my children will do fewer trips to other houses. However, I do still have descendants in 131 of the 705 of the rooms that ATLAS have.
Of these 131 rooms; 41 only have 1 resident, but the top four rooms have 561, 508, 472 and 192 residents!

09 March 2011

How to build a Data Grid

Went yesterday to a collaboration meeting for OOI; their oceanographic data grid is a collaboration with IC and QMUL. What is interesting is their completely different approach to building a "data grid" from how we've done it.

Building data grids was the subject of our (GridPP storage and data management) presentation at AHM last year (how to build an infrastructure that'll cope with LHC data); and an extended version will be presented at ISGC in a few weeks (more infrastructure focus, less LHC.)

While ours was essentially communication, policies, and trust, theirs is a very computersciencey approach - a message based infrastructure which promotes data to a "first class citizen" and uses formal methods (via Scribble) to implement distributed systems with "guaranteed correct behaviour." Interestingly they have about the same data rates from their sensor networks as we have in T1. Their data stream will compress to 700 MB/s.

We have thought about using formal methods before (mostly in proposals that weren't funded :-)), so it will be interesting to compare their approach to existing data grids like ESG or WLCG. Furthermore, some of their tools, like Scribble, may well find uses in some of our other projects.

(BTW, I am calling these grids "data grids," but not as much in the SRB/iRODS sense. As someone pointed out in the session, the emphasis in (any) grid processing sensor or instrument data is on the data. Computation can in principle be redone but sensor data can never be recaptured.)

15 February 2011

ATLAS Tape Store file sizes

Here is the profile of the files that are stored by ATLAS at RAL.

Minimum file size stored for both DATA and MC is 20 bytes

Maximum file size stored is:[12963033040,11943444017] or [13GB,11.9GB]

Average file size is: [1791770811,637492224] or [1.79GB,637MB]
The median filesize for [DATA,MC] are [2141069172,602938704] ([2.1GB,602MB] ) .

Number of files stored is: [282929,672707] of [DATA,MC] files for a total size of [506943923754228,428845482035182] or [507TB,428TB] in total.

[37,5687] files in [DATA,MC] are zero sized ( but we don't have to worry about them as the tape system does not copy 0 size files to tape.

However these are better than the [538,537] 20 byte files which have been migrated to tape (these are 0B sized log files which have then been "tar" and "gzip"ed before being written into Castor.)

The modal average filesizes are [26860316,0]Bytes with [286,5687] files of this size.
These are most likely failed transfers, next modal filesize with 537 entries are files with a size of 20bytes, but theses are just test files. The first genuine modal filesize jointly have 13 files and have size 19492 an 19532 Bytes.

Whereas [254040,626556] have a unique filesize (this equates to only [89.8,93.1] percent of files having a unique file size so checksum is important!!)

Could be worse though , one VO successfully stored a file that is one byte in size (the fact the header and footer on tape file and compressing the file actually increased the size of file actually stored on tape......)

07 February 2011

Get yer clouds here

At the risk of, er, promoting one of my own presentations, can I remind you to not forget to remember to join the NGS surgery this coming Wednesday, 9th, at the usual time just after the storage meeting, 10:30ish-11:30. The subject will be cloud storage in an NGI context, looking at the hows and whys and whats and whatnots, with room for discussion, too (possibly.) You can EVO in as usual if you don't have AG.

26 January 2011

Dirk get's mentioned in Nature

So at least one person other than my avatar is aware of my existence. One of my children is mentioned in the article; (even though the majority of the article is about me and ALL my children. can't be having favourites amongst them now can I??)


Interesting point to note is only 0.02% of the total data collected by the ATLAS is represented, that's to say if I were people and ATLAS were all the people in the world; then I would represent 1.2 million people.

ATLAS have also changed now the way they send my children out. Interestingly I am now in 70/120 houses. The break down of where these rooms are is as follows:

9 rooms at BNL-OSG2.
8 rooms at CERN-PROD.
6 rooms at IN2P3-CC.

This is in total 139 of the 781 rooms at have.
The number and type of rooms are:

56 rooms of type DATADISK.
26 rooms of type LOCALGROUPDISK.
17 rooms of type SCRATCHDISK.
7 rooms of type USERDISK.
5 rooms of type PHYS-SM and PERF-JETS.
4 rooms of type PERF-FLAVTAG.

11 January 2011

Who cares about TCP anyway....

Don't worry I haven't injured myself and need a cut sterilizing, I mean window sizes!!!
So as part of my work to look at how to speed up individual transfers, I thought I would go back and look to see what the effect of changing some of our favourite TCP window settings would be. These are documented at http://fasterdata.es.net/TCP-tuning/

Our CMS instance of Castor is nice since CMS have a separate disk pool for incoming WAN transfers, outgoing WAN transfers and for pool for internal transfers between WNs and the SE. This is great feature as it means the disk servers in WanIn and WanOut will never have 100s of local connections ( a worry I have for setting TCP settings to high;) so we experimented to see what the effect of changing our TCP settings.

I decided to study transfers that the international as these are the large RTT transfers and most likely to benefit from tweaking. Our settings before the change were. 64kB for default and a 1MB maximum window size.
This lead to a maximum transfer rate per transfer of ~60MB/s and an average of ~7.0 MB/s.
This appears to be hardware dependent across the different generation s of kit.
We changed the settings to 128kB and 4MB. This led to an increase to ~90MB/s maximum data transfer rate per transfer and an average transfer of~11MB/s so roughly a 50% increase in performance. This might not seem a lot since we doubled and quadrupled are settings... However further analysis improves matters. changing TCP settings is only going to help with transfers where the settings at RAL were the bottleneck.
For channels where the settings at the source site are already the limiting factor then these changes would have a limited effect. However looking at transfers from FNAL to RAL for CMS we see a much greater improvement.

Before the tweak the maximum file transfer rate was ~20MB/s with an average of 6.2MB/s. However; after the TCP tweak these increased to 50MB/s and 12.9MB/s respectively.

Another set of sites where the changes dramatically helped were transfers from the US tier2s to RAL ( over the production network rather than the OPN). Before the tweaks the transfers peaked at 10Mb/s and averaged 4.9MB/s. After the tweaks, these values were 40MB/s and 10.8 MB/s respectively.

Now putting all these values into a spreadsheet and looking at other values we get:

Solid Line is Peak. Dotted line is average.
Green is total transfers.
Red is transfer from FNAL.
Blue is transfers to US T2 sites.
Tests on a pre-production system at RAL also show that the efffects on the LAN transfers for these changeas are acceptable.