GridPP storage news: 2010

14 December 2010

My Birthday Calendar...

You know that feeling of trying to get a birthday card for a relative, can you imagine the trouble I have when I have so many children! Here is my current calendar with all the days I have a relatives' birthday marked (and the number of children on each day!)

What this calendar does not take into account are early datasets and defunct reprocessing datasets which have been deleted. I would love know what makes October 3rd and June 9th so popular??? Of course some of my children are alot bigger in terms of size and files ( and importance) which is lost in this plot.

Almost Nine months old.. But still going strong.

Its been a busy year for me,
I was born nearly nine months ago now and seem to have been kept spreading.
My avatar has been remiss in reporting my exploits so I am forcing him to give you an update.
I am now only in 19 countries. These are
Australia, Austria, Canada, China, Czech republic, Denmark, France, Germany, India, Israel, Italy, Japan, Netherlands, Portugal, Spain, Russia, Taiwan, UK and the USA.
I am at 68 physical sites (134 ATLAS endpoints) .
I am in a total of 2555 datasets ( but only 857 are unique).
The top 10 popular datasets have on averaged 20.7 copies on the grid. ( Ignoring these datasets leads to an average number of copies for all off Dave's datasets is 2.7.)
In total these unique datasets now are 33.4TB ( slightly more than the original 3TB that I was to start with!!!) So an increase in ~ a factor of 11. However my modest 1779 number of files has now increased to 698749 files; (an increase factor of 392).
My next post will show my birthday calendar. ( hoping to see if there is clustering around the time before conferences...)

08 December 2010

So who's winning the transfer rate race for the Tier2 sites in the UK

Well its not really a race but I thought it would be interested to see how the rates to and from the Tier2 with RAL.
So the clear winner is ( and probably always going to be) RALPPD. This should seem obvious since they are co-located so various factors (small rtt, large bandwidth link for examples;) which would lead to a high rates. This lead to the rate plot as shown below:

Here you can see the rate has got to over 250MB/s
A close second I thought was going to be Glasgow, (they have a longer rtt and lower bandwidth pipe so you would expect them to fair worse) :

This shows that they have got to 140MB/s
Sometimes it is unfair to compare these numbers since the number of concurrent transfers on the FTS channels varies,. Glasgow actually have more concurrent transfers ( ~140MB/s with 42 concurrent transfers compare to ~250MB/s with fewer concurrent transfers.) It is because Glasgow had the most concurrent transfers that I thought it would have the highest rates:
But the peak rates I see for the sites appears to be the following (in the last month at least). If a site thinks they have seen better rates then comment on this Post!!!
( rtt time between SEs is shown in brackets after each record. Well its actually the rtt to the closest router that traceroute can resolve.)

RALPP 250 MB/s ( .2ms)
Imperial 250 MB/s (5.0ms)
Manchester 200MB/s (8.0ms)
Glasgow 140Mb/s (10.9ms)
Lancaster 100MB/s (6.4ms)
QMUL 100MB/s (5.7ms)
Birmingham 90MB/s (8.8ms)
Brunel 90MB/s (6.4ms)
Oxford 90MB/s (8.9ms)
Sheffiled 80Mb/s (9.1ms)
Liverpool 75MB/s (9.3ms)
RHUL 40MB/s (8.2ms)
Cambridge 30MB/s (10.3ms)
ECDF 20MB/s (12.9ms)
Bristol 17MB/s (4.0ms)
UCL 15MB/s (7.3ms)
Durham 12 MB/s (14.8ms)

For some sites the limiting factor seems to be the link (ie transfers are running at line speed.) For other sites the limiting factor is the number of concurrent transfers currently set in FTS. Some thing to tweak further....( Something really interesting is that the top two sites are the only dCache we have, but this could just be coincidence since they are also the shortest and close to second shortest rtt times of any site.

ATLAS have also started their sonar T2-T2 mesh of testing inter cloud T2 transfers. this made me think of the work I had done (but not reported) about work I had done in looking in splitting STAR-T2 channels at RAL into a slow medium and fast channels. the rough split is western European sites in fast channel, north America in the medium channel and south America/Asia-pacific in slow channel ( going from rtt time) . This would be an initial split and then some tweaking if some sites were slower than their rtt would suggest.
Interesting to see if ATLAS list of slow transfers for UK sites match mine.
my list of slow sites would be:

Australia-ATLAS

BEIJING-LCG2

CBPF

EELA-UTFSM

LCG_KNU

MA-01-CNRST

NCP-LCG2

SDU-LCG2

TOKYO-LCG2

UNIANDES

TW-FTT

TR-10-ULAKBIM

INDIACMS-TIFR

Medium sites ( North America)would be:

Canadian

CA-ALBERTA-WESTGRID-T2

CA-SCINET-T2

CA-VICTORIA-WESTGRID-T2

SFU-LCG2

VICTORIA-LCG2

American

UST3

BUATLAS

UMICH

IUT

UTA

UIUC

UCTP

STU

UCT2

SMU

AGLT2

WIS

UMFS

SWT2UTA

SWT2CPN

Plus I am working with Glasgow to see how much of their 6Gbps can be used; but more on that in my next post.

03 November 2010

Confederated conference confusion

Summaries of the data management sessions at last week's OGF, as well as more CHEP discussion, have appeared in the minutes of this week's storage meeting. Meanwhile, there is a HEPiX meeting this week, which will (probably) be discussed at the storage meeting next week - particularly if we can find someone who went to it. Share and Enjoy.

24 October 2010

A computing banquet

Delayed post from Thursday....
So another day, another 100 talks, followed by another 100 food courses.

Just to provoke a suitable tirade from Sam, I will describe that morning's SSD plenary as an "informative and thorough summary on the unquestionable advantages of SSDs" (not really). More interesting was a talk from David South on long term data preservation for experiments. A very worthwhile idea I think, and I hope it can be supported.
In parallel talks, there was an update on Hammercloud developments - now available for LHCb, CMS as well ATLAS and apparanty in the future other VOs will be able to "plugin". Also coming in the future are more advanced/configurable statistics.
Andreas Peters outlined the cern "disk pool project" EOS, and, as mentioned by Sam, the obvious questions followed from Brian B and others, "why yet another filsystem/ storage manager? Are dCache. HDFS etc.etc. etc not worth adapting..." But to look at the positive side (as they are certainly going to carry on with this anyway) if something good develops then it maybe something worth T1s or T2s trying out.
In the same session there was another tool presented that we might use, a flexible benchmark that allows you to trace any application and then "play it back": copying the disk calls.Potentially very useful for testing out new kit - and, though it hasn't yet been packaged for general consumption, we'll defiantely be following up with the developer for a preview.

Friday was the stream leaders summaries, which you don't need since you have had ours ;-).
Overall I've been very impressed both with the organisation of the conference and of the quality of the talks...

And, as my temple visits should have placated the travel gods to guide me home through typhoons, strikes and whatever else is out there; we should be back to provide a more digested summary in next wednesdays storage meeting.

20 October 2010

CHEP: or, how I learned to stop worrying and love the rain.

So, day three of CHEP was a half day, so not so much to report as Wahid did yesterday.

The plenaries were not directly interesting from a storage perspective, but I should ebtion them for other qualities.
First, Kate Keahey told us all why clouds (and public clouds, federated clouds - "sky computing") were awesome. I guess I'm just a cynic, as I still don't see how they're significantly better than Condor pools (plus flocking, plus VM universe). Also, the data flow problem is decidedly unsolved for analysis-class jobs in this context.

Secondly, Lucas Taylor impressed on us how important it was to talk to the media (and, more importantly by far, the public). Apparently, the most significant source of hits on CERN webpages is Twitter! Considering also that the LHC is only 1/3 as popular as Barack Obama on YouTube, it does seem that the right approach can really bring in public interest, and this can only be a good thing.

Finally, Peter Malzcher told us about the FAIR project, which is to be the next big accelerator at GSI. Honestly, it looks awesome, but the 6MW cooling solution for the cluster looks terrifying.

Since I was presenting today, I only have notes from the session I was scheduled for.

The first two talks, both on virtualisation, confirmed that io can be an issue for many-VM hosts. The solution of the day appears to be iSCSI.
Then some dangerous radical told everyone ~~to throw their shoes in the machinery~~ that MLC flash isn't all it's cracked up to be in SSDs.
More upsets followed when Yves Kemp showed that pNFS/NFS4.1 is much better than dCap in almost all possible cases. It is, however, possible that dCap's problem is simply too much readahead.
Finally, Dirk Duellmann gave us an update from CERN storage. Essentially, they're pretty stable at the front-end, growing storage at 15PB/y. Additionally, they're trialling EOS for disk pool filesystems. EOS, as Jeff Templon got Dirk to admit under cross-examination, is basically Hadoop over xrootd protocol, with a better namespace.
Despite agreeing at Amsterdam that reinventing the wheel in private projects was Bad... (CERN could have chosen to patch Hadoop, or even Ceph, instead).

Tch.

CHEPping part 2

Having got my talk out of the way (more on that later), I am now free to blog my view on activities so far here in Taiwan. I will avoid telling you about the driving rain, puppet shows and million-mini-course dinners, sticking instead to the hard storage facts.
My highlight/snippets on Monday/Tuesday activities:
- Lots on many core - but the valid question was asked, can IO keep up with this?
- Partick told us the plans (as they are now) for Data management middleware in EMI. Storm is in the plans (though was somewhat absent from the session to provide an update on their status.)
- Oliver told us the roadmap for DPM:(immediate news is that DPM 1.8.0 is in certification including a 3rd party rfcp to allow it to be used for draining.)
- Ricardo gave a nice talk on the DPM work on NFS4.1 which has reached the stage of a prototype.
For the slides on the later talks see this session:
http://117.103.105.177/MaKaC/sessionDisplay.py?sessionId=33&slotId=0&confId=3#2010-10-19

- My talk went OK with many questions, including those (interested in) doing similar benchmarking work. Hopefully we can get some common ideas towards providing something useful for sites to test and tune.
- Unfortunately I was talking at the same time as Illija's talk on the ATLAS root improvements which among other things outline that some of the further improvements in ROOT 5.26 would not be available in the current ATLAS reprocessing due to some other bugs which, thanks to connections made during the talk, may get fixed. Also up at the same time (!
) was Philippe Canal's talk with more detail on the ROOT changes as well as CMS's experiences in implementing them (http://117.103.105.177/MaKaC/contributionDisplay.py?contribId=150&sessionId=46&confId=3)

Other news - we had a very productive meeting with the DPM team, which should see us soon getting hold of the prerelease NFS4.1 interface for testing (around the next month) and also (probably before that) we'll be testing the "3rd party rfcp" mentioned above to tune it for fastest possible drains (yeah!) . We also talked about creating a central repository to collect together any DPM related nagios probes that people are using (before consolodating / adding new ones)

Packed days (so many sessions at once that even with Sam and I covering different sessions we are still missing half the stuff) - and we are only half way through! So standby for more info, if I don't get lost in the electronics markets or washed away by a typhoon.

18 October 2010

CHEP 2010: Episode 4: A New Hope

The story so far:

The evil Empire of CERN has succeeded in paralyzing the world's data networks by distributing vast quantities of 'event data' from their Death Star in Geneva.

However, at this very moment, a band of resistance fighters are congregating on the forest ~~moon~~ island of Taiwan to lead the fight back...

Ahem. So, Wahid and I are currently in Taipei for CHEP2010. Despite the jetlag encouraging me to write paragraphs like the above, we're seeing lots of interesting things.
Tellingly, the inaugural speech was given by the Vice President of Taiwan, and he mentioned how important Science was to Taiwanese success. Unlike France, I suspect the UK would find it hard to get Nick Clegg to turn up in similar circumstances.
Back to physics, where Ian Bird, and Roger Jones sequentially told us how successful we'd all been over the year, and Craig Lee told us how awesome cloud computing will be when it is public. Just like the Grid (and Condor clusters) before it, eh?
Finally, we had a discussion of many-core scaling for LHC VOs by Sverre Jarp. This is an area of significance in data provision, and the challenge of scaling io is something we're still looking at how best to address.

Of the parallel talks I attended, the most interesting was the CERNVMFS talk - it's still impressive how well it works.
Other interesting things: talks on EMI release processes (they have QA metrics!), posters on FTS over scp, Amazon ECC for CMS (too expensive), L-Grid webportal, and the ATLAS consistency service.

More from Wahid tomorrow.

14 September 2010

Aran Fawddwy

Welcome to Wales, to Cardiff actually, but if there are (real) mountains in Cardiff then I haven't seen them yet. The environment is certainly impressive, in Cardiff's city hall. As science conferences go, we're not used to such impressive surroundings.

Currently in the data management session where we have just heard from the UKQCD collaboration. It is interesting that their data grid is the stuff that we (GridPP) run, and the stuff the other guys run - ILDG is a collaboration of five countries, and they don't all run the same thing, but as long as they interoperate. As usual, GridFTP is the workhorse moving data back and forth, but even then data volumes are such that "truckftp" is sometimes quicker.

Oh, if you're around, don't forget to stop by the GridPP/NGS stand and say hi.

31 August 2010

Taming XFS on SL5

Sites (including Liverpool) running DPM on pool nodes running SL5 with XFS file systems have been experiencing very high (up to multiple 100s Load Average and close to 100% CPU IO WAIT) load when a number of analysis jobs were accessing data simultaneously with rfcp.

The exact same hardware and file systems under SL4 had shown no excessive load, and the SL5 systems had shown no problems under system stress testing/burn-in. Also, the problem was occurring from a relatively small number of parallel transfers (about 5 or more on Liverpool's systems were enough to show an increased load compared to SL4).

Some admins have found that using ext4 at least alleviates the problem although apparently it still occurs under enough load. Migrating production servers with TBs of live data from one FS to another isn't hard but would be a drawn out process for many sites.

The fundamental problem for either FS appears to be IOPS overload on the arrays rather than sheer throughput, although why this is occurring so much under SL5 and not under SL4 is still a bit of a mystery. There may be changes in controller drivers, XFS, kernel block access, DPM access patterns or default parameters.

When faced with an IOPS overload (that's resulting well below the theoretical throughput of the array) one solution is to make each IO operation access more bits from the storage device so that you need to make fewer but larger read requests.

This leads to the actual fix (we have been doing this by default on our 3ware systems but we just assumed the Areca defaults were already optimal).

blockdev --setra 16384 /dev/$RAIDDEVICE

This sets the block device read ahead to (16384/2)kB (8MB). We have previously (on 3ware controllers) had to do this to get the full throughput from the controller. The default on our Areca 1280MLs is 128 (64kB read ahead). So when lots of parallel transfers are occurring our arrays have been thrashing spindles pulling off small 64kB chunks from each different file. These files are usually many hundreds or thousands of MB where reading MBs at a time would be much more efficient.

The mystery for us is more why the SL4 systems *don't* overload rather than why SL5 does, as the SL4 systems use the exact same default values.

Here is a ganglia plot of our pool nodes under about as much load as we can put on them at the moment. Note that previously our SL5 nodes would have LAs in the 10s or 100s under this load or less.

http://hep.ph.liv.ac.uk/~jbland/xfs-fix.html

Any time the systems go above 1LA now is when they're also having data written at a high rate. On that note we also hadn't configured our Arecas to have their block max sector size aligned with the RAID chunk size with

echo "64" > /sys/block/$RAIDDEVICE/queue/max_sectors_kb

although we don't think this had any bearing on the overloading and might not be necessary.

We expect the tweak to also work for systems running ext4 as the underlying hardware access would still be a bottle neck, just at a different level of access.

Note that this 'fix' doesn't fix the even more fundamental problem as pointed out by others that DPM doesn't rate limit connections to pool nodes. All this fix does is (hopefully) push the current limit where overload occurs above the point that our WNs can pull data.

There is also a concern that using a big read ahead may affect small random (RFIO) access although the sites can tune this parameter very quickly to get optimum access. 8MB is slightly arbitrary but 64kB is certainly too small for any sensible access I can envisage to LHC data. Most access is via full file copy (rfcp) reads at the moment.

26 August 2010

Climbing Everest

The slides should appear soon on the web page - the mountain themed programme labelled us Everest, the second highest mountain on the agenda.
Apart from the lovely fresh air and hills and outdoorsy activities, GridPP25 was also an opportunity to persuade the experiments (and not just the LHC ones) to give us feedback and discuss future directions - we'll try to collate this and follow up. We are also working on developing "services" which seem to be useful, eg checking integrity of files, or consistency between catalogues and the storage elements. And of course for us to meet face to face and catch up over a be coffee

18 August 2010

Where Dave Lives and who shares his home....

Title of this blog post is all about whee I am situated within the RALLCG2 site and where my children our as well. I apparently also want to discuss the profile of "files" across a "disk server" as my avatar likes to put it, I prefer to think of this "Storage Element" that he talks of as my home and these "disk servers" as rooms inside my home.

I am made of 1779 files. ( ~3TB if you recall) I am spread across 8/795tapes in the RAL DATATAPE store ( although the pool of tapes for real data is actually only 229 tapes. in total there are currently tapes being used by atlas, so I take up 1/1572 datasets but ~1/130 of the volume (~3TB of the ~380TB) stored on DATATAPE at RAL and correspond to ~1/130 of the files (1779 out of ~230000). In this tape world I am deliberately kept to as small subset of tapes to allow for expedient recall.

However when it comes to being on disk I want to be spread out as much as possible so as not to cause "hot disking" . However, spreading me across many rooms means that if a single room is down, then this increases the chance that I can not be fully examined. In this disk world; of my 3TB is part of the 700TB in ATLASDATADISK at RAL and is 1 in 25k datasets and 1779 files in ~1.5 Million. In this world my average filesize at ~1.7GB per file is a lot larger than the average 450MB filesize of all the other DATADISK files. (Filesize distribution is not linear but that is a discussion for another day.) I am spread across 38 out of 71 roomsa which existed in my space token when I was created. (ther are now an additional 10 rooms and this will continue to increase in the near term.).

Looking at a random DATADISK server for every file:

1in20 datasets represented on this server are log datasets and that 1 in 11 files are log files and corresponds to 1 in 10200GB of the space used in the room.
1in2.7 datasets represented on this server are AOD datasets and that 1 in 5.1 files are AOD files and corresponds to 1 in 8.25GB of the space used in the room.
1in4.5 datasets represented on this server are ESD datasets and that 1 in 3.9 files are ESD files and corresponds to 1 in 2.32GB of the space used in the room.
1in8.3 datasets represented on this server are TAG datasets and that 1 in 8.5 files are TAG files and corresponds to 1 in 3430GB of the space used in the room.
1in47 datasets represented on this server are RAW datasets and that 1 in 17 files are RAW files and corresponds to 1 in 10.8GB of the space used in the room.
1in5.4 datasets represented on this server are DESD datasets and that 1 in 5.1 files are DESD files and corresponds to 1 in3.67 GB of the space used in the room.
1in200 datasets represented on this server are HIST datasets and that 1 in 46 files are HIST files and corresponds to 1 in735 GB of the space used in the room.
1in50 datasets represented on this server are NTUP datasets and that 1 in 16 file are NTUP files and corresponds to 1 in 130GB of the space used in the room.

Similar study has been done for a MCDISK server:
1 in 4.8 datasets represented in this room are log datasets and that 1 in 2.5 files are log files and corresponds to 1 in 18 GBof the space used in the room.
1 in 3.1 datasets represented in this room are AOD datasets and that 1 in 5.7 files are AOD files and corresponds to 1 in 2.1GB of the space used in the room.
1 in 28 datasets represented in this room are ESD datasets and that 1 in 13.6 files are ESD files and corresponds to 1 in 3.2GB of the space used in the room.
1 in 4.3 datasets represented in this room are TAG datasets and that 1 in 14.6 files are TAG files and corresponds to 1 in 2000GB of the space used in the room.
1 in 560 datasets represented in this room are DAOD datasets and that 1 in 49 files are DAOD files and corresponds to 1 in 2200GB of the space used in the room.
1 in 950 datasets represented in this room are DESD datasets and that 1 in 11000 files are DESD files and corresponds to 1 in 600GB of the space used in the room.
1 in 18 datasets represented in this room are HITS datasets and that 1 in 6.3 files are HITS files and corresponds to 1 in 25GB of the space used in the room.
1 in 114 datasets represented in this room are NTUP datasets and that 1 in 71 files are NTUP files and corresponds to 1 in 46GB of the space used in the room.
1 in114 datasets represented in this room are RDO datasets and that 1 in 63 files are RDO files and corresponds to 1 in 11GB of the space used in the room.
1 in 8 datasets represented in this room are EVNT datasets and that 1 in 13 files are EVNT files and corresponds to 1 in 100GB of the space used in the room.

As a sample this MCDISK server represents 1/47 of the space used in MCDISK at RAL and ~ 1/60 of all files in MCDISK. This room was add recently so any disparity might be due this server being filled with newer rather than older files ( which would be a good sign as it shows ATLAS are increasing file size.) Average filesize on this server is 211MB per file. Discounting log files this increases to 330MB per file. ( since log files average size is 29MB)

One area my avatar is interested in is to know that if one of these rooms were lost then how many of the files that were stored in that room could be found in anothe house and how many would be permanently lost.

For the "room" in the DATADISK Space Token, there are no files that are not located in another other house. ( This will not be the case all the time but is a good sign that the ATLAS model of replication is working.)

For the "room" in the MCDISK Space Token the following is the case:
886 out 2800 datasets that are present are not complete elsewhere. Of these 886, 583 are log datasets (consisting of 21632 files.)
Including log datasets there would be potentially 36283 files in 583 of the 886 datasets with a capacity of 640GB of lost data. ( avergae file size is 18MB).
Ignoring log datasets this drops to 14651 files in 303 datasets with a capacity of 2.86 TB of lost data.
The files on this diskserver whcih are elsewhere are form 1914 datasets, consist of 18309 files, and fill a capacity of 8104GB.

14 August 2010

This quarter’s deliverable.

Apologies for the late reporting but on 13/6/10, Cyrus Irving Bhimji went live.He is a mini-Thumper weighing it at just over 10lb from day one.Here he is deeply contemplating the future of grid storage.

Since then he has been doing well - as this weeks performance data clearly illustrates.

09 August 2010

DPM 1.7.4 and the case of the changing python module.

Since we've just had our first GridPP DPM Toolkit user hit this problem, I thought the time was right to blog about it.

Between DPM 1.7.3 and DPM 1.7.4, there is one mostly-invisible change that only hits people using the api (like, for example, the GridPP DPM Toolkit). Basically, in an attempt to clean up code and make maintenance easier, the api modules have been renamed and split by the programming language that they support.
This means that older versions of the GridPP DPM Toolkit can't interface with DPM 1.7.4 and above, as they don't know the name of the module to import. The symptom of this is the "Failed to import DPM API", in the case where following the instructions provided doesn't help at all.

Luckily, we wise men at GridPP Storage already solved this problem.
http://www.sysadmin.hep.ac.uk/rpms/fabric-management/RPMS.storage/
contains two versions of the GridPP DPM Toolkit RPM for the latest release - one suffixed "DPM173-1" and one suffixed "DPM174-1". Fairly obviously, the difference is the version of DPM they work against.

Future releases of the Toolkit will only support the DPM 1.7.4 modules (and may come relicensed, thanks to the concerns of one of our frequent correspondants, who shall remain nameless).

08 July 2010

Be vewy vewy qwuiet...

Boy, it sure is quiet here in T1 HQ. We're only about five people (and perhaps five of the quieter ones :-). Everybody else is out at IC for the WLCG workshop.

I managed to have a chat with some humanities folks earlier this week about archives and storage, they're in London for Digital Humanities this week. The key point is to make data available: for them it's about making sense of files, interpreting the contents, for the WLCG workshop it is about making the "physical" file available to the job which will analyse it. It is almost as if humanities have solved the transfer problem and HEP the semantic one - although I suspect humanities haven't really "solved" the transfer problem, they just have something which is good enough (many of the humanities datasets I saw are tiny, less than a hundred megs, and they mail CDs to people sometimes.) And HEP haven't really "solved" the semantics problem either, there was a working group looking at curation last year. Interesting to get different perspectives - we can learn from each other. This is another reason why it's good to have shared e-infrastructures.

21 June 2010

CDMI reference implementation available

CDMI is the SNIA Cloud Data Management Interface, an implementation of DaaS (Data as a Service). SNIA have today - at the OGF29 in Chicago - announced the availability of a reference implementation, open source (BSD licence), written in java. We just saw a version for a (virtual) iPad. Source code is available after a registration.

Not uncontroversial

Very lively session for the Grid Storage Management community group.

We covered the new charter, agreed with the provision that we replace "EGEE" with something appropriate. We had a quick introduction to the protocol, an introduction which caused a lot more discussion than such introductions normally do.

Much of the time was spent discussing the WLCG data management jamboree. Which in a sense is outside the scope of the group, because the jamboree focused on data analysis, and SRM was designed for transfers and pre-staging and suchlike, completely different use cases.

Normally we have presentations from users, particularly those outside HEP, but since we had run out of time, those discussions had to be relegated to lunch or coffee breaks.

Slightly tricky with both experts and newbies in the room, giving introductions to SRM and also discussing technical issues. But this is how OGF works, and it is a Good Thing™ - it ensures that the discussions are open and exposes our work to others and let others provide input.

20 June 2010

Too good to be true?

A grid filesystem with: transparent replication and partial replication, striping, POSIX interface and semantics, checksumming. Open source - GPL - and, unlike some grid "open source projects" we can mention, you can actually download the source. As fast as ext4 for linux kernel build. Planning NFSv4 and/or WebDAV interfaces.

This is the promise of XtreemFS, the filesystem part (but independent part) of XtreemOS, an EU funded project. More on this later in our weekly meetings.

17 June 2010

Have you heard this one before...

Sunny Amsterdam. Narrow streets, canals. Friendly locals, and a bicycle with your name on in it. Wonderful place for a WLCG data management jamboree.

The super brief summary of yesterday is that some people are pushing for a more network centric data model. They point to video streaming, although others point out that video streaming is very different from HEP analysis. (More in the next couple of GridPP storage meetings.)

Today is more on technology, some known, some less so. One particular piece I would like to highlight is NFS4.1 which is still progressing and is now said to be "wonderful." :-)

There are lots of discussions which sound oddly familiar. For example, use of P2P networks have been suggested before (by Owen, back in EDG) and it's now coming up again. But of course technology moves on and middleware matures, so revisiting the questions and the proposed solutions will hopefully be useful.

Oh, and Happy to J "birthday" T.

27 May 2010

Filesystems for the Future: ext4 vs btrfs vs xfs (pt1)

One of the regular mandates on the storage group is to maintain recommendations for disk configuration for optimal performance. Filesystem choice is a key component of this, and the last time we did a filesystem performance shootout, XFS was the clear winner.

It's been a while since that test, though, so I'm embarking on another shootout, this time comparing the current XFS champion against the new filesystems that have emerged since: ext4 and btrfs.

Of course, btrfs is still "experimental" and ext4 is only present in the SL5 kernel as a "technology preview", so in the interests of pushing the crystal ball into the distant future, I performed the following tests on an install of the RHEL6 beta. This should be something like what SL6 looks like... whenever that happens.

For now, what I'm going to present are iozone performance metrics. For my next post, I'll be investigating the performance of gridftp and other transfer protocols (and hopefully via FTS).

So. As XFS was the champion last time, I generated graphs for the ratio of ext4, btrfs (with defaults) and btrfs (with compression on, and internal checksumming off, and with just internal checksumming off) to xfs performance on the same physical hardware. Values > 1 indicate performance surpassing XFS, values < 1 performance worse than XFS. Colours indicate the size of file written (from 2GB to 16GB) in KB*.

XFS is still the winner, therefore, on pure performance, except for the case of btrfs without internal btrfs checksum calculation, where btrfs regains some edge. I'm not certain how important we should consider filesystem-level per-file checksum functionality, since there is already a layer of checksum-verification present in the data management infrastructure as a whole. (However, note that turning on compression seems to totally ruin btrfs performance for files of this size - I assume that the cpu overhead is simply too great to overcome the file reading advantages.) A further caveat should be noted: these tests are necessarily against an unstable release of btrfs, and may not reflect its final performance. (Indeed, tests by IBM showed significant variation in btrfs benchmarking behaviour with version changes.)

^*Whilst data for smaller files is measured, there are more significant caching effects, so the comparison should be against fsynced writes for more accurate metrics for a loaded system with full cache. We expect to pick up such effects with later load tests against the filesystems, when time permits.

14 May 2010

And hotter:

So I forgotten about some of my children ( well more precisely they are progeny since they do not come directly from me but are my descendants.) So some had gone further round the world ans even more descendants have been produced.
I now have 671 Ursula's', 151 Dirks',162 Valery's'; they also have 46 long lost cousins I did not know from cousin Gavin ( well that's what I call him, ATLAS call him group owned datasets.

One problem I am having is that my children have now travelled miles around the world. ( I myself have now been cloned and reside in the main site in the USA.

In total, I now have children in Switzerland, UK, USA, Canada, Austria, Czech Republic, Ireland, Italy, France, Germany, Netherlands, Norway, Poland, Portugal, Romania, Russia, Spain, Sweden, Turkey, Australia, China, Japan and Taiwan.

My Avatar has been counting to calculate how much infomation has been produced by me.
If you remember , I was 1779 files and ~3.1TB in size. I now have299078 unique child files (taking a volume of 21.44TB). Taking into consideration replication, this increases to ~815k files and 88.9TB.

12 May 2010

Usage really hotting up

Whoa! I turn my back for a moment an I suddenly get analysed massively (and not in the Higgsian sense.) they say a week is a long time in politics, its seems it an eternity on the grid. My Friendly masters holding up the world have created a new tool so that I can now easily see how busy my children and I have been.

My usage is now as follows:
I now have 384 children all beginning with user* (now known as Ursula's)
These 384 children have been produced by 129 unique ATLAS users
Of these:
60 only have 1 child
17 have 2 children
12 have 3 children
16 have 4 children
6 have 5 children
7 have 6 children
3 have 7 children
2 have 8 children
3 have 9 children
1 has 14 children
1 has 16 children
1 has produced 24 children!

I now have 61 children all beginning with data* (now known as Dirk's)
I now have 27 children all beginning with valid* (now known as Valery's)

27 April 2010

Hurray! Hurray! I've been reprocessed!

Further news regarding jobs that run on me. I have now been reprocessed at RAL!!
5036 jobs. Click job number to see details.
States: running:35 holding:1 finished:3989 failed:1011
Users (6): A:24 B:828 C:446 D:1944 E:15 F:1779
Releases (3): A1:828 A2:4134 A3:74
Processing types (3): pathena 1298 reprocessing:1779 validation:1959
Job types (2): managed:3738 user:1298
Transformations (2): Reco.py:3738 runAthena:1298
Sites (11): CERN:24 IFIC:1274 CERN-RELEASE:76 RAL:3410 Brunel:32 QMUL:20 UCL:45 LANCS:59 GLASGOW:66 CAM:10 RALPPP:20

Most of these jobs within the UK are only Validation jobs (a mix of sites which fail or succeed). Really strange since the dataset location has not changed ( still only at CERN,RAL and IFIC).
Large number of reprocessing jobs have been completed at RAL as you would expect.

Derived datasets now at RAL are multiplying like rabbits; including *sub* datasets there are 197 children of Dave. Ignoring the subs, there are 65 parents. Into total there are 2444 files associated with Dave!

These look like:
data10_7TeV..physics_MinBias.merge.AOD.*
data10_7TeV..physics_MinBias.merge.DESD_MBIAS.*
data10_7TeV..physics_MinBias.merge.DESDM_EGAMMA.*
data10_7TeV..physics_MinBias.merge.DESD_MET.*
data10_7TeV..physics_MinBias.merge.DESD_PHOJET.*
data10_7TeV..physics_MinBias.merge.DESD_SGLEL.*
data10_7TeV..physics_MinBias.merge.DESD_SGLMU.*
data10_7TeV..physics_MinBias.merge.ESD.*
data10_7TeV..physics_MinBias.merge.HIST.*
data10_7TeV..physics_MinBias.merge.log.*
data10_7TeV..physics_MinBias.merge.NTUP_MUONCALIB.*
data10_7TeV..physics_MinBias.merge.NTUP_TRKVALID.*
data10_7TeV..physics_MinBias.merge.TAG.*
data10_7TeV..physics_MinBias.merge.TAG_COMM.*
data10_7TeV..physics_MinBias.recon.ESD.*
data10_7TeV..physics_MinBias.recon.HIST.*
data10_7TeV..physics_MinBias.recon.log.*
data10_7TeV..physics_MinBias.recon.NTUP_TRKVALID.*
data10_7TeV..physics_MinBias.recon.TAG_COMM.*
valid1..physics_MinBias.recon.AOD.*
valid1..physics_MinBias.recon.DESD_MBIAS.*
valid1..physics_MinBias.recon.DESDM_EGAMMA.*
valid1..physics_MinBias.recon.DESD_MET.*
valid1..physics_MinBias.recon.DESD_SGLMU.*
valid1..physics_MinBias.recon.ESD.*
valid1..physics_MinBias.recon.HIST.*
valid1..physics_MinBias.recon.log.*
valid1..physics_MinBias.recon.NTUP_MUONCALIB.*
valid1..physics_MinBias.recon.NTUP_TRIG.*
valid1..physics_MinBias.recon.NTUP_TRKVALID.*
valid1..physics_MinBias.recon.TAG_COMM.*

I also had children copied to LOCALGROUPDISK at
UKI-NORTHGRID-LANCS-HEP_LOCALGROUPDISK
and
UKI-LT2-QMUL_LOCALGROUPDISK

Plus 17 Users have put betweeen 1-3 datsets each (totaling 24 datasests) into SCRATCHDISK space tokens across 6 T2
sites within the UK ( number of SCRATCHDISK datsets at these six sites are 1,1,4,6,10.

16 April 2010

Been on my holidays and plans for the future.

Been cloned to spanish LOCALGROUPDISK.
6120 jobs. Click job number to see details.
States: finished:504 failed:5616
Users (3): A:1890 B:1114 C:3116
Releases (3): 1:47 2:1114 3:4959
Processing types (2): ganga:2670 pathena:3450
Job types (1): user:6120
Transformations (1): 1:6120
Sites (3): ANALY_CERN:1823 ANALY_FZK:67 ANALY_IFIC:4230

( Not sure how the jobs in Germany worked since according to dq2-ls I am only at RAL and IFIC. Also want to find out from where I was copied from when copied to IFIC; Ie was it direct from CERN or RAL; or did I go via PIC. IF I did go via PIC, how long was I there before being deleted? )

I expect to be reprocessed soon so it will be intersting to see how I spread and to see older versions of my children get deleted.

07 April 2010

A Busy week for Dave the Dataset

I am only eight days old and already I am prolific.
I now have 51 descendant datasets.
Only some of these have been copied to RAL:
Those are
data10_7TeV.X.physics_MinBias.merge.AOD.f235_m427
data10_7TeV.X.physics_MinBias.merge.AOD.f236_m427
data10_7TeV.X.physics_MinBias.merge.AOD.f239_m427
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f235_m428
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f236_m428
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f236_m429
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f239_m428
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f239_m429
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f235_m428
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f236_m428
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f236_m429
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f239_m428
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f239_m429
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f235_m428
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f236_m428
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f236_m429
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f239_m428
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f239_m429
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f235_m428
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f236_m428
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f236_m429
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f239_m428
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f239_m429
data10_7TeV.X.physics_MinBias.merge.RAW
data10_7TeV.X.physics_MinBias.merge.TAG_COMM.f235_m426
data10_7TeV.X.physics_MinBias.merge.TAG_COMM.f236_m426
data10_7TeV.X.physics_MinBias.merge.TAG.f235_m427
data10_7TeV.X.physics_MinBias.merge.TAG.f236_m427
data10_7TeV.X.physics_MinBias.merge.TAG.f239_m427

As you can see this is a wild range of file types.
Volumes contained in each dataset in terms of size and number of events varies greatly. of the ~65k events in my RAW form, only 1or 2 events have survived into some child datasets.

Of the 10 T2s currently associated with RAL data distribution of real data for ATLAS , some of my children have gone to 8 of them. Those children which have been distributed are being cloned into two copies and sent to different T2s following the general ATLAS model and the Shares decided for the UK.
A break in the initial data model is expected and ESD will be sent to T2s. Let us see how long it takes for this to happen.....

I and my children have also being analysed by jobs on WNs in various countries and by multiple users.
For the three AOD datasets:

The first incarnation was analysed by:
2 Users spread over
2 sites over
2 releases of which
0/11 were analyzed at UK sites all of which were analyzed by
pathena

The second incarnation was analysed by:
29 Users spread over
20 sites over
8 releases of which
94/1032 jobs were analyzed at ANALY_RALPP and ANALY_SHEF all using
pathena

The 3rd incarnation has so far been analysed by:
10 Users spread over
11 sites over
3 releases of which
9/184 jobs were analyzed at ANALY_OX using both
pathena and ganga

31 March 2010

The Fall and Rise of Dave the Dataset

Hello, my full name is data10_7TeV.%$%$%$%$%.RAW but you can call me Dave. I am a dataset within ATLAS. Here I will be blogging my history and that of all the dataset replicas and children datasets that the physicists produce from me.

I came about from data taking at the LHC on the 30th March 2010 from the ATLAS detector.
I initially have 1779 files containing 675757 events. I was born a good 3.13TB
By the end of my first day I have already been copied so that I exist in two copies on disk and two sets of tape. This should result on my continual survival so as to avoid loss.
So I am now secure in my own existance; lets see if any one care to read me or move to different sites.

30 March 2010

Analysing a node chock full of analysis.

As Wahid's previous post notes, we've been doing some testing and benchmarking of the performance of data access under various hardware and data constraints (particularly: SSDs vs HDDs for local storage, "reordered" AODs vs "unordered" AODs, and there are more dimensions to be added).

Although this is a little preliminary, I took some blktrace traces of the activity on a node with an SSD (an Intel X25 G2) mounted on /tmp, and a node with a standard partition of the system HDD as /tmp, whilst they coped with being filled full of HammerCloud-delivered muon analysis jobs. Each trace was a little over an hour of activity, starting with each HammerCloud test's start time.

Using seekwatcher, you can get a quick summary plot of the activity of the filesystem during the trace.

In the following plots, node300 is the one with the HDD, and node305 is the one with the SDD.

Firstly, under stress from analysis of the old AODs, not reordered:

Node 300 (above)

Node 305 (above)

As you can see, the seek rates for the HDD node hit the maximum expected seeks per second for a 7200 rpm device (around 120 seeks per second), whilst the seeks on the SSD peak at around 2 to 2.5 times that. The HDD's seek rate is a significant limit on the efficiency of jobs under this kind of load.

Now, for the same analysis, against reordered AODs. Again, node300 first, then node305.

Notice that the seek rate for both the SSD and the HDD peak below 120 seeks per second, and the sustained seek rate for both of them is around half that. (This is with both nodes loaded completely with analysis work).
So, reordering your datasets definitely improves their performance with regard to seek ordering...

26 March 2010

Testing times

Data analysis at grid sites is hard on poor disk servers. This is part because of the "random" access pattern seen on accessing jobs. Recently LHC experiments have been "reordering" their files to match more the way they might be expected to be accessed.
Initially the access pattern on these new files looks more promising as these plots showed.
But those tests read the data in the new order so are sure to see improvements. Also, as the plots hint at, any improvement is very dependent on access method, file size, network config and a host of other factors.

So recently we have been trying accessing these datasets with real ATLAS analysis type jobs at Glasgow. Initial indications look like the improvement will not be quite as much as hoped but tests are ongoing so we'll report back.

04 March 2010

Checksumming and Integrity: The Challenge

One key focus of the Storage group as whole at the moment is the thorny issue of data integrity and consistency across the Grid. This turns out to be a somewhat complicated, multifaceted problem (the full breakdown is on the wiki here), and one which already has fractions of it solved by some of the VOs.

ATLAS, for example, has some scripts managed by Cedric Serfon which do the checking of data catalogue consistency correctly, between ATLAS's DDM system, the LFC and the local site SE. They don't, however, do file checksum checks, and therefore there is potential for files to be correctly placed, but corrupt (although this would be detected by ATLAS jobs when they run against the file, since they do perform checksums on transferred files before using them).

The Storage group has an integrity checker which does checksum and catalogue consistency checks between LFC and the local SE (in fact, it can be run remotely against any DPM), but it's much slower than the ATLAS code (mainly because of the checksums).

Currently, the plan is to split effort between improving VO specific scripts (adding checksums), and enhancing our own script - one issue of key importance is that the big VOs will always be able to write specific scripts for their own data management infrastructures than we will, but the small VOs deserve help too (perhaps more so than the big ones), and all these tools need to be interoperable. One aspect of this that we'll be talking about a little more in a future blog post is standardisation of input and output formats - we're planning on standardising on SynCat, or a slightly-derived version of SynCat, as a dump and input specifier format.

This post exists primarily as an informational post, to let people know what's going on. More detail will follow in later blog entries. If anyone wants to volunteer their SE to be checked, however, we're always interested...

01 March 2010

A Phew Good Files

The storage support guys finished integrity checking of 5K ATLAS files held at Lancaster and found no bad files.

This, of course, is a Good Thing™.

The next step is to check more files, and to figure out how implementations cache checksums. Er, the next two steps are to check more files and document handling checksums, and do it for more experiments. Errr, the next three steps are to check more files, document checksum handling, add more experiments, and integrate toolkits more with experiments and data management tools.

There have been some reports of corrupted files but corruptions can happen for more than one reason, and the problem is not always at the site. The Storage Inquisition investigation is ongoing.