GridPP storage news

14 August 2010

This quarter’s deliverable.

Apologies for the late reporting but on 13/6/10, Cyrus Irving Bhimji went live.He is a mini-Thumper weighing it at just over 10lb from day one.Here he is deeply contemplating the future of grid storage.

Since then he has been doing well - as this weeks performance data clearly illustrates.

09 August 2010

DPM 1.7.4 and the case of the changing python module.

Since we've just had our first GridPP DPM Toolkit user hit this problem, I thought the time was right to blog about it.

Between DPM 1.7.3 and DPM 1.7.4, there is one mostly-invisible change that only hits people using the api (like, for example, the GridPP DPM Toolkit). Basically, in an attempt to clean up code and make maintenance easier, the api modules have been renamed and split by the programming language that they support.
This means that older versions of the GridPP DPM Toolkit can't interface with DPM 1.7.4 and above, as they don't know the name of the module to import. The symptom of this is the "Failed to import DPM API", in the case where following the instructions provided doesn't help at all.

Luckily, we wise men at GridPP Storage already solved this problem.
http://www.sysadmin.hep.ac.uk/rpms/fabric-management/RPMS.storage/
contains two versions of the GridPP DPM Toolkit RPM for the latest release - one suffixed "DPM173-1" and one suffixed "DPM174-1". Fairly obviously, the difference is the version of DPM they work against.

Future releases of the Toolkit will only support the DPM 1.7.4 modules (and may come relicensed, thanks to the concerns of one of our frequent correspondants, who shall remain nameless).

08 July 2010

Be vewy vewy qwuiet...

Boy, it sure is quiet here in T1 HQ. We're only about five people (and perhaps five of the quieter ones :-). Everybody else is out at IC for the WLCG workshop.

I managed to have a chat with some humanities folks earlier this week about archives and storage, they're in London for Digital Humanities this week. The key point is to make data available: for them it's about making sense of files, interpreting the contents, for the WLCG workshop it is about making the "physical" file available to the job which will analyse it. It is almost as if humanities have solved the transfer problem and HEP the semantic one - although I suspect humanities haven't really "solved" the transfer problem, they just have something which is good enough (many of the humanities datasets I saw are tiny, less than a hundred megs, and they mail CDs to people sometimes.) And HEP haven't really "solved" the semantics problem either, there was a working group looking at curation last year. Interesting to get different perspectives - we can learn from each other. This is another reason why it's good to have shared e-infrastructures.

21 June 2010

CDMI reference implementation available

CDMI is the SNIA Cloud Data Management Interface, an implementation of DaaS (Data as a Service). SNIA have today - at the OGF29 in Chicago - announced the availability of a reference implementation, open source (BSD licence), written in java. We just saw a version for a (virtual) iPad. Source code is available after a registration.

Not uncontroversial

Very lively session for the Grid Storage Management community group.

We covered the new charter, agreed with the provision that we replace "EGEE" with something appropriate. We had a quick introduction to the protocol, an introduction which caused a lot more discussion than such introductions normally do.

Much of the time was spent discussing the WLCG data management jamboree. Which in a sense is outside the scope of the group, because the jamboree focused on data analysis, and SRM was designed for transfers and pre-staging and suchlike, completely different use cases.

Normally we have presentations from users, particularly those outside HEP, but since we had run out of time, those discussions had to be relegated to lunch or coffee breaks.

Slightly tricky with both experts and newbies in the room, giving introductions to SRM and also discussing technical issues. But this is how OGF works, and it is a Good Thing™ - it ensures that the discussions are open and exposes our work to others and let others provide input.

20 June 2010

Too good to be true?

A grid filesystem with: transparent replication and partial replication, striping, POSIX interface and semantics, checksumming. Open source - GPL - and, unlike some grid "open source projects" we can mention, you can actually download the source. As fast as ext4 for linux kernel build. Planning NFSv4 and/or WebDAV interfaces.

This is the promise of XtreemFS, the filesystem part (but independent part) of XtreemOS, an EU funded project. More on this later in our weekly meetings.

17 June 2010

Have you heard this one before...

Sunny Amsterdam. Narrow streets, canals. Friendly locals, and a bicycle with your name on in it. Wonderful place for a WLCG data management jamboree.

The super brief summary of yesterday is that some people are pushing for a more network centric data model. They point to video streaming, although others point out that video streaming is very different from HEP analysis. (More in the next couple of GridPP storage meetings.)

Today is more on technology, some known, some less so. One particular piece I would like to highlight is NFS4.1 which is still progressing and is now said to be "wonderful." :-)

There are lots of discussions which sound oddly familiar. For example, use of P2P networks have been suggested before (by Owen, back in EDG) and it's now coming up again. But of course technology moves on and middleware matures, so revisiting the questions and the proposed solutions will hopefully be useful.

Oh, and Happy to J "birthday" T.

27 May 2010

Filesystems for the Future: ext4 vs btrfs vs xfs (pt1)

One of the regular mandates on the storage group is to maintain recommendations for disk configuration for optimal performance. Filesystem choice is a key component of this, and the last time we did a filesystem performance shootout, XFS was the clear winner.

It's been a while since that test, though, so I'm embarking on another shootout, this time comparing the current XFS champion against the new filesystems that have emerged since: ext4 and btrfs.

Of course, btrfs is still "experimental" and ext4 is only present in the SL5 kernel as a "technology preview", so in the interests of pushing the crystal ball into the distant future, I performed the following tests on an install of the RHEL6 beta. This should be something like what SL6 looks like... whenever that happens.

For now, what I'm going to present are iozone performance metrics. For my next post, I'll be investigating the performance of gridftp and other transfer protocols (and hopefully via FTS).

So. As XFS was the champion last time, I generated graphs for the ratio of ext4, btrfs (with defaults) and btrfs (with compression on, and internal checksumming off, and with just internal checksumming off) to xfs performance on the same physical hardware. Values > 1 indicate performance surpassing XFS, values < 1 performance worse than XFS. Colours indicate the size of file written (from 2GB to 16GB) in KB*.

XFS is still the winner, therefore, on pure performance, except for the case of btrfs without internal btrfs checksum calculation, where btrfs regains some edge. I'm not certain how important we should consider filesystem-level per-file checksum functionality, since there is already a layer of checksum-verification present in the data management infrastructure as a whole. (However, note that turning on compression seems to totally ruin btrfs performance for files of this size - I assume that the cpu overhead is simply too great to overcome the file reading advantages.) A further caveat should be noted: these tests are necessarily against an unstable release of btrfs, and may not reflect its final performance. (Indeed, tests by IBM showed significant variation in btrfs benchmarking behaviour with version changes.)

^*Whilst data for smaller files is measured, there are more significant caching effects, so the comparison should be against fsynced writes for more accurate metrics for a loaded system with full cache. We expect to pick up such effects with later load tests against the filesystems, when time permits.

14 May 2010

And hotter:

So I forgotten about some of my children ( well more precisely they are progeny since they do not come directly from me but are my descendants.) So some had gone further round the world ans even more descendants have been produced.
I now have 671 Ursula's', 151 Dirks',162 Valery's'; they also have 46 long lost cousins I did not know from cousin Gavin ( well that's what I call him, ATLAS call him group owned datasets.

One problem I am having is that my children have now travelled miles around the world. ( I myself have now been cloned and reside in the main site in the USA.

In total, I now have children in Switzerland, UK, USA, Canada, Austria, Czech Republic, Ireland, Italy, France, Germany, Netherlands, Norway, Poland, Portugal, Romania, Russia, Spain, Sweden, Turkey, Australia, China, Japan and Taiwan.

My Avatar has been counting to calculate how much infomation has been produced by me.
If you remember , I was 1779 files and ~3.1TB in size. I now have299078 unique child files (taking a volume of 21.44TB). Taking into consideration replication, this increases to ~815k files and 88.9TB.

12 May 2010

Usage really hotting up

Whoa! I turn my back for a moment an I suddenly get analysed massively (and not in the Higgsian sense.) they say a week is a long time in politics, its seems it an eternity on the grid. My Friendly masters holding up the world have created a new tool so that I can now easily see how busy my children and I have been.

My usage is now as follows:
I now have 384 children all beginning with user* (now known as Ursula's)
These 384 children have been produced by 129 unique ATLAS users
Of these:
60 only have 1 child
17 have 2 children
12 have 3 children
16 have 4 children
6 have 5 children
7 have 6 children
3 have 7 children
2 have 8 children
3 have 9 children
1 has 14 children
1 has 16 children
1 has produced 24 children!

I now have 61 children all beginning with data* (now known as Dirk's)
I now have 27 children all beginning with valid* (now known as Valery's)

27 April 2010

Hurray! Hurray! I've been reprocessed!

Further news regarding jobs that run on me. I have now been reprocessed at RAL!!
5036 jobs. Click job number to see details.
States: running:35 holding:1 finished:3989 failed:1011
Users (6): A:24 B:828 C:446 D:1944 E:15 F:1779
Releases (3): A1:828 A2:4134 A3:74
Processing types (3): pathena 1298 reprocessing:1779 validation:1959
Job types (2): managed:3738 user:1298
Transformations (2): Reco.py:3738 runAthena:1298
Sites (11): CERN:24 IFIC:1274 CERN-RELEASE:76 RAL:3410 Brunel:32 QMUL:20 UCL:45 LANCS:59 GLASGOW:66 CAM:10 RALPPP:20

Most of these jobs within the UK are only Validation jobs (a mix of sites which fail or succeed). Really strange since the dataset location has not changed ( still only at CERN,RAL and IFIC).
Large number of reprocessing jobs have been completed at RAL as you would expect.

Derived datasets now at RAL are multiplying like rabbits; including *sub* datasets there are 197 children of Dave. Ignoring the subs, there are 65 parents. Into total there are 2444 files associated with Dave!

These look like:
data10_7TeV..physics_MinBias.merge.AOD.*
data10_7TeV..physics_MinBias.merge.DESD_MBIAS.*
data10_7TeV..physics_MinBias.merge.DESDM_EGAMMA.*
data10_7TeV..physics_MinBias.merge.DESD_MET.*
data10_7TeV..physics_MinBias.merge.DESD_PHOJET.*
data10_7TeV..physics_MinBias.merge.DESD_SGLEL.*
data10_7TeV..physics_MinBias.merge.DESD_SGLMU.*
data10_7TeV..physics_MinBias.merge.ESD.*
data10_7TeV..physics_MinBias.merge.HIST.*
data10_7TeV..physics_MinBias.merge.log.*
data10_7TeV..physics_MinBias.merge.NTUP_MUONCALIB.*
data10_7TeV..physics_MinBias.merge.NTUP_TRKVALID.*
data10_7TeV..physics_MinBias.merge.TAG.*
data10_7TeV..physics_MinBias.merge.TAG_COMM.*
data10_7TeV..physics_MinBias.recon.ESD.*
data10_7TeV..physics_MinBias.recon.HIST.*
data10_7TeV..physics_MinBias.recon.log.*
data10_7TeV..physics_MinBias.recon.NTUP_TRKVALID.*
data10_7TeV..physics_MinBias.recon.TAG_COMM.*
valid1..physics_MinBias.recon.AOD.*
valid1..physics_MinBias.recon.DESD_MBIAS.*
valid1..physics_MinBias.recon.DESDM_EGAMMA.*
valid1..physics_MinBias.recon.DESD_MET.*
valid1..physics_MinBias.recon.DESD_SGLMU.*
valid1..physics_MinBias.recon.ESD.*
valid1..physics_MinBias.recon.HIST.*
valid1..physics_MinBias.recon.log.*
valid1..physics_MinBias.recon.NTUP_MUONCALIB.*
valid1..physics_MinBias.recon.NTUP_TRIG.*
valid1..physics_MinBias.recon.NTUP_TRKVALID.*
valid1..physics_MinBias.recon.TAG_COMM.*

I also had children copied to LOCALGROUPDISK at
UKI-NORTHGRID-LANCS-HEP_LOCALGROUPDISK
and
UKI-LT2-QMUL_LOCALGROUPDISK

Plus 17 Users have put betweeen 1-3 datsets each (totaling 24 datasests) into SCRATCHDISK space tokens across 6 T2
sites within the UK ( number of SCRATCHDISK datsets at these six sites are 1,1,4,6,10.

16 April 2010

Been on my holidays and plans for the future.

Been cloned to spanish LOCALGROUPDISK.
6120 jobs. Click job number to see details.
States: finished:504 failed:5616
Users (3): A:1890 B:1114 C:3116
Releases (3): 1:47 2:1114 3:4959
Processing types (2): ganga:2670 pathena:3450
Job types (1): user:6120
Transformations (1): 1:6120
Sites (3): ANALY_CERN:1823 ANALY_FZK:67 ANALY_IFIC:4230

( Not sure how the jobs in Germany worked since according to dq2-ls I am only at RAL and IFIC. Also want to find out from where I was copied from when copied to IFIC; Ie was it direct from CERN or RAL; or did I go via PIC. IF I did go via PIC, how long was I there before being deleted? )

I expect to be reprocessed soon so it will be intersting to see how I spread and to see older versions of my children get deleted.

07 April 2010

A Busy week for Dave the Dataset

I am only eight days old and already I am prolific.
I now have 51 descendant datasets.
Only some of these have been copied to RAL:
Those are
data10_7TeV.X.physics_MinBias.merge.AOD.f235_m427
data10_7TeV.X.physics_MinBias.merge.AOD.f236_m427
data10_7TeV.X.physics_MinBias.merge.AOD.f239_m427
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f235_m428
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f236_m428
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f236_m429
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f239_m428
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f239_m429
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f235_m428
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f236_m428
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f236_m429
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f239_m428
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f239_m429
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f235_m428
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f236_m428
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f236_m429
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f239_m428
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f239_m429
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f235_m428
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f236_m428
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f236_m429
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f239_m428
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f239_m429
data10_7TeV.X.physics_MinBias.merge.RAW
data10_7TeV.X.physics_MinBias.merge.TAG_COMM.f235_m426
data10_7TeV.X.physics_MinBias.merge.TAG_COMM.f236_m426
data10_7TeV.X.physics_MinBias.merge.TAG.f235_m427
data10_7TeV.X.physics_MinBias.merge.TAG.f236_m427
data10_7TeV.X.physics_MinBias.merge.TAG.f239_m427

As you can see this is a wild range of file types.
Volumes contained in each dataset in terms of size and number of events varies greatly. of the ~65k events in my RAW form, only 1or 2 events have survived into some child datasets.

Of the 10 T2s currently associated with RAL data distribution of real data for ATLAS , some of my children have gone to 8 of them. Those children which have been distributed are being cloned into two copies and sent to different T2s following the general ATLAS model and the Shares decided for the UK.
A break in the initial data model is expected and ESD will be sent to T2s. Let us see how long it takes for this to happen.....

I and my children have also being analysed by jobs on WNs in various countries and by multiple users.
For the three AOD datasets:

The first incarnation was analysed by:
2 Users spread over
2 sites over
2 releases of which
0/11 were analyzed at UK sites all of which were analyzed by
pathena

The second incarnation was analysed by:
29 Users spread over
20 sites over
8 releases of which
94/1032 jobs were analyzed at ANALY_RALPP and ANALY_SHEF all using
pathena

The 3rd incarnation has so far been analysed by:
10 Users spread over
11 sites over
3 releases of which
9/184 jobs were analyzed at ANALY_OX using both
pathena and ganga

31 March 2010

The Fall and Rise of Dave the Dataset

Hello, my full name is data10_7TeV.%$%$%$%$%.RAW but you can call me Dave. I am a dataset within ATLAS. Here I will be blogging my history and that of all the dataset replicas and children datasets that the physicists produce from me.

I came about from data taking at the LHC on the 30th March 2010 from the ATLAS detector.
I initially have 1779 files containing 675757 events. I was born a good 3.13TB
By the end of my first day I have already been copied so that I exist in two copies on disk and two sets of tape. This should result on my continual survival so as to avoid loss.
So I am now secure in my own existance; lets see if any one care to read me or move to different sites.

30 March 2010

Analysing a node chock full of analysis.

As Wahid's previous post notes, we've been doing some testing and benchmarking of the performance of data access under various hardware and data constraints (particularly: SSDs vs HDDs for local storage, "reordered" AODs vs "unordered" AODs, and there are more dimensions to be added).

Although this is a little preliminary, I took some blktrace traces of the activity on a node with an SSD (an Intel X25 G2) mounted on /tmp, and a node with a standard partition of the system HDD as /tmp, whilst they coped with being filled full of HammerCloud-delivered muon analysis jobs. Each trace was a little over an hour of activity, starting with each HammerCloud test's start time.

Using seekwatcher, you can get a quick summary plot of the activity of the filesystem during the trace.

In the following plots, node300 is the one with the HDD, and node305 is the one with the SDD.

Firstly, under stress from analysis of the old AODs, not reordered:

Node 300 (above)

Node 305 (above)

As you can see, the seek rates for the HDD node hit the maximum expected seeks per second for a 7200 rpm device (around 120 seeks per second), whilst the seeks on the SSD peak at around 2 to 2.5 times that. The HDD's seek rate is a significant limit on the efficiency of jobs under this kind of load.

Now, for the same analysis, against reordered AODs. Again, node300 first, then node305.

Notice that the seek rate for both the SSD and the HDD peak below 120 seeks per second, and the sustained seek rate for both of them is around half that. (This is with both nodes loaded completely with analysis work).
So, reordering your datasets definitely improves their performance with regard to seek ordering...

26 March 2010

Testing times

Data analysis at grid sites is hard on poor disk servers. This is part because of the "random" access pattern seen on accessing jobs. Recently LHC experiments have been "reordering" their files to match more the way they might be expected to be accessed.
Initially the access pattern on these new files looks more promising as these plots showed.
But those tests read the data in the new order so are sure to see improvements. Also, as the plots hint at, any improvement is very dependent on access method, file size, network config and a host of other factors.

So recently we have been trying accessing these datasets with real ATLAS analysis type jobs at Glasgow. Initial indications look like the improvement will not be quite as much as hoped but tests are ongoing so we'll report back.

04 March 2010

Checksumming and Integrity: The Challenge

One key focus of the Storage group as whole at the moment is the thorny issue of data integrity and consistency across the Grid. This turns out to be a somewhat complicated, multifaceted problem (the full breakdown is on the wiki here), and one which already has fractions of it solved by some of the VOs.

ATLAS, for example, has some scripts managed by Cedric Serfon which do the checking of data catalogue consistency correctly, between ATLAS's DDM system, the LFC and the local site SE. They don't, however, do file checksum checks, and therefore there is potential for files to be correctly placed, but corrupt (although this would be detected by ATLAS jobs when they run against the file, since they do perform checksums on transferred files before using them).

The Storage group has an integrity checker which does checksum and catalogue consistency checks between LFC and the local SE (in fact, it can be run remotely against any DPM), but it's much slower than the ATLAS code (mainly because of the checksums).

Currently, the plan is to split effort between improving VO specific scripts (adding checksums), and enhancing our own script - one issue of key importance is that the big VOs will always be able to write specific scripts for their own data management infrastructures than we will, but the small VOs deserve help too (perhaps more so than the big ones), and all these tools need to be interoperable. One aspect of this that we'll be talking about a little more in a future blog post is standardisation of input and output formats - we're planning on standardising on SynCat, or a slightly-derived version of SynCat, as a dump and input specifier format.

This post exists primarily as an informational post, to let people know what's going on. More detail will follow in later blog entries. If anyone wants to volunteer their SE to be checked, however, we're always interested...

01 March 2010

A Phew Good Files

The storage support guys finished integrity checking of 5K ATLAS files held at Lancaster and found no bad files.

This, of course, is a Good Thing™.

The next step is to check more files, and to figure out how implementations cache checksums. Er, the next two steps are to check more files and document handling checksums, and do it for more experiments. Errr, the next three steps are to check more files, document checksum handling, add more experiments, and integrate toolkits more with experiments and data management tools.

There have been some reports of corrupted files but corruptions can happen for more than one reason, and the problem is not always at the site. The Storage Inquisition investigation is ongoing.

22 December 2009

T2 storage Ready for ATLAS Data Taking.. Or are we??

Been a busy couple of Months really; what with helping the Tier2 sites to prepare their storage for data taking.... Good news is the sites have done really well.
Of the three largest LHC VOs, most work has been done with ATLAS; (since they have the hungriest need for space and complexity for site administration of Tier2 space.)

All sites now have the space tokens for atlas that they require.
http://www.hep.lancs.ac.uk/~love/ukdata/

The ATLAS people have also been ready to see what space is available to them adjust there usage to this.
http://atlddm02.cern.ch/dq2/accounting/cloud_view/UKSITES/30/
http://atladcops.cern.ch:8000/drmon/crmon_TiersInfo.html

Almost all sites had either their SE/SRMs in the process of upgrade/decommissioning ready for data taking in '09 and all should be ready for '10.
Sites were very good at making changes needed by the ATLAS changing needs of space token distribution.
Sites have also been really good in working with ATLAS via atlas "hammercloud" tests to improve their storage.
Some issues still remain (Draining on DPM, limiting gridFTP connections etc, lost disk server process, data management by the VOs etc) but these challenges/opportunities will make our lives "interesting" over the coming months..

So that covers some of the known knowns.

The known unknowns ( how user analysis of real data affects on T2 storage) are also going to come about over the next few months, but I feel both the GRIDPP-Storage team, the atlas-uk support team and the site admins are ready to face what the LHC community throw at us.

Unknown unknowns; we will deal with then when they come at us....

09 December 2009

When its a pain to drain

Some experiences rejigging filessytems at ECDF today. Not sure I am recomending this approach but some of it may be useful as a dpm-drain alternative in certain circumstances.

Problem was that some data had been copied in with a limited lifetime but was in fact not OK to delete. Using dpm-drain would delete those so instead I marked the filesystem RDONLY and then did:

dpm-list-disk --server=pool1.glite.ecdf.ed.ac.uk --fs=/gridstorage010 > Stor10Files

I edited this file to replace Replica: with dpm-replicate (and delete the number at the end). (Warning: If these files are in a spacetoken you should also specify the spacetoken in this command)

Unfortunately I had to abort this part way through which left me in a bit of a pickle not knowing what files had been duplicated and could be deleted.
While you could probably figure out a way of doing this using dpm-disk-to-dpns and dpm-dpns-to-disk I instead opted for the database query

select GROUP_CONCAT(cns_db.Cns_file_replica.sfn), cns_db.Cns_file_replica.setname, count(*) from cns_db.Cns_file_replica where cns_db.Cns_file_replica LIKE '%gridstorage%' group by cns_db.Cns_file_replica.fileid INTO outfile '/tmp/Stor10Query2.txt ';

This gave me list of physical file names and the number of copies (and the spacetoken) which I could grep for a list of those with more than one copy.
grep "," /tmp/Stor10Query2.txt | cut -d ',' -f 1 > filestodelete

I could then edit this filestodelete to add dpm-delreplica to each line and sourced it to delete the files. I also made a new list of files to replicate in the same way as above. Finally I repeated the query to check all the files had 2 replicas before deleting all the originals.

Obviously this is a bit of a palava and not the ideal approach for many reasons including there is no check that the replicas are identical and the replicas made are still volatile so I'll probably just encounter the same problem again down the line. But if you really can't use dpm-drain for some reason - there is at least an alternative.