GridPP storage news

08 December 2010

So who's winning the transfer rate race for the Tier2 sites in the UK

Well its not really a race but I thought it would be interested to see how the rates to and from the Tier2 with RAL.
So the clear winner is ( and probably always going to be) RALPPD. This should seem obvious since they are co-located so various factors (small rtt, large bandwidth link for examples;) which would lead to a high rates. This lead to the rate plot as shown below:

Here you can see the rate has got to over 250MB/s
A close second I thought was going to be Glasgow, (they have a longer rtt and lower bandwidth pipe so you would expect them to fair worse) :

This shows that they have got to 140MB/s
Sometimes it is unfair to compare these numbers since the number of concurrent transfers on the FTS channels varies,. Glasgow actually have more concurrent transfers ( ~140MB/s with 42 concurrent transfers compare to ~250MB/s with fewer concurrent transfers.) It is because Glasgow had the most concurrent transfers that I thought it would have the highest rates:
But the peak rates I see for the sites appears to be the following (in the last month at least). If a site thinks they have seen better rates then comment on this Post!!!
( rtt time between SEs is shown in brackets after each record. Well its actually the rtt to the closest router that traceroute can resolve.)

RALPP 250 MB/s ( .2ms)
Imperial 250 MB/s (5.0ms)
Manchester 200MB/s (8.0ms)
Glasgow 140Mb/s (10.9ms)
Lancaster 100MB/s (6.4ms)
QMUL 100MB/s (5.7ms)
Birmingham 90MB/s (8.8ms)
Brunel 90MB/s (6.4ms)
Oxford 90MB/s (8.9ms)
Sheffiled 80Mb/s (9.1ms)
Liverpool 75MB/s (9.3ms)
RHUL 40MB/s (8.2ms)
Cambridge 30MB/s (10.3ms)
ECDF 20MB/s (12.9ms)
Bristol 17MB/s (4.0ms)
UCL 15MB/s (7.3ms)
Durham 12 MB/s (14.8ms)

For some sites the limiting factor seems to be the link (ie transfers are running at line speed.) For other sites the limiting factor is the number of concurrent transfers currently set in FTS. Some thing to tweak further....( Something really interesting is that the top two sites are the only dCache we have, but this could just be coincidence since they are also the shortest and close to second shortest rtt times of any site.

ATLAS have also started their sonar T2-T2 mesh of testing inter cloud T2 transfers. this made me think of the work I had done (but not reported) about work I had done in looking in splitting STAR-T2 channels at RAL into a slow medium and fast channels. the rough split is western European sites in fast channel, north America in the medium channel and south America/Asia-pacific in slow channel ( going from rtt time) . This would be an initial split and then some tweaking if some sites were slower than their rtt would suggest.
Interesting to see if ATLAS list of slow transfers for UK sites match mine.
my list of slow sites would be:

Australia-ATLAS

BEIJING-LCG2

CBPF

EELA-UTFSM

LCG_KNU

MA-01-CNRST

NCP-LCG2

SDU-LCG2

TOKYO-LCG2

UNIANDES

TW-FTT

TR-10-ULAKBIM

INDIACMS-TIFR

Medium sites ( North America)would be:

Canadian

CA-ALBERTA-WESTGRID-T2

CA-SCINET-T2

CA-VICTORIA-WESTGRID-T2

SFU-LCG2

VICTORIA-LCG2

American

UST3

BUATLAS

UMICH

IUT

UTA

UIUC

UCTP

STU

UCT2

SMU

AGLT2

WIS

UMFS

SWT2UTA

SWT2CPN

Plus I am working with Glasgow to see how much of their 6Gbps can be used; but more on that in my next post.

03 November 2010

Confederated conference confusion

Summaries of the data management sessions at last week's OGF, as well as more CHEP discussion, have appeared in the minutes of this week's storage meeting. Meanwhile, there is a HEPiX meeting this week, which will (probably) be discussed at the storage meeting next week - particularly if we can find someone who went to it. Share and Enjoy.

24 October 2010

A computing banquet

Delayed post from Thursday....
So another day, another 100 talks, followed by another 100 food courses.

Just to provoke a suitable tirade from Sam, I will describe that morning's SSD plenary as an "informative and thorough summary on the unquestionable advantages of SSDs" (not really). More interesting was a talk from David South on long term data preservation for experiments. A very worthwhile idea I think, and I hope it can be supported.
In parallel talks, there was an update on Hammercloud developments - now available for LHCb, CMS as well ATLAS and apparanty in the future other VOs will be able to "plugin". Also coming in the future are more advanced/configurable statistics.
Andreas Peters outlined the cern "disk pool project" EOS, and, as mentioned by Sam, the obvious questions followed from Brian B and others, "why yet another filsystem/ storage manager? Are dCache. HDFS etc.etc. etc not worth adapting..." But to look at the positive side (as they are certainly going to carry on with this anyway) if something good develops then it maybe something worth T1s or T2s trying out.
In the same session there was another tool presented that we might use, a flexible benchmark that allows you to trace any application and then "play it back": copying the disk calls.Potentially very useful for testing out new kit - and, though it hasn't yet been packaged for general consumption, we'll defiantely be following up with the developer for a preview.

Friday was the stream leaders summaries, which you don't need since you have had ours ;-).
Overall I've been very impressed both with the organisation of the conference and of the quality of the talks...

And, as my temple visits should have placated the travel gods to guide me home through typhoons, strikes and whatever else is out there; we should be back to provide a more digested summary in next wednesdays storage meeting.

20 October 2010

CHEP: or, how I learned to stop worrying and love the rain.

So, day three of CHEP was a half day, so not so much to report as Wahid did yesterday.

The plenaries were not directly interesting from a storage perspective, but I should ebtion them for other qualities.
First, Kate Keahey told us all why clouds (and public clouds, federated clouds - "sky computing") were awesome. I guess I'm just a cynic, as I still don't see how they're significantly better than Condor pools (plus flocking, plus VM universe). Also, the data flow problem is decidedly unsolved for analysis-class jobs in this context.

Secondly, Lucas Taylor impressed on us how important it was to talk to the media (and, more importantly by far, the public). Apparently, the most significant source of hits on CERN webpages is Twitter! Considering also that the LHC is only 1/3 as popular as Barack Obama on YouTube, it does seem that the right approach can really bring in public interest, and this can only be a good thing.

Finally, Peter Malzcher told us about the FAIR project, which is to be the next big accelerator at GSI. Honestly, it looks awesome, but the 6MW cooling solution for the cluster looks terrifying.

Since I was presenting today, I only have notes from the session I was scheduled for.

The first two talks, both on virtualisation, confirmed that io can be an issue for many-VM hosts. The solution of the day appears to be iSCSI.
Then some dangerous radical told everyone ~~to throw their shoes in the machinery~~ that MLC flash isn't all it's cracked up to be in SSDs.
More upsets followed when Yves Kemp showed that pNFS/NFS4.1 is much better than dCap in almost all possible cases. It is, however, possible that dCap's problem is simply too much readahead.
Finally, Dirk Duellmann gave us an update from CERN storage. Essentially, they're pretty stable at the front-end, growing storage at 15PB/y. Additionally, they're trialling EOS for disk pool filesystems. EOS, as Jeff Templon got Dirk to admit under cross-examination, is basically Hadoop over xrootd protocol, with a better namespace.
Despite agreeing at Amsterdam that reinventing the wheel in private projects was Bad... (CERN could have chosen to patch Hadoop, or even Ceph, instead).

Tch.

CHEPping part 2

Having got my talk out of the way (more on that later), I am now free to blog my view on activities so far here in Taiwan. I will avoid telling you about the driving rain, puppet shows and million-mini-course dinners, sticking instead to the hard storage facts.
My highlight/snippets on Monday/Tuesday activities:
- Lots on many core - but the valid question was asked, can IO keep up with this?
- Partick told us the plans (as they are now) for Data management middleware in EMI. Storm is in the plans (though was somewhat absent from the session to provide an update on their status.)
- Oliver told us the roadmap for DPM:(immediate news is that DPM 1.8.0 is in certification including a 3rd party rfcp to allow it to be used for draining.)
- Ricardo gave a nice talk on the DPM work on NFS4.1 which has reached the stage of a prototype.
For the slides on the later talks see this session:
http://117.103.105.177/MaKaC/sessionDisplay.py?sessionId=33&slotId=0&confId=3#2010-10-19

- My talk went OK with many questions, including those (interested in) doing similar benchmarking work. Hopefully we can get some common ideas towards providing something useful for sites to test and tune.
- Unfortunately I was talking at the same time as Illija's talk on the ATLAS root improvements which among other things outline that some of the further improvements in ROOT 5.26 would not be available in the current ATLAS reprocessing due to some other bugs which, thanks to connections made during the talk, may get fixed. Also up at the same time (!
) was Philippe Canal's talk with more detail on the ROOT changes as well as CMS's experiences in implementing them (http://117.103.105.177/MaKaC/contributionDisplay.py?contribId=150&sessionId=46&confId=3)

Other news - we had a very productive meeting with the DPM team, which should see us soon getting hold of the prerelease NFS4.1 interface for testing (around the next month) and also (probably before that) we'll be testing the "3rd party rfcp" mentioned above to tune it for fastest possible drains (yeah!) . We also talked about creating a central repository to collect together any DPM related nagios probes that people are using (before consolodating / adding new ones)

Packed days (so many sessions at once that even with Sam and I covering different sessions we are still missing half the stuff) - and we are only half way through! So standby for more info, if I don't get lost in the electronics markets or washed away by a typhoon.

18 October 2010

CHEP 2010: Episode 4: A New Hope

The story so far:

The evil Empire of CERN has succeeded in paralyzing the world's data networks by distributing vast quantities of 'event data' from their Death Star in Geneva.

However, at this very moment, a band of resistance fighters are congregating on the forest ~~moon~~ island of Taiwan to lead the fight back...

Ahem. So, Wahid and I are currently in Taipei for CHEP2010. Despite the jetlag encouraging me to write paragraphs like the above, we're seeing lots of interesting things.
Tellingly, the inaugural speech was given by the Vice President of Taiwan, and he mentioned how important Science was to Taiwanese success. Unlike France, I suspect the UK would find it hard to get Nick Clegg to turn up in similar circumstances.
Back to physics, where Ian Bird, and Roger Jones sequentially told us how successful we'd all been over the year, and Craig Lee told us how awesome cloud computing will be when it is public. Just like the Grid (and Condor clusters) before it, eh?
Finally, we had a discussion of many-core scaling for LHC VOs by Sverre Jarp. This is an area of significance in data provision, and the challenge of scaling io is something we're still looking at how best to address.

Of the parallel talks I attended, the most interesting was the CERNVMFS talk - it's still impressive how well it works.
Other interesting things: talks on EMI release processes (they have QA metrics!), posters on FTS over scp, Amazon ECC for CMS (too expensive), L-Grid webportal, and the ATLAS consistency service.

More from Wahid tomorrow.

14 September 2010

Aran Fawddwy

Welcome to Wales, to Cardiff actually, but if there are (real) mountains in Cardiff then I haven't seen them yet. The environment is certainly impressive, in Cardiff's city hall. As science conferences go, we're not used to such impressive surroundings.

Currently in the data management session where we have just heard from the UKQCD collaboration. It is interesting that their data grid is the stuff that we (GridPP) run, and the stuff the other guys run - ILDG is a collaboration of five countries, and they don't all run the same thing, but as long as they interoperate. As usual, GridFTP is the workhorse moving data back and forth, but even then data volumes are such that "truckftp" is sometimes quicker.

Oh, if you're around, don't forget to stop by the GridPP/NGS stand and say hi.

31 August 2010

Taming XFS on SL5

Sites (including Liverpool) running DPM on pool nodes running SL5 with XFS file systems have been experiencing very high (up to multiple 100s Load Average and close to 100% CPU IO WAIT) load when a number of analysis jobs were accessing data simultaneously with rfcp.

The exact same hardware and file systems under SL4 had shown no excessive load, and the SL5 systems had shown no problems under system stress testing/burn-in. Also, the problem was occurring from a relatively small number of parallel transfers (about 5 or more on Liverpool's systems were enough to show an increased load compared to SL4).

Some admins have found that using ext4 at least alleviates the problem although apparently it still occurs under enough load. Migrating production servers with TBs of live data from one FS to another isn't hard but would be a drawn out process for many sites.

The fundamental problem for either FS appears to be IOPS overload on the arrays rather than sheer throughput, although why this is occurring so much under SL5 and not under SL4 is still a bit of a mystery. There may be changes in controller drivers, XFS, kernel block access, DPM access patterns or default parameters.

When faced with an IOPS overload (that's resulting well below the theoretical throughput of the array) one solution is to make each IO operation access more bits from the storage device so that you need to make fewer but larger read requests.

This leads to the actual fix (we have been doing this by default on our 3ware systems but we just assumed the Areca defaults were already optimal).

blockdev --setra 16384 /dev/$RAIDDEVICE

This sets the block device read ahead to (16384/2)kB (8MB). We have previously (on 3ware controllers) had to do this to get the full throughput from the controller. The default on our Areca 1280MLs is 128 (64kB read ahead). So when lots of parallel transfers are occurring our arrays have been thrashing spindles pulling off small 64kB chunks from each different file. These files are usually many hundreds or thousands of MB where reading MBs at a time would be much more efficient.

The mystery for us is more why the SL4 systems *don't* overload rather than why SL5 does, as the SL4 systems use the exact same default values.

Here is a ganglia plot of our pool nodes under about as much load as we can put on them at the moment. Note that previously our SL5 nodes would have LAs in the 10s or 100s under this load or less.

http://hep.ph.liv.ac.uk/~jbland/xfs-fix.html

Any time the systems go above 1LA now is when they're also having data written at a high rate. On that note we also hadn't configured our Arecas to have their block max sector size aligned with the RAID chunk size with

echo "64" > /sys/block/$RAIDDEVICE/queue/max_sectors_kb

although we don't think this had any bearing on the overloading and might not be necessary.

We expect the tweak to also work for systems running ext4 as the underlying hardware access would still be a bottle neck, just at a different level of access.

Note that this 'fix' doesn't fix the even more fundamental problem as pointed out by others that DPM doesn't rate limit connections to pool nodes. All this fix does is (hopefully) push the current limit where overload occurs above the point that our WNs can pull data.

There is also a concern that using a big read ahead may affect small random (RFIO) access although the sites can tune this parameter very quickly to get optimum access. 8MB is slightly arbitrary but 64kB is certainly too small for any sensible access I can envisage to LHC data. Most access is via full file copy (rfcp) reads at the moment.

26 August 2010

Climbing Everest

The slides should appear soon on the web page - the mountain themed programme labelled us Everest, the second highest mountain on the agenda.
Apart from the lovely fresh air and hills and outdoorsy activities, GridPP25 was also an opportunity to persuade the experiments (and not just the LHC ones) to give us feedback and discuss future directions - we'll try to collate this and follow up. We are also working on developing "services" which seem to be useful, eg checking integrity of files, or consistency between catalogues and the storage elements. And of course for us to meet face to face and catch up over a be coffee

18 August 2010

Where Dave Lives and who shares his home....

Title of this blog post is all about whee I am situated within the RALLCG2 site and where my children our as well. I apparently also want to discuss the profile of "files" across a "disk server" as my avatar likes to put it, I prefer to think of this "Storage Element" that he talks of as my home and these "disk servers" as rooms inside my home.

I am made of 1779 files. ( ~3TB if you recall) I am spread across 8/795tapes in the RAL DATATAPE store ( although the pool of tapes for real data is actually only 229 tapes. in total there are currently tapes being used by atlas, so I take up 1/1572 datasets but ~1/130 of the volume (~3TB of the ~380TB) stored on DATATAPE at RAL and correspond to ~1/130 of the files (1779 out of ~230000). In this tape world I am deliberately kept to as small subset of tapes to allow for expedient recall.

However when it comes to being on disk I want to be spread out as much as possible so as not to cause "hot disking" . However, spreading me across many rooms means that if a single room is down, then this increases the chance that I can not be fully examined. In this disk world; of my 3TB is part of the 700TB in ATLASDATADISK at RAL and is 1 in 25k datasets and 1779 files in ~1.5 Million. In this world my average filesize at ~1.7GB per file is a lot larger than the average 450MB filesize of all the other DATADISK files. (Filesize distribution is not linear but that is a discussion for another day.) I am spread across 38 out of 71 roomsa which existed in my space token when I was created. (ther are now an additional 10 rooms and this will continue to increase in the near term.).

Looking at a random DATADISK server for every file:

1in20 datasets represented on this server are log datasets and that 1 in 11 files are log files and corresponds to 1 in 10200GB of the space used in the room.
1in2.7 datasets represented on this server are AOD datasets and that 1 in 5.1 files are AOD files and corresponds to 1 in 8.25GB of the space used in the room.
1in4.5 datasets represented on this server are ESD datasets and that 1 in 3.9 files are ESD files and corresponds to 1 in 2.32GB of the space used in the room.
1in8.3 datasets represented on this server are TAG datasets and that 1 in 8.5 files are TAG files and corresponds to 1 in 3430GB of the space used in the room.
1in47 datasets represented on this server are RAW datasets and that 1 in 17 files are RAW files and corresponds to 1 in 10.8GB of the space used in the room.
1in5.4 datasets represented on this server are DESD datasets and that 1 in 5.1 files are DESD files and corresponds to 1 in3.67 GB of the space used in the room.
1in200 datasets represented on this server are HIST datasets and that 1 in 46 files are HIST files and corresponds to 1 in735 GB of the space used in the room.
1in50 datasets represented on this server are NTUP datasets and that 1 in 16 file are NTUP files and corresponds to 1 in 130GB of the space used in the room.

Similar study has been done for a MCDISK server:
1 in 4.8 datasets represented in this room are log datasets and that 1 in 2.5 files are log files and corresponds to 1 in 18 GBof the space used in the room.
1 in 3.1 datasets represented in this room are AOD datasets and that 1 in 5.7 files are AOD files and corresponds to 1 in 2.1GB of the space used in the room.
1 in 28 datasets represented in this room are ESD datasets and that 1 in 13.6 files are ESD files and corresponds to 1 in 3.2GB of the space used in the room.
1 in 4.3 datasets represented in this room are TAG datasets and that 1 in 14.6 files are TAG files and corresponds to 1 in 2000GB of the space used in the room.
1 in 560 datasets represented in this room are DAOD datasets and that 1 in 49 files are DAOD files and corresponds to 1 in 2200GB of the space used in the room.
1 in 950 datasets represented in this room are DESD datasets and that 1 in 11000 files are DESD files and corresponds to 1 in 600GB of the space used in the room.
1 in 18 datasets represented in this room are HITS datasets and that 1 in 6.3 files are HITS files and corresponds to 1 in 25GB of the space used in the room.
1 in 114 datasets represented in this room are NTUP datasets and that 1 in 71 files are NTUP files and corresponds to 1 in 46GB of the space used in the room.
1 in114 datasets represented in this room are RDO datasets and that 1 in 63 files are RDO files and corresponds to 1 in 11GB of the space used in the room.
1 in 8 datasets represented in this room are EVNT datasets and that 1 in 13 files are EVNT files and corresponds to 1 in 100GB of the space used in the room.

As a sample this MCDISK server represents 1/47 of the space used in MCDISK at RAL and ~ 1/60 of all files in MCDISK. This room was add recently so any disparity might be due this server being filled with newer rather than older files ( which would be a good sign as it shows ATLAS are increasing file size.) Average filesize on this server is 211MB per file. Discounting log files this increases to 330MB per file. ( since log files average size is 29MB)

One area my avatar is interested in is to know that if one of these rooms were lost then how many of the files that were stored in that room could be found in anothe house and how many would be permanently lost.

For the "room" in the DATADISK Space Token, there are no files that are not located in another other house. ( This will not be the case all the time but is a good sign that the ATLAS model of replication is working.)

For the "room" in the MCDISK Space Token the following is the case:
886 out 2800 datasets that are present are not complete elsewhere. Of these 886, 583 are log datasets (consisting of 21632 files.)
Including log datasets there would be potentially 36283 files in 583 of the 886 datasets with a capacity of 640GB of lost data. ( avergae file size is 18MB).
Ignoring log datasets this drops to 14651 files in 303 datasets with a capacity of 2.86 TB of lost data.
The files on this diskserver whcih are elsewhere are form 1914 datasets, consist of 18309 files, and fill a capacity of 8104GB.

14 August 2010

This quarter’s deliverable.

Apologies for the late reporting but on 13/6/10, Cyrus Irving Bhimji went live.He is a mini-Thumper weighing it at just over 10lb from day one.Here he is deeply contemplating the future of grid storage.

Since then he has been doing well - as this weeks performance data clearly illustrates.

09 August 2010

DPM 1.7.4 and the case of the changing python module.

Since we've just had our first GridPP DPM Toolkit user hit this problem, I thought the time was right to blog about it.

Between DPM 1.7.3 and DPM 1.7.4, there is one mostly-invisible change that only hits people using the api (like, for example, the GridPP DPM Toolkit). Basically, in an attempt to clean up code and make maintenance easier, the api modules have been renamed and split by the programming language that they support.
This means that older versions of the GridPP DPM Toolkit can't interface with DPM 1.7.4 and above, as they don't know the name of the module to import. The symptom of this is the "Failed to import DPM API", in the case where following the instructions provided doesn't help at all.

Luckily, we wise men at GridPP Storage already solved this problem.
http://www.sysadmin.hep.ac.uk/rpms/fabric-management/RPMS.storage/
contains two versions of the GridPP DPM Toolkit RPM for the latest release - one suffixed "DPM173-1" and one suffixed "DPM174-1". Fairly obviously, the difference is the version of DPM they work against.

Future releases of the Toolkit will only support the DPM 1.7.4 modules (and may come relicensed, thanks to the concerns of one of our frequent correspondants, who shall remain nameless).

08 July 2010

Be vewy vewy qwuiet...

Boy, it sure is quiet here in T1 HQ. We're only about five people (and perhaps five of the quieter ones :-). Everybody else is out at IC for the WLCG workshop.

I managed to have a chat with some humanities folks earlier this week about archives and storage, they're in London for Digital Humanities this week. The key point is to make data available: for them it's about making sense of files, interpreting the contents, for the WLCG workshop it is about making the "physical" file available to the job which will analyse it. It is almost as if humanities have solved the transfer problem and HEP the semantic one - although I suspect humanities haven't really "solved" the transfer problem, they just have something which is good enough (many of the humanities datasets I saw are tiny, less than a hundred megs, and they mail CDs to people sometimes.) And HEP haven't really "solved" the semantics problem either, there was a working group looking at curation last year. Interesting to get different perspectives - we can learn from each other. This is another reason why it's good to have shared e-infrastructures.

21 June 2010

CDMI reference implementation available

CDMI is the SNIA Cloud Data Management Interface, an implementation of DaaS (Data as a Service). SNIA have today - at the OGF29 in Chicago - announced the availability of a reference implementation, open source (BSD licence), written in java. We just saw a version for a (virtual) iPad. Source code is available after a registration.

Not uncontroversial

Very lively session for the Grid Storage Management community group.

We covered the new charter, agreed with the provision that we replace "EGEE" with something appropriate. We had a quick introduction to the protocol, an introduction which caused a lot more discussion than such introductions normally do.

Much of the time was spent discussing the WLCG data management jamboree. Which in a sense is outside the scope of the group, because the jamboree focused on data analysis, and SRM was designed for transfers and pre-staging and suchlike, completely different use cases.

Normally we have presentations from users, particularly those outside HEP, but since we had run out of time, those discussions had to be relegated to lunch or coffee breaks.

Slightly tricky with both experts and newbies in the room, giving introductions to SRM and also discussing technical issues. But this is how OGF works, and it is a Good Thing™ - it ensures that the discussions are open and exposes our work to others and let others provide input.

20 June 2010

Too good to be true?

A grid filesystem with: transparent replication and partial replication, striping, POSIX interface and semantics, checksumming. Open source - GPL - and, unlike some grid "open source projects" we can mention, you can actually download the source. As fast as ext4 for linux kernel build. Planning NFSv4 and/or WebDAV interfaces.

This is the promise of XtreemFS, the filesystem part (but independent part) of XtreemOS, an EU funded project. More on this later in our weekly meetings.

17 June 2010

Have you heard this one before...

Sunny Amsterdam. Narrow streets, canals. Friendly locals, and a bicycle with your name on in it. Wonderful place for a WLCG data management jamboree.

The super brief summary of yesterday is that some people are pushing for a more network centric data model. They point to video streaming, although others point out that video streaming is very different from HEP analysis. (More in the next couple of GridPP storage meetings.)

Today is more on technology, some known, some less so. One particular piece I would like to highlight is NFS4.1 which is still progressing and is now said to be "wonderful." :-)

There are lots of discussions which sound oddly familiar. For example, use of P2P networks have been suggested before (by Owen, back in EDG) and it's now coming up again. But of course technology moves on and middleware matures, so revisiting the questions and the proposed solutions will hopefully be useful.

Oh, and Happy to J "birthday" T.

27 May 2010

Filesystems for the Future: ext4 vs btrfs vs xfs (pt1)

One of the regular mandates on the storage group is to maintain recommendations for disk configuration for optimal performance. Filesystem choice is a key component of this, and the last time we did a filesystem performance shootout, XFS was the clear winner.

It's been a while since that test, though, so I'm embarking on another shootout, this time comparing the current XFS champion against the new filesystems that have emerged since: ext4 and btrfs.

Of course, btrfs is still "experimental" and ext4 is only present in the SL5 kernel as a "technology preview", so in the interests of pushing the crystal ball into the distant future, I performed the following tests on an install of the RHEL6 beta. This should be something like what SL6 looks like... whenever that happens.

For now, what I'm going to present are iozone performance metrics. For my next post, I'll be investigating the performance of gridftp and other transfer protocols (and hopefully via FTS).

So. As XFS was the champion last time, I generated graphs for the ratio of ext4, btrfs (with defaults) and btrfs (with compression on, and internal checksumming off, and with just internal checksumming off) to xfs performance on the same physical hardware. Values > 1 indicate performance surpassing XFS, values < 1 performance worse than XFS. Colours indicate the size of file written (from 2GB to 16GB) in KB*.

XFS is still the winner, therefore, on pure performance, except for the case of btrfs without internal btrfs checksum calculation, where btrfs regains some edge. I'm not certain how important we should consider filesystem-level per-file checksum functionality, since there is already a layer of checksum-verification present in the data management infrastructure as a whole. (However, note that turning on compression seems to totally ruin btrfs performance for files of this size - I assume that the cpu overhead is simply too great to overcome the file reading advantages.) A further caveat should be noted: these tests are necessarily against an unstable release of btrfs, and may not reflect its final performance. (Indeed, tests by IBM showed significant variation in btrfs benchmarking behaviour with version changes.)

^*Whilst data for smaller files is measured, there are more significant caching effects, so the comparison should be against fsynced writes for more accurate metrics for a loaded system with full cache. We expect to pick up such effects with later load tests against the filesystems, when time permits.

14 May 2010

And hotter:

So I forgotten about some of my children ( well more precisely they are progeny since they do not come directly from me but are my descendants.) So some had gone further round the world ans even more descendants have been produced.
I now have 671 Ursula's', 151 Dirks',162 Valery's'; they also have 46 long lost cousins I did not know from cousin Gavin ( well that's what I call him, ATLAS call him group owned datasets.

One problem I am having is that my children have now travelled miles around the world. ( I myself have now been cloned and reside in the main site in the USA.

In total, I now have children in Switzerland, UK, USA, Canada, Austria, Czech Republic, Ireland, Italy, France, Germany, Netherlands, Norway, Poland, Portugal, Romania, Russia, Spain, Sweden, Turkey, Australia, China, Japan and Taiwan.

My Avatar has been counting to calculate how much infomation has been produced by me.
If you remember , I was 1779 files and ~3.1TB in size. I now have299078 unique child files (taking a volume of 21.44TB). Taking into consideration replication, this increases to ~815k files and 88.9TB.

12 May 2010

Usage really hotting up

Whoa! I turn my back for a moment an I suddenly get analysed massively (and not in the Higgsian sense.) they say a week is a long time in politics, its seems it an eternity on the grid. My Friendly masters holding up the world have created a new tool so that I can now easily see how busy my children and I have been.

My usage is now as follows:
I now have 384 children all beginning with user* (now known as Ursula's)
These 384 children have been produced by 129 unique ATLAS users
Of these:
60 only have 1 child
17 have 2 children
12 have 3 children
16 have 4 children
6 have 5 children
7 have 6 children
3 have 7 children
2 have 8 children
3 have 9 children
1 has 14 children
1 has 16 children
1 has produced 24 children!

I now have 61 children all beginning with data* (now known as Dirk's)
I now have 27 children all beginning with valid* (now known as Valery's)

27 April 2010

Hurray! Hurray! I've been reprocessed!

Further news regarding jobs that run on me. I have now been reprocessed at RAL!!
5036 jobs. Click job number to see details.
States: running:35 holding:1 finished:3989 failed:1011
Users (6): A:24 B:828 C:446 D:1944 E:15 F:1779
Releases (3): A1:828 A2:4134 A3:74
Processing types (3): pathena 1298 reprocessing:1779 validation:1959
Job types (2): managed:3738 user:1298
Transformations (2): Reco.py:3738 runAthena:1298
Sites (11): CERN:24 IFIC:1274 CERN-RELEASE:76 RAL:3410 Brunel:32 QMUL:20 UCL:45 LANCS:59 GLASGOW:66 CAM:10 RALPPP:20

Most of these jobs within the UK are only Validation jobs (a mix of sites which fail or succeed). Really strange since the dataset location has not changed ( still only at CERN,RAL and IFIC).
Large number of reprocessing jobs have been completed at RAL as you would expect.

Derived datasets now at RAL are multiplying like rabbits; including *sub* datasets there are 197 children of Dave. Ignoring the subs, there are 65 parents. Into total there are 2444 files associated with Dave!

These look like:
data10_7TeV..physics_MinBias.merge.AOD.*
data10_7TeV..physics_MinBias.merge.DESD_MBIAS.*
data10_7TeV..physics_MinBias.merge.DESDM_EGAMMA.*
data10_7TeV..physics_MinBias.merge.DESD_MET.*
data10_7TeV..physics_MinBias.merge.DESD_PHOJET.*
data10_7TeV..physics_MinBias.merge.DESD_SGLEL.*
data10_7TeV..physics_MinBias.merge.DESD_SGLMU.*
data10_7TeV..physics_MinBias.merge.ESD.*
data10_7TeV..physics_MinBias.merge.HIST.*
data10_7TeV..physics_MinBias.merge.log.*
data10_7TeV..physics_MinBias.merge.NTUP_MUONCALIB.*
data10_7TeV..physics_MinBias.merge.NTUP_TRKVALID.*
data10_7TeV..physics_MinBias.merge.TAG.*
data10_7TeV..physics_MinBias.merge.TAG_COMM.*
data10_7TeV..physics_MinBias.recon.ESD.*
data10_7TeV..physics_MinBias.recon.HIST.*
data10_7TeV..physics_MinBias.recon.log.*
data10_7TeV..physics_MinBias.recon.NTUP_TRKVALID.*
data10_7TeV..physics_MinBias.recon.TAG_COMM.*
valid1..physics_MinBias.recon.AOD.*
valid1..physics_MinBias.recon.DESD_MBIAS.*
valid1..physics_MinBias.recon.DESDM_EGAMMA.*
valid1..physics_MinBias.recon.DESD_MET.*
valid1..physics_MinBias.recon.DESD_SGLMU.*
valid1..physics_MinBias.recon.ESD.*
valid1..physics_MinBias.recon.HIST.*
valid1..physics_MinBias.recon.log.*
valid1..physics_MinBias.recon.NTUP_MUONCALIB.*
valid1..physics_MinBias.recon.NTUP_TRIG.*
valid1..physics_MinBias.recon.NTUP_TRKVALID.*
valid1..physics_MinBias.recon.TAG_COMM.*

I also had children copied to LOCALGROUPDISK at
UKI-NORTHGRID-LANCS-HEP_LOCALGROUPDISK
and
UKI-LT2-QMUL_LOCALGROUPDISK

Plus 17 Users have put betweeen 1-3 datsets each (totaling 24 datasests) into SCRATCHDISK space tokens across 6 T2
sites within the UK ( number of SCRATCHDISK datsets at these six sites are 1,1,4,6,10.

16 April 2010

Been on my holidays and plans for the future.

Been cloned to spanish LOCALGROUPDISK.
6120 jobs. Click job number to see details.
States: finished:504 failed:5616
Users (3): A:1890 B:1114 C:3116
Releases (3): 1:47 2:1114 3:4959
Processing types (2): ganga:2670 pathena:3450
Job types (1): user:6120
Transformations (1): 1:6120
Sites (3): ANALY_CERN:1823 ANALY_FZK:67 ANALY_IFIC:4230

( Not sure how the jobs in Germany worked since according to dq2-ls I am only at RAL and IFIC. Also want to find out from where I was copied from when copied to IFIC; Ie was it direct from CERN or RAL; or did I go via PIC. IF I did go via PIC, how long was I there before being deleted? )

I expect to be reprocessed soon so it will be intersting to see how I spread and to see older versions of my children get deleted.

07 April 2010

A Busy week for Dave the Dataset

I am only eight days old and already I am prolific.
I now have 51 descendant datasets.
Only some of these have been copied to RAL:
Those are
data10_7TeV.X.physics_MinBias.merge.AOD.f235_m427
data10_7TeV.X.physics_MinBias.merge.AOD.f236_m427
data10_7TeV.X.physics_MinBias.merge.AOD.f239_m427
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f235_m428
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f236_m428
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f236_m429
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f239_m428
data10_7TeV.X.physics_MinBias.merge.DESD_MBIAS.f239_m429
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f235_m428
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f236_m428
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f236_m429
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f239_m428
data10_7TeV.X.physics_MinBias.merge.DESDM_EGAMMA.f239_m429
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f235_m428
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f236_m428
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f236_m429
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f239_m428
data10_7TeV.X.physics_MinBias.merge.DESD_PHOJET.f239_m429
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f235_m428
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f236_m428
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f236_m429
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f239_m428
data10_7TeV.X.physics_MinBias.merge.DESD_SGLEL.f239_m429
data10_7TeV.X.physics_MinBias.merge.RAW
data10_7TeV.X.physics_MinBias.merge.TAG_COMM.f235_m426
data10_7TeV.X.physics_MinBias.merge.TAG_COMM.f236_m426
data10_7TeV.X.physics_MinBias.merge.TAG.f235_m427
data10_7TeV.X.physics_MinBias.merge.TAG.f236_m427
data10_7TeV.X.physics_MinBias.merge.TAG.f239_m427

As you can see this is a wild range of file types.
Volumes contained in each dataset in terms of size and number of events varies greatly. of the ~65k events in my RAW form, only 1or 2 events have survived into some child datasets.

Of the 10 T2s currently associated with RAL data distribution of real data for ATLAS , some of my children have gone to 8 of them. Those children which have been distributed are being cloned into two copies and sent to different T2s following the general ATLAS model and the Shares decided for the UK.
A break in the initial data model is expected and ESD will be sent to T2s. Let us see how long it takes for this to happen.....

I and my children have also being analysed by jobs on WNs in various countries and by multiple users.
For the three AOD datasets:

The first incarnation was analysed by:
2 Users spread over
2 sites over
2 releases of which
0/11 were analyzed at UK sites all of which were analyzed by
pathena

The second incarnation was analysed by:
29 Users spread over
20 sites over
8 releases of which
94/1032 jobs were analyzed at ANALY_RALPP and ANALY_SHEF all using
pathena

The 3rd incarnation has so far been analysed by:
10 Users spread over
11 sites over
3 releases of which
9/184 jobs were analyzed at ANALY_OX using both
pathena and ganga

31 March 2010

The Fall and Rise of Dave the Dataset

Hello, my full name is data10_7TeV.%$%$%$%$%.RAW but you can call me Dave. I am a dataset within ATLAS. Here I will be blogging my history and that of all the dataset replicas and children datasets that the physicists produce from me.

I came about from data taking at the LHC on the 30th March 2010 from the ATLAS detector.
I initially have 1779 files containing 675757 events. I was born a good 3.13TB
By the end of my first day I have already been copied so that I exist in two copies on disk and two sets of tape. This should result on my continual survival so as to avoid loss.
So I am now secure in my own existance; lets see if any one care to read me or move to different sites.

30 March 2010

Analysing a node chock full of analysis.

As Wahid's previous post notes, we've been doing some testing and benchmarking of the performance of data access under various hardware and data constraints (particularly: SSDs vs HDDs for local storage, "reordered" AODs vs "unordered" AODs, and there are more dimensions to be added).

Although this is a little preliminary, I took some blktrace traces of the activity on a node with an SSD (an Intel X25 G2) mounted on /tmp, and a node with a standard partition of the system HDD as /tmp, whilst they coped with being filled full of HammerCloud-delivered muon analysis jobs. Each trace was a little over an hour of activity, starting with each HammerCloud test's start time.

Using seekwatcher, you can get a quick summary plot of the activity of the filesystem during the trace.

In the following plots, node300 is the one with the HDD, and node305 is the one with the SDD.

Firstly, under stress from analysis of the old AODs, not reordered:

Node 300 (above)

Node 305 (above)

As you can see, the seek rates for the HDD node hit the maximum expected seeks per second for a 7200 rpm device (around 120 seeks per second), whilst the seeks on the SSD peak at around 2 to 2.5 times that. The HDD's seek rate is a significant limit on the efficiency of jobs under this kind of load.

Now, for the same analysis, against reordered AODs. Again, node300 first, then node305.

Notice that the seek rate for both the SSD and the HDD peak below 120 seeks per second, and the sustained seek rate for both of them is around half that. (This is with both nodes loaded completely with analysis work).
So, reordering your datasets definitely improves their performance with regard to seek ordering...

26 March 2010

Testing times

Data analysis at grid sites is hard on poor disk servers. This is part because of the "random" access pattern seen on accessing jobs. Recently LHC experiments have been "reordering" their files to match more the way they might be expected to be accessed.
Initially the access pattern on these new files looks more promising as these plots showed.
But those tests read the data in the new order so are sure to see improvements. Also, as the plots hint at, any improvement is very dependent on access method, file size, network config and a host of other factors.

So recently we have been trying accessing these datasets with real ATLAS analysis type jobs at Glasgow. Initial indications look like the improvement will not be quite as much as hoped but tests are ongoing so we'll report back.

04 March 2010

Checksumming and Integrity: The Challenge

One key focus of the Storage group as whole at the moment is the thorny issue of data integrity and consistency across the Grid. This turns out to be a somewhat complicated, multifaceted problem (the full breakdown is on the wiki here), and one which already has fractions of it solved by some of the VOs.

ATLAS, for example, has some scripts managed by Cedric Serfon which do the checking of data catalogue consistency correctly, between ATLAS's DDM system, the LFC and the local site SE. They don't, however, do file checksum checks, and therefore there is potential for files to be correctly placed, but corrupt (although this would be detected by ATLAS jobs when they run against the file, since they do perform checksums on transferred files before using them).

The Storage group has an integrity checker which does checksum and catalogue consistency checks between LFC and the local SE (in fact, it can be run remotely against any DPM), but it's much slower than the ATLAS code (mainly because of the checksums).

Currently, the plan is to split effort between improving VO specific scripts (adding checksums), and enhancing our own script - one issue of key importance is that the big VOs will always be able to write specific scripts for their own data management infrastructures than we will, but the small VOs deserve help too (perhaps more so than the big ones), and all these tools need to be interoperable. One aspect of this that we'll be talking about a little more in a future blog post is standardisation of input and output formats - we're planning on standardising on SynCat, or a slightly-derived version of SynCat, as a dump and input specifier format.

This post exists primarily as an informational post, to let people know what's going on. More detail will follow in later blog entries. If anyone wants to volunteer their SE to be checked, however, we're always interested...

01 March 2010

A Phew Good Files

The storage support guys finished integrity checking of 5K ATLAS files held at Lancaster and found no bad files.

This, of course, is a Good Thing™.

The next step is to check more files, and to figure out how implementations cache checksums. Er, the next two steps are to check more files and document handling checksums, and do it for more experiments. Errr, the next three steps are to check more files, document checksum handling, add more experiments, and integrate toolkits more with experiments and data management tools.

There have been some reports of corrupted files but corruptions can happen for more than one reason, and the problem is not always at the site. The Storage Inquisition investigation is ongoing.

22 December 2009

T2 storage Ready for ATLAS Data Taking.. Or are we??

Been a busy couple of Months really; what with helping the Tier2 sites to prepare their storage for data taking.... Good news is the sites have done really well.
Of the three largest LHC VOs, most work has been done with ATLAS; (since they have the hungriest need for space and complexity for site administration of Tier2 space.)

All sites now have the space tokens for atlas that they require.
http://www.hep.lancs.ac.uk/~love/ukdata/

The ATLAS people have also been ready to see what space is available to them adjust there usage to this.
http://atlddm02.cern.ch/dq2/accounting/cloud_view/UKSITES/30/
http://atladcops.cern.ch:8000/drmon/crmon_TiersInfo.html

Almost all sites had either their SE/SRMs in the process of upgrade/decommissioning ready for data taking in '09 and all should be ready for '10.
Sites were very good at making changes needed by the ATLAS changing needs of space token distribution.
Sites have also been really good in working with ATLAS via atlas "hammercloud" tests to improve their storage.
Some issues still remain (Draining on DPM, limiting gridFTP connections etc, lost disk server process, data management by the VOs etc) but these challenges/opportunities will make our lives "interesting" over the coming months..

So that covers some of the known knowns.

The known unknowns ( how user analysis of real data affects on T2 storage) are also going to come about over the next few months, but I feel both the GRIDPP-Storage team, the atlas-uk support team and the site admins are ready to face what the LHC community throw at us.

Unknown unknowns; we will deal with then when they come at us....

09 December 2009

When its a pain to drain

Some experiences rejigging filessytems at ECDF today. Not sure I am recomending this approach but some of it may be useful as a dpm-drain alternative in certain circumstances.

Problem was that some data had been copied in with a limited lifetime but was in fact not OK to delete. Using dpm-drain would delete those so instead I marked the filesystem RDONLY and then did:

dpm-list-disk --server=pool1.glite.ecdf.ed.ac.uk --fs=/gridstorage010 > Stor10Files

I edited this file to replace Replica: with dpm-replicate (and delete the number at the end). (Warning: If these files are in a spacetoken you should also specify the spacetoken in this command)

Unfortunately I had to abort this part way through which left me in a bit of a pickle not knowing what files had been duplicated and could be deleted.
While you could probably figure out a way of doing this using dpm-disk-to-dpns and dpm-dpns-to-disk I instead opted for the database query

select GROUP_CONCAT(cns_db.Cns_file_replica.sfn), cns_db.Cns_file_replica.setname, count(*) from cns_db.Cns_file_replica where cns_db.Cns_file_replica LIKE '%gridstorage%' group by cns_db.Cns_file_replica.fileid INTO outfile '/tmp/Stor10Query2.txt ';

This gave me list of physical file names and the number of copies (and the spacetoken) which I could grep for a list of those with more than one copy.
grep "," /tmp/Stor10Query2.txt | cut -d ',' -f 1 > filestodelete

I could then edit this filestodelete to add dpm-delreplica to each line and sourced it to delete the files. I also made a new list of files to replicate in the same way as above. Finally I repeated the query to check all the files had 2 replicas before deleting all the originals.

Obviously this is a bit of a palava and not the ideal approach for many reasons including there is no check that the replicas are identical and the replicas made are still volatile so I'll probably just encounter the same problem again down the line. But if you really can't use dpm-drain for some reason - there is at least an alternative.

24 November 2009

Storage workshop discussion

If you have followed the weeklies, you will have noticed we're discussing having another storage workshop. The previous one was thought extremely useful, and we want to create a forum for storage admins to come together and share their experiences with Real Data(tm).
Interestingly, we now have (or are close to getting!) more experience with tech previously not used by us. For example, does it improve performance having your DPM db on SSD? Is Hadoop a good option for making use of storage space on WNs?
We already have a rough agenda. There should be lots of sysadmin-friendly coffee-aided pow-wows. Maybe also some projectplanny stuff, like the implications for us of the end of EGEE, the NGI, GridPP4, and suchlike.
Tentatively, think Edinburgh in February.

23 November 2009

100% uptime for DPM

(and anything else with a MySQL backend).

This weekend, with the ramp up of jobs through the Grid as a result of some minor events happening in Geneva, we were informed of a narrow period during which jobs failed accessing Glasgow's DPM.

There were no problems with the DPM, and it was working according to spec. However, the period was correlated with the 15 minutes or so that the MySQL backend takes to dump a copy of itself as backup, every night.

So, in the interests of improving uptime for DPMs to >99%, we enabled binary logging on the MySQL backend (and advise that other DPM sites do so as well, disk space permitting).

Binary logging (which is enabled by adding the string "log-bin" on it's own line to /etc/my.cnf, and restarting the service) enables (amongst other things, including "proper" uptothesecond backups) a MySQL-hosted InnoDB database to be dumped without interrupting service at all, thus removing any short period of dropped communication.

(Now any downtime is purely your fault, not MySQL's.)

12 November 2009

Nearly there

The new CASTOR information provider is nearly back, the host is finally back up, but given that it's somewhat late in the day we better not switch the information system back till tomorrow. (We are currently running CIP 1.X, without nearline accounting.)

Meanwhile we will of course work on a resilienter infrastructure. We also did that before, it's just that the machine died before we could complete the resilientification.

We do apologise for the inconvenience caused by this incredibly exploding information provider host. I don't know exactly what happened to it, but given that it took a skilled admin nearly three days to get it back, it must have toasted itself fairly thoroughly.

While we're on the subject, a new release is under way for the other CASTOR sites - the current one has a few RAL-isms inside, to get it out before the deadline.

When this is done, work can start on GLUE 2.0. Hey ho.

10 November 2009

Kaboom

Well it seems we lost the new CASTOR information provider (CIP) this morning and the BDII was reset to the old one - the physical host it lived on (the new one) decided to kick the bucket. One of the consequences is that nearline accounting is lost, all nearline numbers are now zero (obviously not 44444 or 99999, that would be silly...:-)).
Before you ask, the new CIP doesn't run on the old host because it was compiled for 64 bit on SLC5, and the old host is 32 bit SL4.
We're still working on getting it back, but are currently short of machines that can run it, even virtual ones. If you have any particular problems, do get in touch with the helpdesk and we'll see what we can do.

30 September 2009

CIP update update

We are OK: problems in deployment that had not been caught in testing appear to be due to different versions of lcg-utils (used for all the tests) behaving subtly differently. So I could run tests as dteam prior to release and they'd work, but the very same tests would fail on the NGS CE after release, even if they'd also run as dteam. Those were finally fixed this morning.

29 September 2009

CIP deployment

As some of you may have noticed, the new CASTOR information provider (version 2.0.3) went live as of 13.00 or thereabouts today.

This one is smarter than the previous one: it automatically picks up certain relevant changes to CASTOR. It has nearline (tape) accounting as requested by CMS. It is more resilient against internal errors. It is easier to configure. It also has an experimental bugfix for the ILC bug (it works for me on dteam). It has improved compliance with WLCG Installed Capacity (up to a point, it is still not fully compliant.)

Apart from a few initial wobbles and adjustments which were fixed fairly quickly (but still needed to filter through the system), real VOs should be working.

ops was trickier, because they have access to everything in hairy ways, so we were coming up red on the SAM tests for a while. This appears to be sorted out for the SE tests, but still causes the CE tests to fail. Which is odd, because the failing CE tests consist of jobs that run the same data tests as the SE tests, which work. I talked to Stephen Burke who suggested a workaround which is now filtering through the information system.

We're leaving it at-risk till tomorrow - and the services are working. On the whole, apart from the ops tests with lcg-utils, I think it went rather well: the CIP is up against two extremely complex software infrastructures, CASTOR on one side, and the grid data management on the other, and the CIP itself has a complex task trying to manage all this information.

Any Qs, let me know.

28 September 2009

Replicating like HOT cakes

As mentioned on the storage list, the newest versions of the GridPP DPM Tools (documented at http://www.gridpp.ac.uk/wiki/DPM-admin-tools) contain a tool to replicate files within a spacetoken (such as the ATLASHOTDISK).

At Edinburgh this is running in cron

DPNS_HOST=srm.glite.ecdf.ed.ac.uk
DPM_HOST=srm.glite.ecdf.ed.ac.uk
PYTHONPATH=/opt/lcg/lib64/python
0 1 * * * root /opt/lcg/bin/dpm-sql-spacetoken-replicate-hotfiles --st ATLASHOTDISK >> /var/log/dpmrephotfiles.log 2>&1

Some issues observed are :
* Takes quite a long time to run the first time. Because of all the dpm-replicate calls on the ~1000 files that ATLAS stuck in there it took around 4 hours just for 1 extra copy. Since then though only the odd file has come in - so it doesn't have much to do.
* The replicas are always on different filesystems - but not always different disk server. This obviously depends on how many servers you have for that pool (compared to the nreps you want), as well as how many filesystems on each server. The replica creation could be more directed but perhaps it should be the default behaviour of the built in command to use a different server if it can.

Intended future enhancements of this tool include:
* List in a clear way the physical duplicates in the ST.
* Remove excess duplicates.
* Automatic replications of a list of "hotfiles"

Other suggestions welcome.

20 August 2009

GridPP DPM toolkit v2.5.2 released

Hello all,

Another month, another toolkit release.
This one, relative to the last announced release (2.5.0) has a slightly improved functionality for dpm-sql-list-hotfiles and adds a -o (or --ordered) option to dpm-list-disk.
The -o option returns a sorted list of the files in the space selected, descending by filesize. As this uses the dpm API, the process currently needs to pull the entire filelist before sorting it, so, unlike the normal mode, you get all the files output in one go (after a pause of some minutes while all the records are acquired + sorted).

There's also a new release of the Gridpp-DPM-monitor package, which includes some bug fixes and the new user-level accounting plot functionality. This should work fine, but if anyone has any problems, contact me as normal.

All rpms at the usual place:
http://www.sysadmin.hep.ac.uk/rpms/fabric-management/RPMS.storage/

31 July 2009

GridPP DPM toolkit v2.5.0 released

Hello everyone,

I've just released version 2.5.0 of the GridPP DPM toolkit.
The main feature of this release is the addition of the tool
dpm-sql-list-hotfiles
which should be called as
dpm-sql-list-hotfiles --days N --num M
to return the top M "most popular" files over the past N days.
Caveat: the query involved in calculating the file temperature is a little more intensive than the average queries implemented in the tool kit. You may see a small load spike in your DPM when this executes, so don't run it a lot in a short space of time if your DPM is doing something important.

As always, downloads are possible from:
http://www.sysadmin.hep.ac.uk/rpms/fabric-management/RPMS.storage/

15 July 2009

Storage workshop and STEP report

The storage workshop writeup is available! Apart from notes from the workshop, it also contains Brian's storage STEP report. Follow the link and read all about it. Comments are welcome.

03 July 2009

GridPP Storage workshop

The GridPP storage workshop was a success. Techies from the Tier 2s got together to show and tell, to discuss issues, and to hear about new stuff. We also had speakers from Tier 1 talking about IPMI and verifying disk arrays, and from the National Grid Service talking about procurement and SANs.
We talked about STEP (storage) experiences, and had a dense programme with a mix of content for both newbies and oldbies, and we gazed into the crystal ball to see what's coming next.
All presentations will be available on the web on the hepsysman website shortly (if they aren't already). There will also be a writeup of the workshop.

24 June 2009

GridPP DPM toolkit v2.3.9 released

Ladies and gentlemen,

I am proud to announce the release of the gridpp-dpm-tools package,
version 2.3.9.
It is available at the usual place:
http://www.sysadmin.hep.ac.uk/rpms/fabric-management/RPMS.storage/gridpp-dpm-tools-2.3.9-6.noarch.rpm

This version is an extra-special "GridPP Milestone" release, in that
it includes a mechanism for writing user-level storage consumption
data to a database, so that you can make nice graphs of them later.
(The graphmaking functionality exists as scripts now, and will be
released as a modification to the dpm monitoring rpm as soon as I make
the necessary changes + package it.)
You can enable user-level accounting on your DPM by following the
instructions just newly added to the GridPP wiki for the toolkit. (
http://www.gridpp.ac.uk/wiki/DPM-admin-tools#Installation )

Comments, bug reports, &c all (happily?) accepted. In particular, if
the user-level accounting doesn't work for you, I'd like to know about
it, since it's been happy here at Glasgow for the last couple of days.

23 June 2009

Workshop agenda now available

As the title indicates, the agenda for the GridPP storage workshop is now uploaded to the web site, along with the agenda for hepsysman.

15 June 2009

Storage workshop planning

With now less than 0.05 years to go before the storage workshop, the agenda and other planning is continuing apace.
We would like to ask sites to give site reports (currently 20 mins each, incl. Qs) about their (own) storage infrastructure: we'd like to hear about their storage setup (as opposed to computing and other irrelevant stuff :-P) as well as their STEP experiences. This is partly so we can discuss the implications, but also for the benefit of folks who will be attending the storage workshop only. We will get feedback from ATLAS on STEP, ie the users' perspective.
Brian will present our experiences with BeStMan and Hadoop; there will be an introduction to the storage schema, to the DPM toolkit, to SRM testing, to user level accounting, and from the Tier 1 a talk on disk arrays scheduling, and hopefully room for discussion. So lots of things to look forward to!

08 June 2009

GridPP DPM toolkit v2.3.6 released

This is a bug-fix release for 2.3.5, which had some annoying whitespace inconsistencies introduced into the dpm-sql* functions. Thanks to Stephen Childs for noticing them.

(The direct link for download is:
http://www.sysadmin.hep.ac.uk/rpms/fabric-management/RPMS.storage/gridpp-dpm-tools-2.3.6-5.noarch.rpm
and the documentation is still at
https://www.gridpp.ac.uk/wiki/DPM-admin-tools
and has been slightly updated for this release.)

02 June 2009

SRM protocol status

The Grid Storage Management working group (GSM-WG) in OGF exists to standardise the SRM protocol. Why standardise? We need this to ensure the process stays open, and can benefit other communities than WLCG.
The SRM document is GFD.129, and we now have an experiences document available for public comments. You are invited to read this document and submit your comments - thanks!
You can even do so anonymously!

01 June 2009

Summary of DESY workshop

Getting the SRM implementers back together was very useful, and long overdue. We agreed of course to not change anything :-)

We needed to review how WLCG clients make use of the protocol; there are cases where they do not make the most efficient use of the protocol, thus causing a high load on the server. Is the estimated wait time used properly?
Differences between implementations may need documenting, e.g. whether an implementation supports "hard" pinning.
We reviewed the implementations' support for areas of the protocol, whether it was fully or partially supported (or not at all), to find a "core" which MUST be universally supported, and whether the implementers thought the feature desirable, given their specialist knowledge of the underlying storage system.
Security and the use of proxies were discussed.

There was one person who was involved with SNIA, and users from WLCG.
This is the summary, for the full report attend the next GridPP storage meeting.

28 April 2009

GridPP DPM toolkit v2.3.5 released

The inaugural release of the DPM toolkit under my aegis has just happened. This release contains some bug fixes (I've attempted to improve the intelligence of the SQL-based tools when trying to acquire the right username/password from configuration tools), and is deliberately missing dpm-listspaces as this is now provided by DPM itself.

This is also my first try building RPMs in OSX, so can people tell me if this is horribly broken for them? :)

(The direct link for download is:
http://www.sysadmin.hep.ac.uk/rpms/fabric-management/RPMS.storage/gridpp-dpm-tools-2.3.5-5.noarch.rpm
and the documentation is still at
https://www.gridpp.ac.uk/wiki/DPM-admin-tools
and has been slightly updated for this release.)

27 April 2009

Storage calendar

It's that time of year and I'm writing reports again. It shows that Greig has left, the number of blog entries has dramatically since the end of March... yes, I am still trying to persuade the rest of the group to blog about the storage work they're doing. Just because it's quiet doesn't mean they're not beavering away.

At the last storage meeting we had a look at the coming storage meetings - not our own but the ones outside GridPP. There were storage talks at ISGC and CHEP, we looked at some of those. The next pre-GDB or GDB is supposed to be about storage although the agenda was a bit bare last I looked. There will be a workshop at DESY focusing on WLCG's usage of SRM, with the developers from both sides, so to speak. Preparations are ongoing for the next OGF - mainly documents that need writing, we still need an "experiences" document describing interoperation issues at the API level. There's a hepix coming up (agenda), in Sweden - usually we have an interest in the filesystem part as well as the site management. Then there is a storage meeting 2-3 July at RAL, following hepsysman on 0-1 July.

26 March 2009

More on CHEP

Right I meant to write more about stuff that's going on here but the network is somewhat unreliable (authentication times out and reauthentication is not always possible). Anyway, I am making copious notes and will be making a full report at the next storage meeting - Wednesday 8 April.

If I shall summarise the workshop, from a high level data storage/mgmt perspective, I'd say it's about stability, scaling/performance, data access (specifically xrootd and http), long term support, catalogue synchronisation, interoperation, information systems, authorisation and ACLs, testing, configuration, complexity vs capabilities.

More details in the next meeting(s).

22 March 2009

WLCG workshop part I

Lots of presentations and talks at the WLCG workshop. As usual much of the work is done in the coffee breaks.
From the storage perspective, there was talk about "SRM changes" which was news to me (experiments require (a) stability, and (b) change, you see). Upon closer investigation, it turns out to be about implementing the rest of the SRM MoU. One outstanding question is how these changes are implemented without impacting users (in a bad way).
Fair bit of talk about xrootd support. xrootd is considered a Good Thing(tm), but the DPM implementation is rather old (2 years). It is possible it can benefit from the new CASTOR implementation for 2.1.8.
Some talk about SRM performance. The dCache folks as usual have good suggestions, Gerd from NDGF suggests using SSL instead of GSI. I believe srmPrepareToGet should be synchronous when files are on disk, this should lead to a large performance increase. Talking to other data management people, we believe the clients should do the Right Thing(tm), so no changes required. Of course the server should be free to treat any request asynchronously if it feels it needs to do this, eg to manage load.
Talked to Brian Bockelman from U Nebraska; they have good experiences with (recent versions of) Hadoop, using BeStMan as the SRM interface.
More later...

08 March 2009

GridPP DPM toolkit v2.3.0 released

I've added a new command line tool to the DPM toolkit: dpm-delreplica. This just just a wrapper round the dpm_delreplica call in the python API and does exactly what it says on the tin. It arose after the guys at Oxford noted that there wasn't an easy way using existing tools to delete a replica of a file - it was either all or nothing.

One thing to note is that the tool will let you delete the last replica of a file, which then leaves a dangling entry in the DPM namespace that you can successfully do e.g. dpns-ls on, but cannot actually retrieve. As with all of these tools, I try to make each one as simple and self contained as possible (the Unix way) so I've not added any special checking to make sure that a replica isn't the last one. You have been warned.

The tool has been tested in a couple of places and seems to work fine. As always, feedback is welcome.

Cheers,
Greig

13 February 2009

Gie's a job!

[Or for those who aren't Scottish: "Give me a job!" ;) ]

I know a few people read this blog so it seems like a good place for some advertising...

A new post-doctoral research associate position has opened up within the particle physics group at Edinburgh University to work on distributed storage management for the GridPP project. This has come about as I am leaving GridPP to move onto other things (physics analysis for LHCb, if you must know) so the project needs a replacement. It will be an exciting time for whomever gets the job since this year we will actually start to see data from the LHC experiments (fingers crossed)! Plus, Edinburgh is a great place to live and work.

All of the details about the position and the online application form can be found here. If you would like any more details please do get in touch.

In addition, the particle physics group in Edinburgh is advertising another job titled "Scientific Programmer". This system administrator position had two main responsibilities. First is the organisation and support of the groups computing needs and secondly to assist in the day-to-day operations of the Edinburgh Tier-2 grid services. All details can be found here. Again, get in touch if you have questions.

Cheers,
Greig

Update: You can get a full listing of the jobs available within the particle physics group at Edinburgh here:

http://www.ph.ed.ac.uk/particle-physics-experiment/positions.html

There's even an advanced fellowship position available if anyone is interested in doing some physics!

29 January 2009

GridPP DPM toolkit v2.2.0 released

Hot on the heels of v2.1.0 comes v2.2.0. This one contains a couple of new tools that have been created to allow sites to have a greater understanding of what is happening with their space tokens. These are:

* dpm-sql-spacetoken-usage

This displays information like:

* dpm-sql-spacetoken-list-files

Unfortunately, I have had to use some SQL to directly query the DB as the API doesn't support this functionality. I'm hoping that the small DB schema change in v1.7.0 of DPM doesn't break these tools too much... These tools were born out of some discussion which has taken place over the past couple of days on our gridpp-storage mailing list (anyone can join!). Thanks to those who tested out the initial releases of the tools.

I have also made another change to dpns-su. I have added another new switch (-s, --summary) for dpns-du which will present a summary of the total size under a target directory rather than the default behaviour which displays the summary for every sub-dir under the target.

You can get it from the usual place (although it will take a day for te yum repodata to update). Again, the release notes on the wiki will be updated at some stage...

23 January 2009

GridPP DPM toolkit v2.1.0 released

I've built a new release of the DPM admin toolkit. This one contains a couple of new tools that have been created by Sam Skipsey. They both present the user with a breakdown of the storage used in the DPM per user/group. One tool uses the DPM python API to do this (and is correspondingly slow) while the other directly talks to the DPM database using the python SQL module. Fingers crossed that this should present the same numbers as are calculated by the GridppDpmMonitor.

There is also a new switch for dpns-du which stops directories of zero size being printed to stdout. Yes Winnie, this one's for you.

You can get it from the usual place. Some of the release notes have to be updated for the new tools as I haven't got round to that yet...

05 January 2009

The evolution of storage in 2008

I've been running my WLCG storage version monitoring system for 1 year so I thought now would be a prudent time to have a quick review of the changes in the storage infrastructure over the past year. The above image shows the count of each different version of the SRM2.2 storage middleware that is deployed on the Grid each day. Over the course of the year the number of deployed SRM2.2 endpoints increased steadily from ~100 to >250.

The pie charts below show the breakdown (as of today) for the different versions of DPM, dCache and CASTOR that are running on the Grid. There are also 20 instances of StoRM out there, but StoRM does not appear to return versioning information from an srmPing operation so it's not possible to tell what version is deployed.

DPM clearly dominates in terms of number of running instances. Hopefully CERN doesn't do something crazy like drop support for it! It's interesting to see that there are still many old versions of the software running at sites. Perhaps this is an indication of the success of SRM in that all of these different implementations are still talking to each other.

01 December 2008

RFIO testing at Liverpool

The guys at Liverpool have been doing some very interesting performance testing of different RFIO READAHEAD buffer settings for the RFIO clients on their worker nodes. Using real ATLAS analysis jobs they have seen significant performance improvements when using large (i.e. 128MB) buffer sizes, both in terms of the total amount of data transferred during any job and the CPU efficiency.

http://northgrid-tech.blogspot.com/2008/12/rfio-tuning-for-atlas-analysis-jobs.html