As we come to the arbitrary cycle end of the common major period of time coinciding with the middle of the northern hemisphere winter, I though I should give an update of how things have been going for me.
I (Dave) currently have relatives in 110 rooms; (out of the 748 which ATLAS own.)
There are 988 unique children; of whom are:
723 Dirks'
138 Ursulas'
35 Gavins'
1 Valery
1 Calibration
Worryingly is how many do not have a clone in another room, hence if that room gets destroyed then is lost. Of the 988 children:
1 has 4 clones
8 have 3 clones
53 have 2 clones
143 have 1 clone
693 are unique to the single room they live in.
These unique children have 48743 files and a size of 7TB. These are the numbers that are at risk of permanent loss. Thankfully; 17173 files (~9.074 TB) are safely replicated.
The newest children were born 15/11/12; so users are still finding me and/or my children interesting.
20 December 2012
Large rate for transfers in 2012 to the UK
2012 was good year for FTS transfers within the UK. A brief look we can see that the Tier2s and RAL Tier 1 ingested 27.66 PB of data (20.23PB being ATLAS files.) This represents 41.77 million successful files transferred (38.94M files for ATLAS.) These figures are for the production FTS servers involved within the WLCG ; and so do not include files within by test servers and direct user client tools.
As can be seen below th emajor amounts of data transfer have been in the last eight months .
Examples of individual sites FTS rates are shown below:
19 December 2012
Storage/Data management overview from ATLAS Jamboree
I luckily had the opportunity to go to the ATLAS jamboree at CERN in December, a link to which can be seen here:
http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=196649
Part of my time was to give the report on the FTS3 tests we within the UK have been doing for CMS and ATLAS, but other points of interest I picked up which might be of interest (or not!) are that:
http://indico.cern.ch/conferenceOtherViews.py?view=standard&confId=196649
Part of my time was to give the report on the FTS3 tests we within the UK have been doing for CMS and ATLAS, but other points of interest I picked up which might be of interest (or not!) are that:
- ATLAS plan to reduce the need for space tokens as much as possible.
- Plan to rename all files ( over 120PB!) of files for the new RUCIO system.
- webDaV and xrootd for T2s being required.
- Storage on cloud still being looked into.
- Plan for SRM to not be used at disk only sites.
- How to rename files on tape for RUCIO (and how tape families will work in new system) are still under investigation.
- Plan for xrootd and webdav usage in read only mode to start with; but intend to test read/write for both LAN and WAN access. (And both full and sparse read of the files using xrootd.
- Known SL6 xrootd problem for DPM storage systems causing "headaches"!
- ATLAS plan a full dress rehearsal of usage of Federated Access by Xrootd (FAX) for January 21st. It will be interesting to see if we can get any further sites in the UK involved.
- ATLASHOTDISK as a space token should be able to go away "soon" (if site has CVMFS).
It's the season to go "Golly!"
As with every activity when the end of the year is nigh, it is useful to look back. And forward.
It's been a good year for GridPP's storage and data management group (and indeed for GridPP itself), and in a sense it's easy to be a victim of our own success: that researchers just expect the infrastructure to be there. For example, some of the public reporting of the Higgs events seemed to gloss over how it was done, and the fact that finding the 400 Higgs events needle in a very large haystack was a global effort - no doubt to keep things simple... RAL currently holds about 10 PB of LHC data on tape, and around 6PB on disk. What we store is not the impressive bit, though - what we move and analyse is much more important. RAL routinely transfers 3 GB/s for a single one of the experiments (usually ATLAS and CMS.) QMUL alone reported having processed 24PB over the past year. So we do "big data."
In addition to providing a large part of WLCG, GridPP is also supporting non-LHC research. The catch, though, is that they usually have to use the same grid middleware While at first this seems like a hurdle, it is the way to tap into the large computing resources - many research case studies show how it's done.
So "well done" sung to the otherwise relatively unsung heroes and heroines who are keeping the infrastructure running and available. Let's continue to do big data well - one of our challenges for the coming year will be to see how wider research can benefit even more - and maybe how we can get better at telling people about it!
It's been a good year for GridPP's storage and data management group (and indeed for GridPP itself), and in a sense it's easy to be a victim of our own success: that researchers just expect the infrastructure to be there. For example, some of the public reporting of the Higgs events seemed to gloss over how it was done, and the fact that finding the 400 Higgs events needle in a very large haystack was a global effort - no doubt to keep things simple... RAL currently holds about 10 PB of LHC data on tape, and around 6PB on disk. What we store is not the impressive bit, though - what we move and analyse is much more important. RAL routinely transfers 3 GB/s for a single one of the experiments (usually ATLAS and CMS.) QMUL alone reported having processed 24PB over the past year. So we do "big data."
In addition to providing a large part of WLCG, GridPP is also supporting non-LHC research. The catch, though, is that they usually have to use the same grid middleware While at first this seems like a hurdle, it is the way to tap into the large computing resources - many research case studies show how it's done.
So "well done" sung to the otherwise relatively unsung heroes and heroines who are keeping the infrastructure running and available. Let's continue to do big data well - one of our challenges for the coming year will be to see how wider research can benefit even more - and maybe how we can get better at telling people about it!
11 December 2012
2nd DPM Community Workshop
The DPM workshop (http://indico.cern.ch/conferenceTimeTable.py?confId=214478#20121203.detailed)
was a very worthwhile meeting and quite well intended in person. It could have done with more people there from the UK - but there were several UK contributions via Vidyo.
On the morning of the first day, Ricardo and Oliver laid out the work so far and roadmap. It was impressive to see that DMLite has becoming a reality with a number of plugins since the last workshop. (And most of this workshop was indeed devoted to DMLite). It was also good to see things like the overloading of disk servers being addressed in the roadmap. We then saw the priorities from other big users, of which it was interesting to see ASGC keen on NFS v4 (which I am not sure we need with xrootd and http) - also they are hosting the next DPM workshop in March collocated with ISGC.
Sam in the Admin Toolkit talk, described plans for rebalancing tool which should probably use DMLite and liase with Ricardo's plans in this area.
The globes online talk did bring up interesting questions of gridftp only transfers as did the ATLAS presentation which talked more explicitly about "ATLAS plans to migrate to a world without srm" then I have heard before, and asked for both performant gridftp transfers and a du solution for directories. The first is being worked on by the DPM team, the last just needs a decision, and ideally a common solution across storage types. CMS seemed happier with DPM in their presentation than they do in normal operations and ALICE seemed happier too now that xrootd is performing well on DPM.
The afternoon was devoted to DMLite and its many interfaces - e.g. S3 and HDFS. Can't wait to play with those in the new year…. Martin presented some interesting things on remote IO performance - certainly xrootd and http offer a much more pleasant direct access experience than rfio (but only with TTreeCache on - which is not guarantied for ATLAS users so we may still be copying to scratch until we can get that switched on by default). We also saw that the WebDav and xrootd implementations are in good shape (and starting to be used - see for example my FAX (ATLAS xrootd) talk - but I think still they need to be tested more widely in production before we know all the bugs). Oliver presented that DPM too was looking to a possible post-SRM landscape and ensuring that their protocols were able to work performantly without it.
Tuesday consisted of a lot of demos showing that DMLite was easy to setup, configure and even develop for (if you missed this then some of the same material was covered in the "Webinar" that should be available to view on the DPM webpages). In addition Ricardo showed how them configure nodes with Puppet and there was a discussion around using that instead of yaim. In the end we decided for "Option 3" - that yaim would be used to generate a puppet manifest that would be run with a local version of puppet on the node so that admins could carry on with yaim for the time being (but it would allow a transition in the future).
The "Collaboration Discussion" that followed was the first meeting of those partners that have offered to help carry DPM forward post-EMI. It was extremely positive in that there seems enough manpower - with core effort at CERN and strong support from ASGC. It seemed like something workable could be formed with tasks divided effectively so there is no need for anyone to fear anymore.
This will be my last blog of the year (possible forever as I dislike blogging) so Happy Holidays Storage Fans!
12 November 2012
Stroage news for HEPSYSMAN Nov '12
Today saw a return for me to Lancaster for the latest HEPSYSMAN meeting. agenda can be found here:
https://indico.cern.ch/conferenceDisplay.py?confId=211206
We started with a productive discussion regarding people's experiences with upgrading from gLite to EMI. Good news was that Storage wasn't really mentioned until I did; so it's good that storage problems aren't at the top of people's problem lists.. Issue regarding auto-update of middleware and updating of YUM is an issue. A large worry seems to be how the dpm-tools appear not to be getting through the ETICS of deployment ( or this might have been my naive interpretation of a talk.) Something to look into anyway.....
https://indico.cern.ch/conferenceDisplay.py?confId=211206
We started with a productive discussion regarding people's experiences with upgrading from gLite to EMI. Good news was that Storage wasn't really mentioned until I did; so it's good that storage problems aren't at the top of people's problem lists.. Issue regarding auto-update of middleware and updating of YUM is an issue. A large worry seems to be how the dpm-tools appear not to be getting through the ETICS of deployment ( or this might have been my naive interpretation of a talk.) Something to look into anyway.....
19 September 2012
WAD
We had an interesting experience with the CASTOR upgrade to 2.1.12, that the link between the storage area (SA) and the tape pool disappeared in the upgrade. In GLUE speak, the SA is a storage space of sorts, which may be shared between collaborators - we use it to publish dynamic usage data.
In CASTOR, we have used the "service class" as the SA; there is then a many-to-many link to disk pools and tape pools, something like this:
The dynamic data of each pool then gets shared accordingly between all the SvcClasses, which is (was) the Right Thing™. Now the second association link has gone away, we're wondering how to keep publishing data correctly in the short term - and the upgrade got postponed by a week amidst much scratching of heads.
The information provider may just have enough information (in its config files) to restore the link, but it'd be a bit hairy to code - we're still working on that - but it may just be better to rething what the SA should be (which we will). We also tried a supermassive query which examined disk copies of files from tape pools to see which disk pools they were on, and then linking those with service classes - which was quite enlightening as we discovered those disk copies were all over the place, not just where they were supposed to be...
In the interest of getting it working, we decided to just remember and adjust which data publishes where - meanwhile, we shall then rethink what the SA should be in the future.
In CASTOR, we have used the "service class" as the SA; there is then a many-to-many link to disk pools and tape pools, something like this:
The dynamic data of each pool then gets shared accordingly between all the SvcClasses, which is (was) the Right Thing™. Now the second association link has gone away, we're wondering how to keep publishing data correctly in the short term - and the upgrade got postponed by a week amidst much scratching of heads.
The information provider may just have enough information (in its config files) to restore the link, but it'd be a bit hairy to code - we're still working on that - but it may just be better to rething what the SA should be (which we will). We also tried a supermassive query which examined disk copies of files from tape pools to see which disk pools they were on, and then linking those with service classes - which was quite enlightening as we discovered those disk copies were all over the place, not just where they were supposed to be...
In the interest of getting it working, we decided to just remember and adjust which data publishes where - meanwhile, we shall then rethink what the SA should be in the future.
14 September 2012
Large Read/Write rates allow T2 storage to fill quickly.
In the last six months, the UK T2s have ingested on average nearly 590MB/s; (peaking at 2.0GB/s.) As a source for transfers; the T2s have averaged 230MB/s and had a peak of 1GB/s. Purely for ATLAS, the write rate has averaged 473MB/s.
That equates to a volume for the last 23 weeks to be 6.58PB of data. Now ATLAS "only" have access to 7.9PB of storage capacity; so the time it takes ATLAS to fill its available storage is 28 weeks. N.B this week the average rate has been double at 953MB/s. so it would only take ~ 3 months to fill from empty. Also remember that this fill rate assumes that files are only transferred into the storage form across the WAN. Since many files are actually created at the T2s (either MC production or derived data; then the fill rate will be much quicker.
That equates to a volume for the last 23 weeks to be 6.58PB of data. Now ATLAS "only" have access to 7.9PB of storage capacity; so the time it takes ATLAS to fill its available storage is 28 weeks. N.B this week the average rate has been double at 953MB/s. so it would only take ~ 3 months to fill from empty. Also remember that this fill rate assumes that files are only transferred into the storage form across the WAN. Since many files are actually created at the T2s (either MC production or derived data; then the fill rate will be much quicker.
11 September 2012
The net for UK to ROW ATLAS sites.
So I was trying to find out what SEs were close, (in a routing sense) to other SEs for ATLAS from RAL. I ended up producing a network diagram for the net. Not surprisingly; SEs within a European country all seem to follow a similar route. However; for traffic to the U.S traffic seems to have multiple different routes.
10 August 2012
FTS error morsal of analysis.
Metric numbers never really tell the true story.
I was looking into FTS transfer failures for the ATLAS experiment; (transfers involving ATLAS; I wasn't doing the analysis for atlas,) to see how well we are transferring data within the UK.
So the overall figure is 93.22%. So with two retires allowed; on average 99.993 of transfers complete without having to worry about retrying from within the VO's Sounds good until you find out that last month ATLAS transferred nearly 6.5M files to and from UK sites, therefore ~45k files would have to be retried from the V0 framework. In total there were6417312 unique successful transfers (2.4M solely within the UK) and 467001 failures associated with them.
I was looking into FTS transfer failures for the ATLAS experiment; (transfers involving ATLAS; I wasn't doing the analysis for atlas,) to see how well we are transferring data within the UK.
So the overall figure is 93.22%. So with two retires allowed; on average 99.993 of transfers complete without having to worry about retrying from within the VO's Sounds good until you find out that last month ATLAS transferred nearly 6.5M files to and from UK sites, therefore ~45k files would have to be retried from the V0 framework. In total there were6417312 unique successful transfers (2.4M solely within the UK) and 467001 failures associated with them.
Good Write/Read/Delete rates for ATLAS at UK T1
Had a busy period for ATLASSCRATCHDISK space token at the UK Tier1.
This was in response to recovering files for a Tier 2 which had lost a disk server and wanted replacement files.
We wrote ~35TB/day in 2 days. Deleted 80TB; (240,000 files) in a day.
Data was copied out to the T2. Purely from RAL the rate looked like:
The following shows the transfers to the T2 in question (including other Source Sites but being dominated by transfers from the RAL Tier 1).
02 August 2012
DPM-XROOTD and Federated redirection: volume 1
Historically, one of the weaknesses in DPM as an SE, from the perspective of some of the LHC VOs, was its lacking xrootd support. (While, technically, DPM has supported "xrootd" for some time, the release of xrootd involved has always lagged significantly behind the curve, meaning that DPMs supporting the protocol often couldn't actually provide functionality expected of them.)
Partly as a result of the recent enthusiasm for federated storage (a concept whereby storage endpoints become part of a redirection hierarchy, so that requests against files not present locally can be passed up the chain, until a (hopefully close) endpoint with the file can be found to serve the request), and the particular enthusiasm of ATLAS and CMS (thanks to their experiments in the US) for xrootd as the mechanism for this, DPM's xrootd support has recently improved significantly.
At present, the package is still beta (in particular, the YAIM module is not released yet, so hand configuration is more reliable), but it's been tested on the development SE here at Glasgow (svr025), with some success.
The current release of the dpm-xrootd package still needs to be obtained from an unusual location (instructions here: https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Xroot/Setup ), but has the advantage that it will work with DPM 1.8.2 disk nodes, as long as the head node is DPM 1.8.3.
As 1.8.3 is EMI-only, this effectively allows you to test the protocol with gLite disk nodes for the first time.
I've recently set this release up on the production SE at Glasgow (svr018, which is precisely EMI 1.8.3 head, with a mix of gLite and EMI disk nodes). Some thoughts follow:
1) it is not safe to install dpm-xrootd from the marked repository if your glite-SE_dpm_disk release is less than 1.8.2. One of the dependancies of the package is the 1.8.2 release of dpm-lib, but without the rest of the packages from 1.8.2 being installed, this will simply break gridftp and rfiod.
Update your node to 1.8.2 and then pull in dpm-xrootd.
2) the configuration described in the link above is identical for all disk pool nodes. This means it is much less painful than it might be - test with one disk node then mirror across the others.
3) It appears that, for some reason, the dpm-xrootd package does not like SL5.5 and glite-SE_dpm_disk - several of our disk pools are on this SL release, and the xrootd service refused to start on them. Updating (yum update) to SL5.7 fixes this, by means currently not fully understood.
4) the provision of a certificate with a valid ATLAS VOMS role for the LFC lookup is provided as an exercise for the reader. This is a requirement of the xrootd redirection framework, not dpm specifically, and I hope it will go away soon, since it's extremely silly.
With those caveats in mind, things seem to work fairly well, although this is all in the testing phase for ATLAS (and Europe) for the moment.
Partly as a result of the recent enthusiasm for federated storage (a concept whereby storage endpoints become part of a redirection hierarchy, so that requests against files not present locally can be passed up the chain, until a (hopefully close) endpoint with the file can be found to serve the request), and the particular enthusiasm of ATLAS and CMS (thanks to their experiments in the US) for xrootd as the mechanism for this, DPM's xrootd support has recently improved significantly.
At present, the package is still beta (in particular, the YAIM module is not released yet, so hand configuration is more reliable), but it's been tested on the development SE here at Glasgow (svr025), with some success.
The current release of the dpm-xrootd package still needs to be obtained from an unusual location (instructions here: https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Xroot/Setup ), but has the advantage that it will work with DPM 1.8.2 disk nodes, as long as the head node is DPM 1.8.3.
As 1.8.3 is EMI-only, this effectively allows you to test the protocol with gLite disk nodes for the first time.
I've recently set this release up on the production SE at Glasgow (svr018, which is precisely EMI 1.8.3 head, with a mix of gLite and EMI disk nodes). Some thoughts follow:
1) it is not safe to install dpm-xrootd from the marked repository if your glite-SE_dpm_disk release is less than 1.8.2. One of the dependancies of the package is the 1.8.2 release of dpm-lib, but without the rest of the packages from 1.8.2 being installed, this will simply break gridftp and rfiod.
Update your node to 1.8.2 and then pull in dpm-xrootd.
2) the configuration described in the link above is identical for all disk pool nodes. This means it is much less painful than it might be - test with one disk node then mirror across the others.
3) It appears that, for some reason, the dpm-xrootd package does not like SL5.5 and glite-SE_dpm_disk - several of our disk pools are on this SL release, and the xrootd service refused to start on them. Updating (yum update) to SL5.7 fixes this, by means currently not fully understood.
4) the provision of a certificate with a valid ATLAS VOMS role for the LFC lookup is provided as an exercise for the reader. This is a requirement of the xrootd redirection framework, not dpm specifically, and I hope it will go away soon, since it's extremely silly.
With those caveats in mind, things seem to work fairly well, although this is all in the testing phase for ATLAS (and Europe) for the moment.
06 June 2012
Lesions learned?
I have a file which is wobbly. It is a good file, but it wobbled, and the letters got in the wrong places.
It is a file I was reading at home, on my home PC, from my home hard drive (no cloud stuff) and suddenly discovered had a lot of NUL characters in it, 3062 to be precise. Ewan suggested in the storage meeting that this might have been caused by an XFS crash - the PC is on a UPS but occasionally slightly experimental X clients lock it up and on a few occasions "security" features I foolishly left running had prevented me from logging in from my laptop and shutting it down cleanly.
Anyway, this file is part of a directory I synchronise to work, and surenuf, the file on my work PC was equally corrupted. And the corrupted file was backed up. Backups are not so helpful when the file you back up is corrupted...(the timestamp said December 2001). The file had been migrated from another older hard drive.
Now it so happens that this file was not precious, it was an RFC, so I could just get a fresh one. Comparing them, there seemed to be more changes...(and the NULs confused some diff tools) - the file seemed to have changed mysteriously in about eight places; in one 749 bytes were missing.
Back in the days, there were computer virus which would alter files on your hard drive... I don't believe that's what happened here - I invoke Hanlon's razor. Or a variant of it. ``Never attribute to malice that which can be adequately explained by 'stuff happens'''
It may be worth investigating my backups from back then - I was still using optical writeable CDs for my home PC backups back then, and I still have them - whether I can read them is another question. But it seems I need to write a utility to background checksum my files for me.
For my precious data at home that I don't copy to work machines (when it's not work related), I have a stronger system, based on unison and multiple mutually independent backups, but the question is still whether I would have noticed the one change that should not propagate to the backup? Worth a test...
It is a file I was reading at home, on my home PC, from my home hard drive (no cloud stuff) and suddenly discovered had a lot of NUL characters in it, 3062 to be precise. Ewan suggested in the storage meeting that this might have been caused by an XFS crash - the PC is on a UPS but occasionally slightly experimental X clients lock it up and on a few occasions "security" features I foolishly left running had prevented me from logging in from my laptop and shutting it down cleanly.
Anyway, this file is part of a directory I synchronise to work, and surenuf, the file on my work PC was equally corrupted. And the corrupted file was backed up. Backups are not so helpful when the file you back up is corrupted...(the timestamp said December 2001). The file had been migrated from another older hard drive.
Now it so happens that this file was not precious, it was an RFC, so I could just get a fresh one. Comparing them, there seemed to be more changes...(and the NULs confused some diff tools) - the file seemed to have changed mysteriously in about eight places; in one 749 bytes were missing.
Back in the days, there were computer virus which would alter files on your hard drive... I don't believe that's what happened here - I invoke Hanlon's razor. Or a variant of it. ``Never attribute to malice that which can be adequately explained by 'stuff happens'''
It may be worth investigating my backups from back then - I was still using optical writeable CDs for my home PC backups back then, and I still have them - whether I can read them is another question. But it seems I need to write a utility to background checksum my files for me.
For my precious data at home that I don't copy to work machines (when it's not work related), I have a stronger system, based on unison and multiple mutually independent backups, but the question is still whether I would have noticed the one change that should not propagate to the backup? Worth a test...
29 May 2012
CHEP Digested
Apologies for not blogging live from CHEP / WLCG meeting but it was busy for me with talks and splinter meetings. So please find below my somewhat jet-lagged digest of the week:
WLCG meeting:
News (to me) from first day was that there will be a new Tier 0, in hungary (!) The current plan is to build a beefy network and split jobs and storage without care. Given the not irrelevant expected latency that didn't seem like the most obviously best plan.Sunday, somewhat disappointing. Little was planned for the TEG session. The chairs were explicitly told no talk expected off them - only to find on the day that it was - which therefore ended up rather regurgitating some of the conclusions and reiterating some of the same discussions. Apparently the TEGs are over now - despite their apparent zombie state, I hope that we can make something useful building on what was discussed outside any process rather than waiting for what may or may not be officially formed from their wake.
On a non-storage note, I did ask one clarification from Romain on glexec, the requirement is for sites to provide fine grained traceability not necessarily to install glexec though the group did not know of any other current way to satisfy the requirement. There was also some discussion on whether the requirement amounted to requiring identity switching though it seemed fairly clear that it need not. If one can think of another way to satisfy the real requirement than one can use it.
CHEP day 1:
Rene Brun gave a kind of testimonial speech - which was a highlight of the week (because he is a legend). Later in the day he asked a question in my talk on ATLAS ROOT I/O - along the lines that he previously seen faster rates in reading ATLAS files with pure ROOT so why was the ATLAS software so much slower (the reasons are Transient->Persistent conversion as well as some reconstruction of objects). Afterwards he came up to me and said he was "very happy" that we were looking at ROOT I/O (which made my week really).Other than my talk (which otherwise went well enough), the "Event Processing" session saw a description from CMS on their plans to make their framework properly parallel. A complete rewrite like this is possibly better approach than the current ATLAS incremental attempts (as also descried in this session by Peter V G ) - though its all somewhat pointless unless big currently sequential (and possibly parallelizable ) parts like tracking are addressed.
CHEP day 2:
Sam blogged a bit about the plenaries. The parallel sessions got off to a good start (;-)) with my GridPP talk on analysing I/O bottlenecks: the most useful comment was perhaps that by Dirk on I/O testing at CERN (see splinter meeting comment below). There was then a talk regarding load balancing for dCache which seemed fairly complicated algorithm, but, if it works, perhaps worth adopting in DPM. Then a talk on xrootd from (of course) Brian B , but describing both ATLAS and CMS work. To be honest I found the use cases less compelling than I have done previously but still lots of good work on understanding these and worth supporting future development (see again splinter meetings below).The posted session was, as Sam indicated, excellent - though way way too many posters to mention. The work on DPM both in DM-LITE and WebDav is very promising but the proof will be in the production pudding that we are testing in the UK (see also my and Sam's CHEP paper of course).
Back in the parallel sessions, the hammercloud update showed some interesting new features and correlations between outages towards reducing the functional testing load. CMS are now using HC properly for their testing.
CHEP day 3:
In terms of the ROOT plenary talk - I would comment on Sam's comments that the asynchronous prefetching does need some more work (we have tested it) but at least it is in there (see also ROOT I/O splinter meeting comments below). I also noted that they offer different compression schemes now which I haven't explored.The data preservation is indeed interesting, Sam gave the link to the report. Of the two models of ensuring one can run on the data: maintaining an old OS environment or validating a new one. I find the later most interesting but really I wonder whether experiments will preserve manpower on old experiments to check and keep up such a validation.
Andreas Peters's talk in the next session was the most relevant plenary to storage. As Sam suggested it was indeed excellently wide ranging and not too biased. Some messages: storage still hard and getting harder with management, tuning and performance issues. LHC storage is large in terms of volume but not number of objects. Storage interfaces are split in terms of complex / rich such as posix and reduces such as S3. We need to be flexible to both profit from standards/ community projects but not to be tied to any particular technology.
CHEP day 4:
The morning I mostly spent in splinter meetings on Data Federations and ROOT I/O (see below) . Afternoon there was a talk from Jeff Templon on the NIKHEV tests with WebDav and proxy caches which is independent of any middleware implementation. Interesting stuff though somewhat of a prototype and should be integrated with other work. There was also some work in Italy on http access which needs further testing but shows such things are possible with Storm.After coffee and many many more posters (!), Paul M showed that dCache is pluggable beyond plugaable (including potentially interesting work with HDFS (and Hadoop for log processing)). He also kept reassuring us that it will be supported in the future.
Some Splinter Meetings / Discussions:
- Possibilities for using DESY grid lab for controlled DPM tests.
- Interest in testing dCache using similar infrastructure as we presented for DPM.
- ATLAS xrootd federating pushing into EU with some redirectors installed at CERN and some sites in Germany and (we volunteered) the UK (including testing the new emerging DPM xrootd server)
- DPM support . Certainly there will be some drop in CERN support post EMI. Lots more discussions to be had, but it seemed optimistic that there would be some decent level of support from CERN providing some could also be found from the community of regions/ users.
- Other DPM news: Chris volunteered for DM-LITE on lustre; sam and I for both xrootd and web dav stuff.
- ROOT I/O - Agreement to allow TTreeCache to be set in the environment. More discussion on optimise baskets (some requirements from CMS that make it more complicated). Interest in having monitoring internal to ROOT, switched on in .rootrc: a first pass at a list of variable to be collected was constructed.
- I/O benchmarking - Dirk at CERN has some suite that both provides a mechanism for submitting tests and some tests itself that are similar to the ones we are using (but not identical). We will form a working group to standardise the test and share tools.
24 May 2012
Day 3 of CHEP - Data Plenaries
The third day of CHEP is always a half-day, with space in the afternoon for the tours.
With that in mind, there were only 6 plenary talks to attend, but 4 of those were of relevance to Storage.
First up, Fons Rademakers gave the ROOT overview talk, pointing to all the other ROOT development talks distributed across the CHEP schedule. In ROOT's I/O system, there are many changes planned, some of which reflect the need for more parallelism in the workflow for the experiments. Hence, parallel merges are being improved (removing some locking issues that still remained), and ROOT is moving to a new threading model where there can be dedicated "IO helper threads" as part of the process space. Hopefully, this will even out IO load for ROOT-based analysis and improve performance.
Another improvement aimed at performance is the addition of asynchronous prefetching to the IO pipeline, which should reduce latencies for streamed data - while I'm still on the fence about I/O streaming vs staging, prefetching is another "load smearing" technique which might improve the seekiness on target disk volumes enough to make me happy.
The next interesting talk was this year's iteration of the always interesting (and a tiny bit Cassandra-ish) DPHEP talk on Data Preservation. There was far too much interesting stuff in this talk to summarise - I instead encourage the interested to read the latest report from the DPHEP group, out only a few days ago, at : http://arxiv.org/abs/1205.4667
In the second session, two more interesting talks with storage relevance followed.
First, Jacek Becla gave an interesting and wide-ranging talk on analysis with very large datasets, discussing the scaling problems of manipulating that much data (beginning with the statement "Storing petabytes is easy. It is what you do with them that matters"). One of the most interesting notes was that indexes on large datasets can be worse for performance, once you get above a critical size - the time and I/O needed to update the indices impairs total performance more than the gain; and the inherently random access that seeking from an index produces on the storage system is very bad for throughput with a sufficiently large file to seek in. Even SSDs don't totally remove the malus from the extremely high seeks that Jacek shows.
Second, Andreas Joachim Peters gave a talk on the Past and Future of very large filesystems, which was actually a good overview, and avoided promoting EOS too much! Andreas made a good case for non-POSIX filesystems for archives, and for taking an agile approach to filesystem selection.
With that in mind, there were only 6 plenary talks to attend, but 4 of those were of relevance to Storage.
First up, Fons Rademakers gave the ROOT overview talk, pointing to all the other ROOT development talks distributed across the CHEP schedule. In ROOT's I/O system, there are many changes planned, some of which reflect the need for more parallelism in the workflow for the experiments. Hence, parallel merges are being improved (removing some locking issues that still remained), and ROOT is moving to a new threading model where there can be dedicated "IO helper threads" as part of the process space. Hopefully, this will even out IO load for ROOT-based analysis and improve performance.
Another improvement aimed at performance is the addition of asynchronous prefetching to the IO pipeline, which should reduce latencies for streamed data - while I'm still on the fence about I/O streaming vs staging, prefetching is another "load smearing" technique which might improve the seekiness on target disk volumes enough to make me happy.
The next interesting talk was this year's iteration of the always interesting (and a tiny bit Cassandra-ish) DPHEP talk on Data Preservation. There was far too much interesting stuff in this talk to summarise - I instead encourage the interested to read the latest report from the DPHEP group, out only a few days ago, at : http://arxiv.org/abs/1205.4667
In the second session, two more interesting talks with storage relevance followed.
First, Jacek Becla gave an interesting and wide-ranging talk on analysis with very large datasets, discussing the scaling problems of manipulating that much data (beginning with the statement "Storing petabytes is easy. It is what you do with them that matters"). One of the most interesting notes was that indexes on large datasets can be worse for performance, once you get above a critical size - the time and I/O needed to update the indices impairs total performance more than the gain; and the inherently random access that seeking from an index produces on the storage system is very bad for throughput with a sufficiently large file to seek in. Even SSDs don't totally remove the malus from the extremely high seeks that Jacek shows.
Second, Andreas Joachim Peters gave a talk on the Past and Future of very large filesystems, which was actually a good overview, and avoided promoting EOS too much! Andreas made a good case for non-POSIX filesystems for archives, and for taking an agile approach to filesystem selection.
23 May 2012
Some notes from CHEP - day 2
So, I'm sure that when Wahid writes his blog entries from CHEP, you'll hear about lots of other interesting talks, so as before I'm just going to cover the pieces I found interesting.
The plenaries today started with a review of the analysis techniques employed by the various experiments by Markus Klute, emphasising the large data volumes required for good statistical analyses. More interesting perhaps for its comparison was Ian Fisk's talk covering the new computing models in development, in the context of LHCONE. Ian's counter to the "why can Netflix do data streaming when we can't was:
(that is, the major difference between us and everyone with a CDN is that we have 3 orders of magnitude more data in a single replica - it's much more expensive to replicate 20PB across 100 locations than 12TB!).
The very best talk of the day was Oxana Smirnova's talk on the future role of the grid. Oxana expressed the most important (and most ignored within WLCG) lesson: if you make a system that is designed for a clique, then only that clique will care. In the context of the EGI/WLCG grid, this is particularly important due to the historical tendency of developers to invent incompatible "standards" [the countless different transfer protocols for storage, the various job submission languages etc] rather than all working together to support a common one (which may already exist). This is why the now-solid HTTP(s)/WebDAV support in the EMI Data Management tools is so important (and why the developing NFS4.1/pNFS functionality is equally so): no-one outside of HEP cares about xrootd, but everyone can use HTTP. I really do suggest that everyone enable HTTP as a transport mechanism on their DPM or dCache instance if they're up to date (DPM will be moving, in the next few revisions, to using HTTP as the default internal transport in any case).
A lot of the remaining plenary time was spent in talking about how the future will be different to the past (in the context of the changing pressures on technology), but little was new to anyone who follows the industry. One interesting tid-bit from the CloudDera talk was the news that HDFS can now support High Availability via multiple metadata servers, which gives potentially higher performance for metadata operations as well.
Out of the plenary tracks, the most interesting in the session I was in was Jakob Blomer's update on CVMFS. We're now deep in the 2.1.x releases, which have much better locking behaviour on the clients; the big improvements on the CVMFS server are coming in the next minor version, and include the transition from redirfs (unsupported, no kernels above SL5) to aufs (supported in all modern kernels) for the overlay filesystem. This also gives a small performance boost to the publishing process when you push a new release into the repository.
Of the posters, there were several of interest - the UK was, of course, well represented, and Mark's iPv6 poster, Alessandra's Network improvements poster, Chris's Lustre and my CVMFS for local VOs poster all got some attention. In the wider ranging set of posters, the Data Management group at CERN were well represented - the poster on HTTP/WebDav for federations got some attention (it does what xrootd can do, but with an actual protocol that the universe cares about, and the implementation that was worked on for the poster even supports Geographical selection of the closest replica by ip), as did Ricardo's DPM status presentation (which, amongst other things, showcased the new HDFS backend for DMLite). With several hundred posters and only an hour to look at them, it was hard to pick up the other interesting examples quickily, but some titles of interest included the "Data transfer test with a 100Gb network" (spoiler: it works!), and a flotilla of "experiment infrastructure" posters of which the best *title* goes to "No file left behind: monitoring transfer latencies in PhEDEx".
The plenaries today started with a review of the analysis techniques employed by the various experiments by Markus Klute, emphasising the large data volumes required for good statistical analyses. More interesting perhaps for its comparison was Ian Fisk's talk covering the new computing models in development, in the context of LHCONE. Ian's counter to the "why can Netflix do data streaming when we can't was:
(that is, the major difference between us and everyone with a CDN is that we have 3 orders of magnitude more data in a single replica - it's much more expensive to replicate 20PB across 100 locations than 12TB!).
The very best talk of the day was Oxana Smirnova's talk on the future role of the grid. Oxana expressed the most important (and most ignored within WLCG) lesson: if you make a system that is designed for a clique, then only that clique will care. In the context of the EGI/WLCG grid, this is particularly important due to the historical tendency of developers to invent incompatible "standards" [the countless different transfer protocols for storage, the various job submission languages etc] rather than all working together to support a common one (which may already exist). This is why the now-solid HTTP(s)/WebDAV support in the EMI Data Management tools is so important (and why the developing NFS4.1/pNFS functionality is equally so): no-one outside of HEP cares about xrootd, but everyone can use HTTP. I really do suggest that everyone enable HTTP as a transport mechanism on their DPM or dCache instance if they're up to date (DPM will be moving, in the next few revisions, to using HTTP as the default internal transport in any case).
A lot of the remaining plenary time was spent in talking about how the future will be different to the past (in the context of the changing pressures on technology), but little was new to anyone who follows the industry. One interesting tid-bit from the CloudDera talk was the news that HDFS can now support High Availability via multiple metadata servers, which gives potentially higher performance for metadata operations as well.
Out of the plenary tracks, the most interesting in the session I was in was Jakob Blomer's update on CVMFS. We're now deep in the 2.1.x releases, which have much better locking behaviour on the clients; the big improvements on the CVMFS server are coming in the next minor version, and include the transition from redirfs (unsupported, no kernels above SL5) to aufs (supported in all modern kernels) for the overlay filesystem. This also gives a small performance boost to the publishing process when you push a new release into the repository.
Of the posters, there were several of interest - the UK was, of course, well represented, and Mark's iPv6 poster, Alessandra's Network improvements poster, Chris's Lustre and my CVMFS for local VOs poster all got some attention. In the wider ranging set of posters, the Data Management group at CERN were well represented - the poster on HTTP/WebDav for federations got some attention (it does what xrootd can do, but with an actual protocol that the universe cares about, and the implementation that was worked on for the poster even supports Geographical selection of the closest replica by ip), as did Ricardo's DPM status presentation (which, amongst other things, showcased the new HDFS backend for DMLite). With several hundred posters and only an hour to look at them, it was hard to pick up the other interesting examples quickily, but some titles of interest included the "Data transfer test with a 100Gb network" (spoiler: it works!), and a flotilla of "experiment infrastructure" posters of which the best *title* goes to "No file left behind: monitoring transfer latencies in PhEDEx".
22 May 2012
CHEP Day 1 - some notes
So, Wahid and I are amongst the many people at CHEP in New York this week, so it behooves us to give some updates on what's going on.
The conference proper started with a long series of plenary talks; the usual Welcome speech, the traditional Keynote on HEP and update on the LHC experience (basically: we're still hopeful we'll get enough data for an independent Higgs discovery from ATLAS and CMS; but there's a lot more interesting stuff that's going on that's not Higgs related - more constraints on the various free constants in the Standard Model, some additional particle discoveries).
The first "CHEP" talk in the plenary session was given by Rene Brun, who used his privilege of being the guy up for retirement to control the podium for twice his allocated time; luckily, he used the time to give an insightful discussion of the historical changes that have occurred in the HEP computing models over the past decades (driven by increased data, the shifting of computational power, and the switch from Fortran (etc) to C++), and some thoughts on what would need to form the basis of the future computing models. Rene seemed keen on a pull model for data, with distributed parallelism (probably not at the event level) - this seems to be much more friendly to a MapReduce style implementation than the current methods floating around.
There was also a talk by the Dell rep, Forrest Norrod, on the technology situation. There was little here that would surprise anyone; CPU manufacturers are finding that even more cores doesn't work because of memory bandwidth and chip real-estate issues, so expect more on-die specialised cores (GPUs etc) or even FPGAs. The most interesting bit was the assertion that Dell (like HP before them) are looking at providing an ARM based compute node for data centres. After a lunch that we had to buy ourselves, the parallel sessions started.
The Distributed Processing track began with the usual traditional talks - Pablo Saiz gave an overview of AliEn for ALICE, they're still heavily based on xrootd and bittorrent for data movement, which sets them apart from the other experiments (although, of course, in the US, xrootd is closer to being standard); Vincent Garonne gave an update on ATLAS data management, including an exciting look at Rucio, the next-gen DQ2 implementation; the UK's own Stuart Wakefield gave the CMS Workload Management talk, of which the most relevant data management implication was that CMS are moving from direct streaming of job output to a remote SE (which is horribly inefficient, potentially, as there's no restriction on the destination SE's distance from the site where the job runs!) to an ATLAS-style (although Stuart didn't use that description ;) ) managed data staging process where the job output is cached on the local SE then transferred to its final destination out-of-band by FTS.
Philippe Charpentier's LHCb data talk was interesting primarily because of the discussion on "common standards" that it provoked in the questions - LHCb are considering popularity-based replica management, but they currently use their own statistics, rather than the CERN Experiment Support group's popularity service.
Speaking of Experiment Support, the final talk before coffee saw Maria Girone give the talk on the aforementioned Common Solutions strategy, which includes HammerCloud and the Experiment Dashboards as well as the Popularity Service - the most comment came from the final slides, however, where Maria discussed the potential for a Common Analysis Service framework (wrapping, say, a PanDA server so that the generic interfaces allow CMS or ATLAS to use it). There was some slightly pointed comment from LHCb that this was lovely, but they reminded people of the original "shared infrastructure" that LHCb/ATLAS used, until it just became LHCb's...
After that: coffee, where Rob Fay and I were pounced on by a rep from NexSan, who still have the most terrifyingly dense storage I've seen (60 drives per 4U, in slide-out drive "drawers"), and were keen to emphasise their green credentials. As always, of course, it's the cost and the vendor-awareness issues that stop us from practically buying this kind of kit, but it's always nice to see it around.
The second and final parallel session of the day saw talks by the UK's own Andy Washbrook and our own Wahid Bhimji, but I didn't make those, as I went to the session ended by Yves Kemp (as Patrick Fuhrmann's proxy) giving the results of DESY's testing of NFS4.1 (and pNFS) as a mountable filesystem for dCache. Generally, the results look good - it will be interesting to see how DPM's implementation compares when it is stable - and Yves and I discussed the implications (we hope that) this might have on protocol use by experiments. (We're both in favour of people using sensible non-proprietary protocols, like NFS and HTTP, rather than weird protocols that no-one else has heard of; and the benefits of kernel-level support for NFS4.1 as a POSIX fs are seen in the better caching performance for users, for example).
Today will see, amongst other things, the first poster session - with 1 hour allocated to see the posters, and several hundred posters all up, we'll have just 15 seconds to appreciate each one; I'll see what the blurry mass resolve to when I do my second CHEP update tomorrow!
The first "CHEP" talk in the plenary session was given by Rene Brun, who used his privilege of being the guy up for retirement to control the podium for twice his allocated time; luckily, he used the time to give an insightful discussion of the historical changes that have occurred in the HEP computing models over the past decades (driven by increased data, the shifting of computational power, and the switch from Fortran (etc) to C++), and some thoughts on what would need to form the basis of the future computing models. Rene seemed keen on a pull model for data, with distributed parallelism (probably not at the event level) - this seems to be much more friendly to a MapReduce style implementation than the current methods floating around.
There was also a talk by the Dell rep, Forrest Norrod, on the technology situation. There was little here that would surprise anyone; CPU manufacturers are finding that even more cores doesn't work because of memory bandwidth and chip real-estate issues, so expect more on-die specialised cores (GPUs etc) or even FPGAs. The most interesting bit was the assertion that Dell (like HP before them) are looking at providing an ARM based compute node for data centres. After a lunch that we had to buy ourselves, the parallel sessions started.
The Distributed Processing track began with the usual traditional talks - Pablo Saiz gave an overview of AliEn for ALICE, they're still heavily based on xrootd and bittorrent for data movement, which sets them apart from the other experiments (although, of course, in the US, xrootd is closer to being standard); Vincent Garonne gave an update on ATLAS data management, including an exciting look at Rucio, the next-gen DQ2 implementation; the UK's own Stuart Wakefield gave the CMS Workload Management talk, of which the most relevant data management implication was that CMS are moving from direct streaming of job output to a remote SE (which is horribly inefficient, potentially, as there's no restriction on the destination SE's distance from the site where the job runs!) to an ATLAS-style (although Stuart didn't use that description ;) ) managed data staging process where the job output is cached on the local SE then transferred to its final destination out-of-band by FTS.
Philippe Charpentier's LHCb data talk was interesting primarily because of the discussion on "common standards" that it provoked in the questions - LHCb are considering popularity-based replica management, but they currently use their own statistics, rather than the CERN Experiment Support group's popularity service.
Speaking of Experiment Support, the final talk before coffee saw Maria Girone give the talk on the aforementioned Common Solutions strategy, which includes HammerCloud and the Experiment Dashboards as well as the Popularity Service - the most comment came from the final slides, however, where Maria discussed the potential for a Common Analysis Service framework (wrapping, say, a PanDA server so that the generic interfaces allow CMS or ATLAS to use it). There was some slightly pointed comment from LHCb that this was lovely, but they reminded people of the original "shared infrastructure" that LHCb/ATLAS used, until it just became LHCb's...
After that: coffee, where Rob Fay and I were pounced on by a rep from NexSan, who still have the most terrifyingly dense storage I've seen (60 drives per 4U, in slide-out drive "drawers"), and were keen to emphasise their green credentials. As always, of course, it's the cost and the vendor-awareness issues that stop us from practically buying this kind of kit, but it's always nice to see it around.
The second and final parallel session of the day saw talks by the UK's own Andy Washbrook and our own Wahid Bhimji, but I didn't make those, as I went to the session ended by Yves Kemp (as Patrick Fuhrmann's proxy) giving the results of DESY's testing of NFS4.1 (and pNFS) as a mountable filesystem for dCache. Generally, the results look good - it will be interesting to see how DPM's implementation compares when it is stable - and Yves and I discussed the implications (we hope that) this might have on protocol use by experiments. (We're both in favour of people using sensible non-proprietary protocols, like NFS and HTTP, rather than weird protocols that no-one else has heard of; and the benefits of kernel-level support for NFS4.1 as a POSIX fs are seen in the better caching performance for users, for example).
Today will see, amongst other things, the first poster session - with 1 hour allocated to see the posters, and several hundred posters all up, we'll have just 15 seconds to appreciate each one; I'll see what the blurry mass resolve to when I do my second CHEP update tomorrow!
09 March 2012
Busy afternoon for ATLAS Storage at the RAL T1
Lunchtime on the 8th March 2012 was a busy time for the disk servers in the ATLAS instance of CASTOR at the Tier1 at RAL. Servers were delivering approximately 45Gbps as a source:
FTS controlled transfers at this as controlled by the RAL FTS WAN transfers were negligible; since the majority of "WAN" transfer via FTS (RAL) system were actually an intra-CASTOR transfer.
"WAN Rate"
"Intra-CASTOR" rate shown below shows that of the 4-5GB/s only 2.5Gbps is internal FTS traffic. ( ~5% of traffic.)
However this is only ~1/3 of the outbound traffic over the WAN. The other traffic is controlled by other FTS servers.
ATLAS aggregate this into the following ATLAS specific plot:
This shows the outbound traffic was actually 625MB/s (~5Gbps)
ATLAS have a graph showing amount of data processed . This peaked at over 8TB in one hour.
However 8TB/hour is approximately 25Gbps
So what was using the remaining bandwidth...
Well we also need to include traffic from disk servers to tape servers (which are LAN transfers that neither ATLAS nor FTS know about. This rate can be seen here:
This shows there is an extra 600MB/s (~5Gbps) of rate unaccounted for. With FTS+WN+Tape network loads; the known bandwidth usage is 35Gbps. As a rough estimation, this explains how the majority of the bandwidth was used.
FTS controlled transfers at this as controlled by the RAL FTS WAN transfers were negligible; since the majority of "WAN" transfer via FTS (RAL) system were actually an intra-CASTOR transfer.
"WAN Rate"
"Intra-CASTOR" rate shown below shows that of the 4-5GB/s only 2.5Gbps is internal FTS traffic. ( ~5% of traffic.)
However this is only ~1/3 of the outbound traffic over the WAN. The other traffic is controlled by other FTS servers.
ATLAS aggregate this into the following ATLAS specific plot:
ATLAS have a graph showing amount of data processed . This peaked at over 8TB in one hour.
However 8TB/hour is approximately 25Gbps
So what was using the remaining bandwidth...
Well we also need to include traffic from disk servers to tape servers (which are LAN transfers that neither ATLAS nor FTS know about. This rate can be seen here:
This shows there is an extra 600MB/s (~5Gbps) of rate unaccounted for. With FTS+WN+Tape network loads; the known bandwidth usage is 35Gbps. As a rough estimation, this explains how the majority of the bandwidth was used.
Subscribe to:
Posts (Atom)