GridPP storage news: May 2012

29 May 2012

CHEP Digested

Apologies for not blogging live from CHEP / WLCG meeting but it was busy for me with talks and splinter meetings. So please find below my somewhat jet-lagged digest of the week:

WLCG meeting:

News (to me) from first day was that there will be a new Tier 0, in hungary (!) The current plan is to build a beefy network and split jobs and storage without care. Given the not irrelevant expected latency that didn't seem like the most obviously best plan.

Sunday, somewhat disappointing. Little was planned for the TEG session. The chairs were explicitly told no talk expected off them - only to find on the day that it was - which therefore ended up rather regurgitating some of the conclusions and reiterating some of the same discussions. Apparently the TEGs are over now - despite their apparent zombie state, I hope that we can make something useful building on what was discussed outside any process rather than waiting for what may or may not be officially formed from their wake.

On a non-storage note, I did ask one clarification from Romain on glexec, the requirement is for sites to provide fine grained traceability not necessarily to install glexec though the group did not know of any other current way to satisfy the requirement. There was also some discussion on whether the requirement amounted to requiring identity switching though it seemed fairly clear that it need not. If one can think of another way to satisfy the real requirement than one can use it.

CHEP day 1:

Rene Brun gave a kind of testimonial speech - which was a highlight of the week (because he is a legend). Later in the day he asked a question in my talk on ATLAS ROOT I/O - along the lines that he previously seen faster rates in reading ATLAS files with pure ROOT so why was the ATLAS software so much slower (the reasons are Transient->Persistent conversion as well as some reconstruction of objects). Afterwards he came up to me and said he was "very happy" that we were looking at ROOT I/O (which made my week really).
Other than my talk (which otherwise went well enough), the "Event Processing" session saw a description from CMS on their plans to make their framework properly parallel. A complete rewrite like this is possibly better approach than the current ATLAS incremental attempts (as also descried in this session by Peter V G ) - though its all somewhat pointless unless big currently sequential (and possibly parallelizable ) parts like tracking are addressed.

CHEP day 2:

Sam blogged a bit about the plenaries. The parallel sessions got off to a good start (;-)) with my GridPP talk on analysing I/O bottlenecks: the most useful comment was perhaps that by Dirk on I/O testing at CERN (see splinter meeting comment below). There was then a talk regarding load balancing for dCache which seemed fairly complicated algorithm, but, if it works, perhaps worth adopting in DPM. Then a talk on xrootd from (of course) Brian B , but describing both ATLAS and CMS work. To be honest I found the use cases less compelling than I have done previously but still lots of good work on understanding these and worth supporting future development (see again splinter meetings below).

The posted session was, as Sam indicated, excellent - though way way too many posters to mention. The work on DPM both in DM-LITE and WebDav is very promising but the proof will be in the production pudding that we are testing in the UK (see also my and Sam's CHEP paper of course).
Back in the parallel sessions, the hammercloud update showed some interesting new features and correlations between outages towards reducing the functional testing load. CMS are now using HC properly for their testing.

CHEP day 3:

In terms of the ROOT plenary talk - I would comment on Sam's comments that the asynchronous prefetching does need some more work (we have tested it) but at least it is in there (see also ROOT I/O splinter meeting comments below). I also noted that they offer different compression schemes now which I haven't explored.
The data preservation is indeed interesting, Sam gave the link to the report. Of the two models of ensuring one can run on the data: maintaining an old OS environment or validating a new one. I find the later most interesting but really I wonder whether experiments will preserve manpower on old experiments to check and keep up such a validation.

Andreas Peters's talk in the next session was the most relevant plenary to storage. As Sam suggested it was indeed excellently wide ranging and not too biased. Some messages: storage still hard and getting harder with management, tuning and performance issues. LHC storage is large in terms of volume but not number of objects. Storage interfaces are split in terms of complex / rich such as posix and reduces such as S3. We need to be flexible to both profit from standards/ community projects but not to be tied to any particular technology.

CHEP day 4:

The morning I mostly spent in splinter meetings on Data Federations and ROOT I/O (see below) . Afternoon there was a talk from Jeff Templon on the NIKHEV tests with WebDav and proxy caches which is independent of any middleware implementation. Interesting stuff though somewhat of a prototype and should be integrated with other work. There was also some work in Italy on http access which needs further testing but shows such things are possible with Storm.
After coffee and many many more posters (!), Paul M showed that dCache is pluggable beyond plugaable (including potentially interesting work with HDFS (and Hadoop for log processing)). He also kept reassuring us that it will be supported in the future.

Some Splinter Meetings / Discussions:

Possibilities for using DESY grid lab for controlled DPM tests.
Interest in testing dCache using similar infrastructure as we presented for DPM.
ATLAS xrootd federating pushing into EU with some redirectors installed at CERN and some sites in Germany and (we volunteered) the UK (including testing the new emerging DPM xrootd server)
DPM support . Certainly there will be some drop in CERN support post EMI. Lots more discussions to be had, but it seemed optimistic that there would be some decent level of support from CERN providing some could also be found from the community of regions/ users.
Other DPM news: Chris volunteered for DM-LITE on lustre; sam and I for both xrootd and web dav stuff.
ROOT I/O - Agreement to allow TTreeCache to be set in the environment. More discussion on optimise baskets (some requirements from CMS that make it more complicated). Interest in having monitoring internal to ROOT, switched on in .rootrc: a first pass at a list of variable to be collected was constructed.
I/O benchmarking - Dirk at CERN has some suite that both provides a mechanism for submitting tests and some tests itself that are similar to the ones we are using (but not identical). We will form a working group to standardise the test and share tools.

24 May 2012

Day 3 of CHEP - Data Plenaries

The third day of CHEP is always a half-day, with space in the afternoon for the tours.
With that in mind, there were only 6 plenary talks to attend, but 4 of those were of relevance to Storage.

First up, Fons Rademakers gave the ROOT overview talk, pointing to all the other ROOT development talks distributed across the CHEP schedule. In ROOT's I/O system, there are many changes planned, some of which reflect the need for more parallelism in the workflow for the experiments. Hence, parallel merges are being improved (removing some locking issues that still remained), and ROOT is moving to a new threading model where there can be dedicated "IO helper threads" as part of the process space. Hopefully, this will even out IO load for ROOT-based analysis and improve performance.
Another improvement aimed at performance is the addition of asynchronous prefetching to the IO pipeline, which should reduce latencies for streamed data - while I'm still on the fence about I/O streaming vs staging, prefetching is another "load smearing" technique which might improve the seekiness on target disk volumes enough to make me happy.

The next interesting talk was this year's iteration of the always interesting (and a tiny bit Cassandra-ish) DPHEP talk on Data Preservation. There was far too much interesting stuff in this talk to summarise - I instead encourage the interested to read the latest report from the DPHEP group, out only a few days ago, at : http://arxiv.org/abs/1205.4667

In the second session, two more interesting talks with storage relevance followed.
First, Jacek Becla gave an interesting and wide-ranging talk on analysis with very large datasets, discussing the scaling problems of manipulating that much data (beginning with the statement "Storing petabytes is easy. It is what you do with them that matters"). One of the most interesting notes was that indexes on large datasets can be worse for performance, once you get above a critical size - the time and I/O needed to update the indices impairs total performance more than the gain; and the inherently random access that seeking from an index produces on the storage system is very bad for throughput with a sufficiently large file to seek in. Even SSDs don't totally remove the malus from the extremely high seeks that Jacek shows.

Second, Andreas Joachim Peters gave a talk on the Past and Future of very large filesystems, which was actually a good overview, and avoided promoting EOS too much! Andreas made a good case for non-POSIX filesystems for archives, and for taking an agile approach to filesystem selection.

23 May 2012

Some notes from CHEP - day 2

So, I'm sure that when Wahid writes his blog entries from CHEP, you'll hear about lots of other interesting talks, so as before I'm just going to cover the pieces I found interesting.

The plenaries today started with a review of the analysis techniques employed by the various experiments by Markus Klute, emphasising the large data volumes required for good statistical analyses. More interesting perhaps for its comparison was Ian Fisk's talk covering the new computing models in development, in the context of LHCONE. Ian's counter to the "why can Netflix do data streaming when we can't was:

(that is, the major difference between us and everyone with a CDN is that we have 3 orders of magnitude more data in a single replica - it's much more expensive to replicate 20PB across 100 locations than 12TB!).
The very best talk of the day was Oxana Smirnova's talk on the future role of the grid. Oxana expressed the most important (and most ignored within WLCG) lesson: if you make a system that is designed for a clique, then only that clique will care. In the context of the EGI/WLCG grid, this is particularly important due to the historical tendency of developers to invent incompatible "standards" [the countless different transfer protocols for storage, the various job submission languages etc] rather than all working together to support a common one (which may already exist). This is why the now-solid HTTP(s)/WebDAV support in the EMI Data Management tools is so important (and why the developing NFS4.1/pNFS functionality is equally so): no-one outside of HEP cares about xrootd, but everyone can use HTTP. I really do suggest that everyone enable HTTP as a transport mechanism on their DPM or dCache instance if they're up to date (DPM will be moving, in the next few revisions, to using HTTP as the default internal transport in any case).

A lot of the remaining plenary time was spent in talking about how the future will be different to the past (in the context of the changing pressures on technology), but little was new to anyone who follows the industry. One interesting tid-bit from the CloudDera talk was the news that HDFS can now support High Availability via multiple metadata servers, which gives potentially higher performance for metadata operations as well.

Out of the plenary tracks, the most interesting in the session I was in was Jakob Blomer's update on CVMFS. We're now deep in the 2.1.x releases, which have much better locking behaviour on the clients; the big improvements on the CVMFS server are coming in the next minor version, and include the transition from redirfs (unsupported, no kernels above SL5) to aufs (supported in all modern kernels) for the overlay filesystem. This also gives a small performance boost to the publishing process when you push a new release into the repository.

Of the posters, there were several of interest - the UK was, of course, well represented, and Mark's iPv6 poster, Alessandra's Network improvements poster, Chris's Lustre and my CVMFS for local VOs poster all got some attention. In the wider ranging set of posters, the Data Management group at CERN were well represented - the poster on HTTP/WebDav for federations got some attention (it does what xrootd can do, but with an actual protocol that the universe cares about, and the implementation that was worked on for the poster even supports Geographical selection of the closest replica by ip), as did Ricardo's DPM status presentation (which, amongst other things, showcased the new HDFS backend for DMLite). With several hundred posters and only an hour to look at them, it was hard to pick up the other interesting examples quickily, but some titles of interest included the "Data transfer test with a 100Gb network" (spoiler: it works!), and a flotilla of "experiment infrastructure" posters of which the best *title* goes to "No file left behind: monitoring transfer latencies in PhEDEx".

22 May 2012

CHEP Day 1 - some notes

So, Wahid and I are amongst the many people at CHEP in New York this week, so it behooves us to give some updates on what's going on. The conference proper started with a long series of plenary talks; the usual Welcome speech, the traditional Keynote on HEP and update on the LHC experience (basically: we're still hopeful we'll get enough data for an independent Higgs discovery from ATLAS and CMS; but there's a lot more interesting stuff that's going on that's not Higgs related - more constraints on the various free constants in the Standard Model, some additional particle discoveries).

The first "CHEP" talk in the plenary session was given by Rene Brun, who used his privilege of being the guy up for retirement to control the podium for twice his allocated time; luckily, he used the time to give an insightful discussion of the historical changes that have occurred in the HEP computing models over the past decades (driven by increased data, the shifting of computational power, and the switch from Fortran (etc) to C++), and some thoughts on what would need to form the basis of the future computing models. Rene seemed keen on a pull model for data, with distributed parallelism (probably not at the event level) - this seems to be much more friendly to a MapReduce style implementation than the current methods floating around.
There was also a talk by the Dell rep, Forrest Norrod, on the technology situation. There was little here that would surprise anyone; CPU manufacturers are finding that even more cores doesn't work because of memory bandwidth and chip real-estate issues, so expect more on-die specialised cores (GPUs etc) or even FPGAs. The most interesting bit was the assertion that Dell (like HP before them) are looking at providing an ARM based compute node for data centres. After a lunch that we had to buy ourselves, the parallel sessions started.

The Distributed Processing track began with the usual traditional talks - Pablo Saiz gave an overview of AliEn for ALICE, they're still heavily based on xrootd and bittorrent for data movement, which sets them apart from the other experiments (although, of course, in the US, xrootd is closer to being standard); Vincent Garonne gave an update on ATLAS data management, including an exciting look at Rucio, the next-gen DQ2 implementation; the UK's own Stuart Wakefield gave the CMS Workload Management talk, of which the most relevant data management implication was that CMS are moving from direct streaming of job output to a remote SE (which is horribly inefficient, potentially, as there's no restriction on the destination SE's distance from the site where the job runs!) to an ATLAS-style (although Stuart didn't use that description ;) ) managed data staging process where the job output is cached on the local SE then transferred to its final destination out-of-band by FTS.
Philippe Charpentier's LHCb data talk was interesting primarily because of the discussion on "common standards" that it provoked in the questions - LHCb are considering popularity-based replica management, but they currently use their own statistics, rather than the CERN Experiment Support group's popularity service.
Speaking of Experiment Support, the final talk before coffee saw Maria Girone give the talk on the aforementioned Common Solutions strategy, which includes HammerCloud and the Experiment Dashboards as well as the Popularity Service - the most comment came from the final slides, however, where Maria discussed the potential for a Common Analysis Service framework (wrapping, say, a PanDA server so that the generic interfaces allow CMS or ATLAS to use it). There was some slightly pointed comment from LHCb that this was lovely, but they reminded people of the original "shared infrastructure" that LHCb/ATLAS used, until it just became LHCb's...

After that: coffee, where Rob Fay and I were pounced on by a rep from NexSan, who still have the most terrifyingly dense storage I've seen (60 drives per 4U, in slide-out drive "drawers"), and were keen to emphasise their green credentials. As always, of course, it's the cost and the vendor-awareness issues that stop us from practically buying this kind of kit, but it's always nice to see it around.

The second and final parallel session of the day saw talks by the UK's own Andy Washbrook and our own Wahid Bhimji, but I didn't make those, as I went to the session ended by Yves Kemp (as Patrick Fuhrmann's proxy) giving the results of DESY's testing of NFS4.1 (and pNFS) as a mountable filesystem for dCache. Generally, the results look good - it will be interesting to see how DPM's implementation compares when it is stable - and Yves and I discussed the implications (we hope that) this might have on protocol use by experiments. (We're both in favour of people using sensible non-proprietary protocols, like NFS and HTTP, rather than weird protocols that no-one else has heard of; and the benefits of kernel-level support for NFS4.1 as a POSIX fs are seen in the better caching performance for users, for example).

Today will see, amongst other things, the first poster session - with 1 hour allocated to see the posters, and several hundred posters all up, we'll have just 15 seconds to appreciate each one; I'll see what the blurry mass resolve to when I do my second CHEP update tomorrow!

GridPP storage news