25 February 2015

ATLAS MC deletion campaign from tape complete for the RAL Tier1.

So ATLAS  have just finished a deletion campaign of  monte carlo data from our tape system at the RAL Tier1.

The good news is that previous seen issues of transfers failing due to a timeout (due to misplacing an "I'm done" UDP packet ) seems to have been solved.

ATLAS deleted 1.325 PB of data allowing for our tape system to recover and re-use (when repacking as completed,) approximately ~250 Tapes. ATLAS delete in total 1739588 files.  The deletion campaign took 17 days, but we have seen deletion rates at least a factor of four higher capable from the CASTOR system; so the VO should be able to increase their deletion request rate.

What is also of interest (which I am now looking into;)  is that ATLAS asked us to delete 211 files which they thought we had but we did not.

Also now may be a good time to provide ATLAS with a list of all the files we have in our tape system to find out which files  we have which ATLAS have "forgotten" about.

03 February 2015

Ceph stress testing at the RAL Tier 1

Of some interest to the wider community, the RAL Tier 1 site have been exploring the CEPH object store as a storage solution (some of the aspects of which involve grid interfaces being developed at RAL, Glasgow and CERN).

They've recently performed some interesting performance benchmarks, which Alastair Dewhurst reported on their own blog:


Distributed Erasure Coding backed by DIRAC File Catalogue

So, last year, I wrote a blog post on the background of Erasure Coding as a technique, and trailed an article on our own initial work on implementing such a thing on top of the DIRAC File Catalogue.

This article is a brief description of the work we did (a poster detailing this work will also be at the CHEP2015 conference).

Obviously, there are two elements to the initial implementation of any file transformation tool for an existing catalogue: choosing the encoding engine, and working out how to plumb it into the catalogue.

There are, arguably, two popular, fast, implementations of general erasure coding libraries in use at the moment:
zfec, which backs the Least Authority File System's implementation, and has a nice python api
jerasure, which has seen use in several projects, including backing Ceph's erasure coded pools.

As DIRAC is a mostly Python project, we selected zfec as our backend library, which also seems to have been somewhat fortuitous on legal grounds, as jerasure has recently been withdrawn from public availability due to patent challenges in the USA (while this is not a relevant threat in the UK, as we don't have software patents, it makes one somewhat nervous about using it as a library in a new project).

Rather than performing erasure coding as a stream, we perform the EC mapping of a file on disk, which is possibly a little slower, but is also safer and easier to perform.

Interfacing to DIRAC had a few teething problems. Setting up a DIRAC client appropriately was a little more finicky than we expected, and the Dirac File Catalogue implementation had some issues we needed to work around. For example, SEs known to the DFC are assumed good - there's no way of marking an SE as bad, or of telling how usable it is without trying it.

The implementation of the DFC Erasure Coding tool, therefore, also includes a tool which evaluates the health of the SEs available to the VO, and removes unresponsive SEs from its list of potential endpoints for transfers.

As far as the actual implementation for adding files is concerned, it's as simple as creating a directory (with the original filename) in the DFC, and uploading the encoded chunks within it, making sure to upload chunks across the set of SEs known to the DFC to support the VO you're part of.
We use the DFC's metadata store to store information about each chunk as a check for reconstruction. We were interested to discover that adding new metadata names to the DFC also makes them available for all files in the DFC, rather than simply for the files you add them for. We're not sure if this is an intended feature or not.

One of the benefits of any kind of data striping, including EC, is that we can retrieve chunks in parallel from the remote store. Our EC implementation allows the use of parallel transfer via the DFC methods when getting remote files, however, in our initial tests, we didn't see particular performance improvements. (Our test instance, using the Imperial test DIRAC instance, didn't have many SEs available to it, though, so it is hard to evaluate the scaling potential.)

The source code for the original implementation is available from: https://github.com/ptodev/Distributed-Resilient-Storage
(There's a fork by me, which has some attempts to clean up the code and possibly add additional features.)