03 February 2015

Distributed Erasure Coding backed by DIRAC File Catalogue

So, last year, I wrote a blog post on the background of Erasure Coding as a technique, and trailed an article on our own initial work on implementing such a thing on top of the DIRAC File Catalogue.

This article is a brief description of the work we did (a poster detailing this work will also be at the CHEP2015 conference).

Obviously, there are two elements to the initial implementation of any file transformation tool for an existing catalogue: choosing the encoding engine, and working out how to plumb it into the catalogue.

There are, arguably, two popular, fast, implementations of general erasure coding libraries in use at the moment:
zfec, which backs the Least Authority File System's implementation, and has a nice python api
jerasure, which has seen use in several projects, including backing Ceph's erasure coded pools.

As DIRAC is a mostly Python project, we selected zfec as our backend library, which also seems to have been somewhat fortuitous on legal grounds, as jerasure has recently been withdrawn from public availability due to patent challenges in the USA (while this is not a relevant threat in the UK, as we don't have software patents, it makes one somewhat nervous about using it as a library in a new project).

Rather than performing erasure coding as a stream, we perform the EC mapping of a file on disk, which is possibly a little slower, but is also safer and easier to perform.

Interfacing to DIRAC had a few teething problems. Setting up a DIRAC client appropriately was a little more finicky than we expected, and the Dirac File Catalogue implementation had some issues we needed to work around. For example, SEs known to the DFC are assumed good - there's no way of marking an SE as bad, or of telling how usable it is without trying it.

The implementation of the DFC Erasure Coding tool, therefore, also includes a tool which evaluates the health of the SEs available to the VO, and removes unresponsive SEs from its list of potential endpoints for transfers.

As far as the actual implementation for adding files is concerned, it's as simple as creating a directory (with the original filename) in the DFC, and uploading the encoded chunks within it, making sure to upload chunks across the set of SEs known to the DFC to support the VO you're part of.
We use the DFC's metadata store to store information about each chunk as a check for reconstruction. We were interested to discover that adding new metadata names to the DFC also makes them available for all files in the DFC, rather than simply for the files you add them for. We're not sure if this is an intended feature or not.

One of the benefits of any kind of data striping, including EC, is that we can retrieve chunks in parallel from the remote store. Our EC implementation allows the use of parallel transfer via the DFC methods when getting remote files, however, in our initial tests, we didn't see particular performance improvements. (Our test instance, using the Imperial test DIRAC instance, didn't have many SEs available to it, though, so it is hard to evaluate the scaling potential.)

The source code for the original implementation is available from: https://github.com/ptodev/Distributed-Resilient-Storage
(There's a fork by me, which has some attempts to clean up the code and possibly add additional features.)

No comments: