10 August 2012

FTS error morsal of analysis.

Metric numbers never really tell the true story.
I was looking into FTS transfer failures for the ATLAS experiment; (transfers involving ATLAS; I wasn't doing the analysis for atlas,) to see how well we are transferring data within the UK.
So the overall figure is 93.22%. So with two retires allowed; on average 99.993 of  transfers complete without having to worry about retrying from within the VO's Sounds good until you find out that last month ATLAS transferred nearly 6.5M files to and from UK sites, therefore ~45k files would have to be retried from the V0 framework. In total there were6417312 unique successful transfers (2.4M solely within the UK) and 467001 failures associated with them.



Good Write/Read/Delete rates for ATLAS at UK T1





Had a busy period for ATLASSCRATCHDISK space token at the UK Tier1.

This was in response to recovering files for a Tier 2 which had lost a disk server and wanted replacement files.

We wrote ~35TB/day in 2 days. Deleted 80TB; (240,000 files) in a day.



Data was copied out to the T2. Purely from RAL the rate looked like:
RALLCG2-UKISOUTHGRIDCAMHEP
The following shows the transfers to the T2 in question (including other Source Sites but being dominated by transfers from the RAL Tier 1).










02 August 2012

DPM-XROOTD and Federated redirection: volume 1

Historically, one of the weaknesses in DPM as an SE, from the perspective of some of the LHC VOs, was its lacking xrootd support. (While, technically, DPM has supported "xrootd" for some time, the release of xrootd involved has always lagged significantly behind the curve, meaning that DPMs supporting the protocol often couldn't actually provide functionality expected of them.)

Partly as a result of the recent enthusiasm for federated storage (a concept whereby storage endpoints become part of a redirection hierarchy, so that requests against files not present locally can be passed up the chain, until a (hopefully close) endpoint with the file can be found to serve the request), and the particular enthusiasm of ATLAS and CMS (thanks to their experiments in the US) for xrootd as the mechanism for this, DPM's xrootd support has recently improved significantly.

At present, the package is still beta (in particular, the YAIM module is not released yet, so hand configuration is more reliable), but it's been tested on the development SE here at Glasgow (svr025), with some success.

The current release of the dpm-xrootd package still needs to be obtained from an unusual location (instructions here: https://svnweb.cern.ch/trac/lcgdm/wiki/Dpm/Xroot/Setup ), but has the advantage that it will work with DPM 1.8.2 disk nodes, as long as the head node is DPM 1.8.3.
As 1.8.3 is EMI-only, this effectively allows you to test the protocol with gLite disk nodes for the first time.

I've recently set this release up on the production SE at Glasgow (svr018, which is precisely EMI 1.8.3 head, with a mix of gLite and EMI disk nodes). Some thoughts follow:

1) it is not safe to install dpm-xrootd from the marked repository if your glite-SE_dpm_disk release is less than 1.8.2. One of the dependancies of the package is the 1.8.2 release of dpm-lib, but without the rest of the packages from 1.8.2 being installed, this will simply break gridftp and rfiod.
Update your node to 1.8.2 and then pull in dpm-xrootd.

2) the configuration described in the link above is identical for all disk pool nodes. This means it is much less painful than it might be - test with one disk node then mirror across the others.

3) It appears that, for some reason, the dpm-xrootd package does not like SL5.5 and glite-SE_dpm_disk  - several of our disk pools are on this SL release, and the xrootd service refused to start on them. Updating (yum update) to SL5.7 fixes this, by means currently not fully understood.

4) the provision of a certificate with a valid ATLAS VOMS role for the LFC lookup is provided as an exercise for the reader. This is a requirement of the xrootd redirection framework, not dpm specifically, and I hope it will go away soon, since it's extremely silly.

With those caveats in mind, things seem to work fairly well, although this is all in the testing phase for ATLAS (and Europe) for the moment.