30 September 2009

CIP update update

We are OK: problems in deployment that had not been caught in testing appear to be due to different versions of lcg-utils (used for all the tests) behaving subtly differently. So I could run tests as dteam prior to release and they'd work, but the very same tests would fail on the NGS CE after release, even if they'd also run as dteam. Those were finally fixed this morning.

29 September 2009

CIP deployment

As some of you may have noticed, the new CASTOR information provider (version 2.0.3) went live as of 13.00 or thereabouts today.

This one is smarter than the previous one: it automatically picks up certain relevant changes to CASTOR. It has nearline (tape) accounting as requested by CMS. It is more resilient against internal errors. It is easier to configure. It also has an experimental bugfix for the ILC bug (it works for me on dteam). It has improved compliance with WLCG Installed Capacity (up to a point, it is still not fully compliant.)

Apart from a few initial wobbles and adjustments which were fixed fairly quickly (but still needed to filter through the system), real VOs should be working.

ops was trickier, because they have access to everything in hairy ways, so we were coming up red on the SAM tests for a while. This appears to be sorted out for the SE tests, but still causes the CE tests to fail. Which is odd, because the failing CE tests consist of jobs that run the same data tests as the SE tests, which work. I talked to Stephen Burke who suggested a workaround which is now filtering through the information system.

We're leaving it at-risk till tomorrow - and the services are working. On the whole, apart from the ops tests with lcg-utils, I think it went rather well: the CIP is up against two extremely complex software infrastructures, CASTOR on one side, and the grid data management on the other, and the CIP itself has a complex task trying to manage all this information.

Any Qs, let me know.

28 September 2009

Replicating like HOT cakes

As mentioned on the storage list, the newest versions of the GridPP DPM Tools (documented at http://www.gridpp.ac.uk/wiki/DPM-admin-tools) contain a tool to replicate files within a spacetoken (such as the ATLASHOTDISK).

At Edinburgh this is running in cron

DPNS_HOST=srm.glite.ecdf.ed.ac.uk
DPM_HOST=srm.glite.ecdf.ed.ac.uk
PYTHONPATH=/opt/lcg/lib64/python
0 1 * * * root /opt/lcg/bin/dpm-sql-spacetoken-replicate-hotfiles --st ATLASHOTDISK >> /var/log/dpmrephotfiles.log 2>&1

Some issues observed are :
* Takes quite a long time to run the first time. Because of all the dpm-replicate calls on the ~1000 files that ATLAS stuck in there it took around 4 hours just for 1 extra copy. Since then though only the odd file has come in - so it doesn't have much to do.
* The replicas are always on different filesystems - but not always different disk server. This obviously depends on how many servers you have for that pool (compared to the nreps you want), as well as how many filesystems on each server. The replica creation could be more directed but perhaps it should be the default behaviour of the built in command to use a different server if it can.

Intended future enhancements of this tool include:
* List in a clear way the physical duplicates in the ST.
* Remove excess duplicates.
* Automatic replications of a list of "hotfiles"

Other suggestions welcome.