24 January 2007

OK, so now the dCache workshop is finished and we're onto the WLCG Tier-1/2 workshop at CERN. Again, I'm a bit late in commenting on the proceedings, but at least I got there in the end.

Day 1: In the morning there were LHC experiment presentations. Again, SRM v2.2 was stated as being a critical service for the success of WLCG. Alice still plan to do things their own way and require xrootd for data access. Now that dCache and (soon) DPM support xrootd as an access protocol our Tier-2 sites may have to think about enabling these access doors in order to more fully support Alice. At the moment it's not clear what we will do.

There was a data management BOF in the afternoon where details about FTS, lcg_utils etc were presented. A fair bit of time was spent discussing about how work plans and priorities are set for feature requests - not very interesting or relevant to DM.

Day 2: Storage issues played a greater part of the agenda on the second day. There was a "SRM v2.2 deployment and issues" session. Details are still required as to how the sites will configure dCache and DPM to setup the Tape0Disk1 storage classes that are required. It will be the mandate of a newly created WLCG working group to understand the issues and create documentation for sites.

In the afternoon there was a interesting DPM BOF. A number of sites (including GridPP) reported their experiences of DPM and highlighted issues with the software. The problems that every site reported were so similar that I rushed through the GridPP presentation; I didn't want to bombard the audience with information they had already heard from each of the other sites. The main issues can be summarised by:

1. Incompatibility of DPM and CASTOR rfio libraries. This is currently preventing many sites using rfio to access the data on DPMs from the WNs. This is part of the reason some VOs continue to gridftp the files to the WNs before the analysis starts.

2. Scalability of DPM. Majority of DPM sites are small (~10TB). How will DPM scale to 100'sTB spread across many more nodes? Also, how will DPM handle an entire batch farm all trying to use rfio to access the files. Work done by GridPP that has started to look at this issue was presented during the session.

3. Performance of DPM. Many sites would like to see DPM implement a more advanced system for load balancing. Currently a round robin system is used to chose the filesystem to use in a pool. Many people suggested the use of a queuing system. For those interested, dCache already has this stuff built in ;-)

4. Admin and management tools. Everyone would like to have a tool to check synchronisation of the namespace and filesystems. The developers plan to release something very soon for this. Also, everyone wanted a more fine grained and flexible pool draining mechanism, i.e., just migrate atlas files from one pool to another, not the entire contents of the pool.

5. Storage classes and space tokens. It looks like the developers are recommending that sites will set up space tokens that will map particular VO-roles to particular spaces in the DPM pools. It was suggested that this could be a way to implement quotas in DPM since the VO would reserve XTB of space. In this way, the site would just have to set up generic pools and let the space tokens handle who could write to the pool and how much they could write. In this way, VO-ACLs would probably not be required to control write access to the pools. Of course, DPM is used by smaller VOs who might not care about storage classes, SRM 2.2...

Again, the developers welcomed input from sites about performance issues. They would like to have access to a production DPM (since they don't have one) and always like you to send in log files if there is a problem.

Phew, that was a lot of information. I'm sure there is still some more to come, as well as a report on our rfio tests. I'll leave that till later, or for Graeme to do.

No comments: