24 March 2008

Grid storage not working?

Well, going by what I heard last week at LHCb software week, I think the answer to this question is "No". The majority of the week focussed on all the cool new changes to the core LHCb software and improvements to the HLT, but there was an interesting session on Wednesday afternoon covering CCRC and more general LHCb computing operations. The point was made in 3 (yes, 3!) separate talks that LHCb continue to be plagued with storage problems which prevent their production and reconstruction jobs from successfully completing. The main issue is the instability of using local POSIX-like protocols to remotely open files on the grid SE from jobs running on the site WNs. From my understanding, this issue could broadly be separated into two categories:

1. Many of the servers being used have been configured in such a way that if a job held a file in an open state for longer than (say) 1 day, the connection was being dropped, causing the entire job to fail.

2. Sites have been running POSIX-like access serices on the same hosts that are providing the SRM. This isn't wrong, but is definitely not recommended due to the load on the system. Anyway, the real problem comes when the SRM has to be restarted for some reason (most likely an upgrade) and the site(s) appear to have just been restarting all services on the node which again resulted in any open file connections being dropped and jobs subsequently failing. I thought it was basic knowledge that everyone knew about, but apparently I was wrong.

LHCb seem to be particularly vulnerable as they have long running reconstruction jobs (>33 hours),resulting in low job efficiency when the above problems rear their ugly heads. I would be interested in comments from other experiments on these observations. Anyway, the upshot of this is that LHCb are now considering on copying data files locally prior to starting their reconstruction jobs. This won't be possible for user analysis jobs, which will be accessing events from a large number of files. Copying all of these locally isn't all that efficient, nor do you know a priori how much local space the WN has available.

xrootd was also proposed as an alternative solution. Certainly dCache, CASTOR and DPM all now provide an implementation of the xrootd protocol in addition to native dcap/rfio, so getting it deployed at sites would be relatively trivial (some places already have it available for ALICE). I don't know enough about xrootd to comment, but I'm sure if properly configured it would be able to deal with case 1 above. Case 2 is a different matter entirely... It should be noted (perhaps celebrated?) that none of the above problems have to do with SRM2.2.

Of course, LHCb only require disk at Tier-1s, so none of this applies to Tier-2 sites. Also, they reported that they saw no problems at RAL: well done guys!

In addition, the computing team have completed a large part of the stripping that the physics planning group have asked for (but this isn't really storage related).

4 comments:

Greig said...

I should add that we have tested POSIX access to storage (see our contributiont to CHEP last year), but this was from the point of view of studying the performance of the server as the number of simultaneous clients increased. All of he clients only has files opened for a short time. What was not tested (nor do I know of anyone who has done this in a controlled way) is looking at performance when the client connections are open for many hours.

Also, I know that ATLAS have been copying all data to the WN local disk but have recently become more interested in POSIX access to disk again. It will be interesting to see what problems they face.

Unknown said...

Grid storage is a new step towards
more resilience towards business
continuity and quick data access.
But, how can SMB afford this kind
of systems? These systems not only
costs more but complicated to
manage as well. At Sentral, you
can find similar solutions at lower
price and with simple structure.
Check this at
www.sentralsystems.com

Brian Bockelman said...

You know, most of the OSG sites don't allow 24 hour jobs... call me un-impressed that LHCb don't have frameworks that know how to retry connections, or at least break apart jobs (HEP is still embarrassingly parallel, right?). It reminds me of CMS's current inability to pre-stage files - it's there, but there's no technical reason why it can't be avoided.

Locally, the biggest headache used to be dCache's irritating inability to turn off tape staging at non-tape sites, meaning that a file access to a non-existent file results in 24 hours of nothing happening. This was mostly resolved by the SRM stager work I did awhile back.

Greig said...

LHCb can split jobs OK, it's just that the reconstruction jobs that they are running really do take 24 hours to complete.

I'm not sure about having a framework for retrying connections since these jobs are basically just using ROOT's TDcacheFile to read the data. Surely that's where the retry logic should be placed. Are CMS doing something different?