26 June 2007

DPM 1.6.5-1 in PPS

v1.6.5-1 of DPM is now in pre-production. The relevant savannah page is here:

https://savannah.cern.ch/patch/index.php?1179

This release involves various bug fixes. What is interesting is that it will now be possible to set ACLs on DPM pools, rather than just limiting a pool to either a single VO or all VOs. This should make sites happy. The previous posting on this version of DPM mentioned that gridftpv2 would be used, but the release notes don't mention this, so we will have to wait and see.

Also out in PPS is the use of v1.3 of the GLUE schema. This is really good news since GLUE 1.3 will allow SEs to properly publish information about SRM2.2 storage spaces (i.e. Edinburgh has 3TB of ATLAS_AOD space).

https://savannah.cern.ch/patch/index.php?980

21 June 2007

DPM 1.5.10 -> 1.6.4 upgrade path broken in YAIM

As reported at yesterdays storage meeting, the upgrade path from DPM 1.5.10 to 1.6.4 is broken in YAIM 3.0.1-15. The different versions of DPM require database schema upgrades in order to be able to handle all of the SRM2.2 stuff (space reservation etc). YAIM should contain appropriate scripts to perform these upgrades, but it appears that they appropriate code has been removed, meaning that it is no longer possible to move from schema versions 2.[12].0 in v1.5.10 of DPM to schemas 3.[01].0 in v1.6.4. We stumbled upon this bug when I asked Cambridge to upgrade to the latest DPM in an attempt to resolve the intermittent SAM failures that they were experiencing. A fairly detailed report of what was required to solve the problem can be found in this ticket:

https://gus.fzk.de/pages/ticket_details.php?ticket=23569

It should be noted that for some reason (a bug in a YAIM script?) the Cambridge DPM was missing two tables from the dpm_db database. These were dpm_fs and dpm_getfilereq (I think). This severely hindered the upgrade since we were trying to upgrade the schema, which was successful, but then the DPM wouldn't start. A restore of the database backup, then an upgrade to DPM 1.6.3 then onto DPM (I'm keeping a close eye on the SAM tests...). Sites should be aware that they may need to follow the steps detailed in this link while performing the database upgrade.

https://twiki.cern.ch/twiki/bin/view/LCG/DpmSrmv2Support

After the installation, the srmv2.2 daemon was running and the SRM2.2 information was being published by the BDII. This is all good. If you end up using yaim 3.0.1-16, it should not be necessary to manually install the host certificates for the edguser.

In summary, the 1.5.10 to 1.6.4 upgrade was a lot of work. Thanks to Santanu for giving me access to the machine. This problem raises issues about sites keeping up to date with the latest releases of middleware. Although there were problems with the configuration of 1.6.4, v1.6.3 has been stable in production for a while now. I'm not really sure why some sites hadn't upgraded to that. It would be great if every site could publish the version of the middleware that they are using. In fact, such a feature may be coming very soon. Just watch this space.

08 June 2007

Anyone for a DPM filesystem?

Looks like someone at CERN is developing a mechanism to enable DPM servers to be mounted. This DPMfs could be used as a simple DPM browser, presenting the namespace in a more user-friendly form than the DM command line utilities. The DPM fs is implemented using the FUSE kernel module interface. The file system calls are forwarded to the daemon which communicates with the DPM servers using the rfio and dpns API and sends back the answer to the kernel.

It's in development and not officially supported:

https://twiki.cern.ch/twiki/bin/view/LCG/DPMfs

06 June 2007

DPM 1.6.5 Coming...

It's not here yet, but DPM 1.6.5 has been tagged for release as part of gLite 3.1. A list of goodies with this release are:


- remove expired spaces
- avoid crash in dpm_errmsg/Cns_errmsg when supplied
buffer is too small (GGUS ticket 21767)
- correct processing of rfio_access on DPM TURLs
(Atlas)
- return DPM version in otherInfo field of srmPing
response
- dpm-shutdown: take "server" into account
- add methods ping and getifcevers in LFC/DPM
- fixed bug #25830: add ACLs on disk pools
- dpm-qryconf: add option --group to display
groupnames instead of gids
- dpm-qryconf: add option --proto to display
supported protocols
- fixed bug #25810: dpm-qryconf: add option --si
to display sizes in power of 10
- implement recursive srmLs and srmRmdir
- DPM-DSI plug-in for the GT4 gridftp2 server

The gridftp v2 server looks to be rather an interesting development.

The patch has all the details.

SAM test failures explained

Here's the story: The past couple of weeks have been pretty bad for SAM. There have been at least 3 big problems with the service due to backend database issues, moving to new hardware, etc. In amongst all of this, the certificate of the user who runs the SAM test ran out (I don't know what happened to the CA warning a month before). It was decided to implement a quick fix by using a different users certificate to submit the test. This was OK for a while, until the ops replica management tests then tried to create a new ops/generated/YYYY-MM-DD directory early on Saturday morning. This was fine for dCache sites, but DPM sites suffered due to the DPM not mapping the new certificate DN + VOMs attributes to a virtual gid that would give permission to create these generated directories. This was the source of the "permission denied" errors that were being reported by lcg-cr. Once sites updated the ACLs on the ops/generated directories, the new certificate DN + VOMs attributes had authorisation to write a new directory and the tests started to pass again.

As an aside, the initially errors pointed to a permissions problem on the LFC, but this was a red herring. This is another example of the poor error messages that are reported by grid middleware.

04 June 2007

DPM sites failing SAM due to change in ops VOMs role

The SAM people changed the VOMs role of the certificate being used to run the ops SAM tests. This led to the majority of DPM sites on the grid failing the replica management tests on over the weekend. Why they made this change (with no announcement) on a Friday is unknown. Graeme's got some information here:

http://scotgrid.blogspot.com/2007/06/sam-tests-changed-voms-role-without.html

All UK DPM sites were failing with the exception of RHUL and Brunel (well done Duncan). All of these sites should run the script that was posted to LCG-ROLLOUT as this will alter the ACLs on the generated directories appropriately.

The other annoying thing is that this wouldn't have happened if all sites were running DPM 1.6.4 (which supports secondary groups). The problem is that this release is broken (due to 2 different problems) meaning that no one is running it!