06 June 2007

SAM test failures explained

Here's the story: The past couple of weeks have been pretty bad for SAM. There have been at least 3 big problems with the service due to backend database issues, moving to new hardware, etc. In amongst all of this, the certificate of the user who runs the SAM test ran out (I don't know what happened to the CA warning a month before). It was decided to implement a quick fix by using a different users certificate to submit the test. This was OK for a while, until the ops replica management tests then tried to create a new ops/generated/YYYY-MM-DD directory early on Saturday morning. This was fine for dCache sites, but DPM sites suffered due to the DPM not mapping the new certificate DN + VOMs attributes to a virtual gid that would give permission to create these generated directories. This was the source of the "permission denied" errors that were being reported by lcg-cr. Once sites updated the ACLs on the ops/generated directories, the new certificate DN + VOMs attributes had authorisation to write a new directory and the tests started to pass again.

As an aside, the initially errors pointed to a permissions problem on the LFC, but this was a red herring. This is another example of the poor error messages that are reported by grid middleware.

No comments: