23 August 2007

DPM vulnerability

Another security hole in the DPM gridftp server has been found and subsequently patched.

All details of the update can be found here.

The security advisory issued by the GSVG can be found here.

All DPM sites should use YAIM (or your method of choice) to upgrade to the latest version (DPM-gridftp-server-1.6.5-6) ASAP. Depending on how regularly you have been updating, there may also be new rpms available for other components of the DPM (all of these are on 1.6.5-5).

sgm and prod pool accounts

I've been a bit confused of late regarding what the best course of action is regarding how to deal with sgm and prod pool accounts on the SEs, in particular, dCache. As an example, Lancaster have run into the problem where a user with an atlassgm proxy has copied files into the dCache and has correspondingly been mapped to atlassgm:atlas (not atlassgm001 etc, just plain old sgm). Non-sgm users have then tried to remove these files from the dCache and have been denied since they are simple atlas001:atlas users. The default dCache file permissions do not allow group write access. This raises a few issues:

1. Why is atlassgm being used to write files into the dCache in the first place?

2. Why are non-sgm users trying to remove files that were placed into the dCache by a (presumably privileged) sgm user?

3. When will dCache have ACLs on the namespace to allow different groups of users access to a bunch of files?

The answer to the 3rd point is that ACLs will be available some time next year when we (finally) get the Chimera namespace replacement to PNFS. ACLs come as a plugin to Chimera.

The interim solution appears to be just to map all atlas users to atlas001:atlas, but this obviously doesn't help the security and traceability aspect that pool accounts are partially trying to solve. Since DPM supports namespace ACLs, we should be OK with supporting sgm and prod pool accounts. Of course, this requires that everyone has the appropriately configured ACLs, which isn't necessarily the case, as we've experienced before.

Comments welcome below.

22 August 2007

Storage accounting - new and improved


We have made some improvements to the storage "accounting" portal (many thanks go to Dave Kant) during the past week or so. The new features are:

1. "Used storage per site" graphs are now generated (see image). This shows the breakdown of resources per site, which is good when looking at the ROC or Tier-2 views.

2. "Available storage per VO" graphs are generated in addition to the "Used" plots that we've always had. This comes with the usual caveats of available storage being shared among multiple VOs.

3. There is a Tier-2 hierarchical tree, so that you can easily pick out the Tier-2s of interest.

4. A few minor tweaks and bug fixes.

Current issues are in savannah.

The page is occasionally slow to load up as the server is also used by the GOC to provide RB monitoring of the production grid. Alternatives to improve speed are being looked at.

15 August 2007

CE-sft-lcg-rm-free released!

A new SAM test is now in production. It does a BDII lookup to check that there is sufficient space on the SE before attempting to run the standard replica management tests. This is good news for sites whose SEs fill up with important experiment data. If the tests finds that there is no free space, then the RM tests don't run. Of course this requires that the information being published into the BDII is correct in the first place. I'll need to check if this system could be abused by sites who publish 0 free space by default, thereby by-passing the RM tests and therefore any failures that could occur. I suppose that GStat already reports sites as being in a WARNING status when they have no free space.

See the related post here.

14 August 2007

CLOSE_WAIT strikes again


Multiple DPM sites are reporting instabilities in the DPM service. The symptoms are massive resource usage by multiple dpm.ftpd processes on the disk servers (running the v1.6.5 of DPM). These have been forked by the main gridftp server process to deal with client requests. Digging a little further we find that the processes are responsible for many CLOSE_WAIT TCP connections between the DPM and the RAL FTS server. It also happens that all of the dpm.ftpd processes are owned by the atlassgm user, but I think this is only because ATLAS are the main (only?) VO using FTS to transfer data at the moment.

CLOSE_WAIT means that the local end of the connection has received a FIN from the other end, but the OS is waiting for the program at the local end to actually close its connection.

Durham, Cambridge, Brunel and Glasgow have all seen this effect. The problem is so bad at Durham that they have written a cron job that kills off the offending dpm.ftpd processes at regular intervals. Glasgow haven't been hit to badly, but then they do have 8GB of RAM on each of their 9 disk servers!

The DPM and FTS developers have been informed. From emails I have seen it appears that the DPM side is at fault, although the root cause is still not understood. This situation is very reminiscent of the CLOSE_WAIT issues that we were seeing with dCache at the end of last year.

Also see here.

DPM and xrootd

Following on from dCache, DPM is also developing an xrootd interface to the namespace. xrootd is the protocol (developed by SLAC) that provides POSIX access to their Scalla storage system, who's other component is the oldb clustering server.

DPM now has a usable xrootd interface. This will sit alongside the rfiod and gridftp servers. Currently, the server has some limitations (provided by A Peters at CERN):

* xrootd server runs as a single 'DPM' identity, all file reads+writes are done on behalf of this identity. However, it can be restricted to read-only mode.

* there is no support of certificate/proxy mapping

* every file open induces a delay of 1s as the interface is implemented as an asynchronous olbd Xmi plugin with polling.

On a short time scale the certificate support in xrootd will be fixed and VOMS roles added (currently certificate authentication is broken for certain CAs) . After that, the DPM interface can be simplified to use certificates/VOMs proxies & run as a simple xrootd OFS plugin without need for an olbd setup.

So it seems that xrootd is soon going to be available across the Grid. I'm sure that ALICE (and maybe some others...) will be very interested.

06 August 2007

dcache on SL4

As part of our planned upgrade to SL4 at Manchester, we've been looking at getting dcache running.
The biggest stumbling block is a lack of glite-SE_dcache* profile, luckily it seems that all of the needed components apart from dcache-server are in the glite-WN profile. Even the GSIFtp Door appears to work.

05 August 2007

SRMv2.2 directory creation

Just discovered that automatic directory creation doesn't happen with SRMv2.2. Directories are created when using SRMv1.

02 August 2007

Annoyed

I re-ran YAIM yesterday on the test DPM I've got at Edinburgh as it turned out we were not publishing the correct site name. Annoyingly, this completely broke information publishing as the BDII couldn't find the correct schema files (again). I had to re-create the symbolic link from /opt/glue/schemas/ldap to /opt/glue/schemas/openldap2.0 and then double check that all was well with the /opt/bdii/etc/schemas files. A restart of the BDII then sorted things out.

It's not really fair to blame YAIM here since I'm running the SL3 build of DPM on SL4, which isn't really supported. Well, I'm hoping that's the source of the trouble.

01 August 2007

Non-improvement of SAM tests

For a while I have been pushing for the creation of a SAM test that only probes the SRM and does not depend on any higher level services (like the LFC or BDII). This would be good as it would prevent sites being marked as unavailable when in fact their SRM is up and running.

Unfortunately, the SAM people have decided to postpone the creation of a pure-SRM test. I don't really understand their concerns. I thought using srmcp with a static (but nightly updated) list of SRM endpoints would have been sufficient. I guess they have some reservations about using the FNAL srmcp client, since it isn't lcg-utils/GFAL, which are the official storage access methods.

https://savannah.cern.ch/bugs/?25249