29 May 2007

Update on the classic SE

As reported at last weeks WLCG operations meeting, the classic SE is now completely frozen (not that it affects any GridPP sites since we are all SRM-ified). There are a few points to note about this:

Are there things not in the classic SE that people want added? Any such features should already be in DPM so this is an acceptable upgrade.

Are there features in the classic SE that are not available in DPM or dCache? The obvious answer is real POSIX mounting of the file system probably via NFS. Both DPM and dCache are looking at supporting NFSv4, there are comments from both of these that NFSv4 might be available by end of the year or sooner.

Will the classic SE be included in the upcoming gLite release? The answer is yes, it will appear in gLite 3.1. Once there is a DPM with NFSv4 support then this will be re-evaluated.

Storage security service challenge

WLCG are asking each ROC to run a security service challenge their sites. Someone (probably Alessandra) will submit a job to each site which will attempt to write a file to the local SE, read it back, copy it to a remote SE, delete the file... Once complete, the submitter will issue a GGUS ticket against the site, asking them to provide information on which operations were performed on the file. You can see an example of what is expected here:

https://gus.fzk.de/pages/ticket_details.php?ticket=22012

The aim of this testing is to determine if SEs record sufficient information for tracing user operations and also to check that site admins are able to gather that information. I am currently putting together some scripts that will perform the querying and parsing of DPM/dCache databases and log files in order to gather the information.

In addition to going through the SE files, it is likely that sites will have to parse the PBS logs on the CE to determine the UI that was originally used for the job submission.

22 May 2007

DPM 1.6.4 and SL4

I have just come across a problem with running DPM v1.6.4 on SL4. Well, it's not actually a problem with DPM itself, but rather a problem with the BDII that it is now using as an information provider. SL4 comes with openldap v2.2 which appears to have stricter schema checking than openldap v2.0 (which comes with SL3). This causes problems like this:

$ ldapsearch -LLL -x -H ldap://wn4.epcc.ed.ac.uk:2170 -b mds-vo-name=resource,o=grid
Invalid DN syntax (34)
Additional information: invalid DN

Meaning that your SE can't publish anything about itself. This can be resolved by adding this block of code

attributetype ( 1.3.6.1.4.1.3536.2.6.1.4.0.1
NAME 'Mds-Vo-name'
DESC 'Locally unique VO name'
EQUALITY caseIgnoreMatch
ORDERING caseIgnoreOrderingMatch
SUBSTR caseIgnoreSubstringsMatch
SYNTAX 1.3.6.1.4.1.1466.115.121.1.44
SINGLE-VALUE
)

to /opt/glue/schema/ldap/Glue-CORE.schema and then restarting the ldap and bdii processes. This is covered by this bug:

https://savannah.cern.ch/bugs/index.php?15532

17 May 2007

DPM 1.6.4 released (with a few problems)

DPM v1.6.4 was released into production this week. First of all, there are a few points to be aware of:

1. This release requires an update of the v1.6.3 DB schema. **YAIM will take care of this for you**. It is not necessary to run the DB migration script by hand.

2. Two new YAIM variables, DPM_DB and DPNS_DB, are introduced.

3. After the reconfiguration, DPM will use the BDII as an information provider instead of Globus MDS. By default the BDII runs on port 2170 whereas globus-mds was on 2135. You need to change the site-info.def variable to this (so that the site BDII looks in the right place)

BDII_SE_URL="ldap://$DPM_HOST:2170/mds-vo-name=resource,o=grid"

4. YAIM does some tweaking of the /etc/sysctl.conf values. The old values are copied to /etc/sysctl.conf.orig if you want to reinstate them.

However, once the release was announced, a couple of problems soon reared their heads:

a) Sites were recommended not to upgrade due to problem left over from the build

http://glite.web.cern.ch/glite/packages/R3.0/updates.asp

For sites who had already upgraded, the fix was this:
   mkdir -p /home/glbuild/GLITE_3_0_3_RC1_DATA/stage/etc
ln -s /opt/lcg/etc/lcgdm-mapfile \
/home/glbuild/GLITE_3_0_3_RC1_DATA/stage/etc
b) With the latest update the info provider of the DPM machines has changed from MDS to BDII. However the YAIM ( -15) coming with the update does not configures edguser's certificate.

The fix was to perform these steps manually:

mkdir -p ~edguser/.globus
chown edguser:edguser ~edguser/.globus
cp /etc/grid-security/hostcert.pem ~edguser/.globus/usercert.pem
cp /etc/grid-security/hostkey.pem ~edguser/.globus/userkey.pem
chown edguser:edguser /home/edguser/.globus/user*
chmod 400 /home/edguser/.globus/userkey.pem

Obviously the certification testing isn't quite as water-tight as we would hope.

10 May 2007

Manchester Tier2 dcache goes resilient II

Yesterday we completed the scheduled downtime, and now dcache02 is up and resilient, it's still chewing through the list of files and making copies of them, going by past experience it will probably finish somewhere around lunchtime tomorrow. It's so nice to know we're not in the dark ages of dcache-1.6.6 any more. Of course, there's still small niggles to iron out and we've yet to really throw a big load at it, but it's looking a lot, lot better.

09 May 2007

DPM 1.6.4-3 on PPS

DPM v1.6.4-3 is now available on the PPS. I would imagine that it will move into production in the next couple of weeks. This version requires a schema change to the dpm_db (3.0.0 -> 3.1.0). YAIM will take of this for you, although a DB backup is recommended beforehand.

We have now moved to YAIM 3.0.1-15, so the installation and configuration steps now look like:

$ /opt/glite/bin/yaim -i -s /opt/glite/yaim/etc/site-info.def -n glite-SE_dpm_mysql

$ /opt/glite/bin/yaim -c -s /opt/glite/yaim/etc/site-info.def -n SE_dpm_mysql

https://savannah.cern.ch/patch/index.php?1121

The information provider plugin is still the *old* one (which does not account for the used space properly). Therefore you will need to install Graeme's new one by hand (again).

http://www.gridpp.ac.uk/wiki/DPM_Information_Publishing#Beta_Release_Plugin

With this version of DPM there is a BDII process on port 2170 that is used to provide the information about the DPM. This replaces globus-mds as the information provider which ran on port 2135.

This version of YAIM includes the some /etc/sysctl.conf tweaks in the config_DPM_disk function. This is nice (since it could lead to some optimisations) but I think sites should be warned about this beforehand and be allowed to turn off these changes:
https://gus.fzk.de/pages/ticket_details.php?ticket=21713

Anyway, I upgraded from v1.6.3 to v1.6.4 today (on SL4 32bit). No problems so far, but I will let you know if anything comes up.

ZFS performance on RAID

http://milek.blogspot.com/2007/04/hw-raid-vs-zfs-software-raid-part-iii.html

04 May 2007

Manchester Tier2 dcache goes resilient

We're half way through the combined upgrade from dcache-1.6.6-vanilla to dcache1.7.0-with-replica-manager, so far only one of the two head-nodes has been upgraded, but so far so good, the other is scheduled for upgrade next week, and I appear to have scheduled the queue shutdown at 8am on bank-holiday Monday! Documentation will obviously follow including cfengine snippets for those people that love it.

01 May 2007

Storage meeting: Wednesday 2nd May

Video of DPM and SRM Presentations at HEPiX

Starring Mr Steve Traylen, including Video download! See Steve's Blog for the links.