16 December 2007

Anyone for some dCache monitoring?


The above plots come from some new dCache monitoring that I have set up to study the behaviour of the Edinburgh production storage (srm.epcc.ed.ac.uk). This uses Brian Bockleman's GraphTool and some associated scripts to query the dCache billing database. You can find the full set of plots here (I know, it's a strange hostname for a monitoring, but it's all that was available):

http://wn3.epcc.ed.ac.uk/billing/xml/

GraphTool is written in python and uses matplotlib to generate the plots. Cherrypy is used for the web interface. The monitoring can't just be installed as an rpm: you need to have PostGreSQL 8.2 available; create a new view in the billing database; set up Apache mod_rewrite; ensure you have the correct compilers installed..., but these steps shouldn't be a problem for anyone.

I think you will agree that the monitoring presents some really useful views of what the dCache is actually doing. It's still a work in progress, but let me know when you want to set it up and I should be able to help.

It should be possible to do something similar for DPM in the coming weeks.

10 December 2007

DPM on SL4

Time to break out the champagne, it looks like DPM will be officially released in production on SL4 next Wednesday.

"Based on the feedback from PPS sites, we think that the following
patches can go to production next Wednesday:


# available on the linked GT-PPS ticket(s)

#1349 glite-LFC_mysql metapackage for SLC4 - 3.1.0 PPS Update 10


#1350 glite-SE_dpm_disk metapackage for SLC4 - 3.1.0 PPS Update 10


#1352 glite-SE_dpm_mysql metapackage for SLC4 - 3.1.0 PPS Update 10


#1541 glite-LFC_oracle metapackage for SLC4 - 3.1.0 PPS Update 10


#1370 R3.1/SLC4/i386 DPM/LFC 1.6.7-1 - 3.1.0 PPS Update 10"


Of course, some sites have been running SL3 DPM on SL4 for over a year and others have been running the development SL4 DPM in production for months. One warning I would give would be to make sure the information publishing is working, I've had a few problems with that in the past (in fact today I was battling with an incompatible version of perl-LDAP from the DAG repository).

06 December 2007

Storage as seen by SAM



We all need more monitoring, don't we? I knocked up these plots showing the storage SAM test results for the ops VO at GridPP sites over the past month. I am only looking at the SE and SRM tests here, where the result for each day is calculated as the number of successes over the total number of tests. The darker-green the square the higher the availability. I think it's clear which sites are having problems.

http://www.gridpp.ac.uk/wiki/GridPP_storage_available_monitoring

We always hear that storage is really unreliable for the experiments, so I was actually quite surprised at the amount of green on the first plot. However, I think since these results are only for the short duration ops tests, they do not truely reflect the view that the experiments have of storage when they are performing bulk data transfer across the WAN or a large amount of local access to the compute farm.

These plots were generated thanks to some great python scripts/tools from Brian Bockleman (and others, I think) from OSG. Brian's also got some interesting monitoring tools for dCache sites which I'm having a look at. It would be great if we could use something similar in GridPP.

04 December 2007

dCache 1.8.0-X

A new patch to dCache 1.8.0 was released on Friday (1.8.0-6). In addition, there is now a 1.8.0 dcap client. All rpms can be found here:

http://www.dcache.org/downloads/1.8.0/index.shtml

Sites (apart from Lancaster!) should wait for all of the Tier-1s to upgrade first of all as there are still some bugs being worked out.

dCache admin scripts

Last week I finally got a chance to have another look at some of the dCache administration scripts that are in the sysadmin wiki [1]. There is a jython interface to the dCache admin interface, but I find it difficult to use. As an alternative, the guys at IN2P3 have written a python module that creates a dCache admin door object which you can then use in your own python scripts to talk to the dCache [2]. One thing that I did was use the rc cleaner script [3] to clean up all of the requests (there were 100's!) that were stuck in Suspended state. You can see how the load on the machine running postgres dropped after removing the entries. Highly recommended.

I also wrote a little script to get information from the LoginBroker in order to print out how many doors are active in the dCache. This is essential information for sites that have many doors (i.e. Manchester) but find the dCache 2288 webpage difficult to use. I'll put it in the SVN repository soon.

[1] http://www.sysadmin.hep.ac.uk/wiki/DCache
[2] http://www.sysadmin.hep.ac.uk/wiki/DCache_python_interface
[3] http://www.sysadmin.hep.ac.uk/wiki/DCache_rc_cleaner