29 January 2008

CASTOR dynamic information publishing

Folks, we now have dynamic data published for CASTOR. Yes, it's something we were supposed to do since the EGEE conference in Pisa(!), but only now has it percolated up through all the other tasks to become a high priority - otherwise the focus has been on CASTOR itself.
For details (and current limitations) of the implementation, please refer to: https://www.gridpp.ac.uk/wiki/RAL_Tier1_CASTOR_Accounting

23 January 2008

CCRC confusion

So, which Tier-2s are involved in the February CCRC exercise? Does anyone know? What about the date that they will get involved? Do they need to have SRM2.2, or not? Some sources suggest they do, others suggest they don't. If anyone knows the answers to these questions, please email me. I think I'll start attending the daily meetings to find out what is going on.

Also, it turns out that although the 1.8.0-12 release of dCache was made sometime last week, it turns out that this is also the CCRC branch of dCache. Do you spot the difference? Good, then we can continue. This explains the weird naming convention that the dCache developers are using for all CCRC related releases, namely 1.8.0-12p1, 1.8.0-12p2 for patch versions 1 and 2 of the 1.8.0-12 branch. Hope that clears things up.

Banned by SAM

Looks like the SAM people weren't very happy with the monitoring that I was running to summarise the status of storage resources on the Grid. In fact, they were so unhappy that they decided to block the IP address that I was using and switch off access to the /sqldb/ query path on their server! As such, the monitoring hasn't been working for the past few days.

I'll admit that I was putting a fair bit of load on their database since I was using a multithreaded application to request information about 6 different tests for ~200 SEs over the past 30 days, but I think this is a bit harsh. Another part of the problem is that the script seemed to trip up when processing information on a couple of sites (and it was always the same sites), leading to it continually requesting the same information. I'd love to debug why this was, but this is somewhat difficult when I can't access the DB.

I'm currently jumping through some hoops to try and get access to the new JSP endpoint. Hopefully this will be granted soon and I can update the application to deal with the new XML schema.

16 January 2008

Monitoring SRM2.2 deployment

Using a combination of the WLCG information system, srmPing from the StoRM SRM client, a multithreaded python script, GraphTool and Apache, I've set up some daily monitoring of all of the SRM v2.2 endpoints that are available on the Grid. This will help track deployment of the new software, allowing us to see which versions are running in production.


Using the cool GraphTool features, you can drill down into the data by typing dCache, CASTOR or DPM in the SRM_flavour box. You can also look at a particular countries endpoints by putting something like .uk in the endpoint box (for all UK sites). Hopefully poor old wn3.epcc can handle all this monitoring that it's doing!

One thing to note is that there is quite a variety of dCache 1.8.0-X out there. What are the developers playing at?

A slightly annoying feature is that srmPing requires a valid proxy, so I'll need to come up with a good way of creating a long-lived one.

Global dCache monitoring

It's been a while since my last post - been busy with some things that I will post about shortly. As you know, I've been running the GridPP SAM storage monitoring for the past few weeks. It looks like the dCache team got wind of this and have asked if I could set up something similar to summarise the SAM test results for all dCache sites in the WLCG information system. This is now done and the results can be seen above and at this new page in the dCache wiki:


I don't seem to be able to get SAM results for the US Tier-2 sites, so I'll need to investigate this. The above link contains other useful information that I'll talk about shortly.

03 January 2008

dCache PostgreSQL monitoring

Happy New Year everyone!

Following a recommendation from the dCache developers, I set up some PostGreSQL monitoring using pgFouine. It is very simple to install and configure. The plots are generated automatically by the pgfouine tool in addition to giving a full breakdown of what queries took the longest time. This should be useful to let you understand what your database (and hence dCache) is doing. You can also use it to analyse the output of the VACUUM (FULL) ANALYSE command which should give you an idea as to how large the FSM should be set to (which is an important configuration parameter).

PS Fouine is French for stone marten, which is something like a weasel.