24 November 2009

Storage workshop discussion

If you have followed the weeklies, you will have noticed we're discussing having another storage workshop. The previous one was thought extremely useful, and we want to create a forum for storage admins to come together and share their experiences with Real Data(tm).
Interestingly, we now have (or are close to getting!) more experience with tech previously not used by us. For example, does it improve performance having your DPM db on SSD? Is Hadoop a good option for making use of storage space on WNs?
We already have a rough agenda. There should be lots of sysadmin-friendly coffee-aided pow-wows. Maybe also some projectplanny stuff, like the implications for us of the end of EGEE, the NGI, GridPP4, and suchlike.
Tentatively, think Edinburgh in February.

23 November 2009

100% uptime for DPM

(and anything else with a MySQL backend).

This weekend, with the ramp up of jobs through the Grid as a result of some minor events happening in Geneva, we were informed of a narrow period during which jobs failed accessing Glasgow's DPM.

There were no problems with the DPM, and it was working according to spec. However, the period was correlated with the 15 minutes or so that the MySQL backend takes to dump a copy of itself as backup, every night.

So, in the interests of improving uptime for DPMs to >99%, we enabled binary logging on the MySQL backend (and advise that other DPM sites do so as well, disk space permitting).

Binary logging (which is enabled by adding the string "log-bin" on it's own line to /etc/my.cnf, and restarting the service) enables (amongst other things, including "proper" uptothesecond backups) a MySQL-hosted InnoDB database to be dumped without interrupting service at all, thus removing any short period of dropped communication.

(Now any downtime is purely your fault, not MySQL's.)

12 November 2009

Nearly there

The new CASTOR information provider is nearly back, the host is finally back up, but given that it's somewhat late in the day we better not switch the information system back till tomorrow. (We are currently running CIP 1.X, without nearline accounting.)

Meanwhile we will of course work on a resilienter infrastructure. We also did that before, it's just that the machine died before we could complete the resilientification.

We do apologise for the inconvenience caused by this incredibly exploding information provider host. I don't know exactly what happened to it, but given that it took a skilled admin nearly three days to get it back, it must have toasted itself fairly thoroughly.

While we're on the subject, a new release is under way for the other CASTOR sites - the current one has a few RAL-isms inside, to get it out before the deadline.

When this is done, work can start on GLUE 2.0. Hey ho.

10 November 2009

Kaboom

Well it seems we lost the new CASTOR information provider (CIP) this morning and the BDII was reset to the old one - the physical host it lived on (the new one) decided to kick the bucket. One of the consequences is that nearline accounting is lost, all nearline numbers are now zero (obviously not 44444 or 99999, that would be silly...:-)).
Before you ask, the new CIP doesn't run on the old host because it was compiled for 64 bit on SLC5, and the old host is 32 bit SL4.
We're still working on getting it back, but are currently short of machines that can run it, even virtual ones. If you have any particular problems, do get in touch with the helpdesk and we'll see what we can do.