16 April 2007

Repository got lost

Recently the Edinburgh dCache has been failing. The usageInfo page was reporting

[99] Repository got lost

for all of the pools. This has been seen before, but only now do I understand why.

The dCache developers have added a process that runs in the background and periodically tries to touch a file on each of the pools. If this process fails, something is regarded as being wrong and the above message is generated. This could happen if there was a problem with the filesystem or a disk was slow to respond for some reason.

Edinburgh was being hit with this issue due to some of the disks pools being completely full, i.e., df was reporting 0 free space, while dCache still thought there was a small amount of space available. This mismatch seems to arise from the presence of the small control files on each dCache pool (these contain metadata information). Each file may take up an entire block on the disk without actually using up all of the space. I'm still trying to find out if dCache performs a stat() call on these files. It should also be noted that dCache has to read each of these control files at pool startup, so a full pool takes longer to come online than one that is empty.

There also appears to be a bug in this background process since all of the Edinburgh disk pools were reporting the error, even though some of them were empty. In the meantime, I have set the full pools to readonly and this appears to have prevented the problem reoccurring.

No comments: