14 August 2007

CLOSE_WAIT strikes again


Multiple DPM sites are reporting instabilities in the DPM service. The symptoms are massive resource usage by multiple dpm.ftpd processes on the disk servers (running the v1.6.5 of DPM). These have been forked by the main gridftp server process to deal with client requests. Digging a little further we find that the processes are responsible for many CLOSE_WAIT TCP connections between the DPM and the RAL FTS server. It also happens that all of the dpm.ftpd processes are owned by the atlassgm user, but I think this is only because ATLAS are the main (only?) VO using FTS to transfer data at the moment.

CLOSE_WAIT means that the local end of the connection has received a FIN from the other end, but the OS is waiting for the program at the local end to actually close its connection.

Durham, Cambridge, Brunel and Glasgow have all seen this effect. The problem is so bad at Durham that they have written a cron job that kills off the offending dpm.ftpd processes at regular intervals. Glasgow haven't been hit to badly, but then they do have 8GB of RAM on each of their 9 disk servers!

The DPM and FTS developers have been informed. From emails I have seen it appears that the DPM side is at fault, although the root cause is still not understood. This situation is very reminiscent of the CLOSE_WAIT issues that we were seeing with dCache at the end of last year.

Also see here.

1 comment:

Greig A Cowan said...

Savannah bug has been submitted:

http://savannah.cern.ch/bugs/?28922