22 October 2013

DPM Load Issue Mitigation for Sites seeing load brownouts on individual disk servers.

For a while, it has been known that sites can experience load issues on their storage, especially for those sites running a lot of ATLAS analysis.

For DPM sites, this is caused by a combination of poor dataset-level file distribution (something which a storage system cannot guarantee without knowledge inaccessible to it apriori, especially with the new Rucio naming scheme), and the lack of any way to consistently throttle access rates, or the number of active transfers, per disk server.

At Glasgow, we found the following mixed approaches seemed to significantly ameliorate the issue:


  • Firstly, enabling xrootd support in DPM (this is now entirely doable via YAIM or Puppet; and it's required for ATLAS and CMS xrootd federation to work anyway), and setting the queues in AGIS to use xrootd direct io (rather than rfcp or xrdcp) for remote file access. A majority of Analysis jobs access only fractions of the files they rely on, so this reduces the total amount of data that a job needs to move, and also distributes the load caused by the data accesses over a longer time period, reducing the peak IOs required by the disk server. Getting your queue settings changed requires you to have a friendly ATLAS person with relevant permissions available.
  • Secondly, changing queue settings in your batch system to limit the rate at which analysis pilots can start. The cause of IO brownouts on disk servers seems to be the simultaneous start of a large number of analysis jobs (all of which immediately attempt to get/access files from the local SE); limiting the rate of pilot starts smears this peak IO over a longer time, again smoothing load out.

    With our torque/maui system, we accomplish this by setting the MAXIPROC value for the atlaspil group to a small number (10 in our case). MAXIPROC sets the maximum number of jobs in a class which can be eligible for starting in any given scheduling iteration - essentially it means that maui will not start more than 10 atlaspil jobs every (60 second) scheduling iteration, in our case.
    i.e:           GROUPCFG[atlaspil] {elided other config variables} MAXIPROC=10

With these two changes, we rarely see issues with load spiking and brownouts, despite increasing our maximum fraction of Analysis jobs from ATLAS significantly since the fixes were installed. Evidence suggests that both changes are needed for robust effects on your load, however.

No comments: