15 June 2018

DPM Workshop 2018 Report

CESNET hosted the 2018 DPM Workshop in Prague, 31st May to 1st June.

As always, the Workshop was built around the announcement of a new DPM release - 1.10.x - and promotion of the aims and roadmaps of the DPM core development team represented by it.

Since the 1.9.x series, the focus of DPM development has been on the next-generation "DOME" codebase. The 1.10 release, therefore, shows performance improvements in managing requests for all supported transfer protocols - GridFTP, Xroot, HTTP - but only when the DOME adapter is managing them.
(DOME itself is an http/WebDAV based management protocol, implemented as a plugin to xrootd, and directly implementing the dmlite API as an adapter.)

By contrast, the old lcgdm code paths are increasingly obsolesced in 1.10 - the most significant work done on the SRM daemon supported via these paths was the fix to Centos7 SOAP handling*.

As a consequence of this, there was a floated suggestion that SRM (and the rest of the lcgdm legacy codebase for DPM) be marked as "unsupported" from 1 June 2019 - a year after this workshop. There was some lively debate about the consequences of this, and two presentations (from ATLAS and CMS) covering the possibility of using SRMless storage. [In short: this is probably not a problem, for those experiements.]
There was some significant concern mainly about historical dependancies on SRM - both for our transfer orchestration infrastructure, for which non-SRM transfers are less tested, and for historical file catalogues, which may have "srm://" paths embedded in them.

As an additional point, there was a discussion of longstanding configuration "issues" with Xrootd redirection into the CMS AAA hierarchy, as discovered and amended by Andrea Sartirana at the end of 2017.


Other presentations from the contributing regions had a significant focus on testing other new features of DPM in 1.9.x; the distributed site approach (using DOME to manage pool nodes at geographically remote locations relative to the head, securely), and the new "volatile pool" model for lightweight caching in DPM.

For Italy, Alessandra Doria reported on the "distributed" DPM configuration across Roma, Napoli and LNF (Frascati), implemented currently as a testbed. This is an interesting implementation of both distributed DPM, and the volatile pools - each site has both a local permanent storage pool, plus a volatile cache pool, enabling the global namespace across the entire distributed DPM to be transparent (as remote files are cached in the volatile pool from other sites).

For Belle 2, Silvio Pardi reported on some investigations and tests of volatile pools for caching of data for analysis.

We also presented, from the UK, work on implementing the old ARGUS-DPM bridge for DMLITE/DOME. This api bridge allows the ARGUS service - the component of the WLCG site framework which centralises authentication and authorisation decisions - to make user and group ban status available to a local DPM. (DPM, especially in DMLITE and DOME eras, does not perform account mapping in the way that compute elements do, so the most useful part of ARGUS's functionality is the binary authorisation component. As site ARGUS' are federated with the general WLCG ARGUS instance to allow "federated user banning" as a security feature, the ability to set storage policies via the same framework is useful.)

*Centos7 gSOAP libraries changed behaviour such that they handle connection timeouts poorly, resulting in erroneous errors being sent to clients to an SRM when they reopen a connection. A work-around was developed at the CGI-GSOAP level, and deployed initially at UK sites which had noticed the issue.

No comments: