27 June 2018

CMS dark data volume after migrtaion to new SE at Tier1 RAL

CMS have recently moved all  full usage of disk only storage from CASTOR to CEPH at the RAL-LCG2 tier1. (They are still using CASTOR for tape services.) CMS had been using an approx 2PB of disk only CASTOR space on dedicated servers. What is of interest is that after CMS had moved their data and deleted the old data from CASTOR, there was still ~ 100TB of dark data left on the SE. WE have now been able to removed this data and have started the process of decommissioning the hardware. (Some hardware may be re-provisioned for other tasks, but most is creaking with old age.)

ATLAS are almost in a similar position. I wander how much dark data will be left for them.. ( Capacity is 3.65PB at the moement so my guess is ~150TB)

When all the dark data is removed , and the files removed form the namespace, we can also clean up the dark directory structure which is no longer needed.  I leave it to the read to guess how many dark directories CMS have left...

15 June 2018

DPM Workshop 2018 Report

CESNET hosted the 2018 DPM Workshop in Prague, 31st May to 1st June.

As always, the Workshop was built around the announcement of a new DPM release - 1.10.x - and promotion of the aims and roadmaps of the DPM core development team represented by it.

Since the 1.9.x series, the focus of DPM development has been on the next-generation "DOME" codebase. The 1.10 release, therefore, shows performance improvements in managing requests for all supported transfer protocols - GridFTP, Xroot, HTTP - but only when the DOME adapter is managing them.
(DOME itself is an http/WebDAV based management protocol, implemented as a plugin to xrootd, and directly implementing the dmlite API as an adapter.)

By contrast, the old lcgdm code paths are increasingly obsolesced in 1.10 - the most significant work done on the SRM daemon supported via these paths was the fix to Centos7 SOAP handling*.

As a consequence of this, there was a floated suggestion that SRM (and the rest of the lcgdm legacy codebase for DPM) be marked as "unsupported" from 1 June 2019 - a year after this workshop. There was some lively debate about the consequences of this, and two presentations (from ATLAS and CMS) covering the possibility of using SRMless storage. [In short: this is probably not a problem, for those experiements.]
There was some significant concern mainly about historical dependancies on SRM - both for our transfer orchestration infrastructure, for which non-SRM transfers are less tested, and for historical file catalogues, which may have "srm://" paths embedded in them.

As an additional point, there was a discussion of longstanding configuration "issues" with Xrootd redirection into the CMS AAA hierarchy, as discovered and amended by Andrea Sartirana at the end of 2017.


Other presentations from the contributing regions had a significant focus on testing other new features of DPM in 1.9.x; the distributed site approach (using DOME to manage pool nodes at geographically remote locations relative to the head, securely), and the new "volatile pool" model for lightweight caching in DPM.

For Italy, Alessandra Doria reported on the "distributed" DPM configuration across Roma, Napoli and LNF (Frascati), implemented currently as a testbed. This is an interesting implementation of both distributed DPM, and the volatile pools - each site has both a local permanent storage pool, plus a volatile cache pool, enabling the global namespace across the entire distributed DPM to be transparent (as remote files are cached in the volatile pool from other sites).

For Belle 2, Silvio Pardi reported on some investigations and tests of volatile pools for caching of data for analysis.

We also presented, from the UK, work on implementing the old ARGUS-DPM bridge for DMLITE/DOME. This api bridge allows the ARGUS service - the component of the WLCG site framework which centralises authentication and authorisation decisions - to make user and group ban status available to a local DPM. (DPM, especially in DMLITE and DOME eras, does not perform account mapping in the way that compute elements do, so the most useful part of ARGUS's functionality is the binary authorisation component. As site ARGUS' are federated with the general WLCG ARGUS instance to allow "federated user banning" as a security feature, the ability to set storage policies via the same framework is useful.)

*Centos7 gSOAP libraries changed behaviour such that they handle connection timeouts poorly, resulting in erroneous errors being sent to clients to an SRM when they reopen a connection. A work-around was developed at the CGI-GSOAP level, and deployed initially at UK sites which had noticed the issue.