15 March 2019

Some success in testing sites with no local production data storage for ATLAS VO.

For a while we have had a few  of the smaller sites (T3s) in the UK running for ATLAS with out any storage at the site.We recently tried to run Birmingham as a completely diskless site with it using storage at Manchester.  This was mostly successful; the saturation of the WAN connection at Manchester which was always considered a worrying possibility was seen. This has helped inform ATLAS's opinions on how to implement diskless sites which was then presented at the ATLAS Site Jamboree this month.

We intend to try using XCache at Birmingham instead to see that is an alternative approach which might succeed. WE are also looking into using the ARC Control tower to pre-place data for ARC-CEs. Main issue is how this conflicts with VO wish for last minute payload changes within pilot jobs.

I would also just remind why (IMHO) we are looking into optimising ATLASDATADISK storage.

From a small site perspective, storage requires a substantial amount of effort to maintain. This effort compared to the volume of storage provided could be efficiently used in other activities. Below is a plot of the percentage of current ATLASDATADISK provided by each site.  The VO also benefits with not using smaller sites as it has fewer logical endpoints to track.
 

This plot shows that if the 10 smaller sites (of which 5 are in the UK) allows for ATLAS to use 99% of space form only 88% of sites. ATLASSCRATCHDISK and ATLASLOCALGROUPDISK usage/requirement also needs to be taken into consideration when deciding if a site should become a fully diskless or caching/buffering site.

08 March 2019

ATLAS Jamboree 2019 view from the offiste perspective.

I didn't go in person to the ATLAS Jamboree this year held at CERN. For those who are allowed to view I suggest looking at: https://indico.cern.ch/event/770307/

But I did join for some via vidyo!
Here is my musings about the talk givens I saw. ( Shame I couldn't get involved in coffee/dinner  discussions which are often the most fruitful moments of these meetings):

Even before the main meeting started, there is an interesting talk regarding HPC data access in US an ANL.


In particular, I  like the thought of globus usage and incorporating rucio into DTNs at the sites.  Similar to what was discussed at other sites at the rucio community workshop last week.

In the preview talk, I picked out the switch to Fast Sim rather than Full sim will increase output rate by a factor of 10. A good reminder that user workflow changes could drastically alter computing requirements.
From the main meeting , the following meetings will be of interest on a data storage:

Data Organization and Management Activities: Third Party Copy and (storage) Quality of Service
TPC: details on DPM
DOMA ACCESS: Caches 
DDM Ops overview
Diskless and lightweight sites: consolidation of storage
Data Carousel
Networking - best practice for sites, and evolution
WLCG evolution strategy
 
One thing it di was cause me to think what if; (and I stress the if is me not ATLAS musing,)  ATLAS  wanted to read 1PB of data a day from Tape at RAL and then distribute it across the world?

 
 

06 March 2019

Rucio 2nd Community Workshop roundup

There was an interesting workshop for members of the rucio community last week:
https://indico.cern.ch/event/773489/timetable/#all.detailed

Here is the summary from the workshop:
Summary
● Presentation from 25 communities, 66 attendees!
● Many different use-cases have been presented
○ Please join us on Slack and the rucio-users mailing list for follow-ups!
● Documentation!
○ Examples
○ Operations documentation
○ Easy way for communities to contribute to the documentation
○ Documentation/Support on Monitoring (Setup, Interpretation, Knowledge)
○ Recommendations on data layout/scheme → Very difficult decision for new communities
● Databases
○ Larger-Scale evaluation of non-Oracle databases would be very beneficial for the community
● Drop of Python 2.6 support for Rucio clients
● LTS/Gold release model
○ Will propose a release model with LTS/gold releases with Security/Critical bug fixes
● Archive support
● Metadata support
○ Existing metadata features (generic metadata) need more evaluation/support
○ More documentation/examples needed
● Additional authentication methods needed
○ OpenID, Edugain, …
● Interfacing/Integration with DIRAC
○ Many communities interested
○ Possibility for joint effort?

Here is a list of my summary of these and other snippets from the talks I find interesting/thought provoking. I 'll point out I was not at the meeting so tone of talks n=may have been lost on me.

Network talk from GEANT:
Lots of 100Gb links!



DUNE talk:
Replacing SAM ( THE product from where I started my data management journey...)
Unfortunately not able to use significant amount of storage at RAL Echo
– Dynafed WebDAV interface can’t handle files larger than 5GB
– Latest Davix can do direct upload of large files (as a multi-part upload), but not third-party
transfers
– Maybe use the S3 interface directly instead?
• Recent work done on improving Rucio S3 URL signing


CMS talk:
Some concerns called out by review panel which we will have to address or solve:
▪ Automated and centralized consistency checking — person is assigned
▪ No verification that files are on tape before they are allowed to be deleted from buffer
★ FTS has agreed to address this


BelleII talk:
My thoughts:
using a BelleII specific version of DIRAC
They are evaluating rucio
naming schema is "interesting"
power users seem a GoodIdea (TM)
BelleII thoughts:
• Looking ahead, the main challenge is achieving balance between conflicting requirements:
• bringing Rucio into Belle II operations quickly enough to avoid duplication of development effort
• supporting the old system for a running experiment


FTS talk:
I didn't know about multi hop support:
Multi-hop transfers support – Transfers from A->C, but also A->B->C


XENONnT talk:
This is a dark matter experiment with some familiar  HEP sites using rucio.


ICE cube experiment :
Interesting data movement issues when base a south pole!
●Raw data is ~1 TB/day, sent via cargo ship; 1 shipment/year
●Filtered data is ~80 GB/day, sent via satellite; daily transfer


CTA talk:
CTA is an experiment for cosmic rays as well as CERN's news tape system!


Rucio DB at CERN  talk:
rucio DB numbers for ATLAS are "impressive" (1014M DIDs)


SKA talk:
RAL members get a thank you specifically.


NSLSII talk:
Similar needs as Diamond Light source DLS , possible collaboration?
Another site which does both HEP and photonics.
Has tested using globus endpoints.


XDC talk:
rucio and dynafed and storage all in one!


CTA (tape system) talk:
Initial deployments: predict a need of 70PB of disk pace just in the disk cache ! (am I reading this slide correctly?)


LCLSII talk:
Linear version of NSLSII based at SLAC, similar to BNL> need to use FTS to be tested. prod system need in the next year.


LSST talk:
Using docker release of rucio
Nice set of setup tests. Things look promising!
– FTS has proven its efficiency for data transfers, either standalone or paired with Rucio
– Rucio makes data management easier in a multi-site context, and tasks can be highly automated
– These features could prove beneficial to LSST
● Evaluation is still ongoing
– discussions with the LSST DM team at NCSA are taking place


Dynafed talk:
Dynafed as a Storage Element is work in progress
– Not be the design purpose of Dynafed


RAL/IRIS talk:
I would be interested to hear how this tlak went down with th epeople present



ARC at Nordugrid talk:
I still think ARC control tower (ACT)  are the future. rucio integration with volatile rse is nice.