28 March 2015

EUDAT and GridPP

EUDAT2020 (the H2020 follow-up project to EUDAT) just finished its kick-off meeting at CSC. Might be useful to jot down a few thoughts on similarities and differences and such before it is too late.

Both EUDAT and GridPP are - as far as this blog is concerned - data e- (or cyber-) infrastructures. The infrastructure is distributed across sites, sites provide storage capacity or users, there is a common authentication and authorisation scheme, there are data discovery mechanisms, both use GOCDB for service availability.

  • EUDAT will be using CDMI as its storage interface - just like EGI does - and CDMI is in many ways fairly SRM-like. We have previously done work comparing the two.
  • EUDAT will also be doing HTTP "federations" (i.e. automatic failover when a replica is missing; this is confusingly referred to as "federation" by some people).
  • Interoperation with EGI is useful/possible/thought-about (delete as applicable). EUDAT's B2STAGE will be interfacing to EGI - there is already a mailing list for discussions.
  • GridPP's (or WLCG's) metadata management is probably a bit too confusing at the moment since there is no single file catalogue 
  • B2ACCESS is the authentication and authorisation infrastructure in EUDAT; it could interoperate with GridPP via SARoNGS (ask us at OGF44 where we will also look at AARC's relation to GridPP and EUDAT). Jos tells us that KIT also have a SARoNGS type service.
  • Referencing a file is done with a persistent identifier, rather like the LFN (Logical Filename) GridPP used to have.
  • "Easy" access via WebDAV is an option for both projects. GlobusOnline is an option (sometimes) for both projects. In fact, B2STAGE is currently using GO, but will also be using FTS.
Using FTS is particularly interesting because it should then be possible to transfer files between EUDAT and GridPP. The differences between the projects are mainly that
  • GridPP is more mature - has had 14-15 years now to build its infrastructure; EUDAT is of course a much younger project (but then again, EUDAT is not exactly starting from scratch)
  • EUDAT is doing more "dynamic data" where the data might change later. Also looking at more support for the lifecycle.
  • EUDAT and GridPP have distinct user communities, to a first approximation at least.
  • The middleware is different; GridPP does of course offer compute where EUDAT will offer simpler server-side workflows. GridPP services are more integrated, where in EUDAT the B2 services are more separated (but will be unified by the discovery/lookup service and by B2ACCESS)
  • Authorisation mechanisms will be very different (but might hopefully interface to each other; there are plans for this in B2ACCESS).
There is some overlap between data sites in WLCG and those in EUDAT. This could lead to some interesting collaborations and cross-pollinations. Come to OGF44 and the EGI conference and talk to us about it.

20 March 2015

ISGC 2015 Review and Musings..

The 2015 ISGC Conference is coming to a close; so I thought I would jot down some musings regarding some of the talks I have seen (and presented.) over the last week. Not surprisingly; since the G and C are grids and clouds, a lot of talks were regrading compute, however there were various talks on storage and data management (especially dCache). But most interesting talk was regarding new technology which sees a cpu and network interface incorporated into an individual HDD. this can be seen here:
http://indico3.twgrid.org/indico/contributionDisplay.py?sessionId=26&contribId=80&confId=593

There were also many site discussion from the various asian countries represented, of which network setup and storage was on particular interest (also including using infiniband between Singapore Seattle and Australia.) My perfSONAR talk seem to be well received.  It makes the distance our european dataflows have to travel seem trivial.

It was also interesting to listen to some of the Humanities and Arts themed talks. (First time I have ever heard post- modernism used at a conference!!) Their data volume may well be smaller than WLCG VOS;  but still complex and uses interesting visualisation methods.

09 March 2015

Some thoughts on data in the cloud gathered at CloudScape VII

Some if-not-quite-live then certainly not at all dead notes from #CloudScapeVII on data in the cloud.
  • How to establish trust in the cloud data centre?  Clouds can run pretty good security, which you'd otherwise only get in the large data centre.
    • Clouds can build trust by disclosing processes and practices - Rüdiger Dorn, Microsoft
    • Clarify responsibilities
    • "35% of security breaches are due to stupid things" - like leaving memory sticks on a train or sending CDs by post... - Giorgio Aprile, AON
    • Difficulty to inculcate good (security) practice in many end users
  • "Opportunity to make big data available in cloud" - Robert Jenkins, CloudSigma
    • Model assumes that end users pay for the ongoing use of data
    • Democratise data
  • Data protection
    • Kuan Hon from QMUL instilled the fear of data protection in everyone that provides data storage. The new data protection stuff doesn't seem to take clouds into accounts - lots of scary implications. [Good thing we are not storing personal data on the grid...]
    • Protection relies on legal frameworks - sign a contract saying you won't reveal the data - rather than technology (encrypt it to preventing your revealing the data)
  • Joe Baguley from vmware talked about the abstractions: where RAID abstracted harddrives from storage, we now do lots more abstractions with hypervisors, containers, software-defined-X, etc.
    • Layers can optimise, so can get excellent performance
    • Stack can be hard to debug when something doesn't work so well...
    • Generally more benefits than drawbacks, so a Good Thing™
    • Overall, speed up data → analysis → app → data → analysis → app → ... cycle
  • "What's hot in the cloud" - panel of John Higgings (DigitalEurope), Joe Baguley (vmware), David Bernstein (Cloud Strategy Partners), Monique Morrow (CISCO)
    • Big data is also fast data (support for more Vs), lots of opportunities for in memory processing
    • Data - use case for predictive analysis and pattern recognition (and in general machine learning)
    • devops needed to break down barriers [as we know quite well from the grid where we have tb-support née dteam]
    • Disruptive technological advances to, er, disrupt?
    • Many end users are using clouds without knowing it -like people using facebook.
Hope I've done it some justice. As always, lots of very interesting things in CloudScape even for those of us who have been providing "cloud" services (in some sense) for a while. Also good of course to catch up with old friends and meeting new ones.

06 March 2015

Storage accounting revisited?

One of the basic features of containers - a thing which can contain something - is that you can see how full it is. If your container happens to be a grid storage element, monitoring information is available in gstat and in our status dashboard. The BDII information system publishes data, and so does the SRM (the storage element control interface), and the larger experiments at least track how much they write.

So what happens if all these measures don't agree?  We had a ticket against RAL querying why the BDII published different values from what the experiment thought they had written. It turned out to be partly because someone was attempting to count used space by space token, which leads to quite the wrong results:
Leaving aside whether these should be the correct mappings for ATLAS, the space tokens on the left do not map one-to-one to the actual storage areas (SAs) in the middle (and in general there are SAs without space tokens pointing to them). Note also that the SAs split the accounting data of the disk pools (online storage) so that the sum of the values are the same -- to avoid double counting.

The other reason for the discrepancy was the treatment of read-only servers: these are published as used space by the SRM, but not by the BDII. This is because the BDII is required to be compliant with the installed capacity agreement from a working group from 2008. The document says on p.33,
TotalOnlineSize (in GB=109) is the total online [..] size available at a given moment (it SHOULD not [sic] include broken disk servers, draining pools, etc.)
RAL uses read only disk pools essentially like draining disk pools (unlike tapes, where a read only tape is perfectly readable), so read only disk pools do not count in the total -- they do, however, count as "reserved" as specified in the same document (the GLUE schema probably intended reserved to be more like SRM's reserved, but the WLCG document interprets the field as "allocated somewhere."

Interestingly, RAL does not comply with the installed capacity document in publishing UseddOnlineSize for tape areas. The document specifies
UsedOnlineSize (in GB=109 bytes) is the space occupied by available and accessible files that are not candidates for garbage collection.
It then kind of contradicts itself in the same paragraph, saying
For CASTOR, since all files in T1D0 are candidates for garbage collection, it has been agreed that in this case UsedOnlineSize is equal to [..] TotalOnlineSize.
If we published like this, the used online size would always equal the total size, and the free size would always be zero (because the document also requires that used and free sum to total -- which doesn't always make sense either, but that is a different story.)

OK, so what might we have learnt today about storage accounting?

  1. Storage accounting is always tricky: there are all sorts of funny boundary cases, like candidates for deletion, temporary replicas, scratch space, etc.
  2. Aggregating accounting data across sites only makes sense if they all publish in the same way: they use the same attributes for the same types of values, etc. However, the supported storage elements all vary somewhat in how they treat storage internally.
  3. Before making use of the numbers, it is useful to have some sort of understanding of how they are generated (what do space tokens do? if numbers are the same for two SAs, is it because they are counting the same area twice, or because they split it 50/50? Implementers should document this and keep the documentation up to date!)
  4. There should probably be a time to review these agreements - what is the use of publishing information if it does not tell people what they want to know?
  5. Storage accounting is non-trivial... getting it right vs useful vs achievable is a bit of a balancing act.

25 February 2015

ATLAS MC deletion campaign from tape complete for the RAL Tier1.

So ATLAS  have just finished a deletion campaign of  monte carlo data from our tape system at the RAL Tier1.

The good news is that previous seen issues of transfers failing due to a timeout (due to misplacing an "I'm done" UDP packet ) seems to have been solved.

ATLAS deleted 1.325 PB of data allowing for our tape system to recover and re-use (when repacking as completed,) approximately ~250 Tapes. ATLAS delete in total 1739588 files.  The deletion campaign took 17 days, but we have seen deletion rates at least a factor of four higher capable from the CASTOR system; so the VO should be able to increase their deletion request rate.

What is also of interest (which I am now looking into;)  is that ATLAS asked us to delete 211 files which they thought we had but we did not.

Also now may be a good time to provide ATLAS with a list of all the files we have in our tape system to find out which files  we have which ATLAS have "forgotten" about.

03 February 2015

Ceph stress testing at the RAL Tier 1

Of some interest to the wider community, the RAL Tier 1 site have been exploring the CEPH object store as a storage solution (some of the aspects of which involve grid interfaces being developed at RAL, Glasgow and CERN).

They've recently performed some interesting performance benchmarks, which Alastair Dewhurst reported on their own blog:

http://www.gridpp.rl.ac.uk/blog/2015/01/22/stress-test-of-ceph-cloud-cluster/

Distributed Erasure Coding backed by DIRAC File Catalogue

So, last year, I wrote a blog post on the background of Erasure Coding as a technique, and trailed an article on our own initial work on implementing such a thing on top of the DIRAC File Catalogue.

This article is a brief description of the work we did (a poster detailing this work will also be at the CHEP2015 conference).

Obviously, there are two elements to the initial implementation of any file transformation tool for an existing catalogue: choosing the encoding engine, and working out how to plumb it into the catalogue.

There are, arguably, two popular, fast, implementations of general erasure coding libraries in use at the moment:
zfec, which backs the Least Authority File System's implementation, and has a nice python api
and
jerasure, which has seen use in several projects, including backing Ceph's erasure coded pools.

As DIRAC is a mostly Python project, we selected zfec as our backend library, which also seems to have been somewhat fortuitous on legal grounds, as jerasure has recently been withdrawn from public availability due to patent challenges in the USA (while this is not a relevant threat in the UK, as we don't have software patents, it makes one somewhat nervous about using it as a library in a new project).

Rather than performing erasure coding as a stream, we perform the EC mapping of a file on disk, which is possibly a little slower, but is also safer and easier to perform.

Interfacing to DIRAC had a few teething problems. Setting up a DIRAC client appropriately was a little more finicky than we expected, and the Dirac File Catalogue implementation had some issues we needed to work around. For example, SEs known to the DFC are assumed good - there's no way of marking an SE as bad, or of telling how usable it is without trying it.

The implementation of the DFC Erasure Coding tool, therefore, also includes a tool which evaluates the health of the SEs available to the VO, and removes unresponsive SEs from its list of potential endpoints for transfers.

As far as the actual implementation for adding files is concerned, it's as simple as creating a directory (with the original filename) in the DFC, and uploading the encoded chunks within it, making sure to upload chunks across the set of SEs known to the DFC to support the VO you're part of.
We use the DFC's metadata store to store information about each chunk as a check for reconstruction. We were interested to discover that adding new metadata names to the DFC also makes them available for all files in the DFC, rather than simply for the files you add them for. We're not sure if this is an intended feature or not.

One of the benefits of any kind of data striping, including EC, is that we can retrieve chunks in parallel from the remote store. Our EC implementation allows the use of parallel transfer via the DFC methods when getting remote files, however, in our initial tests, we didn't see particular performance improvements. (Our test instance, using the Imperial test DIRAC instance, didn't have many SEs available to it, though, so it is hard to evaluate the scaling potential.)



The source code for the original implementation is available from: https://github.com/ptodev/Distributed-Resilient-Storage
(There's a fork by me, which has some attempts to clean up the code and possibly add additional features.)

29 December 2014

Yet another exercise in data recovery?

Just before the Christmas break, my main drive on my main PC - at home - seemed to start to fail (the kernel put it in read-only mode). Good thing we have backups, eh? They are all on portable hard drives, usually encrypted, and maintained with unison. No, they are not "in the cloud."

Surprisingly much of my data is WORM so what if there are differences between the backups? Was it due to those USB3 errors (caused a kernel panic, it did), hardware fault, or that fsck which seemed to discover a problem, or has the file actually changed? (And a big "boo, hiss" to applications that modify files just by opening them - yes, you know who you are.) In my case, I would prefer to re-checksum them all and compare against at least four of the backups. So I need a tool.

My Christmas programming challenge for this year (one should always have one) is then to create a new program to compare my backups. Probably there is one floating around out there, but my scheme - the naming scheme, when I do level zeros, increments, masters, replicas - is probably odd enough that it is useful having a bespoke tool.

On the grid we tend to checksum files as they are transferred. Preservation tools can be asked to "wake up" data every so often and re-check them. Ideally the backup check should quietly validate the checksums in the background as long as the backup drive is mounted.

15 December 2014

Data gateway with dynamic identity - part 1

This doesn't look like GridPP stuff at first, but bear with me...

The grid works by linking sites across the world, by providing a sufficiently high level of infrastructure security using such things as IGTF. The EUDAT project is a data infrastructure project but has users who are unable/unwilling (delete as applicable) to use certificates themselves to authenticate. Thus projects use portals as a "friendly" front end.

So the question is, how do we get data through the proxy?  Yes, it's a reverse proxy, or gateway. Using Apache mod_proxy, this is easy to set up, but is limited to using a single credential for the onward connection.
Look at these (powerpoint) slides: in the top left slide, the user connects (e.g. with a browser) to the portal using some sort of lightweight security - either site-local if the portal is within the site, or federated web authentication in general. Based on this, the portal (top right) generates a key pair and obtains a certificate specific to the user - with the user's (distinguished) name and authorisation attributes. It then (bottom left) connects and sends the data back to the user's browser, or possibly, if the browser is capable of understanding the remote protocol, redirects the browser (with suitable onward authentication) to the remote data source.

We are not aware of anyone having done this before - reverse proxy with identity hooks. If the reader knows any, please comment on this post!

So in EUDAT we investigated a few options, including adding hooks to mod_proxy, but built a cheap and cheerful prototype by bringing the neglected ReverseProxy module up to Apache 2.2 and adding hooks into it.

How is this relevant to GridPP, I hear you cry?  Well, WLCG uses non-browser protocols extensively for data movement, such as GridFTP and xroot, so you need to translate if the user "only" has a browser (or soonish, you should be able to use WebDAV to some systems, but you still need to authenticate with a certificate.)  If this were hooked up to a MyProxy used as a Keystore or certification authority, you could have a lightweight authentication to the portal.

08 December 2014

Ruminations from the ATLAS Computing Jamboree '14

SO..... I have just spent the last 2.5 days at the ATLAS Facilities and Shifters Jamboree at CERN.
The shifters Jamboree was useful to attend since it allowed me to better comprehend the operational shifter's view  of issues seen on services that I help keep in working order. The facilities Jamboree helped to highlight the planned changes (near term and further) for computer operations and service requirement for Run2 of the LHC.
A subset of highlights are:

Analysis jobs have been shown to handle 40MB/s (we better make make sure our internal network and disk servers can handle this with using direct I/O.

Planned increase in analysing data from the disk cache in front of our tape system rather than the disk only pool.

Increase in amount (and types) of data the can be moved to tape. (VO will be able to give a hint  to expected lifetime on tape. In general ATLAS expect to delete data from tape at a scale not seen before.)

Possibly using an web enabled object store to allow storage and viewing of log files.

Event selection analysis as a method of data analysis on the sub file level.

I also know what the tabs in bigpanda now do!!! (but that will be another blog ...)

05 December 2014

Where have all my children gone....

Dave here,
So higher powers decided to change they policy on keeping clones of my children, now we have:
631 of my children are unique and only live in  one room ;124 have a twin, 33 triplets and two sets of quads. Hence now my children  are much more vulnerable to a room being destroyed or damaged.  However it does mean there are now only 72404 files and 13.4TB of unique data on the GRID.
Of  my children; there are 675 Dirks', 14 Gavins' and 101 Ursulas'.

These are located in 81 rooms across the following 45 Houses:
AGLT2
AUSTRALIA-ATLAS
BNL-OSG2
CERN-PROD
CSCS-LCG2
DESY-HH
FZK-LCG2
GRIF-IRFU
GRIF-LAL
GRIF-LPNHE
IFIC-LCG2
IN2P3-CC
IN2P3-LAPP
IN2P3-LPSC
INFN-MILANO-ATLASC
INFN-NAPOLI-ATLAS
INFN-ROMA1
INFN-T1
JINR-LCG2
LIP-COIMBRA
MPPMU
MWT2
NCG-INGRID-PT
NDGF-T1
NET2
NIKHEF-ELPROD
PIC
PRAGUELCG2
RAL-LCG2 ( I Live Here!!)
RU-PROTVINO-IHEP
SARA-MATRIX
SLACXRD
SMU
SWT2
TAIWAN-LCG2
TECHNION-HEP
TOKYO-LCG2
TR-10-ULAKBIM
TRIUMF-LCG2
UKI-LT2-RHUL
UKI-NORTHGRID-MAN-HEP
UKI-SOUTHGRID-OX-HEP
UNI-FREIBURG
WEIZMANN-LCG2
WUPPERTALPROD

Which corresponds to Australia, Canada, Czech Repiblic, France, Germany, Israel, Italy, France, Japan, Netherlands, Portugal, Russia, Spain, Switzerland, Turkey, UK and  USA


01 December 2014

Good Year for FTS Transfers.( My first legitimate use of EB.)

During this year, the WLCG sites running the File Transfer Service (FTS)  upgraded to FTS3.
We have also reduced the number of sites running the service. This has led RAL service to be used more heavily.
A total of 0.224EB ( or 224 PBytes) of Data was moved using WLCG FTS services ( (604M files).
This is split down by VO by:
131PB/550M files for ATLAS (92M failed transfers).  66PB/199M files were by the UK FTS.
85PB/48M files for CMS (10M failed  transfers).  25PB/14M files were by the UK FTS.
8PB/6M files for all other VOs (6.7M failed transfers). 250TB/1M files were by the UK FTS.

(Of course these figures ignore file created and stored at sites from the output of Worker Node jobs and also ignores the "chaotic" data transfer of files via other data transfer mechanisms.) 

18 November 2014

Towards an open (data) science culture

Last week we celebrated the 50th anniversary of ATLAS computing at Chilton where RAL is located. (The anniversary was actually earlier, we just celebrated it now.)

While much of the event was about the computing and had lots of really interesting talks (which should appear on the Chilton site), let's highlight a data talk by Professor Jeremy Frey. If you remember the faster than light neutrinos, Jeremy praised CERN for making the data available early, even with caveats and doubts about the preliminary results.  The idea is to get your data out, so it people can have a look at it and comment. Even if the preliminary results are wrong and neutrinos are not faster than light, what matters is that the data comes out and people can look at it. And most importantly, that it will not negatively impact people's careers for publishing it.On the contrary, Jeremy is absolutely right to point out that it should be good for people's careers to make data available (with suitable caveats).

But what would an "open science" data model look like?  Suddenly you would get a lot more data flying around, instead of (or in addition to) preprints and random emails and word of mouth. Perhaps it will work a bit like open source, which is supposed to be "given enough eyes, all bugs are shallow."  With open source, you sometimes see code which isn't quite ready for production, but at least you can look at the code and figure out whether it will work, and maybe adapt it.

While we are on the subject of open stuff, the code that simulates science and analyses data is also important. Please consider signing the SSI petition.

30 September 2014

Data format descriptions

The highlight of the data area working groups meetings at the Open Grid Forum at Imperial recently was the Data Format Description Language . The idea is that if you have a formatted or structured input from a sensor, or a scientific event, and it's not already in one of the formatted, er, formats like (say) OpeNDAP or HDF5, you can use DFDL to describe it and then build a parser which, er, parses records of the format. For example, one use is to validate records before ingesting them into an archive or big data processing facility.

Led by Steve Hanson from IBM, we had an interactive tutorial building a DFDL description for a sensor: the interactive tool looks and feels a bit like Eclipse but is called Integration Toolkit:
And for those eager for more, the appearance of DFDL v1.0 is imminent.

25 September 2014

Erasure-coding: how it can help *you*.

While some of the mechanisms for data access and placement in the WLCG/EGI grids are increasingly modern, there are underlying assumptions that are rooted in somewhat older design decisions.

Particularly relevantly to this article: on 'The Grid', we tend to increase the resilience of our data against loss by making complete additional copies (either one on tape and one on disk, or additional copies on disk at different physical locations). Similarly, our concepts of data placement are all located at the 'file' level - if you want data to be available somewhere, you access a complete copy from one place or another (or potentially get multiple copies from different places, and the first one to arrive wins).
However, if we allow our concept of data to drop below the file level, we can develop some significant improvements.

Now, some of this is trivial: breaking a file into N chunks and distributing it across multiple devices to 'parallelise' access is called 'striping', and your average RAID controller has been doing it for decades (this is 'RAID0', the simplest RAID mode). Slightly more recently, the 'distributed' class of filesystems (Lustre, GPFS, HDFS et al) have allowed striping of files across multiple servers, to maximise performance across the network connections as well.

Striping, of course, increases the fragility of the data distributed. Rather than being dependent on the failure probability of a single disk (for single-machine striping) or a single server (for SANs), you are now dependent on the probability of any one of a set of entities in the stripe failing (a partial file is usually useless). This probability is likely to scale roughly multiplicatively with the number of devices in the stripe, assuming their failure modes are independent.

So, we need some way to make our stripes more robust to the failure of components. Luckily, the topic of how to encode data to make it resilient against partial losses (or 'erasures'), via 'erasure codes', is an extremely well developed field indeed.
Essentially, the concept is this: take your N chunks that you have split your data into. Design a function such that, when fed N values, will output an additional M values, such that each of those M values can be independently used to reconstruct a missing value from the original set of N. (The analogy used by the inventors of the Reed-Solomon code, the most widely used erasure-code family, is of overspecifying a polynomial by more samples than its order - you can always reconstruct an order N polynomial with any N of the M samples you have.)
In fact, most erasure-codes will actually do better than that - as well as allowing the reconstruction of data known to be missing, they can also detect and correct data that is bad. The efficiency for this is half that for data reconstruction - you need 2 resilient values for every 1 unknown bad value you need to detect and fix.

If we decide how many devices we would expect to fail, we can use an erasure code to 'preprocess' our stripes, writing out N+M chunk stripes.

(The M=1 and M=2 implementations of this approach are called 'RAID5' and 'RAID6' when applied to disk controllers, but the general formulation has almost no limits on M.)

So, how do we apply this approach to Grid storage?

Well, Grid data stores already have a large degree of abstraction and indirection. We use LFCs (or other file catalogues) already to allow a single catalogue entry to tie together multiple replicas of the underlying data in different locations. It is relatively trivial to write a tool that (rather than simply copying a file to a Grid endpoint + registering it in an LFC) splits & encodes data into appropriate chunks, and then stripes them across available endpoints, storing the locations and scheme in the LFC metadata for the record.
Once we've done that, retrieving the files is a simple process, and we are able to perform other optimisations, such as getting all the available chunks in parallel, or healing our stripes on the fly (detecting errors when we download data for use).
Importantly, we do all this while also reducing the lower bound for resiliency substantially from 1 full additional copy of the data to M chunks, chosen based on the failure rate of our underlying endpoints.

This past summer, one of our summer projects was based around developing just such a suite of wrappers for Grid data management (albeit using the DIRAC file catalogue, rather than the LFC).
We're very happy with Paulin's work on this, and a later post will demonstrate how it works and what we're planning on doing next.

22 August 2014

Lambda station

So what did CMS say at GridPP33?  Having looked ahead to the future, they came up with more speculative suggestions. Like FNAL's Lambda Station in the past, one suggestion was to look again at scheduling network for transfers, what we might nowadays call network-as-a-service (well, near enough): since we schedule transfers, it would indeed make sense to integrate networks more closely with the pre-allocation at the endpoints (where you'd bringOnline() at the source and schedule the transfer to avoid saturating the channel.) Phoebus is a related approach from Internet2. 

21 August 2014

Updated data models from experiments

At the GridPP meeting in Ambleside ATLAS announced having lifetime on their files: not quite like the SRM implementation where a file could have a finite when created, but more like a timer which counts after each access. Unlike SRM, deletion when the file has been not accessed for the set length of time, the file will be automatically deleted. Also notable is that files can now belong to multiple datasets, and they are set with automatic replication policies (well, basically how many replicas at T1s are required.) Now with extra AOD visualisation goodness.

Also interesting updates from LHCb, they are continuing to use SRM to stage files from tape, but could be looking into FTS3 for this. Also discussed the DIRAC integrity checking with Sam over breakfast. In order to confuse the enemy they are not using their own GIT but code from various places: both LHCb and DIRAC have their own repositories, and some code is marked as "abandonware," so determining which code is being used in practice requires asking. This correspondent would have naïvely assumed that whatever comes out of git is being used... perhaps that just for high energy physics...

CMS to speak later.

08 August 2014

ARGUS user suspension with DPM

Many grid services that need to authenticate their users do so with LCAS/LCMAPS plugins, making integration with a site central authentication server such as ARGUS relatively straightforward. With the ARGUS client LCAS/LCMAPS plugins configured, all authentication decisions are referred to the central service at the time they're made. When the site ARGUS is configured to use the EGI/NGI emergency user suspension policies, any centrally suspended user DN will be automatically blocked from accessing the site's services.

However, DPM does it's own authentication and maintains its own list of banned DNs, so rather than referring each decision to the site ARGUS, we need a specific tool to update DPM's view based on the site ARGUS server. Just to complicate matters further, DPM's packages live in the Fedora EPEL repository, which means that they cannot depend on the ARGUS client libraries, which do not.

The solution is the very small 'dpm-argus' package which is available from the EMI3 repositories for both SL5 and SL6; a package dependency bug has prevented its installation in the past, but this has been fixed as of EMI3 Update 19. It should be installed on the DPM head node (if installing manually rather than with yum, you'll also need the argus-pep-api-c package from EMI) and contains two files, the 'dpns-arguspoll' binary, and its manual page.

Running the tool is simple - it needs a 'resource string' to identify itself to the ARGUS server (for normal purposes it doesn't actually matter what it is) and the URL for the site ARGUS:
dpns-arguspoll my_resource_id https://argus.example.org:8154/authz
when run, it will iterate over the DNs known to the DPM, check each one against the ARGUS server, and update the DPM banning state accordingly. All that remains is to run it periodically. At Oxford we have an '/etc/cron.hourly/dpm-argus' script that simply looks like this:
#!/bin/sh
# Sync DPM's internal user banning states from argus

export DPNS_HOST=t2se01.physics.ox.ac.uk
dpns-arguspoll dpm_argleflargle https://t2argus04.physics.ox.ac.uk:8154/authz 2>/dev/null
And that's it. If you want to be able to see the current list of DNs that your DPM server considers to be banned, then you can query the head node database directly:
echo "SELECT username from Cns_userinfo WHERE banned = 1;" | mysql -u dpminfo -p cns_db
At the moment that should show you my test DN, and probably nothing else.

23 July 2014

IPv6 and XrootD 4

Xrootd version 4 has recently been released. As QMUL is involved in IPv6 testing, and as this new release now supports IPv6, I thought I ought to test it.  So,  what does this involve?

  1. Set up a dual stack virtual machine - our deployment system now makes this relatively easy. 
  2. Install xrootd. QMUL is a StoRM/Lustre site, and has an existing xrootd server that is part of Atlas's  FAX (Federated ATLAS storage systems using XRootD), so it's just a matter of configuring a new machine to export our posix storage in much the same way.  In fact, I've done it slightly differently as I'm also testing ARGUS authentication, but that's something for another blog post. 
  3. Test it - the difficult bit...
I decided to test it using CERN's dual stack lxplus machine: lxplus-ipv6.cern.ch.

First, I tested that I'd got FAX set up correctly:

setupATLAS
localSetupFAX
voms-proxy-init --voms atlas
testFAX


All 3 tests were successful, so I've got FAX working, next configure it to use my test machine:

export STORAGEPREFIX=root://xrootd02.esc.qmul.ac.uk:1094/
testFAX


Which also gave 3 successful tests out of 3. Finally, to prove that downloading files works, and that it isn't just redirection that works, I tested a file that should only be at QMUL:

xrdcp -d 1 root://xrootd02.esc.qmul.ac.uk:1094//atlas/rucio/user/ivukotic:user.ivukotic.xrootd.uki-lt2-qmul-1M -> /dev/null 

All of these reported that they were successful. Were they using IPv6 though? Well looking at Xrootd's logs, it certainly thinks so - at least for some connections, though some still seem to be using IPv4:

140723 16:03:47 18291 XrootdXeq: cwalker.19073:26@lxplus0063.cern.ch pub IPv6 login as atlas027
140723 16:04:01 18271 XrootdXeq: cwalker.20147:27@lxplus0063.cern.ch pub IPv4 login as atlas027
140723 16:04:29 23892 XrootdXeq: cwalker.20189:26@lxplus0063.cern.ch pub IPv6 login as atlas027


Progress!!!

30 June 2014

Thank you for making a simple compliance test very happy

Rob and I had a look at the gstat tests for RAL's CASTOR. For a good while now we have had a number of errors/warnings raised. They did not affect production: so what are they?

Each error message has a bit of text associated with it, saying typically "something is incompatible with something else" - like an "access control base rule" (ACBR) is incorrect, or tape published not consistent with type of Storage Element (SE). The ACBR error arises due to legacy attributes being published alongside the modern ones, and the latter complains about CASTOR presenting itself as tape store (via a particular SE)

So what is going on?  Well, the (only) way to find out is to locate the test script and find out what exactly it is querying. In this case, it is a python script running LDAP queries, and luckily it can be found in CERN's source code repositories. (How did we find it in this repository? Why, by using a search engine, of course.)

Ah, splendid, so by checking the Documentation™ (also known as "source code" to some), we discover that it needs all ACBRs to be "correct" (not just one for each area) and the legacy ones need an extra slash on the VO value, and an SE with no tape pools should call itself "disk" even if it sits on a tape store.

So it's essentially test driven development: to make the final warnings go away, we need to read the code that is validating it, to engineer the LDIF to make the validation errors go away.