29 December 2014

Yet another exercise in data recovery?

Just before the Christmas break, my main drive on my main PC - at home - seemed to start to fail (the kernel put it in read-only mode). Good thing we have backups, eh? They are all on portable hard drives, usually encrypted, and maintained with unison. No, they are not "in the cloud."

Surprisingly much of my data is WORM so what if there are differences between the backups? Was it due to those USB3 errors (caused a kernel panic, it did), hardware fault, or that fsck which seemed to discover a problem, or has the file actually changed? (And a big "boo, hiss" to applications that modify files just by opening them - yes, you know who you are.) In my case, I would prefer to re-checksum them all and compare against at least four of the backups. So I need a tool.

My Christmas programming challenge for this year (one should always have one) is then to create a new program to compare my backups. Probably there is one floating around out there, but my scheme - the naming scheme, when I do level zeros, increments, masters, replicas - is probably odd enough that it is useful having a bespoke tool.

On the grid we tend to checksum files as they are transferred. Preservation tools can be asked to "wake up" data every so often and re-check them. Ideally the backup check should quietly validate the checksums in the background as long as the backup drive is mounted.

15 December 2014

Data gateway with dynamic identity - part 1

This doesn't look like GridPP stuff at first, but bear with me...

The grid works by linking sites across the world, by providing a sufficiently high level of infrastructure security using such things as IGTF. The EUDAT project is a data infrastructure project but has users who are unable/unwilling (delete as applicable) to use certificates themselves to authenticate. Thus projects use portals as a "friendly" front end.

So the question is, how do we get data through the proxy?  Yes, it's a reverse proxy, or gateway. Using Apache mod_proxy, this is easy to set up, but is limited to using a single credential for the onward connection.
Look at these (powerpoint) slides: in the top left slide, the user connects (e.g. with a browser) to the portal using some sort of lightweight security - either site-local if the portal is within the site, or federated web authentication in general. Based on this, the portal (top right) generates a key pair and obtains a certificate specific to the user - with the user's (distinguished) name and authorisation attributes. It then (bottom left) connects and sends the data back to the user's browser, or possibly, if the browser is capable of understanding the remote protocol, redirects the browser (with suitable onward authentication) to the remote data source.

We are not aware of anyone having done this before - reverse proxy with identity hooks. If the reader knows any, please comment on this post!

So in EUDAT we investigated a few options, including adding hooks to mod_proxy, but built a cheap and cheerful prototype by bringing the neglected ReverseProxy module up to Apache 2.2 and adding hooks into it.

How is this relevant to GridPP, I hear you cry?  Well, WLCG uses non-browser protocols extensively for data movement, such as GridFTP and xroot, so you need to translate if the user "only" has a browser (or soonish, you should be able to use WebDAV to some systems, but you still need to authenticate with a certificate.)  If this were hooked up to a MyProxy used as a Keystore or certification authority, you could have a lightweight authentication to the portal.

08 December 2014

Ruminations from the ATLAS Computing Jamboree '14

SO..... I have just spent the last 2.5 days at the ATLAS Facilities and Shifters Jamboree at CERN.
The shifters Jamboree was useful to attend since it allowed me to better comprehend the operational shifter's view  of issues seen on services that I help keep in working order. The facilities Jamboree helped to highlight the planned changes (near term and further) for computer operations and service requirement for Run2 of the LHC.
A subset of highlights are:

Analysis jobs have been shown to handle 40MB/s (we better make make sure our internal network and disk servers can handle this with using direct I/O.

Planned increase in analysing data from the disk cache in front of our tape system rather than the disk only pool.

Increase in amount (and types) of data the can be moved to tape. (VO will be able to give a hint  to expected lifetime on tape. In general ATLAS expect to delete data from tape at a scale not seen before.)

Possibly using an web enabled object store to allow storage and viewing of log files.

Event selection analysis as a method of data analysis on the sub file level.

I also know what the tabs in bigpanda now do!!! (but that will be another blog ...)

05 December 2014

Where have all my children gone....

Dave here,
So higher powers decided to change they policy on keeping clones of my children, now we have:
631 of my children are unique and only live in  one room ;124 have a twin, 33 triplets and two sets of quads. Hence now my children  are much more vulnerable to a room being destroyed or damaged.  However it does mean there are now only 72404 files and 13.4TB of unique data on the GRID.
Of  my children; there are 675 Dirks', 14 Gavins' and 101 Ursulas'.

These are located in 81 rooms across the following 45 Houses:
RAL-LCG2 ( I Live Here!!)

Which corresponds to Australia, Canada, Czech Repiblic, France, Germany, Israel, Italy, France, Japan, Netherlands, Portugal, Russia, Spain, Switzerland, Turkey, UK and  USA

01 December 2014

Good Year for FTS Transfers.( My first legitimate use of EB.)

During this year, the WLCG sites running the File Transfer Service (FTS)  upgraded to FTS3.
We have also reduced the number of sites running the service. This has led RAL service to be used more heavily.
A total of 0.224EB ( or 224 PBytes) of Data was moved using WLCG FTS services ( (604M files).
This is split down by VO by:
131PB/550M files for ATLAS (92M failed transfers).  66PB/199M files were by the UK FTS.
85PB/48M files for CMS (10M failed  transfers).  25PB/14M files were by the UK FTS.
8PB/6M files for all other VOs (6.7M failed transfers). 250TB/1M files were by the UK FTS.

(Of course these figures ignore file created and stored at sites from the output of Worker Node jobs and also ignores the "chaotic" data transfer of files via other data transfer mechanisms.) 

18 November 2014

Towards an open (data) science culture

Last week we celebrated the 50th anniversary of ATLAS computing at Chilton where RAL is located. (The anniversary was actually earlier, we just celebrated it now.)

While much of the event was about the computing and had lots of really interesting talks (which should appear on the Chilton site), let's highlight a data talk by Professor Jeremy Frey. If you remember the faster than light neutrinos, Jeremy praised CERN for making the data available early, even with caveats and doubts about the preliminary results.  The idea is to get your data out, so it people can have a look at it and comment. Even if the preliminary results are wrong and neutrinos are not faster than light, what matters is that the data comes out and people can look at it. And most importantly, that it will not negatively impact people's careers for publishing it.On the contrary, Jeremy is absolutely right to point out that it should be good for people's careers to make data available (with suitable caveats).

But what would an "open science" data model look like?  Suddenly you would get a lot more data flying around, instead of (or in addition to) preprints and random emails and word of mouth. Perhaps it will work a bit like open source, which is supposed to be "given enough eyes, all bugs are shallow."  With open source, you sometimes see code which isn't quite ready for production, but at least you can look at the code and figure out whether it will work, and maybe adapt it.

While we are on the subject of open stuff, the code that simulates science and analyses data is also important. Please consider signing the SSI petition.

30 September 2014

Data format descriptions

The highlight of the data area working groups meetings at the Open Grid Forum at Imperial recently was the Data Format Description Language . The idea is that if you have a formatted or structured input from a sensor, or a scientific event, and it's not already in one of the formatted, er, formats like (say) OpeNDAP or HDF5, you can use DFDL to describe it and then build a parser which, er, parses records of the format. For example, one use is to validate records before ingesting them into an archive or big data processing facility.

Led by Steve Hanson from IBM, we had an interactive tutorial building a DFDL description for a sensor: the interactive tool looks and feels a bit like Eclipse but is called Integration Toolkit:
And for those eager for more, the appearance of DFDL v1.0 is imminent.

25 September 2014

Erasure-coding: how it can help *you*.

While some of the mechanisms for data access and placement in the WLCG/EGI grids are increasingly modern, there are underlying assumptions that are rooted in somewhat older design decisions.

Particularly relevantly to this article: on 'The Grid', we tend to increase the resilience of our data against loss by making complete additional copies (either one on tape and one on disk, or additional copies on disk at different physical locations). Similarly, our concepts of data placement are all located at the 'file' level - if you want data to be available somewhere, you access a complete copy from one place or another (or potentially get multiple copies from different places, and the first one to arrive wins).
However, if we allow our concept of data to drop below the file level, we can develop some significant improvements.

Now, some of this is trivial: breaking a file into N chunks and distributing it across multiple devices to 'parallelise' access is called 'striping', and your average RAID controller has been doing it for decades (this is 'RAID0', the simplest RAID mode). Slightly more recently, the 'distributed' class of filesystems (Lustre, GPFS, HDFS et al) have allowed striping of files across multiple servers, to maximise performance across the network connections as well.

Striping, of course, increases the fragility of the data distributed. Rather than being dependent on the failure probability of a single disk (for single-machine striping) or a single server (for SANs), you are now dependent on the probability of any one of a set of entities in the stripe failing (a partial file is usually useless). This probability is likely to scale roughly multiplicatively with the number of devices in the stripe, assuming their failure modes are independent.

So, we need some way to make our stripes more robust to the failure of components. Luckily, the topic of how to encode data to make it resilient against partial losses (or 'erasures'), via 'erasure codes', is an extremely well developed field indeed.
Essentially, the concept is this: take your N chunks that you have split your data into. Design a function such that, when fed N values, will output an additional M values, such that each of those M values can be independently used to reconstruct a missing value from the original set of N. (The analogy used by the inventors of the Reed-Solomon code, the most widely used erasure-code family, is of overspecifying a polynomial by more samples than its order - you can always reconstruct an order N polynomial with any N of the M samples you have.)
In fact, most erasure-codes will actually do better than that - as well as allowing the reconstruction of data known to be missing, they can also detect and correct data that is bad. The efficiency for this is half that for data reconstruction - you need 2 resilient values for every 1 unknown bad value you need to detect and fix.

If we decide how many devices we would expect to fail, we can use an erasure code to 'preprocess' our stripes, writing out N+M chunk stripes.

(The M=1 and M=2 implementations of this approach are called 'RAID5' and 'RAID6' when applied to disk controllers, but the general formulation has almost no limits on M.)

So, how do we apply this approach to Grid storage?

Well, Grid data stores already have a large degree of abstraction and indirection. We use LFCs (or other file catalogues) already to allow a single catalogue entry to tie together multiple replicas of the underlying data in different locations. It is relatively trivial to write a tool that (rather than simply copying a file to a Grid endpoint + registering it in an LFC) splits & encodes data into appropriate chunks, and then stripes them across available endpoints, storing the locations and scheme in the LFC metadata for the record.
Once we've done that, retrieving the files is a simple process, and we are able to perform other optimisations, such as getting all the available chunks in parallel, or healing our stripes on the fly (detecting errors when we download data for use).
Importantly, we do all this while also reducing the lower bound for resiliency substantially from 1 full additional copy of the data to M chunks, chosen based on the failure rate of our underlying endpoints.

This past summer, one of our summer projects was based around developing just such a suite of wrappers for Grid data management (albeit using the DIRAC file catalogue, rather than the LFC).
We're very happy with Paulin's work on this, and a later post will demonstrate how it works and what we're planning on doing next.

22 August 2014

Lambda station

So what did CMS say at GridPP33?  Having looked ahead to the future, they came up with more speculative suggestions. Like FNAL's Lambda Station in the past, one suggestion was to look again at scheduling network for transfers, what we might nowadays call network-as-a-service (well, near enough): since we schedule transfers, it would indeed make sense to integrate networks more closely with the pre-allocation at the endpoints (where you'd bringOnline() at the source and schedule the transfer to avoid saturating the channel.) Phoebus is a related approach from Internet2. 

21 August 2014

Updated data models from experiments

At the GridPP meeting in Ambleside ATLAS announced having lifetime on their files: not quite like the SRM implementation where a file could have a finite when created, but more like a timer which counts after each access. Unlike SRM, deletion when the file has been not accessed for the set length of time, the file will be automatically deleted. Also notable is that files can now belong to multiple datasets, and they are set with automatic replication policies (well, basically how many replicas at T1s are required.) Now with extra AOD visualisation goodness.

Also interesting updates from LHCb, they are continuing to use SRM to stage files from tape, but could be looking into FTS3 for this. Also discussed the DIRAC integrity checking with Sam over breakfast. In order to confuse the enemy they are not using their own GIT but code from various places: both LHCb and DIRAC have their own repositories, and some code is marked as "abandonware," so determining which code is being used in practice requires asking. This correspondent would have naïvely assumed that whatever comes out of git is being used... perhaps that just for high energy physics...

CMS to speak later.

08 August 2014

ARGUS user suspension with DPM

Many grid services that need to authenticate their users do so with LCAS/LCMAPS plugins, making integration with a site central authentication server such as ARGUS relatively straightforward. With the ARGUS client LCAS/LCMAPS plugins configured, all authentication decisions are referred to the central service at the time they're made. When the site ARGUS is configured to use the EGI/NGI emergency user suspension policies, any centrally suspended user DN will be automatically blocked from accessing the site's services.

However, DPM does it's own authentication and maintains its own list of banned DNs, so rather than referring each decision to the site ARGUS, we need a specific tool to update DPM's view based on the site ARGUS server. Just to complicate matters further, DPM's packages live in the Fedora EPEL repository, which means that they cannot depend on the ARGUS client libraries, which do not.

The solution is the very small 'dpm-argus' package which is available from the EMI3 repositories for both SL5 and SL6; a package dependency bug has prevented its installation in the past, but this has been fixed as of EMI3 Update 19. It should be installed on the DPM head node (if installing manually rather than with yum, you'll also need the argus-pep-api-c package from EMI) and contains two files, the 'dpns-arguspoll' binary, and its manual page.

Running the tool is simple - it needs a 'resource string' to identify itself to the ARGUS server (for normal purposes it doesn't actually matter what it is) and the URL for the site ARGUS:
dpns-arguspoll my_resource_id https://argus.example.org:8154/authz
when run, it will iterate over the DNs known to the DPM, check each one against the ARGUS server, and update the DPM banning state accordingly. All that remains is to run it periodically. At Oxford we have an '/etc/cron.hourly/dpm-argus' script that simply looks like this:
# Sync DPM's internal user banning states from argus

export DPNS_HOST=t2se01.physics.ox.ac.uk
dpns-arguspoll dpm_argleflargle https://t2argus04.physics.ox.ac.uk:8154/authz 2>/dev/null
And that's it. If you want to be able to see the current list of DNs that your DPM server considers to be banned, then you can query the head node database directly:
echo "SELECT username from Cns_userinfo WHERE banned = 1;" | mysql -u dpminfo -p cns_db
At the moment that should show you my test DN, and probably nothing else.

23 July 2014

IPv6 and XrootD 4

Xrootd version 4 has recently been released. As QMUL is involved in IPv6 testing, and as this new release now supports IPv6, I thought I ought to test it.  So,  what does this involve?

  1. Set up a dual stack virtual machine - our deployment system now makes this relatively easy. 
  2. Install xrootd. QMUL is a StoRM/Lustre site, and has an existing xrootd server that is part of Atlas's  FAX (Federated ATLAS storage systems using XRootD), so it's just a matter of configuring a new machine to export our posix storage in much the same way.  In fact, I've done it slightly differently as I'm also testing ARGUS authentication, but that's something for another blog post. 
  3. Test it - the difficult bit...
I decided to test it using CERN's dual stack lxplus machine: lxplus-ipv6.cern.ch.

First, I tested that I'd got FAX set up correctly:

voms-proxy-init --voms atlas

All 3 tests were successful, so I've got FAX working, next configure it to use my test machine:

export STORAGEPREFIX=root://xrootd02.esc.qmul.ac.uk:1094/

Which also gave 3 successful tests out of 3. Finally, to prove that downloading files works, and that it isn't just redirection that works, I tested a file that should only be at QMUL:

xrdcp -d 1 root://xrootd02.esc.qmul.ac.uk:1094//atlas/rucio/user/ivukotic:user.ivukotic.xrootd.uki-lt2-qmul-1M -> /dev/null 

All of these reported that they were successful. Were they using IPv6 though? Well looking at Xrootd's logs, it certainly thinks so - at least for some connections, though some still seem to be using IPv4:

140723 16:03:47 18291 XrootdXeq: cwalker.19073:26@lxplus0063.cern.ch pub IPv6 login as atlas027
140723 16:04:01 18271 XrootdXeq: cwalker.20147:27@lxplus0063.cern.ch pub IPv4 login as atlas027
140723 16:04:29 23892 XrootdXeq: cwalker.20189:26@lxplus0063.cern.ch pub IPv6 login as atlas027


30 June 2014

Thank you for making a simple compliance test very happy

Rob and I had a look at the gstat tests for RAL's CASTOR. For a good while now we have had a number of errors/warnings raised. They did not affect production: so what are they?

Each error message has a bit of text associated with it, saying typically "something is incompatible with something else" - like an "access control base rule" (ACBR) is incorrect, or tape published not consistent with type of Storage Element (SE). The ACBR error arises due to legacy attributes being published alongside the modern ones, and the latter complains about CASTOR presenting itself as tape store (via a particular SE)

So what is going on?  Well, the (only) way to find out is to locate the test script and find out what exactly it is querying. In this case, it is a python script running LDAP queries, and luckily it can be found in CERN's source code repositories. (How did we find it in this repository? Why, by using a search engine, of course.)

Ah, splendid, so by checking the Documentation™ (also known as "source code" to some), we discover that it needs all ACBRs to be "correct" (not just one for each area) and the legacy ones need an extra slash on the VO value, and an SE with no tape pools should call itself "disk" even if it sits on a tape store.

So it's essentially test driven development: to make the final warnings go away, we need to read the code that is validating it, to engineer the LDIF to make the validation errors go away.

09 June 2014

How much of a small file problem do we have...An update

So as an update to my previous post "How much of a small file problem do we have..."; I decided to have a look at a single part of the namespace within the storage element at the tier1 rather than a single disk server. (The WLCG VOs know this as a scope or family etc.)
When analysing for ATLAS ( if you remember this was the VO I was personally mostly worried about due to the large number of small files; I achieved the following numbers:

Total number of files          3670322
Total number of log files    109025
Volume of log files             4.254TB
Volume of all files              590.731TB
The log files  represent ~29.7% of the files within the scope, so perhaps the disk server I picked was enriched with log files compared to the average.
What is worrying is that this 30% of files is only reponsible for  0.7% of the disk space used ( 4.254TB out of a total 590.731TB).
The mean filesize of the log files is 3.9MB and the median filesize is 2.3MB. Also the log files size varies from 6kB to 10GB;  so some processes within the VO  do seem to be able to create large log files. If one were to remove the log files from the space; then the files mean size would increase from 161MB to 227MB ;  and the median filesize would increase from 22.87MB to 45.63MB.

07 May 2014

Public research, open data

RAL hosted a meeting for research councils, other public bodies, and industry participants, on open data, organised with the Big Innovation Centre (we will have a link once the presentations have been uploaded).

As you know, research councils in the UK have data policies which say

  • Publicly funded data must be made public
  • Data can be embargoed - even if publicly funded, it will be protected for a period of time, to enable you to get your results, write your papers, achieve world domination. You know, usual stuff.
  • Data should be usable.
  • The people who produced the data should be credited for the work - in other words, the data should be cited, as you would cite a publication with results that you use or refer to.
All of these are quite challenging (of this more anon), but interestingly some of the other data publishers had to even train (external) people to use their data. Would you say data is open not just when it is usable, but also actually being used? Certainly makes the policies even more challenging. The next step beyond that would be that the data actually has a measurable economic impact.

You might ask: so what use is the high energy physics (HEP) data, older data, or LHC data such as that held by GridPP, to the general public?  But that is the wrong question, because you don't know what use it is till someone's got it and looked at it. If we can't see an application of the data today - someone else might see it, or we might see one tomorrow.  And the applications of HEP tend to come after some time: when neutrons were discovered, no one knew what they were good for; today they are used in almost all areas of science. Accelerators used in the early days of physics have led to the ones we use today in physics, but also to the ones used in healthcare. What good will come of the LHC data?  Who knows. HEP has the potential to have a huge impact - if you're patient...

24 April 2014

How much of a small file problem do we have...

Here at the Tier1 at RAL-LCG2; we have been draining disk servers with a fury (achieving over 800MB/s on a 10G NIC machine.) Well we get that rate on some servers with large files;  but machines with small files achieve a lower rate, but how many small files do we have and is there a VO dependency... So I decided to look at our three largest LCG VOs.
In tabula form; here is the analysis so far:

sub section All All All non-Log files Log files
# Files 16305 14717 396887 181799 215088
Size (TB) 37.565 39.599 37.564 35.501 2.062
# Files >  10 GB 1 24 75 75 0
# Files >     1GB 8526 11902 9683 9657 26
# Files < 100MB 4434 2330 3E+06 134137 3E+06
# Files <  10MB 2200 569 265464 68792 196672
# Files <    1MB 1429 294 85190 20587 64603
# Files <  100kB 243 91 6693 2124 4569
# Files <    10kB 6 13 635 156 479
Ave Filesize (GB) 2.30 2.69 0.0946 0.195 0.00959
% space used by files > 1GB 96.71 79.73 64.56

Now what I find interesting is how similar values LHCb and CMS are with each other, even though they are vastly different VOs. What worries me is that over 50% of ATLAS files are less than 10MB. Now just to find a tier2 to do a similar analysis to see if it just a T1 issue.....

01 April 2014

Dell OpenManage for disk servers

As we've been telling everyone who'll listen, we at Oxford are big fans of the Dell 12-bay disk servers for grid storage (previously R510 units, now R720xd ones). A few people have now bought them and asked about monitoring them.

Dell's tools all go by the general 'OpenManage' branding, which covers a great range of things, including various general purpose GUI tools. However, for the disk servers, we generally go for a minimal command-line install.

Dell have the necessary bits available in a YUM-able repository as described on the Dell Linux wiki. Our setup simple involves:
  • Installing the repository file,
  • yum install srvadmin-storageservices srvadmin-omcommon,
  • service dataeng start
  • and finally logging out and back in again, or otherwise picking up the PATH variable change from the newly installed srvadmin-path.sh script in /etc/profile.d
At that point, you should be able to query the state of your array with the 'omreport' tool, for example:
# omreport storage vdisk controller=0
List of Virtual Disks on Controller PERC H710P Mini (Embedded)

Controller PERC H710P Mini (Embedded)
ID                            : 0
Status                        : Ok
Name                          : VDos
State                         : Ready
Hot Spare Policy violated     : Not Assigned
Encrypted                     : No
Layout                        : RAID-6
Size                          : 100.00 GB (107374182400 bytes)
Associated Fluid Cache State  : Not Applicable
Device Name                   : /dev/sda
Bus Protocol                  : SATA
Media                         : HDD
Read Policy                   : Adaptive Read Ahead
Write Policy                  : Write Back
Cache Policy                  : Not Applicable
Stripe Element Size           : 64 KB
Disk Cache Policy             : Enabled
We also have a rough and ready Nagios plugin which simply checks that each physical disk reports as 'OK' and 'Online' and complains if anything else is reported.

31 March 2014

Highlights of ISGC 2014

ISGC 2014 is over. Lots of interesting discussions - on the infrastructure end, ASGC developing fanless machine room, interest in (and results on) CEPH and GLUSTER, dCache tutorial, and an hour of code with the DIRAC tutorial.

All countries and regions presented overviews of their work in e-/cyber-Infrastructure.

Interestingly, although this wasn't a HEP conference, practically everyone is doing >0 on LHC, so the LHC really is binding countries and researchers (well, at least physicist and infrastructureists) and e-Infrastructures together (and NRENs). When one day, someone sits down to tally up the benefit and impact of the LHC, this ought to be one of the top ones. The ability to work together and to (mostly) be able to move data to each other, and to trust each other's CAs.

Regarding the DIRAC tutorial, I was there and went through as much as I could ("I am not doing that to my private key")  Something to play with a bit more when I have time - an hour (of code) is not much time; there are always compromises between getting stuff done realistically and cheating in tutorials, but as long as there's something you can take away and play with later. As regards the key shenanigans, DIRAC say they will be working with EGI on SSO, so that's promising. Got the T-shirt, too. "Interware," though?

On the security side, OSG have been interfacing to DigiCert, following the planned termination of the ESNET CA. Once again grids have demands that are not seen in the commercial world, such as the need for bulk certificates (particularly cost effective ones - something a traditional Classic IGTF can do fairly well.) Other security questions (techie acronym alert, until end of paragraph) include how Argus and XACML compare for implementing security policies, and the EMI STS - CERN looking at linking with ADFS. And Malaysia are trialling an online CA based on a FIPS level three token with a Raspberry π.

EGI federated cloud got mentioned quite a few times - KISTI interested in offering IaaS, also Australia interested in joining. Philippines providing resources. EGI have a strategy for engagement. Interesting the extent to which they are driving the of CDMI.

I should mention Shaun gave a talk on "federated" access to data, comparing the protocols - which I missed - the talk, I mean - being in another session, but I understand it was well received and there was a lot of interest.

Software development - interesting experiences from the dCache team and building user communities with (for) DIRAC. How are people taught to develop code? The closing session was by Adam Lyon from Fermilab who talked about the lessons learned - the HEP vision of big data being different from the industry one. And yet HEP needs a culture shift to move away from the not-invented-here.

ISGC really had a great mix of Asian and European countries, as well as the US and Australia. This post was just a quick look through my notes; there'll be much more to pick up and ponder over the coming months. And I haven't even mentioned the actual science stuff ...

Storage thoughts from GRIDPP32

Last week saw me successfully talk about the planned CEPH installation at the RAL Tier1. Here is a list of other thoughts which came up form GRIDPP32:

ATLAS and CMS plans for Run2 of the LHC seems to have an increase in churn rate of data at their Tier2s which will lead to a higher deletion rate being needed. Also will need to look at making sure dark data is discovered and deleted in a more timely manner.

A method for discovering and deleting empty directories which are no longer needed needs to be created. As an example at the Tier1, there are currently 1071 ATLAS users , each of whom can create  up to 131072 sub-directories which can end up being dark directories under ATLAS's new RUCIO namespace convention.

To help with deletion, some of the bulk tools the site admins can use are impressive (but also possible hazardous.) One small typo when deleting may lead to huge unintentional data loss!!!

Data rates shown  by Imperial college  of over 30Gbps WAN traffic are impressive (and makes me want to make a comparison between all UK sites to see  what rates have been recorded via the WLCG monitoring pages.

Wahid Bhimji's  storage talk also got me thinking again that with the rise of the WLCG VO's  FAX/AAA systems and their relative increase in usage; perhaps it is time to re-investigate WAN tuning not only of WN's at sites but also of XROOT proxy servers used by the VOs. In addition, I am still worried about monitoring and controlling the number of xrootd connections per disk server in each of the type's of SE  which we have deployed on the WLCG.

I was also interested to see his work using DAV and its possible usefulness for smaller VOs.

27 March 2014

dCache workshop at (with) ISGC 2014

Shaun and I took part in the dCache workshop. Starting with a VM with a dCache RPM, the challenge was to set it up with two pools, NFS4, and WebDAV. A second VM got to access the data, mainly via NFS or HTTP(S) - security ranged from IP address to X.509 certificates. The overall impression was that it was pretty easy to get set up and configure the interfaces and get it to do something useful: dCache is not "an SRM" or "an NFS server" but rather storage middleware which provides a wide range of interfaces to storage. One of the things the dCache team is looking into is the cloud interface, via CDMI. This particular interface is not ready (as of March 2014) for production, but it's something we may want to look into and test with the EGI FC's version, Stoxy.

05 March 2014

Some thoughts on "normal" HTTP clients and Grid authorisation

In thinking-out-loud mode. Grid clients use certificates: generally this enhances security as you get mutual authentication. So to present authorisation attributes, these either have to be carried with the credential, or queried otherwise via a callout (or cached locally). Access control is generally performed at the resource.

For authorisation attributes we tend to use VOMS,using attribute certificates. These are embedded inside the Globus proxy certificate, which is a temporary client certificate created (signed by) the user certificate, and "decorated" with the authorisation stuff - this makes sense: it separates authentication from authorisation. Globus proxies, however, tend not to work with "standard" HTTP clients, like browsers (which is not HTTP's fault, but a feature of the secure sockets.)

VOMS is nice because you get group membership and can choose to optionally assert rôles. The user selection is often missing in many authorisation schemes which either present all your attributes or none (or give you no choice at all.)

So how would we get grid authorisation working with "standard" HTTP clients?  One way is to do what SARoNGS does: get a temporary certificate and decorate that instead. The client doesn't manage the certificate directly, but grants access to it, like GlobusOnline does, either by giving GO access to your MyProxy server, either by giving it your MyProxy username/password (!), or using OAuth.

If, instead, you want to have your own certificate in the browser (or other user agent), then authorisation could be done in one of two ways: you can have the resource call out to an external authorisation server, saying "I am resource W and I've got user X trying to do action Y on item Z" and the authorisation server must then look up the actual authorisation attributes of user X and take a decision.  XACML could work here, with (in XACMLese) the resource being the PEP, the authorisation server the PDP, and the authorisation database (here, a VOMS database) being the PIP. VOMS also supports a SAML format, allegedly, but if it does, it's rarely seen in the wild.

Or, you could use OAuth directly. If do an HTTP GET on a protected URL, presenting a client certificate, the user agent would be directed to the authorisation server to which it would authenticate using the user certificate. The authorisation server would then need hooks to (a) find the relevant authorisation attributes in VOMS, and (b) take the decision based on the requested URL. The catch is that the OAuth client (the user agent) would need to present a client id to the authorisation server - a shared secret. Also the resource would need a means of validating the Access Token which is generally opaque. Hm. It's easy to see that something much like OAuth could work but it would obviously be better to use an existing protocol.

There are other things one could try, taking a more pure SAML approach, using some of the SAML and web services federation stuff.

Somebody may of course already have done this, but it would be interesting to do some experiments and maybe summarise the state of the art.

25 February 2014

Big data picture

Not as in ((big data) picture) but (big (data picture)), if that makes sense.

I find myself in Edinburgh - it's been far too long since I last was here, I am embarrassed to say.

We are looking at data movement for EUDAT and PRACE, and by a natural extension (being at EPCC), GridPP and DiRAC. The main common data mover is GridFTP: useful because we can (more or less) all move data with GridFTP, it gets great performance, we know how to tune and monitor it, and it supports third party copying. We also need to see how to bridge in GlobusOnline, with the new delegated credentials. In fact both Contrail and NCSA developed OAuth-delegated certificates (and while the original CILogon work was OAuth1, the new stuff is OAuth2.)

One use case is data sharing (the link is to a nice little video Adam Carter from EPCC showed in the introduction). You might argue that users are not jumping up and down screaming for interdisciplinary collaborations, yet if they were possible they might happen! When data policies require data be made available, as a researcher producing data you really have no choice: your data must be shareable with other communities.

21 February 2014

TF-Storage meeting

A quick summary of the Terena TF-Storage meeting earlier this month. Having been on the mailing list for ages, it was good to attend in person - and to catch up with friends from SWITCH and Cybera.

Now there was a lot of talk about cloudy storage, particularly OpenStack's SWIFT and CINDER, as, respectively, object and block stores. At some point when I have a spare moment (haha) I will see if I can get them running in the cloud. I asked about CDMI support for SWIFT but it has not been touched in a while - it'd be a good thing to have, though (so we can use it with other stuff). Also software defined networking (SDN) got attention; it has been talked about for a while but seems to be maturing. Their product used to be called Quantum and is now called Neutron (thanks to Joe from Cybera for the link) There was talk about OpenStack and CEPH, with work by Maciej Brzeźniak from PSNC being presented.

There's an interesting difference between people doing very clever things with their erasure codes and secrecy schemes, and the rest of us who tend to just replicate, replicate, replicate. If you look at the WLCG stuff, we tend to not do clever things - the middleware stack is already complicated enough - but just create enough replicas, and control the process fairly rigidly.

There was a discussion about identity management, of course, which mostly reiterated stuff we did for the grid about ten years ago - which led to VOMS and suchlike.

The report triggered a discussion as to whether the grid is a distributed object store.  It kind of is. 

03 February 2014

Setting up an IPv6 only DPM

As part of the general IPv6 testing work, we've just installed a small, single (virtual) node DPM at Oxford that's exclusively available over IPv6. While many client tools will prefer IPv6 to IPv4 given the choice,
some things will prefer IPv4, even if they could work over IPv6, and others might not be able to work over IPv6 at all. Having a dedicated IPv6 only testing target such as this simplifies tests - if something works at all, you know it's definitely doing it over IPv6.

The process was fairly straightforward, with a few minor catches:
  • In the YAIM config, DPM_DB_HOST is set to localhost rather than the FQDN - MySQL is IPv4 only, and if you have it try to use the machines full name, it will try to look up an IPv4 address, and fail when there isn't one.
  • The setting 'BDII_IPV6_SUPPORT=yes' is enabled to make the DPM's node BDII listen on IPv6. This is also required on the site BDII if you want it to do the same, and seems to be completely harmless when set on v4 only nodes. In any case the site BDII will need some degree of IPv6 capability so that it can connect to the DPM server.
  • YAIM requires the 'hostname -f' command to return the machines fully qualified domain name, which it will only do if the name is properly resolvable. Unfortunately, the default behaviour only attempts to look up an IPv4 address record, and so fails. It's possible to fix this cleanly by adding the option 'options inet6' as a line in /etc/resolve.conf, e.g:
    search physics.ox.ac.uk
    nameserver 2001:630:441:905::fa
    options inet6
  • Socket binding. For reasons that are better explained here, /etc/gai.conf needs to be set to something like:
    label ::/0 0
    label 1
    precedence ::/0 40
    precedence 10
    to get some services that don't explicitly bind to IPv6 addresses as well as IPv4 to get both by default.
And then YAIM it as per normal.

In addition to getting the DPM itself running, there are some sundry support services that are needed or helpful for any IPv6 only system (since it won't be able to use services that are only accessible via
IPv4). In the Oxford case, I've installed:
  • A dual stack DNS resolver to proxy DNS requests to the University's DNS servers,
  • A squid proxy to enable access to IPv4-only web services (like the EMI software repositories),
  • A dual stack site BDII. Advertising the DPM server requires the site BDII to be able to connect to it to pick up its information. That means an IPv6 capable site BDII.
The final product is named 't2dpm1-v6.physics.ox.ac.uk', and it (currently) offers support for the ops, dteam and atlas VOs, and should be accessible from any IPv6 capable grid client system.

Its been a while, but my family dynamics are changing....

Due to housing and room capacity; ATLAS decided to reduce the number of centrally controlled clones. So here is an update for where Georgina, Eve and I are now living after ATLAS's change of policy:

New Values
DataSet Name
Dave G'gina Eve
"DNA" Number
Number of "Houses" 49 70 54
Type of Rooms:DATADISK
17 37 33
Type of Rooms:LGD
32 58 24
Type of Rooms:PERF+PHYS
29 56 24
Type of Rooms:TAPE
7 12 9
Type of Rooms:USERDISK
0 1 5
Type of Rooms:CERN
8 10 10
Type of Rooms:SCRATCH
3 6 1
Type of Rooms:CALIB 0 4 7
Total number of people (including clones) 1090 1392 471
Number of unique people 876 1019 293
Numer of "people" of type:
^user 136 368 97
Numer of unique "people" of type:
^user 131 340 83
Numer of "people" of type:
^data 919 950 352
Numer of unique "people" of type:
^data 719 616 189
Numer of "people" of type:
^group 34 74 22
Numer of unique "people" of type:
^group 25 63 21
Numer of "people" of type:
^valid 1 0 0
Numer of unique "people" of type:
^valid 1 0 0
 Datasets that have 1 copy 696 763 184
 Datasets that have 2 copies 146 184 62
 Datasets that have 3 copies 34 44 33
 Datasets that have 4 copies 0 16 9
 Datasets that have 5 copies 0 7 4
 Datasets that have 6 copies 0 5 0
 Datasets that have 7 copies 0 0 0
 Datasets that have 8 copies 0 0 1
 Datasets that have 12 copies 0 0 0
 Datasets that have 13 copies 0 0 0
Number of files that have  1 copy 56029 134672 14260
Number of files that have  2 copies 8602 35191 6582
Number of files that have  3 copies 1502 1751 5924
Number of files that have  4 copies 0 1879 75
Number of files that have  5 copies 0 607 868
Number of files that have  6 copies 0 306 0
Number of files that have  7 copies 0 0 0
Number of files that have  8 copies 0 0 1
Number of files that have  12 copies 0 0 0
Number of files that have  13 copies 0 0 0
Total number of files on the grid: 77739 222694 49844
Total number of unique files: 66133 174406 27710
Data Volume (TB) that has  1 copy 9.175 30.389 4.127
Data Volume (TB) that has  2 copies 6.361 21.62 3.294
Data Volume (TB) that has  3 copies 0.223 2.452 7.812
Data Volume (TB) that has  4 copies 0 1.263 0.14
Data Volume (TB) that has  5 copies 0 1.408 0.17
Data Volume (TB) that has  6 copies 0 0.379 0
Data Volume (TB) that has  7 copies 0 0 0
Data Volume (TB) that has  8 copies 0 0 0.001
Data Volume (TB) that has  12 copies 0 0 0
Data Volume (TB) that has  13 copies 0 0 0
Total Volume of data on the grid (TB): 22.57 95.351 35.57
Total Volume of unique data (TB): 15.76 57.511 15.54

The difference in values from my last update are:

DataSet Name
D' G' E'
"DNA" Number

Number of "Houses" -10 -9 -9
Type of Rooms:DATADISK
-11 -12 -17
Type of Rooms:LGD
-5 -3 10
Type of Rooms:PERF+PHYS
-3 -2 2
Type of Rooms:TAPE
1 0 0
Type of Rooms:USERDISK
-1 -8 0
Type of Rooms:CERN
5 4 5
Type of Rooms:SCRATCH
3 -6 -13
Type of Rooms:CALIB 0 -1 0
Total number of people (including clones) -76 -202 -171
Number of unique people -18 -101 -6
Numer of "people" of type:
^user -1 -102 33
Numer of unique "people" of type:
^user -1 -89 28
Numer of "people" of type:
^data 194 -98 -186
Numer of unique "people" of type:
^data 187 -15 -23
Numer of "people" of type:
^group 3 4 -18
Numer of unique "people" of type:
^group -1 9 -11
Numer of "people" of type:
^valid 0 0 0
Numer of unique "people" of type:
^valid 0 0 0
 Datasets that have 1 copy 5 -48 6
 Datasets that have 2 copies 3 -13 -1
 Datasets that have 3 copies -18 -24 8
 Datasets that have 4 copies -71 -7 2
 Datasets that have 5 copies -1 0 -17
 Datasets that have 6 copies 0 0 -2
 Datasets that have 7 copies 0 -2 -1
 Datasets that have 8 copies 0 -1 0
 Datasets that have 12 copies 0 0 -6
 Datasets that have 13 copies 0 0 -3
Number of files that have  1 copy 2385 -6991 4751
Number of files that have  2 copies -4302 -1672 -1123
Number of files that have  3 copies -358 -1957 1404
Number of files that have  4 copies -7 -213 -35
Number of files that have  5 copies -7 35 -223
Number of files that have  6 copies 0 -181 -20
Number of files that have  7 copies 0 -142 -5
Number of files that have  8 copies 0 -73 0
Number of files that have  12 copies 0 0 -6
Number of files that have  13 copies 0 0 -5
Total number of files on the grid: -7356 -19547 26871
Total number of unique files: -2289 -11194 -16964
Data Volume (TB) that has  1 copy 0.375 2.689 2.427
Data Volume (TB) that has  2 copies -0.439 0.32 -1.406
Data Volume (TB) that has  3 copies -0.077 -3.148 0.612
Data Volume (TB) that has  4 copies -0.011 -0.237 0.02
Data Volume (TB) that has  5 copies -0.028 0.008 -0.3
Data Volume (TB) that has  6 copies 0 0.019 -0.039
Data Volume (TB) that has  7 copies 0 -0.13 -0.001
Data Volume (TB) that has  8 copies 0 -0.12 0
Data Volume (TB) that has  12 copies 0 0 -0.001
Data Volume (TB) that has  13 copies 0 0 -0.001
Total Volume of data on the grid (TB): -1.134 -8.649 -0.231
Total Volume of unique data (TB): -0.241 -0.489 1.344

The number of unique children in all three families (44TB/205k files) are completely unique to the world and are at risk to a single disk failure in a room.