29 December 2014

Yet another exercise in data recovery?

Just before the Christmas break, my main drive on my main PC - at home - seemed to start to fail (the kernel put it in read-only mode). Good thing we have backups, eh? They are all on portable hard drives, usually encrypted, and maintained with unison. No, they are not "in the cloud."

Surprisingly much of my data is WORM so what if there are differences between the backups? Was it due to those USB3 errors (caused a kernel panic, it did), hardware fault, or that fsck which seemed to discover a problem, or has the file actually changed? (And a big "boo, hiss" to applications that modify files just by opening them - yes, you know who you are.) In my case, I would prefer to re-checksum them all and compare against at least four of the backups. So I need a tool.

My Christmas programming challenge for this year (one should always have one) is then to create a new program to compare my backups. Probably there is one floating around out there, but my scheme - the naming scheme, when I do level zeros, increments, masters, replicas - is probably odd enough that it is useful having a bespoke tool.

On the grid we tend to checksum files as they are transferred. Preservation tools can be asked to "wake up" data every so often and re-check them. Ideally the backup check should quietly validate the checksums in the background as long as the backup drive is mounted.

15 December 2014

Data gateway with dynamic identity - part 1

This doesn't look like GridPP stuff at first, but bear with me...

The grid works by linking sites across the world, by providing a sufficiently high level of infrastructure security using such things as IGTF. The EUDAT project is a data infrastructure project but has users who are unable/unwilling (delete as applicable) to use certificates themselves to authenticate. Thus projects use portals as a "friendly" front end.

So the question is, how do we get data through the proxy?  Yes, it's a reverse proxy, or gateway. Using Apache mod_proxy, this is easy to set up, but is limited to using a single credential for the onward connection.
Look at these (powerpoint) slides: in the top left slide, the user connects (e.g. with a browser) to the portal using some sort of lightweight security - either site-local if the portal is within the site, or federated web authentication in general. Based on this, the portal (top right) generates a key pair and obtains a certificate specific to the user - with the user's (distinguished) name and authorisation attributes. It then (bottom left) connects and sends the data back to the user's browser, or possibly, if the browser is capable of understanding the remote protocol, redirects the browser (with suitable onward authentication) to the remote data source.

We are not aware of anyone having done this before - reverse proxy with identity hooks. If the reader knows any, please comment on this post!

So in EUDAT we investigated a few options, including adding hooks to mod_proxy, but built a cheap and cheerful prototype by bringing the neglected ReverseProxy module up to Apache 2.2 and adding hooks into it.

How is this relevant to GridPP, I hear you cry?  Well, WLCG uses non-browser protocols extensively for data movement, such as GridFTP and xroot, so you need to translate if the user "only" has a browser (or soonish, you should be able to use WebDAV to some systems, but you still need to authenticate with a certificate.)  If this were hooked up to a MyProxy used as a Keystore or certification authority, you could have a lightweight authentication to the portal.

08 December 2014

Ruminations from the ATLAS Computing Jamboree '14

SO..... I have just spent the last 2.5 days at the ATLAS Facilities and Shifters Jamboree at CERN.
The shifters Jamboree was useful to attend since it allowed me to better comprehend the operational shifter's view  of issues seen on services that I help keep in working order. The facilities Jamboree helped to highlight the planned changes (near term and further) for computer operations and service requirement for Run2 of the LHC.
A subset of highlights are:

Analysis jobs have been shown to handle 40MB/s (we better make make sure our internal network and disk servers can handle this with using direct I/O.

Planned increase in analysing data from the disk cache in front of our tape system rather than the disk only pool.

Increase in amount (and types) of data the can be moved to tape. (VO will be able to give a hint  to expected lifetime on tape. In general ATLAS expect to delete data from tape at a scale not seen before.)

Possibly using an web enabled object store to allow storage and viewing of log files.

Event selection analysis as a method of data analysis on the sub file level.

I also know what the tabs in bigpanda now do!!! (but that will be another blog ...)

05 December 2014

Where have all my children gone....

Dave here,
So higher powers decided to change they policy on keeping clones of my children, now we have:
631 of my children are unique and only live in  one room ;124 have a twin, 33 triplets and two sets of quads. Hence now my children  are much more vulnerable to a room being destroyed or damaged.  However it does mean there are now only 72404 files and 13.4TB of unique data on the GRID.
Of  my children; there are 675 Dirks', 14 Gavins' and 101 Ursulas'.

These are located in 81 rooms across the following 45 Houses:
RAL-LCG2 ( I Live Here!!)

Which corresponds to Australia, Canada, Czech Repiblic, France, Germany, Israel, Italy, France, Japan, Netherlands, Portugal, Russia, Spain, Switzerland, Turkey, UK and  USA

01 December 2014

Good Year for FTS Transfers.( My first legitimate use of EB.)

During this year, the WLCG sites running the File Transfer Service (FTS)  upgraded to FTS3.
We have also reduced the number of sites running the service. This has led RAL service to be used more heavily.
A total of 0.224EB ( or 224 PBytes) of Data was moved using WLCG FTS services ( (604M files).
This is split down by VO by:
131PB/550M files for ATLAS (92M failed transfers).  66PB/199M files were by the UK FTS.
85PB/48M files for CMS (10M failed  transfers).  25PB/14M files were by the UK FTS.
8PB/6M files for all other VOs (6.7M failed transfers). 250TB/1M files were by the UK FTS.

(Of course these figures ignore file created and stored at sites from the output of Worker Node jobs and also ignores the "chaotic" data transfer of files via other data transfer mechanisms.)