GridPP storage news

30 June 2015

Musings on data confidentiality

Recently I was asked whether STFC should store classified data, such as Secret data (being a gov't facility, all our data is already Official).

If you look at a "normal" data centre like those run by the big cloud providers, they are typically set up to ensure data confidentiality. They have special personnel who are authorised to enter the data centre, and they have all sorts of physical security measures. If they store Secret data they will need clearance.

We have security measures, but we also take visitors round our data centre and if they are monitored all the time it is more for their own health and safety than because we don't trust them. They can take pictures if they like. Of course we would very much like them to not press any buttons but that's also why there's someone with them.We have students who come and work with us also in the data centre, and leave feeling they have made real contributions.

The three basic data security goals are confidentiality, integrity, and availability, and all three are of course important. A "conventional" data centre would probably prioritise confidentiality first, then integrity, and finally availability: it is better that data is temporarily inaccessible than leaked. RAL's data centre on the other hand is different: for us integrity is top - we spend a lot of time checksumming files at rest and in flight, and comparing lists of files with other lists, data volumes with data volumes. Availability is also highly important as science data is collected, transmitted, and processed around the clock. And then in a sense confidentiality is last: for example, hardly anything is encrypted in flight because it would just slow transfers down. Of course we still need to protect scientists' data because "there is definitely a Nobel prize in there!" but our data is not national security, nor even personal/medical data. Yes, of course we protect the science data, but there something to be said for openness too - making open data available, and showing the public some of the good stuff we do. And it would be quite costly to protect against a "highly capable threat," money which is better spent making things go faster. Leave other data centres to guard the national secrets.

24 June 2015

The firewall did it

Now that we have sort of mostly finished setting up the DiRAC data transfers to RAL, we look at the weeks it took and wonder (a) was it worth it and (b) why did it take weeks?

While initially we only back up data from the DiRAC sites - initially Durham - into RAL Tier 1, the reason we set them up as a grid VO is so we can have the grid tools drive the data transfers. The thinking is that although there is an overhead in setting it up and getting it working, the tools that moved nearly a quarter of an exabyte last year will then move the data with the highest possible efficiency. Initially we are going to let it run as fast as it can until ~~someone complains~~ we hit a reasonable target/limit - 3-400 megabytes per second.

[Edit: updated the image as I had inadvertently put a link in to a 'live' image rather than the snapshot]

The green stuff in the plot is primarily DiRAC data coming in at some 250 MB/s; the spike is not related to DiRAC (this would be a case where the most prominent feature in the plot is of no interest to the discussion...a good way to capture readers, perhaps?)

The advantage of having them griddified is also that in the future if we decide to do more stuff, like move the data elsewhere or start doing analysis, it's all ready to go.

So why does it take time to set up? Part of it is all the technical things that need setting up - VOs, local accounts, mailing lists, certificates, gridmap files, monitoring; none of them too onerous but they all take some time to fill in a form and process, they may have changed since the last time we did it, they take time to debug if they aren't working properly, and in the worst case scenario only one person knows how or is authorised to do it and is on leave/off sick/busy.

Then there are the processes: since access rights are to some very high end computing and storage systems, there are processes for reviewing authorisations, proposals, permissions, allocations and quotas, etc. These, too, take time, particularly if a panel review is involved.

Finally there's putting all the pieces together to see if it works. And when it doesn't, is it the VO's fault - they may be new to the business and do something strange - or is there something wrong with the infrastructure - not unlikely if something new is set up for them. In our case it didn't work, and it turns out that GridFTP as the data movement protocol now uses UDP and the Durham firewall blocked UDP. With firewalls there is a tradeoff between the efficiency of the transfer (less firewall is better) and the security they provide (more firewall is better). It needs both "control" ports where services are listening all the time and "data" ports which are ephemeral so need to be opened in a known port range.

20 May 2015

A view from a room at WLCG/CHEP 2015

It is very handy to have both CHEP 2015 and the WLCG 2015 workshop at the same venue as I don't have to change venues! Here are some thoughts I had from the meeting:
From WLCG:
Monitoring and Security issues were my main take away moments from the first day of the WLCG; (looking forward to getting restricted CMS credentials so that I can see their monitoring pages.)
LHC VOs talked about current Run2 improvements and plans for HL-LHC
Many new sites supporting ALICE and plan to expand EOS usage....
ATLAS keep referring to T2 storage as custodial, but they know this is not what we normally mean as "custodial"
LHCb should a nice slide of the processing workflow for data IO ( a 3GB RAW file ends up also producing ~5GB on disk and 5GB on tape; (they merge their data files.)
Long term future is computer centres will become solely data centres possibly....?
OSG are changing CA and so all users will get a new DN. I can't help bit think about ownership of all their old data, will it survive the change?

Interesting talk on hardware. Each component is really only made by 3-4 companies globally.... and our procurement is minuscule.

From CHEP:
Hopefully my posters went down well. My highlights/points of interest were:

RSE within rucio for atlas can be used to make sure you have more than one replicas. Should be really useful for localgroupdisk and also allows for quota.

Intensity Frontier plenary regarding Computing at Femilab for neutrino experiments using small amounts of staff (made me reminisce about SAM system for data management...)

Data preservation talks for ATLAS CDF/DO interesting.
cms prepared to use network circuits for data transfer, expecting end run2 possibly, definitely run3.

Extension to perfSONAR system to allowe adhoc on demand tests between sites. ( I.e akin to refactoring the NDT/NPAD suites but not requiring the special WEB100 kernel.

Interesting to see that the mean read/write rate for BNL for ATLAS experiment is ~70TB/Yr per disk drive. I wander what other rates are....

Some Posters of interest were:
A173 A191 A317 A339 B358 A359 B20 B214 B114 B254 B284 B292 B358 B362 B408 B441

18 May 2015

Mind The Gap

One of the features of modern data science - whether from big instruments, lots of data sources, or somewhere else - is that generally researchers need to collaborate to be able to manage the data. No single institute is able to cope with everything. Thus, many researchers use e-Infrastructures (or cyberinfrastructures to our North American friends), to connect resources and institutes together, but also to enable further collaborations with other researchers.

The next problem then arises when you have two different infrastructures which were not built to talk to each other. Here's where interoperation and standards come in.

One of the things we have talked about for a while but never got round to doing was to bridge (the) two national infrastructures for physics, GridPP and DiRAC (not to be confused with DIRAC nor with DIRAC). Now we will be moving a few petabytes from the latter to the former, initially to back up the data. Which is tricky when there are no common identities, no common data transfer protocols, no common data (replica) catalogues, accounting information, metadata catalogues, etc.

So we're going to bridge the gap without hopefully too much effort on either side, initially by making DiRAC sites look like a Tier2-(very-)lite, with essentially only a GridFTP endpoint and a VO for DiRAC. We will then start to move data across with FTS and see what happens. (Using the analogy above, we are bringing the ends closer to each other rather than increase the voltage :-))

11 May 2015

Notes from JANET/JISC Networkshop 43

These are notes from ~~JANET~~ JISC's Networkshop, now the 43rd, but seen from the GridPP perspective. The workshop took place 0-1-2 April but this post should be timed to appear after the election.

"Big data" started and closed the workshop; the start being Prof Chris Lintott, BBC Sky at Night, er, superstar, talking about Galaxy Zoo: there are too many galaxies out there, and machines can achieve only 85% accuracy in the classification. Core contributors are the kind of people who read science articles in the news, and they contribute because they want to help out. Zooniverse is similar to the grid in a few respects: a single registration lets you contribute to multiple projects (your correspondent asked about using social media to register people, so people could talk about their contributions on social media), and they have unified support for projects (what we would call VOs)

At the other end, a presentation from the Met Office where machines are achieving high accuracy thanks to to investments in the computing and data infrastructure - and of course in the people who develop and program the models, some of whom have spent decades at the Met Office developing them. While our stuff tends to be more parallel and high throughput processing of events, MO's climate and weather is more about supercomputing. Similarities are more in the data area where managing and processing increasing volumes is essential. This is also where the networkshop comes in, support for accessing and moving large volumes of science data. They are also using STFC's JASMIN/CEMS. In fact JASMIN (in a separate presentation) are using similar networkological tools, such as perfsonar and fasterdata.

Sandwiched in between was loads of great stuff:

HP are using SDN also for security purposes. Would be useful to understand. Or interesting. Or both.
A product called "Nutanix" delivering software defined storage for clouds - basically the storage is managed on what we would call worker nodes with a VM dedicated to managing the storage; it replicates blocks across the cluster, and locally using SSDs as cache.
IPv6 was discussed, with our very own Dave Kelsey presenting.
In coffee break discussions with people, WLCG is ahead of the curve being increasingly network-centric. Still very controlled experiment models, but networks are used a lot to move and access data.
Fair bit of moved-stuff-to-the-cloud reports. JANET's (excuse me, JISC's) agreement with Azure, AWS considered helpful.
Similarly, JISC's data centre offers hosting. Different use from ours, but wonder if we should look into moving data to/from our data centres to theirs? Sometimes it is useful to support users, e.g. users of GO or FTS by testing out data transfers between sites, e.g. when the data centres need to run specific end points, like Globus Connect, SRM, GridFTP, etc.
Lots of identity management stuff, which was the main reason your correspondent was there. Also for AARC and EUDAT (more on that later).
And of course talking to people to find out what they're doing and see if we can usefully do stuff together.

Speaking of sandwiched, we were certainly also made welcome at Exeter, with the local staff welcoming us, colour-coded (= orange) students supporting us, and lots of great food, including of course pasties.

28 March 2015

EUDAT and GridPP

EUDAT2020 (the H2020 follow-up project to EUDAT) just finished its kick-off meeting at CSC. Might be useful to jot down a few thoughts on similarities and differences and such before it is too late.

Both EUDAT and GridPP are - as far as this blog is concerned - data e- (or cyber-) infrastructures. The infrastructure is distributed across sites, sites provide storage capacity or users, there is a common authentication and authorisation scheme, there are data discovery mechanisms, both use GOCDB for service availability.

EUDAT will be using CDMI as its storage interface - just like EGI does - and CDMI is in many ways fairly SRM-like. We have previously done work comparing the two.
EUDAT will also be doing HTTP "federations" (i.e. automatic failover when a replica is missing; this is confusingly referred to as "federation" by some people).
Interoperation with EGI is useful/possible/thought-about (delete as applicable). EUDAT's B2STAGE will be interfacing to EGI - there is already a mailing list for discussions.
GridPP's (or WLCG's) metadata management is probably a bit too confusing at the moment since there is no single file catalogue
B2ACCESS is the authentication and authorisation infrastructure in EUDAT; it could interoperate with GridPP via SARoNGS (ask us at OGF44 where we will also look at AARC's relation to GridPP and EUDAT). Jos tells us that KIT also have a SARoNGS type service.
Referencing a file is done with a persistent identifier, rather like the LFN (Logical Filename) GridPP used to have.
"Easy" access via WebDAV is an option for both projects. GlobusOnline is an option (sometimes) for both projects. In fact, B2STAGE is currently using GO, but will also be using FTS.

Using FTS is particularly interesting because it should then be possible to transfer files between EUDAT and GridPP. The differences between the projects are mainly that

GridPP is more mature - has had 14-15 years now to build its infrastructure; EUDAT is of course a much younger project (but then again, EUDAT is not exactly starting from scratch)
EUDAT is doing more "dynamic data" where the data might change later. Also looking at more support for the lifecycle.
EUDAT and GridPP have distinct user communities, to a first approximation at least.
The middleware is different; GridPP does of course offer compute where EUDAT will offer simpler server-side workflows. GridPP services are more integrated, where in EUDAT the B2 services are more separated (but will be unified by the discovery/lookup service and by B2ACCESS)
Authorisation mechanisms will be very different (but might hopefully interface to each other; there are plans for this in B2ACCESS).

There is some overlap between data sites in WLCG and those in EUDAT. This could lead to some interesting collaborations and cross-pollinations. Come to OGF44 and the EGI conference and talk to us about it.

20 March 2015

ISGC 2015 Review and Musings..

The 2015 ISGC Conference is coming to a close; so I thought I would jot down some musings regarding some of the talks I have seen (and presented.) over the last week. Not surprisingly; since the G and C are grids and clouds, a lot of talks were regrading compute, however there were various talks on storage and data management (especially dCache). But most interesting talk was regarding new technology which sees a cpu and network interface incorporated into an individual HDD. this can be seen here:
http://indico3.twgrid.org/indico/contributionDisplay.py?sessionId=26&contribId=80&confId=593

There were also many site discussion from the various asian countries represented, of which network setup and storage was on particular interest (also including using infiniband between Singapore Seattle and Australia.) My perfSONAR talk seem to be well received. It makes the distance our european dataflows have to travel seem trivial.

It was also interesting to listen to some of the Humanities and Arts themed talks. (First time I have ever heard post- modernism used at a conference!!) Their data volume may well be smaller than WLCG VOS; but still complex and uses interesting visualisation methods.

09 March 2015

Some thoughts on data in the cloud gathered at CloudScape VII

Some if-not-quite-live then certainly not at all dead notes from #CloudScapeVII on data in the cloud.

How to establish trust in the cloud data centre? Clouds can run pretty good security, which you'd otherwise only get in the large data centre.

Clouds can build trust by disclosing processes and practices - Rüdiger Dorn, Microsoft
Clarify responsibilities
"35% of security breaches are due to stupid things" - like leaving memory sticks on a train or sending CDs by post... - Giorgio Aprile, AON
Difficulty to inculcate good (security) practice in many end users

"Opportunity to make big data available in cloud" - Robert Jenkins, CloudSigma

Model assumes that end users pay for the ongoing use of data
Democratise data

Data protection

Kuan Hon from QMUL instilled the fear of data protection in everyone that provides data storage. The new data protection stuff doesn't seem to take clouds into accounts - lots of scary implications. [Good thing we are not storing personal data on the grid...]
Protection relies on legal frameworks - sign a contract saying you won't reveal the data - rather than technology (encrypt it to preventing your revealing the data)

Joe Baguley from vmware talked about the abstractions: where RAID abstracted harddrives from storage, we now do lots more abstractions with hypervisors, containers, software-defined-X, etc.

Layers can optimise, so can get excellent performance
Stack can be hard to debug when something doesn't work so well...
Generally more benefits than drawbacks, so a Good Thing™
Overall, speed up data → analysis → app → data → analysis → app → ... cycle

"What's hot in the cloud" - panel of John Higgings (DigitalEurope), Joe Baguley (vmware), David Bernstein (Cloud Strategy Partners), Monique Morrow (CISCO)

Big data is also fast data (support for more Vs), lots of opportunities for in memory processing
Data - use case for predictive analysis and pattern recognition (and in general machine learning)
devops needed to break down barriers [as we know quite well from the grid where we have tb-support née dteam]
Disruptive technological advances to, er, disrupt?
Many end users are using clouds without knowing it -like people using facebook.

Hope I've done it some justice. As always, lots of very interesting things in CloudScape even for those of us who have been providing "cloud" services (in some sense) for a while. Also good of course to catch up with old friends and meeting new ones.

06 March 2015

Storage accounting revisited?

One of the basic features of containers - a thing which can contain something - is that you can see how full it is. If your container happens to be a grid storage element, monitoring information is available in gstat and in our status dashboard. The BDII information system publishes data, and so does the SRM (the storage element control interface), and the larger experiments at least track how much they write.

So what happens if all these measures don't agree? We had a ticket against RAL querying why the BDII published different values from what the experiment thought they had written. It turned out to be partly because someone was attempting to count used space by space token, which leads to quite the wrong results:

Leaving aside whether these should be the correct mappings for ATLAS, the space tokens on the left do not map one-to-one to the actual storage areas (SAs) in the middle (and in general there are SAs without space tokens pointing to them). Note also that the SAs split the accounting data of the disk pools (online storage) so that the sum of the values are the same -- to avoid double counting.

The other reason for the discrepancy was the treatment of read-only servers: these are published as used space by the SRM, but not by the BDII. This is because the BDII is required to be compliant with the installed capacity agreement from a working group from 2008. The document says on p.33,

TotalOnlineSize (in GB=10⁹) is the total online [..] size available at a given moment (it SHOULD not [sic] include broken disk servers, draining pools, etc.)

RAL uses read only disk pools essentially like draining disk pools (unlike tapes, where a read only tape is perfectly readable), so read only disk pools do not count in the total -- they do, however, count as "reserved" as specified in the same document (the GLUE schema probably intended reserved to be more like SRM's reserved, but the WLCG document interprets the field as "allocated somewhere."

Interestingly, RAL does not comply with the installed capacity document in publishing UseddOnlineSize for tape areas. The document specifies

UsedOnlineSize (in GB=10⁹ bytes) is the space occupied by available and accessible files that are not candidates for garbage collection.

It then kind of contradicts itself in the same paragraph, saying

For CASTOR, since all files in T1D0 are candidates for garbage collection, it has been agreed that in this case UsedOnlineSize is equal to [..] TotalOnlineSize.

If we published like this, the used online size would always equal the total size, and the free size would always be zero (because the document also requires that used and free sum to total -- which doesn't always make sense either, but that is a different story.)

OK, so what might we have learnt today about storage accounting?

Storage accounting is always tricky: there are all sorts of funny boundary cases, like candidates for deletion, temporary replicas, scratch space, etc.
Aggregating accounting data across sites only makes sense if they all publish in the same way: they use the same attributes for the same types of values, etc. However, the supported storage elements all vary somewhat in how they treat storage internally.
Before making use of the numbers, it is useful to have some sort of understanding of how they are generated (what do space tokens do? if numbers are the same for two SAs, is it because they are counting the same area twice, or because they split it 50/50? Implementers should document this and keep the documentation up to date!)
There should probably be a time to review these agreements - what is the use of publishing information if it does not tell people what they want to know?
Storage accounting is non-trivial... getting it right vs useful vs achievable is a bit of a balancing act.

25 February 2015

ATLAS MC deletion campaign from tape complete for the RAL Tier1.

So ATLAS have just finished a deletion campaign of monte carlo data from our tape system at the RAL Tier1.

The good news is that previous seen issues of transfers failing due to a timeout (due to misplacing an "I'm done" UDP packet ) seems to have been solved.

ATLAS deleted 1.325 PB of data allowing for our tape system to recover and re-use (when repacking as completed,) approximately ~250 Tapes. ATLAS delete in total 1739588 files. The deletion campaign took 17 days, but we have seen deletion rates at least a factor of four higher capable from the CASTOR system; so the VO should be able to increase their deletion request rate.

What is also of interest (which I am now looking into;) is that ATLAS asked us to delete 211 files which they thought we had but we did not.

Also now may be a good time to provide ATLAS with a list of all the files we have in our tape system to find out which files we have which ATLAS have "forgotten" about.

03 February 2015

Ceph stress testing at the RAL Tier 1

Of some interest to the wider community, the RAL Tier 1 site have been exploring the CEPH object store as a storage solution (some of the aspects of which involve grid interfaces being developed at RAL, Glasgow and CERN).

They've recently performed some interesting performance benchmarks, which Alastair Dewhurst reported on their own blog:

http://www.gridpp.rl.ac.uk/blog/2015/01/22/stress-test-of-ceph-cloud-cluster/

Distributed Erasure Coding backed by DIRAC File Catalogue

So, last year, I wrote a blog post on the background of Erasure Coding as a technique, and trailed an article on our own initial work on implementing such a thing on top of the DIRAC File Catalogue.

This article is a brief description of the work we did (a poster detailing this work will also be at the CHEP2015 conference).

Obviously, there are two elements to the initial implementation of any file transformation tool for an existing catalogue: choosing the encoding engine, and working out how to plumb it into the catalogue.

There are, arguably, two popular, fast, implementations of general erasure coding libraries in use at the moment:
zfec, which backs the Least Authority File System's implementation, and has a nice python api
and
jerasure, which has seen use in several projects, including backing Ceph's erasure coded pools.

As DIRAC is a mostly Python project, we selected zfec as our backend library, which also seems to have been somewhat fortuitous on legal grounds, as jerasure has recently been withdrawn from public availability due to patent challenges in the USA (while this is not a relevant threat in the UK, as we don't have software patents, it makes one somewhat nervous about using it as a library in a new project).

Rather than performing erasure coding as a stream, we perform the EC mapping of a file on disk, which is possibly a little slower, but is also safer and easier to perform.

Interfacing to DIRAC had a few teething problems. Setting up a DIRAC client appropriately was a little more finicky than we expected, and the Dirac File Catalogue implementation had some issues we needed to work around. For example, SEs known to the DFC are assumed good - there's no way of marking an SE as bad, or of telling how usable it is without trying it.

The implementation of the DFC Erasure Coding tool, therefore, also includes a tool which evaluates the health of the SEs available to the VO, and removes unresponsive SEs from its list of potential endpoints for transfers.

As far as the actual implementation for adding files is concerned, it's as simple as creating a directory (with the original filename) in the DFC, and uploading the encoded chunks within it, making sure to upload chunks across the set of SEs known to the DFC to support the VO you're part of.
We use the DFC's metadata store to store information about each chunk as a check for reconstruction. We were interested to discover that adding new metadata names to the DFC also makes them available for all files in the DFC, rather than simply for the files you add them for. We're not sure if this is an intended feature or not.

One of the benefits of any kind of data striping, including EC, is that we can retrieve chunks in parallel from the remote store. Our EC implementation allows the use of parallel transfer via the DFC methods when getting remote files, however, in our initial tests, we didn't see particular performance improvements. (Our test instance, using the Imperial test DIRAC instance, didn't have many SEs available to it, though, so it is hard to evaluate the scaling potential.)

The source code for the original implementation is available from: https://github.com/ptodev/Distributed-Resilient-Storage
(There's a fork by me, which has some attempts to clean up the code and possibly add additional features.)

29 December 2014

Yet another exercise in data recovery?

Just before the Christmas break, my main drive on my main PC - at home - seemed to start to fail (the kernel put it in read-only mode). Good thing we have backups, eh? They are all on portable hard drives, usually encrypted, and maintained with unison. No, they are not "in the cloud."

Surprisingly much of my data is WORM so what if there are differences between the backups? Was it due to those USB3 errors (caused a kernel panic, it did), hardware fault, or that fsck which seemed to discover a problem, or has the file actually changed? (And a big "boo, hiss" to applications that modify files just by opening them - yes, you know who you are.) In my case, I would prefer to re-checksum them all and compare against at least four of the backups. So I need a tool.

My Christmas programming challenge for this year (one should always have one) is then to create a new program to compare my backups. Probably there is one floating around out there, but my scheme - the naming scheme, when I do level zeros, increments, masters, replicas - is probably odd enough that it is useful having a bespoke tool.

On the grid we tend to checksum files as they are transferred. Preservation tools can be asked to "wake up" data every so often and re-check them. Ideally the backup check should quietly validate the checksums in the background as long as the backup drive is mounted.

15 December 2014

Data gateway with dynamic identity - part 1

This doesn't look like GridPP stuff at first, but bear with me...

The grid works by linking sites across the world, by providing a sufficiently high level of infrastructure security using such things as IGTF. The EUDAT project is a data infrastructure project but has users who are unable/unwilling (delete as applicable) to use certificates themselves to authenticate. Thus projects use portals as a "friendly" front end.

So the question is, how do we get data through the proxy? Yes, it's a reverse proxy, or gateway. Using Apache mod_proxy, this is easy to set up, but is limited to using a single credential for the onward connection.

Look at these (powerpoint) slides: in the top left slide, the user connects (e.g. with a browser) to the portal using some sort of lightweight security - either site-local if the portal is within the site, or federated web authentication in general. Based on this, the portal (top right) generates a key pair and obtains a certificate specific to the user - with the user's (distinguished) name and authorisation attributes. It then (bottom left) connects and sends the data back to the user's browser, or possibly, if the browser is capable of understanding the remote protocol, redirects the browser (with suitable onward authentication) to the remote data source.

We are not aware of anyone having done this before - reverse proxy with identity hooks. If the reader knows any, please comment on this post!

So in EUDAT we investigated a few options, including adding hooks to mod_proxy, but built a cheap and cheerful prototype by bringing the neglected ReverseProxy module up to Apache 2.2 and adding hooks into it.

How is this relevant to GridPP, I hear you cry? Well, WLCG uses non-browser protocols extensively for data movement, such as GridFTP and xroot, so you need to translate if the user "only" has a browser (or soonish, you should be able to use WebDAV to some systems, but you still need to authenticate with a certificate.) If this were hooked up to a MyProxy used as a Keystore or certification authority, you could have a lightweight authentication to the portal.

08 December 2014

Ruminations from the ATLAS Computing Jamboree '14

SO..... I have just spent the last 2.5 days at the ATLAS Facilities and Shifters Jamboree at CERN.
The shifters Jamboree was useful to attend since it allowed me to better comprehend the operational shifter's view of issues seen on services that I help keep in working order. The facilities Jamboree helped to highlight the planned changes (near term and further) for computer operations and service requirement for Run2 of the LHC.
A subset of highlights are:

Analysis jobs have been shown to handle 40MB/s (we better make make sure our internal network and disk servers can handle this with using direct I/O.

Planned increase in analysing data from the disk cache in front of our tape system rather than the disk only pool.

Increase in amount (and types) of data the can be moved to tape. (VO will be able to give a hint to expected lifetime on tape. In general ATLAS expect to delete data from tape at a scale not seen before.)

Possibly using an web enabled object store to allow storage and viewing of log files.

Event selection analysis as a method of data analysis on the sub file level.

I also know what the tabs in bigpanda now do!!! (but that will be another blog ...)

05 December 2014

Where have all my children gone....

Dave here,
So higher powers decided to change they policy on keeping clones of my children, now we have:
631 of my children are unique and only live in one room ;124 have a twin, 33 triplets and two sets of quads. Hence now my children are much more vulnerable to a room being destroyed or damaged. However it does mean there are now only 72404 files and 13.4TB of unique data on the GRID.
Of my children; there are 675 Dirks', 14 Gavins' and 101 Ursulas'.

These are located in 81 rooms across the following 45 Houses:
AGLT2
AUSTRALIA-ATLAS
BNL-OSG2
CERN-PROD
CSCS-LCG2
DESY-HH
FZK-LCG2
GRIF-IRFU
GRIF-LAL
GRIF-LPNHE
IFIC-LCG2
IN2P3-CC
IN2P3-LAPP
IN2P3-LPSC
INFN-MILANO-ATLASC
INFN-NAPOLI-ATLAS
INFN-ROMA1
INFN-T1
JINR-LCG2
LIP-COIMBRA
MPPMU
MWT2
NCG-INGRID-PT
NDGF-T1
NET2
NIKHEF-ELPROD
PIC
PRAGUELCG2
RAL-LCG2 ( I Live Here!!)
RU-PROTVINO-IHEP
SARA-MATRIX
SLACXRD
SMU
SWT2
TAIWAN-LCG2
TECHNION-HEP
TOKYO-LCG2
TR-10-ULAKBIM
TRIUMF-LCG2
UKI-LT2-RHUL
UKI-NORTHGRID-MAN-HEP
UKI-SOUTHGRID-OX-HEP
UNI-FREIBURG
WEIZMANN-LCG2
WUPPERTALPROD

Which corresponds to Australia, Canada, Czech Repiblic, France, Germany, Israel, Italy, France, Japan, Netherlands, Portugal, Russia, Spain, Switzerland, Turkey, UK and USA

01 December 2014

Good Year for FTS Transfers.( My first legitimate use of EB.)

During this year, the WLCG sites running the File Transfer Service (FTS) upgraded to FTS3.
We have also reduced the number of sites running the service. This has led RAL service to be used more heavily.
A total of 0.224EB ( or 224 PBytes) of Data was moved using WLCG FTS services ( (604M files).
This is split down by VO by:
131PB/550M files for ATLAS (92M failed transfers). 66PB/199M files were by the UK FTS.
85PB/48M files for CMS (10M failed transfers). 25PB/14M files were by the UK FTS.
8PB/6M files for all other VOs (6.7M failed transfers). 250TB/1M files were by the UK FTS.

(Of course these figures ignore file created and stored at sites from the output of Worker Node jobs and also ignores the "chaotic" data transfer of files via other data transfer mechanisms.)

18 November 2014

Towards an open (data) science culture

Last week we celebrated the 50th anniversary of ATLAS computing at Chilton where RAL is located. (The anniversary was actually earlier, we just celebrated it now.)

While much of the event was about the computing and had lots of really interesting talks (which should appear on the Chilton site), let's highlight a data talk by Professor Jeremy Frey. If you remember the faster than light neutrinos, Jeremy praised CERN for making the data available early, even with caveats and doubts about the preliminary results. The idea is to get your data out, so it people can have a look at it and comment. Even if the preliminary results are wrong and neutrinos are not faster than light, what matters is that the data comes out and people can look at it. And most importantly, that it will not negatively impact people's careers for publishing it.On the contrary, Jeremy is absolutely right to point out that it should be good for people's careers to make data available (with suitable caveats).

But what would an "open science" data model look like? Suddenly you would get a lot more data flying around, instead of (or in addition to) preprints and random emails and word of mouth. Perhaps it will work a bit like open source, which is supposed to be "given enough eyes, all bugs are shallow." With open source, you sometimes see code which isn't quite ready for production, but at least you can look at the code and figure out whether it will work, and maybe adapt it.

While we are on the subject of open stuff, the code that simulates science and analyses data is also important. Please consider signing the SSI petition.

30 September 2014

Data format descriptions

The highlight of the data area working groups meetings at the Open Grid Forum at Imperial recently was the Data Format Description Language . The idea is that if you have a formatted or structured input from a sensor, or a scientific event, and it's not already in one of the formatted, er, formats like (say) OpeNDAP or HDF5, you can use DFDL to describe it and then build a parser which, er, parses records of the format. For example, one use is to validate records before ingesting them into an archive or big data processing facility.

Led by Steve Hanson from IBM, we had an interactive tutorial building a DFDL description for a sensor: the interactive tool looks and feels a bit like Eclipse but is called Integration Toolkit:

And for those eager for more, the appearance of DFDL v1.0 is imminent.

25 September 2014

Erasure-coding: how it can help you.

While some of the mechanisms for data access and placement in the WLCG/EGI grids are increasingly modern, there are underlying assumptions that are rooted in somewhat older design decisions.

Particularly relevantly to this article: on 'The Grid', we tend to increase the resilience of our data against loss by making complete additional copies (either one on tape and one on disk, or additional copies on disk at different physical locations). Similarly, our concepts of data placement are all located at the 'file' level - if you want data to be available somewhere, you access a complete copy from one place or another (or potentially get multiple copies from different places, and the first one to arrive wins).
However, if we allow our concept of data to drop below the file level, we can develop some significant improvements.

Now, some of this is trivial: breaking a file into N chunks and distributing it across multiple devices to 'parallelise' access is called 'striping', and your average RAID controller has been doing it for decades (this is 'RAID0', the simplest RAID mode). Slightly more recently, the 'distributed' class of filesystems (Lustre, GPFS, HDFS et al) have allowed striping of files across multiple servers, to maximise performance across the network connections as well.

Striping, of course, increases the fragility of the data distributed. Rather than being dependent on the failure probability of a single disk (for single-machine striping) or a single server (for SANs), you are now dependent on the probability of any one of a set of entities in the stripe failing (a partial file is usually useless). This probability is likely to scale roughly multiplicatively with the number of devices in the stripe, assuming their failure modes are independent.

So, we need some way to make our stripes more robust to the failure of components. Luckily, the topic of how to encode data to make it resilient against partial losses (or 'erasures'), via 'erasure codes', is an extremely well developed field indeed.
Essentially, the concept is this: take your N chunks that you have split your data into. Design a function such that, when fed N values, will output an additional M values, such that each of those M values can be independently used to reconstruct a missing value from the original set of N. (The analogy used by the inventors of the Reed-Solomon code, the most widely used erasure-code family, is of overspecifying a polynomial by more samples than its order - you can always reconstruct an order N polynomial with any N of the M samples you have.)
In fact, most erasure-codes will actually do better than that - as well as allowing the reconstruction of data known to be missing, they can also detect and correct data that is bad. The efficiency for this is half that for data reconstruction - you need 2 resilient values for every 1 unknown bad value you need to detect and fix.

If we decide how many devices we would expect to fail, we can use an erasure code to 'preprocess' our stripes, writing out N+M chunk stripes.

(The M=1 and M=2 implementations of this approach are called 'RAID5' and 'RAID6' when applied to disk controllers, but the general formulation has almost no limits on M.)

So, how do we apply this approach to Grid storage?

Well, Grid data stores already have a large degree of abstraction and indirection. We use LFCs (or other file catalogues) already to allow a single catalogue entry to tie together multiple replicas of the underlying data in different locations. It is relatively trivial to write a tool that (rather than simply copying a file to a Grid endpoint + registering it in an LFC) splits & encodes data into appropriate chunks, and then stripes them across available endpoints, storing the locations and scheme in the LFC metadata for the record.
Once we've done that, retrieving the files is a simple process, and we are able to perform other optimisations, such as getting all the available chunks in parallel, or healing our stripes on the fly (detecting errors when we download data for use).
Importantly, we do all this while also reducing the lower bound for resiliency substantially from 1 full additional copy of the data to M chunks, chosen based on the failure rate of our underlying endpoints.

This past summer, one of our summer projects was based around developing just such a suite of wrappers for Grid data management (albeit using the DIRAC file catalogue, rather than the LFC).
We're very happy with Paulin's work on this, and a later post will demonstrate how it works and what we're planning on doing next.