GridPP storage news

25 September 2015

Milestone Passed with Last Two Years of Transfers Through FTS

Over the Last Two years; the FTS system (as used by the WLCG VOs and others) have moved ~0.5EB of data (over a Billion files). What is an EB ? Well its 1000 PBytes or 10^6 TBytes or 10^9 GBytes or 10^12MBytes or 10^15kBytes or 10^18Bytes. This can be seen form the monitoring page:

24 September 2015

Storage system consistency checking required a new methoid to dump nameserver entries from Castor database ( who ever said blog post titles hsould be short and pithy...)

So as to partly fulfil a (re)new request to consistency check our SEs for various WLCG VOs; and for data management of smaller VOs; we realise that the RAL Tier1 needed an improved method to acquire a list per VO if all the files we have in our castor storage system.
We first tried to naively just use the castor "nsfind" command (similar to your normal find command) on our storage system. However we soon realised this caused problems with our production system.
So................. We decided to setup a client host and backup offline database to query. (This also means that we now are acquiring and storing monthly dumps of the storage nameservers for longitudinal analysis (but that is the story of a future post.).

From this we tried creating a dump for one particular VO; and it took 22 days to complete. (this lead to issues as we had wished to dump the database weekly. He improved matters by two methods.
1-Deleted ~5million old empty directories (wish the VO would do this themselves.)
2- Gave the problem to our ORACLE DBAs to look at and try to improve nsfind or come up with another method. (The DBAs gave us a new script; which when run on a smaller part of the namespace reduced the complete time from 8.5 hours to 100 minutes.)

The additional benefit from deleting of empty directories reduced the overall dump from 22 days to 3.5 hours. Hopefully this should mean that we can provide regular (time delayed) dumps of the fileslist of our storage.

What can a lossy fibre in the middle of you r network infrastructure do to your network rate.......

Just a quick post regarding how we noticed packet errors on one of a balanced pair of links within our network. By taking the link down (at ~1700 Wednesday) and re-seating the end connections lead to a mark improvement in rate and reduction in packet loss.

Here is a plot of the number of packets on the link:

Packets on the link

This clearly shows the period with no packets on the link corresponding to when the link was down.

Now look at the packet errors and discards:

Error/Discards on the link

This clearly shows packet errors/discards before the change; and none after the change. The packet "loss" rate was ~ 6k/240k or ~14k/560k which equates to ~2.5 %!!The really interest comes when looking at the data rate through the combined link during this period:

Rate

The data rate through the link pair increases by ~10x when the link is removed (form ~0.6Gbps to 6Gbps.) Also note that there is no dramatic loss when the secondary link is re-nabled (at ~1130 on Thursday.)

31 August 2015

Community knowledge: Upgrading an (SL5) DPM disk node to xrootd4.x from xrootd3.x in place.

While developers try to avoid breaking backwards compatibility between versions, sometimes it is necessary. One such situation occurred for the xrootd protocol, as the long-awaited Xrootd 4 series was released earlier this year, bringing (amongst other changes) true IPv6 support.

Unfortunately, due to the changes involved, support for xrootd4 in DPM was not immediately available (and the various tools which provide assistance and conversion for paths for VOs similarly needed porting). As a result, most sites did not move from xrootd3 at the time.

Because of the complexity of the release process (with some packages built for xrootd4 being available from different dates, and multiple repositories involved), the DPM devs published a blog entry in February concerning special instructions for managing Xrootd4 transition.

Much of the complexity of that blog entry is no longer relevant, however, as all of the dependant packages are now available - but many sites still have systems running Xrootd 3 services, including Glasgow.

So, I took a look at the process for moving from Xrootd3 to 4, on a single disk server. (Xrootd3 and 4 based DPM disk servers can co-exist with each other, and a head node with either release, so there's no need to move them all at once.) We predominantly support ATLAS at Glasgow, so the instructions here are focussed on making sure that ATLAS support works.
[NOTE: this is not sufficient to upgrade a head node to xrootd4, which would require a few additional changes, and I have not tested this yet.]

yum update emi-dpm_disk dpm-xrootd xrootd-server-atlas-n2n-plugin dmlite-plugins-adapter

(
Upgrading the emi-dpm_mysql package just pulls in the core dpm/dmlite functionality, as xrootd is an optional protocol.
dpm-xrootd pulls in updates to xrootd and the dmlite interface to it (which is what we want)
xrootd-server-atlas-n2n-plugin is needed for translation of ATLAS VO surl paths into xrootd paths.
dmlite-plugins-adapter updates the adapter library for dmlite, which is used to allow xrootd to get authorisation/authentication from dpm, for example. For some reason, none of the above packages seem to update it automatically, but without it at a new enough release, dpm-xrootd stuff will not be able to properly talk to emi-dpm_mysql stuff.
)

Specifically, you'll need to ensure that:

dmlite-plugins-adapter >= 0.7.0
xrootd-server-atlas-n2n-plugin >= 0.2

You should also check that the dpm-xrootd package pulls in vomsxrd (>= 0.3) as a dependancy - if it doesn't you need to make sure that the WLCG repo is properly enabled.

You'll also need to open
/etc/sysconfig/xrootd
and add
-k fifo
to the contents of any and all variables with names ending _OPTIONS.

Once this is all done, you can happily restart the xrootd services on the node, and all should be well. (tail -f /var/log/xrootd/disk/xrootd.log can help to spot any issues if they do appear).

24 August 2015

Castor Rebalancer a success at RAL... sort of...

We recently started using the re-balance feature of Castor Storage at RAL-LCG2

And this looked like good news to allow us to keep number of files and free space balanced across diskservers within a pool. However; a couple of days after we turned this feature on; our production team noticed a vast increase in the number of bad incomplete replicate files being produced. (Good news is that the original files still exist; so there is no loss of data. However we thought it might be good idea to effectively turn off re-balancing with a tweak to the settings on our stagerDB/ transfer management system within Castor. (I have since learned a lot more about the usage and output from our "printdbconfig" and "modifydbconfig" commands!) We have been making changes to various settings but the main settings of current interest for this an other issues have been:

CLASS          KEY                                     VALUE
--------------------------------------------------------------------
D2dCopy       MaxNbRetries                      0
Draining        MaxNbFilesScheduled         200
Draining        MaxNbSchedD2dPerDrain 200
Migration      MaxNbMounts                     7
Rebalancing MaxNbFilesScheduled         5
Rebalancing Sensitivity                            100

These current settings seem to have stopped the creation of new problematic files, now "just" need to work out why exactly it seems to have fixed it and see if we can re-enable re-balancing.

30 June 2015

Musings on data confidentiality

Recently I was asked whether STFC should store classified data, such as Secret data (being a gov't facility, all our data is already Official).

If you look at a "normal" data centre like those run by the big cloud providers, they are typically set up to ensure data confidentiality. They have special personnel who are authorised to enter the data centre, and they have all sorts of physical security measures. If they store Secret data they will need clearance.

We have security measures, but we also take visitors round our data centre and if they are monitored all the time it is more for their own health and safety than because we don't trust them. They can take pictures if they like. Of course we would very much like them to not press any buttons but that's also why there's someone with them.We have students who come and work with us also in the data centre, and leave feeling they have made real contributions.

The three basic data security goals are confidentiality, integrity, and availability, and all three are of course important. A "conventional" data centre would probably prioritise confidentiality first, then integrity, and finally availability: it is better that data is temporarily inaccessible than leaked. RAL's data centre on the other hand is different: for us integrity is top - we spend a lot of time checksumming files at rest and in flight, and comparing lists of files with other lists, data volumes with data volumes. Availability is also highly important as science data is collected, transmitted, and processed around the clock. And then in a sense confidentiality is last: for example, hardly anything is encrypted in flight because it would just slow transfers down. Of course we still need to protect scientists' data because "there is definitely a Nobel prize in there!" but our data is not national security, nor even personal/medical data. Yes, of course we protect the science data, but there something to be said for openness too - making open data available, and showing the public some of the good stuff we do. And it would be quite costly to protect against a "highly capable threat," money which is better spent making things go faster. Leave other data centres to guard the national secrets.

24 June 2015

The firewall did it

Now that we have sort of mostly finished setting up the DiRAC data transfers to RAL, we look at the weeks it took and wonder (a) was it worth it and (b) why did it take weeks?

While initially we only back up data from the DiRAC sites - initially Durham - into RAL Tier 1, the reason we set them up as a grid VO is so we can have the grid tools drive the data transfers. The thinking is that although there is an overhead in setting it up and getting it working, the tools that moved nearly a quarter of an exabyte last year will then move the data with the highest possible efficiency. Initially we are going to let it run as fast as it can until ~~someone complains~~ we hit a reasonable target/limit - 3-400 megabytes per second.

[Edit: updated the image as I had inadvertently put a link in to a 'live' image rather than the snapshot]

The green stuff in the plot is primarily DiRAC data coming in at some 250 MB/s; the spike is not related to DiRAC (this would be a case where the most prominent feature in the plot is of no interest to the discussion...a good way to capture readers, perhaps?)

The advantage of having them griddified is also that in the future if we decide to do more stuff, like move the data elsewhere or start doing analysis, it's all ready to go.

So why does it take time to set up? Part of it is all the technical things that need setting up - VOs, local accounts, mailing lists, certificates, gridmap files, monitoring; none of them too onerous but they all take some time to fill in a form and process, they may have changed since the last time we did it, they take time to debug if they aren't working properly, and in the worst case scenario only one person knows how or is authorised to do it and is on leave/off sick/busy.

Then there are the processes: since access rights are to some very high end computing and storage systems, there are processes for reviewing authorisations, proposals, permissions, allocations and quotas, etc. These, too, take time, particularly if a panel review is involved.

Finally there's putting all the pieces together to see if it works. And when it doesn't, is it the VO's fault - they may be new to the business and do something strange - or is there something wrong with the infrastructure - not unlikely if something new is set up for them. In our case it didn't work, and it turns out that GridFTP as the data movement protocol now uses UDP and the Durham firewall blocked UDP. With firewalls there is a tradeoff between the efficiency of the transfer (less firewall is better) and the security they provide (more firewall is better). It needs both "control" ports where services are listening all the time and "data" ports which are ephemeral so need to be opened in a known port range.

20 May 2015

A view from a room at WLCG/CHEP 2015

It is very handy to have both CHEP 2015 and the WLCG 2015 workshop at the same venue as I don't have to change venues! Here are some thoughts I had from the meeting:
From WLCG:
Monitoring and Security issues were my main take away moments from the first day of the WLCG; (looking forward to getting restricted CMS credentials so that I can see their monitoring pages.)
LHC VOs talked about current Run2 improvements and plans for HL-LHC
Many new sites supporting ALICE and plan to expand EOS usage....
ATLAS keep referring to T2 storage as custodial, but they know this is not what we normally mean as "custodial"
LHCb should a nice slide of the processing workflow for data IO ( a 3GB RAW file ends up also producing ~5GB on disk and 5GB on tape; (they merge their data files.)
Long term future is computer centres will become solely data centres possibly....?
OSG are changing CA and so all users will get a new DN. I can't help bit think about ownership of all their old data, will it survive the change?

Interesting talk on hardware. Each component is really only made by 3-4 companies globally.... and our procurement is minuscule.

From CHEP:
Hopefully my posters went down well. My highlights/points of interest were:

RSE within rucio for atlas can be used to make sure you have more than one replicas. Should be really useful for localgroupdisk and also allows for quota.

Intensity Frontier plenary regarding Computing at Femilab for neutrino experiments using small amounts of staff (made me reminisce about SAM system for data management...)

Data preservation talks for ATLAS CDF/DO interesting.
cms prepared to use network circuits for data transfer, expecting end run2 possibly, definitely run3.

Extension to perfSONAR system to allowe adhoc on demand tests between sites. ( I.e akin to refactoring the NDT/NPAD suites but not requiring the special WEB100 kernel.

Interesting to see that the mean read/write rate for BNL for ATLAS experiment is ~70TB/Yr per disk drive. I wander what other rates are....

Some Posters of interest were:
A173 A191 A317 A339 B358 A359 B20 B214 B114 B254 B284 B292 B358 B362 B408 B441

18 May 2015

Mind The Gap

One of the features of modern data science - whether from big instruments, lots of data sources, or somewhere else - is that generally researchers need to collaborate to be able to manage the data. No single institute is able to cope with everything. Thus, many researchers use e-Infrastructures (or cyberinfrastructures to our North American friends), to connect resources and institutes together, but also to enable further collaborations with other researchers.

The next problem then arises when you have two different infrastructures which were not built to talk to each other. Here's where interoperation and standards come in.

One of the things we have talked about for a while but never got round to doing was to bridge (the) two national infrastructures for physics, GridPP and DiRAC (not to be confused with DIRAC nor with DIRAC). Now we will be moving a few petabytes from the latter to the former, initially to back up the data. Which is tricky when there are no common identities, no common data transfer protocols, no common data (replica) catalogues, accounting information, metadata catalogues, etc.

So we're going to bridge the gap without hopefully too much effort on either side, initially by making DiRAC sites look like a Tier2-(very-)lite, with essentially only a GridFTP endpoint and a VO for DiRAC. We will then start to move data across with FTS and see what happens. (Using the analogy above, we are bringing the ends closer to each other rather than increase the voltage :-))

11 May 2015

Notes from JANET/JISC Networkshop 43

These are notes from ~~JANET~~ JISC's Networkshop, now the 43rd, but seen from the GridPP perspective. The workshop took place 0-1-2 April but this post should be timed to appear after the election.

"Big data" started and closed the workshop; the start being Prof Chris Lintott, BBC Sky at Night, er, superstar, talking about Galaxy Zoo: there are too many galaxies out there, and machines can achieve only 85% accuracy in the classification. Core contributors are the kind of people who read science articles in the news, and they contribute because they want to help out. Zooniverse is similar to the grid in a few respects: a single registration lets you contribute to multiple projects (your correspondent asked about using social media to register people, so people could talk about their contributions on social media), and they have unified support for projects (what we would call VOs)

At the other end, a presentation from the Met Office where machines are achieving high accuracy thanks to to investments in the computing and data infrastructure - and of course in the people who develop and program the models, some of whom have spent decades at the Met Office developing them. While our stuff tends to be more parallel and high throughput processing of events, MO's climate and weather is more about supercomputing. Similarities are more in the data area where managing and processing increasing volumes is essential. This is also where the networkshop comes in, support for accessing and moving large volumes of science data. They are also using STFC's JASMIN/CEMS. In fact JASMIN (in a separate presentation) are using similar networkological tools, such as perfsonar and fasterdata.

Sandwiched in between was loads of great stuff:

HP are using SDN also for security purposes. Would be useful to understand. Or interesting. Or both.
A product called "Nutanix" delivering software defined storage for clouds - basically the storage is managed on what we would call worker nodes with a VM dedicated to managing the storage; it replicates blocks across the cluster, and locally using SSDs as cache.
IPv6 was discussed, with our very own Dave Kelsey presenting.
In coffee break discussions with people, WLCG is ahead of the curve being increasingly network-centric. Still very controlled experiment models, but networks are used a lot to move and access data.
Fair bit of moved-stuff-to-the-cloud reports. JANET's (excuse me, JISC's) agreement with Azure, AWS considered helpful.
Similarly, JISC's data centre offers hosting. Different use from ours, but wonder if we should look into moving data to/from our data centres to theirs? Sometimes it is useful to support users, e.g. users of GO or FTS by testing out data transfers between sites, e.g. when the data centres need to run specific end points, like Globus Connect, SRM, GridFTP, etc.
Lots of identity management stuff, which was the main reason your correspondent was there. Also for AARC and EUDAT (more on that later).
And of course talking to people to find out what they're doing and see if we can usefully do stuff together.

Speaking of sandwiched, we were certainly also made welcome at Exeter, with the local staff welcoming us, colour-coded (= orange) students supporting us, and lots of great food, including of course pasties.

28 March 2015

EUDAT and GridPP

EUDAT2020 (the H2020 follow-up project to EUDAT) just finished its kick-off meeting at CSC. Might be useful to jot down a few thoughts on similarities and differences and such before it is too late.

Both EUDAT and GridPP are - as far as this blog is concerned - data e- (or cyber-) infrastructures. The infrastructure is distributed across sites, sites provide storage capacity or users, there is a common authentication and authorisation scheme, there are data discovery mechanisms, both use GOCDB for service availability.

EUDAT will be using CDMI as its storage interface - just like EGI does - and CDMI is in many ways fairly SRM-like. We have previously done work comparing the two.
EUDAT will also be doing HTTP "federations" (i.e. automatic failover when a replica is missing; this is confusingly referred to as "federation" by some people).
Interoperation with EGI is useful/possible/thought-about (delete as applicable). EUDAT's B2STAGE will be interfacing to EGI - there is already a mailing list for discussions.
GridPP's (or WLCG's) metadata management is probably a bit too confusing at the moment since there is no single file catalogue
B2ACCESS is the authentication and authorisation infrastructure in EUDAT; it could interoperate with GridPP via SARoNGS (ask us at OGF44 where we will also look at AARC's relation to GridPP and EUDAT). Jos tells us that KIT also have a SARoNGS type service.
Referencing a file is done with a persistent identifier, rather like the LFN (Logical Filename) GridPP used to have.
"Easy" access via WebDAV is an option for both projects. GlobusOnline is an option (sometimes) for both projects. In fact, B2STAGE is currently using GO, but will also be using FTS.

Using FTS is particularly interesting because it should then be possible to transfer files between EUDAT and GridPP. The differences between the projects are mainly that

GridPP is more mature - has had 14-15 years now to build its infrastructure; EUDAT is of course a much younger project (but then again, EUDAT is not exactly starting from scratch)
EUDAT is doing more "dynamic data" where the data might change later. Also looking at more support for the lifecycle.
EUDAT and GridPP have distinct user communities, to a first approximation at least.
The middleware is different; GridPP does of course offer compute where EUDAT will offer simpler server-side workflows. GridPP services are more integrated, where in EUDAT the B2 services are more separated (but will be unified by the discovery/lookup service and by B2ACCESS)
Authorisation mechanisms will be very different (but might hopefully interface to each other; there are plans for this in B2ACCESS).

There is some overlap between data sites in WLCG and those in EUDAT. This could lead to some interesting collaborations and cross-pollinations. Come to OGF44 and the EGI conference and talk to us about it.

20 March 2015

ISGC 2015 Review and Musings..

The 2015 ISGC Conference is coming to a close; so I thought I would jot down some musings regarding some of the talks I have seen (and presented.) over the last week. Not surprisingly; since the G and C are grids and clouds, a lot of talks were regrading compute, however there were various talks on storage and data management (especially dCache). But most interesting talk was regarding new technology which sees a cpu and network interface incorporated into an individual HDD. this can be seen here:
http://indico3.twgrid.org/indico/contributionDisplay.py?sessionId=26&contribId=80&confId=593

There were also many site discussion from the various asian countries represented, of which network setup and storage was on particular interest (also including using infiniband between Singapore Seattle and Australia.) My perfSONAR talk seem to be well received. It makes the distance our european dataflows have to travel seem trivial.

It was also interesting to listen to some of the Humanities and Arts themed talks. (First time I have ever heard post- modernism used at a conference!!) Their data volume may well be smaller than WLCG VOS; but still complex and uses interesting visualisation methods.

09 March 2015

Some thoughts on data in the cloud gathered at CloudScape VII

Some if-not-quite-live then certainly not at all dead notes from #CloudScapeVII on data in the cloud.

How to establish trust in the cloud data centre? Clouds can run pretty good security, which you'd otherwise only get in the large data centre.

Clouds can build trust by disclosing processes and practices - Rüdiger Dorn, Microsoft
Clarify responsibilities
"35% of security breaches are due to stupid things" - like leaving memory sticks on a train or sending CDs by post... - Giorgio Aprile, AON
Difficulty to inculcate good (security) practice in many end users

"Opportunity to make big data available in cloud" - Robert Jenkins, CloudSigma

Model assumes that end users pay for the ongoing use of data
Democratise data

Data protection

Kuan Hon from QMUL instilled the fear of data protection in everyone that provides data storage. The new data protection stuff doesn't seem to take clouds into accounts - lots of scary implications. [Good thing we are not storing personal data on the grid...]
Protection relies on legal frameworks - sign a contract saying you won't reveal the data - rather than technology (encrypt it to preventing your revealing the data)

Joe Baguley from vmware talked about the abstractions: where RAID abstracted harddrives from storage, we now do lots more abstractions with hypervisors, containers, software-defined-X, etc.

Layers can optimise, so can get excellent performance
Stack can be hard to debug when something doesn't work so well...
Generally more benefits than drawbacks, so a Good Thing™
Overall, speed up data → analysis → app → data → analysis → app → ... cycle

"What's hot in the cloud" - panel of John Higgings (DigitalEurope), Joe Baguley (vmware), David Bernstein (Cloud Strategy Partners), Monique Morrow (CISCO)

Big data is also fast data (support for more Vs), lots of opportunities for in memory processing
Data - use case for predictive analysis and pattern recognition (and in general machine learning)
devops needed to break down barriers [as we know quite well from the grid where we have tb-support née dteam]
Disruptive technological advances to, er, disrupt?
Many end users are using clouds without knowing it -like people using facebook.

Hope I've done it some justice. As always, lots of very interesting things in CloudScape even for those of us who have been providing "cloud" services (in some sense) for a while. Also good of course to catch up with old friends and meeting new ones.

06 March 2015

Storage accounting revisited?

One of the basic features of containers - a thing which can contain something - is that you can see how full it is. If your container happens to be a grid storage element, monitoring information is available in gstat and in our status dashboard. The BDII information system publishes data, and so does the SRM (the storage element control interface), and the larger experiments at least track how much they write.

So what happens if all these measures don't agree? We had a ticket against RAL querying why the BDII published different values from what the experiment thought they had written. It turned out to be partly because someone was attempting to count used space by space token, which leads to quite the wrong results:

Leaving aside whether these should be the correct mappings for ATLAS, the space tokens on the left do not map one-to-one to the actual storage areas (SAs) in the middle (and in general there are SAs without space tokens pointing to them). Note also that the SAs split the accounting data of the disk pools (online storage) so that the sum of the values are the same -- to avoid double counting.

The other reason for the discrepancy was the treatment of read-only servers: these are published as used space by the SRM, but not by the BDII. This is because the BDII is required to be compliant with the installed capacity agreement from a working group from 2008. The document says on p.33,

TotalOnlineSize (in GB=10⁹) is the total online [..] size available at a given moment (it SHOULD not [sic] include broken disk servers, draining pools, etc.)

RAL uses read only disk pools essentially like draining disk pools (unlike tapes, where a read only tape is perfectly readable), so read only disk pools do not count in the total -- they do, however, count as "reserved" as specified in the same document (the GLUE schema probably intended reserved to be more like SRM's reserved, but the WLCG document interprets the field as "allocated somewhere."

Interestingly, RAL does not comply with the installed capacity document in publishing UseddOnlineSize for tape areas. The document specifies

UsedOnlineSize (in GB=10⁹ bytes) is the space occupied by available and accessible files that are not candidates for garbage collection.

It then kind of contradicts itself in the same paragraph, saying

For CASTOR, since all files in T1D0 are candidates for garbage collection, it has been agreed that in this case UsedOnlineSize is equal to [..] TotalOnlineSize.

If we published like this, the used online size would always equal the total size, and the free size would always be zero (because the document also requires that used and free sum to total -- which doesn't always make sense either, but that is a different story.)

OK, so what might we have learnt today about storage accounting?

Storage accounting is always tricky: there are all sorts of funny boundary cases, like candidates for deletion, temporary replicas, scratch space, etc.
Aggregating accounting data across sites only makes sense if they all publish in the same way: they use the same attributes for the same types of values, etc. However, the supported storage elements all vary somewhat in how they treat storage internally.
Before making use of the numbers, it is useful to have some sort of understanding of how they are generated (what do space tokens do? if numbers are the same for two SAs, is it because they are counting the same area twice, or because they split it 50/50? Implementers should document this and keep the documentation up to date!)
There should probably be a time to review these agreements - what is the use of publishing information if it does not tell people what they want to know?
Storage accounting is non-trivial... getting it right vs useful vs achievable is a bit of a balancing act.

25 February 2015

ATLAS MC deletion campaign from tape complete for the RAL Tier1.

So ATLAS have just finished a deletion campaign of monte carlo data from our tape system at the RAL Tier1.

The good news is that previous seen issues of transfers failing due to a timeout (due to misplacing an "I'm done" UDP packet ) seems to have been solved.

ATLAS deleted 1.325 PB of data allowing for our tape system to recover and re-use (when repacking as completed,) approximately ~250 Tapes. ATLAS delete in total 1739588 files. The deletion campaign took 17 days, but we have seen deletion rates at least a factor of four higher capable from the CASTOR system; so the VO should be able to increase their deletion request rate.

What is also of interest (which I am now looking into;) is that ATLAS asked us to delete 211 files which they thought we had but we did not.

Also now may be a good time to provide ATLAS with a list of all the files we have in our tape system to find out which files we have which ATLAS have "forgotten" about.

03 February 2015

Ceph stress testing at the RAL Tier 1

Of some interest to the wider community, the RAL Tier 1 site have been exploring the CEPH object store as a storage solution (some of the aspects of which involve grid interfaces being developed at RAL, Glasgow and CERN).

They've recently performed some interesting performance benchmarks, which Alastair Dewhurst reported on their own blog:

http://www.gridpp.rl.ac.uk/blog/2015/01/22/stress-test-of-ceph-cloud-cluster/

Distributed Erasure Coding backed by DIRAC File Catalogue

So, last year, I wrote a blog post on the background of Erasure Coding as a technique, and trailed an article on our own initial work on implementing such a thing on top of the DIRAC File Catalogue.

This article is a brief description of the work we did (a poster detailing this work will also be at the CHEP2015 conference).

Obviously, there are two elements to the initial implementation of any file transformation tool for an existing catalogue: choosing the encoding engine, and working out how to plumb it into the catalogue.

There are, arguably, two popular, fast, implementations of general erasure coding libraries in use at the moment:
zfec, which backs the Least Authority File System's implementation, and has a nice python api
and
jerasure, which has seen use in several projects, including backing Ceph's erasure coded pools.

As DIRAC is a mostly Python project, we selected zfec as our backend library, which also seems to have been somewhat fortuitous on legal grounds, as jerasure has recently been withdrawn from public availability due to patent challenges in the USA (while this is not a relevant threat in the UK, as we don't have software patents, it makes one somewhat nervous about using it as a library in a new project).

Rather than performing erasure coding as a stream, we perform the EC mapping of a file on disk, which is possibly a little slower, but is also safer and easier to perform.

Interfacing to DIRAC had a few teething problems. Setting up a DIRAC client appropriately was a little more finicky than we expected, and the Dirac File Catalogue implementation had some issues we needed to work around. For example, SEs known to the DFC are assumed good - there's no way of marking an SE as bad, or of telling how usable it is without trying it.

The implementation of the DFC Erasure Coding tool, therefore, also includes a tool which evaluates the health of the SEs available to the VO, and removes unresponsive SEs from its list of potential endpoints for transfers.

As far as the actual implementation for adding files is concerned, it's as simple as creating a directory (with the original filename) in the DFC, and uploading the encoded chunks within it, making sure to upload chunks across the set of SEs known to the DFC to support the VO you're part of.
We use the DFC's metadata store to store information about each chunk as a check for reconstruction. We were interested to discover that adding new metadata names to the DFC also makes them available for all files in the DFC, rather than simply for the files you add them for. We're not sure if this is an intended feature or not.

One of the benefits of any kind of data striping, including EC, is that we can retrieve chunks in parallel from the remote store. Our EC implementation allows the use of parallel transfer via the DFC methods when getting remote files, however, in our initial tests, we didn't see particular performance improvements. (Our test instance, using the Imperial test DIRAC instance, didn't have many SEs available to it, though, so it is hard to evaluate the scaling potential.)

The source code for the original implementation is available from: https://github.com/ptodev/Distributed-Resilient-Storage
(There's a fork by me, which has some attempts to clean up the code and possibly add additional features.)

29 December 2014

Yet another exercise in data recovery?

Just before the Christmas break, my main drive on my main PC - at home - seemed to start to fail (the kernel put it in read-only mode). Good thing we have backups, eh? They are all on portable hard drives, usually encrypted, and maintained with unison. No, they are not "in the cloud."

Surprisingly much of my data is WORM so what if there are differences between the backups? Was it due to those USB3 errors (caused a kernel panic, it did), hardware fault, or that fsck which seemed to discover a problem, or has the file actually changed? (And a big "boo, hiss" to applications that modify files just by opening them - yes, you know who you are.) In my case, I would prefer to re-checksum them all and compare against at least four of the backups. So I need a tool.

My Christmas programming challenge for this year (one should always have one) is then to create a new program to compare my backups. Probably there is one floating around out there, but my scheme - the naming scheme, when I do level zeros, increments, masters, replicas - is probably odd enough that it is useful having a bespoke tool.

On the grid we tend to checksum files as they are transferred. Preservation tools can be asked to "wake up" data every so often and re-check them. Ideally the backup check should quietly validate the checksums in the background as long as the backup drive is mounted.

15 December 2014

Data gateway with dynamic identity - part 1

This doesn't look like GridPP stuff at first, but bear with me...

The grid works by linking sites across the world, by providing a sufficiently high level of infrastructure security using such things as IGTF. The EUDAT project is a data infrastructure project but has users who are unable/unwilling (delete as applicable) to use certificates themselves to authenticate. Thus projects use portals as a "friendly" front end.

So the question is, how do we get data through the proxy? Yes, it's a reverse proxy, or gateway. Using Apache mod_proxy, this is easy to set up, but is limited to using a single credential for the onward connection.

Look at these (powerpoint) slides: in the top left slide, the user connects (e.g. with a browser) to the portal using some sort of lightweight security - either site-local if the portal is within the site, or federated web authentication in general. Based on this, the portal (top right) generates a key pair and obtains a certificate specific to the user - with the user's (distinguished) name and authorisation attributes. It then (bottom left) connects and sends the data back to the user's browser, or possibly, if the browser is capable of understanding the remote protocol, redirects the browser (with suitable onward authentication) to the remote data source.

We are not aware of anyone having done this before - reverse proxy with identity hooks. If the reader knows any, please comment on this post!

So in EUDAT we investigated a few options, including adding hooks to mod_proxy, but built a cheap and cheerful prototype by bringing the neglected ReverseProxy module up to Apache 2.2 and adding hooks into it.

How is this relevant to GridPP, I hear you cry? Well, WLCG uses non-browser protocols extensively for data movement, such as GridFTP and xroot, so you need to translate if the user "only" has a browser (or soonish, you should be able to use WebDAV to some systems, but you still need to authenticate with a certificate.) If this were hooked up to a MyProxy used as a Keystore or certification authority, you could have a lightweight authentication to the portal.

08 December 2014

Ruminations from the ATLAS Computing Jamboree '14

SO..... I have just spent the last 2.5 days at the ATLAS Facilities and Shifters Jamboree at CERN.
The shifters Jamboree was useful to attend since it allowed me to better comprehend the operational shifter's view of issues seen on services that I help keep in working order. The facilities Jamboree helped to highlight the planned changes (near term and further) for computer operations and service requirement for Run2 of the LHC.
A subset of highlights are:

Analysis jobs have been shown to handle 40MB/s (we better make make sure our internal network and disk servers can handle this with using direct I/O.

Planned increase in analysing data from the disk cache in front of our tape system rather than the disk only pool.

Increase in amount (and types) of data the can be moved to tape. (VO will be able to give a hint to expected lifetime on tape. In general ATLAS expect to delete data from tape at a scale not seen before.)

Possibly using an web enabled object store to allow storage and viewing of log files.

Event selection analysis as a method of data analysis on the sub file level.

I also know what the tabs in bigpanda now do!!! (but that will be another blog ...)

05 December 2014

Where have all my children gone....

Dave here,
So higher powers decided to change they policy on keeping clones of my children, now we have:
631 of my children are unique and only live in one room ;124 have a twin, 33 triplets and two sets of quads. Hence now my children are much more vulnerable to a room being destroyed or damaged. However it does mean there are now only 72404 files and 13.4TB of unique data on the GRID.
Of my children; there are 675 Dirks', 14 Gavins' and 101 Ursulas'.

These are located in 81 rooms across the following 45 Houses:
AGLT2
AUSTRALIA-ATLAS
BNL-OSG2
CERN-PROD
CSCS-LCG2
DESY-HH
FZK-LCG2
GRIF-IRFU
GRIF-LAL
GRIF-LPNHE
IFIC-LCG2
IN2P3-CC
IN2P3-LAPP
IN2P3-LPSC
INFN-MILANO-ATLASC
INFN-NAPOLI-ATLAS
INFN-ROMA1
INFN-T1
JINR-LCG2
LIP-COIMBRA
MPPMU
MWT2
NCG-INGRID-PT
NDGF-T1
NET2
NIKHEF-ELPROD
PIC
PRAGUELCG2
RAL-LCG2 ( I Live Here!!)
RU-PROTVINO-IHEP
SARA-MATRIX
SLACXRD
SMU
SWT2
TAIWAN-LCG2
TECHNION-HEP
TOKYO-LCG2
TR-10-ULAKBIM
TRIUMF-LCG2
UKI-LT2-RHUL
UKI-NORTHGRID-MAN-HEP
UKI-SOUTHGRID-OX-HEP
UNI-FREIBURG
WEIZMANN-LCG2
WUPPERTALPROD

Which corresponds to Australia, Canada, Czech Repiblic, France, Germany, Israel, Italy, France, Japan, Netherlands, Portugal, Russia, Spain, Switzerland, Turkey, UK and USA

01 December 2014

Good Year for FTS Transfers.( My first legitimate use of EB.)

During this year, the WLCG sites running the File Transfer Service (FTS) upgraded to FTS3.
We have also reduced the number of sites running the service. This has led RAL service to be used more heavily.
A total of 0.224EB ( or 224 PBytes) of Data was moved using WLCG FTS services ( (604M files).
This is split down by VO by:
131PB/550M files for ATLAS (92M failed transfers). 66PB/199M files were by the UK FTS.
85PB/48M files for CMS (10M failed transfers). 25PB/14M files were by the UK FTS.
8PB/6M files for all other VOs (6.7M failed transfers). 250TB/1M files were by the UK FTS.

(Of course these figures ignore file created and stored at sites from the output of Worker Node jobs and also ignores the "chaotic" data transfer of files via other data transfer mechanisms.)

18 November 2014

Towards an open (data) science culture

Last week we celebrated the 50th anniversary of ATLAS computing at Chilton where RAL is located. (The anniversary was actually earlier, we just celebrated it now.)

While much of the event was about the computing and had lots of really interesting talks (which should appear on the Chilton site), let's highlight a data talk by Professor Jeremy Frey. If you remember the faster than light neutrinos, Jeremy praised CERN for making the data available early, even with caveats and doubts about the preliminary results. The idea is to get your data out, so it people can have a look at it and comment. Even if the preliminary results are wrong and neutrinos are not faster than light, what matters is that the data comes out and people can look at it. And most importantly, that it will not negatively impact people's careers for publishing it.On the contrary, Jeremy is absolutely right to point out that it should be good for people's careers to make data available (with suitable caveats).

But what would an "open science" data model look like? Suddenly you would get a lot more data flying around, instead of (or in addition to) preprints and random emails and word of mouth. Perhaps it will work a bit like open source, which is supposed to be "given enough eyes, all bugs are shallow." With open source, you sometimes see code which isn't quite ready for production, but at least you can look at the code and figure out whether it will work, and maybe adapt it.

While we are on the subject of open stuff, the code that simulates science and analyses data is also important. Please consider signing the SSI petition.

30 September 2014

Data format descriptions

The highlight of the data area working groups meetings at the Open Grid Forum at Imperial recently was the Data Format Description Language . The idea is that if you have a formatted or structured input from a sensor, or a scientific event, and it's not already in one of the formatted, er, formats like (say) OpeNDAP or HDF5, you can use DFDL to describe it and then build a parser which, er, parses records of the format. For example, one use is to validate records before ingesting them into an archive or big data processing facility.

Led by Steve Hanson from IBM, we had an interactive tutorial building a DFDL description for a sensor: the interactive tool looks and feels a bit like Eclipse but is called Integration Toolkit:

And for those eager for more, the appearance of DFDL v1.0 is imminent.

25 September 2014

Erasure-coding: how it can help you.

While some of the mechanisms for data access and placement in the WLCG/EGI grids are increasingly modern, there are underlying assumptions that are rooted in somewhat older design decisions.

Particularly relevantly to this article: on 'The Grid', we tend to increase the resilience of our data against loss by making complete additional copies (either one on tape and one on disk, or additional copies on disk at different physical locations). Similarly, our concepts of data placement are all located at the 'file' level - if you want data to be available somewhere, you access a complete copy from one place or another (or potentially get multiple copies from different places, and the first one to arrive wins).
However, if we allow our concept of data to drop below the file level, we can develop some significant improvements.

Now, some of this is trivial: breaking a file into N chunks and distributing it across multiple devices to 'parallelise' access is called 'striping', and your average RAID controller has been doing it for decades (this is 'RAID0', the simplest RAID mode). Slightly more recently, the 'distributed' class of filesystems (Lustre, GPFS, HDFS et al) have allowed striping of files across multiple servers, to maximise performance across the network connections as well.

Striping, of course, increases the fragility of the data distributed. Rather than being dependent on the failure probability of a single disk (for single-machine striping) or a single server (for SANs), you are now dependent on the probability of any one of a set of entities in the stripe failing (a partial file is usually useless). This probability is likely to scale roughly multiplicatively with the number of devices in the stripe, assuming their failure modes are independent.

So, we need some way to make our stripes more robust to the failure of components. Luckily, the topic of how to encode data to make it resilient against partial losses (or 'erasures'), via 'erasure codes', is an extremely well developed field indeed.
Essentially, the concept is this: take your N chunks that you have split your data into. Design a function such that, when fed N values, will output an additional M values, such that each of those M values can be independently used to reconstruct a missing value from the original set of N. (The analogy used by the inventors of the Reed-Solomon code, the most widely used erasure-code family, is of overspecifying a polynomial by more samples than its order - you can always reconstruct an order N polynomial with any N of the M samples you have.)
In fact, most erasure-codes will actually do better than that - as well as allowing the reconstruction of data known to be missing, they can also detect and correct data that is bad. The efficiency for this is half that for data reconstruction - you need 2 resilient values for every 1 unknown bad value you need to detect and fix.

If we decide how many devices we would expect to fail, we can use an erasure code to 'preprocess' our stripes, writing out N+M chunk stripes.

(The M=1 and M=2 implementations of this approach are called 'RAID5' and 'RAID6' when applied to disk controllers, but the general formulation has almost no limits on M.)

So, how do we apply this approach to Grid storage?

Well, Grid data stores already have a large degree of abstraction and indirection. We use LFCs (or other file catalogues) already to allow a single catalogue entry to tie together multiple replicas of the underlying data in different locations. It is relatively trivial to write a tool that (rather than simply copying a file to a Grid endpoint + registering it in an LFC) splits & encodes data into appropriate chunks, and then stripes them across available endpoints, storing the locations and scheme in the LFC metadata for the record.
Once we've done that, retrieving the files is a simple process, and we are able to perform other optimisations, such as getting all the available chunks in parallel, or healing our stripes on the fly (detecting errors when we download data for use).
Importantly, we do all this while also reducing the lower bound for resiliency substantially from 1 full additional copy of the data to M chunks, chosen based on the failure rate of our underlying endpoints.

This past summer, one of our summer projects was based around developing just such a suite of wrappers for Grid data management (albeit using the DIRAC file catalogue, rather than the LFC).
We're very happy with Paulin's work on this, and a later post will demonstrate how it works and what we're planning on doing next.

22 August 2014

Lambda station

So what did CMS say at GridPP33? Having looked ahead to the future, they came up with more speculative suggestions. Like FNAL's Lambda Station in the past, one suggestion was to look again at scheduling network for transfers, what we might nowadays call network-as-a-service (well, near enough): since we schedule transfers, it would indeed make sense to integrate networks more closely with the pre-allocation at the endpoints (where you'd bringOnline() at the source and schedule the transfer to avoid saturating the channel.) Phoebus is a related approach from Internet2.

21 August 2014

Updated data models from experiments

At the GridPP meeting in Ambleside ATLAS announced having lifetime on their files: not quite like the SRM implementation where a file could have a finite when created, but more like a timer which counts after each access. Unlike SRM, deletion when the file has been not accessed for the set length of time, the file will be automatically deleted. Also notable is that files can now belong to multiple datasets, and they are set with automatic replication policies (well, basically how many replicas at T1s are required.) Now with extra AOD visualisation goodness.

Also interesting updates from LHCb, they are continuing to use SRM to stage files from tape, but could be looking into FTS3 for this. Also discussed the DIRAC integrity checking with Sam over breakfast. In order to confuse the enemy they are not using their own GIT but code from various places: both LHCb and DIRAC have their own repositories, and some code is marked as "abandonware," so determining which code is being used in practice requires asking. This correspondent would have naïvely assumed that whatever comes out of git is being used... perhaps that just for high energy physics...

CMS to speak later.

08 August 2014

ARGUS user suspension with DPM

Many grid services that need to authenticate their users do so with LCAS/LCMAPS plugins, making integration with a site central authentication server such as ARGUS relatively straightforward. With the ARGUS client LCAS/LCMAPS plugins configured, all authentication decisions are referred to the central service at the time they're made. When the site ARGUS is configured to use the EGI/NGI emergency user suspension policies, any centrally suspended user DN will be automatically blocked from accessing the site's services.

However, DPM does it's own authentication and maintains its own list of banned DNs, so rather than referring each decision to the site ARGUS, we need a specific tool to update DPM's view based on the site ARGUS server. Just to complicate matters further, DPM's packages live in the Fedora EPEL repository, which means that they cannot depend on the ARGUS client libraries, which do not.

The solution is the very small 'dpm-argus' package which is available from the EMI3 repositories for both SL5 and SL6; a package dependency bug has prevented its installation in the past, but this has been fixed as of EMI3 Update 19. It should be installed on the DPM head node (if installing manually rather than with yum, you'll also need the argus-pep-api-c package from EMI) and contains two files, the 'dpns-arguspoll' binary, and its manual page.

Running the tool is simple - it needs a 'resource string' to identify itself to the ARGUS server (for normal purposes it doesn't actually matter what it is) and the URL for the site ARGUS:

dpns-arguspoll my_resource_id https://argus.example.org:8154/authz

when run, it will iterate over the DNs known to the DPM, check each one against the ARGUS server, and update the DPM banning state accordingly. All that remains is to run it periodically. At Oxford we have an '/etc/cron.hourly/dpm-argus' script that simply looks like this:

#!/bin/sh
# Sync DPM's internal user banning states from argus

export DPNS_HOST=t2se01.physics.ox.ac.uk
dpns-arguspoll dpm_argleflargle https://t2argus04.physics.ox.ac.uk:8154/authz 2>/dev/null

And that's it. If you want to be able to see the current list of DNs that your DPM server considers to be banned, then you can query the head node database directly:

echo "SELECT username from Cns_userinfo WHERE banned = 1;" | mysql -u dpminfo -p cns_db

At the moment that should show you my test DN, and probably nothing else.

23 July 2014

IPv6 and XrootD 4

Xrootd version 4 has recently been released. As QMUL is involved in IPv6 testing, and as this new release now supports IPv6, I thought I ought to test it. So, what does this involve?

Set up a dual stack virtual machine - our deployment system now makes this relatively easy.
Install xrootd. QMUL is a StoRM/Lustre site, and has an existing xrootd server that is part of Atlas's FAX (Federated ATLAS storage systems using XRootD), so it's just a matter of configuring a new machine to export our posix storage in much the same way. In fact, I've done it slightly differently as I'm also testing ARGUS authentication, but that's something for another blog post.
Test it - the difficult bit...

I decided to test it using CERN's dual stack lxplus machine: lxplus-ipv6.cern.ch.

First, I tested that I'd got FAX set up correctly:


 setupATLAS

 localSetupFAX

 voms-proxy-init --voms atlas

testFAX

All 3 tests were successful, so I've got FAX working, next configure it to use my test machine:


export STORAGEPREFIX=root://xrootd02.esc.qmul.ac.uk:1094/

testFAX

Which also gave 3 successful tests out of 3. Finally, to prove that downloading files works, and that it isn't just redirection that works, I tested a file that should only be at QMUL:


 xrdcp -d 1 root://xrootd02.esc.qmul.ac.uk:1094//atlas/rucio/user/ivukotic:user.ivukotic.xrootd.uki-lt2-qmul-1M -> /dev/null

All of these reported that they were successful. Were they using IPv6 though? Well looking at Xrootd's logs, it certainly thinks so - at least for some connections, though some still seem to be using IPv4:


140723 16:03:47 18291 XrootdXeq: cwalker.19073:26@lxplus0063.cern.ch pub IPv6 login as atlas027

140723 16:04:01 18271 XrootdXeq: cwalker.20147:27@lxplus0063.cern.ch pub IPv4 login as atlas027

140723 16:04:29 23892 XrootdXeq: cwalker.20189:26@lxplus0063.cern.ch pub IPv6 login as atlas027

Progress!!!

30 June 2014

Thank you for making a simple compliance test very happy

Rob and I had a look at the gstat tests for RAL's CASTOR. For a good while now we have had a number of errors/warnings raised. They did not affect production: so what are they?

Each error message has a bit of text associated with it, saying typically "something is incompatible with something else" - like an "access control base rule" (ACBR) is incorrect, or tape published not consistent with type of Storage Element (SE). The ACBR error arises due to legacy attributes being published alongside the modern ones, and the latter complains about CASTOR presenting itself as tape store (via a particular SE)

So what is going on? Well, the (only) way to find out is to locate the test script and find out what exactly it is querying. In this case, it is a python script running LDAP queries, and luckily it can be found in CERN's source code repositories. (How did we find it in this repository? Why, by using a search engine, of course.)

Ah, splendid, so by checking the Documentation™ (also known as "source code" to some), we discover that it needs all ACBRs to be "correct" (not just one for each area) and the legacy ones need an extra slash on the VO value, and an SE with no tape pools should call itself "disk" even if it sits on a tape store.

So it's essentially test driven development: to make the final warnings go away, we need to read the code that is validating it, to engineer the LDIF to make the validation errors go away.

09 June 2014

How much of a small file problem do we have...An update

So as an update to my previous post "How much of a small file problem do we have..."; I decided to have a look at a single part of the namespace within the storage element at the tier1 rather than a single disk server. (The WLCG VOs know this as a scope or family etc.)

When analysing for ATLAS ( if you remember this was the VO I was personally mostly worried about due to the large number of small files; I achieved the following numbers:

Total number of files 3670322

Total number of log files 109025

Volume of log files 4.254TB

Volume of all files 590.731TB

The log files represent ~29.7% of the files within the scope, so perhaps the disk server I picked was enriched with log files compared to the average.

What is worrying is that this 30% of files is only reponsible for 0.7% of the disk space used ( 4.254TB out of a total 590.731TB).

The mean filesize of the log files is 3.9MB and the median filesize is 2.3MB. Also the log files size varies from 6kB to 10GB; so some processes within the VO do seem to be able to create large log files. If one were to remove the log files from the space; then the files mean size would increase from 161MB to 227MB ; and the median filesize would increase from 22.87MB to 45.63MB.

07 May 2014

Public research, open data

RAL hosted a meeting for research councils, other public bodies, and industry participants, on open data, organised with the Big Innovation Centre (we will have a link once the presentations have been uploaded).

As you know, research councils in the UK have data policies which say

Publicly funded data must be made public
Data can be embargoed - even if publicly funded, it will be protected for a period of time, to enable you to get your results, write your papers, achieve world domination. You know, usual stuff.
Data should be usable.
The people who produced the data should be credited for the work - in other words, the data should be cited, as you would cite a publication with results that you use or refer to.

All of these are quite challenging (of this more anon), but interestingly some of the other data publishers had to even train (external) people to use their data. Would you say data is open not just when it is usable, but also actually being used? Certainly makes the policies even more challenging. The next step beyond that would be that the data actually has a measurable economic impact.

You might ask: so what use is the high energy physics (HEP) data, older data, or LHC data such as that held by GridPP, to the general public? But that is the wrong question, because you don't know what use it is till someone's got it and looked at it. If we can't see an application of the data today - someone else might see it, or we might see one tomorrow. And the applications of HEP tend to come after some time: when neutrons were discovered, no one knew what they were good for; today they are used in almost all areas of science. Accelerators used in the early days of physics have led to the ones we use today in physics, but also to the ones used in healthcare. What good will come of the LHC data? Who knows. HEP has the potential to have a huge impact - if you're patient...

24 April 2014

How much of a small file problem do we have...

Here at the Tier1 at RAL-LCG2; we have been draining disk servers with a fury (achieving over 800MB/s on a 10G NIC machine.) Well we get that rate on some servers with large files; but machines with small files achieve a lower rate, but how many small files do we have and is there a VO dependency... So I decided to look at our three largest LCG VOs.
In tabula form; here is the analysis so far:

VO	LHCb	CMS	ATLAS	ATLAS	ATLAS
sub section	All	All	All	non-Log files	Log files
# Files	16305	14717	396887	181799	215088
Size (TB)	37.565	39.599	37.564	35.501	2.062
# Files > 10 GB	1	24	75	75	0
# Files > 1GB	8526	11902	9683	9657	26
# Files < 100MB	4434	2330	3E+06	134137	3E+06
# Files < 10MB	2200	569	265464	68792	196672
# Files < 1MB	1429	294	85190	20587	64603
# Files < 100kB	243	91	6693	2124	4569
# Files < 10kB	6	13	635	156	479
Ave Filesize (GB)	2.30	2.69	0.0946	0.195	0.00959
% space used by files > 1GB	96.71	79.73	64.56

Now what I find interesting is how similar values LHCb and CMS are with each other, even though they are vastly different VOs. What worries me is that over 50% of ATLAS files are less than 10MB. Now just to find a tier2 to do a similar analysis to see if it just a T1 issue.....

01 April 2014

Dell OpenManage for disk servers

As we've been telling everyone who'll listen, we at Oxford are big fans of the Dell 12-bay disk servers for grid storage (previously R510 units, now R720xd ones). A few people have now bought them and asked about monitoring them.

Dell's tools all go by the general 'OpenManage' branding, which covers a great range of things, including various general purpose GUI tools. However, for the disk servers, we generally go for a minimal command-line install.

Dell have the necessary bits available in a YUM-able repository as described on the Dell Linux wiki. Our setup simple involves:

Installing the repository file,
yum install srvadmin-storageservices srvadmin-omcommon,
service dataeng start
and finally logging out and back in again, or otherwise picking up the PATH variable change from the newly installed srvadmin-path.sh script in /etc/profile.d

At that point, you should be able to query the state of your array with the 'omreport' tool, for example:

# omreport storage vdisk controller=0
List of Virtual Disks on Controller PERC H710P Mini (Embedded)

Controller PERC H710P Mini (Embedded)
ID                            : 0
Status                        : Ok
Name                          : VDos
State                         : Ready
Hot Spare Policy violated     : Not Assigned
Encrypted                     : No
Layout                        : RAID-6
Size                          : 100.00 GB (107374182400 bytes)
Associated Fluid Cache State  : Not Applicable
Device Name                   : /dev/sda
Bus Protocol                  : SATA
Media                         : HDD
Read Policy                   : Adaptive Read Ahead
Write Policy                  : Write Back
Cache Policy                  : Not Applicable
Stripe Element Size           : 64 KB
Disk Cache Policy             : Enabled

We also have a rough and ready Nagios plugin which simply checks that each physical disk reports as 'OK' and 'Online' and complains if anything else is reported.