GridPP storage news

25 February 2015

ATLAS MC deletion campaign from tape complete for the RAL Tier1.

So ATLAS have just finished a deletion campaign of monte carlo data from our tape system at the RAL Tier1.

The good news is that previous seen issues of transfers failing due to a timeout (due to misplacing an "I'm done" UDP packet ) seems to have been solved.

ATLAS deleted 1.325 PB of data allowing for our tape system to recover and re-use (when repacking as completed,) approximately ~250 Tapes. ATLAS delete in total 1739588 files. The deletion campaign took 17 days, but we have seen deletion rates at least a factor of four higher capable from the CASTOR system; so the VO should be able to increase their deletion request rate.

What is also of interest (which I am now looking into;) is that ATLAS asked us to delete 211 files which they thought we had but we did not.

Also now may be a good time to provide ATLAS with a list of all the files we have in our tape system to find out which files we have which ATLAS have "forgotten" about.

03 February 2015

Ceph stress testing at the RAL Tier 1

Of some interest to the wider community, the RAL Tier 1 site have been exploring the CEPH object store as a storage solution (some of the aspects of which involve grid interfaces being developed at RAL, Glasgow and CERN).

They've recently performed some interesting performance benchmarks, which Alastair Dewhurst reported on their own blog:

http://www.gridpp.rl.ac.uk/blog/2015/01/22/stress-test-of-ceph-cloud-cluster/

Distributed Erasure Coding backed by DIRAC File Catalogue

So, last year, I wrote a blog post on the background of Erasure Coding as a technique, and trailed an article on our own initial work on implementing such a thing on top of the DIRAC File Catalogue.

This article is a brief description of the work we did (a poster detailing this work will also be at the CHEP2015 conference).

Obviously, there are two elements to the initial implementation of any file transformation tool for an existing catalogue: choosing the encoding engine, and working out how to plumb it into the catalogue.

There are, arguably, two popular, fast, implementations of general erasure coding libraries in use at the moment:
zfec, which backs the Least Authority File System's implementation, and has a nice python api
and
jerasure, which has seen use in several projects, including backing Ceph's erasure coded pools.

As DIRAC is a mostly Python project, we selected zfec as our backend library, which also seems to have been somewhat fortuitous on legal grounds, as jerasure has recently been withdrawn from public availability due to patent challenges in the USA (while this is not a relevant threat in the UK, as we don't have software patents, it makes one somewhat nervous about using it as a library in a new project).

Rather than performing erasure coding as a stream, we perform the EC mapping of a file on disk, which is possibly a little slower, but is also safer and easier to perform.

Interfacing to DIRAC had a few teething problems. Setting up a DIRAC client appropriately was a little more finicky than we expected, and the Dirac File Catalogue implementation had some issues we needed to work around. For example, SEs known to the DFC are assumed good - there's no way of marking an SE as bad, or of telling how usable it is without trying it.

The implementation of the DFC Erasure Coding tool, therefore, also includes a tool which evaluates the health of the SEs available to the VO, and removes unresponsive SEs from its list of potential endpoints for transfers.

As far as the actual implementation for adding files is concerned, it's as simple as creating a directory (with the original filename) in the DFC, and uploading the encoded chunks within it, making sure to upload chunks across the set of SEs known to the DFC to support the VO you're part of.
We use the DFC's metadata store to store information about each chunk as a check for reconstruction. We were interested to discover that adding new metadata names to the DFC also makes them available for all files in the DFC, rather than simply for the files you add them for. We're not sure if this is an intended feature or not.

One of the benefits of any kind of data striping, including EC, is that we can retrieve chunks in parallel from the remote store. Our EC implementation allows the use of parallel transfer via the DFC methods when getting remote files, however, in our initial tests, we didn't see particular performance improvements. (Our test instance, using the Imperial test DIRAC instance, didn't have many SEs available to it, though, so it is hard to evaluate the scaling potential.)

The source code for the original implementation is available from: https://github.com/ptodev/Distributed-Resilient-Storage
(There's a fork by me, which has some attempts to clean up the code and possibly add additional features.)

29 December 2014

Yet another exercise in data recovery?

Just before the Christmas break, my main drive on my main PC - at home - seemed to start to fail (the kernel put it in read-only mode). Good thing we have backups, eh? They are all on portable hard drives, usually encrypted, and maintained with unison. No, they are not "in the cloud."

Surprisingly much of my data is WORM so what if there are differences between the backups? Was it due to those USB3 errors (caused a kernel panic, it did), hardware fault, or that fsck which seemed to discover a problem, or has the file actually changed? (And a big "boo, hiss" to applications that modify files just by opening them - yes, you know who you are.) In my case, I would prefer to re-checksum them all and compare against at least four of the backups. So I need a tool.

My Christmas programming challenge for this year (one should always have one) is then to create a new program to compare my backups. Probably there is one floating around out there, but my scheme - the naming scheme, when I do level zeros, increments, masters, replicas - is probably odd enough that it is useful having a bespoke tool.

On the grid we tend to checksum files as they are transferred. Preservation tools can be asked to "wake up" data every so often and re-check them. Ideally the backup check should quietly validate the checksums in the background as long as the backup drive is mounted.

15 December 2014

Data gateway with dynamic identity - part 1

This doesn't look like GridPP stuff at first, but bear with me...

The grid works by linking sites across the world, by providing a sufficiently high level of infrastructure security using such things as IGTF. The EUDAT project is a data infrastructure project but has users who are unable/unwilling (delete as applicable) to use certificates themselves to authenticate. Thus projects use portals as a "friendly" front end.

So the question is, how do we get data through the proxy? Yes, it's a reverse proxy, or gateway. Using Apache mod_proxy, this is easy to set up, but is limited to using a single credential for the onward connection.

Look at these (powerpoint) slides: in the top left slide, the user connects (e.g. with a browser) to the portal using some sort of lightweight security - either site-local if the portal is within the site, or federated web authentication in general. Based on this, the portal (top right) generates a key pair and obtains a certificate specific to the user - with the user's (distinguished) name and authorisation attributes. It then (bottom left) connects and sends the data back to the user's browser, or possibly, if the browser is capable of understanding the remote protocol, redirects the browser (with suitable onward authentication) to the remote data source.

We are not aware of anyone having done this before - reverse proxy with identity hooks. If the reader knows any, please comment on this post!

So in EUDAT we investigated a few options, including adding hooks to mod_proxy, but built a cheap and cheerful prototype by bringing the neglected ReverseProxy module up to Apache 2.2 and adding hooks into it.

How is this relevant to GridPP, I hear you cry? Well, WLCG uses non-browser protocols extensively for data movement, such as GridFTP and xroot, so you need to translate if the user "only" has a browser (or soonish, you should be able to use WebDAV to some systems, but you still need to authenticate with a certificate.) If this were hooked up to a MyProxy used as a Keystore or certification authority, you could have a lightweight authentication to the portal.

08 December 2014

Ruminations from the ATLAS Computing Jamboree '14

SO..... I have just spent the last 2.5 days at the ATLAS Facilities and Shifters Jamboree at CERN.
The shifters Jamboree was useful to attend since it allowed me to better comprehend the operational shifter's view of issues seen on services that I help keep in working order. The facilities Jamboree helped to highlight the planned changes (near term and further) for computer operations and service requirement for Run2 of the LHC.
A subset of highlights are:

Analysis jobs have been shown to handle 40MB/s (we better make make sure our internal network and disk servers can handle this with using direct I/O.

Planned increase in analysing data from the disk cache in front of our tape system rather than the disk only pool.

Increase in amount (and types) of data the can be moved to tape. (VO will be able to give a hint to expected lifetime on tape. In general ATLAS expect to delete data from tape at a scale not seen before.)

Possibly using an web enabled object store to allow storage and viewing of log files.

Event selection analysis as a method of data analysis on the sub file level.

I also know what the tabs in bigpanda now do!!! (but that will be another blog ...)

05 December 2014

Where have all my children gone....

Dave here,
So higher powers decided to change they policy on keeping clones of my children, now we have:
631 of my children are unique and only live in one room ;124 have a twin, 33 triplets and two sets of quads. Hence now my children are much more vulnerable to a room being destroyed or damaged. However it does mean there are now only 72404 files and 13.4TB of unique data on the GRID.
Of my children; there are 675 Dirks', 14 Gavins' and 101 Ursulas'.

These are located in 81 rooms across the following 45 Houses:
AGLT2
AUSTRALIA-ATLAS
BNL-OSG2
CERN-PROD
CSCS-LCG2
DESY-HH
FZK-LCG2
GRIF-IRFU
GRIF-LAL
GRIF-LPNHE
IFIC-LCG2
IN2P3-CC
IN2P3-LAPP
IN2P3-LPSC
INFN-MILANO-ATLASC
INFN-NAPOLI-ATLAS
INFN-ROMA1
INFN-T1
JINR-LCG2
LIP-COIMBRA
MPPMU
MWT2
NCG-INGRID-PT
NDGF-T1
NET2
NIKHEF-ELPROD
PIC
PRAGUELCG2
RAL-LCG2 ( I Live Here!!)
RU-PROTVINO-IHEP
SARA-MATRIX
SLACXRD
SMU
SWT2
TAIWAN-LCG2
TECHNION-HEP
TOKYO-LCG2
TR-10-ULAKBIM
TRIUMF-LCG2
UKI-LT2-RHUL
UKI-NORTHGRID-MAN-HEP
UKI-SOUTHGRID-OX-HEP
UNI-FREIBURG
WEIZMANN-LCG2
WUPPERTALPROD

Which corresponds to Australia, Canada, Czech Repiblic, France, Germany, Israel, Italy, France, Japan, Netherlands, Portugal, Russia, Spain, Switzerland, Turkey, UK and USA

01 December 2014

Good Year for FTS Transfers.( My first legitimate use of EB.)

During this year, the WLCG sites running the File Transfer Service (FTS) upgraded to FTS3.
We have also reduced the number of sites running the service. This has led RAL service to be used more heavily.
A total of 0.224EB ( or 224 PBytes) of Data was moved using WLCG FTS services ( (604M files).
This is split down by VO by:
131PB/550M files for ATLAS (92M failed transfers). 66PB/199M files were by the UK FTS.
85PB/48M files for CMS (10M failed transfers). 25PB/14M files were by the UK FTS.
8PB/6M files for all other VOs (6.7M failed transfers). 250TB/1M files were by the UK FTS.

(Of course these figures ignore file created and stored at sites from the output of Worker Node jobs and also ignores the "chaotic" data transfer of files via other data transfer mechanisms.)

18 November 2014

Towards an open (data) science culture

Last week we celebrated the 50th anniversary of ATLAS computing at Chilton where RAL is located. (The anniversary was actually earlier, we just celebrated it now.)

While much of the event was about the computing and had lots of really interesting talks (which should appear on the Chilton site), let's highlight a data talk by Professor Jeremy Frey. If you remember the faster than light neutrinos, Jeremy praised CERN for making the data available early, even with caveats and doubts about the preliminary results. The idea is to get your data out, so it people can have a look at it and comment. Even if the preliminary results are wrong and neutrinos are not faster than light, what matters is that the data comes out and people can look at it. And most importantly, that it will not negatively impact people's careers for publishing it.On the contrary, Jeremy is absolutely right to point out that it should be good for people's careers to make data available (with suitable caveats).

But what would an "open science" data model look like? Suddenly you would get a lot more data flying around, instead of (or in addition to) preprints and random emails and word of mouth. Perhaps it will work a bit like open source, which is supposed to be "given enough eyes, all bugs are shallow." With open source, you sometimes see code which isn't quite ready for production, but at least you can look at the code and figure out whether it will work, and maybe adapt it.

While we are on the subject of open stuff, the code that simulates science and analyses data is also important. Please consider signing the SSI petition.

30 September 2014

Data format descriptions

The highlight of the data area working groups meetings at the Open Grid Forum at Imperial recently was the Data Format Description Language . The idea is that if you have a formatted or structured input from a sensor, or a scientific event, and it's not already in one of the formatted, er, formats like (say) OpeNDAP or HDF5, you can use DFDL to describe it and then build a parser which, er, parses records of the format. For example, one use is to validate records before ingesting them into an archive or big data processing facility.

Led by Steve Hanson from IBM, we had an interactive tutorial building a DFDL description for a sensor: the interactive tool looks and feels a bit like Eclipse but is called Integration Toolkit:

And for those eager for more, the appearance of DFDL v1.0 is imminent.

25 September 2014

Erasure-coding: how it can help you.

While some of the mechanisms for data access and placement in the WLCG/EGI grids are increasingly modern, there are underlying assumptions that are rooted in somewhat older design decisions.

Particularly relevantly to this article: on 'The Grid', we tend to increase the resilience of our data against loss by making complete additional copies (either one on tape and one on disk, or additional copies on disk at different physical locations). Similarly, our concepts of data placement are all located at the 'file' level - if you want data to be available somewhere, you access a complete copy from one place or another (or potentially get multiple copies from different places, and the first one to arrive wins).
However, if we allow our concept of data to drop below the file level, we can develop some significant improvements.

Now, some of this is trivial: breaking a file into N chunks and distributing it across multiple devices to 'parallelise' access is called 'striping', and your average RAID controller has been doing it for decades (this is 'RAID0', the simplest RAID mode). Slightly more recently, the 'distributed' class of filesystems (Lustre, GPFS, HDFS et al) have allowed striping of files across multiple servers, to maximise performance across the network connections as well.

Striping, of course, increases the fragility of the data distributed. Rather than being dependent on the failure probability of a single disk (for single-machine striping) or a single server (for SANs), you are now dependent on the probability of any one of a set of entities in the stripe failing (a partial file is usually useless). This probability is likely to scale roughly multiplicatively with the number of devices in the stripe, assuming their failure modes are independent.

So, we need some way to make our stripes more robust to the failure of components. Luckily, the topic of how to encode data to make it resilient against partial losses (or 'erasures'), via 'erasure codes', is an extremely well developed field indeed.
Essentially, the concept is this: take your N chunks that you have split your data into. Design a function such that, when fed N values, will output an additional M values, such that each of those M values can be independently used to reconstruct a missing value from the original set of N. (The analogy used by the inventors of the Reed-Solomon code, the most widely used erasure-code family, is of overspecifying a polynomial by more samples than its order - you can always reconstruct an order N polynomial with any N of the M samples you have.)
In fact, most erasure-codes will actually do better than that - as well as allowing the reconstruction of data known to be missing, they can also detect and correct data that is bad. The efficiency for this is half that for data reconstruction - you need 2 resilient values for every 1 unknown bad value you need to detect and fix.

If we decide how many devices we would expect to fail, we can use an erasure code to 'preprocess' our stripes, writing out N+M chunk stripes.

(The M=1 and M=2 implementations of this approach are called 'RAID5' and 'RAID6' when applied to disk controllers, but the general formulation has almost no limits on M.)

So, how do we apply this approach to Grid storage?

Well, Grid data stores already have a large degree of abstraction and indirection. We use LFCs (or other file catalogues) already to allow a single catalogue entry to tie together multiple replicas of the underlying data in different locations. It is relatively trivial to write a tool that (rather than simply copying a file to a Grid endpoint + registering it in an LFC) splits & encodes data into appropriate chunks, and then stripes them across available endpoints, storing the locations and scheme in the LFC metadata for the record.
Once we've done that, retrieving the files is a simple process, and we are able to perform other optimisations, such as getting all the available chunks in parallel, or healing our stripes on the fly (detecting errors when we download data for use).
Importantly, we do all this while also reducing the lower bound for resiliency substantially from 1 full additional copy of the data to M chunks, chosen based on the failure rate of our underlying endpoints.

This past summer, one of our summer projects was based around developing just such a suite of wrappers for Grid data management (albeit using the DIRAC file catalogue, rather than the LFC).
We're very happy with Paulin's work on this, and a later post will demonstrate how it works and what we're planning on doing next.

22 August 2014

Lambda station

So what did CMS say at GridPP33? Having looked ahead to the future, they came up with more speculative suggestions. Like FNAL's Lambda Station in the past, one suggestion was to look again at scheduling network for transfers, what we might nowadays call network-as-a-service (well, near enough): since we schedule transfers, it would indeed make sense to integrate networks more closely with the pre-allocation at the endpoints (where you'd bringOnline() at the source and schedule the transfer to avoid saturating the channel.) Phoebus is a related approach from Internet2.

21 August 2014

Updated data models from experiments

At the GridPP meeting in Ambleside ATLAS announced having lifetime on their files: not quite like the SRM implementation where a file could have a finite when created, but more like a timer which counts after each access. Unlike SRM, deletion when the file has been not accessed for the set length of time, the file will be automatically deleted. Also notable is that files can now belong to multiple datasets, and they are set with automatic replication policies (well, basically how many replicas at T1s are required.) Now with extra AOD visualisation goodness.

Also interesting updates from LHCb, they are continuing to use SRM to stage files from tape, but could be looking into FTS3 for this. Also discussed the DIRAC integrity checking with Sam over breakfast. In order to confuse the enemy they are not using their own GIT but code from various places: both LHCb and DIRAC have their own repositories, and some code is marked as "abandonware," so determining which code is being used in practice requires asking. This correspondent would have naïvely assumed that whatever comes out of git is being used... perhaps that just for high energy physics...

CMS to speak later.

08 August 2014

ARGUS user suspension with DPM

Many grid services that need to authenticate their users do so with LCAS/LCMAPS plugins, making integration with a site central authentication server such as ARGUS relatively straightforward. With the ARGUS client LCAS/LCMAPS plugins configured, all authentication decisions are referred to the central service at the time they're made. When the site ARGUS is configured to use the EGI/NGI emergency user suspension policies, any centrally suspended user DN will be automatically blocked from accessing the site's services.

However, DPM does it's own authentication and maintains its own list of banned DNs, so rather than referring each decision to the site ARGUS, we need a specific tool to update DPM's view based on the site ARGUS server. Just to complicate matters further, DPM's packages live in the Fedora EPEL repository, which means that they cannot depend on the ARGUS client libraries, which do not.

The solution is the very small 'dpm-argus' package which is available from the EMI3 repositories for both SL5 and SL6; a package dependency bug has prevented its installation in the past, but this has been fixed as of EMI3 Update 19. It should be installed on the DPM head node (if installing manually rather than with yum, you'll also need the argus-pep-api-c package from EMI) and contains two files, the 'dpns-arguspoll' binary, and its manual page.

Running the tool is simple - it needs a 'resource string' to identify itself to the ARGUS server (for normal purposes it doesn't actually matter what it is) and the URL for the site ARGUS:

dpns-arguspoll my_resource_id https://argus.example.org:8154/authz

when run, it will iterate over the DNs known to the DPM, check each one against the ARGUS server, and update the DPM banning state accordingly. All that remains is to run it periodically. At Oxford we have an '/etc/cron.hourly/dpm-argus' script that simply looks like this:

#!/bin/sh
# Sync DPM's internal user banning states from argus

export DPNS_HOST=t2se01.physics.ox.ac.uk
dpns-arguspoll dpm_argleflargle https://t2argus04.physics.ox.ac.uk:8154/authz 2>/dev/null

And that's it. If you want to be able to see the current list of DNs that your DPM server considers to be banned, then you can query the head node database directly:

echo "SELECT username from Cns_userinfo WHERE banned = 1;" | mysql -u dpminfo -p cns_db

At the moment that should show you my test DN, and probably nothing else.

23 July 2014

IPv6 and XrootD 4

Xrootd version 4 has recently been released. As QMUL is involved in IPv6 testing, and as this new release now supports IPv6, I thought I ought to test it. So, what does this involve?

Set up a dual stack virtual machine - our deployment system now makes this relatively easy.
Install xrootd. QMUL is a StoRM/Lustre site, and has an existing xrootd server that is part of Atlas's FAX (Federated ATLAS storage systems using XRootD), so it's just a matter of configuring a new machine to export our posix storage in much the same way. In fact, I've done it slightly differently as I'm also testing ARGUS authentication, but that's something for another blog post.
Test it - the difficult bit...

I decided to test it using CERN's dual stack lxplus machine: lxplus-ipv6.cern.ch.

First, I tested that I'd got FAX set up correctly:


 setupATLAS

 localSetupFAX

 voms-proxy-init --voms atlas

testFAX

All 3 tests were successful, so I've got FAX working, next configure it to use my test machine:


export STORAGEPREFIX=root://xrootd02.esc.qmul.ac.uk:1094/

testFAX

Which also gave 3 successful tests out of 3. Finally, to prove that downloading files works, and that it isn't just redirection that works, I tested a file that should only be at QMUL:


 xrdcp -d 1 root://xrootd02.esc.qmul.ac.uk:1094//atlas/rucio/user/ivukotic:user.ivukotic.xrootd.uki-lt2-qmul-1M -> /dev/null

All of these reported that they were successful. Were they using IPv6 though? Well looking at Xrootd's logs, it certainly thinks so - at least for some connections, though some still seem to be using IPv4:


140723 16:03:47 18291 XrootdXeq: cwalker.19073:26@lxplus0063.cern.ch pub IPv6 login as atlas027

140723 16:04:01 18271 XrootdXeq: cwalker.20147:27@lxplus0063.cern.ch pub IPv4 login as atlas027

140723 16:04:29 23892 XrootdXeq: cwalker.20189:26@lxplus0063.cern.ch pub IPv6 login as atlas027

Progress!!!

30 June 2014

Thank you for making a simple compliance test very happy

Rob and I had a look at the gstat tests for RAL's CASTOR. For a good while now we have had a number of errors/warnings raised. They did not affect production: so what are they?

Each error message has a bit of text associated with it, saying typically "something is incompatible with something else" - like an "access control base rule" (ACBR) is incorrect, or tape published not consistent with type of Storage Element (SE). The ACBR error arises due to legacy attributes being published alongside the modern ones, and the latter complains about CASTOR presenting itself as tape store (via a particular SE)

So what is going on? Well, the (only) way to find out is to locate the test script and find out what exactly it is querying. In this case, it is a python script running LDAP queries, and luckily it can be found in CERN's source code repositories. (How did we find it in this repository? Why, by using a search engine, of course.)

Ah, splendid, so by checking the Documentation™ (also known as "source code" to some), we discover that it needs all ACBRs to be "correct" (not just one for each area) and the legacy ones need an extra slash on the VO value, and an SE with no tape pools should call itself "disk" even if it sits on a tape store.

So it's essentially test driven development: to make the final warnings go away, we need to read the code that is validating it, to engineer the LDIF to make the validation errors go away.

09 June 2014

How much of a small file problem do we have...An update

So as an update to my previous post "How much of a small file problem do we have..."; I decided to have a look at a single part of the namespace within the storage element at the tier1 rather than a single disk server. (The WLCG VOs know this as a scope or family etc.)

When analysing for ATLAS ( if you remember this was the VO I was personally mostly worried about due to the large number of small files; I achieved the following numbers:

Total number of files 3670322

Total number of log files 109025

Volume of log files 4.254TB

Volume of all files 590.731TB

The log files represent ~29.7% of the files within the scope, so perhaps the disk server I picked was enriched with log files compared to the average.

What is worrying is that this 30% of files is only reponsible for 0.7% of the disk space used ( 4.254TB out of a total 590.731TB).

The mean filesize of the log files is 3.9MB and the median filesize is 2.3MB. Also the log files size varies from 6kB to 10GB; so some processes within the VO do seem to be able to create large log files. If one were to remove the log files from the space; then the files mean size would increase from 161MB to 227MB ; and the median filesize would increase from 22.87MB to 45.63MB.

07 May 2014

Public research, open data

RAL hosted a meeting for research councils, other public bodies, and industry participants, on open data, organised with the Big Innovation Centre (we will have a link once the presentations have been uploaded).

As you know, research councils in the UK have data policies which say

Publicly funded data must be made public
Data can be embargoed - even if publicly funded, it will be protected for a period of time, to enable you to get your results, write your papers, achieve world domination. You know, usual stuff.
Data should be usable.
The people who produced the data should be credited for the work - in other words, the data should be cited, as you would cite a publication with results that you use or refer to.

All of these are quite challenging (of this more anon), but interestingly some of the other data publishers had to even train (external) people to use their data. Would you say data is open not just when it is usable, but also actually being used? Certainly makes the policies even more challenging. The next step beyond that would be that the data actually has a measurable economic impact.

You might ask: so what use is the high energy physics (HEP) data, older data, or LHC data such as that held by GridPP, to the general public? But that is the wrong question, because you don't know what use it is till someone's got it and looked at it. If we can't see an application of the data today - someone else might see it, or we might see one tomorrow. And the applications of HEP tend to come after some time: when neutrons were discovered, no one knew what they were good for; today they are used in almost all areas of science. Accelerators used in the early days of physics have led to the ones we use today in physics, but also to the ones used in healthcare. What good will come of the LHC data? Who knows. HEP has the potential to have a huge impact - if you're patient...

24 April 2014

How much of a small file problem do we have...

Here at the Tier1 at RAL-LCG2; we have been draining disk servers with a fury (achieving over 800MB/s on a 10G NIC machine.) Well we get that rate on some servers with large files; but machines with small files achieve a lower rate, but how many small files do we have and is there a VO dependency... So I decided to look at our three largest LCG VOs.
In tabula form; here is the analysis so far:

VO	LHCb	CMS	ATLAS	ATLAS	ATLAS
sub section	All	All	All	non-Log files	Log files
# Files	16305	14717	396887	181799	215088
Size (TB)	37.565	39.599	37.564	35.501	2.062
# Files > 10 GB	1	24	75	75	0
# Files > 1GB	8526	11902	9683	9657	26
# Files < 100MB	4434	2330	3E+06	134137	3E+06
# Files < 10MB	2200	569	265464	68792	196672
# Files < 1MB	1429	294	85190	20587	64603
# Files < 100kB	243	91	6693	2124	4569
# Files < 10kB	6	13	635	156	479
Ave Filesize (GB)	2.30	2.69	0.0946	0.195	0.00959
% space used by files > 1GB	96.71	79.73	64.56

Now what I find interesting is how similar values LHCb and CMS are with each other, even though they are vastly different VOs. What worries me is that over 50% of ATLAS files are less than 10MB. Now just to find a tier2 to do a similar analysis to see if it just a T1 issue.....

01 April 2014

Dell OpenManage for disk servers

As we've been telling everyone who'll listen, we at Oxford are big fans of the Dell 12-bay disk servers for grid storage (previously R510 units, now R720xd ones). A few people have now bought them and asked about monitoring them.

Dell's tools all go by the general 'OpenManage' branding, which covers a great range of things, including various general purpose GUI tools. However, for the disk servers, we generally go for a minimal command-line install.

Dell have the necessary bits available in a YUM-able repository as described on the Dell Linux wiki. Our setup simple involves:

Installing the repository file,
yum install srvadmin-storageservices srvadmin-omcommon,
service dataeng start
and finally logging out and back in again, or otherwise picking up the PATH variable change from the newly installed srvadmin-path.sh script in /etc/profile.d

At that point, you should be able to query the state of your array with the 'omreport' tool, for example:

# omreport storage vdisk controller=0
List of Virtual Disks on Controller PERC H710P Mini (Embedded)

Controller PERC H710P Mini (Embedded)
ID                            : 0
Status                        : Ok
Name                          : VDos
State                         : Ready
Hot Spare Policy violated     : Not Assigned
Encrypted                     : No
Layout                        : RAID-6
Size                          : 100.00 GB (107374182400 bytes)
Associated Fluid Cache State  : Not Applicable
Device Name                   : /dev/sda
Bus Protocol                  : SATA
Media                         : HDD
Read Policy                   : Adaptive Read Ahead
Write Policy                  : Write Back
Cache Policy                  : Not Applicable
Stripe Element Size           : 64 KB
Disk Cache Policy             : Enabled

We also have a rough and ready Nagios plugin which simply checks that each physical disk reports as 'OK' and 'Online' and complains if anything else is reported.