31 August 2010

Taming XFS on SL5

Sites (including Liverpool) running DPM on pool nodes running SL5 with XFS file systems have been experiencing very high (up to multiple 100s Load Average and close to 100% CPU IO WAIT) load when a number of analysis jobs were accessing data simultaneously with rfcp.

The exact same hardware and file systems under SL4 had shown no excessive load, and the SL5 systems had shown no problems under system stress testing/burn-in. Also, the problem was occurring from a relatively small number of parallel transfers (about 5 or more on Liverpool's systems were enough to show an increased load compared to SL4).

Some admins have found that using ext4 at least alleviates the problem although apparently it still occurs under enough load. Migrating production servers with TBs of live data from one FS to another isn't hard but would be a drawn out process for many sites.

The fundamental problem for either FS appears to be IOPS overload on the arrays rather than sheer throughput, although why this is occurring so much under SL5 and not under SL4 is still a bit of a mystery. There may be changes in controller drivers, XFS, kernel block access, DPM access patterns or default parameters.

When faced with an IOPS overload (that's resulting well below the theoretical throughput of the array) one solution is to make each IO operation access more bits from the storage device so that you need to make fewer but larger read requests.

This leads to the actual fix (we have been doing this by default on our 3ware systems but we just assumed the Areca defaults were already optimal).
blockdev --setra 16384 /dev/$RAIDDEVICE

This sets the block device read ahead to (16384/2)kB (8MB). We have previously (on 3ware controllers) had to do this to get the full throughput from the controller. The default on our Areca 1280MLs is 128 (64kB read ahead). So when lots of parallel transfers are occurring our arrays have been thrashing spindles pulling off small 64kB chunks from each different file. These files are usually many hundreds or thousands of MB where reading MBs at a time would be much more efficient.

The mystery for us is more why the SL4 systems *don't* overload rather than why SL5 does, as the SL4 systems use the exact same default values.

Here is a ganglia plot of our pool nodes under about as much load as we can put on them at the moment. Note that previously our SL5 nodes would have LAs in the 10s or 100s under this load or less.

http://hep.ph.liv.ac.uk/~jbland/xfs-fix.html

Any time the systems go above 1LA now is when they're also having data written at a high rate. On that note we also hadn't configured our Arecas to have their block max sector size aligned with the RAID chunk size with

echo "64" > /sys/block/$RAIDDEVICE/queue/max_sectors_kb

although we don't think this had any bearing on the overloading and might not be necessary.
 
We expect the tweak to also work for systems running ext4 as the underlying hardware access would still be a bottle neck, just at a different level of access.

Note that this 'fix' doesn't fix the even more fundamental problem as pointed out by others that DPM doesn't rate limit connections to pool nodes. All this fix does is (hopefully) push the current limit where overload occurs above the point that our WNs can pull data.

There is also a concern that using a big read ahead may affect small random (RFIO) access although the sites can tune this parameter very quickly to get optimum access. 8MB is slightly arbitrary but 64kB is certainly too small for any sensible access I can envisage to LHC data. Most access is via full file copy (rfcp) reads at the moment.

26 August 2010

Climbing Everest

The slides should appear soon on the web page - the mountain themed programme labelled us Everest, the second highest mountain on the agenda.
Apart from the lovely fresh air and hills and outdoorsy activities, GridPP25 was also an opportunity to persuade the experiments (and not just the LHC ones) to give us feedback and discuss future directions - we'll try to collate this and follow up. We are also working on developing "services" which seem to be useful, eg checking integrity of files, or consistency between catalogues and the storage elements. And of course for us to meet face to face and catch up over a be coffee

18 August 2010

Where Dave Lives and who shares his home....

Title of this blog post is all about whee I am situated within the RALLCG2 site and where my children our as well. I apparently also want to discuss the profile of "files" across a "disk server" as my avatar likes to put it, I prefer to think of this "Storage Element" that he talks of as my home and these "disk servers" as rooms inside my home.

I am made of 1779 files. ( ~3TB if you recall) I am spread across 8/795tapes in the RAL DATATAPE store ( although the pool of tapes for real data is actually only 229 tapes. in total there are currently tapes being used by atlas, so I take up 1/1572 datasets but ~1/130 of the volume (~3TB of the ~380TB) stored on DATATAPE at RAL and correspond to ~1/130 of the files (1779 out of ~230000). In this tape world I am deliberately kept to as small subset of tapes to allow for expedient recall.

However when it comes to being on disk I want to be spread out as much as possible so as not to cause "hot disking" . However, spreading me across many rooms means that if a single room is down, then this increases the chance that I can not be fully examined. In this disk world; of my 3TB is part of the 700TB in ATLASDATADISK at RAL and is 1 in 25k datasets and 1779 files in ~1.5 Million. In this world my average filesize at ~1.7GB per file is a lot larger than the average 450MB filesize of all the other DATADISK files. (Filesize distribution is not linear but that is a discussion for another day.) I am spread across 38 out of 71 roomsa which existed in my space token when I was created. (ther are now an additional 10 rooms and this will continue to increase in the near term.).

Looking at a random DATADISK server for every file:

1in20 datasets represented on this server are log datasets and that 1 in 11 files are log files and corresponds to 1 in 10200GB of the space used in the room.
1in2.7 datasets represented on this server are AOD datasets and that 1 in 5.1 files are AOD files and corresponds to 1 in 8.25GB of the space used in the room.
1in4.5 datasets represented on this server are ESD datasets and that 1 in 3.9 files are ESD files and corresponds to 1 in 2.32GB of the space used in the room.
1in8.3 datasets represented on this server are TAG datasets and that 1 in 8.5 files are TAG files and corresponds to 1 in 3430GB of the space used in the room.
1in47 datasets represented on this server are RAW datasets and that 1 in 17 files are RAW files and corresponds to 1 in 10.8GB of the space used in the room.
1in5.4 datasets represented on this server are DESD datasets and that 1 in 5.1 files are DESD files and corresponds to 1 in3.67 GB of the space used in the room.
1in200 datasets represented on this server are HIST datasets and that 1 in 46 files are HIST files and corresponds to 1 in735 GB of the space used in the room.
1in50 datasets represented on this server are NTUP datasets and that 1 in 16 file are NTUP files and corresponds to 1 in 130GB of the space used in the room.


Similar study has been done for a MCDISK server:
1 in 4.8 datasets represented in this room are log datasets and that 1 in 2.5 files are log files and corresponds to 1 in 18 GBof the space used in the room.
1 in 3.1 datasets represented in this room are AOD datasets and that 1 in 5.7 files are AOD files and corresponds to 1 in 2.1GB of the space used in the room.
1 in 28 datasets represented in this room are ESD datasets and that 1 in 13.6 files are ESD files and corresponds to 1 in 3.2GB of the space used in the room.
1 in 4.3 datasets represented in this room are TAG datasets and that 1 in 14.6 files are TAG files and corresponds to 1 in 2000GB of the space used in the room.
1 in 560 datasets represented in this room are DAOD datasets and that 1 in 49 files are DAOD files and corresponds to 1 in 2200GB of the space used in the room.
1 in 950 datasets represented in this room are DESD datasets and that 1 in 11000 files are DESD files and corresponds to 1 in 600GB of the space used in the room.
1 in 18 datasets represented in this room are HITS datasets and that 1 in 6.3 files are HITS files and corresponds to 1 in 25GB of the space used in the room.
1 in 114 datasets represented in this room are NTUP datasets and that 1 in 71 files are NTUP files and corresponds to 1 in 46GB of the space used in the room.
1 in114 datasets represented in this room are RDO datasets and that 1 in 63 files are RDO files and corresponds to 1 in 11GB of the space used in the room.
1 in 8 datasets represented in this room are EVNT datasets and that 1 in 13 files are EVNT files and corresponds to 1 in 100GB of the space used in the room.

As a sample this MCDISK server represents 1/47 of the space used in MCDISK at RAL and ~ 1/60 of all files in MCDISK. This room was add recently so any disparity might be due this server being filled with newer rather than older files ( which would be a good sign as it shows ATLAS are increasing file size.) Average filesize on this server is 211MB per file. Discounting log files this increases to 330MB per file. ( since log files average size is 29MB)


One area my avatar is interested in is to know that if one of these rooms were lost then how many of the files that were stored in that room could be found in anothe house and how many would be permanently lost.


For the "room" in the DATADISK Space Token, there are no files that are not located in another other house. ( This will not be the case all the time but is a good sign that the ATLAS model of replication is working.)

For the "room" in the MCDISK Space Token the following is the case:
886 out 2800 datasets that are present are not complete elsewhere. Of these 886, 583 are log datasets (consisting of 21632 files.)
Including log datasets there would be potentially 36283 files in 583 of the 886 datasets with a capacity of 640GB of lost data. ( avergae file size is 18MB).
Ignoring log datasets this drops to 14651 files in 303 datasets with a capacity of 2.86 TB of lost data.
The files on this diskserver whcih are elsewhere are form 1914 datasets, consist of 18309 files, and fill a capacity of 8104GB.

14 August 2010

This quarter’s deliverable.


Apologies for the late reporting but on 13/6/10, Cyrus Irving Bhimji went live.He is a mini-Thumper weighing it at just over 10lb from day one.Here he is deeply contemplating the future of grid storage.

Since then he has been doing well - as this weeks performance data clearly illustrates.

09 August 2010

DPM 1.7.4 and the case of the changing python module.

Since we've just had our first GridPP DPM Toolkit user hit this problem, I thought the time was right to blog about it.

Between DPM 1.7.3 and DPM 1.7.4, there is one mostly-invisible change that only hits people using the api (like, for example, the GridPP DPM Toolkit). Basically, in an attempt to clean up code and make maintenance easier, the api modules have been renamed and split by the programming language that they support.
This means that older versions of the GridPP DPM Toolkit can't interface with DPM 1.7.4 and above, as they don't know the name of the module to import. The symptom of this is the "Failed to import DPM API", in the case where following the instructions provided doesn't help at all.

Luckily, we wise men at GridPP Storage already solved this problem.
http://www.sysadmin.hep.ac.uk/rpms/fabric-management/RPMS.storage/
contains two versions of the GridPP DPM Toolkit RPM for the latest release - one suffixed "DPM173-1" and one suffixed "DPM174-1". Fairly obviously, the difference is the version of DPM they work against.

Future releases of the Toolkit will only support the DPM 1.7.4 modules (and may come relicensed, thanks to the concerns of one of our frequent correspondants, who shall remain nameless).