24 April 2014

How much of a small file problem do we have...

Here at the Tier1 at RAL-LCG2; we have been draining disk servers with a fury (achieving over 800MB/s on a 10G NIC machine.) Well we get that rate on some servers with large files;  but machines with small files achieve a lower rate, but how many small files do we have and is there a VO dependency... So I decided to look at our three largest LCG VOs.
In tabula form; here is the analysis so far:

sub section All All All non-Log files Log files
# Files 16305 14717 396887 181799 215088
Size (TB) 37.565 39.599 37.564 35.501 2.062
# Files >  10 GB 1 24 75 75 0
# Files >     1GB 8526 11902 9683 9657 26
# Files < 100MB 4434 2330 3E+06 134137 3E+06
# Files <  10MB 2200 569 265464 68792 196672
# Files <    1MB 1429 294 85190 20587 64603
# Files <  100kB 243 91 6693 2124 4569
# Files <    10kB 6 13 635 156 479
Ave Filesize (GB) 2.30 2.69 0.0946 0.195 0.00959
% space used by files > 1GB 96.71 79.73 64.56

Now what I find interesting is how similar values LHCb and CMS are with each other, even though they are vastly different VOs. What worries me is that over 50% of ATLAS files are less than 10MB. Now just to find a tier2 to do a similar analysis to see if it just a T1 issue.....

01 April 2014

Dell OpenManage for disk servers

As we've been telling everyone who'll listen, we at Oxford are big fans of the Dell 12-bay disk servers for grid storage (previously R510 units, now R720xd ones). A few people have now bought them and asked about monitoring them.

Dell's tools all go by the general 'OpenManage' branding, which covers a great range of things, including various general purpose GUI tools. However, for the disk servers, we generally go for a minimal command-line install.

Dell have the necessary bits available in a YUM-able repository as described on the Dell Linux wiki. Our setup simple involves:
  • Installing the repository file,
  • yum install srvadmin-storageservices srvadmin-omcommon,
  • service dataeng start
  • and finally logging out and back in again, or otherwise picking up the PATH variable change from the newly installed srvadmin-path.sh script in /etc/profile.d
At that point, you should be able to query the state of your array with the 'omreport' tool, for example:
# omreport storage vdisk controller=0
List of Virtual Disks on Controller PERC H710P Mini (Embedded)

Controller PERC H710P Mini (Embedded)
ID                            : 0
Status                        : Ok
Name                          : VDos
State                         : Ready
Hot Spare Policy violated     : Not Assigned
Encrypted                     : No
Layout                        : RAID-6
Size                          : 100.00 GB (107374182400 bytes)
Associated Fluid Cache State  : Not Applicable
Device Name                   : /dev/sda
Bus Protocol                  : SATA
Media                         : HDD
Read Policy                   : Adaptive Read Ahead
Write Policy                  : Write Back
Cache Policy                  : Not Applicable
Stripe Element Size           : 64 KB
Disk Cache Policy             : Enabled
We also have a rough and ready Nagios plugin which simply checks that each physical disk reports as 'OK' and 'Online' and complains if anything else is reported.