Showing posts with label filesystems. Show all posts
Showing posts with label filesystems. Show all posts

26 January 2017

File system tests

Since there is interest in filesystem test, I put the script I used for the ZFS/Ext4/XFS tests on a web server. If you test file systems for Grid storage purpose, feel free to give it a try.
It can of course also be used by anyone else, but this test is not doing any random read/writes.

In general, it would be good to run this test (or any other test) under 3 different scenarios when using raid systems:

  1. normal working raid system
  2. degraded raid system
  3. rebuild of raid system


The script needs 2 areas:

  1. one where you have files that are read during the read tests, and 
  2. one where you want write files to. This write area should have no compression since the writes come from /dev/zero.                                                                                                                                                                                                                                                                                          

By default it is doing reads over all specified files, writes to large files, and writes to small files.
For the reads, first it is doing a sequential read for all files, and in a second pass it reads files in parallel for the same set of files.
For the writes, it is doing a sequential write first and in a second pass it is writing in parallel do different files.   That is the same for the writing of large files and of small files.

After each read/write pass there is a cache flush and also each single write issues a file sync after each file is written to make sure that the time is measured to really write a file to disk.


The script needs 3 parameters:

  1. location of a text files that contains the file name including absolute path for all files that you want to include in the read tests
  2. a name used as description for your test, it can be used to distinguish between different tests (e.g. ZFS-raid2-12 disks or ZFS-raidz3-12disks)
  3. absolute path to an area where the write test can write it files too; this area should have no compression enabled

The parameters inside the script, like number of parallel read/writes and  file sizes, can easily be configured. By default about 5TB space are needed for the write tests.

The script itself can be downloaded here

11 April 2016

Setting up of a ZFS based storage server

As it was previously found that ZFS has a good performance in our use case which is even better than the hardware raid performance, new storage servers on our site to be used within GridPP will use ZFS as storage file system in the future.
In this post, I will show how a server for that purpose can easily be setup.  The previous posts which also mention details about the used hardware can be found here, here, and here.

The purpose of this storage server is to be used for LHC data storage which is mostly consistent of GB sized files. At the time of using these data files as input for user jobs, typically the whole  file is copied over to the local node where the user job runs. That means that the configuration needs to deal with large sequential read and writes, but not with small random block access.

The typical hardware configuration of  the storage servers we have is:
  • Server with PERC H700 and/or H800 hardware raid controller
  • 36 disk slots available
    • on some server available through 3 external PowerVault MD-devices (3x12 disks)
    • on some servers available through 2 external PowerVault MD-devices (2x12disks) and 12 internal storage disks
  • 10Gbps network interface
  • Dual-CPU (8 or 12 physical cores on each)
  • between 12GB and 64GB of RAM 
In this blog post, as I did before, I will describe the ZFS setup based on a machine with 12 internal disks (2TB disks on H700) and 24 external disks (17x8TB + 7x2TB  on H800). The machine is already setup with SL6 and has the typical GridPP software (DPM clients, xrootd, httpd,...) installed.

Preparing the disks

Since both raid controllers don't support JBOD, first single raid0 devices have to be created. To find out which disks are available and can be used, omreport can be used:


[root@pool7 ~]# omreport storage pdisk controller=0|grep -E "^ID|Capacity"
ID                              : 0:0:0
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:1
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:2
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:3
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:4
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:5
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:6
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:7
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:8
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:9
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:10
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:11
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:12
Capacity                        : 278.88 GB (299439751168 bytes)
ID                              : 0:0:13
Capacity                        : 278.88 GB (299439751168 bytes)


The disks 0:0:12 and 0:0:13 are the system disks in a mirrored configuration and shouldn't be touched. The disks 0:0:0 to 0:0:11 can be converted to single raid0 using omconfig:
for i in $(seq 0 11); 
do 
  omconfig storage controller controller=0 action=createvdisk raid=r0 size=max pdisk=0:0:$i; 
done

The same procedure has to be repeated for the second controller.

After that, the disks are available to the system and to find out which are the 2TB and which are the 8TB disks, lsblk can be used:

[root@pool7 ~]# lsblk |grep disk
sda      8:0    0 278.9G  0 disk 
sdb      8:16   0   1.8T  0 disk 
sdc      8:32   0   1.8T  0 disk 
sdd      8:48   0   1.8T  0 disk 
sde      8:64   0   1.8T  0 disk 
sdf      8:80   0   1.8T  0 disk 
sdg      8:96   0   1.8T  0 disk 
sdh      8:112  0   1.8T  0 disk 
sdi      8:128  0   1.8T  0 disk 
sdj      8:144  0   1.8T  0 disk 
sdk      8:160  0   1.8T  0 disk 
sdl      8:176  0   1.8T  0 disk 
sdm      8:192  0   1.8T  0 disk 
sdn      8:208  0   7.3T  0 disk 
sdo      8:224  0   7.3T  0 disk 
sdp      8:240  0   7.3T  0 disk 
sdq     65:0    0   7.3T  0 disk 
sdr     65:16   0   7.3T  0 disk 
sds     65:32   0   7.3T  0 disk 
sdt     65:48   0   7.3T  0 disk 
sdu     65:64   0   7.3T  0 disk 
sdv     65:80   0   7.3T  0 disk 
sdw     65:96   0   7.3T  0 disk 
sdx     65:112  0   7.3T  0 disk 
sdy     65:128  0   7.3T  0 disk 
sdz     65:144  0   7.3T  0 disk 
sdaa    65:160  0   7.3T  0 disk 
sdab    65:176  0   1.8T  0 disk 
sdac    65:192  0   7.3T  0 disk 
sdad    65:208  0   1.8T  0 disk 
sdae    65:224  0   1.8T  0 disk 
sdaf    65:240  0   7.3T  0 disk 
sdag    66:0    0   1.8T  0 disk 
sdah    66:16   0   1.8T  0 disk 
sdai    66:32   0   7.3T  0 disk 
sdaj    66:48   0   1.8T  0 disk 
sdak    66:64   0   1.8T  0 disk 

/dev/sda is the system disk and shouldn't be touch, but all the other disks can be used for the storage setup.


ZFS installation

The current version of ZFS can be downloaded from the ZFS on Linux web page. Depending on the used distribution, there are also instructions on how to install ZFS through the package management. In the worst case, one can download the source code and compile on the own system.   

Since we use SL which is RH based, we can follow the instructions provided on the page .
After the installation of ZFS through yum, the module needs to be loaded using modprobe zfs to continue without a reboot.   
To have file systems based on zfs automounted at system start, unfortunately also selinux options need to be changed. In the selinux config file, in our case at /etc/sysconfig/selinux, we need to change "SELINUX=enforcing" to at least "SELINUX=permissive"

This will probably be needed as long as zfs is not part of the RH distribution and zfs will not be recognized by selinux as a valid file system. More about this issue can be found here.


ZFS storage setup

Since we have the ZFS driver installed and the disks prepared now, we can continue to setup the storage pool.  
In this example, we create 2 different storage pools - one for the 2TB disks and one for the 8TB disks - as a good compromise between possible IOPS and available space. For ZFS, it doesn't matter if the disks within one storage pool are connected through the same controller or through different ones, like in our case for the 2TB disks. For the configuration, it's decided to use raidz2 which has 2 redundancy disks similar to raid6. Also, one disk of each kind will be used as hot spare.
To do so, we need to find all disks of a given kind in the system and creating a storage pool for these disks using zpool create :

[root@pool7 ~]# lsblk |grep 1.8T
sdb      8:16   0   1.8T  0 disk 
sdc      8:32   0   1.8T  0 disk 
sdd      8:48   0   1.8T  0 disk 
sde      8:64   0   1.8T  0 disk 
sdf      8:80   0   1.8T  0 disk 
sdg      8:96   0   1.8T  0 disk 
sdh      8:112  0   1.8T  0 disk 
sdi      8:128  0   1.8T  0 disk 
sdj      8:144  0   1.8T  0 disk 
sdk      8:160  0   1.8T  0 disk 
sdl      8:176  0   1.8T  0 disk 
sdm      8:192  0   1.8T  0 disk 
sdab    65:176  0   1.8T  0 disk 
sdad    65:208  0   1.8T  0 disk 
sdae    65:224  0   1.8T  0 disk 
sdag    66:0    0   1.8T  0 disk 
sdah    66:16   0   1.8T  0 disk 
sdaj    66:48   0   1.8T  0 disk 
sdak    66:64   0   1.8T  0 disk 

[root@pool7 ~]# zpool create -f tank-2TB raidz2 sdb sdc sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdab sdad sdae sdag sdah sdaj spare sdak



[root@pool7 ~]# lsblk |grep 7.3T
sdn       8:208  0   7.3T  0 disk 
sdo       8:224  0   7.3T  0 disk 
sdp       8:240  0   7.3T  0 disk 
sdq      65:0    0   7.3T  0 disk 
sdr      65:16   0   7.3T  0 disk 
sds      65:32   0   7.3T  0 disk 
sdt      65:48   0   7.3T  0 disk 
sdu      65:64   0   7.3T  0 disk 
sdv      65:80   0   7.3T  0 disk 
sdw      65:96   0   7.3T  0 disk 
sdx      65:112  0   7.3T  0 disk 
sdy      65:128  0   7.3T  0 disk 
sdz      65:144  0   7.3T  0 disk 
sdaa     65:160  0   7.3T  0 disk 
sdac     65:192  0   7.3T  0 disk 
sdaf     65:240  0   7.3T  0 disk 
sdai     66:32   0   7.3T  0 disk 
[root@pool7 ~]# zpool create -f tank-8TB raidz2 sdn sdo sdp sdq sdr sds sdt sdu sdv sdw sdx sdy sdz sdaa sdac sdaf spare sdai


After the zpool create commands, the storage is setup in a raid configuration, a file system created on top of it, and mounted under /tank-2TB and /tank-8TB. There are no additional commands needed and all is available within seconds.   
At this point the system looks like:

[root@pool7 ~]# mount|grep zfs
tank-2TB on /tank-2TB type zfs (rw)
tank-8TB on /tank-8TB type zfs (rw)


[root@pool7 ~]# zpool status
  pool: tank-2TB
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank-2TB    ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0
            sdi     ONLINE       0     0     0
            sdj     ONLINE       0     0     0
            sdk     ONLINE       0     0     0
            sdl     ONLINE       0     0     0
            sdm     ONLINE       0     0     0
            sdab    ONLINE       0     0     0
            sdad    ONLINE       0     0     0
            sdae    ONLINE       0     0     0
            sdag    ONLINE       0     0     0
            sdah    ONLINE       0     0     0
            sdaj    ONLINE       0     0     0
        spares
          sdak      AVAIL   

errors: No known data errors

  pool: tank-8TB
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank-8TB    ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sdn     ONLINE       0     0     0
            sdo     ONLINE       0     0     0
            sdp     ONLINE       0     0     0
            sdq     ONLINE       0     0     0
            sdr     ONLINE       0     0     0
            sds     ONLINE       0     0     0
            sdt     ONLINE       0     0     0
            sdu     ONLINE       0     0     0
            sdv     ONLINE       0     0     0
            sdw     ONLINE       0     0     0
            sdx     ONLINE       0     0     0
            sdy     ONLINE       0     0     0
            sdz     ONLINE       0     0     0
            sdaa    ONLINE       0     0     0
            sdac    ONLINE       0     0     0
            sdaf    ONLINE       0     0     0
        spares
          sdai      AVAIL   

errors: No known data errors


[root@pool7 ~]# zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank-2TB  32.5T   153K  32.5T         -     0%     0%  1.00x  ONLINE  -
tank-8TB   116T   153K   116T         -     0%     0%  1.00x  ONLINE  -
[root@pool7 ~]# 
[root@pool7 ~]# zfs list
NAME       USED  AVAIL  REFER  MOUNTPOINT
tank-2TB   120K  28.0T  40.0K  /tank-2TB
tank-8TB   117K  97.7T  39.1K  /tank-8TB
[root@pool7 ~]# 
[root@pool7 ~]# df -h|grep tank
tank-2TB         28T     0   28T   0% /tank-2TB
tank-8TB         98T     0   98T   0% /tank-8TB

Setting additional filesystem properties

Since there is lz4 available as compression algorithm which has a very small impact on performance, we can enable compression on our storage. This has probably not a large impact on the storage of LHC data, but could lower the storage space for non-LHC experiments that will be supported in the near future.
In addition, the storage of xattr will also be changed to a similar behaviour like in ext4.
Also since we have a spare configured in our pools, we need to activate auto replacement in failure cases making it a hot spare. An interesting feature of ZFS is also to grow the pool size if the disks are replaced by new disks with a larger capacity. This needs to be done for all disks within one vdev to have an effect, but can be done one by one over time.

[root@pool7 ~]# zfs set compression=lz4 tank-2TB
[root@pool7 ~]# zfs set compression=lz4 tank-8TB

[root@pool7 ~]# zpool set autoreplace=on tank-2TB
[root@pool7 ~]# zpool set autoreplace=on tank-8TB

[root@pool7 ~]# zpool set autoexpand=on tank-2TB
[root@pool7 ~]# zpool set autoexpand=on tank-8TB

[root@pool7 ~]# zfs set relatime=on tank-2TB
[root@pool7 ~]# zfs set relatime=on tank-8TB

[root@pool7 ~]# zfs set xattr=sa tank-2TB
[root@pool7 ~]# zfs set xattr=sa tank-8TB

Changing disk identification

Using the disk identification by letters, like sdb or sdc, is easy to handle and good to setup a pool. However, the order how disks are identified could be changed on a reboot and also will change if the disks need to be rearranged on the server, for example after replacing one of the external MD devices.
While in such cases ZFS should still be able to identify the disks belonging to the same pool and import the pool, it is better to use the disk IDs to identify disks. To change this behaviour, we only need to export the pool and import using the disk IDs:
[root@pool7 ~]# zpool export -a
[root@pool7 ~]# zpool import -d /dev/disk/by-id tank-8TB
[root@pool7 ~]# zpool import -d /dev/disk/by-id tank-2TB

Making the space available to DPM

Traditionally, the available space on a large vdev was divided into smaller parts by creating partitions in fdisk. In ZFS however this can be done directly on top of the just created pool. All properties set for the top level ZFS will be distributed to the new zfs too. There will be no need to set the compression property or other properties again, except if one wants to have different properties than before.
A new file system is created by using zfs create :

[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage01
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage02
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage03
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage04
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage05
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage06
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage07
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage08
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage09
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage10
[root@pool7 ~]# zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
tank-2TB                 144K  28.0T  40.0K  /tank-2TB
tank-8TB                90.0T  7.67T  41.7K  /tank-8TB
tank-8TB/gridstorage01     9T  16.7T  39.1K  /tank-8TB/gridstorage01
tank-8TB/gridstorage02     9T  16.7T  39.1K  /tank-8TB/gridstorage02
tank-8TB/gridstorage03     9T  16.7T  39.1K  /tank-8TB/gridstorage03
tank-8TB/gridstorage04     9T  16.7T  39.1K  /tank-8TB/gridstorage04
tank-8TB/gridstorage05     9T  16.7T  39.1K  /tank-8TB/gridstorage05
tank-8TB/gridstorage06     9T  16.7T  39.1K  /tank-8TB/gridstorage06
tank-8TB/gridstorage07     9T  16.7T  39.1K  /tank-8TB/gridstorage07
tank-8TB/gridstorage08     9T  16.7T  39.1K  /tank-8TB/gridstorage08
tank-8TB/gridstorage09     9T  16.7T  39.1K  /tank-8TB/gridstorage09
tank-8TB/gridstorage10     9T  16.7T  39.1K  /tank-8TB/gridstorage10

[root@pool7 ~]# zfs create -o refreservation=7.66T tank-8TB/gridstorage11
[root@pool7 ~]# zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
tank-2TB                 144K  28.0T  40.0K  /tank-2TB
tank-8TB                97.7T  9.91G  41.7K  /tank-8TB
tank-8TB/gridstorage01     9T  9.01T  39.1K  /tank-8TB/gridstorage01
tank-8TB/gridstorage02     9T  9.01T  39.1K  /tank-8TB/gridstorage02
tank-8TB/gridstorage03     9T  9.01T  39.1K  /tank-8TB/gridstorage03
tank-8TB/gridstorage04     9T  9.01T  39.1K  /tank-8TB/gridstorage04
tank-8TB/gridstorage05     9T  9.01T  39.1K  /tank-8TB/gridstorage05
tank-8TB/gridstorage06     9T  9.01T  39.1K  /tank-8TB/gridstorage06
tank-8TB/gridstorage07     9T  9.01T  39.1K  /tank-8TB/gridstorage07
tank-8TB/gridstorage08     9T  9.01T  39.1K  /tank-8TB/gridstorage08
tank-8TB/gridstorage09     9T  9.01T  39.1K  /tank-8TB/gridstorage09
tank-8TB/gridstorage10     9T  9.01T  39.1K  /tank-8TB/gridstorage10
tank-8TB/gridstorage11  7.66T  7.67T  39.1K  /tank-8TB/gridstorage11

Here a new property is set for each file of the new file systems - refreservation - which reserves the specified space for this particular file system, making sure this space is guaranteed. This is different to setting a quota which limits the space only to an upper limit.  However, to make sure the specified space is not exceeded in our case, also a quota of the same size should be specified. For the last file system in each pool, a larger amount could be specified which makes sure that all the space that can't be used by the other file systems due to quota limitations, will be used here.
[root@pool7 ~]# zfs set refquota=7T tank-2TB/gridstorage01
[root@pool7 ~]# zfs set refquota=7T tank-2TB/gridstorage02
[root@pool7 ~]# zfs set refquota=7T tank-2TB/gridstorage03
[root@pool7 ~]# zfs set refquota=7T tank-2TB/gridstorage04
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage01
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage02
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage03
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage04
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage05
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage06
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage07
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage08
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage09
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage10
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage11

After that, the storage setup is finished and the just created file systems are already mounted and should be made available to the DPM user:
[root@pool7 ~]# chown -R dpmmgr:users /tank-2TB
[root@pool7 ~]# chown -R dpmmgr:users /tank-8TB

That was the last step needed on the storage server and the new file systems can be added to the DPM head node like any other file system before.

ZFS configuration options

To customize the ZFS behaviour, 2 main config files are available  - /etc/sysconfig/zfs and /etc/zfs/zed.d/zed.rc. I will not go into detail here about these 2 files, but if you want to setup your own ZFS based storage then have a look here. The options within are mainly self explaining, for example you can specify where to send email for disk problems and under which circumstances. 




As final section, I want to mention 2 very useful commands - zpool history and zpool iostat.
With the first command, one can display all commands that were run against a zpool since its creation together with a time stamp. This can be very useful for error analyses and also to repeat a configuration on another server.
[root@pool6 ~]# zpool history tank-2TB
History for 'tank-2TB':
2016-04-05.11:27:43 zpool create -f tank-2TB raidz2 sdd sdf sdg sdi sdj sdl sdm sdz sdaa sdab sdac sdad sdae sdaf sdag sdah sdai sdaj spare sdak
2016-04-05.11:40:32 zfs create -o refreservation=7TB tank-2TB/gridstorage01
2016-04-05.11:40:34 zfs create -o refreservation=7TB tank-2TB/gridstorage02
2016-04-05.11:40:39 zfs create -o refreservation=7TB tank-2TB/gridstorage03
2016-04-05.11:41:57 zfs create -o refreservation=6.97T tank-2TB/gridstorage04
2016-04-05.12:02:12 zpool set autoreplace=on tank-2TB
2016-04-05.12:02:17 zpool set autoexpand=on tank-2TB
2016-04-05.12:38:11 zpool export tank-2TB
2016-04-05.12:38:33 zpool import -d /dev/disk/by-id tank-2TB
2016-04-06.13:41:37 zfs set compression=lz4 tank-2TB
2016-04-07.14:36:37 zfs set relatime=on tank-2TB
2016-04-07.14:36:42 zfs set xattr=sa tank-2TB
2016-04-11.11:28:08 zpool scrub tank-2TB
2016-04-11.14:12:41 zfs set refquota=7T tank-2TB/gridstorage01
2016-04-11.14:12:43 zfs set refquota=7T tank-2TB/gridstorage02
2016-04-11.14:12:48 zfs set refquota=7T tank-2TB/gridstorage03
2016-04-11.14:12:59 zfs set refquota=7T tank-2TB/gridstorage04

The second command, zpool iostat, displays the current I/O on the pool, separately for read/write operations and for bandwidth. Information can be displayed for a given pool but also for each disk within the pool. 
The following example is taken from a server that was configured with 3 raidz2 while it was drained using the dpm-drain command on the head node with threads=1 and one drain per file system, resulting in 5 parallel drain commands running:

[root@pool5 ~]# zpool iostat 1
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        6.03T  54.0T    110     24  13.5M  1.33M
tank        6.03T  54.0T  4.74K      0   602M      0
tank        6.03T  54.0T  4.63K      0   589M      0
tank        6.03T  54.0T  4.62K      0   587M      0
tank        6.03T  54.0T  4.42K      0   561M      0
tank        6.03T  54.0T  5.27K      0   669M      0
tank        6.03T  54.0T  4.51K      0   573M      0
tank        6.03T  54.0T  4.47K      0   568M      0
tank        6.03T  54.0T  4.47K      0   568M      0
tank        6.03T  54.0T  4.26K      0   542M      0
tank        6.03T  54.0T  4.56K      0   579M      0
tank        6.03T  54.0T  4.82K      0   613M      0
tank        6.03T  54.0T  4.60K      0   585M      0
tank        6.03T  54.0T  4.73K      0   601M      0
tank        6.03T  54.0T  4.20K      0   533M      0
tank        6.03T  54.0T  4.52K      0   574M      0
tank        6.03T  54.0T  3.72K      0   473M      0
tank        6.03T  54.0T  3.80K      0   484M      0
tank        6.03T  54.0T  4.46K      0   567M      0
tank        6.03T  54.0T  5.16K      0   655M      0
tank        6.03T  54.0T  5.25K      0   667M      0

[root@pool5 ~]# zpool iostat -v 1
                                               capacity     operations    bandwidth
pool                                        alloc   free   read  write   read  write
------------------------------------------  -----  -----  -----  -----  -----  -----
tank                                        5.87T  54.1T  4.77K      0   606M      0
  raidz2                                    1.96T  18.0T  1.60K      0   202M      0
    scsi-36a4badb044e936001e55b2111ca79173      -      -    329      0  21.2M      0
    scsi-36a4badb044e936001e55b2461fc3cb38      -      -    354      0  20.8M      0
    scsi-36a4badb044e936001e55b25520b22e10      -      -    337      0  21.3M      0
    scsi-36a4badb044e936001e55b2622171ff0b      -      -    333      0  21.3M      0
    scsi-36a4badb044e936001e55b26e2232640f      -      -    334      0  21.0M      0
    scsi-36a4badb044e936001e55b27d230ce0f6      -      -    333      0  21.2M      0
    scsi-36a4badb044e936001e55b293245ae11b      -      -    335      0  21.1M      0
    scsi-36a4badb044e936001e55b2b426603fbe      -      -    335      0  21.3M      0
    scsi-36a4badb044e936001e55b2c4274ec795      -      -    338      0  20.9M      0
    scsi-36a4badb044e936001e55b2d128122551      -      -    318      0  20.8M      0
    scsi-36a4badb044e936001e55b2f42a2e3006      -      -    342      0  21.3M      0
  raidz2                                    1.96T  18.0T  1.59K      0   203M      0
    scsi-36a4badb044e936001e830de6afdb1d8f      -      -    332      0  21.6M      0
    scsi-36a4badb044e936001e55b3082b59c1da      -      -    310      0  21.2M      0
    scsi-36a4badb044e936001e55b3142c0ac749      -      -    311      0  21.5M      0
    scsi-36a4badb044e936001e55b31f2cbeb648      -      -    319      0  21.8M      0
    scsi-36a4badb044e936001e55b44e3ecc77ea      -      -    313      0  21.7M      0
    scsi-36a4badb044e936001e55b33b2e6172a4      -      -    213      0  21.8M      0
    scsi-36a4badb044e936001e55b34c2f70184c      -      -    307      0  21.4M      0
    scsi-36a4badb044e936001e55b358301ee6a2      -      -    319      0  21.8M      0
    scsi-36a4badb044e936001e55b36530e4cb2a      -      -    331      0  21.8M      0
    scsi-36a4badb044e936001e55b3793218970b      -      -    325      0  21.8M      0
    scsi-36a4badb044e936001e55b38532cf8c68      -      -    324      0  21.8M      0
  raidz2                                    1.96T  18.0T  1.58K      0   201M      0
    scsi-36a4badb044e936001e55b39033741ccf      -      -    342      0  21.8M      0
    scsi-36a4badb044e936001e55b39b3421403b      -      -    323      0  21.4M      0
    scsi-36a4badb044e936001e55b3de3822569a      -      -    335      0  21.8M      0
    scsi-36a4badb044e936001e55b3eb38df509e      -      -    328      0  21.7M      0
    scsi-36a4badb044e936001e55b3f839a79c83      -      -    301      0  21.5M      0
    scsi-36a4badb044e936001e55b4023a46ae2a      -      -    325      0  21.7M      0
    scsi-36a4badb044e936001e55b40f3b0100cb      -      -    314      0  21.5M      0
    scsi-36a4badb044e936001e55b41d3bdd5a86      -      -    335      0  21.5M      0
    scsi-36a4badb044e936001e55b42b3cb239cd      -      -    324      0  21.8M      0
    scsi-36a4badb044e936001e55b4363d55d784      -      -    322      0  21.7M      0
    scsi-36a4badb044e936001e55b4413e09a9f1      -      -    331      0  21.8M      0
------------------------------------------  -----  -----  -----  -----  -----  -----


One important point to keep in mind is that all the properties set with zfs or zpool are stored within the file system and not within the OS config files! That means, if the OS gets upgraded then one can do a zpool import and all properties - like mount point, quota, reservations, compression, history - will instantly be available again. There is no need to touch manually any system config file, like /etc/fstab, to make the storage available.  This is also true for other properties, like nfs sharing, but since it's not needed in our case I haven't described that. To get an idea what properties are available and what else one can do with ZFS, zfs get all and zpool get all are useful.

04 April 2016

ZFS vs Hardware Raid System, Part III

As it was found in a previous post that the read/write rate varied a lot between the different controllers, the read/write tests need to be redone.
In the previous test, 10GB sized files have been used since that's more at the order of file size when used as GridPP storage behind DPM.  However, since the machine used for the tests has 24GB RAM which is larger than the file size, the files could still be in the cache.

For the following test, the same machine was used as in the above mentioned post, but with some changes in the configuration:

  1. Both raid controllers, a PERC H700 and a PERC H800, have been reset before doing any  tests.
  2. The element size defined in the controllers was 8KB for the previous test, now it uses 64KB.
  3. The test file size was increased to 30GB to be larger than the total RAM in the machine.
  4. All write and read tests were repeated 10 times to see how large the variation in the measured rates are.
  5. On the H700 11x2TB disks are used as raid6/raidz2 + 1 hotspare, mounted under /tank-2TB.
  6. On the H800 16x8TB disks are used as raid6/raidz2 + 1 hotspare, mounted under /tank-8TB.
The controller cache was again set to "write through" instead of the default "write back".
All read and writes have been performed 10 times to 10 different files to reduce the possibility that anything is left over in memory or controller cache. The results are then averaged for  read/write operations and controller. "dd" was used to generate/read the files with commands like:

time (dd if=/dev/zero of=/tank-2TB/test30G-$i bs=1M count=30720 && sync)
time (dd if=/tank-2TB/test30G-$i of=/dev/null bs=1M && sync)

The averaged results are given as "time"-value because the value reported by time includes the "sync" operation and makes sure everything is written to disk, while "dd" reports only about the own process which doesn't mean it is physically on disk already but can still be in memory cached. The minimum and maximum values are also reported to give an idea about the range during all 10 trials.

H700

ZFS write:                    56s (549MB/s) (min:  52s, max:  59s)
Hardware raid write:  305s (101MB/s) (min:265s, max:342s)

ZFS read:                   74s (415MB/s) (min:  56s, max:   83s)
Hardware raid read: 156s (197MB/s) (min:147s, max: 159s)

H800

ZFS write:                   28s (1097MB/s) (min:  28s, max:  30s)
Hardware raid write: 147s (  209MB/s) (min:125s, max:154s)

ZFS read:                     30s (1024MB/s) (min:30s, max: 34s)
Hardware raid read:   29s (1059MB/s) (min:29s, max: 31s)


In conclusion, it can be seen that the H800 performs in both configurations better than the H700 while ZFS clearly has the better performance than the hardware raid configuration. Therefore, all new installations at the Edinburgh side will use ZFS for the administration of the GridPP storage space. In the next blog post, I will show how to setup the zfs storage part for GridPP usage.

23 March 2016

ZFS vs Hardware Raid System, Part II

This post will focus on other differences between a ZFS based software raid and a hardware raid system that could be important for the usage as GridPP storage backend. In a later post, the differences in the read/write rates will be tested more intensively.
For the current tests, the system configuration is the same as described previously

First test is, what happens if we just take a disk out of the current raid system...
In both cases, the raid gets rebuild using the hot spare that was provided in the initial configuration. However, the times needed to fully restore redundancy are very different:

ZFS based raid recovery time: 3min
Hardware based raid recovery time: 9h:2min

For both systems, only the test files from the previous read/write tests were on disk, and the hardware raid was initialized newly to remove the corrupted filesystem after the failure test and then the test files were recreated.  Both systems where not doing anything else during the recovery period.

The large difference is due to the fact that ZFS is raid system, volume manager, and file system all in one. ZFS knows about the structure on disk and the data that was on the broken disk. Therefore it only needs to restore what actually was really used, in the above test case just only 1.78GB.
The hardware raid system on the other hand knows nothing about the filesystem and real used space on a disk, therefore it needs to restore the whole disk even if it's like in our test case nearly empty - that's 2TB to restore while only about 2GB are actually useful data! This difference will  become even more important in the future when the capacity of a single disk gets larger.

ZFS is now also used on one of our real production systems behind DPM. The zpool on that machine, consist of 3 x raidz2 with 11 x 2TB disks for each raidz2 vdev, also had a real failed disk which needed to be replaced. There are about 5TB of data on that zpool, and the whole time needed to restore full redundancy took about 3h while the machine was still in production and used by jobs to request or store data.  This is again much faster than what a hardware raid based system would need even if the machine would be doing nothing else in the meantime than restoring full redundancy.



Another large difference between both system is the time needed to setup a raid configuration. 
For the hardware based raid, first one need to create the configuration in the controller, initialize the raid, then create partitions on the virtual disk, and lastly format the partitions to put a file system on it. When all is finished, the newly formatted partitions need to be put into /etc/fstab to be mounted when the system starts, mount points need to be created, and then the partitions can be mounted. That's a very long process which takes up a lot of time before the system can be used.
To give an idea about it, the formatting of a single 8TB partition with ext4 took
on H700: about 1h:11min
on H800: about       34min

In our configuration, 24x2TB on H800 and 12x2TB on H700, the formatting alone would take about 6h! (if done one partition after another)

For ZFS, we still need to create a raid0 for each disk separately in the H700 and H800 since both controllers don't support JBODs. However once this is done, it's very easy and fast to create a production ready raid system. There is one single command which does everything: 
zpool create NAME /dev/... /dev/... ..... spare /dev/....
After that single command, the raid system is created, formatted to be used, and mounted under /NAME which can also be changed using options when the zpool is created (or later). There is no need to edit /etc/fstab and the whole setup takes less than 10s
To have single 8TB "partitions", one can create other zfs in this pool and set a quota for it, like
zfs -o refquota=8TB create NAME/partition1
After that command, a new ZFS is created, a quota is placed on it which makes sure that tools like "df" only see 8TB available for usage, and it's mounted under /NAME/partition1 - again no need to edit /etc/fstab and the setup takes just a second or two. 




Another important consideration is what happens with the data if parts of the system fail. 
In our case, we have 12 internal disks on a H700 controller and 2 MD1200 devices with 12 disks each that are connected through a H800 controller, and 17 of the 2TB disks we used so far in the MD devices need to be replaced by 8TB disks.  There are different setups possible and it's interesting to see what in each case happens if a controller or one of the MD devices fails.
The below mentioned scenarios assume that the system is used as a DPM server in production.

Possibility 1: 1 raid6/raidz2 for all 8TB disks, 1 raid6/raidz2 for all 2TB disks on the H800, and 1 raid6/raidz2 for all 2TB disks on the H700

What happens if the H700 controller fails and can't be replaced (soon)?

Hardware raid system
The 2 raid systems on H800 would still be available. The associated filesystems could be drained in DPM, and the disks on the H700 controller could be swapped with the disks in the MD devices associated with the drained file systems. However, if the data on the disks will be usable depends a lot on the compatibility of the on disk format used by different raid controllers. Also, even if that would be possible to recreate the raid on different controller without any data lost (which is probably not possible), it would appear to the OS as a different disk and all the mount points would need to be changed to have the data available again.  

ZFS based raid system
Again, the 2 raid systems associated with the H800 would still be available and the pool of 8TB disks could be drained. Then the disks on the failed H700 could simply be swapped with the 8TB disks, and the zpool would be available again - mounted under the same directories before no matter in which bays the disks are or on which controller.

What happens if one of the MD devices fails?

Hardware raid system
Since one MD device has only 12 disks and we need to put in 17 x 8TB disks, one raid system consist of disks from both MD devices. If any MD device fails, the data on all 8TB disks will be lost. If the MD device with the remaining 2 TB disks fails, it depends if the raid on the 2TB disks could be recognized by the H700 controller. If so, then the data on the disks associated with the H700 could be drained and at least the data from the 2TB disks on the failed MD device could be restored, although with some manual configuration changes in the OS since DPM needs the data mounted in the previously used directories.

If anyone has experience on the possibility to use disks from a raid created on one controller  on another controller of a different kind, I would appreciate if we could read about it in the comments section.


ZFS based raid system
If any MD device fails, the pool with the internal disks on the H700 could be drained, and then the disks swapped with the disks on the failed MD device. All data would be available again and no data would be lost.

One raidz2 for all 8TB disks (pool1) and one raidz2 for all 2TB disks (pool2)

This setup is only possible when using a ZFS based raid since it's not supported by hardware raid controllers.

If the H700 fails, then the data on pool1 could be drained and the 8TB disks be replaced with the 2 TB disks which were connected to the H700. Then all data would be available again.
It's similar when one of the MD devices fails. If the MD device with only 8TB disks fails, then pool2 can be drained and the disks on the H700 be replaced with the 8TB disks from the failed MD device. 
If the MD device with 2TB and 8TB disks fails, it's a bit more complicated but all data could be restored in a 2 step process: 
1) replace the 2TB disks on the H700 with the 8TB disks from the failed MD device which makes pool1 available again which then could be drained.
2) Put the original 2TB disks that were connected to the H700 back in and replace 8TB disks on the working MD device with the remaining 2TB disks from the failed MD device, which makes pool2 available again.
In any case, no data would be lost.




22 March 2016

ZFS vs Hardware Raid

Due to the need of upgrading our storage space and the fact that we have in our machines 2 raid controllers, one for the internal disks and one for the external disks, the possibility to use a software raid instead of a traditional hardware based raid was tested.
Since ZFS is the most advanced system in that respect, ZFS on Linux was tested for that purpose and proved to be a good choice here too.
This post will describe the general read/write and failure tests, and a later post will describe additional tests like rebuilding of the raid if a disk fails, different failure scenarios, setup and format times.
Please, use the comment section if you would like to have other tests done too.


harware test configuration:

  1. DELL PowerEdge R510 
  2. 12x2TB SAS (6Gbps) internal storage on a PERC H700 controller
  3. 2 external MD1200 devices with 12x2TB SAS (6Gbps)on a PERC H800 controller
  4. 24GB RAM
  5. 2 x Intel Xeon E5620 (2.4GHz)
  6. for all settings in the raid controllers the default was used for all tests, except for cache which was set to "write through"
ZFS test system configuration:
  1. SL6 OS
  2. ZFS based on the latest version available in the repository 
  3. no ZFS compression used
  4. 1xraidz2 + hotspare for all the disks on H700  (zpool tank)
  5. 1xraidz2 + hotspare for all the disks on H800  (zpool tank800)
  6. in both raid controllers each disk is defined as a single raid0 since they don't support JBOD, unfortunately
Hardware raid test system configuration:
  1. same machine with same disks, controllers, and OS used as for the ZFS test configuration
  2. 1xraid6 + hotspare for all the disks on H700
  3. 1xraid6 + hotspare for all the disks on H800 
  4. space was divided into 8TB partitions and formatted with ext4

Read/Write speed test



  • time (dd if=/dev/zero of=/tank800/test10G bs=1M count=10240 && sync)
  • time (dd if=/tank800/test10G of=/dev/null bs=1M && sync)
  • first number in the results is given by "dd"
  • time and second number is given by "time"
  • write test was done first for both controllers, and then the read tests


  • H700 results

    ZFS based:
    write: 236MB/s, 1min:02 (165MB/s)
    read:  399MB/s, 0min:27 (379MB/s)

    Hardware raid based:
    write: 233MB/s, 1min:10 (146MB/s)
    read:    1.2GB/s, 0min:18 (1138MB/s)

    H800 results

    ZFS based:
    write: 619MB/s, 0min:23 (445MB/s)
    read:  2.0GB/s, 0min:05 (2048MB/s)

    Hardware raid based:
    write: 223MB/s, 1min:13 (140MB/s)
    read:  150MB/s, 1min:12 (142MB/s)

    H700 and H800 mixed

    • 6 disks from each controller were used together in a combined raid configuration
    • this kind of configuration is not possible for a hardware based raid
    ZFS result:
    write: 723MB/s, 0min:37 (277MB/s)
    read:  577MB/s, 0min:18 (568MB/s)

    Conclusion

    • ZFS rates for H800 based raid much better than hardware raid based system
    • the large difference between  ZFS and hardware raid based reads needs more investigation
      • for repeating the same tests 2 more times it was at the same order, however
    • H800 has a much better performance than H700 when using ZFS, but not for the hardware raid configuration

    Failure Test

    Here it was tested what happens if a 100GB file (test.tar) is copied (cp and rsync) from the H800 based raid to the H700 based raid and during this copy the system failed, simulated by cold reboot through the remote console.

    ZFS result:

    root@pool6 ~]# ls -lah /tank 
    total 46G
    drwxr-xr-x.  2 root root    5 Mar 19 20:11 .
    dr-xr-xr-x. 26 root root 4.0K Mar 19 20:17 ..
    -rw-r--r--.  1 root root  16G Mar 19 19:07 test10G
    -rw-r--r--.  1 root root  13G Mar 19 20:12 test.tar
    -rw-------.  1 root root  18G Mar 19 20:06 .test.tar.EM379W

    [root@pool6 ~]# df -h /tank
    Filesystem      Size  Used Avail Use% Mounted on
    tank             16T   46G   16T   1% /tank
    [root@pool6 ~]# du -sch /tank
    46G     /tank
    46G     total

    [root@pool6 ~]# rm /tank/*test.tar*
    rm: remove regular file `/tank/test.tar'? y
    rm: remove regular file `/tank/.test.tar.EM379W'? y
    [root@pool6 ~]# du -sch /tank
    17G     /tank
    17G     total

    [root@pool6 ~]# ls -la /tank
    total 16778239
    drwxr-xr-x.  2 root root           3 Mar 19 20:21 .
    dr-xr-xr-x. 26 root root        4096 Mar 19 20:17 ..
    -rw-r--r--.  1 root root 17179869184 Mar 19 19:07 test10G
    • everything consistent
    • no file check needed at reboot
    • no problems at all occurred 

    Hardware raid based result:

    [root@pool7 gridstorage02]# ls -lhrt
    total 1.9G
    drwx------    2 root   root    16K Jun 26  2012 lost+found
    drwxrwx---   91 dpmmgr dpmmgr 4.0K Feb  4  2013 ildg
    -rw-r--r--    1 root   root      0 Mar  6  2013 thisisgridstor2
    drwxrwx---   98 dpmmgr dpmmgr 4.0K Aug  8  2013 lhcb
    drwxrwx---  609 dpmmgr dpmmgr  20K Aug 27  2014 cms
    drwxrwx---    6 dpmmgr dpmmgr 4.0K Nov 23  2014 ops
    drwxrwx---    6 dpmmgr dpmmgr 4.0K Mar 13 12:18 ilc
    drwxrwx---    9 dpmmgr dpmmgr 4.0K Mar 13 23:04 lsst
    drwxrwx---  138 dpmmgr dpmmgr 4.0K Mar 14 10:23 dteam
    drwxrwx--- 1288 dpmmgr dpmmgr  36K Mar 15 00:00 atlas
    -rw-r--r--    1 root   root   1.9G Mar 18 17:11 test.tar

    [root@pool7 gridstorage02]# df -h .
    Filesystem            Size  Used Avail Use% Mounted on
    /dev/sdb2             8.1T  214M  8.1T   1% /mnt/gridstorage02

    [root@pool7 gridstorage02]# du . -sch
    1.9G    .
    1.9G    total

    [root@pool7 gridstorage02]# rm test.tar 
    rm: remove regular file `test.tar'? y

    [root@pool7 gridstorage02]# du . -sch
    41M     .
    41M     total

    [root@pool7 gridstorage02]# df -h .
    Filesystem            Size  Used Avail Use% Mounted on
    /dev/sdb2             8.1T -1.7G  8.1T   0% /mnt/gridstorage02

    • Hardware raid based tests were done first, on a  machine that was previously used as dpm client, therefore the directory structure was left, but empty
    • during the reboot a file system check was done
    • "df"  reports a different number for the used space than "du" and "ls"
    • after removing the file, the used space reported by "df" is negative
    • file system is not consistent anymore

    Conclusion here:

    • for the planned extension (17x2TB exchanged for 8TB disks), the new disks should be placed in the MD devices and managed by the H800 using ZFS
    • second zpool can be used for all remaining 2TB disks (on H700 and H800 together)
    • ZFS seems to handle system failures better 
    To be continued...

    01 March 2010

    A Phew Good Files

    The storage support guys finished integrity checking of 5K ATLAS files held at Lancaster and found no bad files.


    This, of course, is a Good Thing™.


    The next step is to check more files, and to figure out how implementations cache checksums. Er, the next two steps are to check more files and document handling checksums, and do it for more experiments. Errr, the next three steps are to check more files, document checksum handling, add more experiments, and integrate toolkits more with experiments and data management tools.


    There have been some reports of corrupted files but corruptions can happen for more than one reason, and the problem is not always at the site. The Storage Inquisition investigation is ongoing.

    06 October 2007

    Interview with ZFS techies - mentions LHC

    These guys (Jeff Bonwick and Bill Moore) are techies so not a sales pitch at all. Running time about 48 mins but you can do something else and just listen to it (like in a meeting :-)

    http://www.podtech.net/scobleshow/technology/1619/talking-storage-systems-with-suns-zfs-team

    Suitable for everyone, covers a lot of software engineering ("software is only as good as its test suite") but also obviously filesystems and "zee eff ess" and storage.

    Also mentions CERN's LHC, and the LHC Atlas detector in particular.
    And work done by CERN to examine silent data corruption (at around 33 mins into the interview).

    09 May 2007

    ZFS performance on RAID

    http://milek.blogspot.com/2007/04/hw-raid-vs-zfs-software-raid-part-iii.html