GridPP storage news: ZFS vs Hardware Raid

Due to the need of upgrading our storage space and the fact that we have in our machines 2 raid controllers, one for the internal disks and one for the external disks, the possibility to use a software raid instead of a traditional hardware based raid was tested.
Since ZFS is the most advanced system in that respect, ZFS on Linux was tested for that purpose and proved to be a good choice here too.
This post will describe the general read/write and failure tests, and a later post will describe additional tests like rebuilding of the raid if a disk fails, different failure scenarios, setup and format times.
Please, use the comment section if you would like to have other tests done too.

harware test configuration:

DELL PowerEdge R510
12x2TB SAS (6Gbps) internal storage on a PERC H700 controller
2 external MD1200 devices with 12x2TB SAS (6Gbps)on a PERC H800 controller
24GB RAM
2 x Intel Xeon E5620 (2.4GHz)
for all settings in the raid controllers the default was used for all tests, except for cache which was set to "write through"

ZFS test system configuration:

SL6 OS
ZFS based on the latest version available in the repository
no ZFS compression used
1xraidz2 + hotspare for all the disks on H700 (zpool tank)
1xraidz2 + hotspare for all the disks on H800 (zpool tank800)
in both raid controllers each disk is defined as a single raid0 since they don't support JBOD, unfortunately

Hardware raid test system configuration:

same machine with same disks, controllers, and OS used as for the ZFS test configuration
1xraid6 + hotspare for all the disks on H700
1xraid6 + hotspare for all the disks on H800
space was divided into 8TB partitions and formatted with ext4

Read/Write speed test

time (dd if=/dev/zero of=/tank800/test10G bs=1M count=10240 && sync)

time (dd if=/tank800/test10G of=/dev/null bs=1M && sync)

first number in the results is given by "dd"

time and second number is given by "time"

write test was done first for both controllers, and then the read tests

H700 results

ZFS based:

write: 236MB/s, 1min:02 (165MB/s)

read: 399MB/s, 0min:27 (379MB/s)

Hardware raid based:

write: 233MB/s, 1min:10 (146MB/s)

read: 1.2GB/s, 0min:18 (1138MB/s)

H800 results

ZFS based:

write: 619MB/s, 0min:23 (445MB/s)

read: 2.0GB/s, 0min:05 (2048MB/s)

Hardware raid based:

write: 223MB/s, 1min:13 (140MB/s)

read: 150MB/s, 1min:12 (142MB/s)

H700 and H800 mixed

6 disks from each controller were used together in a combined raid configuration
this kind of configuration is not possible for a hardware based raid

ZFS result:

write: 723MB/s, 0min:37 (277MB/s)

read: 577MB/s, 0min:18 (568MB/s)

Conclusion

ZFS rates for H800 based raid much better than hardware raid based system
the large difference between ZFS and hardware raid based reads needs more investigation

for repeating the same tests 2 more times it was at the same order, however

H800 has a much better performance than H700 when using ZFS, but not for the hardware raid configuration

Failure Test

Here it was tested what happens if a 100GB file (test.tar) is copied (cp and rsync) from the H800 based raid to the H700 based raid and during this copy the system failed, simulated by cold reboot through the remote console.

ZFS result:

root@pool6 ~]# ls -lah /tank

total 46G

drwxr-xr-x. 2 root root 5 Mar 19 20:11 .

dr-xr-xr-x. 26 root root 4.0K Mar 19 20:17 ..

-rw-r--r--. 1 root root 16G Mar 19 19:07 test10G

-rw-r--r--. 1 root root 13G Mar 19 20:12 test.tar

-rw-------. 1 root root 18G Mar 19 20:06 .test.tar.EM379W

[root@pool6 ~]# df -h /tank

Filesystem Size Used Avail Use% Mounted on

tank 16T 46G 16T 1% /tank

[root@pool6 ~]# du -sch /tank

46G /tank

46G total

[root@pool6 ~]# rm /tank/*test.tar*

rm: remove regular file `/tank/test.tar'? y

rm: remove regular file `/tank/.test.tar.EM379W'? y

[root@pool6 ~]# du -sch /tank

17G /tank

17G total

[root@pool6 ~]# ls -la /tank

total 16778239

drwxr-xr-x. 2 root root 3 Mar 19 20:21 .

dr-xr-xr-x. 26 root root 4096 Mar 19 20:17 ..

-rw-r--r--. 1 root root 17179869184 Mar 19 19:07 test10G

everything consistent
no file check needed at reboot
no problems at all occurred

Hardware raid based result:

[root@pool7 gridstorage02]# ls -lhrt

total 1.9G

drwx------ 2 root root 16K Jun 26 2012 lost+found

drwxrwx--- 91 dpmmgr dpmmgr 4.0K Feb 4 2013 ildg

-rw-r--r-- 1 root root 0 Mar 6 2013 thisisgridstor2

drwxrwx--- 98 dpmmgr dpmmgr 4.0K Aug 8 2013 lhcb

drwxrwx--- 609 dpmmgr dpmmgr 20K Aug 27 2014 cms

drwxrwx--- 6 dpmmgr dpmmgr 4.0K Nov 23 2014 ops

drwxrwx--- 6 dpmmgr dpmmgr 4.0K Mar 13 12:18 ilc

drwxrwx--- 9 dpmmgr dpmmgr 4.0K Mar 13 23:04 lsst

drwxrwx--- 138 dpmmgr dpmmgr 4.0K Mar 14 10:23 dteam

drwxrwx--- 1288 dpmmgr dpmmgr 36K Mar 15 00:00 atlas

-rw-r--r-- 1 root root 1.9G Mar 18 17:11 test.tar

[root@pool7 gridstorage02]# df -h .

Filesystem Size Used Avail Use% Mounted on

/dev/sdb2 8.1T 214M 8.1T 1% /mnt/gridstorage02

[root@pool7 gridstorage02]# du . -sch

1.9G .

1.9G total

[root@pool7 gridstorage02]# rm test.tar

rm: remove regular file `test.tar'? y

[root@pool7 gridstorage02]# du . -sch

41M .

41M total

[root@pool7 gridstorage02]# df -h .

Filesystem Size Used Avail Use% Mounted on

/dev/sdb2 8.1T -1.7G 8.1T 0% /mnt/gridstorage02

Hardware raid based tests were done first, on a machine that was previously used as dpm client, therefore the directory structure was left, but empty
during the reboot a file system check was done
"df" reports a different number for the used space than "du" and "ls"
after removing the file, the used space reported by "df" is negative
file system is not consistent anymore

Conclusion here:

for the planned extension (17x2TB exchanged for 8TB disks), the new disks should be placed in the MD devices and managed by the H800 using ZFS
second zpool can be used for all remaining 2TB disks (on H700 and H800 together)
ZFS seems to handle system failures better

To be continued...

1 comment:

wazoox said...: You really should run tests with files much bigger than RAM size, else caching will get in the way and make the results irrelevant. You have 24G of RAM, you should run your tests with 48G fileS.

You may also run "echo 3 > /proc/sys/vm/drop_cache" to empty the file cache between runs for more consistent results.

Another point to take into account is the IO scheduler. Most distributions use cfq (Completely Fair Scheduler) as a default, unfortunately it's most of the time a poor choice for a server, particularly when using hardware RAID. Use "noop" scheduler for perfectly fair tests: run "echo noop > /sys/block//queue/scheduler" for all drives.

Last you may need to adjust IO queue length and read ahead. Default values are quite correct for old ATA drives with small cache, but very suboptimal for RAID arrays. Most RAID controller needs much longer queues than the default 128 (512 or 1024): echo 1024 > /sys/block//queue/nr_requests
And most hardware RAID controller have large caches that give better results with sequential IO with big read-ahead values: echo 8192 > /sys/block:/queue/read_ahead_kb; 2:56 pm

GridPP storage news

22 March 2016

ZFS vs Hardware Raid

Read/Write speed test

H700 results

H800 results

H700 and H800 mixed

Conclusion

Failure Test

ZFS result:

Hardware raid based result:

Conclusion here:

1 comment:

Current SRM versions

GridPP storage availability

Label Cloud

Links

Contributors

Blog Archive

GridPP storage availability