23 March 2016

ZFS vs Hardware Raid System, Part II

This post will focus on other differences between a ZFS based software raid and a hardware raid system that could be important for the usage as GridPP storage backend. In a later post, the differences in the read/write rates will be tested more intensively.
For the current tests, the system configuration is the same as described previously

First test is, what happens if we just take a disk out of the current raid system...
In both cases, the raid gets rebuild using the hot spare that was provided in the initial configuration. However, the times needed to fully restore redundancy are very different:

ZFS based raid recovery time: 3min
Hardware based raid recovery time: 9h:2min

For both systems, only the test files from the previous read/write tests were on disk, and the hardware raid was initialized newly to remove the corrupted filesystem after the failure test and then the test files were recreated.  Both systems where not doing anything else during the recovery period.

The large difference is due to the fact that ZFS is raid system, volume manager, and file system all in one. ZFS knows about the structure on disk and the data that was on the broken disk. Therefore it only needs to restore what actually was really used, in the above test case just only 1.78GB.
The hardware raid system on the other hand knows nothing about the filesystem and real used space on a disk, therefore it needs to restore the whole disk even if it's like in our test case nearly empty - that's 2TB to restore while only about 2GB are actually useful data! This difference will  become even more important in the future when the capacity of a single disk gets larger.

ZFS is now also used on one of our real production systems behind DPM. The zpool on that machine, consist of 3 x raidz2 with 11 x 2TB disks for each raidz2 vdev, also had a real failed disk which needed to be replaced. There are about 5TB of data on that zpool, and the whole time needed to restore full redundancy took about 3h while the machine was still in production and used by jobs to request or store data.  This is again much faster than what a hardware raid based system would need even if the machine would be doing nothing else in the meantime than restoring full redundancy.



Another large difference between both system is the time needed to setup a raid configuration. 
For the hardware based raid, first one need to create the configuration in the controller, initialize the raid, then create partitions on the virtual disk, and lastly format the partitions to put a file system on it. When all is finished, the newly formatted partitions need to be put into /etc/fstab to be mounted when the system starts, mount points need to be created, and then the partitions can be mounted. That's a very long process which takes up a lot of time before the system can be used.
To give an idea about it, the formatting of a single 8TB partition with ext4 took
on H700: about 1h:11min
on H800: about       34min

In our configuration, 24x2TB on H800 and 12x2TB on H700, the formatting alone would take about 6h! (if done one partition after another)

For ZFS, we still need to create a raid0 for each disk separately in the H700 and H800 since both controllers don't support JBODs. However once this is done, it's very easy and fast to create a production ready raid system. There is one single command which does everything: 
zpool create NAME /dev/... /dev/... ..... spare /dev/....
After that single command, the raid system is created, formatted to be used, and mounted under /NAME which can also be changed using options when the zpool is created (or later). There is no need to edit /etc/fstab and the whole setup takes less than 10s
To have single 8TB "partitions", one can create other zfs in this pool and set a quota for it, like
zfs -o refquota=8TB create NAME/partition1
After that command, a new ZFS is created, a quota is placed on it which makes sure that tools like "df" only see 8TB available for usage, and it's mounted under /NAME/partition1 - again no need to edit /etc/fstab and the setup takes just a second or two. 




Another important consideration is what happens with the data if parts of the system fail. 
In our case, we have 12 internal disks on a H700 controller and 2 MD1200 devices with 12 disks each that are connected through a H800 controller, and 17 of the 2TB disks we used so far in the MD devices need to be replaced by 8TB disks.  There are different setups possible and it's interesting to see what in each case happens if a controller or one of the MD devices fails.
The below mentioned scenarios assume that the system is used as a DPM server in production.

Possibility 1: 1 raid6/raidz2 for all 8TB disks, 1 raid6/raidz2 for all 2TB disks on the H800, and 1 raid6/raidz2 for all 2TB disks on the H700

What happens if the H700 controller fails and can't be replaced (soon)?

Hardware raid system
The 2 raid systems on H800 would still be available. The associated filesystems could be drained in DPM, and the disks on the H700 controller could be swapped with the disks in the MD devices associated with the drained file systems. However, if the data on the disks will be usable depends a lot on the compatibility of the on disk format used by different raid controllers. Also, even if that would be possible to recreate the raid on different controller without any data lost (which is probably not possible), it would appear to the OS as a different disk and all the mount points would need to be changed to have the data available again.  

ZFS based raid system
Again, the 2 raid systems associated with the H800 would still be available and the pool of 8TB disks could be drained. Then the disks on the failed H700 could simply be swapped with the 8TB disks, and the zpool would be available again - mounted under the same directories before no matter in which bays the disks are or on which controller.

What happens if one of the MD devices fails?

Hardware raid system
Since one MD device has only 12 disks and we need to put in 17 x 8TB disks, one raid system consist of disks from both MD devices. If any MD device fails, the data on all 8TB disks will be lost. If the MD device with the remaining 2 TB disks fails, it depends if the raid on the 2TB disks could be recognized by the H700 controller. If so, then the data on the disks associated with the H700 could be drained and at least the data from the 2TB disks on the failed MD device could be restored, although with some manual configuration changes in the OS since DPM needs the data mounted in the previously used directories.

If anyone has experience on the possibility to use disks from a raid created on one controller  on another controller of a different kind, I would appreciate if we could read about it in the comments section.


ZFS based raid system
If any MD device fails, the pool with the internal disks on the H700 could be drained, and then the disks swapped with the disks on the failed MD device. All data would be available again and no data would be lost.

One raidz2 for all 8TB disks (pool1) and one raidz2 for all 2TB disks (pool2)

This setup is only possible when using a ZFS based raid since it's not supported by hardware raid controllers.

If the H700 fails, then the data on pool1 could be drained and the 8TB disks be replaced with the 2 TB disks which were connected to the H700. Then all data would be available again.
It's similar when one of the MD devices fails. If the MD device with only 8TB disks fails, then pool2 can be drained and the disks on the H700 be replaced with the 8TB disks from the failed MD device. 
If the MD device with 2TB and 8TB disks fails, it's a bit more complicated but all data could be restored in a 2 step process: 
1) replace the 2TB disks on the H700 with the 8TB disks from the failed MD device which makes pool1 available again which then could be drained.
2) Put the original 2TB disks that were connected to the H700 back in and replace 8TB disks on the working MD device with the remaining 2TB disks from the failed MD device, which makes pool2 available again.
In any case, no data would be lost.




1 comment:

Ramachandran Gopalan said...

We found this weblog very great and we wanna thank you for that. We hope you keep up the great work!
affordable storage