31 March 2016

Some thoughts on data in academic environments vs industry - Part 2/2

Now if we in the academic world are so good at managing "big data," how does it (if at all) impact industry and society as a whole? Obviously we rely on the storage and networking industry for the "fabric," and more generally we widely collaborate with industry partners in "big data" projects such as those funded by Horizon 2020, e.g. ESiWACE and SAGE to solve the "next generation" of problems. And of course there are data specialists in industry with even bigger data volumes than ours.

So the question here is how we (=academic data centres) can engage with industry more widely, specifically with those who could benefit from the expertise we have developed.

If we have something that can be commercialised, we can of course spin out companies: universities and research councils can do this fairly easily. Indeed, many former collaborators are now CxOs and co-founders of startups such as SIXSQ and StreamVibe. Also patents and knowledge exchange. And STFC has the Hartree centre which focuses on solving problems - including "big data" ones - for industry. STFC also has an innovation hub.

For more collaborative exploratory work, let me highlight the opportunity of working with the Connected Digital Economy Catapult and the Big Innovation Centre. Like we see with our own experiments, we need to get experts together from both sides - experiment and infrastructure - to make them both work together. CDEC and BIC have the ability to tap into business requirements, and as an example, we have previously investigated and designed a "trusted data hub" to enable companies that don't want to share data openly to share in a controlled way within a trusted platform. By connecting academic expertise and innovation with industry requirements, we can increase the impact of our work and together design systems which improve people's lives. The CDEC and BIC wear the suits, so we don't have to!

29 March 2016

Deletion of Tape backed data for ATLAS VO at RAL Tier1

The RAL Tier1 is just about to migrate all the tape backed data that we store for the ATLAS collaboration onto larger tapes ( well the tapes are the same size but the amount per tape which can be written with new drives is higher. Before we started; we asked ATLAS if they could find any data to delete before we migrate ( so a not to put gaps  in to the new tapes.)
This they did successfully.
They deleted 1.58 million files; (corresponding to 1.48 PB of data,) over a 5 day  period this can be seen in the following plots, (deletion rate when busy of 20k files per hour):






Now to just get ATLAS to delete the remaining logfiles from DATADISK... (requires moving  to using our new CEPH storage system for this to happen...)....

24 March 2016

Some thoughts on data in academic environments vs industry, part 1 (of 2)

I was asked today about my opinion on the difference between big data in academia and industry. As I see it we have volume (data collections on the order of 10s of PBs, e.g. WLCG, climate); characteristically velocity is an order of magnitude greater than volume as all data is moved around and replicated (FTS alone was recently estimated to have moved nearly a quarter of an exabyte in a year, and Globus advertise their (presumably estimated) data transfer volumes on their home page) Most science data is copied and replicated, as large scale data science is a global endeavour, requiring the collaboration of research centres across the world.

But we (science) have less variety. Physics events are physics events, and even with different types like raw, AOD, ESD, etc., there is a manageable collection of formats. Different communities have different formats, but as a rule, science is fairly consistent.

Bandwidth into sites is measured in 10s of Gb/s (10, 40, 60, that sort of stuff); and the 16 PB Panasas system for JASMIN can shift something like 2.5 Tb/s (think of it as a million times faster than your home Internet)

Moreover, for some things like WLCG, data models are very regimented, thus ensuring that the processing happens in an orderly fashion rather than chaotic. We have security strong enough to enable us to run services outside the firewall (as otherwise we'd slow the transfer down a lot, and/or melt the firewall).

And the expected evolution is more of the same - towards the 100PB for data collections, 1EB for data transfers. More data, more bandwidth, perhaps also more diversity within research disciplines. When will we get there? If growth is truly exponential, it could be too soon  :-) we'd get more data before the technology is ready... even with sub-exponential growth, we may have scalability issues - sure, we could store an exabyte today in a finite footprint, but can we afford it?

"Big Science" security goals are different - data is rarely personal, but may be commercially sensitive (e.g. MX of a protein) or embargoed for research reasons. Integrity is king, availability is the prince, and the third "classic" data security goal, confidentiality, not quite a pauper, but a baronet, maybe?! It's an oversimplified view but these are often the priorities. Big Science requires public funding so the data that comes out of it must be shared to maximise the benefit.

There's also the human side. With the AoD (Archer-on-Duty) looking after our storage systems, it does not seem likely we will ever have ten times more people for an order of magnitude more data.  But so far, comparing to when data was order-of-1PB scale, we didn't need ten times more people than we had then.

Agree? Disagree? Did I forget something?

23 March 2016

ZFS vs Hardware Raid System, Part II

This post will focus on other differences between a ZFS based software raid and a hardware raid system that could be important for the usage as GridPP storage backend. In a later post, the differences in the read/write rates will be tested more intensively.
For the current tests, the system configuration is the same as described previously

First test is, what happens if we just take a disk out of the current raid system...
In both cases, the raid gets rebuild using the hot spare that was provided in the initial configuration. However, the times needed to fully restore redundancy are very different:

ZFS based raid recovery time: 3min
Hardware based raid recovery time: 9h:2min

For both systems, only the test files from the previous read/write tests were on disk, and the hardware raid was initialized newly to remove the corrupted filesystem after the failure test and then the test files were recreated.  Both systems where not doing anything else during the recovery period.

The large difference is due to the fact that ZFS is raid system, volume manager, and file system all in one. ZFS knows about the structure on disk and the data that was on the broken disk. Therefore it only needs to restore what actually was really used, in the above test case just only 1.78GB.
The hardware raid system on the other hand knows nothing about the filesystem and real used space on a disk, therefore it needs to restore the whole disk even if it's like in our test case nearly empty - that's 2TB to restore while only about 2GB are actually useful data! This difference will  become even more important in the future when the capacity of a single disk gets larger.

ZFS is now also used on one of our real production systems behind DPM. The zpool on that machine, consist of 3 x raidz2 with 11 x 2TB disks for each raidz2 vdev, also had a real failed disk which needed to be replaced. There are about 5TB of data on that zpool, and the whole time needed to restore full redundancy took about 3h while the machine was still in production and used by jobs to request or store data.  This is again much faster than what a hardware raid based system would need even if the machine would be doing nothing else in the meantime than restoring full redundancy.



Another large difference between both system is the time needed to setup a raid configuration. 
For the hardware based raid, first one need to create the configuration in the controller, initialize the raid, then create partitions on the virtual disk, and lastly format the partitions to put a file system on it. When all is finished, the newly formatted partitions need to be put into /etc/fstab to be mounted when the system starts, mount points need to be created, and then the partitions can be mounted. That's a very long process which takes up a lot of time before the system can be used.
To give an idea about it, the formatting of a single 8TB partition with ext4 took
on H700: about 1h:11min
on H800: about       34min

In our configuration, 24x2TB on H800 and 12x2TB on H700, the formatting alone would take about 6h! (if done one partition after another)

For ZFS, we still need to create a raid0 for each disk separately in the H700 and H800 since both controllers don't support JBODs. However once this is done, it's very easy and fast to create a production ready raid system. There is one single command which does everything: 
zpool create NAME /dev/... /dev/... ..... spare /dev/....
After that single command, the raid system is created, formatted to be used, and mounted under /NAME which can also be changed using options when the zpool is created (or later). There is no need to edit /etc/fstab and the whole setup takes less than 10s
To have single 8TB "partitions", one can create other zfs in this pool and set a quota for it, like
zfs -o refquota=8TB create NAME/partition1
After that command, a new ZFS is created, a quota is placed on it which makes sure that tools like "df" only see 8TB available for usage, and it's mounted under /NAME/partition1 - again no need to edit /etc/fstab and the setup takes just a second or two. 




Another important consideration is what happens with the data if parts of the system fail. 
In our case, we have 12 internal disks on a H700 controller and 2 MD1200 devices with 12 disks each that are connected through a H800 controller, and 17 of the 2TB disks we used so far in the MD devices need to be replaced by 8TB disks.  There are different setups possible and it's interesting to see what in each case happens if a controller or one of the MD devices fails.
The below mentioned scenarios assume that the system is used as a DPM server in production.

Possibility 1: 1 raid6/raidz2 for all 8TB disks, 1 raid6/raidz2 for all 2TB disks on the H800, and 1 raid6/raidz2 for all 2TB disks on the H700

What happens if the H700 controller fails and can't be replaced (soon)?

Hardware raid system
The 2 raid systems on H800 would still be available. The associated filesystems could be drained in DPM, and the disks on the H700 controller could be swapped with the disks in the MD devices associated with the drained file systems. However, if the data on the disks will be usable depends a lot on the compatibility of the on disk format used by different raid controllers. Also, even if that would be possible to recreate the raid on different controller without any data lost (which is probably not possible), it would appear to the OS as a different disk and all the mount points would need to be changed to have the data available again.  

ZFS based raid system
Again, the 2 raid systems associated with the H800 would still be available and the pool of 8TB disks could be drained. Then the disks on the failed H700 could simply be swapped with the 8TB disks, and the zpool would be available again - mounted under the same directories before no matter in which bays the disks are or on which controller.

What happens if one of the MD devices fails?

Hardware raid system
Since one MD device has only 12 disks and we need to put in 17 x 8TB disks, one raid system consist of disks from both MD devices. If any MD device fails, the data on all 8TB disks will be lost. If the MD device with the remaining 2 TB disks fails, it depends if the raid on the 2TB disks could be recognized by the H700 controller. If so, then the data on the disks associated with the H700 could be drained and at least the data from the 2TB disks on the failed MD device could be restored, although with some manual configuration changes in the OS since DPM needs the data mounted in the previously used directories.

If anyone has experience on the possibility to use disks from a raid created on one controller  on another controller of a different kind, I would appreciate if we could read about it in the comments section.


ZFS based raid system
If any MD device fails, the pool with the internal disks on the H700 could be drained, and then the disks swapped with the disks on the failed MD device. All data would be available again and no data would be lost.

One raidz2 for all 8TB disks (pool1) and one raidz2 for all 2TB disks (pool2)

This setup is only possible when using a ZFS based raid since it's not supported by hardware raid controllers.

If the H700 fails, then the data on pool1 could be drained and the 8TB disks be replaced with the 2 TB disks which were connected to the H700. Then all data would be available again.
It's similar when one of the MD devices fails. If the MD device with only 8TB disks fails, then pool2 can be drained and the disks on the H700 be replaced with the 8TB disks from the failed MD device. 
If the MD device with 2TB and 8TB disks fails, it's a bit more complicated but all data could be restored in a 2 step process: 
1) replace the 2TB disks on the H700 with the 8TB disks from the failed MD device which makes pool1 available again which then could be drained.
2) Put the original 2TB disks that were connected to the H700 back in and replace 8TB disks on the working MD device with the remaining 2TB disks from the failed MD device, which makes pool2 available again.
In any case, no data would be lost.




22 March 2016

ZFS vs Hardware Raid

Due to the need of upgrading our storage space and the fact that we have in our machines 2 raid controllers, one for the internal disks and one for the external disks, the possibility to use a software raid instead of a traditional hardware based raid was tested.
Since ZFS is the most advanced system in that respect, ZFS on Linux was tested for that purpose and proved to be a good choice here too.
This post will describe the general read/write and failure tests, and a later post will describe additional tests like rebuilding of the raid if a disk fails, different failure scenarios, setup and format times.
Please, use the comment section if you would like to have other tests done too.


harware test configuration:

  1. DELL PowerEdge R510 
  2. 12x2TB SAS (6Gbps) internal storage on a PERC H700 controller
  3. 2 external MD1200 devices with 12x2TB SAS (6Gbps)on a PERC H800 controller
  4. 24GB RAM
  5. 2 x Intel Xeon E5620 (2.4GHz)
  6. for all settings in the raid controllers the default was used for all tests, except for cache which was set to "write through"
ZFS test system configuration:
  1. SL6 OS
  2. ZFS based on the latest version available in the repository 
  3. no ZFS compression used
  4. 1xraidz2 + hotspare for all the disks on H700  (zpool tank)
  5. 1xraidz2 + hotspare for all the disks on H800  (zpool tank800)
  6. in both raid controllers each disk is defined as a single raid0 since they don't support JBOD, unfortunately
Hardware raid test system configuration:
  1. same machine with same disks, controllers, and OS used as for the ZFS test configuration
  2. 1xraid6 + hotspare for all the disks on H700
  3. 1xraid6 + hotspare for all the disks on H800 
  4. space was divided into 8TB partitions and formatted with ext4

Read/Write speed test



  • time (dd if=/dev/zero of=/tank800/test10G bs=1M count=10240 && sync)
  • time (dd if=/tank800/test10G of=/dev/null bs=1M && sync)
  • first number in the results is given by "dd"
  • time and second number is given by "time"
  • write test was done first for both controllers, and then the read tests


  • H700 results

    ZFS based:
    write: 236MB/s, 1min:02 (165MB/s)
    read:  399MB/s, 0min:27 (379MB/s)

    Hardware raid based:
    write: 233MB/s, 1min:10 (146MB/s)
    read:    1.2GB/s, 0min:18 (1138MB/s)

    H800 results

    ZFS based:
    write: 619MB/s, 0min:23 (445MB/s)
    read:  2.0GB/s, 0min:05 (2048MB/s)

    Hardware raid based:
    write: 223MB/s, 1min:13 (140MB/s)
    read:  150MB/s, 1min:12 (142MB/s)

    H700 and H800 mixed

    • 6 disks from each controller were used together in a combined raid configuration
    • this kind of configuration is not possible for a hardware based raid
    ZFS result:
    write: 723MB/s, 0min:37 (277MB/s)
    read:  577MB/s, 0min:18 (568MB/s)

    Conclusion

    • ZFS rates for H800 based raid much better than hardware raid based system
    • the large difference between  ZFS and hardware raid based reads needs more investigation
      • for repeating the same tests 2 more times it was at the same order, however
    • H800 has a much better performance than H700 when using ZFS, but not for the hardware raid configuration

    Failure Test

    Here it was tested what happens if a 100GB file (test.tar) is copied (cp and rsync) from the H800 based raid to the H700 based raid and during this copy the system failed, simulated by cold reboot through the remote console.

    ZFS result:

    root@pool6 ~]# ls -lah /tank 
    total 46G
    drwxr-xr-x.  2 root root    5 Mar 19 20:11 .
    dr-xr-xr-x. 26 root root 4.0K Mar 19 20:17 ..
    -rw-r--r--.  1 root root  16G Mar 19 19:07 test10G
    -rw-r--r--.  1 root root  13G Mar 19 20:12 test.tar
    -rw-------.  1 root root  18G Mar 19 20:06 .test.tar.EM379W

    [root@pool6 ~]# df -h /tank
    Filesystem      Size  Used Avail Use% Mounted on
    tank             16T   46G   16T   1% /tank
    [root@pool6 ~]# du -sch /tank
    46G     /tank
    46G     total

    [root@pool6 ~]# rm /tank/*test.tar*
    rm: remove regular file `/tank/test.tar'? y
    rm: remove regular file `/tank/.test.tar.EM379W'? y
    [root@pool6 ~]# du -sch /tank
    17G     /tank
    17G     total

    [root@pool6 ~]# ls -la /tank
    total 16778239
    drwxr-xr-x.  2 root root           3 Mar 19 20:21 .
    dr-xr-xr-x. 26 root root        4096 Mar 19 20:17 ..
    -rw-r--r--.  1 root root 17179869184 Mar 19 19:07 test10G
    • everything consistent
    • no file check needed at reboot
    • no problems at all occurred 

    Hardware raid based result:

    [root@pool7 gridstorage02]# ls -lhrt
    total 1.9G
    drwx------    2 root   root    16K Jun 26  2012 lost+found
    drwxrwx---   91 dpmmgr dpmmgr 4.0K Feb  4  2013 ildg
    -rw-r--r--    1 root   root      0 Mar  6  2013 thisisgridstor2
    drwxrwx---   98 dpmmgr dpmmgr 4.0K Aug  8  2013 lhcb
    drwxrwx---  609 dpmmgr dpmmgr  20K Aug 27  2014 cms
    drwxrwx---    6 dpmmgr dpmmgr 4.0K Nov 23  2014 ops
    drwxrwx---    6 dpmmgr dpmmgr 4.0K Mar 13 12:18 ilc
    drwxrwx---    9 dpmmgr dpmmgr 4.0K Mar 13 23:04 lsst
    drwxrwx---  138 dpmmgr dpmmgr 4.0K Mar 14 10:23 dteam
    drwxrwx--- 1288 dpmmgr dpmmgr  36K Mar 15 00:00 atlas
    -rw-r--r--    1 root   root   1.9G Mar 18 17:11 test.tar

    [root@pool7 gridstorage02]# df -h .
    Filesystem            Size  Used Avail Use% Mounted on
    /dev/sdb2             8.1T  214M  8.1T   1% /mnt/gridstorage02

    [root@pool7 gridstorage02]# du . -sch
    1.9G    .
    1.9G    total

    [root@pool7 gridstorage02]# rm test.tar 
    rm: remove regular file `test.tar'? y

    [root@pool7 gridstorage02]# du . -sch
    41M     .
    41M     total

    [root@pool7 gridstorage02]# df -h .
    Filesystem            Size  Used Avail Use% Mounted on
    /dev/sdb2             8.1T -1.7G  8.1T   0% /mnt/gridstorage02

    • Hardware raid based tests were done first, on a  machine that was previously used as dpm client, therefore the directory structure was left, but empty
    • during the reboot a file system check was done
    • "df"  reports a different number for the used space than "du" and "ls"
    • after removing the file, the used space reported by "df" is negative
    • file system is not consistent anymore

    Conclusion here:

    • for the planned extension (17x2TB exchanged for 8TB disks), the new disks should be placed in the MD devices and managed by the H800 using ZFS
    • second zpool can be used for all remaining 2TB disks (on H700 and H800 together)
    • ZFS seems to handle system failures better 
    To be continued...