29 December 2016

2016 - nice and boring?

We like GridPP services to be "nice and boring": running smoothly, and upgrades are uneventful

One cannot accuse the year 2016 of having been N&B, with lots of interesting and extraordinary events (also) in science. However, computing and physics also had their share of the seemingly extraordinary number of celebrities we will miss, such as Tomlinson and Minsky in computing, and arguably both Kibble, and Rubin should have won Nobel prizes in physics (or to be precise, shared the prizes that were awarded.)

In 2016 we continued to deliver services for LHC despite changing requirements, and also to support the much smaller non-LHC communities. With GridPP as a data e-infrastructure (and "data transfer zone"), have also revisited connecting GridPP to other data infrastructures and will continue this work in 2017.

GridFTP continues to be the popular and efficient workhorse of data transfers; xroot is also popular but mainly in high energy physics. Tier 2 sites are set to become SRM-less. Accounting will need more work in 2017; hopefully we can do this with the GLUE group and the EGI accounting experts. GridPP also looks forward to contributing to the EPSRC-funded Pathfinder pilot project which should eventually enable connecting DiRAC, eMedLab, and GridPP. So, perhaps, not N&B either.

21 December 2016

Comparative Datamanagementology

GridPP was well represented at the cloud workshop at Crick. (The slides and official writeup still have not appeared as of this blog post)

The general theme was hybrids, so it is natural to wonder whether it is useful to move data between infrastructures and what is the best way to do it. In the past we have connected, for example, NGS and GridPP (through SRB, using GridFTP), but there was not a strong need for it in the user community. However, with today's "UKT0" activities and more multidisciplinary e-infrastructures, perhaps the need for moving data across infrastructures will grow stronger.

xroot may be a bit too specialised, as it is almost exclusively used by HEP? but GridFTP is widely use by users of Globus, and is the workhorse behind WAN transfers (as an aside, we hear at the AARC meeting that Globus are pondering moving away from certificates, towards a more OIDC approach - which would be new as GridFTP has always required client certificate authentication.)

The big question is whether moving data between infrastructures is useful at all - will users make use of it? It is tempting to just upload the data to some remote storage service and share links to it with collaborators. Providing "approved" infrastructure for data sharing helps users avoid the pitfalls of inappropriate data management, but they still need tools to move the data efficiently, and to manage permissions.  For example, EUDAT's B2ACCESS was specifically designed to move data into and out of EUDAT (as EUDAT does not offer compute).

So far we have focused on whether it is possible at all to move data between the infrastructures, the idea being to offer users the ability to do so. The next step is efficiency and performance, as we saw with DiRAC where we had to tar the files up in order to make the transfer of small files more efficient, and to preserve ownerships, permissions, and timestamps. 

16 December 2016

XRDCP and Checksums with DPM

To check if a file transfer was successful, xrdcp is able to calculate a checksum at the destination. However, while this works well for plain xrootd installations it is not working at the moment when used together with DPM. The reason for that seems to be that xrootd is only used as disk server clients and doesn't use the redirector component which would do the translations between logical and physical file names. This translation is done within DPM.
(if the reason why it is not working at the moment is different please add a comment)

If a checksum is needed to verify a successful copy of the data, one way is to copy the file from the origin to the disk server first, and then transfer it back to the origin server and calculate the checksum on what was transferred back. That always works, but is not very efficient since it can involve a lot of additional network traffic, especially at sites with a small number of storage servers but large amount of compute resources or when transferring files to distant sites. This method is implemented by some experiments if xrdcp fails to give a checksum.

While the DPM developers work on a build-in solution for DPM, there is also another method that can be used to calculate the checksum without any additional network traffic, which can be used in the meantime.
Xrootd provides a very flexible interface for configurations. What we can use here is the possibility to specify an external program to calculate the checksum. This can be any executable, especially also a shell script.
To do so, one needs to add to the xrootd config file on the disk servers the following option

xrootd.chksum adler32 /PATH/TO/SCRIPT.sh


where "adler32" specifies the used checksum algorithm and "/PATH/TO/SCRIPT.sh" specifies which script is used to calculate the checksum and where it is.
(make sure the script is executable)
Xrootd will also automatically pass the logical file name as a parameter to the script

In the script it is then possible to do the logical filename to physical file name lookup and calculate the checksum.  To be able to do so, the DPM tools need to be installed at least on the DPM head node which can be found in dpm-contrib-admintools​ when using the EPEL repository. Also, the clients need to have a way to contact the DPM head node to do the lookup then.
An example script that can be adapted to own configurations can be found here.


28 September 2016

Co-evolving data nodes

Bing! a mail comes in from our friends in the States saying look! here's someone in New Zealand who has set up iRODS node to GridFTP data to/from their site. It is a very detailed document yet it looks a lot like the DiRAC/GridPP data node document. They have solved many of the same problems we have solved, independently.

The basic idea is to have a node outside your institute/organisation which can be used to transfer data to/from your datastore/cluster. With a GridFTP endpoint, you could move data with FTS (as we do with DiRAC), people can use Globus (used by STFC's facilities, for example), or data can be moved to/from other e-infrastructures (such as EUDAT's B2STAGE) or EGI. Regardless of the underlying storage, there will be common topics like security, monitoring, performance, how to (or not to) firewall it, how to make it discoverable, etc. It could be the data node in a Science DMZ.

The suggestion is that we (= GridPP, DiRAC, and in fact anyone else who is willing and able) contribute to a detailed writeup which can be published as an OGF document (open access publishing for free!, and because GridFTP is an OGF protocol), either community practice or experiences - and then have a less detailed paper which could be submitted to a conference or published in a journal. 

22 September 2016

Upgrading and Expanding Lustre Storage (part4)

With a 1.5PB Lustre file system set up we now need to transfer our data from the old lustre system, conveniently also 1.5 PB in size, before we can put it into production.

Migration of Data:

It was found that is was not possible to mount both Lustre 1.8 and 2.8 on the same client, therefore migration of data had to be done via rsync between two clients mounting the different Lustre file systems. Setting up an rsync demon on the clients was found to be an order of magnitude quicker than using rsync over ssh for transferring data between the two clients. Hard links, ACLs and extended attributes are preserved by using the “-HAX” option when transferring data. Up to a dozen clients were utilised over the course of about six weeks to transfer 1.5PB of data between the old and new Lustre file systems. After the initial transfer the old and new systems were kept in sync with repeated rsync runs, remembering to use the “—delete” option to remove files that no longer existed on the live Lustre system. MD5 checksums were compared for a small random selection of files. The final transfer from the old to new Lustre took about a day, during which the file system was unavailable to external users. Then all clients were updated to the new Lustre version.  

Real World Experience:

With the new Lustre system put into production we then recommissioned the old system to create a 3PB Lustre file system. The grid cluster has about 4000 job slots in over 200 Lustre client compute nodes. The actual cluster is shown below. Note that the compute nodes fill the bottom 12U of every rack, where the air is cooler, and storage above them in the next 24U.
Real world performance over half a year, March to September 2016, is shown below. When all job slots are running grid “analysis” workloads, requiring access to data stored in Lustre, no slow down in job efficiency was observed. An average of 4.8 Gb/s is seen for reading data from Lustre and 1.6Gb/s for writing to Lustre (which is always done through StoRM).
However, in one case a local user simultaneously ran more than a 1500 jobs each accessing a very large number of small files, in this case BIOinformatics data, on Lustre and a slow down in performance was observed. Once the user was limited to no more than 500 jobs no further issues were seen. It is expected that accessing small files on the Lustre filesystem is not efficient [1] and should be avoided or limited where possible. A future enhancement to Lustre is planned that will enable small files to be stored on the MDS which should improve small file performance [1]. 

The Queen Mary Grid site major workload is for the ATLAS experiment which keeps detailed statistics of site usage. We are responsible for processing about 2.5% of all ATLAS data internationally and about 20% of data processed in the UK. Remote data transfer statics are shown below. Over the last 6 months ATLAS has transferred 2.39 PB of data into the cluster (top left plot), the weekly totals are shown in the top left plot, with a maxim for one week of 340TB (an average 4Gb/s). The bottom plot shows that 2.3PB has been sent to other grid sites around the world from Queen Mary.

Future Plans:
  • Double the Storage of the cluster to 6PB in 2018.
  • Consider an upgrade to Lustre 2.9 which will have bug LU1482 fixed and also provide additional functionality such as user and group ID mapping which would allow the storage to be used in different clusters. However Lustre 2.9 is SL/Centos7 only.
  • Upgrade OSS servers to SL/CentOS 7 from SL6. 
  • Examine the use of ZFS in place of hardware raid which might help mitigate very long raid rebuild times after replacement of a failed hard drive.
Conclusions:

Over the past 4 Blogs we have shown a  successful major upgrade of Lustre. Including the specification, installation, configuration, migration of data, and operation of hardware and software.  

21 September 2016

Upgrading and Expanding Lustre Storage (part3)

In this post we will describe how we went about benchmarking and optimising our Lustre file system.

Performance Tuning:

A number of optimisation were made to improve the performance of the Lustre OSSs. To test these optimisation the IOzone [6] benchmarking program was used. IOzone is used to perform a variety of read and write tests. It is able to operate on a single server or on multiple clients at the same time. 

First it is useful to have an estimate of possible performance before undertaking benchmarking. The typical maximum sustained throughput of a single disk is quoted at approximately 200MB/s. For a 16 disk raid 6 array the maximum sustained throughput for a single server is expected to be 2.4GB/s (excluding the two parity disks). For a Lustre system made up of 20, Dell R730XDs, with 16 disks in each, this should scale to 56GB/s. However, each server is only connected with a 10Gb/s ethernet connection. Therefore the maximum sustained throughput obtainable is 25GB/s.

To test a single server IOzone was run with 12 threads (equal to the number of cpu cores) each transferring a file size of 24GB in chunks of 1024kB (iozone -e -+u -t 12 -r 1024k -s 24g -i0 -i1 -i 5 -i 8). As well as the standard sequential read and write tests, results were obtained for stride reads, and mixed workloads, which does reading and writing of a file with accesses being made to random locations within the file. The values were chosen to match the expected workload (i.e. the reading of large, GigaByte sized) to reduce cacheing effects and to match the 1024k buffer size used in Lustre network transfers. 

Using the BgFS Tips and Recommendations for Storage Server Tuning [7] as reference we applied different sets of optimisations to the storage server.  

Optimisation 1
echo deadline > /sys/block/sdb/queue/scheduler
echo 4096 > /sys/block/sdb/queue/nr_requests
echo 4096 > /sys/block/sdb/queue/read_ahead_kb
echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/defrag

Optimisation 2 or (3), used in conjunction with optimisation 1, optimises the linux file system cacheing which is used by Lustre to help improve performance.   
echo 5(1) > /proc/sys/vm/dirty_background_ratio
echo 10(75) > /proc/sys/vm/dirty_ratio
echo 262144 > /proc/sys/vm/min_free_kbytes
echo 50 > /proc/sys/vm/vfs_cache_pressure

To reduce raid alignment complications the partition was made directly on to the storage device (e.g. /dev/sdb) taking into account the raid configuration (block size, stripe size and width). Lustre uses the EXT4 file system although it is possible to use ZFS instead.

The results of six different IOzone tests on a single server with different optimisations are shown in figure below (top). The results clearly show the benefits of applying optimisations to the OS to improve file system performance. As optimisation 1+3 show the highest throughput this has been applied to the Lustre file system. 

The single server tests were carried out for each of the 20 R730XD servers as a cross check of performance and as a check for hardware issues. All servers were found to produce similar performance.  A cross check for of the single server benchmark test, for optimisation 1 only, limiting the storage servers to only 2G RAM, to remove caching effects, was performed and results were found to be consistent with the results presented here.

A near complete 1.5 PB lustre file system with 20 Dell 730XD servers was created with up to 24 client nodes dedicated to the benchmark tests.  Lustre is set up such that individual files remain on a single OSS (i.e. there is no striping of files across OSSs). The well known Lustre clients tunes were included by default [1].
echo 256 > /proc/fs/lustre/osc/*/max_pages_per_rpc
echo 1024 > /proc/fs/lustre/osc/*/max_dirty_mb

For Lustre benchmarking using multiple clients IOzone is run with the “-+m filename” option to specify the client nodes (iozone -+m iozone_client_list_file -+h [IP of master IOzone node] -e -+u -t 10 -r 1024k -s 24g -i0 -i1 -i 5 -i 8). The figure above (bottom) shows the benchmark results for different number of clients. Each client has a 10Gb/s network connection so this sets the upper limit of the storage performance until we have more than 20 clients (black solid line). As the number of clients increase the performance first increases and then falls off for all but the initial write test. The maximum performance of the storage is seen with 18 clients. The anomalous reread result for 18 clients is reproducible and may to be due to client side cacheing effects. With 24 clients the mixed workload performance is below that for 8 clients. The reason for the fall off in performance for large number of active clients is probable due to contention for resource when seeking data on the file system, this would be less important for the initial writes tests. 

If we assume that a typical data analysis job uses 5MB/s and there is a maximum of 4000 job slots, then a throughput of the complete Lustre system of 20GB/s would be required for our cluster. The read performance measured of the benchmark Lustre system is of the order 15-20GB/s. The performance of the full Lustre file system, including 20 R730XDs and 70 R510s, is expected to be at least double that of the benchmarked system. If the real world workload is dominated by read type workflows, as is expected, then the full Lustre system should be able to provide the 20GB/s performance required.

NOTE: A number of network optimisations were deployed in production based on recommendations found on the faster data web site[8], for both data transfers within the cluster and for those done over the WAN by StoRM, these have not been benchmarked.

For the final part of this story we will discuss the real world Lustre system we have had in production for over 6 months.

[6] IOzone: 

[7] BeeGFS Tips and Recommendations for Storage Server Tuning: http://www.beegfs.com/wiki/StorageServerTuning

[8] ESnet Fasterdata Knowledge Base: 

20 September 2016

Upgrading and Expanding Lustre Storage (part2)

In the last post I Introduced Lustre and our history of use at the Queen Mary Grid site and then discussed the motivation and benefits of upgrading. In this post I will describe our hardware setup and the most important software configuration options. 

Hardware Choice and Setup:

In order to reduce costs the existing Lustre OSS/OSTs, made up of 70 Dell R510s, with 12 two or three TB hard disks in raid six, were reused, providing 1.5 PB of usable storage. An additional 20 Dell R730XDs, with 16 six TB disks in raid six was also purchased, providing 1.5 PB of usable storage, matching the size of the existing Lustre file system.  The Dell 730XDs have two Intel E5-2609 V3 processors and 64GB of RAM. Lustre is a “light” user of CPU resources on the OSS/OST and the E5-2609 processor is one of the cheapest CPUs available. Cost saving were also made by not utilising failover OSS/OSTs hardware which helped reduce costs by 40%!

However, the new MDS/MDT was set up in a resilient, automatic failover configuration utilising two Dell R630s connected to a MD3400 disk array. The Dell R630s have two Intel E5-2637 V3 processors and 256 GB RAM. The disk array has 12 600GB 15K SAS disks in RAID 10. Only one MDS/MDT is used in the cluster and the hardware has been specified as high as affordable. The automatic failover was configured using Corosync, Cman, Fence-agents and the Red Hat resource group manager (rgmanager) packages. Lustre itself has protection against the MDT being mounted by more than one MDS at a time.
All servers (storage, compute and service nodes) are connected to one of seven top of rack Dell S4810 network switches with a single 10Gb SFP+ Ethernet connection, which in turn are connected with multiple 40Gb QSFP+ connections to a distributed core switch made up of two Dell Z9000s in a Virtual Lan Trunk (VLT) configuration (figure 1).
As a result of design choices and several years of evolution in hardware the network connections from storage and compute servers are mixed in the top of rack switches. This has the advantage of balancing power and network IO [4] but at the expense of a more complicated hardware layout. 
Figure 1. Schematic of the Queen Mary Grid Cluster hardware layout.
Software Setup:

The Lustre software was installed on a standard SL6 OS configured server. A patch has been applied to Lustre due to a bug, LU1482 [1], causing incorrect interaction between Access Control Lists (ACLs) and the extended attribute permissions. This is required by StoRM as attributes are used to store checksums of every file which, after every gridftp transfer, are compared between source and destination. This bug is be fixed in the future 2.9 release of Lustre. 
The Lustre manual [1] describes in detail how to setup and configure a Lustre system.
The MDT is formatted and mounted on the MDS using the commands below. On the MDS add the “acl" option when mounting the MDT to ensure ACL and extended attributes support. For simplicity we install the Lustre ManaGement Server (MGS) on the MDS. The MGS will not be discussed further.

[root@mds05 ~] mkfs.lustre —fsname=lustre_1 --mgs --mdt --servicenode=10.0.0.5 --servicenode=10.0.0.6  --index=0 /dev/mapper/mpathb
[root@mds05 ~]# cat /etc/fstab 
...
/dev/mapper/mpathb  /mnt/mdt lustre rw,noauto,acl,errors=remount-ro,user_xattr  0 0

On the OSS/OST You need to specify each of the MDSs when you configure a Lustre OSTs. Once each file system has been mounted it becomes visible to Lustre.

[root$sn100 ~]mkfs.lustre —fsname=lustre_1 --mgsnode=mds05@tcp0 --mgsnode=mds06@tcp0 --ost --index=0 /dev/sdb
[root@sn100 ~]# cat /etc/fstab 
...
/dev/sdb                /mnt/sdb                lustre  defaults        0 0

Lustre Clients need to know about both MDS/MGS nodes when mounting lustre in order to be able to fail over. Lustre is mounted as standard POSIX file system, of type lustre, on clients.

[root@cn200 ~]# cat /etc/fstab 
... 
mds05@tcp0:mds06@tcp0:/lustre_1 /mnt/lustre_1    lustre  flock,user_xattr,_netdev 0 0

The file system mounted on a client appears as any normal file system, just bigger!

[~]$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
mds05@tcp0:mds06@tcp0:/lustre_1  2.9P  2.1P  710T  75% /mnt/lustre_1

StoRM is used for remote data management for all Virtual Organisations (VOs) supported by the site and supports SRM, HTTP(S) and GridFTP. Most data is transferred via GridFTP and three GridFTP nodes were found to be needed to provide the capacity to fully utilise the 20GB WAN link. A standalone, readonly, installation of XRootD is deployed and is remotely usable by all site supported VOs using standard Grid authentication.   



19 September 2016

Upgrading and Expanding Lustre Storage (part1)


At the Queen Mary Grid site we are now running a Lustre file system of over 3PB using the most recent release (2.8). Lustre is an open source, POSIX compatible, clustered file system presented to the Grid using the StoRM Storage Resource Manager. Over the next few posts I would like to describe the recent major upgrade of the Lustre file system. I will: 
  • Introduce Lustre and our history of use at the Queen Mary Grid site and then discuss the motivation and benefits of upgrading; 
  • Describe our hardware setup and the most important software configuration options; 
  • Go into the testing and performance tuning of the file system as seen on the file server and the lustre client; 
  • Finally I will outline the data migration procedure and real world performance we have seen.

Introduction:

The Queen Mary WLCG tier two site has successfully operated a reliable, high performance, efficient, budget oriented storage storage solution, utilising Lustre[1] StoRM [2] and xrootd [3], since 2010 [4,5]. 
Lustre is a open-source(GPL), POSIX compliant, parallel file system used in over half of the worlds Top 500 supercomputers. Lustre is made up of three components: One or more Meta Data Servers (MDS) connected to one or more Meta Data Targets (MDT), which stores the namespace metadata such as filenames, directories and access permissions; One or more Object Storage Servers (OSS) connected to one or more Object Storage Targets (OST) which stores the actual files; and clients that access the data over the network using POSIX filesystem mounts. The network is typically either Ethernet or Infiniband.
StoRM (STOrage Resource Manager) is a scalable and file system independent storage manager service (SRM). It supports standard access and transfer protocols like HTTP(S), WebDAV and GridFTP. It it is designed to work on top of any POSIX filesystems with Access Control Lists(ACL) support such as Lustre.
Previously the Lustre storage file system at Queen Mary has undergone expansion from 300TB to 1.5PB, an upgrade of Lustre from version 1.6 to 1.8.X, a network upgrade from multiple 1Gb to 10Gb ethernet, and migration of the MDS and MDT to new hardware. This upgrade will involve new hardware, a complete reinstalation of every OS and Lustre software on every storage server (MDS/OSS) and a migration of data from the old Lustre to the new.

Motivation for Upgrade:

Last year it was decided that a major software and hardware upgrade was required. This was driven by several reasons: The need to upgrade the Operating system (OS) from Scientific Linux (SL)5 to a supported OS such as SL6 or CentOS7; Use a supported Lustre version compatible with SL6 or CentOS7; To take advantage of new software developments providing improved performance and reliability; Migrate to a new MDS/MDT with hardware in warranty; Double the storage capacity to over 3PB and allow for a doubling again before 2020.
Consideration was given to use of other open source file systems such as CEPH and GlusterFS. However, it was decided early on that local knowledge and experience with Lustre; its maturity, reliability and performance; clear long term development and support from Intel and others; and POSIX support made Lustre the obvious choice.
It is possible to buy a commercially supported solution but this was beyond the budget available. Therefore the specification, installation, configuration and operation of hardware and software had to 
be done by the site team.

next post: Hardware Choices and Software Setup

Some Useful References:

[1] Lustre:

[2] StoRM:

[3] XrootD:

[4] CHEP2012:
Scalable Petascale Storage for HEP using Lustre: Journal of Physics: C.J. Walker D.P. Traynor and A.J. Martin. Conference Series 396 (2012) 042063 

[5] CHEP2014:
Optimising network transfers to and from Queen Mary University of London, a large WLCG tier-2 grid site: C J Walker, D P Traynor, D T Rand, T S Froy and S L Lloyd. Journal of Physics: Conference Series 513 (2014) 062048 

07 September 2016

(Cloud) storage plugfest

A quick note to advertise the coming SNIA/CloudWatch cloud storage "plugfest" - physically in Santa Clara, CA, US, but remote attendance is possible: www.cloudplugfest.org. 19-22 Sep (19-21 is the SNIA storage developer event itself.)

These events are often very interesting, bringing together different components - and standards - and making them work together. Submit your work on any or all of CDMI, OCCI, CIMI, TOSCA, XACML/SAML/X.509.

20 April 2016

ZFS compression for LHC experiments data

An interesting feature of ZFS is that it supports transparent compression. Different to typical file compression, ZFS compression works on the record size/block size that it writes (which is variable in ZFS depending on the data and file size itself). Since it is important to have a fast compression/decompression algorithm to reduce the overhead compared to file access without compression, it can not be expected to get compression results similar to for example bzip in its highest compression level.  Also, the data files of the LHC experiments are ROOT files which already store data in a compressed format.

Therefore, I was not expecting any benefit of enabling compression on our servers, but since the newly implemented algorithm LZ4 has nearly no overhead even for non-compressible data, it shouldn't hurt to enable it.  Especially since our storage servers have Dual-CPUs with 12 cores each, running most of the time idle.

After enabling the default lz4 compression on 4 machines that were already migrated to ZFS and copying data on it, the first compression result looks like this:


NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank-2TB  32.5T  8.73T  23.8T         -    15%    26%  1.00x  ONLINE  -
tank-8TB   116T  24.0T  92.0T         -    10%    20%  1.00x  ONLINE  -

NAME                    PROPERTY       VALUE  SOURCE
tank-2TB                compressratio  1.03x  -
tank-2TB/gridstorage01  compressratio  1.03x  -
tank-2TB/gridstorage02  compressratio  1.03x  -
tank-2TB/gridstorage03  compressratio  1.03x  -
tank-2TB/gridstorage04  compressratio  1.03x  -
tank-8TB                compressratio  1.03x  -
tank-8TB/gridstorage01  compressratio  1.03x  -
tank-8TB/gridstorage02  compressratio  1.03x  -
tank-8TB/gridstorage03  compressratio  1.03x  -
tank-8TB/gridstorage04  compressratio  1.03x  -
tank-8TB/gridstorage05  compressratio  1.04x  -
tank-8TB/gridstorage06  compressratio  1.03x  -
tank-8TB/gridstorage07  compressratio  1.03x  -
tank-8TB/gridstorage08  compressratio  1.03x  -
tank-8TB/gridstorage09  compressratio  1.03x  -
tank-8TB/gridstorage10  compressratio  1.03x  -
tank-8TB/gridstorage11  compressratio  1.03x  -



NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank-2TB  32.5T  8.45T  24.0T         -    11%    26%  1.00x  ONLINE  -
tank-8TB   116T  24.1T  91.9T         -     7%    20%  1.00x  ONLINE  -

NAME                    PROPERTY       VALUE  SOURCE
tank-2TB                compressratio  1.03x  -
tank-2TB/gridstorage01  compressratio  1.03x  -
tank-2TB/gridstorage02  compressratio  1.03x  -
tank-2TB/gridstorage03  compressratio  1.03x  -
tank-2TB/gridstorage04  compressratio  1.04x  -
tank-8TB                compressratio  1.03x  -
tank-8TB/gridstorage01  compressratio  1.03x  -
tank-8TB/gridstorage02  compressratio  1.03x  -
tank-8TB/gridstorage03  compressratio  1.03x  -
tank-8TB/gridstorage04  compressratio  1.03x  -
tank-8TB/gridstorage05  compressratio  1.03x  -
tank-8TB/gridstorage06  compressratio  1.03x  -
tank-8TB/gridstorage07  compressratio  1.03x  -
tank-8TB/gridstorage08  compressratio  1.03x  -
tank-8TB/gridstorage09  compressratio  1.03x  -
tank-8TB/gridstorage10  compressratio  1.03x  -
tank-8TB/gridstorage11  compressratio  1.03x  -


NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank-4TB   127T  9.05T   118T         -     3%     7%  1.00x  ONLINE  -

NAME                    PROPERTY       VALUE  SOURCE
tank-4TB                compressratio  1.03x  -
tank-4TB/gridstorage01  compressratio  1.03x  -
tank-4TB/gridstorage02  compressratio  1.03x  -
tank-4TB/gridstorage03  compressratio  1.04x  -
tank-4TB/gridstorage04  compressratio  1.02x  -
tank-4TB/gridstorage05  compressratio  1.03x  -
tank-4TB/gridstorage06  compressratio  1.03x  -
tank-4TB/gridstorage07  compressratio  1.03x  -
tank-4TB/gridstorage08  compressratio  1.03x  -
tank-4TB/gridstorage09  compressratio  1.03x  -
tank-4TB/gridstorage10  compressratio  1.04x  -
tank-4TB/gridstorage11  compressratio  1.03x  -
tank-4TB/gridstorage12  compressratio  1.02x  -
tank-4TB/gridstorage13  compressratio  1.03x  -
tank-4TB/gridstorage14  compressratio  1.03x  -



NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank-2TB  63.5T  15.4T  48.1T         -    11%    24%  1.00x  ONLINE  -

NAME                    PROPERTY       VALUE  SOURCE
tank-2TB                compressratio  1.03x  -
tank-2TB/gridstorage01  compressratio  1.03x  -
tank-2TB/gridstorage02  compressratio  1.04x  -
tank-2TB/gridstorage03  compressratio  1.03x  -
tank-2TB/gridstorage04  compressratio  1.03x  -
tank-2TB/gridstorage05  compressratio  1.03x  -
tank-2TB/gridstorage06  compressratio  1.03x  -
tank-2TB/gridstorage07  compressratio  1.03x  -


Although there is not much data stored so far on each of the machines, this means we can still reduce the used disk space by some percent, 2-4% here depending on the file system and the data on it.
We have a bit more than 1PB disk storage in total on our site and the servers with 2TB disks provide about 50TB usable storage each. If we can get 4% compression for all the data, that would mean we could get nearly the space provided by one of the 2TB-disk servers additionally for free, without the cost of a new machine, power, extra disks,.... ! And that's just with the default compression while the compression level could also be tuned in ZFS...
This saving could be even bigger if we consider that in the future sites will also store more non-LHC data, like for LSST, which use a different and maybe uncompressed file format.
Another positive aspect of compression is that it reduces disk I/O since it needs to read less data blocks from disk.

It will be interesting to see how the compression rate will be after all our servers have been switch over  to ZFS.





11 April 2016

Setting up of a ZFS based storage server

As it was previously found that ZFS has a good performance in our use case which is even better than the hardware raid performance, new storage servers on our site to be used within GridPP will use ZFS as storage file system in the future.
In this post, I will show how a server for that purpose can easily be setup.  The previous posts which also mention details about the used hardware can be found here, here, and here.

The purpose of this storage server is to be used for LHC data storage which is mostly consistent of GB sized files. At the time of using these data files as input for user jobs, typically the whole  file is copied over to the local node where the user job runs. That means that the configuration needs to deal with large sequential read and writes, but not with small random block access.

The typical hardware configuration of  the storage servers we have is:
  • Server with PERC H700 and/or H800 hardware raid controller
  • 36 disk slots available
    • on some server available through 3 external PowerVault MD-devices (3x12 disks)
    • on some servers available through 2 external PowerVault MD-devices (2x12disks) and 12 internal storage disks
  • 10Gbps network interface
  • Dual-CPU (8 or 12 physical cores on each)
  • between 12GB and 64GB of RAM 
In this blog post, as I did before, I will describe the ZFS setup based on a machine with 12 internal disks (2TB disks on H700) and 24 external disks (17x8TB + 7x2TB  on H800). The machine is already setup with SL6 and has the typical GridPP software (DPM clients, xrootd, httpd,...) installed.

Preparing the disks

Since both raid controllers don't support JBOD, first single raid0 devices have to be created. To find out which disks are available and can be used, omreport can be used:


[root@pool7 ~]# omreport storage pdisk controller=0|grep -E "^ID|Capacity"
ID                              : 0:0:0
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:1
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:2
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:3
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:4
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:5
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:6
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:7
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:8
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:9
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:10
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:11
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:12
Capacity                        : 278.88 GB (299439751168 bytes)
ID                              : 0:0:13
Capacity                        : 278.88 GB (299439751168 bytes)


The disks 0:0:12 and 0:0:13 are the system disks in a mirrored configuration and shouldn't be touched. The disks 0:0:0 to 0:0:11 can be converted to single raid0 using omconfig:
for i in $(seq 0 11); 
do 
  omconfig storage controller controller=0 action=createvdisk raid=r0 size=max pdisk=0:0:$i; 
done

The same procedure has to be repeated for the second controller.

After that, the disks are available to the system and to find out which are the 2TB and which are the 8TB disks, lsblk can be used:

[root@pool7 ~]# lsblk |grep disk
sda      8:0    0 278.9G  0 disk 
sdb      8:16   0   1.8T  0 disk 
sdc      8:32   0   1.8T  0 disk 
sdd      8:48   0   1.8T  0 disk 
sde      8:64   0   1.8T  0 disk 
sdf      8:80   0   1.8T  0 disk 
sdg      8:96   0   1.8T  0 disk 
sdh      8:112  0   1.8T  0 disk 
sdi      8:128  0   1.8T  0 disk 
sdj      8:144  0   1.8T  0 disk 
sdk      8:160  0   1.8T  0 disk 
sdl      8:176  0   1.8T  0 disk 
sdm      8:192  0   1.8T  0 disk 
sdn      8:208  0   7.3T  0 disk 
sdo      8:224  0   7.3T  0 disk 
sdp      8:240  0   7.3T  0 disk 
sdq     65:0    0   7.3T  0 disk 
sdr     65:16   0   7.3T  0 disk 
sds     65:32   0   7.3T  0 disk 
sdt     65:48   0   7.3T  0 disk 
sdu     65:64   0   7.3T  0 disk 
sdv     65:80   0   7.3T  0 disk 
sdw     65:96   0   7.3T  0 disk 
sdx     65:112  0   7.3T  0 disk 
sdy     65:128  0   7.3T  0 disk 
sdz     65:144  0   7.3T  0 disk 
sdaa    65:160  0   7.3T  0 disk 
sdab    65:176  0   1.8T  0 disk 
sdac    65:192  0   7.3T  0 disk 
sdad    65:208  0   1.8T  0 disk 
sdae    65:224  0   1.8T  0 disk 
sdaf    65:240  0   7.3T  0 disk 
sdag    66:0    0   1.8T  0 disk 
sdah    66:16   0   1.8T  0 disk 
sdai    66:32   0   7.3T  0 disk 
sdaj    66:48   0   1.8T  0 disk 
sdak    66:64   0   1.8T  0 disk 

/dev/sda is the system disk and shouldn't be touch, but all the other disks can be used for the storage setup.


ZFS installation

The current version of ZFS can be downloaded from the ZFS on Linux web page. Depending on the used distribution, there are also instructions on how to install ZFS through the package management. In the worst case, one can download the source code and compile on the own system.   

Since we use SL which is RH based, we can follow the instructions provided on the page .
After the installation of ZFS through yum, the module needs to be loaded using modprobe zfs to continue without a reboot.   
To have file systems based on zfs automounted at system start, unfortunately also selinux options need to be changed. In the selinux config file, in our case at /etc/sysconfig/selinux, we need to change "SELINUX=enforcing" to at least "SELINUX=permissive"

This will probably be needed as long as zfs is not part of the RH distribution and zfs will not be recognized by selinux as a valid file system. More about this issue can be found here.


ZFS storage setup

Since we have the ZFS driver installed and the disks prepared now, we can continue to setup the storage pool.  
In this example, we create 2 different storage pools - one for the 2TB disks and one for the 8TB disks - as a good compromise between possible IOPS and available space. For ZFS, it doesn't matter if the disks within one storage pool are connected through the same controller or through different ones, like in our case for the 2TB disks. For the configuration, it's decided to use raidz2 which has 2 redundancy disks similar to raid6. Also, one disk of each kind will be used as hot spare.
To do so, we need to find all disks of a given kind in the system and creating a storage pool for these disks using zpool create :

[root@pool7 ~]# lsblk |grep 1.8T
sdb      8:16   0   1.8T  0 disk 
sdc      8:32   0   1.8T  0 disk 
sdd      8:48   0   1.8T  0 disk 
sde      8:64   0   1.8T  0 disk 
sdf      8:80   0   1.8T  0 disk 
sdg      8:96   0   1.8T  0 disk 
sdh      8:112  0   1.8T  0 disk 
sdi      8:128  0   1.8T  0 disk 
sdj      8:144  0   1.8T  0 disk 
sdk      8:160  0   1.8T  0 disk 
sdl      8:176  0   1.8T  0 disk 
sdm      8:192  0   1.8T  0 disk 
sdab    65:176  0   1.8T  0 disk 
sdad    65:208  0   1.8T  0 disk 
sdae    65:224  0   1.8T  0 disk 
sdag    66:0    0   1.8T  0 disk 
sdah    66:16   0   1.8T  0 disk 
sdaj    66:48   0   1.8T  0 disk 
sdak    66:64   0   1.8T  0 disk 

[root@pool7 ~]# zpool create -f tank-2TB raidz2 sdb sdc sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdab sdad sdae sdag sdah sdaj spare sdak



[root@pool7 ~]# lsblk |grep 7.3T
sdn       8:208  0   7.3T  0 disk 
sdo       8:224  0   7.3T  0 disk 
sdp       8:240  0   7.3T  0 disk 
sdq      65:0    0   7.3T  0 disk 
sdr      65:16   0   7.3T  0 disk 
sds      65:32   0   7.3T  0 disk 
sdt      65:48   0   7.3T  0 disk 
sdu      65:64   0   7.3T  0 disk 
sdv      65:80   0   7.3T  0 disk 
sdw      65:96   0   7.3T  0 disk 
sdx      65:112  0   7.3T  0 disk 
sdy      65:128  0   7.3T  0 disk 
sdz      65:144  0   7.3T  0 disk 
sdaa     65:160  0   7.3T  0 disk 
sdac     65:192  0   7.3T  0 disk 
sdaf     65:240  0   7.3T  0 disk 
sdai     66:32   0   7.3T  0 disk 
[root@pool7 ~]# zpool create -f tank-8TB raidz2 sdn sdo sdp sdq sdr sds sdt sdu sdv sdw sdx sdy sdz sdaa sdac sdaf spare sdai


After the zpool create commands, the storage is setup in a raid configuration, a file system created on top of it, and mounted under /tank-2TB and /tank-8TB. There are no additional commands needed and all is available within seconds.   
At this point the system looks like:

[root@pool7 ~]# mount|grep zfs
tank-2TB on /tank-2TB type zfs (rw)
tank-8TB on /tank-8TB type zfs (rw)


[root@pool7 ~]# zpool status
  pool: tank-2TB
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank-2TB    ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0
            sdi     ONLINE       0     0     0
            sdj     ONLINE       0     0     0
            sdk     ONLINE       0     0     0
            sdl     ONLINE       0     0     0
            sdm     ONLINE       0     0     0
            sdab    ONLINE       0     0     0
            sdad    ONLINE       0     0     0
            sdae    ONLINE       0     0     0
            sdag    ONLINE       0     0     0
            sdah    ONLINE       0     0     0
            sdaj    ONLINE       0     0     0
        spares
          sdak      AVAIL   

errors: No known data errors

  pool: tank-8TB
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank-8TB    ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sdn     ONLINE       0     0     0
            sdo     ONLINE       0     0     0
            sdp     ONLINE       0     0     0
            sdq     ONLINE       0     0     0
            sdr     ONLINE       0     0     0
            sds     ONLINE       0     0     0
            sdt     ONLINE       0     0     0
            sdu     ONLINE       0     0     0
            sdv     ONLINE       0     0     0
            sdw     ONLINE       0     0     0
            sdx     ONLINE       0     0     0
            sdy     ONLINE       0     0     0
            sdz     ONLINE       0     0     0
            sdaa    ONLINE       0     0     0
            sdac    ONLINE       0     0     0
            sdaf    ONLINE       0     0     0
        spares
          sdai      AVAIL   

errors: No known data errors


[root@pool7 ~]# zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank-2TB  32.5T   153K  32.5T         -     0%     0%  1.00x  ONLINE  -
tank-8TB   116T   153K   116T         -     0%     0%  1.00x  ONLINE  -
[root@pool7 ~]# 
[root@pool7 ~]# zfs list
NAME       USED  AVAIL  REFER  MOUNTPOINT
tank-2TB   120K  28.0T  40.0K  /tank-2TB
tank-8TB   117K  97.7T  39.1K  /tank-8TB
[root@pool7 ~]# 
[root@pool7 ~]# df -h|grep tank
tank-2TB         28T     0   28T   0% /tank-2TB
tank-8TB         98T     0   98T   0% /tank-8TB

Setting additional filesystem properties

Since there is lz4 available as compression algorithm which has a very small impact on performance, we can enable compression on our storage. This has probably not a large impact on the storage of LHC data, but could lower the storage space for non-LHC experiments that will be supported in the near future.
In addition, the storage of xattr will also be changed to a similar behaviour like in ext4.
Also since we have a spare configured in our pools, we need to activate auto replacement in failure cases making it a hot spare. An interesting feature of ZFS is also to grow the pool size if the disks are replaced by new disks with a larger capacity. This needs to be done for all disks within one vdev to have an effect, but can be done one by one over time.

[root@pool7 ~]# zfs set compression=lz4 tank-2TB
[root@pool7 ~]# zfs set compression=lz4 tank-8TB

[root@pool7 ~]# zpool set autoreplace=on tank-2TB
[root@pool7 ~]# zpool set autoreplace=on tank-8TB

[root@pool7 ~]# zpool set autoexpand=on tank-2TB
[root@pool7 ~]# zpool set autoexpand=on tank-8TB

[root@pool7 ~]# zfs set relatime=on tank-2TB
[root@pool7 ~]# zfs set relatime=on tank-8TB

[root@pool7 ~]# zfs set xattr=sa tank-2TB
[root@pool7 ~]# zfs set xattr=sa tank-8TB

Changing disk identification

Using the disk identification by letters, like sdb or sdc, is easy to handle and good to setup a pool. However, the order how disks are identified could be changed on a reboot and also will change if the disks need to be rearranged on the server, for example after replacing one of the external MD devices.
While in such cases ZFS should still be able to identify the disks belonging to the same pool and import the pool, it is better to use the disk IDs to identify disks. To change this behaviour, we only need to export the pool and import using the disk IDs:
[root@pool7 ~]# zpool export -a
[root@pool7 ~]# zpool import -d /dev/disk/by-id tank-8TB
[root@pool7 ~]# zpool import -d /dev/disk/by-id tank-2TB

Making the space available to DPM

Traditionally, the available space on a large vdev was divided into smaller parts by creating partitions in fdisk. In ZFS however this can be done directly on top of the just created pool. All properties set for the top level ZFS will be distributed to the new zfs too. There will be no need to set the compression property or other properties again, except if one wants to have different properties than before.
A new file system is created by using zfs create :

[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage01
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage02
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage03
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage04
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage05
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage06
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage07
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage08
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage09
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage10
[root@pool7 ~]# zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
tank-2TB                 144K  28.0T  40.0K  /tank-2TB
tank-8TB                90.0T  7.67T  41.7K  /tank-8TB
tank-8TB/gridstorage01     9T  16.7T  39.1K  /tank-8TB/gridstorage01
tank-8TB/gridstorage02     9T  16.7T  39.1K  /tank-8TB/gridstorage02
tank-8TB/gridstorage03     9T  16.7T  39.1K  /tank-8TB/gridstorage03
tank-8TB/gridstorage04     9T  16.7T  39.1K  /tank-8TB/gridstorage04
tank-8TB/gridstorage05     9T  16.7T  39.1K  /tank-8TB/gridstorage05
tank-8TB/gridstorage06     9T  16.7T  39.1K  /tank-8TB/gridstorage06
tank-8TB/gridstorage07     9T  16.7T  39.1K  /tank-8TB/gridstorage07
tank-8TB/gridstorage08     9T  16.7T  39.1K  /tank-8TB/gridstorage08
tank-8TB/gridstorage09     9T  16.7T  39.1K  /tank-8TB/gridstorage09
tank-8TB/gridstorage10     9T  16.7T  39.1K  /tank-8TB/gridstorage10

[root@pool7 ~]# zfs create -o refreservation=7.66T tank-8TB/gridstorage11
[root@pool7 ~]# zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
tank-2TB                 144K  28.0T  40.0K  /tank-2TB
tank-8TB                97.7T  9.91G  41.7K  /tank-8TB
tank-8TB/gridstorage01     9T  9.01T  39.1K  /tank-8TB/gridstorage01
tank-8TB/gridstorage02     9T  9.01T  39.1K  /tank-8TB/gridstorage02
tank-8TB/gridstorage03     9T  9.01T  39.1K  /tank-8TB/gridstorage03
tank-8TB/gridstorage04     9T  9.01T  39.1K  /tank-8TB/gridstorage04
tank-8TB/gridstorage05     9T  9.01T  39.1K  /tank-8TB/gridstorage05
tank-8TB/gridstorage06     9T  9.01T  39.1K  /tank-8TB/gridstorage06
tank-8TB/gridstorage07     9T  9.01T  39.1K  /tank-8TB/gridstorage07
tank-8TB/gridstorage08     9T  9.01T  39.1K  /tank-8TB/gridstorage08
tank-8TB/gridstorage09     9T  9.01T  39.1K  /tank-8TB/gridstorage09
tank-8TB/gridstorage10     9T  9.01T  39.1K  /tank-8TB/gridstorage10
tank-8TB/gridstorage11  7.66T  7.67T  39.1K  /tank-8TB/gridstorage11

Here a new property is set for each file of the new file systems - refreservation - which reserves the specified space for this particular file system, making sure this space is guaranteed. This is different to setting a quota which limits the space only to an upper limit.  However, to make sure the specified space is not exceeded in our case, also a quota of the same size should be specified. For the last file system in each pool, a larger amount could be specified which makes sure that all the space that can't be used by the other file systems due to quota limitations, will be used here.
[root@pool7 ~]# zfs set refquota=7T tank-2TB/gridstorage01
[root@pool7 ~]# zfs set refquota=7T tank-2TB/gridstorage02
[root@pool7 ~]# zfs set refquota=7T tank-2TB/gridstorage03
[root@pool7 ~]# zfs set refquota=7T tank-2TB/gridstorage04
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage01
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage02
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage03
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage04
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage05
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage06
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage07
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage08
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage09
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage10
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage11

After that, the storage setup is finished and the just created file systems are already mounted and should be made available to the DPM user:
[root@pool7 ~]# chown -R dpmmgr:users /tank-2TB
[root@pool7 ~]# chown -R dpmmgr:users /tank-8TB

That was the last step needed on the storage server and the new file systems can be added to the DPM head node like any other file system before.

ZFS configuration options

To customize the ZFS behaviour, 2 main config files are available  - /etc/sysconfig/zfs and /etc/zfs/zed.d/zed.rc. I will not go into detail here about these 2 files, but if you want to setup your own ZFS based storage then have a look here. The options within are mainly self explaining, for example you can specify where to send email for disk problems and under which circumstances. 




As final section, I want to mention 2 very useful commands - zpool history and zpool iostat.
With the first command, one can display all commands that were run against a zpool since its creation together with a time stamp. This can be very useful for error analyses and also to repeat a configuration on another server.
[root@pool6 ~]# zpool history tank-2TB
History for 'tank-2TB':
2016-04-05.11:27:43 zpool create -f tank-2TB raidz2 sdd sdf sdg sdi sdj sdl sdm sdz sdaa sdab sdac sdad sdae sdaf sdag sdah sdai sdaj spare sdak
2016-04-05.11:40:32 zfs create -o refreservation=7TB tank-2TB/gridstorage01
2016-04-05.11:40:34 zfs create -o refreservation=7TB tank-2TB/gridstorage02
2016-04-05.11:40:39 zfs create -o refreservation=7TB tank-2TB/gridstorage03
2016-04-05.11:41:57 zfs create -o refreservation=6.97T tank-2TB/gridstorage04
2016-04-05.12:02:12 zpool set autoreplace=on tank-2TB
2016-04-05.12:02:17 zpool set autoexpand=on tank-2TB
2016-04-05.12:38:11 zpool export tank-2TB
2016-04-05.12:38:33 zpool import -d /dev/disk/by-id tank-2TB
2016-04-06.13:41:37 zfs set compression=lz4 tank-2TB
2016-04-07.14:36:37 zfs set relatime=on tank-2TB
2016-04-07.14:36:42 zfs set xattr=sa tank-2TB
2016-04-11.11:28:08 zpool scrub tank-2TB
2016-04-11.14:12:41 zfs set refquota=7T tank-2TB/gridstorage01
2016-04-11.14:12:43 zfs set refquota=7T tank-2TB/gridstorage02
2016-04-11.14:12:48 zfs set refquota=7T tank-2TB/gridstorage03
2016-04-11.14:12:59 zfs set refquota=7T tank-2TB/gridstorage04

The second command, zpool iostat, displays the current I/O on the pool, separately for read/write operations and for bandwidth. Information can be displayed for a given pool but also for each disk within the pool. 
The following example is taken from a server that was configured with 3 raidz2 while it was drained using the dpm-drain command on the head node with threads=1 and one drain per file system, resulting in 5 parallel drain commands running:

[root@pool5 ~]# zpool iostat 1
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        6.03T  54.0T    110     24  13.5M  1.33M
tank        6.03T  54.0T  4.74K      0   602M      0
tank        6.03T  54.0T  4.63K      0   589M      0
tank        6.03T  54.0T  4.62K      0   587M      0
tank        6.03T  54.0T  4.42K      0   561M      0
tank        6.03T  54.0T  5.27K      0   669M      0
tank        6.03T  54.0T  4.51K      0   573M      0
tank        6.03T  54.0T  4.47K      0   568M      0
tank        6.03T  54.0T  4.47K      0   568M      0
tank        6.03T  54.0T  4.26K      0   542M      0
tank        6.03T  54.0T  4.56K      0   579M      0
tank        6.03T  54.0T  4.82K      0   613M      0
tank        6.03T  54.0T  4.60K      0   585M      0
tank        6.03T  54.0T  4.73K      0   601M      0
tank        6.03T  54.0T  4.20K      0   533M      0
tank        6.03T  54.0T  4.52K      0   574M      0
tank        6.03T  54.0T  3.72K      0   473M      0
tank        6.03T  54.0T  3.80K      0   484M      0
tank        6.03T  54.0T  4.46K      0   567M      0
tank        6.03T  54.0T  5.16K      0   655M      0
tank        6.03T  54.0T  5.25K      0   667M      0

[root@pool5 ~]# zpool iostat -v 1
                                               capacity     operations    bandwidth
pool                                        alloc   free   read  write   read  write
------------------------------------------  -----  -----  -----  -----  -----  -----
tank                                        5.87T  54.1T  4.77K      0   606M      0
  raidz2                                    1.96T  18.0T  1.60K      0   202M      0
    scsi-36a4badb044e936001e55b2111ca79173      -      -    329      0  21.2M      0
    scsi-36a4badb044e936001e55b2461fc3cb38      -      -    354      0  20.8M      0
    scsi-36a4badb044e936001e55b25520b22e10      -      -    337      0  21.3M      0
    scsi-36a4badb044e936001e55b2622171ff0b      -      -    333      0  21.3M      0
    scsi-36a4badb044e936001e55b26e2232640f      -      -    334      0  21.0M      0
    scsi-36a4badb044e936001e55b27d230ce0f6      -      -    333      0  21.2M      0
    scsi-36a4badb044e936001e55b293245ae11b      -      -    335      0  21.1M      0
    scsi-36a4badb044e936001e55b2b426603fbe      -      -    335      0  21.3M      0
    scsi-36a4badb044e936001e55b2c4274ec795      -      -    338      0  20.9M      0
    scsi-36a4badb044e936001e55b2d128122551      -      -    318      0  20.8M      0
    scsi-36a4badb044e936001e55b2f42a2e3006      -      -    342      0  21.3M      0
  raidz2                                    1.96T  18.0T  1.59K      0   203M      0
    scsi-36a4badb044e936001e830de6afdb1d8f      -      -    332      0  21.6M      0
    scsi-36a4badb044e936001e55b3082b59c1da      -      -    310      0  21.2M      0
    scsi-36a4badb044e936001e55b3142c0ac749      -      -    311      0  21.5M      0
    scsi-36a4badb044e936001e55b31f2cbeb648      -      -    319      0  21.8M      0
    scsi-36a4badb044e936001e55b44e3ecc77ea      -      -    313      0  21.7M      0
    scsi-36a4badb044e936001e55b33b2e6172a4      -      -    213      0  21.8M      0
    scsi-36a4badb044e936001e55b34c2f70184c      -      -    307      0  21.4M      0
    scsi-36a4badb044e936001e55b358301ee6a2      -      -    319      0  21.8M      0
    scsi-36a4badb044e936001e55b36530e4cb2a      -      -    331      0  21.8M      0
    scsi-36a4badb044e936001e55b3793218970b      -      -    325      0  21.8M      0
    scsi-36a4badb044e936001e55b38532cf8c68      -      -    324      0  21.8M      0
  raidz2                                    1.96T  18.0T  1.58K      0   201M      0
    scsi-36a4badb044e936001e55b39033741ccf      -      -    342      0  21.8M      0
    scsi-36a4badb044e936001e55b39b3421403b      -      -    323      0  21.4M      0
    scsi-36a4badb044e936001e55b3de3822569a      -      -    335      0  21.8M      0
    scsi-36a4badb044e936001e55b3eb38df509e      -      -    328      0  21.7M      0
    scsi-36a4badb044e936001e55b3f839a79c83      -      -    301      0  21.5M      0
    scsi-36a4badb044e936001e55b4023a46ae2a      -      -    325      0  21.7M      0
    scsi-36a4badb044e936001e55b40f3b0100cb      -      -    314      0  21.5M      0
    scsi-36a4badb044e936001e55b41d3bdd5a86      -      -    335      0  21.5M      0
    scsi-36a4badb044e936001e55b42b3cb239cd      -      -    324      0  21.8M      0
    scsi-36a4badb044e936001e55b4363d55d784      -      -    322      0  21.7M      0
    scsi-36a4badb044e936001e55b4413e09a9f1      -      -    331      0  21.8M      0
------------------------------------------  -----  -----  -----  -----  -----  -----


One important point to keep in mind is that all the properties set with zfs or zpool are stored within the file system and not within the OS config files! That means, if the OS gets upgraded then one can do a zpool import and all properties - like mount point, quota, reservations, compression, history - will instantly be available again. There is no need to touch manually any system config file, like /etc/fstab, to make the storage available.  This is also true for other properties, like nfs sharing, but since it's not needed in our case I haven't described that. To get an idea what properties are available and what else one can do with ZFS, zfs get all and zpool get all are useful.