21 September 2016

Upgrading and Expanding Lustre Storage (part3)

In this post we will describe how we went about benchmarking and optimising our Lustre file system.

Performance Tuning:

A number of optimisation were made to improve the performance of the Lustre OSSs. To test these optimisation the IOzone [6] benchmarking program was used. IOzone is used to perform a variety of read and write tests. It is able to operate on a single server or on multiple clients at the same time. 

First it is useful to have an estimate of possible performance before undertaking benchmarking. The typical maximum sustained throughput of a single disk is quoted at approximately 200MB/s. For a 16 disk raid 6 array the maximum sustained throughput for a single server is expected to be 2.4GB/s (excluding the two parity disks). For a Lustre system made up of 20, Dell R730XDs, with 16 disks in each, this should scale to 56GB/s. However, each server is only connected with a 10Gb/s ethernet connection. Therefore the maximum sustained throughput obtainable is 25GB/s.

To test a single server IOzone was run with 12 threads (equal to the number of cpu cores) each transferring a file size of 24GB in chunks of 1024kB (iozone -e -+u -t 12 -r 1024k -s 24g -i0 -i1 -i 5 -i 8). As well as the standard sequential read and write tests, results were obtained for stride reads, and mixed workloads, which does reading and writing of a file with accesses being made to random locations within the file. The values were chosen to match the expected workload (i.e. the reading of large, GigaByte sized) to reduce cacheing effects and to match the 1024k buffer size used in Lustre network transfers. 

Using the BgFS Tips and Recommendations for Storage Server Tuning [7] as reference we applied different sets of optimisations to the storage server.  

Optimisation 1
echo deadline > /sys/block/sdb/queue/scheduler
echo 4096 > /sys/block/sdb/queue/nr_requests
echo 4096 > /sys/block/sdb/queue/read_ahead_kb
echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/enabled
echo madvise > /sys/kernel/mm/redhat_transparent_hugepage/defrag

Optimisation 2 or (3), used in conjunction with optimisation 1, optimises the linux file system cacheing which is used by Lustre to help improve performance.   
echo 5(1) > /proc/sys/vm/dirty_background_ratio
echo 10(75) > /proc/sys/vm/dirty_ratio
echo 262144 > /proc/sys/vm/min_free_kbytes
echo 50 > /proc/sys/vm/vfs_cache_pressure

To reduce raid alignment complications the partition was made directly on to the storage device (e.g. /dev/sdb) taking into account the raid configuration (block size, stripe size and width). Lustre uses the EXT4 file system although it is possible to use ZFS instead.

The results of six different IOzone tests on a single server with different optimisations are shown in figure below (top). The results clearly show the benefits of applying optimisations to the OS to improve file system performance. As optimisation 1+3 show the highest throughput this has been applied to the Lustre file system. 

The single server tests were carried out for each of the 20 R730XD servers as a cross check of performance and as a check for hardware issues. All servers were found to produce similar performance.  A cross check for of the single server benchmark test, for optimisation 1 only, limiting the storage servers to only 2G RAM, to remove caching effects, was performed and results were found to be consistent with the results presented here.

A near complete 1.5 PB lustre file system with 20 Dell 730XD servers was created with up to 24 client nodes dedicated to the benchmark tests.  Lustre is set up such that individual files remain on a single OSS (i.e. there is no striping of files across OSSs). The well known Lustre clients tunes were included by default [1].
echo 256 > /proc/fs/lustre/osc/*/max_pages_per_rpc
echo 1024 > /proc/fs/lustre/osc/*/max_dirty_mb

For Lustre benchmarking using multiple clients IOzone is run with the “-+m filename” option to specify the client nodes (iozone -+m iozone_client_list_file -+h [IP of master IOzone node] -e -+u -t 10 -r 1024k -s 24g -i0 -i1 -i 5 -i 8). The figure above (bottom) shows the benchmark results for different number of clients. Each client has a 10Gb/s network connection so this sets the upper limit of the storage performance until we have more than 20 clients (black solid line). As the number of clients increase the performance first increases and then falls off for all but the initial write test. The maximum performance of the storage is seen with 18 clients. The anomalous reread result for 18 clients is reproducible and may to be due to client side cacheing effects. With 24 clients the mixed workload performance is below that for 8 clients. The reason for the fall off in performance for large number of active clients is probable due to contention for resource when seeking data on the file system, this would be less important for the initial writes tests. 

If we assume that a typical data analysis job uses 5MB/s and there is a maximum of 4000 job slots, then a throughput of the complete Lustre system of 20GB/s would be required for our cluster. The read performance measured of the benchmark Lustre system is of the order 15-20GB/s. The performance of the full Lustre file system, including 20 R730XDs and 70 R510s, is expected to be at least double that of the benchmarked system. If the real world workload is dominated by read type workflows, as is expected, then the full Lustre system should be able to provide the 20GB/s performance required.

NOTE: A number of network optimisations were deployed in production based on recommendations found on the faster data web site[8], for both data transfers within the cluster and for those done over the WAN by StoRM, these have not been benchmarked.

For the final part of this story we will discuss the real world Lustre system we have had in production for over 6 months.

[6] IOzone: 

[7] BeeGFS Tips and Recommendations for Storage Server Tuning: http://www.beegfs.com/wiki/StorageServerTuning

[8] ESnet Fasterdata Knowledge Base: 

No comments: