20 September 2016

Upgrading and Expanding Lustre Storage (part2)

In the last post I Introduced Lustre and our history of use at the Queen Mary Grid site and then discussed the motivation and benefits of upgrading. In this post I will describe our hardware setup and the most important software configuration options. 

Hardware Choice and Setup:

In order to reduce costs the existing Lustre OSS/OSTs, made up of 70 Dell R510s, with 12 two or three TB hard disks in raid six, were reused, providing 1.5 PB of usable storage. An additional 20 Dell R730XDs, with 16 six TB disks in raid six was also purchased, providing 1.5 PB of usable storage, matching the size of the existing Lustre file system.  The Dell 730XDs have two Intel E5-2609 V3 processors and 64GB of RAM. Lustre is a “light” user of CPU resources on the OSS/OST and the E5-2609 processor is one of the cheapest CPUs available. Cost saving were also made by not utilising failover OSS/OSTs hardware which helped reduce costs by 40%!

However, the new MDS/MDT was set up in a resilient, automatic failover configuration utilising two Dell R630s connected to a MD3400 disk array. The Dell R630s have two Intel E5-2637 V3 processors and 256 GB RAM. The disk array has 12 600GB 15K SAS disks in RAID 10. Only one MDS/MDT is used in the cluster and the hardware has been specified as high as affordable. The automatic failover was configured using Corosync, Cman, Fence-agents and the Red Hat resource group manager (rgmanager) packages. Lustre itself has protection against the MDT being mounted by more than one MDS at a time.
All servers (storage, compute and service nodes) are connected to one of seven top of rack Dell S4810 network switches with a single 10Gb SFP+ Ethernet connection, which in turn are connected with multiple 40Gb QSFP+ connections to a distributed core switch made up of two Dell Z9000s in a Virtual Lan Trunk (VLT) configuration (figure 1).
As a result of design choices and several years of evolution in hardware the network connections from storage and compute servers are mixed in the top of rack switches. This has the advantage of balancing power and network IO [4] but at the expense of a more complicated hardware layout. 
Figure 1. Schematic of the Queen Mary Grid Cluster hardware layout.
Software Setup:

The Lustre software was installed on a standard SL6 OS configured server. A patch has been applied to Lustre due to a bug, LU1482 [1], causing incorrect interaction between Access Control Lists (ACLs) and the extended attribute permissions. This is required by StoRM as attributes are used to store checksums of every file which, after every gridftp transfer, are compared between source and destination. This bug is be fixed in the future 2.9 release of Lustre. 
The Lustre manual [1] describes in detail how to setup and configure a Lustre system.
The MDT is formatted and mounted on the MDS using the commands below. On the MDS add the “acl" option when mounting the MDT to ensure ACL and extended attributes support. For simplicity we install the Lustre ManaGement Server (MGS) on the MDS. The MGS will not be discussed further.

[root@mds05 ~] mkfs.lustre —fsname=lustre_1 --mgs --mdt --servicenode= --servicenode=  --index=0 /dev/mapper/mpathb
[root@mds05 ~]# cat /etc/fstab 
/dev/mapper/mpathb  /mnt/mdt lustre rw,noauto,acl,errors=remount-ro,user_xattr  0 0

On the OSS/OST You need to specify each of the MDSs when you configure a Lustre OSTs. Once each file system has been mounted it becomes visible to Lustre.

[root$sn100 ~]mkfs.lustre —fsname=lustre_1 --mgsnode=mds05@tcp0 --mgsnode=mds06@tcp0 --ost --index=0 /dev/sdb
[root@sn100 ~]# cat /etc/fstab 
/dev/sdb                /mnt/sdb                lustre  defaults        0 0

Lustre Clients need to know about both MDS/MGS nodes when mounting lustre in order to be able to fail over. Lustre is mounted as standard POSIX file system, of type lustre, on clients.

[root@cn200 ~]# cat /etc/fstab 
mds05@tcp0:mds06@tcp0:/lustre_1 /mnt/lustre_1    lustre  flock,user_xattr,_netdev 0 0

The file system mounted on a client appears as any normal file system, just bigger!

[~]$ df -h
Filesystem                       Size  Used Avail Use% Mounted on
mds05@tcp0:mds06@tcp0:/lustre_1  2.9P  2.1P  710T  75% /mnt/lustre_1

StoRM is used for remote data management for all Virtual Organisations (VOs) supported by the site and supports SRM, HTTP(S) and GridFTP. Most data is transferred via GridFTP and three GridFTP nodes were found to be needed to provide the capacity to fully utilise the 20GB WAN link. A standalone, readonly, installation of XRootD is deployed and is remotely usable by all site supported VOs using standard Grid authentication.   

No comments: