Showing posts with label configuration. Show all posts
Showing posts with label configuration. Show all posts

03 October 2019

Modern account mapping for a Ceph/Xrootd or Ceph/GridFTP service.

One of the advantages of the RAL ECHO service having "gone first" in terms of setting up a Ceph object store with direct connections to xrootd and gridftp services, is that when we are doing the same thing at Scotgrid-Glasgow, we can try new things.

One such change for us is how we do account authorisation and mapping.

The RAL Echo system is deliberately conservative, and has a two stage process:


  1. User DNs are mapped via a simple grid-mapfile to a specific account.
  2. That account name is then associated with a set of capabilities via an xrootd authdb file.

(These capabilities correspond to access permissions for a small number of ceph pools on the backend, usually one per VO.)

We know that works, but it's unwieldy - you need big grid-mapfiles full of DNs for all the users, and users are hard to map to more than one account.
Additionally, privacy and security concerns have led to policies for voms servers being restricted - it's hard or impossible to even request a list of member DNs for some VOs now.

It would be nice if we could do something more modern, using the VOMS extensions in the certificates. (It would be even nicer if we could, whilst we're doing this, call out to an ARGUS server for banning, as that's a cheap way to provide central banning for our SE.)

It turns out that we can do this, with the magic of a >6 year old technology from Nikhef called LCMAPS. The below replaces the grid-mapfile parts of the RAL configuration - you still need the authdb part to map the resulting account names to the underlying capabilities. 
(And in the magical world where we just pass capability tokens around, we can probably make this a single step mapping.)

Doing this needs a bit of work, but since we're already compiling our own version of xrootd, and our own gridftp-ceph plugin, a bit more compilation never hurts.

The underlying LCMAPS configuration we're using (in /etc/lcmaps/lcmaps.db) looks like this, with a bit of unique data obscured:

vomsmapfile2 = "lcmaps_voms_localaccount.mod"
              "-gridmap /etc/grid-security/voms-mapfile"

verifyproxynokey = "lcmaps_verify_proxy.mod"
          "--allow-limited-proxy"
          "--discard_private_key_absence"
          " -certdir /etc/grid-security/certificates"
pepc = "lcmaps_c_pep.mod"
            "--pep-daemon-endpoint-url https://ourargusserverauthzpoint"
            "--resourceid ourcephresourceid"
            "--actionid http://glite.org/xacml/action/execute"
            "--capath /etc/grid-security/certificates/"
            "--certificate /etc/grid-security/hostcert.pem"
            "--key /etc/grid-security/hostkey.pem"
            "--banning-only-mode"

good = "lcmaps_dummy_good.mod"
bad = "lcmaps_dummy_bad.mod"

mapping_pol:
verifyproxynokey -> pepc | bad
pepc -> vomsmapfile2 | bad
vomsmapfile2 -> good

Here, the grey backgrounded part uses the lcmaps_voms_localaccount plugin to map by VOMS extension only, to a small number of accounts. So, our local services don't need to maintain a large and brittle grid-mapfile, or call out anywhere with a cron to update it.

(The voms-mapfile is as simple as, for example:

/dteam* dteamaccount

to map all /dteam* VOMS extensions to the single dteamaccount )

The pink backgrounded part uses the lcmaps_c_pep plugin to call out to our local ARGUS server. Unlike for glExec on workernodes, or CEs, the only thing we care about here is if the ARGUS server returns a "Permit" or not. As a result, the policy on the local PAP in our ARGUS server has no obligations in it - in fact, including the local_environment_mapping obligation breaks our chain, since we don't have (or need) pool accounts on these servers by design. We still need to add a policy for the corresponding resourceid we pass, and remember to reload the config on the PDP and PEP afterwards.

So far, so easy (and all the packages needed are in UMD4 and easy to get).

Getting LCMAPS to work with the vanilla versions of globus-gridftp-server and xrootd is not completely trivial, however.

In gridftp's case: 


globus-gridftp-server is perfectly capable of interfacing with lcmaps, but all of the shipped versions in EPEL and UMD come without the necessary configuration to do so. (In particular, a set of environment variables need to be present in the environment of the gridftp server daemon, and without them set, the configured LCMAPs will fail with odd errors about gridftp still being mapped to the root user.)

We can fix this with the addition of a /etc/sysconfig/globus-gridftp-server file containing:

export LCMAPS_DB_FILE=/etc/lcmaps/lcmaps.db
export LLGT_LIFT_PRIVILEGED_PROTECTION=1
export LLGT_RUN_LCAS=no 
export conf=/etc/gridftp.conf

where the lower line also prevents the configured gridftp service from trying to load LCAS (which we don't need here - since banning is being farmed out to ARGUS).

We also need to install the lcas-lcmaps-gt4-interface rpm, which provides the glue to let gsi call out via LCMAPS.

and finally, install the /etc/grid-security/gsi-authz.conf file to tell gridftp how to authenticate gsi stuff:

globus_mapping liblcas_lcmaps_gt4_mapping lcmaps_callout

(The more exciting thing with gridftp is getting the ceph and authdb stuff to work, about which more in another post)

In Xrootd's case: 

This needs a little more work: xrootd does not have an officially packaged security plugin for interfacing with lcmaps.

Luckily, however, OSG have done some sterling work on this (in fact, most of this blog post is based on their documentation, plus the nikhef LCMAPS docs), and there's a git repository containing a working xrootd-lcmaps plugin, here: https://github.com/opensciencegrid/xrootd-lcmaps.git

In order to build this, we also need the development libraries for the underlying technologies: voms-devel, lcmaps-devel and lcmaps-common-devel, as well as a host of globus libs that you probably already have installed (as well as the xrootd development headers, which we already have since we build xrootd locally too).

Building this, and installing the resulting libXrdLcmaps.so into a suitable place, we just need to add the following to our xrootd config for the externally visible service:

sec.protocol /opt/xrootd/lib64 gsi -certdir:/etc/grid-security/certificates \
                    -cert:/etc/grid-security/hostcert.pem \
                    -key:/etc/grid-security/xrd/hostkey.pem \
                    -crl:1 \
                    -authzfun:libXrdLcmaps.so \
                    -authzfunparms:lcmapscfg=/etc/lcmaps/lcmaps.db,loglevel=1,policy=mapping_pol \
                    -gmapopt:10 -gmapto:0

where here we configure the xrootd service to call out to the library we built (and we have to, unlike with gridftp, specify the policy to use from the file - gridftp will use the only policy present if there's just one).
We need a second copy of the hostkey, you'll notice, because the xrootd service doesn't run as the same user as the gridftp service - but gridftp won't let you have a hostkey which is accessible by more than one user. (So we need two copies, one for gridftp and one for xrootd.)

EXAMPLE

Once you configure your authdb for the capability mapping you're ready to go!

As you can see from the LCMAPS logs, when I do a transfer with a voms-enabled proxy, using, in this case, globus-url-copy, but it's the same with xrdcp:


Oct  3 15:33:25 cephs02 globus-gridftp-server: lcmaps: Starting policy: mapping_pol
... (some certificate verification) ... 
Oct  3 15:33:25 cephs02 globus-gridftp-server: lcmaps: lcmaps_plugin_verify_proxy-plugin_run(): verify proxy plugin succeeded
Oct  3 15:33:25 cephs02 globus-gridftp-server: lcmaps: lcmaps_plugin_c_pep-plugin_run(): Using endpoint OURARGUSENDPPOINT, try #1
Oct  3 15:33:25 cephs02 globus-gridftp-server: lcmaps: lcmaps_plugin_c_pep-plugin_run(): c_pep plugin succeeded
Oct  3 15:33:25 cephs02 globus-gridftp-server: lcmaps: lcmaps_gridmapfile: Found mapping dteamaccount for "/dteam/*" (line 1)
Oct  3 15:33:25 cephs02 globus-gridftp-server: lcmaps: lcmaps_voms_localaccount-plugin_run(): voms_localaccount plugin succeeded
Oct  3 15:33:25 cephs02 globus-gridftp-server: lcmaps: lcmaps_dummy_good-plugin_run(): good plugin succeeded
Oct  3 15:33:25 cephs02 globus-gridftp-server: lcmaps: LCMAPS CRED FINAL: mapped uid:'xxx',pgid:'xxx',sgid:'xxx',sgid:'xxx'
Oct  3 15:33:25 cephs02 globus-gridftp-server: Callout to "LCMAPS" returned local user (service file): "dteamaccount"

and then we go into the gridftp.log for the authdb:

[344940] Thu Oct  3 15:33:25 2019 :: globus_l_gfs_ceph_send: started
[344940] Thu Oct  3 15:33:25 2019 :: globus_l_gfs_ceph_send: rolename is dteamaccount
[344940] Thu Oct  3 15:33:25 2019 :: globus_l_gfs_ceph_send: pathname: dteam:testfile1/
[344940] Thu Oct  3 15:33:25 2019 :: INFO globus_l_gfs_ceph_send: acc.success: 'RETR' operation  allowed
[344940] Thu Oct  3 15:33:25 2019 :: ceph_posix_stat64 : pathname = /dteam:testfile1

where our capabilities are checked (and the dteamaccount is, indeed, allowed to READ from objects in the dteam pool).

19 June 2017

Hosting a large web-forum on ZFS (a case study)

Hosting a large web-forum on ZFS (a case study)

Over the course of last weekend I worked with a friend on deploying zfs across their infrastructure.

Their infrastructure in this case is a popular website written in php and administering to some 20,000+ users. They, like many gridpp sysadmins use CentOS for their back-end infrastructure. However due to being a regularly high profile target for attacks they have opted to run their systems using the latest kernel installed from the elrepo.
The infrastructure for this website is heavily docker orientated due to the (re)deployment advantages that this offers.
Due to problems with the complex workflow selinux has been set to permissive.

Data for the site was stored within a /data directory which stored both the main database for the site and files which are hosted by the site.
Prior to the use of zfs the storage used for this site was xfs.

The hardware used to run this site is a dedicated 8 intel cores, 32Gb RAM, 2 * 2Tb disks managed by soft-raid(mirror) and partitioned using lvm.

Installing zfs

Initially setting up ZFS couldn't have been easier. Install the correct rpm repo, update, install zfs and reboot:

yum update
yum install 
yum update
yum install zfs
reboot

Fixing zfs-dkms

As they are using the latest stable kernel they opted to install zfs using dkms which has pros/cons to the kmod install.

This unfortunately didn't work as it should have done (possibly due to a pending kernel update on reboot). After rebooting the following commands were needed to install the zfs driver:

dkms build spl/0.6.5.10
dkms build zfs/0.6.5.10
dkms install spl/0.6.5.10
dkms install zfs/0.6.5.10

This step triggered the rebuild and installation of the spl (solaris porting layer) and the zfs modules.
(Adding this to the initrd shouldn't be required but can probably be done as per usual once this has been build)

Migrating data to ZFS

The initial step was to migrate the storage backend and main database for the site. This storage is approximately 0.5Tb of data which was constructed of numerous files with an average file size close to 1Mb. The SQL database is approximately 50Gb in size containing most of the site data.

mv /data/webroot /data/webroot-bak
mv /data/sqlroot /data/sqlroot-bak
zfs create webrootzfs vgs/webrootzfs
zfs create sqlrootzfs vgs/sqlrootzfs
zfs set mountpoint=/data/webroot webrootzfs
zfs set mountpoint=/data/sqlroot sqlrootzfs
zfs set compression=lz4 webrootzfs
zfs set compression=lz4 sqlrootzfs
zfs set primarycache=metadata sqlrootzfs
zfs set secondarycache=none webrootzfs
zfs set secondarycache=none sqlrootzfs
zfs set recordsize=16k sqlrootzfs # Matches the db block size
rsync -avP /data/webroot-bak/* /data/webroot/
rsync -avP /data/sqlroot-bak/* /data/sqlroot/

After migrating these the site was then brought back up for approximately 24hr and there were no performance problems observed.

The webroot data which contained mainly user submitted files reached a compression level of about 1.1.
The sql database reached a compression level of about 2.4.

Given the increased performance of the site due to this migration it was decided 24hr later to investigate migrating the main website itself rather than just the backend.

Setting up systemd

The following systemd services and targets were enabled but rebooting the system has not (yet) been tested.

systemctl enable zfs.target
systemctl enable zfs-mount
systemctl start zfs-mount
systemctl enable zfs-import-cache
systemctl start zfs-import-cache
 
systemctl enable zfs-share 
systemctl start zfs-share


Impact of using ZFS

A nice solution for this was found to already exist quite well. This is the zfs storage driver for docker.

https://docs.docker.com/engine/userguide/storagedriver/zfs-driver/

After this was setup the site was brought back online and the performance was notable.

Page load time for the site dropped from about 600ms to 300ms. That is a 50% drop in page load time entirely due to replacing the backend storage with zfs.
This was with the ARC cache running with a 95% hit rate.

Problems Encountered

Unfortunately about 30min of running after of migrating the docker service to use ZFS the site fell over.
(page load times increased to multiple seconds and the backend server load spiked.)

Upon initial inspection it was discovered that the zfs arc cache had dropped to 32M (almost absolute minimum) and the arc-reclaim process was consuming 100% of 1 CPU.

The ZFS arc cache maximum was increased to 10Gb but the cache refused to increase.

echo 10737418240 > /sys/module/zfs/parameters/zfs_arc_max

Increasing the minimum forced the arc cache to increase however the arc-reclaim process still was consuming 1 CPU core.

Fixing the Problems

A better workaround was found to be to disable the transparent_hugepage using:


echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

This stopped the arc-reclaim process from consuming 100% CPU as well as triggering the arc cache to start regrowing.
(For the interested this has been reported: https://github.com/zfsonlinux/zfs/issues/4869)



Summary of Tweaks made

A summary of some of the optimizations applied to these pools are:

# ZFS settings
zfs set compression=lz4 webrootzfs # Enable best compression
zfs set compression=lz4 sqlrootzfs # Enable best compression 

zfs set primarycache=all # This is default
zfs set primarycache=metadata sqlrootzfs # Don't store DB in cache
zfs set secondarycache=none webrootzfs # Not using l2arc
zfs set secondarycache=none sqlrootzfs # Not using l2arc
zfs set recordsize=16k sqlrootzfs # Matches the db block size


# Settings changed through /sys
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

echo 10737418240 > /sys/module/zfs/parameters/zfs_arc_max # 10Gb max
echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_min # 4Gb min

# repeat the following for /sys/block/sda and /sys/block/sdb
echo 4096 > /sys/block/sda/queue/nr_requests
echo 0 > /sys/block/sda/queue/iosched/front_merges
echo noop > /sys/block/sda/queue/scheduler
echo 150 > /sys/block/sda/queue/iosched/read_expire
echo 1500 > /sys/block/sda/queue/iosched/write_expire
echo 4096 > /sys/block/sda/queue/nr_requests
echo 4096 > /sys/block/sda/queue/read_ahead_kb
echo 1 > /sys/block/sda/queue/iosched/fifo_batch
echo 16384 > /sys/block/sda/queue/ma
x_sectors_kb



Additionally for the docker-zfs pool:

zfs set primarycache=all zpool-docker
zfs set secondarycache=none zpool-docker
zfs set compression=lz4 zpool-docker

All docker containers built using this engine inherit these properties from the base pool zpool-docker however, a remove/rebuild will be needed to take advantage of settings such as compression.

26 January 2017

ZFS auto mount for CentOS7

When testing ZFS installs on servers running on CentOS7.3, it can happen that ZFS is not available after a restart. After some testing this seems to be related to systemd and probably affects other systemd Linux distributions too.

What I used were ZFS installs using different versions of ZFS on Linux. After looking into the system setup, I noticed that by default ZFS is just disabled. Doing the following solved the problem on the machines I tested:

systemctl enable zfs.target
systemctl start zfs.target
systemctl enable zfs-import-cache.service
systemctl enable zfs-mount.service
systemctl enable zfs-share.service 

This solved all auto mount issues for me on the CentOS systems.

 Note: At least when using the latest version 0.6.5.8, one can also use the following command as explained on the ZFSonLinux web page:

systemctl preset zfs-import-cache zfs-import-scan zfs-mount zfs-share zfs-zed zfs.target


Everyone who is upgrading to the latest version should also have a look to the ZFS on Linux web page since the repository address has changed. While it should have updated it automatically, if you haven't run any updates since some month, then it can't get the new repository automatically.

11 April 2016

Setting up of a ZFS based storage server

As it was previously found that ZFS has a good performance in our use case which is even better than the hardware raid performance, new storage servers on our site to be used within GridPP will use ZFS as storage file system in the future.
In this post, I will show how a server for that purpose can easily be setup.  The previous posts which also mention details about the used hardware can be found here, here, and here.

The purpose of this storage server is to be used for LHC data storage which is mostly consistent of GB sized files. At the time of using these data files as input for user jobs, typically the whole  file is copied over to the local node where the user job runs. That means that the configuration needs to deal with large sequential read and writes, but not with small random block access.

The typical hardware configuration of  the storage servers we have is:
  • Server with PERC H700 and/or H800 hardware raid controller
  • 36 disk slots available
    • on some server available through 3 external PowerVault MD-devices (3x12 disks)
    • on some servers available through 2 external PowerVault MD-devices (2x12disks) and 12 internal storage disks
  • 10Gbps network interface
  • Dual-CPU (8 or 12 physical cores on each)
  • between 12GB and 64GB of RAM 
In this blog post, as I did before, I will describe the ZFS setup based on a machine with 12 internal disks (2TB disks on H700) and 24 external disks (17x8TB + 7x2TB  on H800). The machine is already setup with SL6 and has the typical GridPP software (DPM clients, xrootd, httpd,...) installed.

Preparing the disks

Since both raid controllers don't support JBOD, first single raid0 devices have to be created. To find out which disks are available and can be used, omreport can be used:


[root@pool7 ~]# omreport storage pdisk controller=0|grep -E "^ID|Capacity"
ID                              : 0:0:0
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:1
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:2
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:3
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:4
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:5
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:6
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:7
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:8
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:9
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:10
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:11
Capacity                        : 1,862.50 GB (1999844147200 bytes)
ID                              : 0:0:12
Capacity                        : 278.88 GB (299439751168 bytes)
ID                              : 0:0:13
Capacity                        : 278.88 GB (299439751168 bytes)


The disks 0:0:12 and 0:0:13 are the system disks in a mirrored configuration and shouldn't be touched. The disks 0:0:0 to 0:0:11 can be converted to single raid0 using omconfig:
for i in $(seq 0 11); 
do 
  omconfig storage controller controller=0 action=createvdisk raid=r0 size=max pdisk=0:0:$i; 
done

The same procedure has to be repeated for the second controller.

After that, the disks are available to the system and to find out which are the 2TB and which are the 8TB disks, lsblk can be used:

[root@pool7 ~]# lsblk |grep disk
sda      8:0    0 278.9G  0 disk 
sdb      8:16   0   1.8T  0 disk 
sdc      8:32   0   1.8T  0 disk 
sdd      8:48   0   1.8T  0 disk 
sde      8:64   0   1.8T  0 disk 
sdf      8:80   0   1.8T  0 disk 
sdg      8:96   0   1.8T  0 disk 
sdh      8:112  0   1.8T  0 disk 
sdi      8:128  0   1.8T  0 disk 
sdj      8:144  0   1.8T  0 disk 
sdk      8:160  0   1.8T  0 disk 
sdl      8:176  0   1.8T  0 disk 
sdm      8:192  0   1.8T  0 disk 
sdn      8:208  0   7.3T  0 disk 
sdo      8:224  0   7.3T  0 disk 
sdp      8:240  0   7.3T  0 disk 
sdq     65:0    0   7.3T  0 disk 
sdr     65:16   0   7.3T  0 disk 
sds     65:32   0   7.3T  0 disk 
sdt     65:48   0   7.3T  0 disk 
sdu     65:64   0   7.3T  0 disk 
sdv     65:80   0   7.3T  0 disk 
sdw     65:96   0   7.3T  0 disk 
sdx     65:112  0   7.3T  0 disk 
sdy     65:128  0   7.3T  0 disk 
sdz     65:144  0   7.3T  0 disk 
sdaa    65:160  0   7.3T  0 disk 
sdab    65:176  0   1.8T  0 disk 
sdac    65:192  0   7.3T  0 disk 
sdad    65:208  0   1.8T  0 disk 
sdae    65:224  0   1.8T  0 disk 
sdaf    65:240  0   7.3T  0 disk 
sdag    66:0    0   1.8T  0 disk 
sdah    66:16   0   1.8T  0 disk 
sdai    66:32   0   7.3T  0 disk 
sdaj    66:48   0   1.8T  0 disk 
sdak    66:64   0   1.8T  0 disk 

/dev/sda is the system disk and shouldn't be touch, but all the other disks can be used for the storage setup.


ZFS installation

The current version of ZFS can be downloaded from the ZFS on Linux web page. Depending on the used distribution, there are also instructions on how to install ZFS through the package management. In the worst case, one can download the source code and compile on the own system.   

Since we use SL which is RH based, we can follow the instructions provided on the page .
After the installation of ZFS through yum, the module needs to be loaded using modprobe zfs to continue without a reboot.   
To have file systems based on zfs automounted at system start, unfortunately also selinux options need to be changed. In the selinux config file, in our case at /etc/sysconfig/selinux, we need to change "SELINUX=enforcing" to at least "SELINUX=permissive"

This will probably be needed as long as zfs is not part of the RH distribution and zfs will not be recognized by selinux as a valid file system. More about this issue can be found here.


ZFS storage setup

Since we have the ZFS driver installed and the disks prepared now, we can continue to setup the storage pool.  
In this example, we create 2 different storage pools - one for the 2TB disks and one for the 8TB disks - as a good compromise between possible IOPS and available space. For ZFS, it doesn't matter if the disks within one storage pool are connected through the same controller or through different ones, like in our case for the 2TB disks. For the configuration, it's decided to use raidz2 which has 2 redundancy disks similar to raid6. Also, one disk of each kind will be used as hot spare.
To do so, we need to find all disks of a given kind in the system and creating a storage pool for these disks using zpool create :

[root@pool7 ~]# lsblk |grep 1.8T
sdb      8:16   0   1.8T  0 disk 
sdc      8:32   0   1.8T  0 disk 
sdd      8:48   0   1.8T  0 disk 
sde      8:64   0   1.8T  0 disk 
sdf      8:80   0   1.8T  0 disk 
sdg      8:96   0   1.8T  0 disk 
sdh      8:112  0   1.8T  0 disk 
sdi      8:128  0   1.8T  0 disk 
sdj      8:144  0   1.8T  0 disk 
sdk      8:160  0   1.8T  0 disk 
sdl      8:176  0   1.8T  0 disk 
sdm      8:192  0   1.8T  0 disk 
sdab    65:176  0   1.8T  0 disk 
sdad    65:208  0   1.8T  0 disk 
sdae    65:224  0   1.8T  0 disk 
sdag    66:0    0   1.8T  0 disk 
sdah    66:16   0   1.8T  0 disk 
sdaj    66:48   0   1.8T  0 disk 
sdak    66:64   0   1.8T  0 disk 

[root@pool7 ~]# zpool create -f tank-2TB raidz2 sdb sdc sdd sde sdf sdg sdh sdi sdj sdk sdl sdm sdab sdad sdae sdag sdah sdaj spare sdak



[root@pool7 ~]# lsblk |grep 7.3T
sdn       8:208  0   7.3T  0 disk 
sdo       8:224  0   7.3T  0 disk 
sdp       8:240  0   7.3T  0 disk 
sdq      65:0    0   7.3T  0 disk 
sdr      65:16   0   7.3T  0 disk 
sds      65:32   0   7.3T  0 disk 
sdt      65:48   0   7.3T  0 disk 
sdu      65:64   0   7.3T  0 disk 
sdv      65:80   0   7.3T  0 disk 
sdw      65:96   0   7.3T  0 disk 
sdx      65:112  0   7.3T  0 disk 
sdy      65:128  0   7.3T  0 disk 
sdz      65:144  0   7.3T  0 disk 
sdaa     65:160  0   7.3T  0 disk 
sdac     65:192  0   7.3T  0 disk 
sdaf     65:240  0   7.3T  0 disk 
sdai     66:32   0   7.3T  0 disk 
[root@pool7 ~]# zpool create -f tank-8TB raidz2 sdn sdo sdp sdq sdr sds sdt sdu sdv sdw sdx sdy sdz sdaa sdac sdaf spare sdai


After the zpool create commands, the storage is setup in a raid configuration, a file system created on top of it, and mounted under /tank-2TB and /tank-8TB. There are no additional commands needed and all is available within seconds.   
At this point the system looks like:

[root@pool7 ~]# mount|grep zfs
tank-2TB on /tank-2TB type zfs (rw)
tank-8TB on /tank-8TB type zfs (rw)


[root@pool7 ~]# zpool status
  pool: tank-2TB
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank-2TB    ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sdb     ONLINE       0     0     0
            sdc     ONLINE       0     0     0
            sdd     ONLINE       0     0     0
            sde     ONLINE       0     0     0
            sdf     ONLINE       0     0     0
            sdg     ONLINE       0     0     0
            sdh     ONLINE       0     0     0
            sdi     ONLINE       0     0     0
            sdj     ONLINE       0     0     0
            sdk     ONLINE       0     0     0
            sdl     ONLINE       0     0     0
            sdm     ONLINE       0     0     0
            sdab    ONLINE       0     0     0
            sdad    ONLINE       0     0     0
            sdae    ONLINE       0     0     0
            sdag    ONLINE       0     0     0
            sdah    ONLINE       0     0     0
            sdaj    ONLINE       0     0     0
        spares
          sdak      AVAIL   

errors: No known data errors

  pool: tank-8TB
 state: ONLINE
  scan: none requested
config:

        NAME        STATE     READ WRITE CKSUM
        tank-8TB    ONLINE       0     0     0
          raidz2-0  ONLINE       0     0     0
            sdn     ONLINE       0     0     0
            sdo     ONLINE       0     0     0
            sdp     ONLINE       0     0     0
            sdq     ONLINE       0     0     0
            sdr     ONLINE       0     0     0
            sds     ONLINE       0     0     0
            sdt     ONLINE       0     0     0
            sdu     ONLINE       0     0     0
            sdv     ONLINE       0     0     0
            sdw     ONLINE       0     0     0
            sdx     ONLINE       0     0     0
            sdy     ONLINE       0     0     0
            sdz     ONLINE       0     0     0
            sdaa    ONLINE       0     0     0
            sdac    ONLINE       0     0     0
            sdaf    ONLINE       0     0     0
        spares
          sdai      AVAIL   

errors: No known data errors


[root@pool7 ~]# zpool list
NAME       SIZE  ALLOC   FREE  EXPANDSZ   FRAG    CAP  DEDUP  HEALTH  ALTROOT
tank-2TB  32.5T   153K  32.5T         -     0%     0%  1.00x  ONLINE  -
tank-8TB   116T   153K   116T         -     0%     0%  1.00x  ONLINE  -
[root@pool7 ~]# 
[root@pool7 ~]# zfs list
NAME       USED  AVAIL  REFER  MOUNTPOINT
tank-2TB   120K  28.0T  40.0K  /tank-2TB
tank-8TB   117K  97.7T  39.1K  /tank-8TB
[root@pool7 ~]# 
[root@pool7 ~]# df -h|grep tank
tank-2TB         28T     0   28T   0% /tank-2TB
tank-8TB         98T     0   98T   0% /tank-8TB

Setting additional filesystem properties

Since there is lz4 available as compression algorithm which has a very small impact on performance, we can enable compression on our storage. This has probably not a large impact on the storage of LHC data, but could lower the storage space for non-LHC experiments that will be supported in the near future.
In addition, the storage of xattr will also be changed to a similar behaviour like in ext4.
Also since we have a spare configured in our pools, we need to activate auto replacement in failure cases making it a hot spare. An interesting feature of ZFS is also to grow the pool size if the disks are replaced by new disks with a larger capacity. This needs to be done for all disks within one vdev to have an effect, but can be done one by one over time.

[root@pool7 ~]# zfs set compression=lz4 tank-2TB
[root@pool7 ~]# zfs set compression=lz4 tank-8TB

[root@pool7 ~]# zpool set autoreplace=on tank-2TB
[root@pool7 ~]# zpool set autoreplace=on tank-8TB

[root@pool7 ~]# zpool set autoexpand=on tank-2TB
[root@pool7 ~]# zpool set autoexpand=on tank-8TB

[root@pool7 ~]# zfs set relatime=on tank-2TB
[root@pool7 ~]# zfs set relatime=on tank-8TB

[root@pool7 ~]# zfs set xattr=sa tank-2TB
[root@pool7 ~]# zfs set xattr=sa tank-8TB

Changing disk identification

Using the disk identification by letters, like sdb or sdc, is easy to handle and good to setup a pool. However, the order how disks are identified could be changed on a reboot and also will change if the disks need to be rearranged on the server, for example after replacing one of the external MD devices.
While in such cases ZFS should still be able to identify the disks belonging to the same pool and import the pool, it is better to use the disk IDs to identify disks. To change this behaviour, we only need to export the pool and import using the disk IDs:
[root@pool7 ~]# zpool export -a
[root@pool7 ~]# zpool import -d /dev/disk/by-id tank-8TB
[root@pool7 ~]# zpool import -d /dev/disk/by-id tank-2TB

Making the space available to DPM

Traditionally, the available space on a large vdev was divided into smaller parts by creating partitions in fdisk. In ZFS however this can be done directly on top of the just created pool. All properties set for the top level ZFS will be distributed to the new zfs too. There will be no need to set the compression property or other properties again, except if one wants to have different properties than before.
A new file system is created by using zfs create :

[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage01
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage02
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage03
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage04
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage05
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage06
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage07
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage08
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage09
[root@pool7 ~]# zfs create -o refreservation=9T tank-8TB/gridstorage10
[root@pool7 ~]# zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
tank-2TB                 144K  28.0T  40.0K  /tank-2TB
tank-8TB                90.0T  7.67T  41.7K  /tank-8TB
tank-8TB/gridstorage01     9T  16.7T  39.1K  /tank-8TB/gridstorage01
tank-8TB/gridstorage02     9T  16.7T  39.1K  /tank-8TB/gridstorage02
tank-8TB/gridstorage03     9T  16.7T  39.1K  /tank-8TB/gridstorage03
tank-8TB/gridstorage04     9T  16.7T  39.1K  /tank-8TB/gridstorage04
tank-8TB/gridstorage05     9T  16.7T  39.1K  /tank-8TB/gridstorage05
tank-8TB/gridstorage06     9T  16.7T  39.1K  /tank-8TB/gridstorage06
tank-8TB/gridstorage07     9T  16.7T  39.1K  /tank-8TB/gridstorage07
tank-8TB/gridstorage08     9T  16.7T  39.1K  /tank-8TB/gridstorage08
tank-8TB/gridstorage09     9T  16.7T  39.1K  /tank-8TB/gridstorage09
tank-8TB/gridstorage10     9T  16.7T  39.1K  /tank-8TB/gridstorage10

[root@pool7 ~]# zfs create -o refreservation=7.66T tank-8TB/gridstorage11
[root@pool7 ~]# zfs list
NAME                     USED  AVAIL  REFER  MOUNTPOINT
tank-2TB                 144K  28.0T  40.0K  /tank-2TB
tank-8TB                97.7T  9.91G  41.7K  /tank-8TB
tank-8TB/gridstorage01     9T  9.01T  39.1K  /tank-8TB/gridstorage01
tank-8TB/gridstorage02     9T  9.01T  39.1K  /tank-8TB/gridstorage02
tank-8TB/gridstorage03     9T  9.01T  39.1K  /tank-8TB/gridstorage03
tank-8TB/gridstorage04     9T  9.01T  39.1K  /tank-8TB/gridstorage04
tank-8TB/gridstorage05     9T  9.01T  39.1K  /tank-8TB/gridstorage05
tank-8TB/gridstorage06     9T  9.01T  39.1K  /tank-8TB/gridstorage06
tank-8TB/gridstorage07     9T  9.01T  39.1K  /tank-8TB/gridstorage07
tank-8TB/gridstorage08     9T  9.01T  39.1K  /tank-8TB/gridstorage08
tank-8TB/gridstorage09     9T  9.01T  39.1K  /tank-8TB/gridstorage09
tank-8TB/gridstorage10     9T  9.01T  39.1K  /tank-8TB/gridstorage10
tank-8TB/gridstorage11  7.66T  7.67T  39.1K  /tank-8TB/gridstorage11

Here a new property is set for each file of the new file systems - refreservation - which reserves the specified space for this particular file system, making sure this space is guaranteed. This is different to setting a quota which limits the space only to an upper limit.  However, to make sure the specified space is not exceeded in our case, also a quota of the same size should be specified. For the last file system in each pool, a larger amount could be specified which makes sure that all the space that can't be used by the other file systems due to quota limitations, will be used here.
[root@pool7 ~]# zfs set refquota=7T tank-2TB/gridstorage01
[root@pool7 ~]# zfs set refquota=7T tank-2TB/gridstorage02
[root@pool7 ~]# zfs set refquota=7T tank-2TB/gridstorage03
[root@pool7 ~]# zfs set refquota=7T tank-2TB/gridstorage04
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage01
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage02
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage03
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage04
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage05
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage06
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage07
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage08
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage09
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage10
[root@pool7 ~]# zfs set refquota=9T tank-8TB/gridstorage11

After that, the storage setup is finished and the just created file systems are already mounted and should be made available to the DPM user:
[root@pool7 ~]# chown -R dpmmgr:users /tank-2TB
[root@pool7 ~]# chown -R dpmmgr:users /tank-8TB

That was the last step needed on the storage server and the new file systems can be added to the DPM head node like any other file system before.

ZFS configuration options

To customize the ZFS behaviour, 2 main config files are available  - /etc/sysconfig/zfs and /etc/zfs/zed.d/zed.rc. I will not go into detail here about these 2 files, but if you want to setup your own ZFS based storage then have a look here. The options within are mainly self explaining, for example you can specify where to send email for disk problems and under which circumstances. 




As final section, I want to mention 2 very useful commands - zpool history and zpool iostat.
With the first command, one can display all commands that were run against a zpool since its creation together with a time stamp. This can be very useful for error analyses and also to repeat a configuration on another server.
[root@pool6 ~]# zpool history tank-2TB
History for 'tank-2TB':
2016-04-05.11:27:43 zpool create -f tank-2TB raidz2 sdd sdf sdg sdi sdj sdl sdm sdz sdaa sdab sdac sdad sdae sdaf sdag sdah sdai sdaj spare sdak
2016-04-05.11:40:32 zfs create -o refreservation=7TB tank-2TB/gridstorage01
2016-04-05.11:40:34 zfs create -o refreservation=7TB tank-2TB/gridstorage02
2016-04-05.11:40:39 zfs create -o refreservation=7TB tank-2TB/gridstorage03
2016-04-05.11:41:57 zfs create -o refreservation=6.97T tank-2TB/gridstorage04
2016-04-05.12:02:12 zpool set autoreplace=on tank-2TB
2016-04-05.12:02:17 zpool set autoexpand=on tank-2TB
2016-04-05.12:38:11 zpool export tank-2TB
2016-04-05.12:38:33 zpool import -d /dev/disk/by-id tank-2TB
2016-04-06.13:41:37 zfs set compression=lz4 tank-2TB
2016-04-07.14:36:37 zfs set relatime=on tank-2TB
2016-04-07.14:36:42 zfs set xattr=sa tank-2TB
2016-04-11.11:28:08 zpool scrub tank-2TB
2016-04-11.14:12:41 zfs set refquota=7T tank-2TB/gridstorage01
2016-04-11.14:12:43 zfs set refquota=7T tank-2TB/gridstorage02
2016-04-11.14:12:48 zfs set refquota=7T tank-2TB/gridstorage03
2016-04-11.14:12:59 zfs set refquota=7T tank-2TB/gridstorage04

The second command, zpool iostat, displays the current I/O on the pool, separately for read/write operations and for bandwidth. Information can be displayed for a given pool but also for each disk within the pool. 
The following example is taken from a server that was configured with 3 raidz2 while it was drained using the dpm-drain command on the head node with threads=1 and one drain per file system, resulting in 5 parallel drain commands running:

[root@pool5 ~]# zpool iostat 1
               capacity     operations    bandwidth
pool        alloc   free   read  write   read  write
----------  -----  -----  -----  -----  -----  -----
tank        6.03T  54.0T    110     24  13.5M  1.33M
tank        6.03T  54.0T  4.74K      0   602M      0
tank        6.03T  54.0T  4.63K      0   589M      0
tank        6.03T  54.0T  4.62K      0   587M      0
tank        6.03T  54.0T  4.42K      0   561M      0
tank        6.03T  54.0T  5.27K      0   669M      0
tank        6.03T  54.0T  4.51K      0   573M      0
tank        6.03T  54.0T  4.47K      0   568M      0
tank        6.03T  54.0T  4.47K      0   568M      0
tank        6.03T  54.0T  4.26K      0   542M      0
tank        6.03T  54.0T  4.56K      0   579M      0
tank        6.03T  54.0T  4.82K      0   613M      0
tank        6.03T  54.0T  4.60K      0   585M      0
tank        6.03T  54.0T  4.73K      0   601M      0
tank        6.03T  54.0T  4.20K      0   533M      0
tank        6.03T  54.0T  4.52K      0   574M      0
tank        6.03T  54.0T  3.72K      0   473M      0
tank        6.03T  54.0T  3.80K      0   484M      0
tank        6.03T  54.0T  4.46K      0   567M      0
tank        6.03T  54.0T  5.16K      0   655M      0
tank        6.03T  54.0T  5.25K      0   667M      0

[root@pool5 ~]# zpool iostat -v 1
                                               capacity     operations    bandwidth
pool                                        alloc   free   read  write   read  write
------------------------------------------  -----  -----  -----  -----  -----  -----
tank                                        5.87T  54.1T  4.77K      0   606M      0
  raidz2                                    1.96T  18.0T  1.60K      0   202M      0
    scsi-36a4badb044e936001e55b2111ca79173      -      -    329      0  21.2M      0
    scsi-36a4badb044e936001e55b2461fc3cb38      -      -    354      0  20.8M      0
    scsi-36a4badb044e936001e55b25520b22e10      -      -    337      0  21.3M      0
    scsi-36a4badb044e936001e55b2622171ff0b      -      -    333      0  21.3M      0
    scsi-36a4badb044e936001e55b26e2232640f      -      -    334      0  21.0M      0
    scsi-36a4badb044e936001e55b27d230ce0f6      -      -    333      0  21.2M      0
    scsi-36a4badb044e936001e55b293245ae11b      -      -    335      0  21.1M      0
    scsi-36a4badb044e936001e55b2b426603fbe      -      -    335      0  21.3M      0
    scsi-36a4badb044e936001e55b2c4274ec795      -      -    338      0  20.9M      0
    scsi-36a4badb044e936001e55b2d128122551      -      -    318      0  20.8M      0
    scsi-36a4badb044e936001e55b2f42a2e3006      -      -    342      0  21.3M      0
  raidz2                                    1.96T  18.0T  1.59K      0   203M      0
    scsi-36a4badb044e936001e830de6afdb1d8f      -      -    332      0  21.6M      0
    scsi-36a4badb044e936001e55b3082b59c1da      -      -    310      0  21.2M      0
    scsi-36a4badb044e936001e55b3142c0ac749      -      -    311      0  21.5M      0
    scsi-36a4badb044e936001e55b31f2cbeb648      -      -    319      0  21.8M      0
    scsi-36a4badb044e936001e55b44e3ecc77ea      -      -    313      0  21.7M      0
    scsi-36a4badb044e936001e55b33b2e6172a4      -      -    213      0  21.8M      0
    scsi-36a4badb044e936001e55b34c2f70184c      -      -    307      0  21.4M      0
    scsi-36a4badb044e936001e55b358301ee6a2      -      -    319      0  21.8M      0
    scsi-36a4badb044e936001e55b36530e4cb2a      -      -    331      0  21.8M      0
    scsi-36a4badb044e936001e55b3793218970b      -      -    325      0  21.8M      0
    scsi-36a4badb044e936001e55b38532cf8c68      -      -    324      0  21.8M      0
  raidz2                                    1.96T  18.0T  1.58K      0   201M      0
    scsi-36a4badb044e936001e55b39033741ccf      -      -    342      0  21.8M      0
    scsi-36a4badb044e936001e55b39b3421403b      -      -    323      0  21.4M      0
    scsi-36a4badb044e936001e55b3de3822569a      -      -    335      0  21.8M      0
    scsi-36a4badb044e936001e55b3eb38df509e      -      -    328      0  21.7M      0
    scsi-36a4badb044e936001e55b3f839a79c83      -      -    301      0  21.5M      0
    scsi-36a4badb044e936001e55b4023a46ae2a      -      -    325      0  21.7M      0
    scsi-36a4badb044e936001e55b40f3b0100cb      -      -    314      0  21.5M      0
    scsi-36a4badb044e936001e55b41d3bdd5a86      -      -    335      0  21.5M      0
    scsi-36a4badb044e936001e55b42b3cb239cd      -      -    324      0  21.8M      0
    scsi-36a4badb044e936001e55b4363d55d784      -      -    322      0  21.7M      0
    scsi-36a4badb044e936001e55b4413e09a9f1      -      -    331      0  21.8M      0
------------------------------------------  -----  -----  -----  -----  -----  -----


One important point to keep in mind is that all the properties set with zfs or zpool are stored within the file system and not within the OS config files! That means, if the OS gets upgraded then one can do a zpool import and all properties - like mount point, quota, reservations, compression, history - will instantly be available again. There is no need to touch manually any system config file, like /etc/fstab, to make the storage available.  This is also true for other properties, like nfs sharing, but since it's not needed in our case I haven't described that. To get an idea what properties are available and what else one can do with ZFS, zfs get all and zpool get all are useful.

22 February 2008

dCache configuration, graphviz style


I don't know about anyone else, but I'm fed up having to try and debug different site's PoolManager.conf files, especially with all this LinkGroup stuff going on. I find it too too hard to manually parse a file when it stretches to 100's of lines, making it virtually impossible to know if there are any mistakes.

In an effort to try and improve the situation, I put together a little python script last night that converts a PoolManager.conf into a .dot file. This can then be processed by GraphViz to produce a structured graph of the dCache configuration. You can see some examples of currently active dCache configurations here. The above plot shows the config at Edinburgh.

I have been creating both directional (dot) and undirectional (neato) graphs. At the moment, the most useful one is the dot plot. I'm still exploring what neato can be used for.

I think the fact that we even have to consider looking at things this way tells you two things:

1. dCache is a complex beast, with a multitude of different ways of setting things up (which has both pros and cons).
2. The basic configuration really has to be improved to save multiple man-hours that are spent across the Grid trying to debug basic problems.

At the moment, this system is only a prototype. It is intended as an aide to understanding dCache configuration and looking for potential bugs. As always, comments are welcome.

PS Thanks to Steve T for inspiring me to work on this following his graphing glue project.