26 June 2017

A Storage view from WLCG 2017

I knew the WLCG 2017 meeting at Manchester was going to be interesting when it starts with  a heatwave! In addition to various non-storage topics we had good discussions regarding accounting, object stores and site evolution.

Of particular interest I think are the talks from the Tuesday morning sessions on midterm evolution of sites. Other topics covered  included a talk on a distributed dCache setup, a xrootd federations setup (using DynaFed) and the OSG experience of caching using xrootd. The network rates from the Tuesday afternoon session VO network requirement talks are good to look at for sites to try and calculate their bandwidth requirements dependent on what type of site they intend to be.

My highlights from the Wednesday sessions were the informative talks regarding accounting for storage systems without SRMs.


The last day also contained an IPv6 Workshop where site admins not only had some theory and heard the expected timescale for deployment   but also got a chance to deploy  ipv6 versions of perfSONAR/CVMFS/Frontier/squid/xrootd. ( not much different than deploying an ipV4 but that should not be a much of a surprise.)
Link to theses talks are:

https://indico.cern.ch/event/609911/timetable/#20170621.detailed
https://indico.cern.ch/event/609911/timetable/#20170622.detailed


A benefit from the meeting being in Manchester was the multiple talks from Manchester based SKA project for their data, computing and networking requirements. Links can be found interspersed here:

https://indico.cern.ch/event/609911/timetable/

Also announced was the next pre-GDB on Storage in September:
https://indico.cern.ch/event/578974/

20 June 2017

A solid start for IPV6 accessile storage within the UK for WLCG expeiments.

IPV6 accessible storage is becoming a reality in production at WLCG sites within the UK.
We already have some sites which are fully dual IPV4/6. Some sites have full storage systems are dual hosted. others have partial sections of their storage systems  dual hosted as part of a staged rollout of service . I am also aware of sites whoa actively looking at providing dual hosted gateways to their storage systems.  Now just to work out how to monitor and differentiate network rates between IPV4 and IPV6 traffic. (This is using a very liberal usage of the word "just" since I understand that there is a complicated set of issues for further IPV6 deployment and its monitoring.)

Storage news from HEPSYSMAN

Good news is that I heard some good talks (and possibly gave one) at HEPSYSMAN meeting this week . (Just in time for the WLCG workshop next week.) WE started with a day regarding IPv6 (which other than increasing my knowledge of networking. These talks also highlighted the timescale for storage to be dual homed for WLCG, volume and monitoring of traffic over IPv6 (in Dave Kelsey talk ) and issues with 3rd party transfers (form talk by Tim Chown.)
During HEPSYSMAN proper , many good things from different sites were reported. Of interest is how sites should evolve. Their was also a very interesting talk on comparing RAID6 and ZFS.  Slides from the workshop should be available from here:
https://indico.cern.ch/event/592622/


19 June 2017

Hosting a large web-forum on ZFS (a case study)

Hosting a large web-forum on ZFS (a case study)

Over the course of last weekend I worked with a friend on deploying zfs across their infrastructure.

Their infrastructure in this case is a popular website written in php and administering to some 20,000+ users. They, like many gridpp sysadmins use CentOS for their back-end infrastructure. However due to being a regularly high profile target for attacks they have opted to run their systems using the latest kernel installed from the elrepo.
The infrastructure for this website is heavily docker orientated due to the (re)deployment advantages that this offers.
Due to problems with the complex workflow selinux has been set to permissive.

Data for the site was stored within a /data directory which stored both the main database for the site and files which are hosted by the site.
Prior to the use of zfs the storage used for this site was xfs.

The hardware used to run this site is a dedicated 8 intel cores, 32Gb RAM, 2 * 2Tb disks managed by soft-raid(mirror) and partitioned using lvm.

Installing zfs

Initially setting up ZFS couldn't have been easier. Install the correct rpm repo, update, install zfs and reboot:

yum update
yum install 
yum update
yum install zfs
reboot

Fixing zfs-dkms

As they are using the latest stable kernel they opted to install zfs using dkms which has pros/cons to the kmod install.

This unfortunately didn't work as it should have done (possibly due to a pending kernel update on reboot). After rebooting the following commands were needed to install the zfs driver:

dkms build spl/0.6.5.10
dkms build zfs/0.6.5.10
dkms install spl/0.6.5.10
dkms install zfs/0.6.5.10

This step triggered the rebuild and installation of the spl (solaris porting layer) and the zfs modules.
(Adding this to the initrd shouldn't be required but can probably be done as per usual once this has been build)

Migrating data to ZFS

The initial step was to migrate the storage backend and main database for the site. This storage is approximately 0.5Tb of data which was constructed of numerous files with an average file size close to 1Mb. The SQL database is approximately 50Gb in size containing most of the site data.

mv /data/webroot /data/webroot-bak
mv /data/sqlroot /data/sqlroot-bak
zfs create webrootzfs vgs/webrootzfs
zfs create sqlrootzfs vgs/sqlrootzfs
zfs set mountpoint=/data/webroot webrootzfs
zfs set mountpoint=/data/sqlroot sqlrootzfs
zfs set compression=lz4 webrootzfs
zfs set compression=lz4 sqlrootzfs
zfs set primarycache=metadata sqlrootzfs
zfs set secondarycache=none webrootzfs
zfs set secondarycache=none sqlrootzfs
zfs set recordsize=16k sqlrootzfs # Matches the db block size
rsync -avP /data/webroot-bak/* /data/webroot/
rsync -avP /data/sqlroot-bak/* /data/sqlroot/

After migrating these the site was then brought back up for approximately 24hr and there were no performance problems observed.

The webroot data which contained mainly user submitted files reached a compression level of about 1.1.
The sql database reached a compression level of about 2.4.

Given the increased performance of the site due to this migration it was decided 24hr later to investigate migrating the main website itself rather than just the backend.

Setting up systemd

The following systemd services and targets were enabled but rebooting the system has not (yet) been tested.

systemctl enable zfs.target
systemctl enable zfs-mount
systemctl start zfs-mount
systemctl enable zfs-import-cache
systemctl start zfs-import-cache
 
systemctl enable zfs-share 
systemctl start zfs-share


Impact of using ZFS

A nice solution for this was found to already exist quite well. This is the zfs storage driver for docker.

https://docs.docker.com/engine/userguide/storagedriver/zfs-driver/

After this was setup the site was brought back online and the performance was notable.

Page load time for the site dropped from about 600ms to 300ms. That is a 50% drop in page load time entirely due to replacing the backend storage with zfs.
This was with the ARC cache running with a 95% hit rate.

Problems Encountered

Unfortunately about 30min of running after of migrating the docker service to use ZFS the site fell over.
(page load times increased to multiple seconds and the backend server load spiked.)

Upon initial inspection it was discovered that the zfs arc cache had dropped to 32M (almost absolute minimum) and the arc-reclaim process was consuming 100% of 1 CPU.

The ZFS arc cache maximum was increased to 10Gb but the cache refused to increase.

echo 10737418240 > /sys/module/zfs/parameters/zfs_arc_max

Increasing the minimum forced the arc cache to increase however the arc-reclaim process still was consuming 1 CPU core.

Fixing the Problems

A better workaround was found to be to disable the transparent_hugepage using:


echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

This stopped the arc-reclaim process from consuming 100% CPU as well as triggering the arc cache to start regrowing.
(For the interested this has been reported: https://github.com/zfsonlinux/zfs/issues/4869)



Summary of Tweaks made

A summary of some of the optimizations applied to these pools are:

# ZFS settings
zfs set compression=lz4 webrootzfs # Enable best compression
zfs set compression=lz4 sqlrootzfs # Enable best compression 

zfs set primarycache=all # This is default
zfs set primarycache=metadata sqlrootzfs # Don't store DB in cache
zfs set secondarycache=none webrootzfs # Not using l2arc
zfs set secondarycache=none sqlrootzfs # Not using l2arc
zfs set recordsize=16k sqlrootzfs # Matches the db block size


# Settings changed through /sys
echo never > /sys/kernel/mm/transparent_hugepage/enabled
echo never > /sys/kernel/mm/transparent_hugepage/defrag

echo 10737418240 > /sys/module/zfs/parameters/zfs_arc_max # 10Gb max
echo 4294967296 > /sys/module/zfs/parameters/zfs_arc_min # 4Gb min

# repeat the following for /sys/block/sda and /sys/block/sdb
echo 4096 > /sys/block/sda/queue/nr_requests
echo 0 > /sys/block/sda/queue/iosched/front_merges
echo noop > /sys/block/sda/queue/scheduler
echo 150 > /sys/block/sda/queue/iosched/read_expire
echo 1500 > /sys/block/sda/queue/iosched/write_expire
echo 4096 > /sys/block/sda/queue/nr_requests
echo 4096 > /sys/block/sda/queue/read_ahead_kb
echo 1 > /sys/block/sda/queue/iosched/fifo_batch
echo 16384 > /sys/block/sda/queue/ma
x_sectors_kb



Additionally for the docker-zfs pool:

zfs set primarycache=all zpool-docker
zfs set secondarycache=none zpool-docker
zfs set compression=lz4 zpool-docker

All docker containers built using this engine inherit these properties from the base pool zpool-docker however, a remove/rebuild will be needed to take advantage of settings such as compression.