GridPP storage news: 2018

31 December 2018

Pretrospective

Traditionally at the end of the year it falls to the undersigned to look back at highlights of the past year and forward to issues in the coming year. By which we mean technical issues, not political, economic, or peopleoligical - no mention of the B-word or the impact it has had on us already.

Looking back, without the benefit of the archive (an internal network error has, ironically, taken it temporarily offline), one of the significant developments of 2018 was the formation of what is now known as IRIS, an infrastructure for an amalgamation of STFC-funded research communities. GridPP, of course, is a part of this, an example of how to provide an infrastructure even if the solution would not work directly for all user communities. Also highlightworthy, the work on object stores, and ZFS.

Looking forward, expanding the work with IRIS will be interesting, in particular the competition for the middleware on top of the IaaS and storage, because the current infrastructure requires a fair bit of IaaS expertise. Less new, but still not sufficiently followed up on, are opportunities to work on the integration of data centre automation and information systems, although even some storage vendors at CIUK knew little about this, and WLCG itself seems to have given up on this angle. In fact, looking further back to the early naughties it is remarkable how often the wheels get reinvented, thing B gets replaced with thing C that is a reinvention of thing A which was originally replaced by B.

Coming back to 2019, more pertinent still for the GridPP T2s will be the ongoing evolution of sites to storage sites and cache-only sites, a process which continues to pose lots of interesting technical challenges.

19 September 2018

When did we switch storage systems.... I'm not sure I can tell.....

So as part of the Storage evolution within the UK, the Birmingham site is switching SEs from DPM to EOS for the ATLAS experiment. However this blog isn't about that, this blog is about what is happening in the meantime... Similar to the way in which we pointed UCL WNs to use QMUL storage in London, ATLAS have switched the 'Brum' WNs to use storage at Manchester. This "greatly" increases the rtt between WN and SE from ~0.1ms to ~4ms. So I got to wandering, is this increase in latency noticeable in Job efficiency for the VOs. Here are the graphs from the VO for Birmingham and Manchester for the last month..
For Manchester a a control set for SE performance we have:

And for Birmingham we have:

I leave it to the reader to come to conclusions about wether Birmingham efficiencies have dropped since changing SEs. (Even if that conclusion is "you need more data to make a proper comparison",) but at least its something to get things going...

11 September 2018

40/100G networks are here! (but also have been for a while...)

So hearing that RAL now has a 100G link into JANET started me thinking again about what kind of network tuning might be needed.. A quick bit of using a popular search engine (and knowledge that their are certain individuals I have found that can provide useful info on this area, I found:

https://fasterdata.es.net/host-tuning/100g-tuning/

https://www.es.net/assets/Uploads/100G-Tuning-TechEx2016.tierney.pdf

What surprised me (and shouldn't have) was how old these posts were! Of particular interest has been that there really is a benefit to get diskservers upgraded to SL/CENTOS 7 rather than 6. I also did not know there was a 2G limit for TCP window size.

Successful DPM/DOME BOF/Hackathon

As part of a side meeting to the GRIDPP41 meeting held in Ambleside this year, we also had a successful BOF/Hackathon meeting. Now I probably should care more about the WN configuration/security hack session, but that really is SEP to me. However the discussion we had regarding Lancaster's DOME deployment and DPM legacy mode in general was useful to understand the DPM roadmap and how we are going to roll out SRM less service at tier 2 sites. This is germane at the moment as some of the features such as 3rd party copy (or TPC as it shall now be known as...) are needed for supporting data activities not just for the WLCG VOs but also some of our smaller communities. (DUNE is an example.)

24 August 2018

Help, ZFS ate my storage server (kernel segfaults on SL6)

At Edinburgh our storage test server (sl6) just updated it's kernel and had to reboot. Unfortunately it did not come back and suffered a kernel segfault during the reboot.

This was spotted to be during the filesystem mounting stage in the init scripts and specifically was caused by modprobe-ing the zfs module which had just been built by dkms.

The newer sl6 redhat kernels (2.6.32-754....) appear to have broken part of the abi used by the ZFS modules built by dkms.

The solution to fix this was found to be:

Reboot into the old kernel (anything with a version 2.6.32-696... or older)
check dkms for builds of the zfs/spl modules: dkms status
run: dkms uninstall zfs/0.7.9; dkms uninstall spl/0.7.9
make sure dkms removed this for ALL kernel versions (if needed run dkms uninstall zfs/0.7.9 -k 2.6.32-754) to remove it for a specific kernel
remove all traces of these modules:
for i in /lib/modules/*; do for j in extra weak-updates; do for k in avl icp nvpair spl splat unicode zcommon zfs zpios ; do rm -r ${i}/${j}/${k}; done; done; done
reboot back into the new kernel and reinstall zfs:
dkms install zfs/0.7.9; dkms install spl/0.7.9
Check that you've saved everything important.
Now load the new modules: modprobe zfs
re-import your pools: zpool import -a

Alternatively: Remove all of the zfs modules (steps 3 and 5) before you reboot your system after installing the new kernel and then dkms will re-install everything on the next reboot.

For more info: https://github.com/zfsonlinux/zfs/issues/7704

TL;DR When building new kernel modules dkms doesn't always rebuild external modules safely, make sure you remove these dependencies when you perform a kernel update so that everything is rebuilt safely

27 June 2018

CMS dark data volume after migrtaion to new SE at Tier1 RAL

CMS have recently moved all full usage of disk only storage from CASTOR to CEPH at the RAL-LCG2 tier1. (They are still using CASTOR for tape services.) CMS had been using an approx 2PB of disk only CASTOR space on dedicated servers. What is of interest is that after CMS had moved their data and deleted the old data from CASTOR, there was still ~ 100TB of dark data left on the SE. WE have now been able to removed this data and have started the process of decommissioning the hardware. (Some hardware may be re-provisioned for other tasks, but most is creaking with old age.)

ATLAS are almost in a similar position. I wander how much dark data will be left for them.. ( Capacity is 3.65PB at the moement so my guess is ~150TB)

When all the dark data is removed , and the files removed form the namespace, we can also clean up the dark directory structure which is no longer needed. I leave it to the read to guess how many dark directories CMS have left...

15 June 2018

DPM Workshop 2018 Report

CESNET hosted the 2018 DPM Workshop in Prague, 31st May to 1st June.

As always, the Workshop was built around the announcement of a new DPM release - 1.10.x - and promotion of the aims and roadmaps of the DPM core development team represented by it.

Since the 1.9.x series, the focus of DPM development has been on the next-generation "DOME" codebase. The 1.10 release, therefore, shows performance improvements in managing requests for all supported transfer protocols - GridFTP, Xroot, HTTP - but only when the DOME adapter is managing them.
(DOME itself is an http/WebDAV based management protocol, implemented as a plugin to xrootd, and directly implementing the dmlite API as an adapter.)

By contrast, the old lcgdm code paths are increasingly obsolesced in 1.10 - the most significant work done on the SRM daemon supported via these paths was the fix to Centos7 SOAP handling*.

As a consequence of this, there was a floated suggestion that SRM (and the rest of the lcgdm legacy codebase for DPM) be marked as "unsupported" from 1 June 2019 - a year after this workshop. There was some lively debate about the consequences of this, and two presentations (from ATLAS and CMS) covering the possibility of using SRMless storage. [In short: this is probably not a problem, for those experiements.]
There was some significant concern mainly about historical dependancies on SRM - both for our transfer orchestration infrastructure, for which non-SRM transfers are less tested, and for historical file catalogues, which may have "srm://" paths embedded in them.

As an additional point, there was a discussion of longstanding configuration "issues" with Xrootd redirection into the CMS AAA hierarchy, as discovered and amended by Andrea Sartirana at the end of 2017.

Other presentations from the contributing regions had a significant focus on testing other new features of DPM in 1.9.x; the distributed site approach (using DOME to manage pool nodes at geographically remote locations relative to the head, securely), and the new "volatile pool" model for lightweight caching in DPM.

For Italy, Alessandra Doria reported on the "distributed" DPM configuration across Roma, Napoli and LNF (Frascati), implemented currently as a testbed. This is an interesting implementation of both distributed DPM, and the volatile pools - each site has both a local permanent storage pool, plus a volatile cache pool, enabling the global namespace across the entire distributed DPM to be transparent (as remote files are cached in the volatile pool from other sites).

For Belle 2, Silvio Pardi reported on some investigations and tests of volatile pools for caching of data for analysis.

We also presented, from the UK, work on implementing the old ARGUS-DPM bridge for DMLITE/DOME. This api bridge allows the ARGUS service - the component of the WLCG site framework which centralises authentication and authorisation decisions - to make user and group ban status available to a local DPM. (DPM, especially in DMLITE and DOME eras, does not perform account mapping in the way that compute elements do, so the most useful part of ARGUS's functionality is the binary authorisation component. As site ARGUS' are federated with the general WLCG ARGUS instance to allow "federated user banning" as a security feature, the ability to set storage policies via the same framework is useful.)

*Centos7 gSOAP libraries changed behaviour such that they handle connection timeouts poorly, resulting in erroneous errors being sent to clients to an SRM when they reopen a connection. A work-around was developed at the CGI-GSOAP level, and deployed initially at UK sites which had noticed the issue.

26 April 2018

Impact of Firmware updates of Spectra and Meltdown Mitigation.

In order to address the security issues associated with the Spectra / Meltdown hardware bug found in many modern operating system AND CPUs firmware, CPU microcode updates are required. The microcode updates addresses the Spectre variant 2 attack. Spectre variant 2 attacks work by persuading a processor's branch predictor to make a specific bad prediction about which code will be executed and from which information can be obtained about the process.

Much has been said about the performance impact of Spectra / meltdown mitigation caused by the kernel patches. Less is known about the impact of the firmware updates on system performance. Most of the concern is about the performance impact on processes that switch between user and system calls. These are typically applications that perform disk or network operations.

After one abortive attempt Intel has released a new set of CPU microcode updates that promise to provide stability (https://newsroom.intel.com/wp-content/uploads/sites/11/2018/03/microcode-update-guidance.pdf). We have run some IO intensive benchmarks tests on our servers testing different firmware on our Intel Haswell CPUs (E5 2600 V3).

Our test setup up is made up of 3 HPE DL60 servers each with one OS disk and three data disks (1 TB SATA hard drives). One node is used for control while the other two will be involved in the actual benchmark process. The servers have Intel E5 2650 V3 CPUs and 128GB of RAM. Each server is connected at 10Gb/s SFP+ to a non blocking switch. All system are running scientific linux 6.9 (aka CentOS 6.9) with all the latests updates installed.

The manufacture, HPE, has provided a BIOS update which will deploy this new microbe version and we will investigate the impact of updating the microcode to 0x3C(BIOS 2.52) from previous version 0x3A(2.56) while keeping everything else constant. One nice feature of the HPE servers is the ability to swap to a backup BIOS so updates can be reverted.

Our first test uses a HDFS test called DFSIO with a Hadoop setup (1 name node, 2 data nodes with 3 data disks each). The test will write 1TB of data across the 6 disks and then reads it back. The command run are

yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.8.3-tests.jar TestDFSIO -D

test.build.data=mytestdfsio -write -nrFiles 1000 -fileSize 1000

yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.8.3-tests.jar TestDFSIO -D

test.build.data=mytestdfsio -read -nrFiles 1000 -fileSize 1000

The results, in minutes taken, clearly show a major performance impact, of order 20%, in using the new microcode update!

As a cross check we did a similar test using IOzone. Here we used the distributed mode of IOzone to run tests on the six disks of the two data nodes. The command run was

iozone -+m clustre.cfg -r 4096k -s 85g -i 0 -i 1 -t 12 1TB, 12 threads, were clustre.cfg defines the nodes and disks used.

The results, in kb/s throughput, again show a measurable impact in performance of using the new firmware, although at a smaller scale (5%).

Instead of using local idisk (direct attached storage) we also did the test over the network, using our Lustre file system instead of the local disks, we saw no performance impact in either test however in this case the 10Gb/s link was a bottle neck and may have influenced the results. We will investigate further as time allows.

13 April 2018

Data rates in the UK for last 12 months: Wow alot of data goes bentween the WNs and SEs...

So with me move to new data and work-flow models as a result of the idea to create further storageless sites and caching sites. I decided to take a look at how large data flows within the UK. Caveat Emptor: I took this data from ATLAS dashboard and make the assumption that there is little WAN traffic from WNs to SEs. I am aware this is not correct, but is at the moment a small factor for ATLAS. (Hence why I reviewed ATLAS rather than CMS whom I know use AAA alot.)

In green are the data volumes in UK disk storage, in red are the rates out of the storage. (In blue is the rate for WAN transfers between UK SEs.) In purple shows the rates to and from the RAL tape system. Of notes is that during the month there was 49.1PB of data deleted from UK storage out of a disk cache of ~33PB. What I note form these rate that the 139PB of data ingest from storage into worker nodes and the 11.4PB out from the completed jobs is data that would have had to go on the WAN if WNs were not co-located with SEs.

3 Meetings, 2 talk topic areas, 1 blogpost: Storage travels this month in a nutshell.

Been a busy month for meetings with the WLCG/HSF joint meeting in Naples, GRIDPP40 at Pitlochry and GDB at CERN. I summarized at the gridpp storage meeting the WLCG/HSF meeting, I expanded with my talk at GRIDPP on eudatalake project. (goo.gl/uvVtdm ).
But overall if you need a summary then these two talks from the GDB are the way to summarize most areas. Short links are:
goo.gl/2vxQYo
goo.gl/fMefh6

07 March 2018

LHCOPN/ONE meeting 2018 at Abingdon UK---Or how the NRENs want to taake over the storage world

OK, So my title may be full of poetic license, but I have fortunately been able to attend the WLCG LHCOPN/ONE meeting at Abingdon UK this week and wanted to find a way to get your attention. It might be easy to ask why would a storage blog be interested in a networking conference; but if you don't take into account how to transfer the data and make sure the network is efficient for the data movement process we will be in trouble. Remember, with no "N" , all we do are WA/LA transfers. (what's WA/LA? Exactly!!).

The agenda for the meeting can be found here:
https://indico.cern.ch/event/681168/
The meeting was attended by ~30 experts from both WLCG and more importantly from NREN network providers such as JISC/GEANT/ESNET/SURFnet (but not limited to these organisations.)

My highlights maybe subjective. and people should feel free to delve into the slides if they wish, (if the have access.) Here however are my highlights and musings:

From site updates. RAL-LCG will be joining LHCONE, BNL and FNAL are connected, or almost connected at 300G with a 400G transatlantic link to Europe. At the moment , the majority of T1-T1 traffic is still using the OPN rather than ONE. However for RAL a back of the envelope calculation shows switching for our connections to US T1s will reduce are latency by 12 and 35 % respectively to either BNL and FNAL so could be a benefit.

Data volume increases are slowing , there was only a 65% increase in rate on LHCONE in the last year compared to the 100% seen in each of the previous two years.

Following the site update was an interesting ipv6 talk where my work looking at perfSONAR rates between ipv4 and ipv6 comparisons was referenced. (See previous blogpost.) It was also stated again that the next version of perfSONAR (4.1) will not be on SL/RHEL 6 and will only be available on 7.

There was an interesting talk on the new DUNE neutrino project and its possible usage of either LHCOPN or ONE .

The day ended for me on a productive discussion to agree that jumbo frames should be highly encouraged/recommended. (but make sure PMTUD is on!)

Day 2 was slightly more network techy side and some parts I have to admit were lost on me, However there were interesting talks regarding DTNs and open storage Networks. Plus topics about demonstrators which were shown at SuperComputing. A rate of 378Gbps memory to memory is not to be sniffed at! How far NRENs want to become persistent storage providers is a question
I would ask myself. However I can see how OSN and SDNs could do to dataI/O workflows what the creation of collimators did to allow radio telescope inter-ferometry to flourish.

23 February 2018

Lisa, the new sister to "Dave the dataset" makes her appearane.

Hello I 'm Lisa, similar to "Dave the dataset" but born in 2017 in the ATLAS experiment . my DNA number is 2.16.2251. My initial size pf my 23 sub sections is 60.8TB 33651 files. My Main physics subsection is 8.73TB (4726 files). I was born on 9 months ago, in that time I have now produced 1281 unique children corresponding to 129.4TB of data in 60904 files. It i snot surprising that I have a large number of children as I am still relatively new and my children have yet been culled.

It is interesting to see for a relatively new dataset, how many copies of myself and my children there are.
There is 46273 files/ 60.248TB with 1 copy, 35807 files/ 62.06TB with 2 copies, 2959 files/ 4.94TB with 3 copies, 9110 files/ 2.16TB with 4 copies, 51 files/ 0.017GB with 5 copies and 80files/ 0.44GB with 6 copies. Only four real scientist have data which doesn't have a second copy

Analyzing how distributed around the world this data is shows the data is in 100 rooms in total across 67 houses.

Of course more data sets are just about to be created with the imminent restart of the LHC , so we will see how I and new datasets distributions develop.

21 February 2018

Dave's Locations in preparation for 2018 data taking.

My powers that be are just about to create more brethren of me in their big circular tunnel. SO I though I would give an update of my locations.

There are currently 479 rooms over 145 houses used by ATLAS. My data is still 8 years on still in 46 rooms in 24 houses. There are 269 individuals. of which 212 are unique , 56 have a twin in another room and one is a triplet. In total this means 13GB of data has double redundancy, 5.48TB has a single redundancy, and 2.45TB has no redundancy. Of note is that 5.28TB of the 7.93TB of data with a twin is from the original produced data.

My main concern is not with those "Dirk" or "Gavin" who are sole children, as they can easily be reproduced in the children "production" factories. Of concern are the 53 "Ursulas" with no redundancy. This equates to 159GB of data/ 6671 files of whose lose would effect 17 real scientists.

06 February 2018

ZFS 0.7.6 release

ZFS on Linux 0.7.6 has now landed.

https://github.com/zfsonlinux/zfs/releases/tag/zfs-0.7.6

For everyone running the 0.7.0-0.7.5 builds I would encourage people to look into updating as there are a few performance fixes associated with this build.
Large storage servers tend to have ample hardware, however if you're running this on systems with a small amount of RAM then the fixes may have a dramatic performance improvement.
Anecdotally I've also seen some improvements on a system which hosts a large number of smaller files which could be due to some fixes around the ZFS cache.

What if an update goes wrong?

I'm linking a draft of a flowchart I'm still working on to help debug what to do if a ZFS filesystem has disappeared after rebooting a machine:
https://drive.google.com/file/d/1hqY_qTfdpo-g_qApcP9nSknIm8X3wMwo/view?usp=sharing(Download and view offline for best results, there's a few things to check for!)

24 January 2018

Can we see Improvement in IPV6 perfSONAR traffic for the RAL-LCG2 Tier1 site?

In three words (or my TL;DR response would be,) Yes and No.
You may remember I made an analysis of the perfSONAR rates for both IPv4 and IPV6 traffic from the WLCG Tier1 at Rutherford Appleton Laboratory to other WLCG sites. Here is a quick update for new measurements is a plot showing current perfSONAR rates for sites from 8 months ago, their new results, and values for new sites which have been IPV6 enabled and included in official WLCG mesh testing.


IPv4 vs IPv6perfSONAR throughput rates between RAL and other WLCG sites.

What I find interesting is that we still have some sites which have vastly better IPv4 rates rather than IPv6. NB we have 16 sites still with data, 5 sites with no current results and 10 new sites which have been added since the last tranche of measurements.

18 January 2018

Dual-stacking Lancaster's SL6 DPM

I'll start with the caveat that this isn't an interesting tale, but then all the happy sysadmin stories are of the form “I just did this, and it worked!”.

Before we tried to dual-stack our DPM we had all the necessary IPv6 infrastructure provided and set up for us by the Lancaster Information System Services team. Our DNS was v6 ready, DHCPv6 had been set up and we had a IPv6 allocation to our subnet. We tested that these services were working on our Perfsonar boxes, so there were no surprises there. When the time came to dual-stack, all we needed to do was request IPv6 addresses for our headnode and pool nodes. It's worth noting that you can run partially dual-stacked without error – we ran with a handful of poolnodes dual-stacked. However I would advise that when the time comes to dual-stack your headnode you do all of your disk pools at the same time.

Once the IPv6 addresses came through and the DNS was updated (with dig returning AAAA records for all our DPM machines) the dual-stacking process was as simple as adding these lines to the network script for our external interfaces (for example ifcfg-eth0):

IPV6INIT=yes
DHCPV6C=yes

And then restarting the network interface, and the DPM services on that node (although we probably only needed to restart dpm-gsiftp). We also of course needed a v6 firewall, so we created a ip6tables firewall that just had all the DPM transfer ports (gsiftp, xrootd, https) open. Luckily the ip6tables syntax is the same as that for iptables, so there wasn't anything new to learn there.

Despite successfully running test by hand we found out all FTS transfers were failing with errors like:

CGSI-gSOAP running on fts-test01.gridpp.rl.ac.uk reports could not open connection to fal-pygrid-30.lancs.ac.uk:8446

Initial flailing had me add this line that was missing from /etc/hosts:

::1 localhost localhost.localdomain localhost6 localhost6.localdomain6

But the fix came after reading a similar thread in the dpm users forum pointing to problems with /etc/gai.conf – a config that I had never heard of before and typically didn't exist or was empty in the average linux installation. In order for globus to work with IPv6 it had to be filled with what is for all intents and purposes an arcane incantation:

# cat /etc/gai.conf

label ::1/128 0

label ::/0 1

label 2002::/16 2

label ::/96 3

label ::ffff:0:0/96 4

label fec0::/10 5

label fc00::/7 6

label 2001:0::/32 7

label ::ffff:7f00:0001/128 8

It's important to note that this is a problem that only effects SL6, RHEL7 installs of DPM should not need it. Filling /etc/gai.conf with the above and then restarting dpm-gsiftp on all our nodes (headnode and disk pools) fixed the issue and FTS transfers started passing again, although we still had transfer failures occurring at quite a high rate.

The final piece of the puzzle was the v6 firewall – remember how I said we opened up all the transfer protocol ports? It appears DPM likes to talk to itself over ipv6, so we had to open up our ip6tables firewall ports a lot more on our head node to bring it more in line with our v4 iptables. Once this was done and the firewall restarted our DPM started running like a dual-stacked dream, and we haven't had any problems since. A happy ending just in time for Christmas!

GridPP storage news