26 April 2018

Impact of Firmware updates of Spectra and Meltdown Mitigation.

In order to address the security issues associated with the Spectra / Meltdown hardware bug found in many modern operating system AND CPUs firmware, CPU microcode updates are required. The microcode updates addresses the Spectre variant 2 attack. Spectre variant 2 attacks work by persuading a processor's branch predictor to make a specific bad prediction about which code will be executed and from which information can be obtained about the process.

Much has been said about the performance impact of Spectra / meltdown mitigation caused by the kernel patches. Less is known about the impact of the firmware updates on system performance. Most of the concern is about the performance impact on processes that switch between user and system calls. These are typically applications that perform disk or network operations.

After one abortive attempt Intel has released a new set of CPU microcode updates that promise to provide stability (https://newsroom.intel.com/wp-content/uploads/sites/11/2018/03/microcode-update-guidance.pdf). We have run some IO intensive benchmarks tests on our servers testing different firmware on our Intel Haswell CPUs (E5 2600 V3).

Our test setup up is made up of 3 HPE DL60 servers each with one OS disk and three data disks (1 TB SATA hard drives). One node is used for control while the other two will be involved in the actual benchmark process. The servers have Intel E5 2650 V3 CPUs and 128GB of RAM. Each server is connected at 10Gb/s SFP+ to a non blocking switch. All system are running scientific linux 6.9 (aka CentOS 6.9) with all the latests updates installed.

The manufacture, HPE, has provided a BIOS update which will deploy this new microbe version and we will investigate the impact of updating the microcode to 0x3C(BIOS 2.52) from previous version 0x3A(2.56) while keeping everything else constant. One nice feature of the HPE servers is the ability to swap to a backup BIOS so updates can be reverted.

Our first test uses a HDFS test called DFSIO with a  Hadoop setup (1 name node, 2 data nodes with 3 data disks each). The test will write 1TB of data across the 6 disks and then reads it back. The command run are

yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.8.3-tests.jar TestDFSIO -D
test.build.data=mytestdfsio -write -nrFiles 1000 -fileSize 1000
yarn jar /usr/local/hadoop/share/hadoop/mapreduce/hadoop-mapreduce-client-jobclient-2.8.3-tests.jar TestDFSIO -D
test.build.data=mytestdfsio -read -nrFiles 1000 -fileSize 1000

The results, in minutes taken, clearly show a major performance impact, of order 20%, in using the new microcode update! 

As a cross check we did a similar test using IOzone. Here we used the distributed mode of IOzone to run tests on the six disks of the two data nodes. The command run was 
iozone -+m clustre.cfg -r 4096k -s 85g -i 0 -i 1 -t 12 1TB, 12 threads, were clustre.cfg defines the nodes and disks used.

The results, in kb/s throughput, again show a measurable impact in performance of using the new firmware, although at a smaller scale (5%).

Instead of using local idisk (direct attached storage) we also did the test over the network, using our Lustre file system instead of the local disks, we saw no performance impact in either test however in this case the 10Gb/s link was a bottle neck and may have influenced the results. We will investigate further as time allows.

13 April 2018

Data rates in the UK for last 12 months: Wow alot of data goes bentween the WNs and SEs...

So with me move to new data and work-flow models as a result of the idea to create further storageless sites and caching sites. I decided to take a look at how large data flows  within the UK. Caveat Emptor: I took this data from ATLAS dashboard and make the assumption that there is little WAN traffic from WNs to SEs. I am aware this is not correct, but is at the moment a small factor for ATLAS. (Hence why I reviewed ATLAS rather than CMS whom I know use AAA alot.)

In green are the data volumes in UK disk storage, in red are the rates out of the storage. (In blue is the rate for WAN transfers between UK SEs.) In purple shows the rates to and from the RAL tape system. Of notes is that during the month there was 49.1PB of data deleted from UK storage out of a disk cache of ~33PB. What I note form these rate that the 139PB of data ingest from storage into worker nodes  and the 11.4PB out from the completed jobs is data that would have had to go on the WAN if WNs were not co-located with SEs.

3 Meetings, 2 talk topic areas, 1 blogpost: Storage travels this month in a nutshell.

Been  a busy month for meetings with the WLCG/HSF joint meeting  in Naples, GRIDPP40  at Pitlochry and GDB at CERN. I summarized at the gridpp storage meeting the WLCG/HSF meeting, I expanded with my talk at GRIDPP on eudatalake project. (goo.gl/uvVtdm ).
But overall if you need a summary then these two talks from the GDB are the way to summarize most areas. Short links are:  

07 March 2018

LHCOPN/ONE meeting 2018 at Abingdon UK---Or how the NRENs want to taake over the storage world

OK, So my title may be full of poetic license, but I have fortunately been able to attend the WLCG LHCOPN/ONE meeting at Abingdon UK this week and wanted to find a way to get your attention. It might be easy to ask why would a storage blog be interested in a networking conference; but if you don't take into account how to transfer the data and make sure the network is efficient for the data movement process we will be in trouble. Remember, with no "N" , all we do are WA/LA transfers.  (what's WA/LA? Exactly!!).

The agenda for the meeting  can be found here:
 The meeting was attended by ~30 experts from both WLCG and more importantly from NREN network providers such as JISC/GEANT/ESNET/SURFnet  (but not limited to these organisations.)

My highlights maybe subjective. and people should feel free to delve into the slides if they wish, (if the have access.) Here however are my highlights and musings:

From site updates. RAL-LCG will be joining LHCONE, BNL and FNAL are connected, or almost connected at 300G with a 400G transatlantic link to Europe. At the moment , the majority of T1-T1 traffic is still using the OPN rather than ONE. However for RAL a back of the envelope calculation shows switching for our connections to US T1s will reduce are latency by 12 and 35 % respectively to either BNL  and FNAL so could be a benefit.
Data volume increases are slowing , there was only a 65% increase in rate on LHCONE in the last year compared to the 100% seen in each of the previous two years.

Following the site update was an interesting ipv6 talk where my work looking at perfSONAR rates between ipv4 and ipv6 comparisons was referenced.  (See previous blogpost.) It was also stated again that the next version  of perfSONAR (4.1) will not be on SL/RHEL 6 and will only be available on 7.

There was an interesting talk on the new DUNE neutrino project and its possible usage of either LHCOPN or ONE .

The day ended for me on a productive discussion to agree that jumbo frames should be highly encouraged/recommended. (but make sure PMTUD is on!)

Day 2   was slightly more network techy side and some parts I have to admit were lost on me, However there were interesting talks regarding DTNs and open storage Networks. Plus topics about demonstrators which were shown at SuperComputing. A rate of 378Gbps memory to memory is not to be sniffed at! How far NRENs want to become persistent storage providers is a question
I would ask myself. However I can see how OSN and SDNs could do to dataI/O workflows what the creation of collimators did to allow radio telescope inter-ferometry to flourish.

23 February 2018

Lisa, the new sister to "Dave the dataset" makes her appearane.

Hello  I 'm Lisa, similar to "Dave the dataset" but born in 2017 in the ATLAS experiment . my DNA number is 2.16.2251. My initial size pf my 23 sub sections is 60.8TB 33651 files. My Main physics subsection is 8.73TB (4726 files). I was born on 9 months ago, in that time I have now produced 1281 unique children corresponding to 129.4TB of data in 60904 files. It i snot surprising that I have a large number of children as I am still relatively new and my children have yet been culled.

It is interesting to see for a relatively new dataset, how many copies of myself and my children there are.
There is 46273 files/ 60.248TB with 1 copy, 35807 files/ 62.06TB with 2 copies, 2959 files/ 4.94TB with 3 copies, 9110 files/ 2.16TB with 4 copies, 51 files/ 0.017GB with 5 copies and 80files/ 0.44GB with 6 copies. Only four real scientist have data which doesn't have a second copy

Analyzing  how distributed around the world this data is shows the data is in  100 rooms in total across 67 houses.

Of course more data sets are just about to be created with the imminent restart of the LHC , so we will see how I and new datasets distributions develop.

21 February 2018

Dave's Locations in preparation for 2018 data taking.

My powers that be are just about to create more brethren of me in their big circular tunnel. SO I though I would give an update of my locations.

There are currently 479 rooms over 145 houses used by ATLAS. My data is still 8 years on still in 46 rooms in 24 houses. There are 269 individuals. of which 212 are unique , 56 have a twin in another room and one is a triplet. In total this means 13GB of data has double redundancy,  5.48TB has a single redundancy, and 2.45TB has no redundancy.  Of note is that 5.28TB of the 7.93TB of data with a twin is from the original produced data.

My main concern is not with those "Dirk" or "Gavin" who are sole children, as they can easily be reproduced in the children "production" factories. Of concern are the 53 "Ursulas" with no redundancy. This equates to 159GB of data/ 6671 files of whose lose would effect 17 real scientists.

06 February 2018

ZFS 0.7.6 release

ZFS on Linux 0.7.6 has now landed.


For everyone running the 0.7.0-0.7.5 builds I would encourage people to look into updating as there are a few performance fixes associated with this build.
Large storage servers tend to have ample hardware, however if you're running this on systems with a small amount of RAM then the fixes may have a dramatic performance improvement.
Anecdotally I've also seen some improvements on a system which hosts a large number of smaller files which could be due to some fixes around the ZFS cache.

What if an update goes wrong?

I'm linking a draft of a flowchart I'm still working on to help debug what to do if a ZFS filesystem has disappeared after rebooting a machine:
https://drive.google.com/file/d/1hqY_qTfdpo-g_qApcP9nSknIm8X3wMwo/view?usp=sharing(Download and view offline for best results, there's a few things to check for!)

24 January 2018

Can we see Improvement in IPV6 perfSONAR traffic for the RAL-LCG2 Tier1 site?

In three words (or my TL;DR response would be,) Yes and No.
You may remember I made an analysis of the perfSONAR rates for both IPv4 and IPV6 traffic from the WLCG Tier1 at Rutherford Appleton Laboratory to other WLCG sites. Here is a quick update for new measurements is a plot showing current perfSONAR rates for sites from 8 months ago, their new results, and values for new sites which have been IPV6 enabled and included in official WLCG mesh testing.
IPv4 vs IPv6perfSONAR throughput rates between RAL and other WLCG sites.
What I find interesting is that we still have some sites which have vastly better IPv4 rates rather than IPv6. NB we have 16 sites still with data, 5 sites with no current results and 10 new sites which have been added since the last tranche of measurements.