11 January 2019

Learning about Globus Connect/Online and more when The globusWorld Tour came to RAL

RAL recently hosted a Globus World Tour event where ~50 researchers and system administrators from across the UK (and across research interest groups) gathered to learn about Globus Online.
A link to the agenda is here:
https://www.globusworld.org/tour/program?c=16

The sessions covered various areas from server system administration, transfer usage and monitoring, API/CLI access and app integration. As was pointed out, globus do more than just data transfer.  (I got use Jupyter notebooks again :) ) What I also found helpful was trying to understand tokens as authentication method. (Although most is still just a mystery to me.)
Useful pages to look at  are the globus how-to page:

 https://docs.globus.org/how-to/

 and the github repository:

https://github.com/globus

Time to go and see if I can setup a globus connect server on a linux box with no gui or browser.....

04 January 2019

This is Dave; First of the ATLAS datasets at RAL-LCG2, signing off.

The last physics beams and collisions in the LHC for the foreseeable future finished last month. This has been described better than I can do it here:
https://www.symmetrymagazine.org/article/lhc-ends-second-season-of-data-taking
 I thought I would use this moment to finish my description. I and my children are still in 43 rooms spread over 24 houses. We are in Switzerland, USA, Germany, UK, France, Portugal, Netherlands, Japan,Turkey, Canada, Italy, Israel and Czech Republic. There are 268datasets of which 1 has three replicas , 56 have a twin, the remaining 211 are at risk of extinction if a room were to be destroyed. Only 54 datatsets are user derived rather than being centrally produced.

In general,  since my birth neatly nine years ago,  the number of primary brethren at my initial home of the RAL tape system.may also be of interest to people.
In the "DATATAPE" room of RAL-LCG2 there are 17855 people. It's amazing how we can fit all in one room!  4865 of these are unique data runs.(If you recall that my DNA value is 2.3.7.3623, then the last value at RAL is 2.3.3.20441). There are 7628774 files filling a space of 7.294 PB of RAW data  (out of a total 9.925PB)

Below is a plot of the distribution of dataset size (GB)  stored for data runs:


More datasets will be coming later (though still to be confirmed when.)  Data rates will be much larger and so greater volumes of data will need to be recorded. It's going to be an interesting couple of years working out how to do this and a challenge to prepare for this, whilst still providing  functionality and capacity to provide my current and future children to the Users who will continue to analyse me and my brethren in the coming years. This is Dave; first (but no last ) of the ATLAS raw datasets stored at RAL-LCG2, signing off.

31 December 2018

Pretrospective

Traditionally at the end of the year it falls to the undersigned to look back at highlights of the past year and forward to issues in the coming year.  By which we mean technical issues, not political, economic, or peopleoligical - no mention of the B-word or the impact it has had on us already.

Looking back, without the benefit of the archive (an internal network error has, ironically, taken it temporarily offline), one of the significant developments of 2018 was the formation of what is now known as IRIS, an infrastructure for an amalgamation of STFC-funded research communities.  GridPP, of course, is a part of this, an example of how to provide an infrastructure even if the solution would not work directly for all user communities.  Also highlightworthy, the work on object stores, and ZFS.

Looking forward, expanding the work with IRIS will be interesting, in particular the competition for the middleware on top of the IaaS and storage, because the current infrastructure requires a fair bit of IaaS expertise.  Less new, but still not sufficiently followed up on, are opportunities to work on the integration of data centre automation and information systems, although even some storage vendors at CIUK knew little about this, and WLCG itself seems to have given up on this angle.  In fact, looking further back to the early naughties it is remarkable how often the wheels get reinvented, thing B gets replaced with thing C that is a reinvention of thing A which was originally replaced by B.

Coming back to 2019, more pertinent still for the GridPP T2s will be the ongoing evolution of sites to storage sites and cache-only sites, a process which continues to pose lots of interesting technical challenges.

19 September 2018

When did we switch storage systems.... I'm not sure I can tell.....

So as part of the Storage evolution within the UK, the Birmingham site is switching SEs from DPM to EOS for the ATLAS experiment. However this blog isn't about that, this blog is about what is happening in the meantime... Similar to the way in which we pointed UCL WNs to use QMUL storage in London, ATLAS have switched the 'Brum' WNs to use storage at Manchester. This "greatly" increases the rtt between WN and SE from ~0.1ms to ~4ms. So I got to wandering, is this increase in latency noticeable in Job efficiency for the VOs. Here are the graphs from the VO for Birmingham and Manchester for the last month..
For Manchester a a control set for SE performance we have:

And for Birmingham we have:


I leave it to the reader to come to conclusions about wether Birmingham efficiencies have dropped since changing SEs. (Even if that conclusion is "you need more data to make a proper comparison",) but at least its something to get things going...

11 September 2018

40/100G networks are here! (but also have been for a while...)

So hearing that RAL now has a 100G link into JANET started me thinking again about what kind of network tuning might be needed.. A quick bit of using a popular search engine (and knowledge that their are certain individuals I have found that can provide useful info on this area, I found:

https://fasterdata.es.net/host-tuning/100g-tuning/

https://www.es.net/assets/Uploads/100G-Tuning-TechEx2016.tierney.pdf

What surprised me (and shouldn't have) was how old these posts were! Of particular interest has been that there really is a benefit to get diskservers upgraded to SL/CENTOS 7 rather than 6. I also did not know there was a 2G limit for TCP window size.

Successful DPM/DOME BOF/Hackathon

As part  of a side meeting to the GRIDPP41 meeting held in Ambleside this year, we also had a successful BOF/Hackathon meeting. Now I probably should care more about the WN configuration/security hack session, but that really is SEP to me. However the discussion we had regarding Lancaster's DOME deployment  and DPM legacy mode in general was useful to understand the DPM roadmap and how we are going to roll out SRM less service at  tier 2 sites. This is germane at the moment as some of the features such as 3rd party copy (or TPC as it shall now be known as...) are needed for supporting data activities not just for the WLCG VOs but also some of our smaller communities. (DUNE is an example.)

24 August 2018

Help, ZFS ate my storage server (kernel segfaults on SL6)


At Edinburgh our storage test server (sl6) just updated it's kernel and had to reboot. Unfortunately it did not come back and suffered a kernel segfault during the reboot.

This was spotted to be during the filesystem mounting stage in the init scripts and specifically was caused by modprobe-ing the zfs module which had just been built by dkms.

The newer sl6 redhat kernels (2.6.32-754....) appear to have broken part of the abi used by the ZFS modules built by dkms.

The solution to fix this was found to be:
  1. Reboot into the old kernel (anything with a version 2.6.32-696... or older)
  2. check dkms for builds of the zfs/spl modules:   dkms status
  3. run:   dkms uninstall zfs/0.7.9; dkms uninstall spl/0.7.9
  4. make sure dkms removed this for ALL kernel versions (if needed run dkms uninstall zfs/0.7.9 -k 2.6.32-754) to remove it for a specific kernel
  5. remove all traces of these modules:
     for i in /lib/modules/*; do
      for j in extra weak-updates; do
       for k in avl icp nvpair spl splat unicode zcommon zfs zpios ; do
         rm -r ${i}/${j}/${k};
       done;
      done;
     done
  6. reboot back into the new kernel and reinstall zfs:
    dkms install zfs/0.7.9; dkms install spl/0.7.9
  7. Check that you've saved everything important.
  8. Now load the new modules: modprobe zfs
  9. re-import your pools: zpool import -a
Alternatively: Remove all of the zfs modules (steps 3 and 5) before you reboot your system after installing the new kernel and then dkms will re-install everything on the next reboot.

For more info: https://github.com/zfsonlinux/zfs/issues/7704


TL;DR  When building new kernel modules dkms doesn't always rebuild external modules safely, make sure you remove these dependencies when you perform a kernel update so that everything is rebuilt safely

27 June 2018

CMS dark data volume after migrtaion to new SE at Tier1 RAL

CMS have recently moved all  full usage of disk only storage from CASTOR to CEPH at the RAL-LCG2 tier1. (They are still using CASTOR for tape services.) CMS had been using an approx 2PB of disk only CASTOR space on dedicated servers. What is of interest is that after CMS had moved their data and deleted the old data from CASTOR, there was still ~ 100TB of dark data left on the SE. WE have now been able to removed this data and have started the process of decommissioning the hardware. (Some hardware may be re-provisioned for other tasks, but most is creaking with old age.)

ATLAS are almost in a similar position. I wander how much dark data will be left for them.. ( Capacity is 3.65PB at the moement so my guess is ~150TB)

When all the dark data is removed , and the files removed form the namespace, we can also clean up the dark directory structure which is no longer needed.  I leave it to the read to guess how many dark directories CMS have left...