02 March 2017

Data Recall Tests from Data from Tape for ATLAS successful... ....ish

The ATLAS VO wanted to test  rates achievable to be recalled from Tape system for real data to see what we could achieve. We (RAL) decided to make this even harder on ourselves by getting them to move the date to our new CEPH storage system so as to test further its deployment. This had some success and discovered issues which we have been working to fix. I won't go into all of them here and leave it to others to expand if they wish to other than to say the following.

Data rate was improved by fixing the number of concurrent transfers between the tow storage systems by disabling the auto-configure feature in FTS for the particular FTS "link". We set it to 32 as it had been downgraded to only two due to other issues. This 32 is a value which should be investigated in further tests.

What I will say are the headline figures we achieved. When things were working well,; we were recalling from tape at ~900MB/s and then copying from the Castor disk cache into our object store using (gsiftp protocol) via FTS at  450MB/s. (Total Data Volume expected to be moved was 150TB, but we seem to have seen 230TB of data moved). This was using three gateway machines on the CEPH cluster, 12 disk servers of various quality in the disk cache in castor (with capability of up to 160 concurrent transfers.) This was connected to our tape library which had up to 10 out of 16 tape drives ( this limit is due to an internal tape family setting which we shan't worry about now.)

We also had to worry about other VOs using the system and internal ATLAS contention of resources during the test.  There are probably earlier tests that my colleagues have done in the background  on the Q.T; but this is the first large Volume data recall and movement combining CEPH and Castor, so thought I would comment on it here for prosperity ( and reference for the next time the test is run.)





No comments: