30 June 2015

Musings on data confidentiality

Recently I was asked whether STFC should store classified data, such as Secret data (being a gov't facility, all our data is already Official).

If you look at a "normal" data centre like those run by the big cloud providers, they are typically set up to ensure data confidentiality. They have special personnel who are authorised to enter the data centre, and they have all sorts of physical security measures. If they store Secret data they will need clearance.

We have security measures, but we also take visitors round our data centre and if they are monitored all the time it is more for their own health and safety than because we don't trust them. They can take pictures if they like. Of course we would very much like them to not press any buttons but that's also why there's someone with them.We have students who come and work with us also in the data centre, and leave feeling they have made real contributions.

The three basic data security goals are confidentiality, integrity, and availability, and all three are of course important. A "conventional" data centre would probably prioritise confidentiality first, then integrity, and finally availability: it is better that data is temporarily inaccessible than leaked. RAL's data centre on the other hand is different: for us integrity is top - we spend a lot of time checksumming files at rest and in flight, and comparing lists of files with other lists, data volumes with data volumes. Availability is also highly important as science data is collected, transmitted, and processed around the clock. And then in a sense confidentiality is last: for example, hardly anything is encrypted in flight because it would just slow transfers down. Of course we still need to protect scientists' data because "there is definitely a Nobel prize in there!" but our data is not national security, nor even personal/medical data. Yes, of course we protect the science data, but there something to be said for openness too - making open data available, and showing the public some of the good stuff we do. And it would be quite costly to protect against a "highly capable threat," money which is better spent making things go faster. Leave other data centres to guard the national secrets.

24 June 2015

The firewall did it

Now that we have sort of mostly finished setting up the DiRAC data transfers to RAL, we look at the weeks it took and wonder (a) was it worth it and (b) why did it take weeks?

While initially we only back up data from the DiRAC sites - initially Durham - into RAL Tier 1, the reason we set them up as a grid VO is so we can have the grid tools drive the data transfers. The thinking is that although there is an overhead in setting it up and getting it working, the tools that moved nearly a quarter of an exabyte last year will then move the data with the highest possible efficiency. Initially we are going to let it run as fast as it can until someone complains we hit a reasonable target/limit - 3-400 megabytes per second.


[Edit: updated the image as I had inadvertently put a link in to a 'live' image rather than the snapshot]

The green stuff in the plot is primarily DiRAC data coming in at some 250 MB/s; the spike is not related to DiRAC (this would be a case where the most prominent feature in the plot is of no interest to the discussion...a good way to capture readers, perhaps?)

The advantage of having them griddified is also that in the future if we decide to do more stuff, like move the data elsewhere or start doing analysis, it's all ready to go.

So why does it take time to set up?  Part of it is all the technical things that need setting up - VOs, local accounts, mailing lists, certificates, gridmap files, monitoring; none of them too onerous but they all take some time to fill in a form and process, they may have changed since the last time we did it, they take time to debug if they aren't working properly, and in the worst case scenario only one person knows how or is authorised to do it and is on leave/off sick/busy.

Then there are the processes: since access rights are to some very high end computing and storage systems, there are processes for reviewing authorisations, proposals, permissions, allocations and quotas, etc. These, too, take time, particularly if a panel review is involved.

Finally there's putting all the pieces together to see if it works. And when it doesn't, is it the VO's fault - they may be new to the business and do something strange - or is there something wrong with the infrastructure - not unlikely if something new is set up for them. In our case it didn't work, and it turns out that GridFTP as the data movement protocol now uses UDP and the Durham firewall blocked UDP. With firewalls there is a tradeoff between the efficiency of the transfer (less firewall is better) and the security they provide (more firewall is better). It needs both "control" ports where services are listening all the time and "data" ports which are ephemeral so need to be opened in a known port range.