24 March 2016

Some thoughts on data in academic environments vs industry, part 1 (of 2)

I was asked today about my opinion on the difference between big data in academia and industry. As I see it we have volume (data collections on the order of 10s of PBs, e.g. WLCG, climate); characteristically velocity is an order of magnitude greater than volume as all data is moved around and replicated (FTS alone was recently estimated to have moved nearly a quarter of an exabyte in a year, and Globus advertise their (presumably estimated) data transfer volumes on their home page) Most science data is copied and replicated, as large scale data science is a global endeavour, requiring the collaboration of research centres across the world.

But we (science) have less variety. Physics events are physics events, and even with different types like raw, AOD, ESD, etc., there is a manageable collection of formats. Different communities have different formats, but as a rule, science is fairly consistent.

Bandwidth into sites is measured in 10s of Gb/s (10, 40, 60, that sort of stuff); and the 16 PB Panasas system for JASMIN can shift something like 2.5 Tb/s (think of it as a million times faster than your home Internet)

Moreover, for some things like WLCG, data models are very regimented, thus ensuring that the processing happens in an orderly fashion rather than chaotic. We have security strong enough to enable us to run services outside the firewall (as otherwise we'd slow the transfer down a lot, and/or melt the firewall).

And the expected evolution is more of the same - towards the 100PB for data collections, 1EB for data transfers. More data, more bandwidth, perhaps also more diversity within research disciplines. When will we get there? If growth is truly exponential, it could be too soon  :-) we'd get more data before the technology is ready... even with sub-exponential growth, we may have scalability issues - sure, we could store an exabyte today in a finite footprint, but can we afford it?

"Big Science" security goals are different - data is rarely personal, but may be commercially sensitive (e.g. MX of a protein) or embargoed for research reasons. Integrity is king, availability is the prince, and the third "classic" data security goal, confidentiality, not quite a pauper, but a baronet, maybe?! It's an oversimplified view but these are often the priorities. Big Science requires public funding so the data that comes out of it must be shared to maximise the benefit.

There's also the human side. With the AoD (Archer-on-Duty) looking after our storage systems, it does not seem likely we will ever have ten times more people for an order of magnitude more data.  But so far, comparing to when data was order-of-1PB scale, we didn't need ten times more people than we had then.

Agree? Disagree? Did I forget something?

No comments: