Big data is a big buzzword these days but it's also a real "problem" (in some practical sense: data can be difficult to move, to store, to process, to preserve - when you have lots of it), but the big data is also an opportunity.
Well, we've had our big data workshop. A proper writeup will be, er, written up, but here are a few personal notes to whet your appetite (hopefully). You can of course also go and view the presentations at the workshop - all speakers have made their presentations available.
Well, we've had our big data workshop. A proper writeup will be, er, written up, but here are a few personal notes to whet your appetite (hopefully). You can of course also go and view the presentations at the workshop - all speakers have made their presentations available.
- Lots of research communities have "big data" - in fact, it's hard to think of one that doesn't. Perhaps it's like email - after email became widely used at universities, researchers could communicate rapidly with each other, thus extending remote collaborations - but it's also a double edged sword in the sense that we spend much of our time processing email. Big data tools, I think, enable each research community to do more, but perhaps at a cost.
- The cost is not always obvious beforehand - many speakers mentioned the complexity of software, and maintaining software for your data which makes use of hardware advances - and, er, runs correctly - is nontrivial.
- Likewise, adapting existing code to process "big data" - to which extent is it like adapting applications and algorithms to run parallel, distributed, in the grid, in the cloud - ?
- There is clearly opportunities for sharing ideas and innovations - the LHC grid may be finding events not inconsistent with the existence of Higgs-like particles (or, to the less careful, discovering the Higgs), thanks to a global network of data transfer, storage, and computing (of which GridPP is a part) crunching data on the order of hundred(s) of petabytes. But this work doesn't rely on visualisation to the extent that astronomy does. And in humanities, artists have found new ways of visualising data.
- And who mentioned software complexity - the compute evolved along with building the collider, so we had time to test it.The last thing a researcher wants is to debug the infrastructure (well, actually, a few quite like that, but most would rather just get on with the research they're supposed to be doing.)
- Data policies - open data, sharing - making data sets usable - and giving academic credit to researchers for doing this. There is more to big data in research than the ability to store it. Human stuff, too. Policy. Security. Identity management. That sort of stuff.
- I think there is a gap between hardware/tools on one side and communities on the other - and the bridge is the infrastrcuture provider. But there is sometimes more to it than that - some communities find it harder to share than others. A cultural change may be needed.
- And making use of big data tools - as with clouds, grids - sometimes it's easier running stuff locally, if you can, or it feels more secure. Make use of tools that make it easy. Learn from others.
No comments:
Post a Comment