The first thing I ask a candidate who says they want to work on "Big Data" is "What Constitutes Big Data?"
They'll throw out a number like "10GB" or "5PB". Both are ridiculous answers. I'll let you in on a secret: there is no right answer to this. People answer this question based on personal experience. They'll have just been at a company with a 300GB SQL Server installation is creaking under its weight, so 1TB becomes "Big Data" in their mind.
There are really two axes to look at:
- Size of the data in terms of disk or memory
- Complexity of analyzing that data [EDIT: the complexity of things you want to know]
Most problems don't challenge both of these axes. Furthermore, most people confuse the two.
Size is not usually a problem. As Ted Dziuba points out eloquently as usual, most processing is not complex, even if the data is large in size. If you need to know if treatment A or treatment B of your Facebook game sold more virtual junk, it's just not that hard to figure that out. You can use grep, cut, sort and uniq to figure that out.
I jokingly posted to Twitter today that I'm working on my Small Data skills for this reason. I'm helping out with some analytics for some content optimizations, but am using pure unix and a simple Python script to pull it together. No Hadoop. No Map Reduce. Just not needed here.
Complexity is really the major problem to tackle. Try making some sensible decision based on the information at hand. The size of the data collected is a crutch in order to avoid requiring real intelligence. A "machine learning" algorithm will require 10,000 keyword searches to deduce what kind of person you are, but a human brain might require just one keyword. Or even just by lookin' at you.
Bottom line: learn to differentiate "big" data from just data. Chances are that you're probably working with regular, boring, small data. Embrace traditional data marts if you have to for historical analytics. Then just use the Taco Bell techniques that Ted Dziuba describes above.