Parlor Game . . .
Surely, you have been part of a “guessing game” at a
dinner party. Someone says, “Think of 10 numbers and tell me their sum and then tell me 2 of
the 10 numbers; I will tell you the other 8 numbers you thought of”. The
general reaction will be that this person is crazy! Too many unknowns, not
enough constraints, too many solutions and so on.
Compressive
Sensing:
There is a tsunami of activity in mathematical circles
these days related to the parlor guessing game with great solutions for the
guesses – which sounds like magic (in fact, “l1-Magic” is the name
of a popular Matlab routine that provides a solution).
Donoho and Candes are the prime-movers of Compressive
Sensing (CS) and related theories and their pedigrees are impressive (Donoho is a MacArthur
"genius award" winner). For engineers, the simplest context in which
to understand CS is data acquisition. If a time series is a sum of 2 sinusoids,
we have to sample it at twice the frequency of the sinusoid of the highest frequency
(at Nyquist rate). Depending on the relative frequencies of the sinusoids, we
may have to acquire hundreds of samples to be able to reconstruct the original
time series faithfully. CS theory asserts that you only need about 8 random
samples and not hundreds to faithfully reconstruct the data! On the face of it,
this assertion looks absurd in that it flies in the face of the long-standing
understanding of Nyquist criterion.
However, a glimmer of insight may arise when we take note
of the fact that there are only 2 sinusoids in the data! Somehow, if we can
gather the information about these 2 sinusoids from 8 random samples, we may be
able to reconstruct the data fully. Under such “rank limited” situation (along
with a few easy-to-meet restrictions), CS theory provides a way to completely
recreate the original data.
CS theory is accessible and 3 easy to read intros are (1)
Candes & Wakin, “Introduction to Compressive Sampling”, IEEE Signal
Processing Magazine, 2008, (2) Baraniuk, “Compressive Sensing lecture notes”, IEEE
Signal Processing Magazine, 2007 and (3) Candes & Plan, “Matrix Completion
with Noise”, Proceedings of the IEEE, 2010.
As you will find from these and other excellent articles
on the topic, the approach is to solve a constrained minimization problem using
a certain norm. l2-norm we are used to is largely unsuitable for
this task because you do not almost always reach the desired solution (perhaps
the reason why no one saw this solution till 2000’s). The so-called l0-norm
will work but is not tractable. However, l1-norm (or the “taxi-cab”
metric) does the trick. Many linear programming solutions exist; MATLAB implementations are available from multiple websites.
Data Analytics:
The references I mentioned above provide motivation from the
data analytics domain (mostly from the Netflix recommendation challenge). Let me
give an example from Retail analytics.
In Retail/CPG analytics, large amount of shopper data is
collected via loyalty cards, transaction logs, credit card information, etc.
All of these data are collected and collated into a large data matrix. For example,
the data matrix may contain for each shopper (each row of a matrix), the amount
they spent over a year on soaps from different brands (each column is a CPG brand).
A relatively small data set in this business may have 5 or 10 million entries;
however most of the entries will be missing (usually marked as “zero”). This
stands to reason because I may buy only 1 or 2 brands of the 100’s of CPG soap
brands; so the rest of my row will be zeros. In an example, we had 93% of the
entries as zeros!
To relate this situation to the sum of sinusoids
sampling, remember that the signal was “rank-limited” in that it was made up of
only 2 sinusoids. Low-rank is one of the primary reasons that CS works. In the case
of the soap data matrix, we again have a low-rank situation since the underlying
preference for soap depends only on a few parameters such as smell, anti-bacterial
property, color, cost, etc. In other words, if we do an eigen-decomposition of the
true data matrix, only a few eigenvalues will be
significant. Hence, using the linear programming approach and minimizing the l1-norm,
we are able to reconstruct the other 93% of the entries!
So now, we have a tractable (“magical”) solution to
identify the entries of the “soap” data matrix. What do these entries tell us?
The reconstructed
entry is the Data Scientist’s best prediction of what a shopper would spend on
a certain CPG soap brand. This is hugely valuable – this is an excellent
proxy for shopper’s propensity of purchase of a particular brand. A CPG can
group the shoppers based on their purchase propensities, identify a group for
whom it is low and target them with advertisements or discount offers to “switch”
them. This information can be used by media buyers to optimize their ad dollar ‘spend’.
Knowing which shoppers visit a specific store location, Retailers can assay
their ‘spends’ and stock their shelves appropriately. The applications are virtually limitless.
small BIG Data:
If we can reconstruct 93% of the data from a 7% random
sample as in the example above, who needs BIG data?
Clearly, we do not need BIG data *collection*. But we
still need to regenerate BIG data (all the data matrix entries). But perhaps we
do not need to store it; it can be regenerated at will (convergence of the
algorithms seem fast). So, we do not
need BIG data collection or storage but we do need to process BIG data.
Seem to point to in-memory BIG data solutions for this type of data analytics
applications . . . next few months will tell.
Interesting.
ReplyDeleteI'm not sure about your conclusion that BIG data is not required. Big Data is not about filling out the sparse matrix of an individual shopper's preferences. Rather it is the collection of ALL shoppers' data that when analyzed in aggregate can yield overall patterns e.g. why do shoppers in Seattle like soaps that are white in color more than any other region?