Friday, April 19, 2013

Parlor Games, Compressive Sensing, Data Analytics . . . & small BIG Data?


Parlor Game . . .
Surely, you have been part of a “guessing game” at a dinner party. Someone says, “Think of 10 numbers and tell me their sum and then tell me 2 of the 10 numbers; I will tell you the other 8 numbers you thought of”. The general reaction will be that this person is crazy! Too many unknowns, not enough constraints, too many solutions and so on.

Compressive Sensing:
There is a tsunami of activity in mathematical circles these days related to the parlor guessing game with great solutions for the guesses – which sounds like magic (in fact, “l1-Magic” is the name of a popular Matlab routine that provides a solution).

Donoho and Candes are the prime-movers of Compressive Sensing (CS) and related theories and their pedigrees are impressive (Donoho is a MacArthur "genius award" winner). For engineers, the simplest context in which to understand CS is data acquisition. If a time series is a sum of 2 sinusoids, we have to sample it at twice the frequency of the sinusoid of the highest frequency (at Nyquist rate). Depending on the relative frequencies of the sinusoids, we may have to acquire hundreds of samples to be able to reconstruct the original time series faithfully. CS theory asserts that you only need about 8 random samples and not hundreds to faithfully reconstruct the data! On the face of it, this assertion looks absurd in that it flies in the face of the long-standing understanding of Nyquist criterion.

However, a glimmer of insight may arise when we take note of the fact that there are only 2 sinusoids in the data! Somehow, if we can gather the information about these 2 sinusoids from 8 random samples, we may be able to reconstruct the data fully. Under such “rank limited” situation (along with a few easy-to-meet restrictions), CS theory provides a way to completely recreate the original data.

CS theory is accessible and 3 easy to read intros are (1) Candes & Wakin, “Introduction to Compressive Sampling”, IEEE Signal Processing Magazine, 2008, (2) Baraniuk, “Compressive Sensing lecture notes”, IEEE Signal Processing Magazine, 2007 and (3) Candes & Plan, “Matrix Completion with Noise”, Proceedings of the IEEE, 2010.

As you will find from these and other excellent articles on the topic, the approach is to solve a constrained minimization problem using a certain norm. l2-norm we are used to is largely unsuitable for this task because you do not almost always reach the desired solution (perhaps the reason why no one saw this solution till 2000’s). The so-called l0-norm will work but is not tractable. However, l1-norm (or the “taxi-cab” metric) does the trick. Many linear programming solutions exist; MATLAB implementations are available from multiple websites.

Data Analytics:
The references I mentioned above provide motivation from the data analytics domain (mostly from the Netflix recommendation challenge). Let me give an example from Retail analytics.

In Retail/CPG analytics, large amount of shopper data is collected via loyalty cards, transaction logs, credit card information, etc. All of these data are collected and collated into a large data matrix. For example, the data matrix may contain for each shopper (each row of a matrix), the amount they spent over a year on soaps from different brands (each column is a CPG brand). A relatively small data set in this business may have 5 or 10 million entries; however most of the entries will be missing (usually marked as “zero”). This stands to reason because I may buy only 1 or 2 brands of the 100’s of CPG soap brands; so the rest of my row will be zeros. In an example, we had 93% of the entries as zeros!

To relate this situation to the sum of sinusoids sampling, remember that the signal was “rank-limited” in that it was made up of only 2 sinusoids. Low-rank is one of the primary reasons that CS works. In the case of the soap data matrix, we again have a low-rank situation since the underlying preference for soap depends only on a few parameters such as smell, anti-bacterial property, color, cost, etc. In other words, if we do an eigen-decomposition of the true data matrix, only a few eigenvalues will be significant. Hence, using the linear programming approach and minimizing the l1-norm, we are able to reconstruct the other 93% of the entries!

So now, we have a tractable (“magical”) solution to identify the entries of the “soap” data matrix. What do these entries tell us?

The reconstructed entry is the Data Scientist’s best prediction of what a shopper would spend on a certain CPG soap brand. This is hugely valuable – this is an excellent proxy for shopper’s propensity of purchase of a particular brand. A CPG can group the shoppers based on their purchase propensities, identify a group for whom it is low and target them with advertisements or discount offers to “switch” them. This information can be used by media buyers to optimize their ad dollar ‘spend’. Knowing which shoppers visit a specific store location, Retailers can assay their ‘spends’ and stock their shelves appropriately. The applications are virtually limitless.

small BIG Data:
If we can reconstruct 93% of the data from a 7% random sample as in the example above, who needs BIG data?

Clearly, we do not need BIG data *collection*. But we still need to regenerate BIG data (all the data matrix entries). But perhaps we do not need to store it; it can be regenerated at will (convergence of the algorithms seem fast). So, we do not need BIG data collection or storage but we do need to process BIG data. Seem to point to in-memory BIG data solutions for this type of data analytics applications . . . next few months will tell.