Sunday, November 24, 2013

X-Event Marketing


Dr. PG Madhavan developed his expertise in Data Analytics as an EECS Professor, Computational Neuroscience researcher, Bell Labs MTS, Microsoft Architect and startup CEO. Overall, he has extensive experience of 20+ years in leadership roles at major corporations such as Microsoft, Lucent, AT&T and Rockwell and startups, Zaplah Corp (Founder and CEO), Global Logic, Solavei and SymphonyEYC He is continually engaged hands-on in the development of advanced Analytics algorithms and all aspects of innovation (12 issued US patents with deep interest in adaptive systems and social networks). More at www.linkedin.com/in/pgmad


This blog is a coming together of a bunch of my past blogs and my reading of a recent book by John Casti called “X-Events”.

PG Blog: “What does ‘Emergent Properties in Network Dynamics’ have to do with Shopping?”; http://pgmadblog.blogspot.com/2012/10/what-does-emergent-properties-in.html?m=1
PG Blog: “Network Dynamics & Coupling: Shannon’s Reverie Reprised . . . “ http://pgmadblog.blogspot.com/2012/11/network-dynamics-coupling-shannons.html?m=1
Book by John Casti, "X-Events: The Collapse of Everything";

Reading X-Events book triggered ideas from some of my past academic work (http://www.jininnovation.com/SoF.PDF) combined with my present startup, “Syzen Analytics” (http://www.SyzenAnalytics.com/), leads me to suggest an interesting way to create “pocket” X-events in Retail Commerce via X-Event Marketing or “XM”!

Let us start at the beginning . . .
1.     Shannon understood the importance of rare events in conveying information. He sought a mathematical formulation to capture this important property of our brains and came up with the definition of “entropy”. As I had made clear in the original blog, this is a fictitious reverie I made up to give one the guts to turn other valuable physical/ practical insights into mathematical forms!

2.     Similarly, from my past research in Neuroscience, I had some strong impressions that have refused to go away despite the passage of decades. One is the activity that precedes the most significant X-Event in each of our lives, Death! Unreplicated and unpublished observations in our neurophysiology labs had shown that one of the surprising events that happen before a mouse with brain in-dwelling electrodes dies is the all-out firing of “complex spike cells” (which when the mouse is alive and performing “place cell” tasks are painfully difficult find and record). In other words, the “natural state” of the brain seems to be excitatory and the work of a healthy brain seems to be to assert control and coordination by keeping the *inhibitory tone* high so that all hell does not break loose (by cells firing away with no coordination or control). I speculate a regime-transition of the sort shown below in death: the right-most picture is indicative of brain cells firing away in an uncontrolled fashion (just before a major “X-Event”, death or epileptic seizure) with the left-most indicative of a properly functioning brain (“nicely” coupled). The next story will illuminate the idea of the middle picture where the “system is uncoupled” with the system poised for all its energy to pile into a single “mode” and create a major suppression of inhibition in the brain.

3.     Following “Shannon’s approach”, how do I make these intuitions mathematical? Let us look at the right-most picture. As I outlined in my “Shannon’s Reverie Reprised” blog, everything is firing away and the potential field will show up equally across the scalp, much like a single “wavefront” that reaches all the regions of the scalp simultaneously. Imagine you are at a beach looking out to the sea and gentle waves are rolling in – let us say in parallel to the beach. If you are standing knee-deep in the water and look in a direction parallel to the beachfront (i.e., up or down the beach), the spatial frequency in your “look direction” (or the frequency of “corrugation”) is 0 cycles/meter! Much like the ocean waves, the single wavefront of activity is nearly “constant” across the distributed neocortex and its relevant strength can be thought of as the power at zero frequency. I happen to know that power at zero frequency is called “Scale of Fluctuation” or “θ” in random field theory. From the previous work of Eric Vanmarcke (“Random Fields”, 1983, with an updated edition in 2010), θ is defined below. I refer you to my blog and academic paper mentioned at the outset for the gory details!
 
Is Theta as powerful and useful as Shannon’s entropy . . . time will tell. As you see in my academic publication, Theta does have some intriguing properties in the case of 2nd order linear time-invariant systems which may indeed prove useful in the future! Adding results from my simulation and data analysis of machine tool chatter, my summary observation is that:

Large value of Theta can be good or bad. It indicates “Coupling”: *good* coupling as a nicely functioning brain (or lathe) or a *bad* coupling as in death (or chatter)! The key here is that low value of theta can be a “predictor” of impending X-Event!

As you can imagine, if our speculation above holds up, we may be able to predict X-Events (or Taleb’s “black swans”) such as 2008 Great Recession or Tohoku earthquake and eventual tsunami. In the case of Tohoku, a few extra hours of warning may have helped Fukushima engineers reach a consensus and move the power generators to a higher location (thus avoiding the nuclear disaster).

XM:
In this blog on X-Event Marketing (“XM”), my purpose is different. I want to *create* desirable, “pocket” X-Events to beneficial ends!

One of the preconditions of “interesting” dynamics in systems is sufficient “complexity”. This can be taken to mean multiple actors with high degrees-of-freedom, dense interconnections with linear and non-linear coupling and in general, hard to analyze! One such natural system is the “shopper network”.
A modern-day shopper is enmeshed in an ever-varying network of preferences and influences (from friends and family) and embedded in a network distributed in time and space – just the type of complex systems where “interesting” dynamics can arise. We want to create desirable “pocket” X-Events in the shopper network with the purpose of increasing business value for the shopper or the seller or both.


This is the so-called “Purchase Funnel” that an individual shopper goes through when executing a product transaction. The marketing, merchandizing and offer activities are devised to impact the shopper at predetermined points of importance in the shopper’s decision making process. For example, a well-conceived offer delivered via shopper’s mobile phone at the point of “Desire” may precipitate a Purchase “Action” – a win for the seller (in this case both the retailer and the manufacturer of that product; for example, Walmart and Procter & Gamble, respectively).

This was one shopper. Now, you have to consider the shopper being embedded in a network. At this point, it may be useful for you to review my earlier blog, “Social Network Theory” http://pgmadblog.blogspot.com/2013/08/social-network-theory.html, where Barabasi’s work in Network Graphs and the following terms are discussed:
1. Clusters (of friends who “hang out”).
2. Weak links (to your long-forgotten high school classmates).
3. Hubs (politicians and others with massive number of contacts).

Clusters and hubs are some examples of the influence on an individual shopper. There are many such shoppers, all interconnected, when studying the dynamics with a view to creating a “pocket” X-Event. Even this does not complete the complexity picture – the shoppers are distributed in space (you are in NYC and your buddy from Seattle tweets about a cool product that he bought) and time (good and bad product reviews reaches you at various times and not in a synchronized fashion).



That the Shopper Network is sufficiently complex to engender “interesting” dynamics must be beyond a doubt by now! The picture above may be a reasonable representation of the Shopper Network (WITHOUT the time element captured – we will need a video showing the undulations in the ”heat map”). The “heat” shown as ‘yellow’ spots indicate the total dollars spent each day on a particular product on a particular day, aggregated to a county for all of USA. Clearly, this will vary from day-to-day.

How do we capture the dynamics of the Shopper Network picture shown above in a tractable manner so that we can understand the dynamics and then track the effect of any manipulations we do in the network so that we can close the loop to adaptively adjust our manipulations to achieve a desired effect?

This is the essential question of XM. Of course, there are other important questions such as the type and timing of manipulations (Offers to selected customers? A country-wide branding effort? Is it a new product or an existing one? And so on). As a systems engineer, I will find a partner who is more qualified than I am in answering the marketing questions; I will focus on Systems Analytics tools to create a quantitative execution infrastructure to deploy marketeers’ ingenuity!

Engineers love scalars! I suspect it is because having a single control variable is easier to manage than many. Probability theory is replete with reduction to scalars – I am talking about mean, variance, correlation coefficient, Entropy, Theta, . . . The full underlying information is available in the joint probability density functions but they are a bear to handle! So, we reduce it to a scalar – clearly, in the process, we have thrown away MUCH information but what is left is what we plan to use; so our justification for selecting and using a scalar for a particular engineering task is very *operational*.

We will use the scalar, Theta, to answer Shopper Network analysis and control question I asked a few paragraphs earlier.

Scenario: New Product Introduction
Procter & Gamble is going to introduce a new detergent and wants to create a ground-swell of purchase interest (our “pocket” X-Event).

·         P&G introduces the product quietly and collects the heat-map data from T-log data of their partner Retailers.
·         Sysan (our startup) processes the heat-map data: total dollar spent each day on the new detergent, aggregated to a county for all of USA.
·         Sysan calculates Theta (a single number) as the baseline measure.
·         P&G Marketing has developed 2 major tools for the launch of the new detergent: (1) nation-wide TV ad campaign and (2) a 10% cash discount for the first 1 million customers in US.
·         P&G runs a 1-week TV ad campaign and while collecting T-log data.
·         Sysan monitors Theta for the week.
·         P&G suspends the TV ad campaign and starts the cash discount offer.
·         Sysan monitors Theta and finds that Theta is entering a “Decoupled” regime with low values of Theta.
·         Sysan advises P&G to resume the TV ad campaign in an effort to trigger “bad” coupling among various counties of US.
·        This pushes Theta into a high-value, the coupling between counties get tight and a “pocket” X-Event occurs where there is a ground-swell of purchases of the new detergent country-wide!


NOTE that in this fictitious example, P&G would have pretty much done the various marketing campaigns that I speculate here on their own. The KEY point is that P&G does NOT have a scalar (or any) measure to get immediate feedback regarding their efforts, combined for the whole country. The ability to quantify “closed-loop” intervention is the value of our humble scalar, Theta, in X-Event Marketing!

Clearly, what manipulations are best applied to the network and when is an open question. Utilizing Theta, Sysan is developing methods using “massive simulation methods” (for background, please see: John Casti, “Would-Be Worlds: How Simulation is Changing the Frontiers of Science”) that can predict the effect of various interactions at various instances of time and points in space, thus providing guidance for the most effective use of vast sums of money required for ad campaigns and discount offers.

Stay tuned for our initial results . . .

PG

Wednesday, August 28, 2013

Predictive Analytics Automation


There is much discussion about whether Predictive Analytics (PA) can be automated or not. This is a false dichotomy.

Predictive Analytics is a strange beast - it needs to be ‘learned by learning‘ and ‘learned by doing‘ – BOTH! That is due to the interconnected nature of the field. To be a successful hyper-specialist in “left nostril” diseases, one needs to have done Anatomy, Physiology and Biochemistry in med school. Similarly, for PA, learning-by-learning (which takes at least 6 years of grad school) is not a step you can skip and go directly to learning-by-doing and hope to become a true curer of business diseases!

In PA, learning-by-doing can be an even steeper curve. As I have noted before in my blogs, PA skills will have to be rounded out with mathematical inventiveness and ingenuity applied repeatedly in a specific business vertical. These are the hallmarks of an uber Data Scientist. Clearly, an uber data scientist as described above cannot be bottled and passed around. Don’t even think of “automating” all the things that an uber data scientist does. So what do we do about “scaling”? Are there support pieces we can automate to scale the solution.

Comparison to a programing environment such as MATLAB is appropriate. MATLAB supplies you with all kinds of toolboxes. Similarly, in PA, many basic operations can be automated – clustering, learning, classification, etc. But, like MATLAB, you also need an environment where these toolboxes can be fine-tuned with inventiveness appropriate to the business vertical, mixed and matched and augmented with additional one-off solutions to address the overall business problem at hand. Otherwise, the solution will fall short (or flat!).

So, part of PA can be automated. PA toolboxes can be fine-tuned by data scientist associates and the overall solution can be conceived and put together with these toolboxes (with added “glue”) by the uber data scientist.

Note that everything I talked about here refers to PA solution development. Once the overall solution is developed, “production runs” by customer personnel and visualizations by executives of the PA solution developed above can be mostly automated (with data scientist looking over their shoulders – data can change on you on a dime; someone has to watch for the sanctity of the data and non-stationarity problems!). Production is where the solution needs to scale and it can.

In summary, PA solution development will require manual work by uber data scientists supported by data science associates; automated toolboxes for basic PA functions will help speed up the process and once the overall solution is manually cobbled together, production runs can be automated along with some amount of ongoing data science audit of the process and results.



Dr. PG Madhavan developed his expertise in analytics as an EECS Professor, Computational Neuroscience researcher, Bell Labs MTS, Microsoft Architect and startup CEO. Overall, he has extensive experience of 20+ years in leadership roles at major corporations such as Microsoft, Lucent, AT&T and Rockwell as well as four startups including Zaplah Corp as Founder and CEO. He is continually engaged hands-on in the development of advanced Analytics algorithms and all aspects of innovation (12 issued US patents with deep interest in adaptive systems and social networks).

Tuesday, August 20, 2013

Predictive Analytics – Where from and where to?

Dr. PG Madhavan developed his expertise in analytics as an EECS Professor, Computational Neuroscience researcher, Bell Labs MTS, Microsoft Architect and startup CEO. Overall, he has extensive experience of 20+ years in leadership roles at major corporations such as Microsoft, Lucent, AT&T and Rockwell and startups, Zaplah Corp (Founder and CEO), Global Logic, Solavei and Aldata. He is continually engaged hands-on in the development of advanced Analytics algorithms and all aspects of innovation (12 issued US patents with deep interest in adaptive systems and social networks).


Big Data is big business but the “open secret” in this business is that what the paying client really wants is Predictive Analytics (“what should my business do next to improve?!”). To be explicit, Big Data is all about the technology for storage and manipulation of lots of data (terabytes to petabytes) as well as new types of data such as graph and unstructured data. This is an extremely important precursor to doing anything useful with data – 10 or 20 years ago, we had to settle for small amounts of representative data or simulated data. Outcomes were clever and new data analysis *techniques* rather than useful answers!

The next step is to make sense of the large amounts of data at your disposal (now that you have Hadoop and NoSQL and Graph database). This is where visualization and Descriptive Analytics come in. They provide historic “snap-shots” with graphs and charts – another important precursor to coming up with answers to the question, “What should my business do next to improve?”

In 2013, Big Data is maturing with still many open technology challenges; Descriptive Analytics is in its pre-teen years. Predictive Analytics is in its infancy, nurtured by its stronger siblings, Big Data and Descriptive Analytics!

Predictive Analytics (PA):
In a business context, Predictive Analytics (PA), attempts to answer the question, “What should my business do next to improve?”

Prediction is an old preoccupation of mankind! What comes next, next meal, next rain, next mate? Techniques have changed little from Stone Age till recently; see what has happened before, notice the cyclical nature and any new information; combine them to forecast what may come next. Take tomorrow’s temperature for example:
1.       Predict it as today’s temperature.
2.       Predict it as the average of Tuesday’s (today) temperature for the past month.
3.       Predict it as the average of Tuesday’s (today) temperature of the month of August for the past 50 years.
4.       Look at the pattern of today’s, yesterday’s and day before yesterday’s temperature; go back in the weather record and find “triples” that match (your criteria of match). Collect the next day’s temperature after matching patterns in the record and average them as your prediction of tomorrow’s temperature.
5.       And so on . . .
As you can imagine (and can easily demonstrate to yourself), predictions from 1 to 5 get better and better. If your business is in shipping fresh flowers, your business may be able to use just this simple Predictive Analytics method to lower your cost of shipping! Simple PA but answers your “What next?” question. By the way, this PA technique is not as naïve as it looks; complexity-theory-“quants” used to use its extended forms in financial engineering on Wall Street.

So one feature of PA is clear; there is a time element, historical data and future state. Historical data can be limited in time as in the first method of temperature prediction or as extensive as in the fourth method. Descriptive Analytics can be of use here for sure – looking at the trends in the temperature data plot, one can heuristically see where it is headed and act accordingly (to ship flowers or not tomorrow). However, PA incorporates time as an essential element in quantitative ways.

Quantitative Methods in PA:
My intent here is not to provide a catalog of statistical and probabilistic methods and when and where to use them. Hire a stats or math or physics or engineering Ph.D. and they will know them backwards and forwards. Applying it to Big Data in business however requires much more inventiveness and ingenuity – let me explain.

PA has an essential time element to it. That makes prediction possible but life difficult! There is a notion called “non-stationarity”. While reading Taleb’s or Silver’s books, I have been puzzled by finding that this word is hardly mentioned (not once in Taleb’s books, if I am not mistaken). One reason may be that those books would have ended up being 10 pages long instead of the actual many 100’s of pages!

Non-stationarity is a rigorous concept but for our purposes think about it as “changing behavior”. I do not look, act and think the same as I did 30 years ago – there is an underlying similarity for sure but equally surely, the specifics have changed. Global warming may be steady linear change but it will systematically affect tree-ring width data over centuries. Some other changes are cyclical – at various times, they are statistically the same but at other times, they are different. Systems more non-stationary than this will be all over the place! Thus, historical and future behavior and resultant data have variability that constrains our ability to predict. Not all is lost – weather can be predicted pretty accurately for up to a week now but not into the next many months (this may be an unsolvable issue per complexity theory). Every system, including business “systems”, has its window of predictability; finding the predictability window and finding historical data that can help us predict within that window is an art.

I do not want this blog to be a litany of problems but there are two more that need our attention. Heteroscedasticity is the next fly in the ointment! This is also a formal rigorous statistical concept but we will talk about it as “variability upon variability”.  Business data that we want to study are definitely variable and if they vary in a “well-behaved” way, we can handle them well. But if the variability varies, which is often the case in naturally occurring data, we have constraints in what we can hope to accomplish with that data. Similar to non-stationary data, we have to chunk them, transform variables, etc. to make them behave.

The third issue is that of “noise”. Noise is not just the hiss you hear when you listen to an AM radio station. The best definition for “noise” is the data that you do not want. Desirable data is “signal” and anything undesirable is “noise”. In engineered systems such as a communication channel, these are clearly identifiable entities – “signal” is what you sent out at one end and anything else in additional to the signal that you pick up at the receiver is “noise”. In a business case, “unwanted” data or “noise” are not so clearly identifiable and separable. Think of “Relative Spend” data among various brands of beer in a chain of stores; or sentiment analysis results for those brands. Depending on the objective of the analysis, the signal we are looking for may be “purchase propensity” of each shopper (so that we can individually target them with discount offers). Relative Spends and Likes are not pure measures of “purchase propensity” – I may have bought Brand X beer because my friend asked me to pick some up for him which has nothing to do with my purchase propensity for that brand! Purchases like that will pollute my Relative Spend data. How do you separate this noise from the data. There may also be pure noise – incorrect entry of data, data corruption and such errors that affect all data uniformly.

Fortunately, there is a framework from engineering that provides a comprehensive approach to containing these issues while providing powerful analytics solutions.

Model-based Analytics (MBA):
Model-based and model-free methods have a long history in many areas of Engineering, Science and Mathematics. I take an Engineering approach below.

Data that you have can be taken “as is” or can be considered to be generated by an underlying model. “Model” is a very utilitarian concept; it is like a “map” of the world. You can have a street map or a satellite map – you use them for different purposes. If you need to see the terrain, you look for a topographic map. A map is never fully accurate in itself – it serves a purpose. As the old saw goes, if a world map has to be accurate, the map will have to be as large as the world – what is the point of a map then?

Let us call “model-free” methods as today’s “Data Analytics (DA)” to distinguish it from “Model-based Analytics (MBA)”. DA analyzes measured data to aid business decisions and predictions. MBA attempts to model the system that generated the measured data.

Model-based methods form the basis of innumerable quantitative techniques in Engineering. Theory and practice have shown that data analysis approaches (similar to periodogram spectrum estimation) are robust but not powerful, while model-based methods (similar to AR-spectrum estimation) are powerful but not robust (incorrect model order, for example, can lead to misleading results - needs expert practitioner).

MBA go beyond data slicing/ dicing and heuristics. In our previous “brands of beer in a store chain” example, model-based approach hypothesizes that there is a system, either explicit or implicit, behind the scenes generating customer purchase behaviors and purchase propensities. From measured data, MBA identifies the key attributes of the underlying hidden system (to understand commerce business quantitatively) and provides ways to regulate system outputs (to produce desirable business outcomes).

MBA does not solve but alleviates the three pain points in Predictive Analytics quantitative methods: (1) Non-stationarity, (2) Heteroscedasticity and (3) Noise.

MBA – Personal Commerce Example:
I do not have an MBA (the university kind) nor have I taken a Marketing course. So here is a layman’s take on Personal Commerce. My main objective is to show a practical example of model-based predictive analytics within MBA framework.

Personal Commerce has 2 objectives: (1) Customer acquisition and (2) Customer retention. There are 3 ways to accomplish these 2 tasks: (1) Marketing, (2) Merchandizing and (3) Affinity programs.

Let us take “Merchandizing” (the business of bringing the product to the shopper). An online example is “recommendation engine”. When you log into Amazon, their PA software will bring products from multiple product categories (books, electronics, etc.) to your tablet screen. An offline example is brick-and-mortar store that arranges beer brands on their physical shelf such that it will entice their shoppers to buy a particular brand (assume that the store has a promotion agreement with the brand and hence a business reason to do so). This type of merchandizing is called “assortment optimization”. Note that both Assortment Optimization and Recommendation Engine are general concepts that have many variations in their applications. The MBA approach below applies to the 2 Personal Commerce objectives and the 3 programs to accomplish them.

Assortment Optimization:
As practical example of MBA, let us explore Assortment Optimization. From the various data sources available to you, you construct a model of the shopper groups with their beer-affinity as the dependent variable. Then construct a model of a specific store with the shopper groups as the dependent variable. Once these 2 models are in hand, you combine them to obtain the optimal shelf assortment for beer at that particular store so that the store revenue can be maximized.

Clearly, I have not explained the details of the construction of these models and how they can be combined to give you the optimal product assortment. That is not the focus of this blog – it is to show that such an approach will allow you to contain the 3 banes of PA quantitative methods and hence get powerful results. In my actual use case, we achieve “sizes of the prize” (in industry parlance; the potential peak increase in revenue) greater than any current merchandizing Big Data methods!

(1)    Non-stationarity: As we discussed earlier, different systems vary over time in their own way. If you always used the past 52 weeks of data, it may be appropriate for some products but not others. For example, in certain cases of Fashion or FMCG, non-homogeniety can be minimized by selecting shorter durations but not so short that you do not have enough representative data!
(2)    Heteroscedasticity: There is a fundamental concept here again of choosing just enough data (even if you have tons in your Big Data store!) that address the business question you have but not too much. When you have selected just enough data, you may also escape severe heteroscedasticity. If not, variable transformations (such as log transformation) may have to be adopted.
(3)    Noise: As we noted, Noise is unwanted data. Consider the previous Merchandizing case but where you tried to analyze 2 product categories together, say, Beer and Soap. Since the fundamental purchase propensity driving-forces are most likely different for these two product categories, one will act as noise to the other – deal with them separately. In addition, doing some eigen-decomposition pre-processing may allow you to separate “signal from noise”.

Many of you will find this discussion inadequate – part of it is because they are trade secrets and part of it is because there are no magic formulas. Each business problem is different and calls for ingenuity and insight particular to that data set and business question.

I have only scratched the surface of Model-based Analytics here. The sub-disciples of Electrical Engineering such as Digital Signal Processing, Control Theory and Systems Theory are replete with frameworks and solutions developed in the last two decades or so for deploying model-based solutions and extending them to closed-loop systems. Thus, we go beyond predictions to actions with their results fed-back into the model to refine its predictions. Next five years will see a great increase in the efficacy of Predictive Analytics solutions with the incorporation of more model-based approaches.



Post Script: Other major “flies-in-the ointment” are non-linearity and non-normality; I am of the opinion that methods that are practical and efficient are still not available to battle these issues (I know that the 1000’s of authors of books and papers of these fields will disagree with me!). So, the approach I take is that non-linearity and non-normality issues are minor in most cases and MBA techniques will work adequately; when in a few cases I cannot make any headway, I reckon that these issues are so extreme that I have a currently-intractable problem!

Friday, August 9, 2013

Social Network Theory


In the last few decades, interest in Social Networks has seen explosive growth with Facebook being the poster child. Indeed work had gone on before that time in Physics especially in the area of “random networks” (Erdos and Renyi are famous names). In such networks, each new node is randomly connected to any previous node and its evolution studied.
The work of Barabasi and others showed that while the physics of Random Networks was interesting, it had little to do with “human made” networks such as Web, Internet, friendship networks, etc. They all fall into the category of “Scale-Free Networks”.
Scale-Free Network (SFN) Model:
“Scale-free” refers to the nature of some networks where its diameter (average # of links to reach a node from the present node; remember “six degrees of separation” between people) is mostly unchanged as the network grows in size.
SFN’s degree (# of links per node) is distributed as a power law (and not a binomial distribution which is the case for a purely random network). A well-known example of power law is Pareto distribution (“80/20 rule”). SFNs arise in networks that grow and have preferential attachment property while growing (exhibits “rich get richer” and “first mover advantage” phenomena).

Figure 1: Pareto distribution (example of a Power Law distribution)
If a “fitness” rating is attached to each node, a new phenomenon of “winner takes all” emerges (Google search engine will rate very high among websites as an example). This has been shown to be a “Bose-Einstein” condensate state (somewhat similar to condensation transitions that occur in traffic jams where long queues of cars are foundin wealth distribution models where a few people might have a hugefraction of all the wealth and in operating systems usage where Windows has close to 90% market share).
Real Social Networks:
Real social network has 3 major features:
1. Clusters (of friends who “hang out”).
2. Weak links (to your long forgotten high school classmates).
3.Hubs (politicians and others with massive number of contacts).

Figure 2: Natural social network structure with Cluster Groups
Comparing the Pareto (power law) distribution that governs SFNs, we can see that Hubs with many links (large “k”) are nodes that occur rarely whereas Clusters with few links are nodes that occur very often. Here we elaborate a mid-scale situation where certain nodes have a fair number of links and occur with some frequency – we call them “Cluster Groups”. They arise in cases where a certain Member belongs to multiple Clusters, their family cluster, work cluster, golf club cluster, etc. The same Member belongs to many Clusters that are tightly coupled leading to a Custer Group.
Failure Modes:
Failure modes of SFNs have been widely studied. The main 2 observations are that (1) SFN is robust to Failure and (2) SFN is vulnerable to attacks.
Failure means random loss of network elements such as some computers or local routers going down in the Internet. Attack refers to concerted take-down of major “hub” routers. It has been shown that removal of a small percent (15 to 20%) of “hub” routers will pull apart the network and “cascading failure” will spread (due to routers NOT under attack being unable to carry the extra load from “hub” routers) leading to total network failure.

Here are N=10 and N=100 examples; d=2 in both cases.
The N=100 example already shows some interesting features. Remember “Hubs”? One would expect the “root” node at the very top to be a Hub but the left-most node in the 2nd level seems to be developing into a Hub with lots more “child” nodes.

Have fun with social networks!
PG


Friday, April 19, 2013

Parlor Games, Compressive Sensing, Data Analytics . . . & small BIG Data?


Parlor Game . . .
Surely, you have been part of a “guessing game” at a dinner party. Someone says, “Think of 10 numbers and tell me their sum and then tell me 2 of the 10 numbers; I will tell you the other 8 numbers you thought of”. The general reaction will be that this person is crazy! Too many unknowns, not enough constraints, too many solutions and so on.

Compressive Sensing:
There is a tsunami of activity in mathematical circles these days related to the parlor guessing game with great solutions for the guesses – which sounds like magic (in fact, “l1-Magic” is the name of a popular Matlab routine that provides a solution).

Donoho and Candes are the prime-movers of Compressive Sensing (CS) and related theories and their pedigrees are impressive (Donoho is a MacArthur "genius award" winner). For engineers, the simplest context in which to understand CS is data acquisition. If a time series is a sum of 2 sinusoids, we have to sample it at twice the frequency of the sinusoid of the highest frequency (at Nyquist rate). Depending on the relative frequencies of the sinusoids, we may have to acquire hundreds of samples to be able to reconstruct the original time series faithfully. CS theory asserts that you only need about 8 random samples and not hundreds to faithfully reconstruct the data! On the face of it, this assertion looks absurd in that it flies in the face of the long-standing understanding of Nyquist criterion.

However, a glimmer of insight may arise when we take note of the fact that there are only 2 sinusoids in the data! Somehow, if we can gather the information about these 2 sinusoids from 8 random samples, we may be able to reconstruct the data fully. Under such “rank limited” situation (along with a few easy-to-meet restrictions), CS theory provides a way to completely recreate the original data.

CS theory is accessible and 3 easy to read intros are (1) Candes & Wakin, “Introduction to Compressive Sampling”, IEEE Signal Processing Magazine, 2008, (2) Baraniuk, “Compressive Sensing lecture notes”, IEEE Signal Processing Magazine, 2007 and (3) Candes & Plan, “Matrix Completion with Noise”, Proceedings of the IEEE, 2010.

As you will find from these and other excellent articles on the topic, the approach is to solve a constrained minimization problem using a certain norm. l2-norm we are used to is largely unsuitable for this task because you do not almost always reach the desired solution (perhaps the reason why no one saw this solution till 2000’s). The so-called l0-norm will work but is not tractable. However, l1-norm (or the “taxi-cab” metric) does the trick. Many linear programming solutions exist; MATLAB implementations are available from multiple websites.

Data Analytics:
The references I mentioned above provide motivation from the data analytics domain (mostly from the Netflix recommendation challenge). Let me give an example from Retail analytics.

In Retail/CPG analytics, large amount of shopper data is collected via loyalty cards, transaction logs, credit card information, etc. All of these data are collected and collated into a large data matrix. For example, the data matrix may contain for each shopper (each row of a matrix), the amount they spent over a year on soaps from different brands (each column is a CPG brand). A relatively small data set in this business may have 5 or 10 million entries; however most of the entries will be missing (usually marked as “zero”). This stands to reason because I may buy only 1 or 2 brands of the 100’s of CPG soap brands; so the rest of my row will be zeros. In an example, we had 93% of the entries as zeros!

To relate this situation to the sum of sinusoids sampling, remember that the signal was “rank-limited” in that it was made up of only 2 sinusoids. Low-rank is one of the primary reasons that CS works. In the case of the soap data matrix, we again have a low-rank situation since the underlying preference for soap depends only on a few parameters such as smell, anti-bacterial property, color, cost, etc. In other words, if we do an eigen-decomposition of the true data matrix, only a few eigenvalues will be significant. Hence, using the linear programming approach and minimizing the l1-norm, we are able to reconstruct the other 93% of the entries!

So now, we have a tractable (“magical”) solution to identify the entries of the “soap” data matrix. What do these entries tell us?

The reconstructed entry is the Data Scientist’s best prediction of what a shopper would spend on a certain CPG soap brand. This is hugely valuable – this is an excellent proxy for shopper’s propensity of purchase of a particular brand. A CPG can group the shoppers based on their purchase propensities, identify a group for whom it is low and target them with advertisements or discount offers to “switch” them. This information can be used by media buyers to optimize their ad dollar ‘spend’. Knowing which shoppers visit a specific store location, Retailers can assay their ‘spends’ and stock their shelves appropriately. The applications are virtually limitless.

small BIG Data:
If we can reconstruct 93% of the data from a 7% random sample as in the example above, who needs BIG data?

Clearly, we do not need BIG data *collection*. But we still need to regenerate BIG data (all the data matrix entries). But perhaps we do not need to store it; it can be regenerated at will (convergence of the algorithms seem fast). So, we do not need BIG data collection or storage but we do need to process BIG data. Seem to point to in-memory BIG data solutions for this type of data analytics applications . . . next few months will tell.