Wednesday, August 28, 2013

Predictive Analytics Automation

There is much discussion about whether Predictive Analytics (PA) can be automated or not. This is a false dichotomy.

Predictive Analytics is a strange beast - it needs to be ‘learned by learning‘ and ‘learned by doing‘ – BOTH! That is due to the interconnected nature of the field. To be a successful hyper-specialist in “left nostril” diseases, one needs to have done Anatomy, Physiology and Biochemistry in med school. Similarly, for PA, learning-by-learning (which takes at least 6 years of grad school) is not a step you can skip and go directly to learning-by-doing and hope to become a true curer of business diseases!

In PA, learning-by-doing can be an even steeper curve. As I have noted before in my blogs, PA skills will have to be rounded out with mathematical inventiveness and ingenuity applied repeatedly in a specific business vertical. These are the hallmarks of an uber Data Scientist. Clearly, an uber data scientist as described above cannot be bottled and passed around. Don’t even think of “automating” all the things that an uber data scientist does. So what do we do about “scaling”? Are there support pieces we can automate to scale the solution.

Comparison to a programing environment such as MATLAB is appropriate. MATLAB supplies you with all kinds of toolboxes. Similarly, in PA, many basic operations can be automated – clustering, learning, classification, etc. But, like MATLAB, you also need an environment where these toolboxes can be fine-tuned with inventiveness appropriate to the business vertical, mixed and matched and augmented with additional one-off solutions to address the overall business problem at hand. Otherwise, the solution will fall short (or flat!).

So, part of PA can be automated. PA toolboxes can be fine-tuned by data scientist associates and the overall solution can be conceived and put together with these toolboxes (with added “glue”) by the uber data scientist.

Note that everything I talked about here refers to PA solution development. Once the overall solution is developed, “production runs” by customer personnel and visualizations by executives of the PA solution developed above can be mostly automated (with data scientist looking over their shoulders – data can change on you on a dime; someone has to watch for the sanctity of the data and non-stationarity problems!). Production is where the solution needs to scale and it can.

In summary, PA solution development will require manual work by uber data scientists supported by data science associates; automated toolboxes for basic PA functions will help speed up the process and once the overall solution is manually cobbled together, production runs can be automated along with some amount of ongoing data science audit of the process and results.

Dr. PG Madhavan developed his expertise in analytics as an EECS Professor, Computational Neuroscience researcher, Bell Labs MTS, Microsoft Architect and startup CEO. Overall, he has extensive experience of 20+ years in leadership roles at major corporations such as Microsoft, Lucent, AT&T and Rockwell as well as four startups including Zaplah Corp as Founder and CEO. He is continually engaged hands-on in the development of advanced Analytics algorithms and all aspects of innovation (12 issued US patents with deep interest in adaptive systems and social networks).

Tuesday, August 20, 2013

Predictive Analytics – Where from and where to?

Dr. PG Madhavan developed his expertise in analytics as an EECS Professor, Computational Neuroscience researcher, Bell Labs MTS, Microsoft Architect and startup CEO. Overall, he has extensive experience of 20+ years in leadership roles at major corporations such as Microsoft, Lucent, AT&T and Rockwell and startups, Zaplah Corp (Founder and CEO), Global Logic, Solavei and Aldata. He is continually engaged hands-on in the development of advanced Analytics algorithms and all aspects of innovation (12 issued US patents with deep interest in adaptive systems and social networks).

Big Data is big business but the “open secret” in this business is that what the paying client really wants is Predictive Analytics (“what should my business do next to improve?!”). To be explicit, Big Data is all about the technology for storage and manipulation of lots of data (terabytes to petabytes) as well as new types of data such as graph and unstructured data. This is an extremely important precursor to doing anything useful with data – 10 or 20 years ago, we had to settle for small amounts of representative data or simulated data. Outcomes were clever and new data analysis *techniques* rather than useful answers!

The next step is to make sense of the large amounts of data at your disposal (now that you have Hadoop and NoSQL and Graph database). This is where visualization and Descriptive Analytics come in. They provide historic “snap-shots” with graphs and charts – another important precursor to coming up with answers to the question, “What should my business do next to improve?”

In 2013, Big Data is maturing with still many open technology challenges; Descriptive Analytics is in its pre-teen years. Predictive Analytics is in its infancy, nurtured by its stronger siblings, Big Data and Descriptive Analytics!

Predictive Analytics (PA):
In a business context, Predictive Analytics (PA), attempts to answer the question, “What should my business do next to improve?”

Prediction is an old preoccupation of mankind! What comes next, next meal, next rain, next mate? Techniques have changed little from Stone Age till recently; see what has happened before, notice the cyclical nature and any new information; combine them to forecast what may come next. Take tomorrow’s temperature for example:
1.       Predict it as today’s temperature.
2.       Predict it as the average of Tuesday’s (today) temperature for the past month.
3.       Predict it as the average of Tuesday’s (today) temperature of the month of August for the past 50 years.
4.       Look at the pattern of today’s, yesterday’s and day before yesterday’s temperature; go back in the weather record and find “triples” that match (your criteria of match). Collect the next day’s temperature after matching patterns in the record and average them as your prediction of tomorrow’s temperature.
5.       And so on . . .
As you can imagine (and can easily demonstrate to yourself), predictions from 1 to 5 get better and better. If your business is in shipping fresh flowers, your business may be able to use just this simple Predictive Analytics method to lower your cost of shipping! Simple PA but answers your “What next?” question. By the way, this PA technique is not as na├»ve as it looks; complexity-theory-“quants” used to use its extended forms in financial engineering on Wall Street.

So one feature of PA is clear; there is a time element, historical data and future state. Historical data can be limited in time as in the first method of temperature prediction or as extensive as in the fourth method. Descriptive Analytics can be of use here for sure – looking at the trends in the temperature data plot, one can heuristically see where it is headed and act accordingly (to ship flowers or not tomorrow). However, PA incorporates time as an essential element in quantitative ways.

Quantitative Methods in PA:
My intent here is not to provide a catalog of statistical and probabilistic methods and when and where to use them. Hire a stats or math or physics or engineering Ph.D. and they will know them backwards and forwards. Applying it to Big Data in business however requires much more inventiveness and ingenuity – let me explain.

PA has an essential time element to it. That makes prediction possible but life difficult! There is a notion called “non-stationarity”. While reading Taleb’s or Silver’s books, I have been puzzled by finding that this word is hardly mentioned (not once in Taleb’s books, if I am not mistaken). One reason may be that those books would have ended up being 10 pages long instead of the actual many 100’s of pages!

Non-stationarity is a rigorous concept but for our purposes think about it as “changing behavior”. I do not look, act and think the same as I did 30 years ago – there is an underlying similarity for sure but equally surely, the specifics have changed. Global warming may be steady linear change but it will systematically affect tree-ring width data over centuries. Some other changes are cyclical – at various times, they are statistically the same but at other times, they are different. Systems more non-stationary than this will be all over the place! Thus, historical and future behavior and resultant data have variability that constrains our ability to predict. Not all is lost – weather can be predicted pretty accurately for up to a week now but not into the next many months (this may be an unsolvable issue per complexity theory). Every system, including business “systems”, has its window of predictability; finding the predictability window and finding historical data that can help us predict within that window is an art.

I do not want this blog to be a litany of problems but there are two more that need our attention. Heteroscedasticity is the next fly in the ointment! This is also a formal rigorous statistical concept but we will talk about it as “variability upon variability”.  Business data that we want to study are definitely variable and if they vary in a “well-behaved” way, we can handle them well. But if the variability varies, which is often the case in naturally occurring data, we have constraints in what we can hope to accomplish with that data. Similar to non-stationary data, we have to chunk them, transform variables, etc. to make them behave.

The third issue is that of “noise”. Noise is not just the hiss you hear when you listen to an AM radio station. The best definition for “noise” is the data that you do not want. Desirable data is “signal” and anything undesirable is “noise”. In engineered systems such as a communication channel, these are clearly identifiable entities – “signal” is what you sent out at one end and anything else in additional to the signal that you pick up at the receiver is “noise”. In a business case, “unwanted” data or “noise” are not so clearly identifiable and separable. Think of “Relative Spend” data among various brands of beer in a chain of stores; or sentiment analysis results for those brands. Depending on the objective of the analysis, the signal we are looking for may be “purchase propensity” of each shopper (so that we can individually target them with discount offers). Relative Spends and Likes are not pure measures of “purchase propensity” – I may have bought Brand X beer because my friend asked me to pick some up for him which has nothing to do with my purchase propensity for that brand! Purchases like that will pollute my Relative Spend data. How do you separate this noise from the data. There may also be pure noise – incorrect entry of data, data corruption and such errors that affect all data uniformly.

Fortunately, there is a framework from engineering that provides a comprehensive approach to containing these issues while providing powerful analytics solutions.

Model-based Analytics (MBA):
Model-based and model-free methods have a long history in many areas of Engineering, Science and Mathematics. I take an Engineering approach below.

Data that you have can be taken “as is” or can be considered to be generated by an underlying model. “Model” is a very utilitarian concept; it is like a “map” of the world. You can have a street map or a satellite map – you use them for different purposes. If you need to see the terrain, you look for a topographic map. A map is never fully accurate in itself – it serves a purpose. As the old saw goes, if a world map has to be accurate, the map will have to be as large as the world – what is the point of a map then?

Let us call “model-free” methods as today’s “Data Analytics (DA)” to distinguish it from “Model-based Analytics (MBA)”. DA analyzes measured data to aid business decisions and predictions. MBA attempts to model the system that generated the measured data.

Model-based methods form the basis of innumerable quantitative techniques in Engineering. Theory and practice have shown that data analysis approaches (similar to periodogram spectrum estimation) are robust but not powerful, while model-based methods (similar to AR-spectrum estimation) are powerful but not robust (incorrect model order, for example, can lead to misleading results - needs expert practitioner).

MBA go beyond data slicing/ dicing and heuristics. In our previous “brands of beer in a store chain” example, model-based approach hypothesizes that there is a system, either explicit or implicit, behind the scenes generating customer purchase behaviors and purchase propensities. From measured data, MBA identifies the key attributes of the underlying hidden system (to understand commerce business quantitatively) and provides ways to regulate system outputs (to produce desirable business outcomes).

MBA does not solve but alleviates the three pain points in Predictive Analytics quantitative methods: (1) Non-stationarity, (2) Heteroscedasticity and (3) Noise.

MBA – Personal Commerce Example:
I do not have an MBA (the university kind) nor have I taken a Marketing course. So here is a layman’s take on Personal Commerce. My main objective is to show a practical example of model-based predictive analytics within MBA framework.

Personal Commerce has 2 objectives: (1) Customer acquisition and (2) Customer retention. There are 3 ways to accomplish these 2 tasks: (1) Marketing, (2) Merchandizing and (3) Affinity programs.

Let us take “Merchandizing” (the business of bringing the product to the shopper). An online example is “recommendation engine”. When you log into Amazon, their PA software will bring products from multiple product categories (books, electronics, etc.) to your tablet screen. An offline example is brick-and-mortar store that arranges beer brands on their physical shelf such that it will entice their shoppers to buy a particular brand (assume that the store has a promotion agreement with the brand and hence a business reason to do so). This type of merchandizing is called “assortment optimization”. Note that both Assortment Optimization and Recommendation Engine are general concepts that have many variations in their applications. The MBA approach below applies to the 2 Personal Commerce objectives and the 3 programs to accomplish them.

Assortment Optimization:
As practical example of MBA, let us explore Assortment Optimization. From the various data sources available to you, you construct a model of the shopper groups with their beer-affinity as the dependent variable. Then construct a model of a specific store with the shopper groups as the dependent variable. Once these 2 models are in hand, you combine them to obtain the optimal shelf assortment for beer at that particular store so that the store revenue can be maximized.

Clearly, I have not explained the details of the construction of these models and how they can be combined to give you the optimal product assortment. That is not the focus of this blog – it is to show that such an approach will allow you to contain the 3 banes of PA quantitative methods and hence get powerful results. In my actual use case, we achieve “sizes of the prize” (in industry parlance; the potential peak increase in revenue) greater than any current merchandizing Big Data methods!

(1)    Non-stationarity: As we discussed earlier, different systems vary over time in their own way. If you always used the past 52 weeks of data, it may be appropriate for some products but not others. For example, in certain cases of Fashion or FMCG, non-homogeniety can be minimized by selecting shorter durations but not so short that you do not have enough representative data!
(2)    Heteroscedasticity: There is a fundamental concept here again of choosing just enough data (even if you have tons in your Big Data store!) that address the business question you have but not too much. When you have selected just enough data, you may also escape severe heteroscedasticity. If not, variable transformations (such as log transformation) may have to be adopted.
(3)    Noise: As we noted, Noise is unwanted data. Consider the previous Merchandizing case but where you tried to analyze 2 product categories together, say, Beer and Soap. Since the fundamental purchase propensity driving-forces are most likely different for these two product categories, one will act as noise to the other – deal with them separately. In addition, doing some eigen-decomposition pre-processing may allow you to separate “signal from noise”.

Many of you will find this discussion inadequate – part of it is because they are trade secrets and part of it is because there are no magic formulas. Each business problem is different and calls for ingenuity and insight particular to that data set and business question.

I have only scratched the surface of Model-based Analytics here. The sub-disciples of Electrical Engineering such as Digital Signal Processing, Control Theory and Systems Theory are replete with frameworks and solutions developed in the last two decades or so for deploying model-based solutions and extending them to closed-loop systems. Thus, we go beyond predictions to actions with their results fed-back into the model to refine its predictions. Next five years will see a great increase in the efficacy of Predictive Analytics solutions with the incorporation of more model-based approaches.

Post Script: Other major “flies-in-the ointment” are non-linearity and non-normality; I am of the opinion that methods that are practical and efficient are still not available to battle these issues (I know that the 1000’s of authors of books and papers of these fields will disagree with me!). So, the approach I take is that non-linearity and non-normality issues are minor in most cases and MBA techniques will work adequately; when in a few cases I cannot make any headway, I reckon that these issues are so extreme that I have a currently-intractable problem!

Friday, August 9, 2013

Social Network Theory

In the last few decades, interest in Social Networks has seen explosive growth with Facebook being the poster child. Indeed work had gone on before that time in Physics especially in the area of “random networks” (Erdos and Renyi are famous names). In such networks, each new node is randomly connected to any previous node and its evolution studied.
The work of Barabasi and others showed that while the physics of Random Networks was interesting, it had little to do with “human made” networks such as Web, Internet, friendship networks, etc. They all fall into the category of “Scale-Free Networks”.
Scale-Free Network (SFN) Model:
“Scale-free” refers to the nature of some networks where its diameter (average # of links to reach a node from the present node; remember “six degrees of separation” between people) is mostly unchanged as the network grows in size.
SFN’s degree (# of links per node) is distributed as a power law (and not a binomial distribution which is the case for a purely random network). A well-known example of power law is Pareto distribution (“80/20 rule”). SFNs arise in networks that grow and have preferential attachment property while growing (exhibits “rich get richer” and “first mover advantage” phenomena).

Figure 1: Pareto distribution (example of a Power Law distribution)
If a “fitness” rating is attached to each node, a new phenomenon of “winner takes all” emerges (Google search engine will rate very high among websites as an example). This has been shown to be a “Bose-Einstein” condensate state (somewhat similar to condensation transitions that occur in traffic jams where long queues of cars are foundin wealth distribution models where a few people might have a hugefraction of all the wealth and in operating systems usage where Windows has close to 90% market share).
Real Social Networks:
Real social network has 3 major features:
1. Clusters (of friends who “hang out”).
2. Weak links (to your long forgotten high school classmates).
3.Hubs (politicians and others with massive number of contacts).

Figure 2: Natural social network structure with Cluster Groups
Comparing the Pareto (power law) distribution that governs SFNs, we can see that Hubs with many links (large “k”) are nodes that occur rarely whereas Clusters with few links are nodes that occur very often. Here we elaborate a mid-scale situation where certain nodes have a fair number of links and occur with some frequency – we call them “Cluster Groups”. They arise in cases where a certain Member belongs to multiple Clusters, their family cluster, work cluster, golf club cluster, etc. The same Member belongs to many Clusters that are tightly coupled leading to a Custer Group.
Failure Modes:
Failure modes of SFNs have been widely studied. The main 2 observations are that (1) SFN is robust to Failure and (2) SFN is vulnerable to attacks.
Failure means random loss of network elements such as some computers or local routers going down in the Internet. Attack refers to concerted take-down of major “hub” routers. It has been shown that removal of a small percent (15 to 20%) of “hub” routers will pull apart the network and “cascading failure” will spread (due to routers NOT under attack being unable to carry the extra load from “hub” routers) leading to total network failure.

Here are N=10 and N=100 examples; d=2 in both cases.
The N=100 example already shows some interesting features. Remember “Hubs”? One would expect the “root” node at the very top to be a Hub but the left-most node in the 2nd level seems to be developing into a Hub with lots more “child” nodes.

Have fun with social networks!