Showing posts with label Big Data. Show all posts
Showing posts with label Big Data. Show all posts

Wednesday, January 27, 2016

Analytics Made Simple

Analytics Made Simple

PG Madhavan, Ph.D.
Chief Algorist, Syzen Analytics, Inc.
Seattle, WA, USA

Brief bio: PG developed his expertise as an EECS Professor, Computational Neuroscience researcher, Bell labs MTS, Microsoft Architect and multiple startup leader. He has over 100 publications & platform presentations to Sales, Marketing, Product, Standards and Research groups as well as 12 issued US patents. Major Contributions:
·       Computational Neuroscience of Hippocampal Place Cell phenomenon related to the subject matter of 2014 Nobel Prize in Medicine.
·       Random Field Theory estimation methods, relationship to systems theory and industry applications.
·       Systems Analytics bringing model-based methods into current analytics practice.
·       Four startups with two as Founder.
 

Four hard topics in Analytics are explained in plain English in this article:
1.       Machine Learning.
2.       Why is Predictive Analytics important to business?
3.       Prediction – the other dismal science?
4.       Future of Analytics.

Machine Learning in plain English

If someone asks you, “What is ML?”, what will be your conceptual, non-technical answer?

Mine is . . . ML is “cluster”, “classify” and “convert”. I use these words in their English language sense and not as techniques. What do I mean by that?

Cluster: Structure in the data is information – find the structure.
Classify: Transform structure into a Mathematical form.
Convert: Convert into insight/ action.
Do this by Learning – meaning, use the ability to generalize from experience.

This captures the essence of ML for me. From my experience, I find that –
·       Convert: best done by a “paired” (Data Scientist + Domain Expert) combo.
·       Classify: there is a grab bag of tools and techniques that the Data Scientist can exploit on one’s own. You can see my attempt at unifying this bag of tricks here – “Unifying Machine Learning to create breakthrough perspectives”.
·       Cluster: I am not referring to specific clustering *algorithms* here. This step is where the Data Scientist works to sense, identify and extract structure or patterns or features in the data which are the bearers of information!

“Cluster” is the hardest part – data do not tell you where it hides the structure. Finding patterns is an “art” where inspiration, skill, experience, knowledge of inter-related theories, etc. play a major part. In a current algorithm work that I am doing, it turned out (after *months* of slicing and dicing the data) that rendering data into “phasors” (or complex variables) revealed the structure hidden in the data “by itself”!

If you are able to get at the most descriptive and discriminatory features at the “Cluster” stage, the rest of the steps will just fall into place (almost) and provide the best robust solution! If not, you may succeed but you will work many times harder to Classify and Covnert and end up with non-optimal answers.

It must be clear that my comments apply only to the first time development of an algorithm for a new business problem; once an end-to-end algorithm is in place, of course, the Cluster-Classify-Covnert steps can be automated for repeated application to similar data sets. But for the first-time ML algorithm solution development, automation cannot replace art!

Why is Predictive Analytics important to business?

A prerequisite for performance at a high level in business is the ability to understand and manage complexity. Complex systems to be managed properly requires a ton of data at the right time. BIG Data provide us the data we need; to put these data to work in order to take us to the high levels of complexity required while still managing it, we have to anticipate what is about to happen and react when it happens in a closed loop manner. Predictive Analytics will allow us to push our “system” to the edge (without “falling over”) in a managed fashion. This is why businesses embrace Predictive Analytics - to manage businesses at a high level of performance at the edge of complexity overload.

Prediction – the other dismal science?

An insightful person once said, “Prediction is like driving your car forward by looking only at the rearview mirror!”. If the road is dead-straight, you are good . . . UNLESS there is a stalled vehicle ahead in the middle of the road.

We should consider short-term and long-term prediction separately. Long-term prediction is nearly a lost cause. In the 80’s and 90’s, chaos and complexity theorists showed us that things can spin out of control even when we have perfect past and present information (predicting weather beyond 3 weeks is a major challenge, if not impossible). Even earlier, stochastic process theory told us that “non-stationarity” where statistics evolve (slowly or fast) can render longer term predictions unreliable.

If the underlying systems do not evolve quickly or suddenly, there is some hope. Causal systems (in Systems Theory, it means that no future information of any kind is available in the current state of the system), where “the car is driven forward strictly by using the rearview mirror”, outcomes are predictable in the sense that, as long as the “road is straight” or “curves only gently”, we can be somewhat confident in predicting a few steps ahead. This may be quite useful in some Data Science applications (such as in Fintech).

Another type of prediction involves not the actual path of future events (or the “state space trajectories” in the parlance) but the occurrence of a “black swan” or an “X-event” (for an elegant in-depth discussion, see John Casti, “X-Events: Complexity Overload and the Collapse of Everything’, 2013). For that matter, ANY unwanted event can be good to know about in advance – consider unwanted destructive vibrations (called “chatter”) in machine tools, as an example; early warning may be possible and very useful in saving expensive work pieces (“Instantaneous Scale of Fluctuation Using Kalman-TFD and Applications in Machine Tool Monitoring”). We find that sometimes the underlying system does undergo some pre-event changes (such as approach “complexity overload”, “state-space volume inflation”, “increase in degrees of freedom”, etc.) which may be detectable and trackable. However, there is NO escaping False Positives (and associated wastage of resources preparing for the event that does not come) or False Negatives (and be blind-sided when we are told it is not going to happen).

At Syzen Analytics, Inc., we use an explicit systems theory approach to Analytics. In our SYSTEMS Analytics formulation (“Future of Analytics – a definitive Roadmap”), the parameters of the system and its variation over time are tracked adaptively in real-time which tells us how long into the future we can predict safely – if the parameters evolve slowly or cyclically, we have higher confidence in our predictive analytics solutions.

Wanting to know the future has always been a human preoccupation – we see that you cannot truly know the future but in some cases, predictions to some extent are possible . . . surrounded by many caveats; more of “excuses” than definitive answers. Sounds a lot like a dismal science!

Future of Analytics – Spatio-temporal data

As businesses push to higher levels of performance, higher fidelity models are going to be necessary to produce more accurate and hence valuable predictions and recommendations for business operations.

ALL data are spatio-temporal! At the simplest to more complex levels -
·       Data can be considered isolated at the simplest level – a “snap shot”.
·       Then we realize that data exist in a “social” network with mutual interactions.
·       In reality, data exist in *embedded* forms in “influence” networks of one type or the other which are distributed in time and space – a “video”!

Spatial extent of data (distance) can be folded into time if we assume a certain information diffusion speed. Graph-theoretic methods do not account for time dimension. For accurate analysis, no escaping Dynamics over Time; meaning the use of differential (or difference) equations . . . and Systems Theory!


Systems Theory + Analytics = “SYSTEMS Analytics”! A few example business applications are shown above. As you can see, it spans most of the current Analytics use cases and many more promising ones when network graphs and spatio-temporal nature of data are fully incorporated in the coming years – basic theories and some algorithms are already in hand. For specific technologies, see –
·       For a full 30-minute discourse, Youtube video on “Future of Analytics – a definitive roadmap

From the simple explanation of ML, the power and limitations of prediction and the promising Analytics technology roadmap ahead, it is clear that Data Science is indeed a rich area to mine that can create even bigger impact on business performance in the coming years.

PG Madhavan



 

Sunday, May 17, 2015

“IA not AI” in Retail Commerce – Enhanced Tanpin Kanri



HighlightsEnhanced Tanpin Kanri is a specific example of Intelligence Augmentation. Store staff's local knowledge and engagement with their shoppers cannot be replaced; but Big Data & Analytics can provide a significant leg-up in their difficult job of hypothesis-generation by providing data-driven predictions that they can safely rely on and improve incrementally. In a highly competitive low-margin business such as fast moving consumer goods retail, pioneering use of IA in their operations will determine the winners.

Dr. PG Madhavan is the Founder and Chairman of Syzen Analytics, Inc. He developed his expertise in analytics as an EECS Professor, Computational Neuroscience researcher, Bell Labs MTS, Microsoft Architect and startup CEO. He has been involved in four startups with two as Founder. He has over 100 publications & platform presentations to Sales, Marketing, Product, Industry Standards and Research groups and 12 issued US patents. He conceived and leads the development of SYSTEMS Analytics and is continually engaged hands-on in the development of advanced Analytics algorithms.

Artificial Intelligence or “AI” is the technology of the future; it has been so for the past 50 years . . . and it continues to be today! Intelligence Augmentation or “IA” has been around for as long. IA as a paradigm for value creation by computers was demonstrated by Douglas Engelbart during his 1968 “Mother of all Demos” (and in his 1962 report, “Augmenting Human Intellect: A Conceptual Framework”). While Engelbart’s working demo had to do with the mouse, networking, hypertext, etc. (way before their day-to-day use), IA has increased in scope massively in the last 5 plus years. Now, Big Data & Analytics can truly augment human intelligence beyond anything Engelbart could have imagined.


IA is contrasted with Artificial Intelligence which in its early days was the realm of theorem proving, chess playing, expert systems and neural networks. AI has made large strides but its full scope is yet to be realized. IA on the other hand can be put to significant use and benefit today. Such is the story of Enhanced Tanpin Kanri in retail commerce.


As a keen observer of the retail ecosystem’s demand chain portion for the past 3 years or so, I am struck by the inefficiencies in large portions of Retail. Shoppers finding what they want on the shelf is considered a BIG problem in the industry (the so called “OOS problem” or Out-Of-Stock problem) – to the tune of $170 Billion per year! The following diagram illustrates the portion of Retail ecosystem on which we will focus in this blog.


The left half represents the current dominant model. “Push” model drives manufacturer’s FMCG (fast moving consumer good from “brands” such as Procter & Gamble, Kraft, Unilever and others) into the supply chain with store shelves as the final destination. As retailers move to the right half of the picture, they tend to be more agile and mature in their practices and are seeking a competitive edge through differentiation – they find that by focusing their store operations to satisfy the LOCAL customer via making available what she prefers on the store shelves, they can win. The poster-child of this revolution is 7-Eleven, the ubiquitous store at virtually every street corner around the globe!

7-Eleven’s use of the “Pull” model since early 2000’s has been very successful as captured in a case study at Harvard Business School in 2011 (HBS Case Study: 9-506-002 REV: FEBRUARY 23, 2011). Quoting from this study, “Toshifumi Suzuki, Chairman and CEO of Seven and I Holdings Co., was widely credited as the mastermind behind Seven-Eleven Japan’s rise” and goes on to say that ‘Suzuki’s emphasis on fresh merchandise, innovative inventory management techniques, and numerous technological improvements guided Seven-Eleven Japan’s rapid growth. At the core of these lay Tanpin Kanri, Suzuki’s signature management framework’.

Proof is in the pudding – “Tanpin Kanri has yielded merchandising decisions that has decreased inventory levels, while increasing margins and daily store sales” since the 2000’s, the HBS case study points out. So what exactly is Tanpin Kanri?

Tanpin Kanri or "management by single product," is an approach to merchandising pioneered by 7-Eleven in Japan that considers demand on a store-by-store and product-by-product basis. Essentially, it empowers store-level retail clerks to tweak suggested assortments and order quantities based on their own educated hypotheses . . . 

You can tell that Tanpin Kanri lies well to the right in the Retail Ecosystem diagram above. To call out some features:
·       PULL model
·       For a buyers’ market
·       Focus on How to satisfy customer
·       Item planning and supply driven by retailer and customer
·       Symbolic of Consumer Initiative

I believe that it is simply a matter of time before Tanpin Kanri variants dominate the Retail demand chain model. As shoppers clamor even more for their preferences to be made available, Retailers will evolve incrementally.

So, where does the PULL Model provide most bang for the buck today?


Where ever Product Density (number of products to be stocked per unit area) is high, “customer pull” will help prioritize what products to stock. Today’s Tanpin Kanri at 7-Eleven Japan accomplishes “customer pull” incorporation via super-diligent store staff manually making the choices.

Here are the “hypothesis testing” steps that 7-Eleven store staff goes through in operationalizing Tanpin Kanri. Based on frequent interactions and personal relationships with the shoppers at a store, the staff generates hypotheses of the shoppers’ needs, wants and dislikes. Based on such information, store staff formulates hypotheses of what to carry (or not) on their store shelves (“merchandising”). Sales during the following days and weeks allow them to ascertain if their hypotheses should be rejected or not; this continuous iteration goes on over time to “track” shopper preferences. Clearly, Tanpin Kanri methodology has been highly successful for 7-Eleven according to HBS case study.



IA based on Big Data & Analytics can play a major role in Tanpin Kanri. In any scientific methodology, coming up with meaningful hypotheses is the HARD part! In Tanpin Kanri case, it is the personal relationships, diligence and intelligence of the store staff that help generate the hypotheses. This is the super-important human value-add that no AI can fully replace! However, we can AUGMENT the hypothesis-driven Tanpin Kanri with Data-Driven precursor that enhances staff intelligence by providing them with predictions that they can build on to formulate their hypotheses.

IA happens in the data-driven precursor step. 7-Eleven has vast amounts of transaction and customer data in their data warehouse. They can be “data-mined” to find shopper preferences at a particular store which can form the basis of what to carry and how much on the store shelf. The data-driven predictions then become the “foundation” on which the store staff adds their own “deltas” based on the shopper quirks that they have surmised through their all-important personal relationships.

Syzen Analytics, Inc. has accomplished IA integration using Machine Learning and a new development in Analytics called “SystemsAnalytics”. A dumb “prediction” for SKU shares is same sales as last year (see the multi-colored bar chart in the middle of the picture below) – in other words, historical sales is the “information-neutral” prediction. But surely, we can do better than that with Systems Analytics.

Syzen is able to provide SKU-by-SKU, store-by-store and week-by-week predictions using typical T-Log data that every Retailer has in its data archives. A typical prediction of Syzen’s ROG-0 SaaS product for a typical SKU at a particular store looks like this.


·       The purple uneven “picket fence” is the weekly predictions – lowest bar chart. This is obtained by combining different “masks”.
·       The papaya-colored mask is the new and most significant one. The values are predicted based on appropriate past intervals of T-Log data digested via Systems Analytics and updated adaptively.
·       The numbered masks in the middle accounts for things that the store manager knows that will happen next year such as a local festival or a rock concert in the nearby park.
·       The MANUAL part of Tanpin Kanri now only involves the store staff simply making daily small adjustments to the SKU “facings” based on local shopper “gossip” to the purple bar chart!

Convenience stores are drawn to the Enhanced Tanpin Kanri method because of the maturity of operations they already possess. With more agile supply chains and the desire to differentiate their stores in response to their local clientele, Syzen finds a lot of enthusiasm among “high-density” Retailers for our predictive solution that makes Tanpin Kanri more scalable due to lesser dependency on super-diligent store staff. Advances in Systems Analytics and other quantitative methods will refine products such as Syzen’s ROG-0 SaaS in the future to sharpen shopper-preference based product assortment predictions.

Enhanced Tanpin Kanri is a specific example of Intelligence Augmentation. Store staff's knowledge of local happenings and engagement with their store shoppers cannot be replaced; but Big Data & Analytics can provide a significant leg-up in their difficult job of hypothesis-generation by providing data-driven predictions that they can safely rely on and improve incrementally. In a highly competitive low-margin business such as fast moving consumer goods retail, pioneering use of IA in their operations will determine the winners.

Syzen website: www.syzenanalytics.com


Wednesday, August 28, 2013

Predictive Analytics Automation


There is much discussion about whether Predictive Analytics (PA) can be automated or not. This is a false dichotomy.

Predictive Analytics is a strange beast - it needs to be ‘learned by learning‘ and ‘learned by doing‘ – BOTH! That is due to the interconnected nature of the field. To be a successful hyper-specialist in “left nostril” diseases, one needs to have done Anatomy, Physiology and Biochemistry in med school. Similarly, for PA, learning-by-learning (which takes at least 6 years of grad school) is not a step you can skip and go directly to learning-by-doing and hope to become a true curer of business diseases!

In PA, learning-by-doing can be an even steeper curve. As I have noted before in my blogs, PA skills will have to be rounded out with mathematical inventiveness and ingenuity applied repeatedly in a specific business vertical. These are the hallmarks of an uber Data Scientist. Clearly, an uber data scientist as described above cannot be bottled and passed around. Don’t even think of “automating” all the things that an uber data scientist does. So what do we do about “scaling”? Are there support pieces we can automate to scale the solution.

Comparison to a programing environment such as MATLAB is appropriate. MATLAB supplies you with all kinds of toolboxes. Similarly, in PA, many basic operations can be automated – clustering, learning, classification, etc. But, like MATLAB, you also need an environment where these toolboxes can be fine-tuned with inventiveness appropriate to the business vertical, mixed and matched and augmented with additional one-off solutions to address the overall business problem at hand. Otherwise, the solution will fall short (or flat!).

So, part of PA can be automated. PA toolboxes can be fine-tuned by data scientist associates and the overall solution can be conceived and put together with these toolboxes (with added “glue”) by the uber data scientist.

Note that everything I talked about here refers to PA solution development. Once the overall solution is developed, “production runs” by customer personnel and visualizations by executives of the PA solution developed above can be mostly automated (with data scientist looking over their shoulders – data can change on you on a dime; someone has to watch for the sanctity of the data and non-stationarity problems!). Production is where the solution needs to scale and it can.

In summary, PA solution development will require manual work by uber data scientists supported by data science associates; automated toolboxes for basic PA functions will help speed up the process and once the overall solution is manually cobbled together, production runs can be automated along with some amount of ongoing data science audit of the process and results.



Dr. PG Madhavan developed his expertise in analytics as an EECS Professor, Computational Neuroscience researcher, Bell Labs MTS, Microsoft Architect and startup CEO. Overall, he has extensive experience of 20+ years in leadership roles at major corporations such as Microsoft, Lucent, AT&T and Rockwell as well as four startups including Zaplah Corp as Founder and CEO. He is continually engaged hands-on in the development of advanced Analytics algorithms and all aspects of innovation (12 issued US patents with deep interest in adaptive systems and social networks).

Tuesday, August 20, 2013

Predictive Analytics – Where from and where to?

Dr. PG Madhavan developed his expertise in analytics as an EECS Professor, Computational Neuroscience researcher, Bell Labs MTS, Microsoft Architect and startup CEO. Overall, he has extensive experience of 20+ years in leadership roles at major corporations such as Microsoft, Lucent, AT&T and Rockwell and startups, Zaplah Corp (Founder and CEO), Global Logic, Solavei and Aldata. He is continually engaged hands-on in the development of advanced Analytics algorithms and all aspects of innovation (12 issued US patents with deep interest in adaptive systems and social networks).


Big Data is big business but the “open secret” in this business is that what the paying client really wants is Predictive Analytics (“what should my business do next to improve?!”). To be explicit, Big Data is all about the technology for storage and manipulation of lots of data (terabytes to petabytes) as well as new types of data such as graph and unstructured data. This is an extremely important precursor to doing anything useful with data – 10 or 20 years ago, we had to settle for small amounts of representative data or simulated data. Outcomes were clever and new data analysis *techniques* rather than useful answers!

The next step is to make sense of the large amounts of data at your disposal (now that you have Hadoop and NoSQL and Graph database). This is where visualization and Descriptive Analytics come in. They provide historic “snap-shots” with graphs and charts – another important precursor to coming up with answers to the question, “What should my business do next to improve?”

In 2013, Big Data is maturing with still many open technology challenges; Descriptive Analytics is in its pre-teen years. Predictive Analytics is in its infancy, nurtured by its stronger siblings, Big Data and Descriptive Analytics!

Predictive Analytics (PA):
In a business context, Predictive Analytics (PA), attempts to answer the question, “What should my business do next to improve?”

Prediction is an old preoccupation of mankind! What comes next, next meal, next rain, next mate? Techniques have changed little from Stone Age till recently; see what has happened before, notice the cyclical nature and any new information; combine them to forecast what may come next. Take tomorrow’s temperature for example:
1.       Predict it as today’s temperature.
2.       Predict it as the average of Tuesday’s (today) temperature for the past month.
3.       Predict it as the average of Tuesday’s (today) temperature of the month of August for the past 50 years.
4.       Look at the pattern of today’s, yesterday’s and day before yesterday’s temperature; go back in the weather record and find “triples” that match (your criteria of match). Collect the next day’s temperature after matching patterns in the record and average them as your prediction of tomorrow’s temperature.
5.       And so on . . .
As you can imagine (and can easily demonstrate to yourself), predictions from 1 to 5 get better and better. If your business is in shipping fresh flowers, your business may be able to use just this simple Predictive Analytics method to lower your cost of shipping! Simple PA but answers your “What next?” question. By the way, this PA technique is not as naïve as it looks; complexity-theory-“quants” used to use its extended forms in financial engineering on Wall Street.

So one feature of PA is clear; there is a time element, historical data and future state. Historical data can be limited in time as in the first method of temperature prediction or as extensive as in the fourth method. Descriptive Analytics can be of use here for sure – looking at the trends in the temperature data plot, one can heuristically see where it is headed and act accordingly (to ship flowers or not tomorrow). However, PA incorporates time as an essential element in quantitative ways.

Quantitative Methods in PA:
My intent here is not to provide a catalog of statistical and probabilistic methods and when and where to use them. Hire a stats or math or physics or engineering Ph.D. and they will know them backwards and forwards. Applying it to Big Data in business however requires much more inventiveness and ingenuity – let me explain.

PA has an essential time element to it. That makes prediction possible but life difficult! There is a notion called “non-stationarity”. While reading Taleb’s or Silver’s books, I have been puzzled by finding that this word is hardly mentioned (not once in Taleb’s books, if I am not mistaken). One reason may be that those books would have ended up being 10 pages long instead of the actual many 100’s of pages!

Non-stationarity is a rigorous concept but for our purposes think about it as “changing behavior”. I do not look, act and think the same as I did 30 years ago – there is an underlying similarity for sure but equally surely, the specifics have changed. Global warming may be steady linear change but it will systematically affect tree-ring width data over centuries. Some other changes are cyclical – at various times, they are statistically the same but at other times, they are different. Systems more non-stationary than this will be all over the place! Thus, historical and future behavior and resultant data have variability that constrains our ability to predict. Not all is lost – weather can be predicted pretty accurately for up to a week now but not into the next many months (this may be an unsolvable issue per complexity theory). Every system, including business “systems”, has its window of predictability; finding the predictability window and finding historical data that can help us predict within that window is an art.

I do not want this blog to be a litany of problems but there are two more that need our attention. Heteroscedasticity is the next fly in the ointment! This is also a formal rigorous statistical concept but we will talk about it as “variability upon variability”.  Business data that we want to study are definitely variable and if they vary in a “well-behaved” way, we can handle them well. But if the variability varies, which is often the case in naturally occurring data, we have constraints in what we can hope to accomplish with that data. Similar to non-stationary data, we have to chunk them, transform variables, etc. to make them behave.

The third issue is that of “noise”. Noise is not just the hiss you hear when you listen to an AM radio station. The best definition for “noise” is the data that you do not want. Desirable data is “signal” and anything undesirable is “noise”. In engineered systems such as a communication channel, these are clearly identifiable entities – “signal” is what you sent out at one end and anything else in additional to the signal that you pick up at the receiver is “noise”. In a business case, “unwanted” data or “noise” are not so clearly identifiable and separable. Think of “Relative Spend” data among various brands of beer in a chain of stores; or sentiment analysis results for those brands. Depending on the objective of the analysis, the signal we are looking for may be “purchase propensity” of each shopper (so that we can individually target them with discount offers). Relative Spends and Likes are not pure measures of “purchase propensity” – I may have bought Brand X beer because my friend asked me to pick some up for him which has nothing to do with my purchase propensity for that brand! Purchases like that will pollute my Relative Spend data. How do you separate this noise from the data. There may also be pure noise – incorrect entry of data, data corruption and such errors that affect all data uniformly.

Fortunately, there is a framework from engineering that provides a comprehensive approach to containing these issues while providing powerful analytics solutions.

Model-based Analytics (MBA):
Model-based and model-free methods have a long history in many areas of Engineering, Science and Mathematics. I take an Engineering approach below.

Data that you have can be taken “as is” or can be considered to be generated by an underlying model. “Model” is a very utilitarian concept; it is like a “map” of the world. You can have a street map or a satellite map – you use them for different purposes. If you need to see the terrain, you look for a topographic map. A map is never fully accurate in itself – it serves a purpose. As the old saw goes, if a world map has to be accurate, the map will have to be as large as the world – what is the point of a map then?

Let us call “model-free” methods as today’s “Data Analytics (DA)” to distinguish it from “Model-based Analytics (MBA)”. DA analyzes measured data to aid business decisions and predictions. MBA attempts to model the system that generated the measured data.

Model-based methods form the basis of innumerable quantitative techniques in Engineering. Theory and practice have shown that data analysis approaches (similar to periodogram spectrum estimation) are robust but not powerful, while model-based methods (similar to AR-spectrum estimation) are powerful but not robust (incorrect model order, for example, can lead to misleading results - needs expert practitioner).

MBA go beyond data slicing/ dicing and heuristics. In our previous “brands of beer in a store chain” example, model-based approach hypothesizes that there is a system, either explicit or implicit, behind the scenes generating customer purchase behaviors and purchase propensities. From measured data, MBA identifies the key attributes of the underlying hidden system (to understand commerce business quantitatively) and provides ways to regulate system outputs (to produce desirable business outcomes).

MBA does not solve but alleviates the three pain points in Predictive Analytics quantitative methods: (1) Non-stationarity, (2) Heteroscedasticity and (3) Noise.

MBA – Personal Commerce Example:
I do not have an MBA (the university kind) nor have I taken a Marketing course. So here is a layman’s take on Personal Commerce. My main objective is to show a practical example of model-based predictive analytics within MBA framework.

Personal Commerce has 2 objectives: (1) Customer acquisition and (2) Customer retention. There are 3 ways to accomplish these 2 tasks: (1) Marketing, (2) Merchandizing and (3) Affinity programs.

Let us take “Merchandizing” (the business of bringing the product to the shopper). An online example is “recommendation engine”. When you log into Amazon, their PA software will bring products from multiple product categories (books, electronics, etc.) to your tablet screen. An offline example is brick-and-mortar store that arranges beer brands on their physical shelf such that it will entice their shoppers to buy a particular brand (assume that the store has a promotion agreement with the brand and hence a business reason to do so). This type of merchandizing is called “assortment optimization”. Note that both Assortment Optimization and Recommendation Engine are general concepts that have many variations in their applications. The MBA approach below applies to the 2 Personal Commerce objectives and the 3 programs to accomplish them.

Assortment Optimization:
As practical example of MBA, let us explore Assortment Optimization. From the various data sources available to you, you construct a model of the shopper groups with their beer-affinity as the dependent variable. Then construct a model of a specific store with the shopper groups as the dependent variable. Once these 2 models are in hand, you combine them to obtain the optimal shelf assortment for beer at that particular store so that the store revenue can be maximized.

Clearly, I have not explained the details of the construction of these models and how they can be combined to give you the optimal product assortment. That is not the focus of this blog – it is to show that such an approach will allow you to contain the 3 banes of PA quantitative methods and hence get powerful results. In my actual use case, we achieve “sizes of the prize” (in industry parlance; the potential peak increase in revenue) greater than any current merchandizing Big Data methods!

(1)    Non-stationarity: As we discussed earlier, different systems vary over time in their own way. If you always used the past 52 weeks of data, it may be appropriate for some products but not others. For example, in certain cases of Fashion or FMCG, non-homogeniety can be minimized by selecting shorter durations but not so short that you do not have enough representative data!
(2)    Heteroscedasticity: There is a fundamental concept here again of choosing just enough data (even if you have tons in your Big Data store!) that address the business question you have but not too much. When you have selected just enough data, you may also escape severe heteroscedasticity. If not, variable transformations (such as log transformation) may have to be adopted.
(3)    Noise: As we noted, Noise is unwanted data. Consider the previous Merchandizing case but where you tried to analyze 2 product categories together, say, Beer and Soap. Since the fundamental purchase propensity driving-forces are most likely different for these two product categories, one will act as noise to the other – deal with them separately. In addition, doing some eigen-decomposition pre-processing may allow you to separate “signal from noise”.

Many of you will find this discussion inadequate – part of it is because they are trade secrets and part of it is because there are no magic formulas. Each business problem is different and calls for ingenuity and insight particular to that data set and business question.

I have only scratched the surface of Model-based Analytics here. The sub-disciples of Electrical Engineering such as Digital Signal Processing, Control Theory and Systems Theory are replete with frameworks and solutions developed in the last two decades or so for deploying model-based solutions and extending them to closed-loop systems. Thus, we go beyond predictions to actions with their results fed-back into the model to refine its predictions. Next five years will see a great increase in the efficacy of Predictive Analytics solutions with the incorporation of more model-based approaches.



Post Script: Other major “flies-in-the ointment” are non-linearity and non-normality; I am of the opinion that methods that are practical and efficient are still not available to battle these issues (I know that the 1000’s of authors of books and papers of these fields will disagree with me!). So, the approach I take is that non-linearity and non-normality issues are minor in most cases and MBA techniques will work adequately; when in a few cases I cannot make any headway, I reckon that these issues are so extreme that I have a currently-intractable problem!

Friday, April 19, 2013

Parlor Games, Compressive Sensing, Data Analytics . . . & small BIG Data?


Parlor Game . . .
Surely, you have been part of a “guessing game” at a dinner party. Someone says, “Think of 10 numbers and tell me their sum and then tell me 2 of the 10 numbers; I will tell you the other 8 numbers you thought of”. The general reaction will be that this person is crazy! Too many unknowns, not enough constraints, too many solutions and so on.

Compressive Sensing:
There is a tsunami of activity in mathematical circles these days related to the parlor guessing game with great solutions for the guesses – which sounds like magic (in fact, “l1-Magic” is the name of a popular Matlab routine that provides a solution).

Donoho and Candes are the prime-movers of Compressive Sensing (CS) and related theories and their pedigrees are impressive (Donoho is a MacArthur "genius award" winner). For engineers, the simplest context in which to understand CS is data acquisition. If a time series is a sum of 2 sinusoids, we have to sample it at twice the frequency of the sinusoid of the highest frequency (at Nyquist rate). Depending on the relative frequencies of the sinusoids, we may have to acquire hundreds of samples to be able to reconstruct the original time series faithfully. CS theory asserts that you only need about 8 random samples and not hundreds to faithfully reconstruct the data! On the face of it, this assertion looks absurd in that it flies in the face of the long-standing understanding of Nyquist criterion.

However, a glimmer of insight may arise when we take note of the fact that there are only 2 sinusoids in the data! Somehow, if we can gather the information about these 2 sinusoids from 8 random samples, we may be able to reconstruct the data fully. Under such “rank limited” situation (along with a few easy-to-meet restrictions), CS theory provides a way to completely recreate the original data.

CS theory is accessible and 3 easy to read intros are (1) Candes & Wakin, “Introduction to Compressive Sampling”, IEEE Signal Processing Magazine, 2008, (2) Baraniuk, “Compressive Sensing lecture notes”, IEEE Signal Processing Magazine, 2007 and (3) Candes & Plan, “Matrix Completion with Noise”, Proceedings of the IEEE, 2010.

As you will find from these and other excellent articles on the topic, the approach is to solve a constrained minimization problem using a certain norm. l2-norm we are used to is largely unsuitable for this task because you do not almost always reach the desired solution (perhaps the reason why no one saw this solution till 2000’s). The so-called l0-norm will work but is not tractable. However, l1-norm (or the “taxi-cab” metric) does the trick. Many linear programming solutions exist; MATLAB implementations are available from multiple websites.

Data Analytics:
The references I mentioned above provide motivation from the data analytics domain (mostly from the Netflix recommendation challenge). Let me give an example from Retail analytics.

In Retail/CPG analytics, large amount of shopper data is collected via loyalty cards, transaction logs, credit card information, etc. All of these data are collected and collated into a large data matrix. For example, the data matrix may contain for each shopper (each row of a matrix), the amount they spent over a year on soaps from different brands (each column is a CPG brand). A relatively small data set in this business may have 5 or 10 million entries; however most of the entries will be missing (usually marked as “zero”). This stands to reason because I may buy only 1 or 2 brands of the 100’s of CPG soap brands; so the rest of my row will be zeros. In an example, we had 93% of the entries as zeros!

To relate this situation to the sum of sinusoids sampling, remember that the signal was “rank-limited” in that it was made up of only 2 sinusoids. Low-rank is one of the primary reasons that CS works. In the case of the soap data matrix, we again have a low-rank situation since the underlying preference for soap depends only on a few parameters such as smell, anti-bacterial property, color, cost, etc. In other words, if we do an eigen-decomposition of the true data matrix, only a few eigenvalues will be significant. Hence, using the linear programming approach and minimizing the l1-norm, we are able to reconstruct the other 93% of the entries!

So now, we have a tractable (“magical”) solution to identify the entries of the “soap” data matrix. What do these entries tell us?

The reconstructed entry is the Data Scientist’s best prediction of what a shopper would spend on a certain CPG soap brand. This is hugely valuable – this is an excellent proxy for shopper’s propensity of purchase of a particular brand. A CPG can group the shoppers based on their purchase propensities, identify a group for whom it is low and target them with advertisements or discount offers to “switch” them. This information can be used by media buyers to optimize their ad dollar ‘spend’. Knowing which shoppers visit a specific store location, Retailers can assay their ‘spends’ and stock their shelves appropriately. The applications are virtually limitless.

small BIG Data:
If we can reconstruct 93% of the data from a 7% random sample as in the example above, who needs BIG data?

Clearly, we do not need BIG data *collection*. But we still need to regenerate BIG data (all the data matrix entries). But perhaps we do not need to store it; it can be regenerated at will (convergence of the algorithms seem fast). So, we do not need BIG data collection or storage but we do need to process BIG data. Seem to point to in-memory BIG data solutions for this type of data analytics applications . . . next few months will tell.