Saturday, April 29, 2017

An output of a truly random process

Recently I have a discussion with my data science team whether we can challenge the observations is following a random process or not.  Basically, data science is all about learning hidden pattern that is affecting the observations.  If the observation is following a random process, then there is nothing we can learn about.  Let me walk through an example to illustrate.

Lets say someone is making a claim that he is throwing a fair dice (with number 1 to 6) sequentially.

Lets say I claim the output of my dice throw is uniformly random, ie: with equal chances of getting a number from 1 to 6.

And then he throws the dice 12 times, and show you the output sequence.  From the output, can you make a judgement whether this is really a sequential flow of a fair dice ?  In other words, is the output really follow a random process as expected ?

Lets look at 3 situations
  • Situation 1 output is [4, 1, 3, 1, 2, 6, 3, 5, 5, 1, 2, 4]
  • Situation 2 output is [1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1]
  • Situation 3 output is [1, 2, 3, 4, 5, 6, 1, 2, 3, 4, 5, 6]
At first glance, the output of situation 1 looks like resulting from a random process.  Situation 2 definitely doesn't look like it.  Situation 3 is harder to judge.  If you look at the proportion of the output numbers, the frequency of each output number of situation 3 definitely follows a uniform distribution of a fair dice.  But if you look at the number ordering, situation 3 follows a well-defined ordering that doesn't seem to be random at all.  Therefore, I don't think the output of situation 3 is following a random process.

However, this seems to be a very arbitrary choice.  Why would I look at the number ordering at all ? Should I look for more properties ?  such as ...
  • Whether the number of the even position are even
  • Average gap between consecutive throws
  • Whether the number in the 3rd position always smaller than the 10th position
  • ...
As you can see, depends on my imagination, the list can go on and on.  How can I tell whether situation 3 is following a random process or not ?

Method 1: Randomization Test

This is based on the hypothesis testing methodology.  We establish null hypothesis H0 that situation 3 follows a random process.

First, I define an arbitrary list of statistics of my choices
  • statisticA = proportion of even numbers in even position
  • statisticB = average gap between consecutive output numbers
  • statisticC = ...
Second, I run a simulation to generate 12 numbers based on a random process.  Calculate the corresponding statistics defined above.

Third repeat the simulation for N times, output the mean and standard deviation of the statistics.

If the statisticA or B or C of situation 3 are too far away (based on the likelihood pValue) from the mean of statistics A/B/C by the number of standard deviation of statistics A/B/C, then we conclude that situation 3 is not following a random process.  Otherwise, we don't have enough evidence to show our null hypothesis is violated and so we accept situation 3 follows the random process.

Method 2: Predictability Test

This is based on the theory of predictive analytics.

First, I pick a particular machine learning algorithm, lets say time series forecast using ARIMA.
Notice that I can also choose to use RandomForest and create some arbitrary input features (such as previous output number, maximum number in last 3 numbers ... etc)

Second, I train my selected predictive model based on the output data of situation 3 (in this example, situation 3 has only 12 data point, but imagine we have much more than 12 data point).

Third, I evaluate my model in the test set.  And see whether the prediction is much better than a random guess.  For example I can measure the lift of my model by comparing the RMSE (root mean square error) or my prediction and the standard deviation of the testing data.  If the lift is very insignificant, then I conclude that situation 3 results from a random process, because my predictive model doesn't learn any pattern.

Tuesday, August 25, 2015

Common techniques in optimization

Optimization is a frequently encountered problem in real life.  We need to make a decision to achieve something within a set of constraints, and we need to maximize or minimize our objective based on some measurement.  For example, a restaurant may need to decide how many workers (of each position) to hire to serve the customers, with the constraint that workers cannot work overtime, and the objective is to minimize the cost.  A car manufacturer may need to decide how many units of each model to be produced, within the constraint of the storage capacity of its warehouse, and maximize the profit of its sales.

Exhaustive Search

If there isn't a lot of choices, an exhaustive search (e.g. breath-first, depth-first, best-first, A* ... etc.) can be employed to evaluate each option, see if it meets all the constraints (ie: a feasible solution), and then compute its objective value.  Then we sort all feasible solutions based on its corresponding objective value and pick the solution that has the max (or min) objective value as our decision.  Unfortunately, real world problem usually involve a large number (exponentially large due to combinatorial explosion) of choices, making the exhaustive search in many cases impractical.

When this happens, two other solution approaches are commonly used.
1) Mathematical Programming
2) Local Greedy Search.

Mathematical Programming

Mathematical programming is a classical way to solve optimization problem.  It is a family of approaches including linear programming, integer programming, quadratic programming and even non-linear programming.  The development process usually go through the following steps ...

From a problem description, the modeler will express the problem into a mathematical structure containing 3 parts.
  • Variables:  there are two types of variables.  Data variables contains the current value of your business environment (e.g. cost of hiring a waiter, price of a car), and Decision variables hold the decision you make to optimize your objective (e.g. how many staff to hire, how many cars to make).
  • Constraints:  it is a set of rules that you cannot break.  Effectively, constraints disallow certain combinations of your decision variables and is mainly used to filter out in-feasible solutions.  In a typical settings, constraints are expressed as a system of linear inequality equations where a linear expression of decision variables are specified on the left hand side and a value is specified on the right hand side after an inequality comparison.
  • Objective function: it encapsulates the quantitative measure of how our goal has been achieved.  In a typical setting, objective function is expressed as a single linear (or quadratic) combination of decision variables
After the mathematical structure is defined, the modeler will submit it to a solver, which will output the best solution.  The process can be depicted as follows.

Expressing the problem in the mathematical structure is the key design of the solution.  There are many elements to be considered, which we described below.

The first consideration is how to represent your decision, specially whether the decision is a quantity (real number), a number of units (integer) or a binary decision (integer 0, 1).

The next step is to represent the constraints in terms of inequality equations of linear combination of decision variables.  You need to think whether the constraints is a hard of soft constraints.  Hard constraints should be expressed in the constraint part of the problem.  Notice the solver will not consider any solution once it violates any constraints.  Therefore, if the solver cannot find a feasible solution that fulfill all constraints, it will simply abort.  In other words, it won't return you a solution that violates the smallest number of constraints.  If you want the solve to tell you that because you have rooms to relax them, you should model these soft constraints in the objective function rather than in the constraint section.  Typically, you define an objective function that quantifies the degree of violation.  The solver will then give you the optimal solution (violating least number of constraints) rather than just telling you no solution is found.

Finally you define the objective function.  In most real-life problems, you have multiple goals in mind (e.g. you want to minimize your customer's wait time in the queue, but you also want to minimize your cost of hiring staffs).  First you express each goal as a linear expression of decision variables and then take the weighted average among different goals (which is also a linear combination of all decision variables) to form the final objective function.  How to choose the weights among different goals is a business question, based on how much you are willing to trade off between conflicting goals.

There are some objectives that cannot be expressed by a linear expression of decision variables.  One frequently encountered example is about minimizing the absolute deviation from a target value (ie. no matter the deviation is a positive and negative value).  A common way is to minimize the sum of square of the difference.  But after we square it, it is no longer a linear expression.  To address this requirement, there is a more powerful class of solver call "quadratic programming" which relax the objective function to allow a degree 2 polynomial expression.

After you expressed the problem in the mathematical structure.  You can pass it to a solver (there are many open source and commercial solver available) which you can treat it as a magical black box.  The solver will output an optimal solution (with a value assigned to each decision variable) that fulfill all constraints and maximize (or minimize) the objective function.

Once you received the solution from the solver, you need to evaluate how "reliable" is this optimal solution.  There may be fluctuations in the data variables due to collection error, or the data may have a volatile value that changes rapidly, or the data is an estimation of another unknown quantity.

Ideally, the optimal solution doesn't change much when we fluctuate the data values within its error bound.  In this case, our optimal solution is stable against error in our data variables and therefore is a good one.  However, if the optimal solution changes drastically when the data variable fluctuates, we say the optimal solution is unreliable and cannot use it.  In the case, we usually modify each data variables one at a time to figure out which data variable is causing a big swing of optimal solution and try to reduce our estimation error of that data variable.

The sensitivity analysis is an important step to evaluate the stability and hence the quality of our optimal solution.  It also provides guidance on which area we need to invest effort to make the estimation more accurate.

Mathematical Programming allows you to specify your optimization problem in a very declarative manner and also output an optimal solution if it exist.  It should be the first-to-go solution.  The downside of Mathematical programming is that it requires linear constraints and linear (or quadratic) objectives.  And it also has limits in terms of number of decision variables and constraints that it can store (and this limitation varies among different implementations).  Although there are non-linear solvers, the number of variables it can take is even smaller.

Usually the first thing to do is to test out if your problem is small enough to fit in the mathematically programming solver.  If not, you may need to use another approach.

Greedy Local Search

The key words here are "local" and "greedy".  Greedy local search starts at a particular selection of decision variables, and then it look around surrounding neighborhood (hence the term "local") and within that find the best solution (hence the term "greedy").  Then it moves the current point to this best neighbor and then the process repeats until the new solution stays at the same place (ie. there is no better neighbor found).

If the initial point is selected to be a non-feasible solution, the greedy search will first try to find a feasible solution first by looking for neighbors that has less number of constraint violation.  After the feasible solution is found, the greedy search will only look for neighbors that fulfill all constraints and within which to find a neighbor with the best objective value.  Another good initialization strategy is to choose the initial point to be a feasible (of course not optimal) solution and then start the local search from there.

Because local search limits its search within a neighborhood, it can control the degree of complexity by simply limits the scope of neighbor.  Greedy local search can evaluate a large number of variables by walking across a chain of neighborhoods.  However, because local search only navigate towards the best neighbor within a limited scope, it loses the overall picture and rely on the path to be a convex shape.  If there are valleys along the path, the local search will stop there and never reach the global optimum.  This is called a local optimal trap because the search is trapped within a local maximum and not able to escape.  There are some techniques to escape from local optimum that I describe in my previous blog post.

When performing greedy local search, it is important to pick the right greedy function (also called heuristic function) as well as the right neighborhood.  Although it is common to choose the greedy function to be the same objective function itself, they don't have to be the same.  In many good practices, the greedy function is chosen to be the one that has high correlation with the objective function, but can be computed much faster.  On the other hand, evaluating objective function of every point in the neighborhood can also be a very expensive operation, especially when the data has a high dimension and the number of neighbors can be exponentially large.  A good neighbor function can be combined with a good greedy function such that we don't need to evaluate each neighbor to figure out which one has the best objective value.

Combining the two approaches

In my experience, combining the two approaches can be a very powerful solution itself.  At the outer layer, we are using the greedy local search to navigate the direction towards better solution.  However, when we search for the best neighbor within the neighborhood, we use the mathematical programming by taking the greedy function as the objective function, and we use the constraints directly.  This way, we can use a very effective approach to search for best solution within the neighborhood and can use the flexible neighborhood scoping of local search to control the complexity of the mathematical programming solver.

Saturday, June 27, 2015

When machine replace human

Recently, a good friend sent me an article from Harvard Business Review called "Beyond Automation", written by Thomas H. Davenport and Julia Kirby.  The article talked about how automation affects our job forces and displacing values from human workers.  It proposed 5 strategies in how we can get prepared to retain competitiveness in the automation era.  This is a very good article and triggers me a lot of thoughts.

I want to explore a fundamental question:  "Can machine replace a human in future ?"

Lets start looking at what machines are doing and not doing today.  Machines are operating under a human's program, and therefore it can only solve those problems that we, human can express or codified in a structural form.  Don't underestimate its power underneath.  With good abstract thinking, smartest human in the world has partitioned large number of problems (by its problem nature) into different problem categories.  Each category is expressed in form od a "generic problem" and subsequently a "general solution" is developed.  Notice that computer scientist has been doing this for many decades, and come up with the powerful algorithm such as "Sorting", "Finding shortest path", "Heuristic search" ... etc.

By grouping concrete problems by their nature into a "generic, abstract problem", we can significantly reduce the volume of cases/scenarios while still covers a large area of ground.  The "generic solution" we developed can also be specialized for each concrete problem scenario.  After that we can develop a software program which can be executed in a large cluster of machines equipped with fast CPU and a lot of memory.  Compare this automated solution with what a human can do in a manual fashion.  In these areas, once problems are well-defined and solutions are automated by software program, computers with much powerful CPU and memory will always beat human in many many orders of magnitude.  There is no question that the human job in these areas will be eliminated.

In terms of capturing our experience using a abstract data structure and algorithm, computer scientist are very far from done.  There are still a very large body of problems that even the smartest human haven't completely figured out how to put them in a structural form yet.  Things that involve "perception", "intuition", "decision making", "estimation", "creativity" are primarily done today by human.  I believe these type of jobs will continue to be done by human workers in our next decade.  On the other hand, with our latest technology research, we continuously push our boundary of automation into some of these areas.  "Face recognition", "Voice recognition" that involves high degree of perception can now be done very accurately by software program.  With "machine learning" technology, we can do "prediction" and make judgement in a more objective way than a human.  Together with "planning" and "optimization" algorithm, large percentage of decision making can be automated, and the result is usually better because of a less biased and data-driven manner.

However, in these forefront areas where latest software technology is unable to automate every steps, the human is need in the path to make a final decision, or interven in those exceptional situation that the software is not programmed to handled.  There are jobs that a human and machine can working together to make better outcome.  This is what is called "augmentation" in the article.  Some job examples are artists are using advanced software to touchup their photos, using computer graphics to create movies, using machine learning to do genome sequence processing, using robots to perform surgery, driver-less vehicles ... etc.

Whether computer programming can replace human completely remains to be seen, but I don't think this will happen in the next 2 decades.  We humans are unique and good at perceiving things with multiple level of abstractions from different angles.  We are good at connecting the dots between unrelated areas.  We can invent new things.  These are things that machine will be very hard to do, or at least will take a long time if at all possible.

"When can we program a machine that can write program ?"

The HBR article suggests a person can consider five strategies (step up, step aside, step in, step narrowly and step forward) to retain value in the automation era.  I favor the "step forward" strategy because the person is driving the trend rather than passively reacting to the trend.  Date back to our history, human's value system has been shifted across industry revolution, internet revolution etc.   At the end of the day, it is more-sophisticated human who take away jobs (and value) from other less-sophisticated human.  And it is always the people who drives the movement to be the winner of this value shift.  It happens in the past and will continue into future.

Sunday, February 22, 2015

Big Data Processing in Spark

In the traditional 3-tier architecture, data processing is performed by the application server where the data itself is stored in the database server.  Application server and database server are typically two different machine.  Therefore, the processing cycle proceeds as follows
  1. Application server send a query to the database server to retrieve the necessary data
  2. Application server perform processing on the received data
  3. Application server will save the changed data to the database server
In the traditional data processing paradigm, we move data to the code.
It can be depicted as follows ...

Then big data phenomenon arrives.  Because the data volume is huge, it cannot be hold by a single database server.  Big data is typically partitioned and stored across many physical DB server machines.  On the other hand, application servers need to be added to increase the processing power of big data.

However, as we increase the number of App servers and DB servers for storing and processing the big data, more data need to be transfer back and forth across the network during the processing cycle, up to a point where the network becomes a major bottleneck.

Moving code to data

To overcome the network bottleneck, we need a new computing paradigm.  Instead of moving data to the code, we move the code to the data and perform the processing at where the data is stored.

Notice the change of the program structure
  • The program execution starts at a driver, which orchestrate the execution happening remotely across many worker servers within a cluster.
  • Data is no longer transferred to the driver program, the driver program holds a data reference in its variable rather than the data itself.  The data reference is basically an id to locate the corresponding data residing in the database server
  • Code is shipped from the program to the database server, where the execution is happening, and data is modified at the database server without leaving the server machine.
  • Finally the program request a save of the modified data.  Since the modified data resides in the database server, no data transfer happens over the network.
By moving the code to the data, the volume of data transfer over network is significantly reduced.  This is an important paradigm shift for big data processing.

In the following session, I will use Apache Spark to illustrate how this big data processing paradigm is implemented.


Resilient Distributed Dataset (RDD) is how Spark implements the data reference concept.  RDD is a logical reference of a dataset which is partitioned across many server machines in the cluster.

To make a clear distinction between data reference and data itself, a Spark program is organized as a sequence of execution steps, which can either be a "transformation" or an "action".

Programming Model

A typical program is organized as follows
  1. From an environment variable "context", create some initial data reference RDD objects
  2. Transform initial RDD objects to create more RDD objects.  Transformation is expressed in terms of functional programming where a code block is shipped from the driver program to multiple remote worker server, which hold a partition of the RDD.  Variable appears inside the code block can either be an item of the RDD or a local variable inside the driver program which get serialized over to the worker machine.  After the code (and the copy of the serialized variables) is received by the remote worker server, it will be executed there by feeding the items of RDD residing in that partition.  Notice that the result of a transformation is a brand new RDD (the original RDD is not mutated)
  3. Finally, the RDD object (the data reference) will need to be materialized.  This is achieved through an "action", which will dump the RDD into a storage, or return its value data to the driver program.
Here is a word count example

# Get initial RDD from the context
file = spark.textFile("hdfs://...")

# Three consecutive transformation of the RDD
counts = file.flatMap(lambda line: line.split(" "))
             .map(lambda word: (word, 1))
             .reduceByKey(lambda a, b: a + b)

# Materialize the RDD using an action

When the driver program starts its execution, it builds up a graph where nodes are RDD and edges are transformation steps.  However, no execution is happening at the cluster until an action is encountered.  At that point, the driver program will ship the execution graph as well as the code block to the cluster, where every worker server will get a copy.

The execution graph is a DAG.
  • Each DAG is a atomic unit of execution. 
  • Each source node (no incoming edge) is an external data source or driver memory
  • Each intermediate node is a RDD
  • Each sink node (no outgoing edge) is an external data source or driver memory
  • Green edge connecting to RDD represents a transformation.  Red edge connecting to a sink node represents an action

Data Shuffling

Although we ship the code to worker server where the data processing happens, data movement cannot be completely eliminated.  For example, if the processing requires data residing in different partitions to be grouped first, then we need to shuffle data among worker server.

Spark carefully distinguish "transformation" operation in two types.
  • "Narrow transformation" refers to the processing where the processing logic depends only on data that is already residing in the partition and data shuffling is unnecessary.  Examples of narrow transformation includes filter(), sample(), map(), flatMap() .... etc.
  • "Wide transformation" refers to the processing where the processing logic depends on data residing in multiple partitions and therefore data shuffling is needed to bring them together in one place.  Example of wide transformation includes groupByKey(), reduceByKey() ... etc.

Joining two RDD can also affect the amount of data being shuffled.  Spark provides two ways to join data.  In a shuffle join implementation, data of two RDD with the same key will be redistributed to the same partition.  In other words, each of the items in each RDD will be shuffled across worker servers.

Beside shuffle join, Spark provides another alternative call broadcast join.  In this case, one of the RDD will be broadcasted and copied over to every partition.  Imagine the situation when one of the RDD is significantly smaller relative to the other, then broadcast join will reduce the network traffic because only the small RDD need to be copied to all worker servers while the large RDD doesn't need to be shuffled at all.

In some cases, transformation can be re-ordered to reduce the amount of data shuffling.  Below is an example of a JOIN between two huge RDDs followed by a filtering.

Plan1 is a naive implementation which follows the given order.  It first join the two huge RDD and then apply the filter on the join result.  This ends up causing a big data shuffling because the two RDD is huge, even though the result after filtering is small.

Plan2 offers a smarter way by using the "push-down-predicate" technique where we first apply the filtering in both RDDs before joining them.  Since the filtering will reduce the number of items of each RDD significantly, the join processing will be much cheaper.

Execution planning

As explain above, data shuffling incur the most significant cost in the overall data processing flow.  Spark provides a mechanism that generate an execute plan from the DAG that minimize the amount of data shuffling.
  1. Analyze the DAG to determine the order of transformation.  Notice that we starts from the action (terminal node) and trace back to all dependent RDDs.
  2. To minimize data shuffling, we group the narrow transformation together in a "stage" where all transformation tasks can be performed within the partition and no data shuffling is needed.  The transformations becomes tasks that are chained together within a stage
  3. Wide transformation sits at the boundary of two stages, which requires data to be shuffled to a different worker server.  When a stage finishes its execution, it persist the data into different files (one per partition) of the local disks.  Worker nodes of the subsequent stage will come to pickup these files and this is where data shuffling happens
Below is an example how the execution planning turns the DAG into an execution plan involving stages and tasks.

Reliability and Fault Resiliency

Since the DAG defines a deterministic transformation steps between different partitions of data within each RDD RDD, fault recovery is very straightforward.  Whenever a worker server crashes during the execution of a stage, another worker server can simply re-execute the stage from the beginning by pulling the input data from its parent stage that has the output data stored in local files.  In case the result of the parent stage is not accessible (e.g. the worker server lost the file), the parent stage need to be re-executed as well.  Imagine this is a lineage of transformation steps, and any failure of a step will trigger a restart of execution from its last step.

Since the DAG itself is an atomic unit of execution, all the RDD values will be forgotten after the DAG finishes its execution.  Therefore, after the driver program finishes an action (which execute a DAG to its completion), all the RDD value will be forgotten and if the program access the RDD again in subsequent statement, the RDD needs to be recomputed again from its dependents.  To reduce this repetitive processing, Spark provide a caching mechanism to remember RDDs in worker server memory (or local disk).  Once the execution planner finds the RDD is already cache in memory, it will use the RDD right away without tracing back to its parent RDDs.  This way, we prune the DAG once we reach an RDD that is in the cache.

Overall speaking, Apache Spark provides a powerful framework for big data processing.  By the caching mechanism that holds previous computation result in memory, Spark out-performs Hadoop significantly because it doesn't need to persist all the data into disk for each round of parallel processing.  Although it is still very new, I think Spark will take off as the main stream approach to process big data.

Friday, November 28, 2014

Spark Streaming

In this post, we'll discuss another important topic of big data processing: real-time stream processing area.  This is an area where Hadoop falls short because of its high latency, and another open source framework Storm is developed to cover the need in real-time processing.  Unfortunately, Hadoop and Storm provides quite different programming model, resulting in high development and maintenance cost.

Continue from my previous post on Spark, which provides a highly efficient parallel processing framework.  Spark streaming is a natural extension of its core programming paradigm to provide large-scale, real-time data processing.  The biggest benefits of using Spark Streaming is that it is based on a similar programming paradigm of its core and there is no need to develop and maintain a completely different programming paradigm for batch and realtime processing.

Spark Core Programming Paradigm Recap

The core Spark programming paradigm consists of the following steps ...
  1. Taking input data from an external data source and create an RDD (a distributed data set across many servers)
  2. Transform the RDD to another RDD (these transformation defines a direct acyclic graph of dependencies between RDD)
  3. Output the final RDD to an external data source

Notice that the RDD is immutable, therefore the sequence of transformations is deterministic and therefore recovery from intermediate processing failure is simply by tracing back to the parent of the failure node (in the DAG) and redo the processing from there.

Spark Streaming

Spark Streaming introduce a data structure call DStream which is basically a sequence of RDD where each RDD contains data associated with a time interval.  DStream is created with a frequency parameters which defines the frequency RDD creation into the sequence.

Transformation of a DStream boils down to transformation of each RDD (within the sequence of RDD that the DStream contains).  Within the transformation, the RDD inside the DStream can "join" with another RDD (outside the DStream), hence provide a mix processing paradigm between DStream and other RDDs.  Also, since each transformation produces an output RDD, the result of transforming a DStream results in another sequence of RDDs that defines an output DStream.

Here is the basic transformation where each RDD in the output DStream has a one to one correspondence with each RDD in the input DStream. 

Instead of performing a 1 to 1 transformation of each RDD in the DStream.  Spark streaming enable a sliding window operation by defining a WINDOW which groups consecutive RDDs along the time dimension.  There are 2 parameters that the window is defined ...
  1. Window length: defines how many consecutive RDDs will be combined for performing the transformation.
  2. Slide interval: defines how many RDD will be skipped before the next transformation executes.

By providing a similar set of transformation operation for both RDD and DStream, Spark enable a unified programming paradigm across both batch and real-time processing, and hence reduce the corresponding development and maintenance cost.

Wednesday, November 5, 2014

Common data science project flow

As working across multiple data science projects, I observed a similar pattern across a group of strategic data science projects where a common methodology can be used.  In this post, I want to sketch this methodology at a high level.

First of all, "data science" itself is a very generic term that means different things to different people.  For the projects I involved, many of them target to solve a very tactical and specific problem.  However, over the last few years more and more enterprises start to realize the strategic value of data.  I observed a growing number of strategic data science projects were started from a very broad scope and took a top-down approach to look at the overall business operation.  Along the way, the enterprise prioritize the most critical areas within their operation cycle and build sophisticated models to guide and automate the decision process.

Usually, my engagement started as a data scientist / consultant, with very little (or even no) domain knowledge.  Being unfamiliar with the domain is nothing to be proud of and often slow down my initial discussion.  Therefore, within a squeezed time period I need to quickly learn enough "basic domain knowledge" to facilitate the discussion smooth.  On the other hand, lacking a per-conceived model enables me (or you can say force me) to look from a fresh-eye view, from which I can trim off unnecessary details from the legacies and only focus on those essential elements that contributes to the core part of the data model.  It is also fun to go through the concept blending process between a data scientist and a domain expert.  I force them to think in my way and they force me to think in their way.  This is by far the most effective way for me to learn any new concepts.

Recently I had a discussion with a company who has a small, but very sophisticated data science team that build pricing model, and demand forecasting for their product line.  I am, by no means an expert in their domain.  But their problem (how to predict demand, and how to set price) is general enough across many industries.  Therefore, I will use this problem as an example to illustration the major steps in the common pattern that I describe above.

Problem Settings

Lets say a car manufacturer starts its quarterly planning process.  Here are some key decisions that need to be made by the management.
  • How many cars the company should produce for next year ?
  • What should be the renew price of the cars ?
First of all, we need to identify the ultimate "goal" of these decisions.  Such goal is usually easy to find as it usually in the company's mission statement.

In this problem, the goal is to ...
maximize: "Profit_2015"

In general, I find it is a good start to look at the problem from an "optimization" angle, from which we define our goal in terms of an objective function as well as a set of constraints.

Step 1:  Identify variables and define its dependency graph

Build the dependency graph between different variables starting from the Objective function.  Separate between the decision variables (where you have control) and environment variable (where you have no control).

As an illustrative example, we start from our objective function "Profit_2015" and define the dependency relationship below. Decision variable is highlighted in blue.

Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)
UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)
Demand_2015 = H(GDP_2014, PhoneSold_2014)
GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)

Identifying these variable and their potential dependencies typically come from a well-studied theory from University, or domain experts in the industry.  At this stage, we don't need to know the exact formula of the function F/G/H.  We only need to capture the links between the variables.  It is also ok to include a link that shouldn't have exist (ie: there is no relationship between the 2 variables in reality).  However, it is not good if we miss a link (ie: fail to capture a strong, existing dependency).

This round usually involves 4 to 5 half day brainstorming sessions with the domain experts, facilitated by the data scientist/consultant who is familiar with the model building process.  There may be additional investigation, background studies if the subject matter experts doesn't exist.  Starting from scratch, this round can take somewhere between couple weeks to couple months

Step 2:  Define the dependency function

In this round,  we want to identify the relationship between variable using formula of F(), G(), H().

Well-Known Function
For some relationship that is well-studied, we can use a known mathematical model.

For example, in the relationship
Profit_2015 = F(UnitSold_2015, UnitProduced_2015, Price_2015, Cost_2015)

We can use the following Mathematical formula in a very straightforward manner
Profit = (UnitSold * Price) - (UnitProduced * Cost)

Semi-Known Function
However, some of the relationship is not as straightforward as that.  For those relationship that we don't exactly know the formula, but can make a reasonable assumption on the shape of the formula, we can assume the relationship follows a family of models (e.g. Linear, Quadratic ... etc.), and then figure out the optimal parameters that best fit the historical data.

For example, in the relationship
Demand_2015 = H(GDP_2014, PhoneSold_2014)

Lets assume the "demand" is a linear combination of "GDP" and "Phone sold", which seems to be a reasonable assumption.

For the linear model we assume
Demand = w0 + (w1 * GDP) + (w2 * PhoneSold)

Then we feed the historical training data to a build a linear regression model and figure out what the fittest value of w0, w1, w2 should be.

Time-Series Function
In some cases, a variable depends only on its own past value but not other variables, here we can train a Time Series model to predict the variable based on its own past values.  Typically, the model is decomposed into 3 components; Noise, Trend and Seasonality.  One popular approach is to use exponential smoothing techniques such as Holt/Winters model.  Another popular approach is to use the ARIMA model which decomposed the value into "Auto-Regression" and "Moving-Average".

For example, in the relationship
GDP_2015 = T(GDP_2014, GDP_2013, GDP_2012, GDP_2011 ...)

We can use TimeSeries model to learn the relationship between the historical data to its future value.

Completely Unknown Function
But if we cannot even assume the model family, we can consider using "k nearest neighbor" approach to interpolate the output from its input.  We need to define the "distance function" between data points based on domain knowledge and also to figure out what the optimal value of k should be.  In many case, using a weighted average of the k-nearest neighbor is a good interpolation.

For example, in the relationship
UnitSold_2015 = G(Supply_2015, Demand_2015, Price_2015, CompetitorPrice_2015)
 It is unclear what model to be used in representing UnitSold as a function of Supply, Demand, Price and CompetitorPrice.  So we go with a nearest neighbor approach.

Based on monthly sales of past 3 years, we can use "Euclidean distance" (we can also consider scaling the data to a comparable range by minus its mean and divide by its standard deviation) to find out the closest 5 neighbors, and then using the weighted average to predict the unit sold.

Step 3: Optimization

At this point, we have the following defined
  • A goal defined by maximizing (or minimizing) an objective function
  • A set of variables (including the decision and environment variables)
  • A set of functions that define how these variables are inter-related to each other.  Some of them is defined by a mathematical formula and some of them is defined as a black-box (base on a predictive model)
Our goal is to figure out what the decision variables (which we have control) should be set such that the objective function is optimized (maximized or minimized).

Determine the value of environment variables
For those environment variables that has no dependencies on other variables, we can acquire their value from external data sources.  For those environment variables that has dependencies on other environment variables (but not decision variables), we can estimate their value using the corresponding dependency function (of course, we need to estimate all its depending variables first).  For those environment variables that has dependencies (direct or indirect) on decision variables, leave it as undefined.

Determine the best value of decision variables
Once we formulate the dependency function, depends on the format of these function, we can employ different optimization methods.  Here is how I choose the appropriate method based on the formulation of dependency functions.

Additional Challenges

To summarize, we have following the process below
  1. Define an objective function, constraints, decision variables and environment variables
  2. Identify the relationship between different variables
  3. Collect or predict those environment variables
  4. Optimize those decision variables based on the objective functions
  5. Return the optimal value of decision variables as the answer
So far, our dependency graph is acyclic where our decision won't affect the underlying variables.  Although this is reasonably true if the enterprise is an insignificant small player in the market, it is no longer true if the enterprise is one of the few major players.  For example, their pricing strategy may causes other competitors to change their own pricing strategy as well.  And how the competitors would react is less predictable and historical data play a less important role here.  At some point, human judgement will get involved to fill the gaps.

Sunday, August 17, 2014

Lambda Architecture Principles

"Lambda Architecture" (introduced by Nathan Marz) has gained a lot of traction recently.  Fundamentally, it is a set of design patterns of dealing with Batch and Real time data processing workflow that fuel many organization's business operations.  Although I don't realize any novice ideas has been introduced, it is the first time these principles are being outlined in such a clear and unambiguous manner.

In this post, I'd like to summarize the key principles of the Lambda architecture, focus more in the underlying design principles and less in the choice of implementation technologies, which I may have a different favors from Nathan.

One important distinction of Lambda architecture is that it has a clear separation between the batch processing pipeline (ie: Batch Layer) and the real-time processing pipeline (ie: Real-time Layer).  Such separation provides a means to localize and isolate complexity for handling data update.  To handle real-time query, Lambda architecture provide a mechanism (ie: Serving Layer) to merge/combine data from the Batch Layer and Real-time Layer and return the latest information to the user.

Data Source Entry

At the very beginning, data flows in Lambda architecture as follows ...
  • Transaction data starts streaming in from OLTP system during business operations.  Transaction data ingestion can be materialized in the form of records in OLTP systems, or text lines in App log files, or incoming API calls, or an event queue (e.g. Kafka)
  • This transaction data stream is replicated and fed into both the Batch Layer and Realtime Layer
Here is an overall architecture diagram for Lambda.

Batch Layer

For storing the ground truth, "Master dataset" is the most fundamental DB that captures all basic event happens.  It stores data in the most "raw" form (and hence the finest granularity) that can be used to compute any perspective at any given point in time.  As long as we can maintain the correctness of master dataset, every perspective of data view derived from it will be automatically correct.

Given maintaining the correctness of master dataset is crucial, to avoid the complexity of maintenance, master dataset is "immutable".  Specifically data can only be appended while update and delete are disallowed.  By disallowing changes of existing data, it avoids the complexity of handling the conflicting concurrent update completely.

Here is a conceptual schema of how the master dataset can be structured.  The center green table represents the old, traditional-way of storing data in RDBMS.  The surrounding blue tables illustrates the schema of how the master dataset can be structured, with some key highlights
  • Data are partitioned by columns and stored in different tables.  Columns that are closely related can be stored in the same table
  • NULL values are not stored
  • Each data record is associated with a time stamp since then the record is valid

Notice that every piece of data is tagged with a time stamp at which the data is changed (or more precisely, a change record that represents the data modification is created).  The latest state of an object can be retrieved by extracting the version of the object with the largest time stamp.

Although master dataset stores data in the finest granularity and therefore can be used to compute result of any query, it usually take a long time to perform such computation if the processing starts with such raw form.  To speed up the query processing, various data at intermediate form (called Batch View) that aligns closer to the query will be generated in a periodic manner.  These batch views (instead of the original master dataset) will be used to serve the real-time query processing. 

To generate these batch views, the "Batch Layer" use a massively parallel, brute force approach to process the original master dataset.  Notice that since data in master data set is timestamped, the data candidate can be identified simply from those that has the time stamp later than the last round of batch processing.  Although less efficient, Lambda architecture advocates that at each round of batch view generation, the previous batch view should just be simply discarded and the new batch view is computed  from master dataset.  This simple-mind, compute-from-scratch approach has some good properties in stopping error propagation (since error cannot be accumulated), but the processing may not be optimized and may take a longer time to finish.  This can increase the "staleness" of the batch view.

Real time Layer

As discussed above, generating the batch view requires scanning a large volume of master dataset that takes few hours.  The batch view will therefore be stale for at least the processing time duration (ie: between the start and end of the Batch processing).  But the maximum staleness can be up to the time period between the end of this Batch processing and the end of next Batch processing (ie: the batch cycle).  The following diagram illustrate this staleness.

Even the batch view is stale period, business operates as usual and transaction data will be streamed in continuously.  To answer user's query with the latest, up-to-date information.  The business transaction records need to be captured and merged into the real-time view.  This is the responsibility of the Real-time Layer.  To reduce the latency of latest information availability close to zero, the merge mechanism has to be done in an incremental manner such that no batching delaying the processing will be introduced.  This requires the real time view update to be very different from the batch view update, which can tolerate a high latency.  The end goal is that the latest information that is not captured in the Batch view will be made available in the Realtime view.

The logic of doing the incremental merge on Realtime view is application specific.  As a common use case, lets say we want to compute a set of summary statistics (e.g. mean, count, max, min, sum, standard deviation, percentile) of the transaction data since the last batch view update.  To compute the sum, we can simply add the new transaction data to the existing sum and then write the new sum back to the real-time view.  To compute the mean, we can multiply the existing count with existing mean, adding the transaction sum and then divide by the existing count plus one.  To implement this logic, we need to READ data from the Realtime view, perform the merge and WRITE the data back to the Realtime view.  This requires the Realtime serving DB (which host the Realtime view) to support both random READ and WRITE.  Fortunately, since the realtime view only need to store the stale data up to one batch cycle, its scale is limited to some degree.
Once the batch view update is completed, the real-time layer will discard the data from the real time serving DB that has time stamp earlier than the batch processing.  This not only limit the data volume of Realtime serving DB, but also allows any data inconsistency (of the realtime view) to be clean up eventually.  This drastically reduce the requirement of sophisticated multi-user, large scale DB.  Many DB system support multiple user random read/write and can be used for this purpose.

Serving Layer

The serving layer is responsible to host the batch view (in the batch serving database) as well as hosting the real-time view (in the real-time serving database).  Due to very different accessing pattern, the batch serving DB has a quite different characteristic from the real-time serving DB.

As mentioned in above, while required to support efficient random read at large scale data volume, the batch serving DB doesn't need to support random write because data will only be bulk-loaded into the batch serving DB.  On the other hand, the real-time serving DB will be incrementally (and continuously) updated by the real-time layer, and therefore need to support both random read and random write.

To maintain the batch serving DB updated, the serving layer need to periodically check the batch layer progression to determine whether a later round of batch view generation is finished.  If so, bulk load the batch view into the batch serving DB.  After completing the bulk load, the batch serving DB has contained the latest version of batch view and some data in the real-time view is expired and therefore can be deleted.  The serving layer will orchestrate these processes.  This purge action is especially important to keep the size of the real-time serving DB small and hence can limit the complexity for handling real-time, concurrent read/write.

To process a real-time query, the serving layer disseminates the incoming query into 2 different sub-queries and forward them to both the Batch serving DB and Realtime serving DB, apply application-specific logic to combine/merge their corresponding result and form a single response to the query.  Since the data in the real-time view and batch view are different from a timestamp perspective, the combine/merge is typically done by concatenate the results together.  In case of any conflict (same time stamp), the one from Batch view will overwrite the one from Realtime view.

Final Thoughts

By separating different responsibility into different layers, the Lambda architecture can leverage different optimization techniques specifically designed for different constraints.  For example, the Batch Layer focuses in large scale data processing using simple, start-from-scratch approach and not worrying about the processing latency.  On the other hand, the Real-time Layer covers where the Batch Layer left off and focus in low-latency merging of the latest information and no need to worry about large scale.  Finally the Serving Layer is responsible to stitch together the Batch View and Realtime View to provide the final complete picture.

The clear demarcation of responsibility also enable different technology stacks to be utilized at each layer and hence can tailor more closely to the organization's specific business need.  Nevertheless, using a very different mechanism to update the Batch view (ie: start-from-scratch) and Realtime view (ie: incremental merge) requires two different algorithm implementation and code base to handle the same type of data.  This can increase the code maintenance effort and can be considered to be the price to pay for bridging the fundamental gap between the "scalability" and "low latency" need.

Nathan's Lambda architecture also introduce a set of candidate technologies which he has developed and used in his past projects (e.g. Hadoop for storing Master dataset, Hadoop for generating Batch view, ElephantDB for batch serving DB, Cassandra for realtime serving DB, STORM for generating Realtime view).  The beauty of Lambda architecture is that the choice of technologies is completely decoupled so I intentionally do not describe any of their details in this post.  On the other hand, I have my own favorite which is different and that will be covered in my future posts.