The DIFA Framework for evaluating data science projects

A situation that many CIOs or data science departments will face today is that the list of possible analytics projects is sheer endless. The range of project candidates usually begins with analyzing customers and ends somewhere at utilizing social media data. Needless to say that not all project candidates will make sense from a business and especially ROI perspective. Also, some projects will run into dead ends because some fundamental bits turn out to be missing.
Even though experimentation and some vagueness about eventual monetary success in the context of data science projects is normal, there are some hard facts that heavily influence the success of a data science project. These facts are structured in the DIFA framework which is explained below.

The DIFA framework is a set of rules and guidelines that help to analyze and prioritize data science projects from a management and business perspective.

DIFA stands for:

D = Data
I = Impact
F = Frequency
A = Automation

In the following, each of these four cornerstones will be explained. At first sight, you might be surprised that the accuracy of analytics or forecasts is missing. The reason for this is that the DIFA framework is applied before a data science project starts. At this point, forecasts are not yet available, hence their accuracy is still unknown.

Data

From a very high level perspective, data needs to be available in a sufficient number of independent observations. The independency feature is important, because redundancies in the data have a negative impact on a model’s real world performance. Next, data needs to be free from structural breaks. This means, that a data point from today must be absolutely comparable to a data point from a year ago. Sometimes, variable definitions or the way in which data are recorded are changed over time. These structural breaks need to be carefully analyzed and fixed, if possible.
Given enough independent and comparable data points, its important to take a closer look at the data. Here, the variable to be modeled (the target or dependent variable) needs to fulfill stricter criteria than the input (or independent) variables. First and most important, does the target variable exactly reflect what it is supposed to? As an example, consider a retailer who wants to forecast the daily demand of its products. Is the daily sales figure a good variable to model demand? The answer is very close to yes, but consider the case when a product is sold out before the store closes. Demand might still be there but since the product is already sold out, this demand wouldn’t be reflected in the sales numbers. It is therefore very important to ask the right questions and to find or construct the correct data points. Furthermore, the target variable needs to be bias free (this means that the variable is not artificially “tilted” into one direction) and it needs to be uncensored (this means that there are no cut-offs). One specialty to consider regarding input variables is that these must not contain future information. Especially in the case of aggregation (like summation or averaging) it happens very quickly, that future information is contained in one of these aggregates. Also methods that impute missing values may very well introduce future information and must thus be handled with care. A more robust approach is to find a satisfying way to deal with missing data.

Impact

It is very important to know in advance, how an analytics result or forecast would eventually be used in an organization. This is especially important, since most processes introduce friction cost that even the best model might not be able to outweigh. Here is an example: Given a manufacturer of consumer goods that ships its products in multiples of 1000 items to its network of wholesalers. It’s very clear that an improvement of 10% in forecasting accuracy will presumably not be reflected in a final improvement of the business. Here, it is important for data scientists to understand the process chain to a detail, such that they can identify the lever by which a process would be impacted.

But even without considering a process chain, the question of impact remains. The key issue to be answered from a very high level perspective is: What would the organization do with my forecast? Do they even care after the project is finished? And how exactly would it impact their business from a monetary perspective?

Frequency

One other key question is how frequently a model will be used in daily business. Is it a one-off thing or will it be continuously employed? If the latter is true, is the model going to be run on a daily, weekly or monthly basis? Or even real-time or on demand?
If you can determine the monetary impact (see above) of your model, knowing the frequency helps to determine the actual value of a data science project. Also, the frequency tells you something about what the deliverable should look like. A one-off project is likely to result in a website with charts and tables or some kind of infographic. A daily or monthly report would probably be managed by some automation software and an on-demand analytics service needs to talk through a Web-API to the outside world. Finding answers to this question also tells you something about implementation and maintenance cost and the required skillsets. This of course impacts the cost side of the business case.

Automation

Automation is a key question for many reasons. First, it’s important to determine for each case if automation is desired and feasible at all. Hopefully, the answer is yes. Automation does not necessarily means automated decision making but it certainly means that analytics is run without additional manual work and the output in form of charts and numbers will be used by decision makers or staff to drive decisions either from a strategic viewpoint or from a rather operational perspective. A very common example is anomaly detection. Consider a company selling webhosting as a service. They measure for each of their customers the hourly or daily number of website calls to find out, if the customer’s website behaves normally. An explosive increase in visitors might indicate some sort of attack on the website. However, when do you denote an increase of visitors as explosive? Your client might have a seasonal business or it might be around Christmas. Hence, analytics is needed on a continuous basis to produces numbers that reliably indicate this sort of anomaly, maybe based on moving averages or portfolio benchmarks.

If the answer to automation is no, the next question should be “why not”. Automation is in almost all cases to be preferred over manual and repetitive analytics work. Sometimes, this means that more upfront effort has to be put into preparing the data needed for analytics. However, this step is usually absolutely worth the effort, because it opens the doors to do even more with the data once they are available.

 

Collecting answers to these four points (DIFA) is rather consultative than actual data science work. It includes workshops, some number crunching on a spreadsheet level, learning about processes and so on. However, looking at data science projects from this perspective will absolutely pay off and tell a company where it can start and what homework still needs to be done before other projects can be kicked off.

Banks, don’t wait for your competition to become data driven

Screen Shot 2015-12-31 at 14.51.06

 

“Who moved my cheese” is the title of the famous management book written by Spencer Johnson. It treats the fact that some of our current and beloved business models will fade out sooner or later and that you should realise and embrace the point in time when a paradigm shift is about to hit your business model.

As an example, think of retail banks who sold their products primarily through a network of branches. Today, the number of these points of sale is dropping like a rock. Obviously, this is because an ever increasing amount of people finds it way more attractive to manage their finances through user friendly apps while sitting on the couch.

Data Science is frequently classified as being a disruptive collection of methods and a new paradigm of data driven business models for many (if not all) industries. Especially industries in which margin pressure is high, like the retail and e-commerce business, or those who show a high affinity to collecting data, like online advertising, spearhead this movement.

Other industries like the Financial Industry for example seem like perfect candidates to exploit the power of data science. They, however, appear to be among the laggards in adopting these new possibilities. This is surprising for two reasons:
First, the upside of leveraging the potential of data science and analytics and developing data driven business models is not only a measure to increase internal process efficiency but especially to attract customers and maintain a sustainable business. Second, the risk of a “sit tight and wait” strategy is truly suicidal. Establishing a data driven business culture cannot be done over night and needs time for people training and development, letting aside the effort and time needed to choose and set up the systems and infrastructure.

Recall how Google disrupted the search industry. Yahoo, Lycos and all these almost forgotten dinosaurs could never catch up over come even close to Google’s success after they had been disrupted. Hence, from the day on when some financial institutions start to seriously follow a completed digital vision, the remaining air will become thin for the laggards in the financial system.

So, which road blocks are holding the financial world back?

  1. it’s hard to determine a ROI figure before first results are available
  2. data is still hard to be made available on a continuous basis
  3. regulation limits the number of possibilities in combining data

1. It’s hard to determine a ROI figure before first results are available
It’s true that you can not tell how well you might be able to forecast e.g. customer behaviour based on your data before you haven’t tried it. The first stages of a data science project will contain some experimentation and trial-and-error elements. This might feel undesired but is an inherent characteristic of many data science uses cases. However, this is not an excuse to not do it. Rather, you should have some R&D budget available for these sort of projects and to collect experience points. Second, if you don’t try to boil the ocean but rather laser focus on some promising use cases that have been selected following the DIFA framework, chances are high you will strike a rich vein.

2. Data is still hard to be made available on a continuous basis
Usually, in discussions about which data might influence an outcome, the sky is the limit. Including all data sources into an analysis that might influence the outcome is like building the Chinese Wall. At this point it is wise not to try to boil the ocean.
There is a very neat way to prioritise data sources, as shown in the data prioritisation matrix below. In brief, you want to start with the bread & butter quadrant and only extend your data basis beyond this, if it’s absolutely necessary. In many cases, it’s not.

Screen Shot 2015-12-31 at 15.46.12

3. Regulation limits the number of possibilities in combining data
The finance industry and regulation have had a tough time over the recent years. Hence, the impact on regulation regarding data and analytics issues is perceived more severe than it actually is. Of course, there are clear lines that may not be crossed and reputation risk should be part of the discussion.
The critical point are personalised data. And here is the good news: Analyses can almost always be carried out without personalised data. Since the goal of most analyses is to discover patterns, clusters and the like, a surname or phone number doesn’t really help here. The bottom line is to stay in line with regulation but at the same time not to be overly cautious about it.

Twitter: @AlexanderD_Beck

Vita: Alexander Beck holds a masters degree in physics and a PhD in economics. He has worked as a data scientist for a quant hedge fund and has spent several years as a data science and pre-sales consultant. Today, he leads a data science team at a startup company in the financial industry.