The DIFA Framework for evaluating data science projects

A situation that many CIOs or data science departments will face today is that the list of possible analytics projects is sheer endless. The range of project candidates usually begins with analyzing customers and ends somewhere at utilizing social media data. Needless to say that not all project candidates will make sense from a business and especially ROI perspective. Also, some projects will run into dead ends because some fundamental bits turn out to be missing.
Even though experimentation and some vagueness about eventual monetary success in the context of data science projects is normal, there are some hard facts that heavily influence the success of a data science project. These facts are structured in the DIFA framework which is explained below.

The DIFA framework is a set of rules and guidelines that help to analyze and prioritize data science projects from a management and business perspective.

DIFA stands for:

D = Data
I = Impact
F = Frequency
A = Automation

In the following, each of these four cornerstones will be explained. At first sight, you might be surprised that the accuracy of analytics or forecasts is missing. The reason for this is that the DIFA framework is applied before a data science project starts. At this point, forecasts are not yet available, hence their accuracy is still unknown.

Data

From a very high level perspective, data needs to be available in a sufficient number of independent observations. The independency feature is important, because redundancies in the data have a negative impact on a model’s real world performance. Next, data needs to be free from structural breaks. This means, that a data point from today must be absolutely comparable to a data point from a year ago. Sometimes, variable definitions or the way in which data are recorded are changed over time. These structural breaks need to be carefully analyzed and fixed, if possible.
Given enough independent and comparable data points, its important to take a closer look at the data. Here, the variable to be modeled (the target or dependent variable) needs to fulfill stricter criteria than the input (or independent) variables. First and most important, does the target variable exactly reflect what it is supposed to? As an example, consider a retailer who wants to forecast the daily demand of its products. Is the daily sales figure a good variable to model demand? The answer is very close to yes, but consider the case when a product is sold out before the store closes. Demand might still be there but since the product is already sold out, this demand wouldn’t be reflected in the sales numbers. It is therefore very important to ask the right questions and to find or construct the correct data points. Furthermore, the target variable needs to be bias free (this means that the variable is not artificially “tilted” into one direction) and it needs to be uncensored (this means that there are no cut-offs). One specialty to consider regarding input variables is that these must not contain future information. Especially in the case of aggregation (like summation or averaging) it happens very quickly, that future information is contained in one of these aggregates. Also methods that impute missing values may very well introduce future information and must thus be handled with care. A more robust approach is to find a satisfying way to deal with missing data.

Impact

It is very important to know in advance, how an analytics result or forecast would eventually be used in an organization. This is especially important, since most processes introduce friction cost that even the best model might not be able to outweigh. Here is an example: Given a manufacturer of consumer goods that ships its products in multiples of 1000 items to its network of wholesalers. It’s very clear that an improvement of 10% in forecasting accuracy will presumably not be reflected in a final improvement of the business. Here, it is important for data scientists to understand the process chain to a detail, such that they can identify the lever by which a process would be impacted.

But even without considering a process chain, the question of impact remains. The key issue to be answered from a very high level perspective is: What would the organization do with my forecast? Do they even care after the project is finished? And how exactly would it impact their business from a monetary perspective?

Frequency

One other key question is how frequently a model will be used in daily business. Is it a one-off thing or will it be continuously employed? If the latter is true, is the model going to be run on a daily, weekly or monthly basis? Or even real-time or on demand?
If you can determine the monetary impact (see above) of your model, knowing the frequency helps to determine the actual value of a data science project. Also, the frequency tells you something about what the deliverable should look like. A one-off project is likely to result in a website with charts and tables or some kind of infographic. A daily or monthly report would probably be managed by some automation software and an on-demand analytics service needs to talk through a Web-API to the outside world. Finding answers to this question also tells you something about implementation and maintenance cost and the required skillsets. This of course impacts the cost side of the business case.

Automation

Automation is a key question for many reasons. First, it’s important to determine for each case if automation is desired and feasible at all. Hopefully, the answer is yes. Automation does not necessarily means automated decision making but it certainly means that analytics is run without additional manual work and the output in form of charts and numbers will be used by decision makers or staff to drive decisions either from a strategic viewpoint or from a rather operational perspective. A very common example is anomaly detection. Consider a company selling webhosting as a service. They measure for each of their customers the hourly or daily number of website calls to find out, if the customer’s website behaves normally. An explosive increase in visitors might indicate some sort of attack on the website. However, when do you denote an increase of visitors as explosive? Your client might have a seasonal business or it might be around Christmas. Hence, analytics is needed on a continuous basis to produces numbers that reliably indicate this sort of anomaly, maybe based on moving averages or portfolio benchmarks.

If the answer to automation is no, the next question should be “why not”. Automation is in almost all cases to be preferred over manual and repetitive analytics work. Sometimes, this means that more upfront effort has to be put into preparing the data needed for analytics. However, this step is usually absolutely worth the effort, because it opens the doors to do even more with the data once they are available.

 

Collecting answers to these four points (DIFA) is rather consultative than actual data science work. It includes workshops, some number crunching on a spreadsheet level, learning about processes and so on. However, looking at data science projects from this perspective will absolutely pay off and tell a company where it can start and what homework still needs to be done before other projects can be kicked off.

2 thoughts on “The DIFA Framework for evaluating data science projects

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google photo

You are commenting using your Google account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s