So many data scientists select an analytic technique in hopes of achieving a magical solution, but in the end, the solution simply may not even be possible due to other limiting factors. It is important for organizations working with analytic capabilities to understand the various constraints of implementation most real-world applications will encounter. When developing a solution one has to consider: data complexity, speed, analytic complexity, accuracy & precision, and data size. Data Scientists, nor the organizations they work for, will be able to be the best in each category simultaneously; however, it will prove necessary to understand the trade-offs of each.
It is important to know as much as possible about the data. Practically, this looks like understanding the data type, formal complexity measures, tab measures of overlap and linear separability, number of dimensions/columns, and linkages between data sets. For example, one must be able to link up healthcare remittances to paid claims that come in all flavors: fully paid, partially paid, and denied over long periods of time. These linkages can be extremely complex.
The speed at which an analytic outcome must be produced (e.g. near real-time, hourly, daily), or the time it takes to develop and implement the analytic solution, is another key consideration. This particular dimension provides a lot of angst for most Data Scientists, primarily because they generally want to come up with an optimal solution regardless of time. However, we can all agree that if an enterprise needs to deploy new predictions every 15 minutes, but it takes 1.5 hours to retrain the algorithm, then it will not be successful.
Algorithm complexity is measured as complexity class and execution resources. This dimension could be limiting if the complexity needs to be low in order for the business to grasp what is going on. Clearly this will limit a Data Scientist’s ability to create an optimal outcome. Some industries prefer lower quality prediction if they receive more understanding about the contributing factors to a prediction; this is true in the healthcare industry. A great example of this is the Netflix $1 Million Challenge. A team of Data Scientists put in over 2,000 hours of work to come up with the combination of 107 algorithms that won first place by besting Netflix’s own algorithm by 10%. However, Netflix never implemented the full benefit of the first-place solution due to the engineering effort needed to bring it into a production environment.
Most businesses do not understand how to nuance when it comes to predictive accuracy; however, it will be essential for a Data Scientist to help the organization move beyond the simple notion of accuracy. Obviously we all want to hit the proverbial target. At least directionally, as a Data Scientist, you will want to steer the conversation to something more useful, like an algorithm that produces “high accuracy/low precision” or “high accuracy/high precision”. It usually proves beneficial to the business audience to distinguish what is meant by accuracy and precision as they appear to be close in meaning. Help them see that “accuracy” refers to the closeness of a predicted value to the actual value. A good example of this: a data science model predicted the weight of a package to be 19 lbs, but the actual weight of the package is 28 lbs. This would demonstrate “low accuracy”. “Precision” on the other hand refers to the closeness of two or more measurements to each other. For example, if a Data Scientist predicts the value of a package to be 19 lbs – over 5 separate iterations – then it is said to be “precise”. From a business perspective, it is critical to note that a data science model can be extremely precise, but inaccurate in its prediction.
The size of the data set is viewed as the number of rows and the number of fields. Many organizations may not understand when dealing with prediction that the more data you have, the better the output. However, there may be a point that the size of data goes beyond the typical tools skill set of the average Data Scientist. In fact, many of the classic algorithms one might use in smaller datasets may simply vanish as an option once one begins navigating in bigger data waters. As a Data Scientist, it is worth investigating the limits of your skill and tools before you get in front of an executive audience; they are counting on you to be the expert, as well they should.
By: Damian Mingle
Originally published at www.smartdatacollective.com
This excerpt is from Smart Data Collective. To view the whole article click here.
You must be logged in to post a comment.
Great article Jason. I particularly liked to dimensional approach and calling out the balance required… great results depend on all of these aspects being harmonized… they have multiplicative effects on the outcome.
My “pet” issue is accuracy and precision. In my experience as an innovator this is all too often overlooked. We have created new methods of refining the accuracy of analytics by as much as 30% by eliminating false positive in the target sets in banking. The best models in the world are pretty poor if they are predicting and prescribing actions directed at the wrong objectives !
Hope to read more from you -thank you.
– Dave McNab