Machine learning can bring tremendous value to the applications you build, but beware the bearers of false gifts.
In providing the ability to discover patterns buried deep within data, machine learning has the potential to make applications more powerful and more responsive to users’ needs. Well-tuned algorithms allow value to be extracted from immense and disparate data sources without the limits of human thinking and analysis. For developers, machine learning offers the promise of applying business critical analytics to any application in order to accomplish everything from improving customer experience to providing product recommendations to serving up hyper-personalized content.
Cloud-based machine learning tools can act as a way for developers to dip their toes into the possibilities that machine learning creates and can offer novel functionality. When used incorrectly, however, these tools garner poor results, which can be frustrating for users. As anyone who tested Microsoft’s age-detecting machine learning tool probably discovered, the plug-and-play ease of use came with major accuracy problems — not something one should trust for critical applications or when making important decisions.
Developers looking to incorporate machine learning in their applications need to be aware of some keys to success:
1. The more data an algorithm has, the more accurate it becomes, so avoid subsampling if possible. Machine learning theory has a very intuitive characterization of the prediction error. In brief, the gap in prediction error between a machine learning model and the optimal predictor (the one that achieves the best possible error in theory) can be decomposed into three parts:
The error due to not having the right functional form for the model
The error due to not finding the optimal parameters for the model
The error due to not feeding enough data to the model
If the training data is limited, it may not be able to support the model complexity needed for the problem. Foundational laws of statistics tell us we should use all the data that we have if we can, rather than a subsample.
2. Selecting the machine learning method that works best for the given problem is key and often determines success or failure. For example, Gradient Boosting Trees (GBT) is a popular supervised learning algorithm widely used by industry practitioners due to its accuracy. However, despite its high popularity, it should not be blindly treated as the algorithm for every problem. Instead, one should always use the algorithm that best fits the characteristics of the data for the most accurate results.
To demonstrate this concept, one can try an experiment comparing the accuracy between GBT and the linear Support Vector Machine (SVM) algorithm on the popular text categorization dataset rcv1. We observed that linear SVM is superior to GBT in terms of error rate on this problem. This is because in the domain of text, the data is often highly dimensional. A linear classifier can perfectly separate N examples in N − 1 dimensions, and thus, a simple model is likely to work well on such data. Moreover, the simpler the model, the less problematic it is to learn the parameters with a finite number of training examples to avoid overfitting and deliver an accurate model.
On the other hand, GBT is highly nonlinear and more powerful, but more difficult to learn and more prone to overfitting in such a setting. It often ends up with inferior accuracy.
3. To get a great model, the method and the parameters pertaining to the method must be chosen well. This may not be simple for the nondata scientist. Modern machine learning algorithms have a number of knobs to tweak. For example, the popular GBT algorithm alone can have up to a dozen parameter settings, including how to control tree size, the learning rate, the sampling methodology for rows or columns, the loss function, the regularization options, and more. A typical project requires finding the best values for each of those parameters to get the highest possible accuracy for a given data set, and this is no easy feat. Intuition and experience help, but for best results, a data scientist needs to train a large number of models, looking at their cross-validated scores and putting some thought into deciding what parameters to try next.
4. Machine learning models can only be as good as the data. Improper data collection and cleaning will hurt your ability to build predictive, generalizable machine learning models. Experience recommends carefully reviewing the data with subject matter experts to gain insights into the data and the data generation process behind the scenes. Often this process can identify data quality issues related to records, features, values, or sampling.
5. Understanding features in the data and improving upon them (by creating new features and eliminating existing ones) has a high impact in terms of predictability. One fundamental task of machine learning is to represent the raw data in a rich feature space that can be effectively exploited by the machine learning algorithm. For example, feature transformation is a popular method that achieves this by developing new features based on the original ones through mathematical transformations. The resulting feature space (that is, the collection of features used to characterize the data) better captures various complex characteristics of the data (such as nonlinearity and interaction between multiple features), which are important for the succeeding learning processes.
6. Selecting the appropriate objective/loss function inspired by the business value is important for ultimate success in the application. Almost all machine learning algorithms are formulated as optimization problems. Based on the nature of the business, appropriately setting or adjusting the objective function of the optimization is a key step to the success of machine learning.
SVM, as an example, optimizes the generalization error for a binary classification problem by assuming all types of errors are equally weighted. This is not appropriate for cost-sensitive problems, such as failure detection, in which the cost of certain types of errors might weigh more than the others. In this case, it is recommended to adjust the SVM loss function by adding more penalties on certain types of error to account for their weights.
7. Ensure proper handling of training and testing data so the testing data mimics incoming data when the model is deployed in production. We can see, for example, how important this is for time-dependent data. In this case, using the standard cross-validation approach for training, tuning, and testing models would result in misleading or even inaccurate outputs. This is because it doesn’t properly mimic the nature of incoming data in the deployment stage. To correct this, one must mimic how the model is used when deployed. One should use a time-based cross-validation in which the trained model is validated on newer data in terms of time.
8. Understand the generalization error of the model before deployment. Generalization error measures how well a model performs on unseen data. Just because a model performs well on training data doesn’t necessarily mean it will generalize well on unseen data. A carefully designed model evaluation process, which mimics the real deployment usage, is needed to estimate the generalization error of the model.
It’s easy to violate the rules of cross-validation without noticing, and there are non-obvious ways to perform cross-validation incorrectly, which often happens when you attempt to take computational shortcuts. It is essential to pay careful attention to proper and diligent cross-validation before deploying any models to obtain a scientific estimation of the deployment performance.
This excerpt is from the InfoWorld. To view the whole article click here.
New Tech Forum provides a venue to explore and discuss emerging enterprise technology in unprecedented depth and breadth. The selection is subjective, based on our pick of the technologies we believe to be important and of greatest interest to InfoWorld readers. InfoWorld does not accept marketing collateral for publication and reserves the right to edit all contributed content. Send all inquiries to newtechforum [at] infoworld [dot] com.