• Dr. Timothy Smith

A Tree from the Forest

Beech Forest

Photo Source: Wikimedia Commons

Have you ever wanted to predict the future? Artificial intelligence can help with that using a technique known as supervised machine learning. With supervised machine learning the computer can practice predicting the future with information that it already knows is correct. In a way, it is the same as working on problems that you have the answers for already. The set of known answers is called a training set. Using the training set, the computer looks for ways to predict the right answers. This training is called supervised machine learning. For supervised machine learning, the computer needs people to define what it is learning --to supervise its learning. The sorting and classifying reveals patterns in the information that the computer can use to predict the future. In other words, supervised learning is very good at answering the question— “Is it A or is it B?” By extension, it is also great at the question, “Will it be A or B?” or “Will it be A, B, or C?” “What will customers want in the fall—vanilla, chocolate, or coffee ice cream.” Supervised machine learning can sort and it can predict.

Decision Tree

(Photo Source: Wikipedia Commons) This figure above is an example of a decision tree that tries to answer the question "Should I go play outside?" It uses aspects such as humidity, weather outlook, and amount of wind to answer the question.

One type of supervised machine learning is called random forest. Random forest is the name of a machine learning tool that builds decision trees from data to classify information and make predictions. Decision trees divide aspects of something into branches (see figure above). A good example is the use of random forest to predict the price at which a house would sell using an aspect such as other house prices, elevation or closeness to a commuter rail station. Random forest splits the information into branches to find the biggest patterns. For example, in a hypothetical town with houses that range in value from $70,000 to $200,000, it may be observed that house prices are always over $110,000 when they are a mile or less from a commuter rail station. Other information can be added to the tree such as the number of bathrooms which will give another branch of patterns to work with. Random forest will build many decision trees and then figure out which trees produces results closest to the training data. Since random forest is computerized, it can build thousands of decision trees and compare them to find the most informative patterns. The trees can be built from different starting points such as elevation first and then closeness to a commuter rail line. One of the great strengths of random forest is it will look at all the variables equally, which removes bias from the prediction. For example, in predicting house prices, another pattern may be distance from a school or whether the house has municipal sewer and water or a well and septic system. This might help experts spot important patterns that they may have missed. The point is random forest will try thousands of scenarios and, using statistics, find the most informative trees out of the thousands it builds. Sometimes a variable that experts ignore may be much more important than originally thought. Once random forest has learned the best decision tree, it can be given new data, and it will predict an outcome like a house price using the patterns it learned earlier.

Artificial intelligence comes in many forms and supervised machine learning can be a powerful tool for classifying information or making predictions. One type of machine learning known as random forest builds thousands of decision trees to best fit its training data. Once random forest has found the best decision tree, it can make predictions based on new data. Random forest is a powerful tool that works in the background of many systems such as predicting credit scores, stock prices and even election outcomes. A great power of random forest derives from its unbiased nature; it may offer insights that may be overlooked by experts.

Click Here for a great advanced graphical explanation of predictions and decision trees from R2D3.

Dr. Smith’s career in scientific and information research spans the areas of bioinformatics, artificial intelligence, toxicology, and chemistry. He has published a number of peer-reviewed scientific papers. He has worked over the past seventeen years developing advanced analytics, machine learning, and knowledge management tools to enable research and support high-level decision making. Tim completed his Ph.D. in Toxicology at Cornell University and a Bachelor of Science in chemistry from the University of Washington.

You can buy his book on Amazon in paperback and in kindle format here.