On Pomona College Week: What is the Random Forest prediction model?
Jo Hardin, professor of math and statistics and Hardison Chair of analytical thinking, explores how it works.
Jo Hardin is Professor of Mathematics & Statistics and Hardison Chair of Analytical Thinking at Pomona College. Her research areas include machine learning, methods development for biological high throughput data, and statistics & data science education. She has an active undergraduate research group and seeks to find ways to make statistics and data science more accessible. She recently co-authored the online and complete free introductory statistics textbook, Introduction to Modern Statistics. When not working with students or on her research, she loves running, hiking, and jigsaw puzzles.
A Unified Framework for Random Forest Prediction Error Estimation
From Netflix to Amazon, our lives are awash in models predicting individual preferences. Instead of predicting a distinct item like a movie or book, a different sub-class of models predicts a numerical outcome. For example, we might be interested in models predicting how much an individual will spend at Target, how much electricity an individual home will use, or how many months it will take someone to recover from a particular surgery.
My research focuses on the Random Forest prediction model. It takes a set of variables, like where the family heating their home lives, the type of heating system, and how old and how big the home is, to predict a specific numerical value—such as how many kilowatt hours of electricity the family will use. We know that the model isn’t perfect, so we take the predicted value with a grain of salt.
For example, the model might predict that this home will use 1000 kilowatt hours of electricity. But for the model to be most useful, you would also want to know if the prediction estimate is 1000 plus-or-minus 100 kilowatts, or whether the prediction estimate is 1000 plus-or-minus 10 kilowatt hours. Instead of a single estimate, there is usually a prediction interval.
One distinct aspect of Random Forests is that the model uses partitioned data. Some parts of the data build the model, while other parts independently assess the model. We take advantage of the data splits by using the independent observations to measure individual variability. In that way, we have come up with a method for calculating interval estimates for every unique individual in the dataset.
Using simulations and multiple real datasets, we are able to show that our method is just as accurate and generally more narrow than prediction intervals that do not harness Random Forests.
Read More:
[JMLR] – A Unified Framework for Random Forest Prediction Error Estimation

