Regression-Enhanced Random Forests


Random forest methodology is a useful statistical learning methodology for predicting response values (e.g., corn yield) from predictor variables (e.g., soil type, soil moisture, and temperatures).  Random forest predictions are weighted averages of response values that occur in a training dataset.  The random forest algorithm provides a clever mechanism for determining weights assigned to each training dataset observation when predicting the response value associated with given values of the predictor variables.  However, one limitation of random forest predictions is that, as a weighted average or training dataset responses, all predictions are necessarily bounded by the range of response values in the training dataset.  Thus, random forests may not be able to generate accurate predictions of future response values if a response (like corn yield) is generally increasing over time.  Baker Center personnel (Zhang, Zhu, and Nettleton) are developing an improved version of random forests (regression-enhanced random forests) that can account for linear structures in the data and capitalize on trends over time.  This new hybrid of penalized regression and random forests can greatly increase the accuracy of random forest predictions by blending the strengths of multiple linear regression with the flexibility of random forests.