Jackknife variance estimates for random forest


In statistics, jackknife variance estimates for random forest are a way to estimate the variance in random forest models, in order to eliminate the bootstrap effects.

Jackknife variance estimates

The sampling variance of bagged learners is:
Jackknife estimates can be considered to eliminate the bootstrap effects. The jackknife variance estimator is defined as:
In some classification problems, when random forest is used to fit models, jackknife estimated variance
is defined as:
Here, denotes a decision tree after training, denotes the result based on samples without observation.

Examples

problem is a common classification problem, in this problem, 57 features are used to classify spam e-mail and non-spam e-mail. Applying IJ-U variance formula to evaluate the accuracy of models with m=15,19 and 57. The results shows in paper that m = 57 random forest appears to be quite unstable, while predictions made by m=5 random forest appear to be quite stable, this results is corresponding to the evaluation made by error percentage, in which the accuracy of model with m=5 is high and m=57 is low.
Here, accuracy is measured by error rate, which is defined as:
Here N is also the number of samples, M is the number of classes, is the indicator function which equals 1 when observation is in class j, equals 0 when in other classes. No probability is considered here. There is another method which is similar to error rate to measure accuracy:
Here N is the number of samples, M is the number of classes, is the indicator function which equals 1 when observation is in class j, equals 0 when in other classes. is the predicted probability of observation in class.This method is used in Kaggle
These two methods are very similar.

Modification for bias

When using Monte Carlo MSEs for estimating and, a problem about the Monte Carlo bias should be considered, especially when n is large, the bias is getting large:
To eliminate this influence, bias-corrected modifications are suggested: