Scientific publication
T. M. Lange, M. Gültas, A. O. Schmitt & F. Heinrich (2025). optRF: Optimising random forest stability by figuring out the optimum variety of timber. BMC bioinformatics, 26(1), 95.
Comply with this LINK to the unique publication.
Forest — A Highly effective Software for Anybody Working With Knowledge
What’s Random Forest?
Have you ever ever wished you might make higher selections utilizing knowledge — like predicting the danger of illnesses, crop yields, or recognizing patterns in buyer habits? That’s the place machine studying is available in and some of the accessible and highly effective instruments on this subject is one thing referred to as Random Forest.
So why is random forest so well-liked? For one, it’s extremely versatile. It really works properly with many sorts of knowledge whether or not numbers, classes, or each. It’s additionally broadly utilized in many fields — from predicting affected person outcomes in healthcare to detecting fraud in finance, from bettering buying experiences on-line to optimising agricultural practices.
Regardless of the title, random forest has nothing to do with timber in a forest — however it does use one thing referred to as Decision Trees to make sensible predictions. You’ll be able to consider a choice tree as a flowchart that guides a collection of sure/no questions based mostly on the info you give it. A random forest creates a complete bunch of those timber (therefore the “forest”), every barely completely different, after which combines their outcomes to make one last determination. It’s a bit like asking a bunch of specialists for his or her opinion after which going with the bulk vote.
However till lately, one query was unanswered: What number of determination timber do I really want? If every determination tree can result in completely different outcomes, averaging many timber would result in higher and extra dependable outcomes. However what number of are sufficient? Fortunately, the optRF bundle solutions this query!
So let’s take a look at how you can optimise Random Forest for predictions and variable choice!
Making Predictions with Random Forests
To optimise and to make use of random forest for making predictions, we are able to use the open-source statistics programme R. As soon as we open R, we’ve got to put in the 2 R packages “ranger” which permits to make use of random forests in R and “optRF” to optimise random forests. Each packages are open-source and out there through the official R repository CRAN. With a purpose to set up and cargo these packages, the next strains of R code could be run:
> set up.packages(“ranger”)
> set up.packages(“optRF”)
> library(ranger)
> library(optRF)
Now that the packages are put in and loaded into the library, we are able to use the features that these packages include. Moreover, we are able to additionally use the info set included within the optRF bundle which is free to make use of below the GPL license (simply because the optRF bundle itself). This knowledge set referred to as SNPdata accommodates within the first column the yield of 250 wheat vegetation in addition to 5000 genomic markers (so referred to as single nucleotide polymorphisms or SNPs) that may include both the worth 0 or 2.
> SNPdata[1:5,1:5]
Yield SNP_0001 SNP_0002 SNP_0003 SNP_0004
ID_001 670.7588 0 0 0 0
ID_002 542.5611 0 2 0 0
ID_003 591.6631 2 2 0 2
ID_004 476.3727 0 0 0 0
ID_005 635.9814 2 2 0 2
This knowledge set is an instance for genomic knowledge and can be utilized for genomic prediction which is an important software for breeding high-yielding crops and, thus, to battle world starvation. The thought is to foretell the yield of crops utilizing genomic markers. And precisely for this function, random forest can be utilized! That implies that a random forest mannequin is used to explain the connection between the yield and the genomic markers. Afterwards, we are able to predict the yield of wheat vegetation the place we solely have genomic markers.
Due to this fact, let’s think about that we’ve got 200 wheat vegetation the place we all know the yield and the genomic markers. That is the so-called coaching knowledge set. Let’s additional assume that we’ve got 50 wheat vegetation the place we all know the genomic markers however not their yield. That is the so-called take a look at knowledge set. Thus, we separate the info body SNPdata in order that the primary 200 rows are saved as coaching and the final 50 rows with out their yield are saved as take a look at knowledge:
> Coaching = SNPdata[1:200,]
> Take a look at = SNPdata[201:250,-1]
With these knowledge units, we are able to now take a look at how you can make predictions utilizing random forests!
First, we obtained to calculate the optimum variety of timber for random forest. Since we wish to make predictions, we use the perform opt_prediction
from the optRF bundle. Into this perform we’ve got to insert the response from the coaching knowledge set (on this case the yield), the predictors from the coaching knowledge set (on this case the genomic markers), and the predictors from the take a look at knowledge set. Earlier than we run this perform, we are able to use the set.seed perform to make sure reproducibility despite the fact that this isn’t obligatory (we’ll see later why reproducibility is a matter right here):
> set.seed(123)
> optRF_result = opt_prediction(y = Coaching[,1],
+ X = Coaching[,-1],
+ X_Test = Take a look at)
Really useful variety of timber: 19000
All the outcomes from the opt_prediction
perform are actually saved within the object optRF_result, nonetheless, crucial info was already printed within the console: For this knowledge set, we should always use 19,000 timber.
With this info, we are able to now use random forest to make predictions. Due to this fact, we use the ranger perform to derive a random forest mannequin that describes the connection between the genomic markers and the yield within the coaching knowledge set. Additionally right here, we’ve got to insert the response within the y argument and the predictors within the x argument. Moreover, we are able to set the write.forest
argument to be TRUE and we are able to insert the optimum variety of timber within the num.timber
argument:
> RF_model = ranger(y = Coaching[,1], x = Coaching[,-1],
+ write.forest = TRUE, num.timber = 19000)
And that’s it! The article RF_model
accommodates the random forest mannequin that describes the connection between the genomic markers and the yield. With this mannequin, we are able to now predict the yield for the 50 vegetation within the take a look at knowledge set the place we’ve got the genomic markers however we don’t know the yield:
> predictions = predict(RF_model, knowledge=Take a look at)$predictions
> predicted_Test = knowledge.body(ID = row.names(Take a look at), predicted_yield = predictions)
The info body predicted_Test now accommodates the IDs of the wheat vegetation along with their predicted yield:
> head(predicted_Test)
ID predicted_yield
ID_201 593.6063
ID_202 596.8615
ID_203 591.3695
ID_204 589.3909
ID_205 599.5155
ID_206 608.1031
Variable Choice with Random Forests
A unique method to analysing such a knowledge set could be to seek out out which variables are most vital to foretell the response. On this case, the query could be which genomic markers are most vital to foretell the yield. Additionally this may be performed with random forests!
If we deal with such a job, we don’t want a coaching and a take a look at knowledge set. We will merely use all the knowledge set SNPdata and see which of the variables are crucial ones. However earlier than we do this, we should always once more decide the optimum variety of timber utilizing the optRF bundle. Since we’re insterested in calculating the variable significance, we use the perform opt_importance
:
> set.seed(123)
> optRF_result = opt_importance(y=SNPdata[,1],
+ X=SNPdata[,-1])
Really useful variety of timber: 40000
One can see that the optimum variety of timber is now larger than it was for predictions. That is really typically the case. Nonetheless, with this variety of timber, we are able to now use the ranger perform to calculate the significance of the variables. Due to this fact, we use the ranger perform as earlier than however we modify the variety of timber within the num.timber argument to 40,000 and we set the significance argument to “permutation” (different choices are “impurity” and “impurity_corrected”).
> set.seed(123)
> RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1],
+ write.forest = TRUE, num.timber = 40000,
+ significance="permutation")
> D_VI = knowledge.body(variable = names(SNPdata)[-1],
+ significance = RF_model$variable.significance)
> D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]
The info body D_VI now accommodates all of the variables, thus, all of the genomic markers, and subsequent to it, their significance. Additionally, we’ve got instantly ordered this knowledge body in order that crucial markers are on the highest and the least vital markers are on the backside of this knowledge body. Which implies that we are able to take a look at crucial variables utilizing the top perform:
> head(D_VI)
variable significance
SNP_0020 45.75302
SNP_0004 38.65594
SNP_0019 36.81254
SNP_0050 34.56292
SNP_0033 30.47347
SNP_0043 28.54312
And that’s it! We now have used random forest to make predictions and to estimate crucial variables in a knowledge set. Moreover, we’ve got optimised random forest utilizing the optRF bundle!
Why Do We Want Optimisation?
Now that we’ve seen how simple it’s to make use of random forest and the way shortly it may be optimised, it’s time to take a better have a look at what’s occurring behind the scenes. Particularly, we’ll discover how random forest works and why the outcomes may change from one run to a different.
To do that, we’ll use random forest to calculate the significance of every genomic marker however as a substitute of optimising the variety of timber beforehand, we’ll persist with the default settings within the ranger perform. By default, ranger makes use of 500 determination timber. Let’s attempt it out:
> set.seed(123)
> RF_model = ranger(y=SNPdata[,1], x=SNPdata[,-1],
+ write.forest = TRUE, significance="permutation")
> D_VI = knowledge.body(variable = names(SNPdata)[-1],
+ significance = RF_model$variable.significance)
> D_VI = D_VI[order(D_VI$importance, decreasing=TRUE),]
> head(D_VI)
variable significance
SNP_0020 80.22909
SNP_0019 60.37387
SNP_0043 50.52367
SNP_0005 43.47999
SNP_0034 38.52494
SNP_0015 34.88654
As anticipated, all the things runs easily — and shortly! In reality, this run was considerably sooner than after we beforehand used 40,000 timber. However what occurs if we run the very same code once more however this time with a distinct seed?
> set.seed(321)
> RF_model2 = ranger(y=SNPdata[,1], x=SNPdata[,-1],
+ write.forest = TRUE, significance="permutation")
> D_VI2 = knowledge.body(variable = names(SNPdata)[-1],
+ significance = RF_model2$variable.significance)
> D_VI2 = D_VI2[order(D_VI2$importance, decreasing=TRUE),]
> head(D_VI2)
variable significance
SNP_0050 60.64051
SNP_0043 58.59175
SNP_0033 52.15701
SNP_0020 51.10561
SNP_0015 34.86162
SNP_0019 34.21317
As soon as once more, all the things seems to work wonderful however take a better have a look at the outcomes. Within the first run, SNP_0020 had the very best significance rating at 80.23, however within the second run, SNP_0050 takes the highest spot and SNP_0020 drops to the fourth place with a a lot decrease significance rating of 51.11. That’s a big shift! So what modified?
The reply lies in one thing referred to as non-determinism. Random forest, because the title suggests, includes plenty of randomness: it randomly selects knowledge samples and subsets of variables at varied factors throughout coaching. This randomness helps forestall overfitting however it additionally implies that outcomes can range barely every time you run the algorithm — even with the very same knowledge set. That’s the place the set.seed() perform is available in. It acts like a bookmark in a shuffled deck of playing cards. By setting the identical seed, you make sure that the random decisions made by the algorithm comply with the identical sequence each time you run the code. However whenever you change the seed, you’re successfully altering the random path the algorithm follows. That’s why, in our instance, crucial genomic markers got here out in a different way in every run. This habits — the place the identical course of can yield completely different outcomes because of inside randomness — is a basic instance of non-determinism in machine studying.
As we simply noticed, random forest fashions can produce barely completely different outcomes each time you run them even when utilizing the identical knowledge as a result of algorithm’s built-in randomness. So, how can we cut back this randomness and make our outcomes extra secure?
One of many easiest and handiest methods is to extend the variety of timber. Every tree in a random forest is educated on a random subset of the info and variables, so the extra timber we add, the higher the mannequin can “common out” the noise brought on by particular person timber. Consider it like asking 10 folks for his or her opinion versus asking 1,000 — you’re extra more likely to get a dependable reply from the bigger group.
With extra timber, the mannequin’s predictions and variable significance rankings are likely to develop into extra secure and reproducible even with out setting a selected seed. In different phrases, including extra timber helps to tame the randomness. Nonetheless, there’s a catch. Extra timber additionally imply extra computation time. Coaching a random forest with 500 timber may take just a few seconds however coaching one with 40,000 timber may take a number of minutes or extra, relying on the scale of your knowledge set and your pc’s efficiency.
Nonetheless, the connection between the soundness and the computation time of random forest is non-linear. Whereas going from 500 to 1,000 timber can considerably enhance stability, going from 5,000 to 10,000 timber may solely present a tiny enchancment in stability whereas doubling the computation time. Sooner or later, you hit a plateau the place including extra timber offers diminishing returns — you pay extra in computation time however acquire little or no in stability. That’s why it’s important to seek out the correct stability: Sufficient timber to make sure secure outcomes however not so many who your evaluation turns into unnecessarily gradual.
And that is precisely what the optRF bundle does: it analyses the connection between the soundness and the variety of timber in random forests and makes use of this relationship to find out the optimum variety of timber that results in secure outcomes and past which including extra timber would unnecessarily enhance the computation time.
Above, we’ve got already used the opt_importance perform and saved the outcomes as optRF_result. This object accommodates the details about the optimum variety of timber however it additionally accommodates details about the connection between the soundness and the variety of timber. Utilizing the plot_stability perform, we are able to visualise this relationship. Due to this fact, we’ve got to insert the title of the optRF object, which measure we’re focused on (right here, we have an interest within the “significance”), the interval we wish to visualise on the X axis, and if the really helpful variety of timber ought to be added:
> plot_stability(optRF_result, measure="significance",
+ from=0, to=50000, add_recommendation=FALSE)

This plot clearly reveals the non-linear relationship between stability and the variety of timber. With 500 timber, random forest solely results in a stability of round 0.2 which explains why the outcomes modified drastically when repeating random forest after setting a distinct seed. With the really helpful 40,000 timber, nonetheless, the soundness is close to 1 (which signifies an ideal stability). Including greater than 40,000 timber would get the soundness additional to 1 however this enhance could be solely very small whereas the computation time would additional enhance. That’s the reason 40,000 timber point out the optimum variety of timber for this knowledge set.
The Takeaway: Optimise Random Forest to Get the Most of It
Random forest is a strong ally for anybody working with knowledge — whether or not you’re a researcher, analyst, pupil, or knowledge scientist. It’s simple to make use of, remarkably versatile, and extremely efficient throughout a variety of purposes. However like every software, utilizing it properly means understanding what’s occurring below the hood. On this put up, we’ve uncovered one in every of its hidden quirks: The randomness that makes it robust may also make it unstable if not rigorously managed. Thankfully, with the optRF bundle, we are able to strike the proper stability between stability and efficiency, guaranteeing we get dependable outcomes with out losing computational assets. Whether or not you’re working in genomics, drugs, economics, agriculture, or some other data-rich subject, mastering this stability will assist you make smarter, extra assured selections based mostly in your knowledge.