Random forest importance

8/10/2023

Therefore, the impact of different amino acid properties, some of which have been shown to be relevant in DNA and protein evolution, for predicting peptide binding is investigated in our application example in Section 4. Tree-based methods like random forests can help identify relevant predictor variables even in such high dimensional settings involving complex interactions. also point out the necessity to consider interactions between sequence positions. In the analysis of amino acid sequence data Segal et al. find that genetic markers relevant in interactions with other markers or environmental variables can be detected more efficiently by means of random forests than by means of univariate screening methods like Fisher's exact test. In this case a key advantage of random forest variable importance measures, as compared to univariate screening methods, is that they cover the impact of each predictor variable individually as well as in multivariate interactions with other predictor variables.

By means of variable importance measures the candidate predictor variables can be compared with respect to their impact in predicting the response or even their causal effect (see, e.g., for assumptions necessary for interpreting the importance of a variable as a causal effect). Identifying relevant predictor variables, rather than only predicting the response by means of some "black-box" model, is of interest in many applications. Recently, the variable importance measures yielded by random forests have also been suggested for the selection of relevant predictor variables in the analysis of microarray data, DNA sequencing and other applications. They show high predictive accuracy and are applicable even in high-dimensional problems with highly correlated variables, a situation which often occurs in bioinformatics. Within the past few years, random forests have become a popular and widely-used tool for non-parametric regression in many scientific areas. The resulting conditional variable importance reflects the true impact of each predictor variable more reliably than the original marginal approach. Based on these considerations we develop a new, conditional permutation scheme for the computation of the variable importance measure. We identify two mechanisms responsible for this finding: (i) A preference for the selection of correlated predictors in the tree building process and (ii) an additional advantage for correlated predictor variables induced by the unconditional permutation scheme that is employed in the computation of the variable importance measure. However, these variable importance measures show a bias towards correlated predictor variables. Their variable importance measures have recently been suggested as screening tools for, e.g., gene expression studies.

Random forests are becoming increasingly popular in many scientific fields because they can cope with "small n large p" problems, complex interactions and even highly correlated predictor variables.

0 Comments

Random forest importance

Leave a Reply.

Author

Archives

Categories