Using Random Forest for feature selection - Dealing with correlated variables

Keywords: python random-forest


I would like to know how to deal with correlated variables when building a Random Forest for feature selection

So I need to do some feature selection on different datasets that contain categorical and continuous variables. I'm a bit lost here because the most obvious correlation measure is the Pearson's correlation coeff that works for continuous variables but what about categorical variables.

Would the following approach work if I wanted to do a good feature selection using RF :

do some kind of feature selection on the continuous variables independently of the categorical variables by using any of the techniques described in this article :

From what I understand, univariate selection, RFE or PCA are only valid for continuous variables. (I doubt transforming a categorical by one hot encoding and then doing these techniques would be benefical)

Once the continuous variables are chosen, create the random forest which would give uncorrelated variables.

However, can categorical variables be correlated ? If yes, would doing a chi-square test on only categorical variables be useful.

In the end, would combining the results from the feature selection on continuous variables + chi-square test on cat variables be a good solution ?

Thank you for your help, I'm new to feature selection :)