I am currently working with PyCaret 3.4.0, since 4.0 lacks some configuration parameters that are useful for my case.
I tried to replicate PyCaret results using scikit-learn.
This is my script, after running Pycaret's setup and obtaining the transformed data:
compare_models(include=['rf'],cross_validation=True)
scoring=['accuracy','precision','recall','f1','f1_macro','f1_weighted','f1_micro', 'roc_auc']
rfc = RandomForestClassifier(random_state=42, n_jobs=-1)
scores = cross_validate(rfc, xtrain_trans, ytrain_trans, scoring=scoring, cv=cv)
compare_models return these results:
Model Accuracy AUC Recall Prec. F1 Kappa MCC TT (Sec)
rf Random Forest Classifier 0.7164 0.7617 0.7164 0.7254 0.7137 0.4329 0.4416 0.25
but I get these results from sklearn:
Accuracy=0.6948717948717947
AUC=0.7411665257819103
Recall=0.6908791208791208
Precission=0.7005056185644422
F1=0.6867142420587947
F1_macro=0.6910345427878428
F1_weighted=0.6911863570749788
F1_micro=0.6948717948717947
Just to clarify, I am using the exactly same CV splitter on both PyCaret and sklearn.
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
The setup used was:
setup(xtrain,
target = 'Group',
session_id=42,
test_data=xtest, #None default
imputation_type=None,
remove_multicollinearity=True,
multicollinearity_threshold=0.70,
remove_outliers=True,
transformation=True,
transformation_method='quantile',
normalize=True,
feature_selection=False,
fold_strategy=cv,
use_gpu=True)
I know the differences are small. But I would like to use RFECV to reduce the number of features. The problem is that if I perform RFECV, which I must perform using sklearn since it is not implemented in PyCaret, results are not comparable. After performing RFECV, the best average score I get (F1=0.70, reduced number of features) is smaller than the average score I get with PyCaret (F1=0.7137, all features), but higher than the average score I get with sklearn (F1=0.6867). So now I'm not sure how to move forward, and if I should use the reduced set.
REPRODUCIBLE EXAMPLE:
from sklearn.model_selection import RepeatedStratifiedKFold
from sklearn.model_selection import cross_validate
from sklearn.ensemble import RandomForestClassifier
cv = RepeatedStratifiedKFold(n_splits=5, n_repeats=10, random_state=42)
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=100, n_features=5, n_informative=2, n_classes=2, flip_y=0.2, random_state=42)
dataset = pd.DataFrame(X)
dataset.columns = ['X1', 'X2', 'X3', 'X4', 'X5']
dataset['y'] = y
setup0 = setup(dataset,target = 'y',session_id=42,train_size=0.7,imputation_type=None,remove_multicollinearity=True,multicollinearity_threshold=0.70,remove_outliers=True,transformation=True,transformation_method='quantile',normalize=True,feature_selection=False,fold_strategy=cv,use_gpu=True)
# CV average results from PyCaret
compare_models(include=['rf'],sort='F1',errors='raise',cross_validation=True)
xtrain_trans = setup0.get_config('X_train_transformed')
ytrain_trans = setup0.get_config('y_train_transformed')
scoring=['accuracy', 'roc_auc', 'recall', 'precision', 'f1','f1_macro','f1_weighted','f1_micro']
rfc = RandomForestClassifier(random_state=42, n_jobs=-1)
scores = cross_validate(rfc, xtrain_trans, ytrain_trans, scoring=scoring, cv=cv)
scores=pd.DataFrame(scores)
scores=scores.mean().round(4)
# CV average results from sklearn
pd.DataFrame(scores).transpose()