I am currently doing some work with a fraud detection dataset as part of a research to leverage uncertainty to improve neural networks ensemble of experts' results.
Firstly I had to take the dataset's training data and split it into 6 different domains (5 in-distribution and 1 ood). The goal is to train 5 different experts on different fraud patterns and then compute uncertainty of each expert (using Monte-Carlo dropout) and the uncertainty of the ensemble of experts.
When we talk about fraud detection we expect the data to be heavily imbalance favoring the non-fraud class. In this case is 1:90. Given this it makes sense to use oversampling when training the neural networks and so I did. I made a sampler which increases the ratio to 1:10, not by creating new transactions (Because due to the data split, some domains have very few transactions to get reliable simulated transactions with oversampling) but by having the fraudulent transactions seen x amount of times more to increase the ratio. Now, with this, I'm having a problem with the uncertainty signal. In a perfect scenario, fraudulent transactions would be more uncertain than legit ones, but due to the oversampling the experts seem to be more uncertain about the legit transactions than the fraudulent ones.
I have tried different percentages of oversampling and only without oversampling the uncertainty signal is correct, but then the overall predictive results underperform.
So what should I do? Should I try different methodologies to oversample or even try undersampling? Would it make sense to just go forward without any oversampling? I'm open to have a bit of discussion with some real humans rather than just going back and forth with an unreliable AI.
Cheers good people of Stack Overflow