AI/ML News & Innovations Hub

Closed.

This question is

. It is not currently accepting answers.

Self-study questions (including textbook exercises, old exam papers, and homework) that seek to understand the concepts are welcome, but those that demand a solution need to indicate clearly at what step help or advice are needed. For help writing a good self-study question, please visit the meta pages.

This post was edited and submitted for review yesterday.

What follows is a general scenario that describes the statistical problem I’m facing; I am happy to describe more specifics if necessary.

An uncommon, dangerous, and purely hypothetical disease has begun to spread across the nation. To quell panic, the authorities have created the following medical scheme. Any persons who suspect they are exhibiting symptoms can check into a clinic to get themselves a preliminary blood test. Those who test positive are immediately sent to a hospital for detailed (and expensive) examination, which conclusively determines if they have the disease or not. Those who test negative are sent home; however, a small fraction of them are asked at random to check into the hospital anyway.

The authorities now wish to create a new preliminary blood test.

We have the biological features catalogued by the old blood test for all patients; these will act as our predictor variables. For those who tested positive, we also have the determination of whether they did in fact have the disease or not (i.e. the outcome variable). Unfortunately, the vast majority (say, ~80%) of the preliminary blood tests came negative, and so the outcome variable for them is unknown. However, we do still have outcome data for some of those patients — the randomly hospitalised sample of the negative population. A small but nonzero percentage of these patients turned out to have the disease.

What is the soundest way of handling the missing data in this classification problem?

Training a classifier on data with systematically missing outcome variable [closed]