How to Handle Missing Data: Best Practices 2026

The deadline is close, the file finally lands in your inbox, and half the columns are speckled with NaN, blanks, or placeholder junk that someone forgot to document. Most analysts react the same way. Drop the bad rows, fill the rest with a mean, move on.

That instinct is understandable, and it's often wrong.

How to handle missing data isn't a cleanup chore. It's part of the analysis. The choice you make can change the sample you're analyzing, flatten important variation, or hide the fact that the missingness itself is telling you something useful. Under deadline pressure, you need a decision process that's fast enough to use and disciplined enough to defend later.

A practical way to work is this: diagnose, choose, implement, validate. If your team is trying to shorten the repetitive parts of that workflow, tools like an AI agent for data analysis can help with profiling and first-pass checks. But the judgment still matters. You need to know when deletion is acceptable, when imputation is defensible, and when the smartest move is to leave values missing and model around them.

Your Data Is Full of Holes Now What
- A better workflow under pressure
- The question that actually matters
Diagnose Your Missing Data Problem
Choose Your Strategy Deletion Imputation or Modeling
Practical Implementation With Python
When Missingness Itself Is the Signal
Avoid These Common Pitfalls

Your Data Is Full of Holes Now What

The first bad move is treating missing data like lint on a sweater. Brush it off and keep walking. That's how analysts end up defending results they can't really trust.

A blank field changes the analysis in at least two ways. It can remove observations from the usable sample if you rely on complete cases, and it can distort the relationships among variables if you replace the blanks with something convenient. The core distinction matters: complete-case or listwise deletion discards records with missing values, while imputation and model-based approaches try to recover information instead of throwing it away. The practical guidance collected by the VA HERC overview of missing-data methods makes that trade-off explicit.

A better workflow under pressure

When I mentor junior analysts, I push a four-step sequence:

Diagnose the pattern of missingness before touching values.
Choose a handling strategy that matches the likely mechanism.
Implement it in a way that won't contaminate validation.
Validate whether your conclusion changes.

That sounds slower than a quick fillna(). It usually isn't. What burns time is rework after someone asks the obvious question: why did you handle the missing values this way?

Practical rule: If you can't explain why a field is missing, you're not ready to explain why your fix is appropriate.

The question that actually matters

Analysts under pressure often ask the wrong question. They ask, “How do I fill these values?” The better question is whether a given method changes the conclusion as little as possible while staying defensible.

That shift in mindset is the whole game. Some datasets deserve deletion. Some need multiple imputation. Some should keep explicit missingness flags. Some should be left incomplete because forcing a fake value into the data does more damage than leaving the absence visible.

Diagnose Your Missing Data Problem

Before you choose a method, you need to understand what kind of problem you have. Not every blank means the same thing, and treating them as interchangeable is how bad habits survive.

An infographic titled Diagnosing Missing Data showing a five-step detective process to analyze and handle absent information.

Start with the pattern not the patch

There are three mechanisms analysts use as a working model.

MCAR means missing completely at random. Think of a server glitch that wipes out a few entries with no pattern.
MAR means missing at random conditional on observed data. A survey field might be missing more often for one observed group than another.
MNAR means missing not at random. The missingness depends on the missing value itself, like people with high income being less likely to report income.

You usually won't prove one of these with certainty. But you can form a defensible hypothesis from patterns in the observed data.

The strongest practical framing I know comes from MeasuringU's guide to handling missing data: the core question for analysts under pressure is not how to fill values, but which method changes the conclusion least while remaining methodologically defensible. That's why you inspect patterns and test whether missingness differs across observed groups before defaulting to a fix.

If you need a broader foundation for that reasoning, this overview of statistical analysis methodology is a useful companion.

Copy paste checks in Python

Start with basic visibility. Don't guess where the problem is.

import pandas as pd

# overall missingness by column
missing_pct = df.isna().mean().sort_values(ascending=False)
print(missing_pct)

# rows with any missing values
rows_with_missing = df[df.isna().any(axis=1)]
print(rows_with_missing.head())

Then visualize structure. The missingno library is useful because it shows whether columns tend to go missing together.

import missingno as msno
import matplotlib.pyplot as plt

msno.matrix(df)
plt.show()

msno.heatmap(df)
plt.show()

Now test whether missingness in one column lines up with observed values in another. That's often the first clue that you're looking at MAR rather than MCAR.

# Example: does income missingness differ by region?
df["income_missing"] = df["income"].isna().astype(int)

group_check = df.groupby("region")["income_missing"].mean().sort_values(ascending=False)
print(group_check)

For numeric predictors, compare observed groups directly.

# Example: compare age for rows where income is missing vs present
comparison = df.groupby(df["income"].isna())["age"].describe()
print(comparison)

Missingness that clusters by observed groups is rarely a random annoyance. It's a clue about the data-generating process.

What to write down before moving on

Keep a short note for each key variable:

What is missing
How often it's missing
Whether the missingness appears patterned
What mechanism seems most plausible
What business or collection process might explain it

That note is more valuable than a fancy chart. It forces an explicit hypothesis, and that hypothesis should drive the treatment.

Choose Your Strategy Deletion Imputation or Modeling

It's 4 p.m., the model review is tomorrow, and 18 percent of a key field is missing. At that point, the primary question is not which function to call. It's what mistake you can afford to make.

Once the diagnosis is done, choose based on the job. Are you trying to publish an effect estimate, ship a forecast, or get a baseline in front of stakeholders before the deadline? The right answer changes with the consequence of being wrong.

Columbia's guidance on missing data and multiple imputation makes the core trade-off clear. Simple mean or median imputation is easy, but it can bias results by shrinking variance. Multiple imputation is more appropriate when data are MAR and you need valid uncertainty estimates, not just a filled-in table.

Deletion is acceptable only when the loss is tolerable

Deletion is sometimes the right call. If the missing share is small, the pattern looks close to random, and the dropped records are not concentrated in the subgroup you care about, complete-case analysis can be a reasonable baseline.

The risk is not just fewer rows. The risk is changing the sample in a way that answers a different question. If low-income customers, one hospital, or one device type is more likely to have blanks, deletion shifts the population under analysis.

I usually tell junior analysts to defend deletion in one sentence: why does the remaining sample still represent the decision you need to make? If that sentence sounds shaky, deletion is probably the wrong choice.

Simple imputation buys speed, not credibility

Mean, median, mode, and constant fills are workflow tools. They are useful when you need a fast baseline, a quick prototype, or a model pipeline that runs end to end today instead of next week.

They are weak choices for final inference. They reduce variation, dilute relationships between variables, and make the dataset look cleaner than it really is. That trade-off can be fine in an operational model comparison. It is much harder to justify in a report where coefficients, intervals, or segment-level conclusions matter.

If you need a practical reference for fitting this into preprocessing work, this guide to data transformation techniques for model pipelines is a good complement, because missing-data handling rarely lives on its own.

Model-based methods are worth the effort when the decision is expensive

When accuracy is critical, extra work is usually cheaper than a bad conclusion. Multiple imputation earns its complexity when preserving sample size and uncertainty matters, especially under a plausible MAR story. It forces analysts to be explicit about which variables help explain the missingness and the missing values themselves.

That last part is where common advice often fails. Analysts are often told to fill the blanks and move on. A better question is whether the missingness process belongs in the model. In predictive work, a missingness indicator can carry real signal. In some tree-based systems, leaving values missing can perform better than forcing a guess into every cell. In other cases, especially regulated or high-stakes inference, that same shortcut can hide assumptions you should document.

Here's the framework I use in practice.

Strategy	Best For	Primary Risk	Implementation Effort
Deletion	Small, plausibly random missingness and fast baseline checks	Loss of sample and bias if missingness is systematic	Low
Simple imputation	Exploratory analysis, rough prototypes, operational placeholders	Distorted variance and weaker correlations	Low
Multiple imputation	MAR settings where inference or stable estimates matter	More setup, more modeling choices, easy to misuse if diagnosis is weak	Medium to high
Model-based handling	Workflows where the model can directly work with incomplete data	Hidden assumptions and validation errors	Medium to high
Missingness as a feature	Predictive modeling where absence may carry signal	Leakage or overinterpretation if the collection process changes	Medium

One rule helps under deadline pressure. Choose the fastest honest baseline first, then upgrade only if the result will change a real decision.

Sometimes the correct strategy is to leave the raw value missing, add a flag that marks the absence, and let the model use both pieces of information. Missingness is not always damage to repair. Sometimes it is part of the pattern you are trying to predict.

Practical Implementation With Python

Monday morning, the model has to be in review by 3 p.m., and three columns are 20 percent missing. That is not the moment to hunt for a perfect method. It is the moment to build a baseline you can defend, test where the choice changes the result, and document the assumptions you are making.

An illustration showing how to handle missing data using the SimpleImputer function in the Python scikit-learn library.

In practice, missing-data code belongs inside the same preprocessing pipeline as encoding, scaling, and type cleanup. If your workflow is still a set of ad hoc notebook cells, tighten that up first. This guide to data transformation techniques in a production-friendly preprocessing workflow is a useful reference for that broader setup.

Simple imputation when you need a fast baseline

For a first pass, SimpleImputer is often enough. The goal is not to declare victory. The goal is to get a clean benchmark so you can see whether more effort changes the conclusion.

import pandas as pd
from sklearn.impute import SimpleImputer

num_cols = ["age", "income", "spend"]

mean_imputer = SimpleImputer(strategy="mean")
df_mean = df.copy()
df_mean[num_cols] = mean_imputer.fit_transform(df_mean[num_cols])

median_imputer = SimpleImputer(strategy="median")
df_median = df.copy()
df_median[num_cols] = median_imputer.fit_transform(df_median[num_cols])

Use mean for roughly symmetric variables. Use median when a few large values would pull the average around. Use a constant only when that fill value has a clear operational meaning and downstream users will understand it.

cat_cols = ["channel", "device_type"]

constant_imputer = SimpleImputer(strategy="constant", fill_value="Missing")
df_cat = df.copy()
df_cat[cat_cols] = constant_imputer.fit_transform(df_cat[cat_cols])

A common mistake is treating simple imputation as harmless cleanup. It changes the distribution. It can shrink variance, weaken correlations, and make a model look more certain than it should. I still use it often, because under deadline pressure a fast benchmark is better than a complicated method you cannot validate in time.

Iterative imputation when relationships between variables matter

If age, income, tenure, and spend move together, filling each column independently throws away information. IterativeImputer uses the other columns to estimate plausible values, which usually makes more sense than plugging the same mean into every gap.

import pandas as pd
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer

imp_cols = ["age", "income", "spend", "tenure"]

iter_imputer = IterativeImputer(random_state=42)
df_iter = df.copy()
df_iter[imp_cols] = iter_imputer.fit_transform(df_iter[imp_cols])

This is a practical approximation, not a magic fix. It can work well for prediction pipelines, but it still bakes in modeling assumptions. If the missingness mechanism is badly misunderstood, better code will not save the analysis.

A few habits matter more than the specific class you import:

Include predictors that are related to both the missing field and the reason it went missing.
Fit the imputer on training data only, then apply it to validation and test sets.
Keep the imputation step inside a saved pipeline so scoring later uses the same logic.
Compare model performance and business conclusions against a complete-case baseline.
Check whether imputed values land in sensible ranges instead of assuming the algorithm got it right.

A short explainer is often helpful before you code deeper into the workflow:

A practical standard for shipping work

The standard I use is simple. If a quick imputation method and a stricter method lead to the same decision, ship the simpler one and document it. If the recommendation changes, the missing-data choice is part of the analysis, not a preprocessing footnote.

For internal prototypes, that may mean median imputation plus careful validation. For client work, policy decisions, or anything likely to be audited, raise the bar. Show how sensitive the result is to the handling method, record the assumptions, and keep enough of the workflow reproducible that another analyst can rerun it without guessing what happened.

When Missingness Itself Is the Signal

It is 4:30 p.m., the model review is at 5, and one column is missing for a third of the rows. The rushed move is to fill the blanks and keep going. That is often the wrong call.

Some missing values are part of the process you are trying to model. A customer skips a phone number. A claims adjuster leaves a field blank until a manual review happens. A lab result is absent because the test was never ordered. Those cases do not all mean the same thing, and treating them as generic damage usually throws away useful information.

The practical point is simple. Missingness can be predictive. Zest AI makes this point well in its discussion of methods for dealing with missing data. In many real workflows, the fact that a value is absent carries information about behavior, process, or risk.

An infographic titled Missingness as a Feature, explaining five advanced machine learning techniques for handling missing data.

Preserve the absence before you fill anything

A good default in predictive work is to create a missingness flag before imputation. It takes seconds and protects information you cannot recover later.

df["income_was_missing"] = df["income"].isna().astype(int)
df["phone_was_missing"] = df["phone_number"].isna().astype(int)

I use this most when the blank plausibly reflects user choice, operational routing, eligibility, or fraud screening. In those settings, the indicator often matters more than the filled value itself.

That does not mean every column needs a flag. If a field is missing because of random sensor dropouts or a one-off export problem, the indicator may add noise. The job is to ask what generated the blank, not to add dummy variables by habit.

Tree-based models can help here because many of them tolerate missingness well or can split on the indicator effectively. But convenience is not the same as judgment. A model can learn a temporary collection quirk just as easily as it learns a durable business pattern.

A useful rule is this. If the missingness could exist at scoring time for the same operational reason, keep and test it. If it comes from a historical migration bug that will never happen again, building around it is usually a mistake.

Sometimes the right choice is to leave values missing

Analysts are often taught that every null must be replaced before modeling. That advice breaks down fast in operational data.

For some methods, especially tree ensembles with native handling for nulls, leaving the value missing is a valid modeling choice. It can be better than forcing mean, median, or KNN imputation onto a field where the absence has a different meaning from any observed value. The trade-off is interpretability and portability. A model that handles nulls internally may score well, but it can be harder to explain or reimplement in another system.

This is also common in sequential systems. In event logs, device telemetry, and product funnels, a blank may mean “not observed yet,” “not applicable,” or “the event never happened.” Those are different states. Analysts working in those settings usually need process-aware handling, not default fills. The same logic shows up in broader time series analysis methods, where the gap itself can carry information.

Use a simple decision test

Under deadline pressure, I ask three questions:

Does the blank reflect behavior, workflow state, eligibility, or risk?
Will that same kind of blank appear when the model is used in production?
Does preserving missingness change model performance or the business decision?

If the answer is yes to the first two, test a missingness indicator or a model that can keep nulls intact. If the answer is no, standard imputation is usually fine.

The point is not to treat missingness as special in every project. The point is to stop erasing it before you decide what it means.

Avoid These Common Pitfalls

Missing-data mistakes usually happen at 5 p.m. on a deadline, not because the analyst lacks methods. The pattern is familiar. Someone cleans the dataframe fast, fills blanks with a default rule, gets a model to run, and only later realizes the cleanup step changed the result more than the model did.

The first failure point is leakage. If you calculate imputation values before the train-test split, your validation is biased from the start. Mean, median, mode, KNN, model-based imputation. It does not matter. If the test set influenced the fill values, the score is too optimistic. Put the imputer inside the pipeline and fit it on training data only.

The second failure point is picking a method because it is convenient instead of because it matches the missingness pattern. Analysts do this all the time under pressure. They diagnose a process issue, a subgroup effect, or a field that is missing for a business reason, then still use a global mean fill because it is quick. That decision can erase the very pattern the analysis was supposed to examine.

A clean dataframe proves very little. The real question is whether your handling choice changed the story, and whether you can defend that choice to someone who has to act on the result.

One practical mistake is treating single imputation as if it removed uncertainty. It does not. A filled value is still a guess. If the analysis is inferential, that distinction matters because standard errors and confidence in the estimate can look better than they should. If the analysis is purely predictive and the deadline is tight, a simpler method may still be the right call. The point is to be honest about the trade-off.

Use this checklist before you call the work done:

Compare the distribution of observed values to the distribution after imputation.
Refit the main analysis on complete cases and on the handled dataset.
Check whether conclusions change in direction, ranking, or business recommendation.
Record what you assumed about the missingness mechanism and why that assumption is reasonable.
Make the workflow reproducible so another analyst can rerun it without guessing what happened.

Teams that want those checks built into the workflow often use automated data processing software to keep preprocessing, modeling, and documentation in one auditable path.

My standard is simple. Another analyst should be able to inspect your work and answer three questions without chasing you on Slack. Why was this method chosen? Was it applied correctly? Did it materially change the result?

PlotStudio AI is a good fit if you want that level of rigor without spending your week wiring notebooks together by hand. It helps analysts turn plain-English questions into publication-ready analyses, with code execution, methodological planning, reproducible outputs, and reviewable workflows in one place. If you want to move faster without giving up analyst judgment, explore PlotStudio AI.