USD ($)
$
United States Dollar
Euro Member Countries
India Rupee

Handling Missing or Inconsistent Data

Lesson 13/31 | Study Time: 24 Min

Handling missing or inconsistent data is one of the most important and time-consuming steps in the data science lifecycle. In real-world datasets, missing values, incomplete records, mismatched formats, and incorrect entries are the norm rather than the exception. These issues arise from human errors, system failures, operational oversights, or limitations in data collection processes. The presence of missing or inconsistent data can significantly weaken the reliability of insights and predictions, making this submodule critical in preparing the dataset for modeling.

The goal of this phase is not just to “fill blanks” or “correct values” but to understand the underlying patterns, causes, and implications of missingness and inconsistencies. This understanding shapes the strategy you choose, because each type of missingness requires a different method to fix appropriately. Mishandling missing data can introduce bias, distort distributions, or artificially inflate trends, ultimately leading to misleading conclusions.

Types of Missing Data 

Missing data is not random in many situations, and identifying the type helps reduce improper imputation and inaccurate modeling.


A. MCAR — Missing Completely at Random

1. Safest type because handling introduces minimal bias

2. Values are missing with no pattern or correlation

3. Occurs due to accidental deletion, system glitches

B. MAR — Missing at Random

1. Missingness depends on other variables

2. Example: income missing more for certain job titles

3. Needs careful imputation (conditional methods are preferred)

C. MNAR — Missing Not at Random

1. Missingness depends on other variables

2. Example: income missing more for certain job titles

3. Needs careful imputation (conditional methods are preferred)


Why Missing Data Happens (Real-World Reasons)

Detecting Missing Data (Practical Techniques)

These methods help identify which variables are too incomplete to be used and which require imputation.


1. Missingness depends on the missing value itself

2. Example: high spenders hiding spending data

3. Hardest to treat; often requires domain expertise or advanced techniques


Understanding these categories ensures that the method chosen does not distort the dataset’s true nature.

Missing data is rarely accidental; something in the pipeline causes it.


1. Data entry errors (manual form filling)

2. Optional fields not filled by users

3. Sensor failures producing blank readings

4. Privacy concerns, especially in sensitive fields like income

5. Disconnected systems during data collection


Recognizing causes helps prevent future missingness, not just fix current gaps.

Before fixing values, analysts explore missingness patterns.


1. Counting nulls in each column

2. Visualizing missingness using heatmaps

3. Reviewing patterns within groups (e.g., missing income only for one region)

4. Checking percentages: columns with >30% missing often require removal or transformation


Strategies to Handle Missing Data (Choosing the Right Method)


Handling Techniques and When to Use Them


Choosing an incorrect strategy may influence model fairness, accuracy, and interpretability.


Handling Inconsistent Data (Cleaning Values That Don’t Match Rules)

Examples of inconsistencies


1. “Male”, “male”, “M”, “m” → represent same category

Inconsistent data refers to values that contradict logical, contextual, or formatting expectations. This often occurs across systems where data entry rules are not standardized.

2. Dates in different formats

3. Currency symbols missing or mixed

4. Units mismatched (“5kg” vs “5000g”)

5. Typos and spelling variations

Fixing Inconsistencies Typically Involves

6. Validating values against business rules

1. Standardizing case (uppercase/lowercase)
2. Unifying date formats
3. Mapping categories to standard labels
4. Converting units
5. Removing impossible values (e.g., negative height)

This process significantly improves the uniformity and reliability of the dataset.