Handling Missing or Inconsistent Data

Lesson 13/31 | Study Time: 24 Min

Course: Data Science Process for Beginners

Handling missing or inconsistent data is one of the most important and time-consuming steps in the data science lifecycle. In real-world datasets, missing values, incomplete records, mismatched formats, and incorrect entries are the norm rather than the exception. These issues arise from human errors, system failures, operational oversights, or limitations in data collection processes. The presence of missing or inconsistent data can significantly weaken the reliability of insights and predictions, making this submodule critical in preparing the dataset for modeling.

The goal of this phase is not just to “fill blanks” or “correct values” but to understand the underlying patterns, causes, and implications of missingness and inconsistencies. This understanding shapes the strategy you choose, because each type of missingness requires a different method to fix appropriately. Mishandling missing data can introduce bias, distort distributions, or artificially inflate trends, ultimately leading to misleading conclusions.

Types of Missing Data

Missing data is not random in many situations, and identifying the type helps reduce improper imputation and inaccurate modeling.

A. MCAR — Missing Completely at Random

1. Safest type because handling introduces minimal bias

2. Values are missing with no pattern or correlation

3. Occurs due to accidental deletion, system glitches

B. MAR — Missing at Random

1. Missingness depends on other variables

2. Example: income missing more for certain job titles

3. Needs careful imputation (conditional methods are preferred)

C. MNAR — Missing Not at Random

1. Missingness depends on other variables

2. Example: income missing more for certain job titles

3. Needs careful imputation (conditional methods are preferred)

Why Missing Data Happens (Real-World Reasons)

Detecting Missing Data (Practical Techniques)

These methods help identify which variables are too incomplete to be used and which require imputation.

1. Missingness depends on the missing value itself

2. Example: high spenders hiding spending data

3. Hardest to treat; often requires domain expertise or advanced techniques

Understanding these categories ensures that the method chosen does not distort the dataset’s true nature.

Missing data is rarely accidental; something in the pipeline causes it.

1. Data entry errors (manual form filling)

2. Optional fields not filled by users

3. Sensor failures producing blank readings

4. Privacy concerns, especially in sensitive fields like income

5. Disconnected systems during data collection

Recognizing causes helps prevent future missingness, not just fix current gaps.

Before fixing values, analysts explore missingness patterns.

1. Counting nulls in each column

2. Visualizing missingness using heatmaps

3. Reviewing patterns within groups (e.g., missing income only for one region)

4. Checking percentages: columns with >30% missing often require removal or transformation

Strategies to Handle Missing Data (Choosing the Right Method)

Handling Techniques and When to Use Them

Choosing an incorrect strategy may influence model fairness, accuracy, and interpretability.

Handling Inconsistent Data (Cleaning Values That Don’t Match Rules)

Examples of inconsistencies

1. “Male”, “male”, “M”, “m” → represent same category

Inconsistent data refers to values that contradict logical, contextual, or formatting expectations. This often occurs across systems where data entry rules are not standardized.

2. Dates in different formats

3. Currency symbols missing or mixed

4. Units mismatched (“5kg” vs “5000g”)

5. Typos and spelling variations

Fixing Inconsistencies Typically Involves

6. Validating values against business rules

1. Standardizing case (uppercase/lowercase)
2. Unifying date formats
3. Mapping categories to standard labels
4. Converting units
5. Removing impossible values (e.g., negative height)

This process significantly improves the uniformity and reliability of the dataset.

Previous Lesson Next Lesson

himanshu singh

Product Designer

Profile

Class Sessions

1- What is Data Science? 2- Importance of Methodology 3- Overview of Common Frameworks 4- Roles and Applications in the Industry 5- Business Understanding 6- Defining objectives and questions 7- Framing Data Science Problems 8- Working with IDEs 9- Identifying data requirements 10- Data Sources 11- Basics of Data Collection & Ethics 12- Data Exploration Basics 13- Handling Missing or Inconsistent Data 14- Data Cleaning Essentials 15- Introduction to data wrangling 16- Introduction to Analytical Thinking 17- Overview of Analytical Methods 18- Introduction to Key Tools :- Python and Excel 19- Summary Statistics 20- Measures of Spread (Variance, Standard Deviation) 21- Central Tendency and Dispersion 22- Interpreting Basic Statistical Outputs 23- Introduction to Data Analysis 24- Basic Data Visualization: Charts, Graphs, Plots 25- Extracting Insights From Data 26- Structuring a Data Science Report 27- Presenting Insights Visually and Textually 28- Introduction to Storytelling with Data 29- Ethical Considerations in Data Science 30- Reviewing the Data Science workflow 31- Emerging Trends and Where to Go next