USD ($)
$
United States Dollar
Euro Member Countries
India Rupee

Introduction to data wrangling

Lesson 15/31 | Study Time: 27 Min

Data wrangling is the step that transforms raw, messy, and unstructured data into formats optimized for analysis, modeling, reporting, or visualization. While data cleaning focuses on correcting errors, data wrangling focuses on reshaping, restructuring, and enriching the dataset so it becomes analytically powerful.

This module explores the modern, industry-standard concepts of data wrangling, including reshaping data structures, joining datasets, engineering new features, and subsetting large data collections for efficient analysis.

Data wrangling is a creative, iterative process—part engineering, part analysis, and part domain reasoning. When performed effectively, it elevates dataset usability and directly improves model performance.


Reshaping Data 

Reshaping data involves reorganizing the structure of a dataset to make it more suitable for analytical objectives. Raw data is rarely structured exactly the way a data scientist needs it. You may receive it as a wide spreadsheet, nested JSON, log files, or a narrow long-format table.
Reshaping techniques convert these structures into analytical formats compatible with statistical models, BI dashboards, and machine learning algorithms.

Long vs Wide Format Conversion

Data can exist in two essential structural formats:


Wide Format:

1. Each subject appears once

2. Multiple attributes appear as separate columns

3. Example: Monthly sales columns like Jan, Feb, Mar

Long Format:

1. Each row represents one observation of a variable

2. Suitable for time-series and machine learning

3. Example: Month and Sales columns instead of 12 monthly columns

Why Long/Wide Conversion Matters:

1. Machine learning algorithms often work best with long format

2. Business reporting often prefers wide format

3. Time-series forecasting requires long format with timestamp fields

This transformation ensures compatibility with the intended analysis.
Tools like pd.melt() and pivot() in Python make conversions seamless.


2. Pivoting and Unpivoting Data

Pivoting transforms long data into wide format to create summaries or reorganize categories.


Use Cases of Pivoting:

1. Creating user-level summaries (e.g., total purchases by category)

2. Generating cross-tab reports for BI dashboards

3. Reshaping transactional data into analytical matrices

4. Unpivoting (also called melting) is the reverse process where wide data is condensed into long format.


Why It Matters:

Unpivoting standardizes datasets with many repeated columns and makes them more compatible with statistical algorithms. For example, converting columns like “Test1, Test2, Test3” into a single “Test Score” column improves modeling consistency.


Grouping and Aggregating

Aggregation helps summarize and condense large datasets.


Common Aggregations:

Sum of sales

Average temperature readings

Maximum and minimum values per group

Count of unique users or sessions

Why Aggregation is Critical:

Simplifies large datasets

Supports feature engineering (e.g., customer lifetime value)

Helps identify trends and patterns

Enables efficient model-ready datasets

Grouping and aggregation directly influence the quality of descriptive analytics and predictive features.


Feature Creation (Feature Engineering)

Feature engineering is the process of creating meaningful new variables that enhance the modeling power of algorithms. Raw data often lacks the expressiveness required by machine learning models. By deriving new features, data scientists expose hidden patterns and relationships.


1. Deriving New Columns from Existing Variables

This involves creating new metrics or indicators based on combinations or transformations of existing columns.

Examples:

Age from birthdate

Time difference between events

Revenue growth rate

Customer activity scores

Why It Matters:

Machine learning models identify patterns based on the features provided. Well-designed features capture essential business logic and improve predictions significantly.


Encoding Categorical Variables (OHE, Label Encoding)

Most machine learning algorithms require numerical inputs. Categorical variables need encoding.


One-Hot Encoding:

Creates binary columns for each category.
Useful for nominal (unordered) categories like color or city.

Label Encoding:

Assigns numeric labels to categories.
Useful for ordinal categories like education level.

Why Encoding Is Important:

Incorrect encoding distorts patterns and biases models. Proper encoding ensures compatibility with ML algorithms and preserves categorical meaning.


Binning and Scaling Data

Binning (Discretization):

Splits continuous variables into ranges.
Examples include income brackets or age groups.

Scaling:

Standardization (Z-score scaling)

Normalization (Min–Max scaling)

Impact on Modeling:

Many algorithms—such as kNN, SVM, and neural networks—are sensitive to scale.
Scaling ensures that all features contribute equally and models converge faster.

Merging and Joining Datasets 

Modern data ecosystems rely on multiple structured and unstructured sources—databases, APIs, log files, CRM systems, ERP platforms, and more. Combining these datasets is essential to build a unified and complete analytical view.


1. Combining Multiple Data Sources Using Keys

Joining involves merging datasets on common identifiers like:

Customer ID

Order ID

Product SKU

Email or phone number

Timestamp

Why Joins Are Critical:

1. Real-world data is distributed across systems.
For example:

2. CRM contains customer profiles

3. Transaction system contains purchase data

4. Web logs contain browsing behavior

Joining enriches datasets by creating a single unified dataset that models real-world entities accurately.


2. Handling Join Types (Inner, Left, Right, Outer)

Different join types control how datasets combine.
Common Join Types:

Real-World Relevance


1. Left joins are common in analytics (retain base population)

2. Inner joins are common in modeling (ensure complete data)

3. Outer joins are used when diagnosing mismatches or missing links

4. Choosing wrong join type leads to data loss, duplicates, or misaligned rows.