Data wrangling is the step that transforms raw, messy, and unstructured data into formats optimized for analysis, modeling, reporting, or visualization. While data cleaning focuses on correcting errors, data wrangling focuses on reshaping, restructuring, and enriching the dataset so it becomes analytically powerful.
This module explores the modern, industry-standard concepts of data wrangling, including reshaping data structures, joining datasets, engineering new features, and subsetting large data collections for efficient analysis.
Data wrangling is a creative, iterative process—part engineering, part analysis, and part domain reasoning. When performed effectively, it elevates dataset usability and directly improves model performance.
Reshaping Data
Reshaping data involves reorganizing the structure of a dataset to make it more suitable for analytical objectives. Raw data is rarely structured exactly the way a data scientist needs it. You may receive it as a wide spreadsheet, nested JSON, log files, or a narrow long-format table.
Reshaping techniques convert these structures into analytical formats compatible with statistical models, BI dashboards, and machine learning algorithms.
Long vs Wide Format Conversion
Data can exist in two essential structural formats:
1. Each subject appears once
2. Multiple attributes appear as separate columns
3. Example: Monthly sales columns like Jan, Feb, Mar
1. Each row represents one observation of a variable
2. Suitable for time-series and machine learning
3. Example: Month and Sales columns instead of 12 monthly columns
1. Machine learning algorithms often work best with long format
2. Business reporting often prefers wide format
3. Time-series forecasting requires long format with timestamp fields
This transformation ensures compatibility with the intended analysis.
Tools like pd.melt() and pivot() in Python make conversions seamless.
2. Pivoting and Unpivoting Data
Pivoting transforms long data into wide format to create summaries or reorganize categories.
1. Creating user-level summaries (e.g., total purchases by category)
2. Generating cross-tab reports for BI dashboards
3. Reshaping transactional data into analytical matrices
4. Unpivoting (also called melting) is the reverse process where wide data is condensed into long format.
Unpivoting standardizes datasets with many repeated columns and makes them more compatible with statistical algorithms. For example, converting columns like “Test1, Test2, Test3” into a single “Test Score” column improves modeling consistency.
Grouping and Aggregating
Aggregation helps summarize and condense large datasets.
Sum of sales
Average temperature readings
Maximum and minimum values per group
Count of unique users or sessions
Simplifies large datasets
Supports feature engineering (e.g., customer lifetime value)
Helps identify trends and patterns
Enables efficient model-ready datasets
Grouping and aggregation directly influence the quality of descriptive analytics and predictive features.
Feature Creation (Feature Engineering)
Feature engineering is the process of creating meaningful new variables that enhance the modeling power of algorithms. Raw data often lacks the expressiveness required by machine learning models. By deriving new features, data scientists expose hidden patterns and relationships.
1. Deriving New Columns from Existing Variables
This involves creating new metrics or indicators based on combinations or transformations of existing columns.
Age from birthdate
Time difference between events
Revenue growth rate
Customer activity scores
Machine learning models identify patterns based on the features provided. Well-designed features capture essential business logic and improve predictions significantly.
Encoding Categorical Variables (OHE, Label Encoding)
Most machine learning algorithms require numerical inputs. Categorical variables need encoding.
Creates binary columns for each category.
Useful for nominal (unordered) categories like color or city.
Assigns numeric labels to categories.
Useful for ordinal categories like education level.
Incorrect encoding distorts patterns and biases models. Proper encoding ensures compatibility with ML algorithms and preserves categorical meaning.
Binning and Scaling Data
Splits continuous variables into ranges.
Examples include income brackets or age groups.
Standardization (Z-score scaling)
Normalization (Min–Max scaling)
Impact on Modeling:
Many algorithms—such as kNN, SVM, and neural networks—are sensitive to scale.
Scaling ensures that all features contribute equally and models converge faster.
Modern data ecosystems rely on multiple structured and unstructured sources—databases, APIs, log files, CRM systems, ERP platforms, and more. Combining these datasets is essential to build a unified and complete analytical view.
Joining involves merging datasets on common identifiers like:
Customer ID
Order ID
Product SKU
Email or phone number
Timestamp
1. Real-world data is distributed across systems.
For example:
2. CRM contains customer profiles
3. Transaction system contains purchase data
4. Web logs contain browsing behavior
Joining enriches datasets by creating a single unified dataset that models real-world entities accurately.

1. Left joins are common in analytics (retain base population)
2. Inner joins are common in modeling (ensure complete data)
3. Outer joins are used when diagnosing mismatches or missing links
4. Choosing wrong join type leads to data loss, duplicates, or misaligned rows.