Skip to content
Ankit Tomar
Ankit Tomar

AI Products

  • AIML
  • Product Management
  • Interview Prep
    • Data Science Interview Questions and Answers
  • Books
  • Blog
    • Generic
    • GenAI
    • Data Pipeline
    • Education
    • Cloud
    • Working in Netherlands
  • About Me
Schedule
Ankit Tomar

AI Products

8. Encoding Categorical Variables

Ankit Tomar, June 25, 2025June 24, 2025

Great job sticking through the foundational parts of ML so far. Now let’s talk about something crucial — how to handle categorical variables. This is one of the first real technical steps when working with data, and it can make or break your model’s performance.


🧠 Why Do We Need Encoding?

Here’s a simple truth: machines don’t understand words — they only work with numbers. Whether it’s a linear model doing math or a deep learning model crunching through matrices, everything boils down to numbers.

So, whenever your dataset has text — like country names, colors, product types, etc. — you need to convert those into numbers. That process is called encoding.


🔍 But First: What Types of Categorical Variables Are There?

Before we jump into encoding techniques, let’s understand the different types of categorical variables you’ll come across:


1. Nominal Variables

These are just names or labels — no order or ranking.
👉 Example: {India, Netherlands, Germany}
There’s no “greater” or “lesser” here — they’re just categories.


2. Ordinal Variables

These have a natural order or ranking.
👉 Example: {Low, Medium, High} or {High School, Graduate, Postgraduate}
In this case, order does matter — but the gap between levels isn’t always consistent.


3. Cyclical Variables

These repeat in a cycle — like months or hours.
👉 Example: Day of Week, Hour of Day
Here, the end connects back to the beginning, so treating them as just numbers can mislead your model.


4. Binary Variables

Only two categories — usually yes/no or true/false.
👉 Example: {Male, Female}, {Yes, No}
These are the easiest to convert into numbers — just use 0 and 1.


🔧 So, How Do We Encode These?

Let’s look at the most common methods you can use to convert these variables into machine-friendly formats.


1. One-Hot Encoding (OHE)

This is the most popular method. It creates a new column for each category with a 1 or 0 depending on whether that row belongs to the category.

📌 Example:

Country: India → [1, 0, 0]
Country: Netherlands → [0, 1, 0]
Country: Germany → [0, 0, 1]

✅ Pros:

  • Simple and widely used.
  • Keeps categories separate.

❌ Cons:

  • If you have 1000 categories? Boom — 1000 new columns. High-dimensional data = more complexity.

Best for: Nominal data with fewer unique values.


2. Label Encoding

This method assigns a number to each category (like 0, 1, 2…).

📌 Example:

Education Level: High School → 0, Graduate → 1, Postgraduate → 2

✅ Pros:

  • Compact and simple.
  • Works great for tree-based models like Random Forest.

❌ Cons:

  • Can confuse linear models into thinking “Postgraduate” is twice as good as “Graduate.”

Best for: Ordinal data or when using tree-based models.


3. Target Encoding

This one is a bit advanced — it replaces a category with the average value of the target for that category.

📌 Example:

If you’re predicting customer purchases and “Region A” has an average conversion of 0.45, you’ll replace “Region A” with 0.45.

✅ Pros:

  • Very useful for high-cardinality data (like product IDs).

❌ Cons:

  • Risk of overfitting! Make sure you use proper validation techniques.

Best for: When you have many unique categories (like zip codes, product IDs).


⚖️ Things to Watch Out For

  • Dimensionality explosion: One-hot encoding with many categories can create thousands of columns. You can manage this with tools like PCA.
  • Overfitting: Especially in Target Encoding — avoid data leakage by using techniques like K-fold mean encoding.
  • Algorithm sensitivity: Some models (like logistic regression) are sensitive to how you encode. Others (like decision trees) are more forgiving.

🎯 Final Thoughts

There’s no one-size-fits-all approach here. The right encoding depends on:

  • The type of data.
  • The algorithm you’re using.
  • Your business goals.

As you grow more confident, you’ll be able to experiment and make these decisions intuitively. And remember — encoding isn’t just a technical step. It’s your first chance to shape how your model understands the world.

Loading

Post Views: 97
Machine Learning ML

Post navigation

Previous post
Next post

Related Posts

Machine Learning

🎯 Go-To-Market Reduction with a Hypothesis-First Approach in Data Science

June 12, 2025June 6, 2025

Let’s face it — most machine learning models never make it to production. Despite the effort, time, and resources poured into data science projects, a staggering percentage fail to deliver actual business value. Why? One of the biggest culprits is that we often jump straight into the data and start…

Loading

Read More
Machine Learning

Building a Practical Explainable AI Dashboard – From Concept to Reusability 🧰🔍

May 25, 2025June 17, 2025

In today’s world of machine learning, understanding why a model makes a decision is becoming just as important as the decision itself. Interpretability isn’t just a “nice to have” anymore—it’s essential for trust, debugging, fairness, and compliance. That’s why I set out to create a modular, reusable Explainable AI Dashboard….

Loading

Read More
Machine Learning

9. Feature Engineering – The Unsung Hero of Machine Learning

June 26, 2025June 26, 2025

As we continue our journey through machine learning model development, it’s time to shine a light on one of the most critical yet underrated aspects — Feature Engineering. If you ever wondered why two people using the same dataset and algorithm get wildly different results, the answer often lies in…

Loading

Read More

Search

Ankit Tomar

AI product leader, Amsterdam

Archives

  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • December 2024
  • August 2024
  • July 2024
Tweets by ankittomar_ai
©2025 Ankit Tomar | WordPress Theme by SuperbThemes