Skip to content
Ankit Tomar
Ankit Tomar

AI Products

  • AIML
  • Product Management
  • Interview Prep
    • Data Science Interview Questions and Answers
  • Books
  • Blog
    • Generic
    • GenAI
    • Data Pipeline
    • Education
    • Cloud
    • Working in Netherlands
  • About Me
Schedule
Ankit Tomar

AI Products

8. Encoding Categorical Variables

Ankit Tomar, June 25, 2025June 24, 2025

Great job sticking through the foundational parts of ML so far. Now let’s talk about something crucial — how to handle categorical variables. This is one of the first real technical steps when working with data, and it can make or break your model’s performance.


🧠 Why Do We Need Encoding?

Here’s a simple truth: machines don’t understand words — they only work with numbers. Whether it’s a linear model doing math or a deep learning model crunching through matrices, everything boils down to numbers.

So, whenever your dataset has text — like country names, colors, product types, etc. — you need to convert those into numbers. That process is called encoding.


🔍 But First: What Types of Categorical Variables Are There?

Before we jump into encoding techniques, let’s understand the different types of categorical variables you’ll come across:


1. Nominal Variables

These are just names or labels — no order or ranking.
👉 Example: {India, Netherlands, Germany}
There’s no “greater” or “lesser” here — they’re just categories.


2. Ordinal Variables

These have a natural order or ranking.
👉 Example: {Low, Medium, High} or {High School, Graduate, Postgraduate}
In this case, order does matter — but the gap between levels isn’t always consistent.


3. Cyclical Variables

These repeat in a cycle — like months or hours.
👉 Example: Day of Week, Hour of Day
Here, the end connects back to the beginning, so treating them as just numbers can mislead your model.


4. Binary Variables

Only two categories — usually yes/no or true/false.
👉 Example: {Male, Female}, {Yes, No}
These are the easiest to convert into numbers — just use 0 and 1.


🔧 So, How Do We Encode These?

Let’s look at the most common methods you can use to convert these variables into machine-friendly formats.


1. One-Hot Encoding (OHE)

This is the most popular method. It creates a new column for each category with a 1 or 0 depending on whether that row belongs to the category.

📌 Example:

Country: India → [1, 0, 0]
Country: Netherlands → [0, 1, 0]
Country: Germany → [0, 0, 1]

✅ Pros:

  • Simple and widely used.
  • Keeps categories separate.

❌ Cons:

  • If you have 1000 categories? Boom — 1000 new columns. High-dimensional data = more complexity.

Best for: Nominal data with fewer unique values.


2. Label Encoding

This method assigns a number to each category (like 0, 1, 2…).

📌 Example:

Education Level: High School → 0, Graduate → 1, Postgraduate → 2

✅ Pros:

  • Compact and simple.
  • Works great for tree-based models like Random Forest.

❌ Cons:

  • Can confuse linear models into thinking “Postgraduate” is twice as good as “Graduate.”

Best for: Ordinal data or when using tree-based models.


3. Target Encoding

This one is a bit advanced — it replaces a category with the average value of the target for that category.

📌 Example:

If you’re predicting customer purchases and “Region A” has an average conversion of 0.45, you’ll replace “Region A” with 0.45.

✅ Pros:

  • Very useful for high-cardinality data (like product IDs).

❌ Cons:

  • Risk of overfitting! Make sure you use proper validation techniques.

Best for: When you have many unique categories (like zip codes, product IDs).


⚖️ Things to Watch Out For

  • Dimensionality explosion: One-hot encoding with many categories can create thousands of columns. You can manage this with tools like PCA.
  • Overfitting: Especially in Target Encoding — avoid data leakage by using techniques like K-fold mean encoding.
  • Algorithm sensitivity: Some models (like logistic regression) are sensitive to how you encode. Others (like decision trees) are more forgiving.

🎯 Final Thoughts

There’s no one-size-fits-all approach here. The right encoding depends on:

  • The type of data.
  • The algorithm you’re using.
  • Your business goals.

As you grow more confident, you’ll be able to experiment and make these decisions intuitively. And remember — encoding isn’t just a technical step. It’s your first chance to shape how your model understands the world.

Loading

Post Views: 99
Machine Learning ML

Post navigation

Previous post
Next post

Related Posts

Machine Learning

Gradient Boosting

July 1, 2025July 1, 2025

As we continue our journey into ML algorithms, in this post, we’ll go deeper into gradient boosting — how it works, what’s happening behind the scenes mathematically, and why it performs so well. 🌟 What is gradient boosting? Gradient boosting is an ensemble method where multiple weak learners (usually shallow…

Loading

Read More
Machine Learning

4. How to Make a Machine Learning Model Live

June 21, 2025June 9, 2025

So far, we’ve discussed how to train, test, and evaluate machine learning models. In this blog, let’s talk about the final—but one of the most important—steps: model deployment. You’ve built a great model. Now what? The real value of any machine learning (ML) model is unlocked only when it’s used…

Loading

Read More
Machine Learning

7. Model Metrics – Classification

June 24, 2025June 24, 2025

Let’s talk about a topic that often gets underestimated — classification metrics in machine learning. I know many of you are eager to dive into LLMs and the shiny new world of GenAI. But here’s the truth: without building a strong foundation in traditional ML, your understanding of advanced systems…

Loading

Read More

Search

Ankit Tomar

AI product leader, Amsterdam

Archives

  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • December 2024
  • August 2024
  • July 2024
Tweets by ankittomar_ai
©2025 Ankit Tomar | WordPress Theme by SuperbThemes