Skip to content
Ankit Tomar
Ankit Tomar

AI Products

  • AIML
  • Product Management
  • Interview Prep
    • Data Science Interview Questions and Answers
  • Books
  • Blog
    • Generic
    • GenAI
    • Data Pipeline
    • Education
    • Cloud
    • Working in Netherlands
  • About Me
Schedule
Ankit Tomar

AI Products

CatBoost – An Algorithm you need

Ankit Tomar, July 2, 2025July 3, 2025

Hi there!

In this post, weโ€™ll explore CatBoost in depth โ€” what it is, why it was created, how it works internally (including symmetric trees, ordered boosting, and ordered target statistics), and guidance on when to use or avoid it.


๐Ÿˆ What is CatBoost?

CatBoost is a gradient boosting library developed by Yandex, uniquely built to handle categorical features natively. Instead of manually applying one-hot or target encoding, CatBoost automatically processes categorical variables in a way that avoids overfitting.


๐Ÿ” Why was CatBoost introduced?

Most boosting libraries struggle with categorical variables:

  • One-hot encoding increases dimensionality and sparsity.
  • Naive target encoding can introduce data leakage and overfitting.

CatBoost uses ordered target statistics to calculate target encodings in a way that avoids future data leakage.


๐Ÿง  How CatBoost builds models

  • Starts with an initial prediction.
  • Computes residuals (negative gradients).
  • Fits new trees on residuals and updates predictions.
  • Uses ordered boosting and ordered target statistics to reduce prediction shift and prevent target leakage.

๐ŸŒณ Symmetric trees and ordered boosting

Symmetric trees: CatBoost grows balanced, symmetric trees where all leaves at the same depth use the same split. This:

  • Simplifies model structure.
  • Speeds up inference.
  • Improves parallelism.

Ordered boosting: Instead of training on the whole dataset directly, CatBoost permutes the dataset and ensures each data point’s encoding only depends on data before it. This combats prediction shift, where earlier trees might bias the model.


๐Ÿ” Preventing target leakage (ordered target statistics)

CatBoost computes target statistics (like mean target values for categories) in a way that each row’s encoding only depends on preceding rows in the permutation. This prevents the model from โ€œpeekingโ€ into future data and overfitting.


โš™๏ธ Important hyperparameters

  • depth: Tree depth.
  • iterations: Number of boosting rounds.
  • learning_rate: Step size.
  • l2_leaf_reg: Regularization.
  • random_strength: Adds randomness to prevent overfitting.

๐Ÿงฎ Key formulas and intuition

CatBoost, like other gradient boosting methods, fits trees to minimize a loss:

  • Regression:
  • Classification:

At each boosting step, a new tree predicts negative gradients to reduce the loss.


โœ… When to use CatBoost

  • Datasets with many categorical features.
  • Projects needing good out-of-the-box accuracy.
  • Situations where prediction stability is key.

โš ๏ธ When to consider alternatives

  • Purely numeric data, large-scale data: LightGBM or XGBoost may be faster.
  • When model size is critical.
  • When categorical handling isnโ€™t important.

Catboost is very powerful algorithm. You can read more here https://catboost.ai/docs/en/concepts/tutorials

๐Ÿˆ CatBoost Algorithm Example โ€“ How it Handles Features, Splits & Reuse

Imagine youโ€™re working with a dataset containing 10 features, like:

  • purchase_count
  • membership_level
  • visit_days
  • age
  • … and so on.

You might wonder:

How does CatBoost decide which feature to split on? Does it use the same features in all trees? And can it reuse the same feature multiple times inside one tree?

Letโ€™s walk through this step by step.


๐ŸŒณ Step 1: Building symmetric trees

CatBoost builds symmetric, balanced trees:

  • At each depth (tree level), all nodes must split on the same feature and threshold.
  • This makes trees balanced and allows faster parallel prediction.

๐Ÿ” Step 2: How CatBoost picks splits

At each depth, CatBoost:

  1. Considers all available features (all 10 in our example).
  2. Computes which feature & threshold best reduce impurity (e.g., using Gini, entropy, or loss gradient).
  3. Chooses the split that offers the most gain.

Importantly, this process repeats at each depth:

  • The model checks again across all 10 features โ€” including any it already used earlier in the tree.

โ™ป๏ธ Step 3: Reusing the same feature

Yes โ€” the same feature can appear multiple times in the same tree.

For instance:

  • At depth 1, CatBoost might choose:
    purchase_count > 5
  • At depth 2, after data is split, it could again find:
    purchase_count > 12 is the best next split.

Each reuse can have a different threshold or grouping because:

  • The data subset in each node is different.
  • The optimal split point changes.

โœ… Step 4: Using all features in all trees

  • CatBoost keeps the same set of features (the full 10) available at each depth.
  • It does not drop features permanently just because theyโ€™ve been used.
  • Across different trees, the splits can vary โ€” some trees might reuse a feature heavily; others might not use it at all.

๐Ÿง  Why does this matter?

  • Reusing the same feature helps capture non-linear relationships and complex patterns.
  • It helps the model remain flexible and powerful, despite using symmetric trees.
  • Each split is chosen only if it truly improves the model on that branch of data.

๐Ÿ”„ Symmetric trees recap

  • All nodes at the same depth must split on the same feature & threshold.
  • But across depths (e.g., depth 1 vs depth 2), CatBoost is free to:
    • Choose different features
    • Reuse the same feature with new thresholds

๐Ÿ“ฆ Putting it together (quick example)

Tree depth 1:

  • Split: purchase_count > 5

Tree depth 2:

  • Re-evaluates all 10 features.
  • Finds purchase_count > 12 again gives best gain โ†’ reuses purchase_count.

Tree depth 3:

  • Re-evaluates again.
  • Chooses a different feature this time, e.g., visit_days > 3.

โœ๏ธ Summary

  • CatBoost keeps the same feature set at each depth.
  • At each depth, it re-evaluates all features to find the best split.
  • The same feature can appear multiple times in one tree.
  • Symmetric trees enforce same split per level, but not across levels.

This design helps CatBoost remain fast, powerful, and very effective at modeling complex data โ€” especially with categorical features.

Loading

Post Views: 619
Machine Learning ML

Post navigation

Previous post
Next post

Related Posts

Machine Learning

Decision Trees โ€“ A Complete Guide

June 28, 2025June 27, 2025

Decision Trees are one of the most intuitive and interpretable models in machine learning. They are widely used in both classification and regression problems due to their simplicity and flexibility. Below, we cover their internal workings, strengths, limitations, and answer key interview questions. ๐ŸŒณ What Is a Decision Tree? A…

Loading

Read More
Machine Learning

10. Feature Selection โ€“ Separating Signal from Noise

June 27, 2025June 26, 2025

In our last blog, we talked about feature engineering, and hopefully, you got excited and created dozens โ€” if not hundreds โ€” of new features. Now, you may be wondering: Which ones should I actually use in my model? Donโ€™t worry โ€” weโ€™ve all been there. Welcome to the world…

Loading

Read More
Career

๐Ÿš€ Donโ€™t Just Be a Data Scientist โ€” Become a Full-Stack Data Scientist

June 11, 2025June 6, 2025

Over the past decade, data science has emerged as one of the most sought-after fields in technology. Weโ€™ve seen incredible advances in how businesses use data to inform decisions, predict outcomes, and automate systems. But hereโ€™s the catch: most data scientists stop halfway.They build models, generate insights, and maybe make…

Read More

Search

Ankit Tomar

AI product leader, Amsterdam

Archives

  • November 2025
  • August 2025
  • July 2025
  • June 2025
  • May 2025
  • December 2024
  • August 2024
  • July 2024
Tweets by ankittomar_ai
©2025 Ankit Tomar | WordPress Theme by SuperbThemes