ML Diagnostics Mastery — Part 1 of the Series

Table of Contents

Series note: This is Part 1 of the ML Diagnostics Mastery series. Future parts will cover class imbalance diagnostics, feature engineering audits, and production drift detection. All code from this article is self-contained and reproducible with random_state=42.

I’ve reviewed a lot of model pull requests over the years, and there is one pattern that shows up more reliably than any bug: a model that looks brilliant on the training set, gets shipped, and quietly embarrasses everyone in production. The engineer who built it wasn’t careless. They just didn’t have a structured diagnostic workflow — they had a number (train accuracy), and it was high, so they moved on.

This article is my attempt to build that workflow in public. Everything here came out of an afternoon of deliberate experimentation: I sat down, wrote nine diagnostic functions from scratch, ran them against controlled synthetic datasets, and documented what each one reveals. The plots tell the story better than I can in prose, so I’ve left placeholders throughout for each figure. If you’re following along with the companion code, you’ll generate those images yourself.

Let’s start with the concept that makes all of this necessary.

What Overfitting Actually Is

Overfitting is a model memorising the training data rather than learning the underlying distribution. It recognises the noise in the examples it has seen, and that noise has zero predictive value on new data. The result: spectacular train metrics, mediocre real-world performance.

The opposite problem — underfitting — is a model too simple to capture the genuine signal. It performs poorly on both training and test data, which is at least honest.

Neither condition is rare. In my experience, most models that get deployed sit somewhere on the overfit end of the spectrum, because the incentives during development push in that direction: we optimise what we measure, and we measure training performance because it’s convenient.

The solution isn’t to distrust your model — it’s to interrogate it systematically. Below are nine tools that let you do that.

The Experimental Setup

All experiments use scikit-learn’s synthetic dataset generators (make_classification, make_regression) with fixed random seeds, so every number in this article is reproducible [1]. I used:

Python 3.11
scikit-learn 1.4 [2]
NumPy 1.26 [3]
Matplotlib 3.8 [4]

No real datasets were harmed. I chose synthetic data deliberately: it gives me ground truth about what the correct model complexity should be, so I can manufacture clean examples of each failure mode.

Diagnostic 1: Polynomial Overfitting — The Visual Anchor

Before any code, it helps to see overfitting in the most unambiguous setting possible. Polynomial regression on a 1D noisy sine wave is the textbook example, and it earns that status.

I generated 30 samples from y = sin(2πx) + ε, where ε ~ N(0, 0.3), then fit three pipelines — degree-1 (linear), degree-4, and degree-15 — and evaluated each on a held-out test split.

degrees = [1, 4, 15]

for degree in degrees:
    pipeline = Pipeline([
        ("poly", PolynomialFeatures(degree=degree)),
        ("scaler", StandardScaler()),
        ("model", LinearRegression()),
    ])
    pipeline.fit(X_train, y_train)
    train_score = pipeline.score(X_train, y_train)
    test_score  = pipeline.score(X_test,  y_test)

degrees = [1, 4, 15]

for degree in degrees:
    pipeline = Pipeline([
        ("poly", PolynomialFeatures(degree=degree)),
        ("scaler", StandardScaler()),
        ("model", LinearRegression()),
    ])
    pipeline.fit(X_train, y_train)
    train_score = pipeline.score(X_train, y_train)
    test_score  = pipeline.score(X_test,  y_test)

Three side-by-side scatter plots of the same noisy sine wave dataset fitted with polynomial regression at degree 1, degree 4, and degree 15. The degree-1 line slopes gently downward and misses the curve entirely. The degree-4 line follows the sine wave closely. The degree-15 line passes through the training points but oscillates wildly, reaching extreme values near x=0.6 and x=0.8, with a test R² of −1,166,955. — Degree-1 underfits, degree-4 finds the signal, degree-15 memorises the noise — and the test R² tells you everything.

The degree-15 model achieves an R² close to 1.0 on the training set and a sharply negative R² on the test set — it has learned the noise so thoroughly that it can’t generalise to points a few millimetres away on the x-axis. Degree-1 is the opposite failure: it has the capacity of a ruler.

What to look for: The train/test R² printed in each subplot title. A high train score alongside a low or negative test score is the overfitting signature in its purest form.

Diagnostic 2: Learning Curves — The First Thing I Always Plot

If I could recommend only one diagnostic tool, it would be the learning curve. It answers the most important question in applied ML: will getting more data help?

A learning curve plots train and validation score as a function of training set size. Three regimes emerge:

Pattern	Diagnosis	Fix
Large train–val gap, both curves flat	Overfitting (high variance)	More data, regularization, simpler model
Both curves low, converging tightly	Underfitting (high bias)	More complex model, better features
Converging at high score	Good fit	Ship it (carefully)

I tested three decision trees — unconstrained depth (overfit), max_depth=4 (good fit), and depth-1 stump (underfit):

train_sizes, train_scores, val_scores = learning_curve(
    estimator, X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring="accuracy", n_jobs=-1
)

train_sizes, train_scores, val_scores = learning_curve(
    estimator, X, y,
    train_sizes=np.linspace(0.1, 1.0, 10),
    cv=5, scoring="accuracy", n_jobs=-1
)

Learning curve for a depth-4 decision tree, showing training score (blue) starting near 1.0 at small dataset sizes and declining to around 0.90, while validation score (red) rises from about 0.63 to 0.83 as training set size increases from 45 to 400 samples. The two curves converge toward each other with a final gap of 0.068. Shaded bands show standard deviation across cross-validation folds. — The training and validation curves converge as data grows — this is the shape you’re aiming for.

Learning curve for an unconstrained deep decision tree. The training score (blue) sits flat at 1.0 across all training sizes from 45 to 400. The validation score (red) starts around 0.63, rises to approximately 0.81 by 80 samples, then stays roughly flat with some variance. The final train-validation gap is 0.186. The two curves never converge. — Training score frozen at 1.0 across every data size — that flatline is the overfit signature, not a success story.

Learning curve for a depth-1 decision tree (stump). Both training score (blue) and validation score (red) hover between 0.69 and 0.80, with the training score declining from about 0.80 to 0.75 and the validation score fluctuating around 0.72. The final gap is only 0.015. The curves run close together but at uniformly low accuracy throughout. — Both curves are converged and close — but converging at 0.75 means the model has hit a ceiling, not found a solution.

The deep tree (Figure 2a) is the canonical overfit signature: training accuracy stays near 1.0 regardless of data volume, and the validation curve plateaus far below it. More data will not close this gap — you need to constrain the model. The stump (Figure 2c) is the opposite: both curves converge, but they converge at a bad number. The gap between them is fine; the absolute level is the problem.

Diagnostic 3: Validation Curves — Finding the Hyperparameter Sweet Spot

A validation curve is a learning curve’s cousin, but instead of varying data size, you vary a single hyperparameter. This is how you find the complexity sweet spot before committing to a final model.

I swept max_depth from 1 to 19 on a decision tree classifier:

train_scores, val_scores = validation_curve(
    DecisionTreeClassifier(random_state=42), X, y,
    param_name="max_depth",
    param_range=np.arange(1, 20),
    cv=5, scoring="accuracy"
)

train_scores, val_scores = validation_curve(
    DecisionTreeClassifier(random_state=42), X, y,
    param_name="max_depth",
    param_range=np.arange(1, 20),
    cv=5, scoring="accuracy"
)

The experiment found max_depth=8 as the optimal value, with a validation accuracy of 0.808.

Validation curve showing decision tree accuracy versus max_depth from 1 to 19. The training score (blue) rises continuously from 0.74 at depth 1 to 1.0 at depth 15 and beyond. The validation score (red) rises from 0.73, peaks around 0.806 at depth 8, then gradually declines. A vertical green dashed line marks max_depth=8 as the optimal value. — The green dashed line at depth 8 is the only point where training the model harder still helps generalisation — past it, you’re buying noise.

The shape in Figure 3 is what you want to see. Left of the peak: both scores are low, the model is too simple. Right of the peak: training score continues climbing but validation score drops — classic overfitting. The validation curve gives you a principled reason to stop at max_depth=8 rather than guessing.

Diagnostic 4: Train/Validation Loss Curves — The Neural Network Signature

For neural networks, we don’t get a single accuracy number — we get a loss trajectory across epochs. The overfitting signature here is unmistakable once you know what to look for: validation loss starts rising while training loss continues to fall.

I simulated this with synthetic loss curves to isolate the pattern cleanly:

train_loss = 1.0 / (1 + 0.07 * epochs) + np.random.normal(0, 0.005, 100)
val_loss   = (1.0 / (1 + 0.05 * epochs)
              + 0.003 * np.maximum(0, epochs - 40)
              + np.random.normal(0, 0.01, 100))

train_loss = 1.0 / (1 + 0.07 * epochs) + np.random.normal(0, 0.005, 100)
val_loss   = (1.0 / (1 + 0.05 * epochs)
              + 0.003 * np.maximum(0, epochs - 40)
              + np.random.normal(0, 0.01, 100))

The early-stopping point — epoch 68 — is where validation loss is minimised. Training past that point is pure memorisation.

Train and validation loss curves over 100 epochs. Training loss (blue) decreases monotonically from about 0.95 to near 0.10. Validation loss (red) decreases in parallel until approximately epoch 68, then begins rising, reaching about 0.36 by epoch 100. The region where validation loss exceeds training loss is shaded in pink as the generalisation gap. A green dashed vertical line marks epoch 68 as the early stopping point. — Everything to the right of the green line is the model getting better at the training set and worse at the world.

The shaded region between the two curves in Figure 4 is the generalisation gap — the performance you’re giving up to memorisation. Early stopping is simply agreeing to save the checkpoint at the leftmost edge of that shaded region [5].

Practical note: In real training pipelines, save model weights at every epoch or at regular intervals using callbacks (e.g., ModelCheckpoint in Keras [6], EarlyStopping in PyTorch Lightning [7]). The epoch with minimum validation loss is your restore point.

Diagnostic 5: Cross-Validation Score Distribution — Measuring Stability

A single train/test split tells you how the model performs on one particular partition of your data. Cross-validation tells you how it performs on all of them — and crucially, it tells you how much performance varies across splits. High variance across folds is itself a diagnostic signal.

I ran 10-fold cross-validation on four classifiers:

scores = cross_val_score(model, X_train, y_train, cv=10, scoring="accuracy")

scores = cross_val_score(model, X_train, y_train, cv=10, scoring="accuracy")

Results:

Model	CV Mean	CV Std	Test Acc
Deep Tree (overfit)	0.810	0.064	0.810
Shallow Tree	0.817	0.054	0.860
Random Forest	0.872	0.049	0.900
Gradient Boosting	0.893	0.046	0.890

Box plot showing 10-fold cross-validation accuracy distributions for four classifiers: Deep Tree (overfit), Shallow Tree, Random Forest, and Gradient Boosting. The Deep Tree box spans roughly 0.77 to 0.85 with a whisker reaching down to 0.70, indicating high variance. The Shallow Tree has a similar spread. Random Forest and Gradient Boosting boxes are tighter, positioned higher between approximately 0.84 and 0.93, with shorter whiskers. — The width of each box is the model’s instability — Random Forest and Gradient Boosting earn their reputation by shrinking it.

The box plot makes the instability visible. The deep tree’s wide interquartile range means its performance is highly sensitive to which 10% of the data ends up in the validation fold — it has memorised patterns that only hold for certain data subsets. Ensembles shrink this variance dramatically, which is one of the reasons they’re so reliable in practice [8].

Rule of thumb: A CV standard deviation above 0.06–0.07 is a flag worth investigating. It doesn’t always mean overfit, but it means the model is not robust to data partitioning.

Diagnostic 6: Bias–Variance Decomposition — Finding the Root Cause

All prediction error can be decomposed into three components [9]:

Total Error = Bias² + Variance + Irreducible Noise

Bias²: How far off the average prediction is from the truth. Systematic, model-structural error.
Variance: How much predictions fluctuate across different training sets. Sensitivity to noise.
Irreducible noise: The floor. You cannot do better than this regardless of model complexity.

The textbook result: simpler models have high bias and low variance; complex models have low bias and high variance. The sweet spot minimises their sum.

I estimated bias and variance via bootstrap resampling [10] — training 200 models on bootstrap samples of the training data and measuring the spread of their predictions on a fixed test set:

for i in range(n_bootstrap):
    rng = np.random.RandomState(i)
    idx = rng.choice(len(X_train_full), size=len(X_train_full), replace=True)
    estimator.fit(X_train_full[idx], y_train_full[idx])
    all_preds.append(estimator.predict(X_test))

mean_pred = all_preds.mean(axis=0)
bias_sq   = np.mean((mean_pred - y_test) ** 2)
variance  = np.mean(all_preds.var(axis=0))

for i in range(n_bootstrap):
    rng = np.random.RandomState(i)
    idx = rng.choice(len(X_train_full), size=len(X_train_full), replace=True)
    estimator.fit(X_train_full[idx], y_train_full[idx])
    all_preds.append(estimator.predict(X_test))

mean_pred = all_preds.mean(axis=0)
bias_sq   = np.mean((mean_pred - y_test) ** 2)
variance  = np.mean(all_preds.var(axis=0))

Results across tree depths:

max_depth	Bias²	Variance	Total Error
1	29,682	6,507	74,759
2	23,451	8,137	70,158
4	14,626	13,768	66,964
6	12,366	16,317	67,253
10	11,603	17,577	67,749
None	11,610	17,898	68,078

Grouped bar chart with blue bars for Bias² and red/orange bars for Variance at each tree depth setting: 1, 2, 4, 6, 10, and None. At depth 1, Bias² is near 30,000 and Variance is around 6,500. By depth 4, they are roughly equal near 14,000 and 13,500 respectively. At depth None, Bias² is about 11,600 and Variance is about 17,900. A green diamond-line for Total Error runs above all bars, dipping to its minimum around depth 4 before rising again. — The green line doesn’t care which component is larger — it only cares about their sum, and that sum bottoms out at depth 4.

The total error curve (green line in Figure 6) bottoms out around max_depth=4. That’s the sweet spot — enough capacity to capture the signal, not enough to memorise the noise. Note that max_depth=None (fully grown tree) has nearly the same bias as max_depth=10, but meaningfully higher variance. The marginal gains in bias reduction stop paying off past a certain depth.

Implementation note: This is a manual bootstrap approximation suitable for illustration. For production use, consider mlxtend.evaluate.bias_variance_decomp [11], which implements the formal Kohavi-Wolpert decomposition [12].

Diagnostic 7: Regularisation — The Cheapest Fix for Regression Overfit

If you have a regression problem with many features — especially more features than samples — regularisation is the fastest path to closing the generalisation gap. Ridge regression adds an L2 penalty to the coefficient magnitudes; Lasso adds L1, which additionally drives some coefficients to exactly zero [13].

I set up a deliberately pathological problem: 100 samples, 80 features, only 10 of which are truly informative. Without regularisation, a linear model wildly overfits.

n_samples, n_features = 100, 80  # More features than useful → easy to overfit
X, y = make_regression(n_samples=n_samples, n_features=n_features,
                       n_informative=10, noise=30, random_state=42)

alphas = np.logspace(-3, 4, 30)

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    # Track train and test MSE at each alpha

n_samples, n_features = 100, 80  # More features than useful → easy to overfit
X, y = make_regression(n_samples=n_samples, n_features=n_features,
                       n_informative=10, noise=30, random_state=42)

alphas = np.logspace(-3, 4, 30)

for alpha in alphas:
    ridge = Ridge(alpha=alpha)
    ridge.fit(X_train, y_train)
    # Track train and test MSE at each alpha

Two side-by-side semi-log plots showing MSE versus regularisation strength (alpha) for Ridge (left) and Lasso (right). In the Ridge plot, Train MSE (blue) stays near zero until alpha around 10, then rises steeply, while Test MSE (red) remains flat around 13,000 for low alphas, rises gradually, and converges with Train MSE around alpha=1,000. In the Lasso plot, Train MSE stays near zero then rises sharply around alpha=30, while Test MSE starts around 16,000, drops steeply to a minimum near alpha=10, then rises sharply to join Train MSE around alpha=100. — At low alpha both models overfit badly; at high alpha both underfit — the sweet spot is narrow and only visible because we swept the full range.

The pattern in Figure 7 is the same in both subplots: at very low alpha, train MSE is tiny but test MSE is enormous (overfit). At very high alpha, both converge to a high value (underfit). Somewhere in between — typically identified by cross-validation on a held-out set — is the optimum.

For Lasso, the L1 penalty has the additional effect of zeroing out the 70 uninformative features entirely, giving you implicit feature selection [14]. In practice, I usually try both: Ridge when all features are plausibly relevant, Lasso when I suspect many are noise.

Diagnostic 8: Feature Importance and Data Leakage Detection

Data leakage is a particularly insidious form of overfitting. The model achieves near-perfect train accuracy not because it has learned something real, but because information from the target has leaked into its features [15]. This happens more often than people expect — target encoding computed on the full dataset before splitting, time-series features that encode future information, or administrative fields that are only populated after the outcome is known.

The feature importance plot from a tree ensemble is one of the fastest ways to flag leakage. If one feature dwarfs all others in importance, and the train/test gap is large despite high absolute accuracy, the dominant feature warrants investigation.

I simulated this by injecting a near-perfect noisy copy of the label as feature zero in the training set only. At test time, that feature contained only random noise — simulating what happens when a leaky feature is absent at serving time.

X_train_leaky[:, 0] = y_train + np.random.normal(0, 0.05, len(y_train))
# Test set gets noise only — the label signal is gone (simulates production)
X_test_leaky[:, 0]  = np.random.normal(0, 1.0, len(y_test))

X_train_leaky[:, 0] = y_train + np.random.normal(0, 0.05, len(y_train))
# Test set gets noise only — the label signal is gone (simulates production)
X_test_leaky[:, 0]  = np.random.normal(0, 1.0, len(y_test))

Result:

With leaky feature → Train: 1.000 | Test: 0.547
⚠  Large train/test gap despite high train accuracy = leakage signature!

Horizontal bar chart of the top 10 feature importances from a Random Forest trained with a leaky feature. LEAKY_feat (red bar) has importance approximately 0.575, roughly eight times larger than the next highest feature (feat_16 at about 0.075). The remaining nine features (blue bars) are all below 0.08. Chart title states Train=1.00, Test=0.55 and notes the red bar is a leaky feature absent at test time. — One feature at 0.575 importance and nine others fighting over the remaining 0.425 — that imbalance alone should trigger an investigation.

The red bar in Figure 8 is the diagnostic. One feature having an importance two to three times larger than the next — especially when combined with the train/test gap shown in the title — is the leakage signature. In a real project, that feature name goes straight onto the investigation list.

Common real-world leakage sources:

Target-mean encoding computed before train/test split
Timestamps or IDs that correlate with outcome ordering
Features derived from post-outcome administrative events
Join keys that inadvertently pull in outcome-adjacent tables

Diagnostic 9: The Full Diagnosis Report — A One-Call Sanity Check

The previous eight diagnostics each answer a specific question. For day-to-day use, you want a single function that gives you a structured verdict without requiring you to remember the full workflow. I built full_diagnosis() for exactly this purpose.

def full_diagnosis(model, X, y, model_name="Model"):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc  = accuracy_score(y_test,  model.predict(X_test))
    cv_scores = cross_val_score(model, X_train, y_train, cv=10)
    gap       = train_acc - test_acc

    # Verdict logic: underfitting check first, then gap severity
    if train_acc < 0.78 and test_acc < 0.78:
        verdict = "UNDERFITTING ..."
    elif gap > 0.18:
        verdict = "SEVERE OVERFITTING ..."
    elif gap > 0.12:
        verdict = "MODERATE OVERFITTING ..."
    elif gap > 0.06:
        verdict = "MILD OVERFITTING ..."
    else:
        verdict = "GOOD FIT ..."

def full_diagnosis(model, X, y, model_name="Model"):
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)

    model.fit(X_train, y_train)
    train_acc = accuracy_score(y_train, model.predict(X_train))
    test_acc  = accuracy_score(y_test,  model.predict(X_test))
    cv_scores = cross_val_score(model, X_train, y_train, cv=10)
    gap       = train_acc - test_acc

    # Verdict logic: underfitting check first, then gap severity
    if train_acc < 0.78 and test_acc < 0.78:
        verdict = "UNDERFITTING ..."
    elif gap > 0.18:
        verdict = "SEVERE OVERFITTING ..."
    elif gap > 0.12:
        verdict = "MODERATE OVERFITTING ..."
    elif gap > 0.06:
        verdict = "MILD OVERFITTING ..."
    else:
        verdict = "GOOD FIT ..."

Running this on five classifiers:

╔══════════════════════════════════════════════════════════════╗
  OVERFITTING DIAGNOSIS — Deep Decision Tree (overfit)
  Train accuracy      : 1.0000
  Test accuracy       : 0.7750
  Train–Test gap      : 0.2250  ⚠ HIGH
  CV mean (10-fold)   : 0.7417
  CV std  (10-fold)   : 0.0705  ⚠ UNSTABLE
  Verdict: SEVERE OVERFITTING — reduce complexity, add regularization, get more data
╚══════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════╗
  OVERFITTING DIAGNOSIS — Random Forest
  Train accuracy      : 1.0000
  Test accuracy       : 0.9333
  Train–Test gap      : 0.0667  ✓ OK
  CV mean (10-fold)   : 0.8958
  CV std  (10-fold)   : 0.0280  ✓ STABLE
  Verdict: MILD OVERFITTING — monitor, may be acceptable in some cases
╚══════════════════════════════════════════════════════════════╝

╔══════════════════════════════════════════════════════════════╗
  OVERFITTING DIAGNOSIS — Decision Stump (underfit)
  Train accuracy      : 0.7708
  Test accuracy       : 0.7000
  Train–Test gap      : 0.0708  ✓ OK
  CV mean (10-fold)   : 0.7604
  CV std  (10-fold)   : 0.0553  ✓ STABLE
  Verdict: UNDERFITTING — increase model complexity, add better features
╚══════════════════════════════════════════════════════════════╝

Two things worth noting in the verdict logic. First, the underfitting check runs before the gap check — a decision stump has a small gap and stable CV, but both scores are low. Without the absolute-accuracy gate, it would pass as a “good fit.” Second, the Random Forest shows train=1.0 with a small test gap — this is typical behaviour for bagged ensembles and is usually acceptable. The key number is the test accuracy, not the train accuracy.

Key Takeaways

Pulling everything together, here is the diagnostic workflow I’ve settled on:

Start with learning curves. They answer the most fundamental question — is the problem high bias or high variance — and that answer determines every subsequent decision. If the gap is large, more data or regularisation. If both scores are low, more model capacity or better features.

Use validation curves to tune hyperparameters. Don’t grid-search blindly. Plot the validation curve first; it tells you the direction of the optimum and the sensitivity of performance to the parameter. You’ll waste far fewer compute cycles.

Monitor CV standard deviation, not just the mean. A model with high CV std is an unstable model. It may get lucky on your test split. It will not get lucky in production.

Treat near-perfect train accuracy with suspicion. It is almost never a sign that you’ve built a perfect model. It is usually a sign of overfitting, leakage, or both. The feature importance plot is the fastest way to distinguish them.

Apply the bias–variance decomposition to understand which lever to pull. Bias-dominated error → increase capacity, add features. Variance-dominated error → regularise, get more data, use an ensemble.

Regularise aggressively on high-dimensional regression. Ridge and Lasso are fast, interpretable, and almost always improve the generalisation gap when n_features > n_samples / 10.

Wire full_diagnosis() into your training script. Make the diagnostic automated so you can’t forget to run it. A two-second sanity check at the end of every training run will catch the majority of problems before they reach a reviewer.

What’s Coming in Part 2

Part 2 of this series will cover class imbalance — a different but related failure mode where models that look healthy on standard metrics are quietly useless on the minority class. We’ll examine precision-recall curves, calibration plots, and threshold selection strategies. The diagnostic mindset is the same; the tools differ.

Code Availability

The complete code for all nine diagnostics (app.py) is self-contained and requires only standard scientific Python libraries. All random states are fixed at seed=42 for reproducibility. The nine output figures are saved as numbered PNGs in the working directory.

Complete code: https://github.com/Emmimal/ml-diagnostics-overfitting/

Disclosure

All experiments in this article were designed and run by the author. The code was written from scratch for this article and does not incorporate proprietary data or models. Synthetic datasets were generated using scikit-learn’s built-in generators with no real-world data involved.

The author has no financial relationship with any of the libraries referenced. All tools mentioned are open-source.

The code, output, and analysis are the author’s own work.

Further Exploration

If this article got you thinking more carefully about model behaviour, these are worth reading next:

Feature Engineering in Machine Learning — Overfitting and underfitting are often symptoms of what happened before training. This piece covers how to build features that give your model something real to learn from, which is the upstream fix for half the bias problems in Diagnostic 6.

Top 6 Regression Techniques — Once you can diagnose the generalisation gap, the next question is which model family to reach for. A good companion to the Ridge and Lasso section in Diagnostic 7.

Handle Missing Values in Data Science — Missing value imputation is one of the most common sources of the data leakage pattern shown in Diagnostic 8. Worth reading alongside that section.

Machine Learning Algorithms — A broader map of the algorithm landscape. Useful context for understanding why different model families sit in different parts of the bias-variance spectrum covered in Diagnostic 6.

References

[1] Pedregosa, F., et al. (2011). Scikit-learn: Machine Learning in Python. Journal of Machine Learning Research, 12, 2825–2830. https://jmlr.org/papers/v12/pedregosa11a.html

[2] scikit-learn developers. (2024). scikit-learn 1.4 documentation. https://scikit-learn.org/stable/

[3] Harris, C.R., et al. (2020). Array programming with NumPy. Nature, 585, 357–362. https://doi.org/10.1038/s41586-020-2649-2

[4] Hunter, J.D. (2007). Matplotlib: A 2D graphics environment. Computing in Science & Engineering, 9(3), 90–95. https://doi.org/10.1109/MCSE.2007.55

[5] Prechelt, L. (1998). Early Stopping — But When? In Neural Networks: Tricks of the Trade, Lecture Notes in Computer Science, vol. 1524. Springer, Berlin, Heidelberg. https://doi.org/10.1007/3-540-49430-8_3

[6] Chollet, F., et al. (2015). Keras. https://keras.io

[7] Falcon, W., et al. (2019). PyTorch Lightning. https://www.pytorchlightning.ai

[8] Breiman, L. (2001). Random Forests. Machine Learning, 45(1), 5–32. https://doi.org/10.1023/A:1010933404324

[9] Geman, S., Bienenstock, E., & Doursat, R. (1992). Neural Networks and the Bias/Variance Dilemma. Neural Computation, 4(1), 1–58. https://doi.org/10.1162/neco.1992.4.1.1

[10] Efron, B., & Tibshirani, R. (1994). An Introduction to the Bootstrap. Chapman & Hall/CRC.

[11] Raschka, S. (2018). MLxtend: Providing machine learning and data science utilities and extensions to Python’s scientific computing stack. The Journal of Open Source Software, 3(24). https://doi.org/10.21105/joss.00638

[12] Kohavi, R., & Wolpert, D.H. (1996). Bias Plus Variance Decomposition for Zero-One Loss Functions. In Proceedings of the 13th International Conference on Machine Learning (ICML). Morgan Kaufmann.

[13] Tibshirani, R. (1996). Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society: Series B (Methodological), 58(1), 267–288. https://doi.org/10.1111/j.2517-6161.1996.tb02080.x

[14] Hastie, T., Tibshirani, R., & Friedman, J. (2009). The Elements of Statistical Learning: Data Mining, Inference, and Prediction (2nd ed.). Springer. https://doi.org/10.1007/978-0-387-84858-7

[15] Kaufman, S., Rosset, S., Perlich, C., & Stitelman, O. (2012). Leakage in Data Mining: Formulation, Detection, and Avoidance. ACM Transactions on Knowledge Discovery from Data, 6(4), 1–21. https://doi.org/10.1145/2382577.2382579

Part 1 of the ML Diagnostics Mastery series. Follow for Part 2: Class Imbalance — When Accuracy Lies.

How to Diagnose Overfitting in Machine Learning — 9 Proven Tools

ML Diagnostics Mastery — Part 1 of the Series

What Overfitting Actually Is

The Experimental Setup

Diagnostic 1: Polynomial Overfitting — The Visual Anchor

Diagnostic 2: Learning Curves — The First Thing I Always Plot

Diagnostic 3: Validation Curves — Finding the Hyperparameter Sweet Spot

Diagnostic 4: Train/Validation Loss Curves — The Neural Network Signature

Diagnostic 5: Cross-Validation Score Distribution — Measuring Stability

Diagnostic 6: Bias–Variance Decomposition — Finding the Root Cause

Diagnostic 7: Regularisation — The Cheapest Fix for Regression Overfit

Diagnostic 8: Feature Importance and Data Leakage Detection

Diagnostic 9: The Full Diagnosis Report — A One-Call Sanity Check

Key Takeaways

What’s Coming in Part 2

Code Availability

Disclosure

Further Exploration

References

Understanding Random Variables in Data Science

Edge AI for Real-Time Data Processing

How to Import and Use Modules in Python

How to Create an Invisible Cloak Using OpenCV

Variational Autoencoders (VAEs) in Generative Music: A New Era of Creativity

Encapsulation and Abstraction in Python: A Complete Guide

Leave a Reply Cancel reply

ML Diagnostics Mastery — Part 1 of the Series

What Overfitting Actually Is

The Experimental Setup

Diagnostic 1: Polynomial Overfitting — The Visual Anchor

Diagnostic 2: Learning Curves — The First Thing I Always Plot

Diagnostic 3: Validation Curves — Finding the Hyperparameter Sweet Spot

Diagnostic 4: Train/Validation Loss Curves — The Neural Network Signature

Diagnostic 5: Cross-Validation Score Distribution — Measuring Stability

Diagnostic 6: Bias–Variance Decomposition — Finding the Root Cause

Diagnostic 7: Regularisation — The Cheapest Fix for Regression Overfit

Diagnostic 8: Feature Importance and Data Leakage Detection

Diagnostic 9: The Full Diagnosis Report — A One-Call Sanity Check

Key Takeaways

What’s Coming in Part 2

Code Availability

Disclosure

Further Exploration

References

RELATED POSTS

Leave a Reply Cancel reply