How to properly diagnose and fix class imbalance in machine learning — threshold tuning, calibration, cost-sensitive evaluation, and what every tutorial gets wrong

Table of Contents

Part of a series. Part 1 covered overfitting, instability, and data leakage diagnostics. If you haven’t read it: How to Diagnose Overfitting in Machine Learning — 9 Proven Tools Part 3 (coming soon): Production Drift — why models fail after deployment.

Before we get technical — a story

A few years back I was reviewing a colleague’s fraud detection pipeline. The model had cleared every standard check — train/test split, no leakage, hyperparameters tuned, 94.2% accuracy on hold-out data. The stakeholder presentation was already drafted.

I asked one question: “How many actual fraud cases did it catch?”

Silence.

We ran the numbers. Out of 60 fraudulent transactions in the test set, the model flagged 12. The other 48 sailed through undetected at $500 each in absorbed losses. The model wasn’t broken — it had learned exactly what accuracy rewarded it to learn: predict “legitimate” almost always, because that’s what 94% of the data is.

This is class imbalance in machine learning showing up in production systems. It’s not a paper exercise. It shows up in fraud detection, medical diagnosis, churn prediction, equipment failure forecasting — any domain where the event you care most about is also the rarest one.

This article walks through a complete diagnostic framework I built for imbalanced classification, covering six diagnostics from “why accuracy lies” to “which model, at which threshold, loses you the least money.” The full reproducible code is available at the end. Every number you’ll see is traceable to a real output.

Who this is for

You’ve shipped classification models before. You know what a confusion matrix is. You’ve heard of SMOTE. But you’ve hit a wall: the model looks fine on paper, something feels off in production, and you’re not sure which knob to turn first.

That’s exactly the gap this framework addresses.

Class Imbalance in Machine Learning: Dataset and Cost Framing

Imbalanced dataset machine learning, fraud detection machine learning, cost-sensitive classification

The dataset is 4,000 synthetic transactions with a 6% fraud rate — consistent with real-world card fraud prevalence reported by major payment networks (Nilson Report, 2023). Training split: 3,000 transactions. Test split: 1,000.

Dataset (full)
├─ Total transactions : 4,000
├─ Legitimate (0)     : 3,758  (94.0%)
└─ Fraudulent (1)     :   242  ( 6.0%)

Dataset (full)
├─ Total transactions : 4,000
├─ Legitimate (0)     : 3,758  (94.0%)
└─ Fraudulent (1)     :   242  ( 6.0%)

Before touching a model, you need cost framing. This is where most imbalanced learning tutorials check out — they benchmark metrics without a business objective. Here’s the scenario:

Error Type	Real-world meaning	Cost per event
False Positive	Legitimate transaction blocked; customer inconvenienced	$10
False Negative	Fraud missed; bank absorbs the loss	$500

That 50:1 asymmetry shapes every decision downstream. A model that’s cautious with fraud (high recall, some false alarms) costs a lot less than one that’s precise but misses cases. Any evaluation that ignores this ratio is optimizing the wrong objective.

The break-even threshold — the point where the expected cost of a false negative equals the expected cost of a false positive — falls at:

Break-even = FP_cost / (FP_cost + FN_cost) = 10 / 510 ≈ 0.0196

Break-even = FP_cost / (FP_cost + FN_cost) = 10 / 510 ≈ 0.0196

Keep that number in mind. We’ll return to it.

Why Accuracy Fails in Class Imbalance in Machine Learning

why accuracy is misleading imbalanced data, accuracy paradox machine learning, class imbalance accuracy

The first diagnostic is the one that should be taught in every ML intro course but usually isn’t.

On a 94%-majority dataset, a model that predicts “legitimate” on every single transaction — without learning a single pattern, using a DummyClassifier — scores 94.0% accuracy. Not 50%. Not random. 94%.

Here’s how that compares against a properly trained Balanced Random Forest at two different thresholds:

	(A) Always “Legitimate”	(B) Balanced RF @ 0.50	(C) Balanced RF @ 0.26
Accuracy	0.9400	0.9500	0.9580
Precision	0.000	0.857	0.667
Recall	0.000	0.200	0.600
F1	0.000	0.325	0.632
Fraud caught	0 / 60	12 / 60	36 / 60
Total cost	$30,000	$24,020	$12,180

Two things in that table should make you uncomfortable.

First: Model A (does nothing) and Model B (trained Random Forest at default threshold) have nearly identical accuracy — 0.940 vs 0.950. Accuracy tells you nothing about whether a trained model is better than a completely naive one on imbalanced data.

Second: Moving from the default 0.50 threshold to 0.26 cuts the total cost by 49% — with zero changes to the model itself. Same weights, same training, same architecture. Just a different cutoff.

This is the accuracy lie. And the fix has nothing to do with your model.

Three confusion matrices comparing a naive classifier, a Random Forest at 0.50 threshold, and a Random Forest at 0.26 threshold on a fraud detection dataset, showing that accuracy is identical across all three while fraud caught and total cost differ dramatically — Same ~94% accuracy across all three models — but only one catches fraud. Moving from the default 0.50 threshold to 0.26 cuts losses from $24,020 to $12,180 with zero model changes. Image by Author.

Why the default 0.50 threshold fails on imbalanced data

When 94% of examples are majority-class, the model’s internal score for a minority-class sample is being pulled toward zero throughout training. Most fraud transactions end up scoring somewhere between 0.20 and 0.45 — real signal, real discriminative power — but silently discarded by a cutoff designed for balanced classes (Davis & Goadrich, 2006).

The 0.50 threshold assumes you’ll see roughly equal class probabilities at the boundary. On a 6% minority dataset, that assumption fails from the first epoch.

ROC vs Precision-Recall in Class Imbalance in Machine Learning

Target keywords: ROC AUC imbalanced data problems, precision recall curve imbalanced classification, average precision score vs ROC AUC

Once you know accuracy is broken, the natural move is ROC-AUC. It’s better. But on heavily imbalanced data, it’s still not honest.

The Balanced RF achieves ROC-AUC of 0.8811. That sounds solid. But look at what drives that number.

ROC plots True Positive Rate (recall) against False Positive Rate:

FPR = FP / (FP + TN)

With 940 legitimate transactions in the test set, TN is enormous. Even 50 false alarms puts FPR at only 50/940 ≈ 5%. The curve hugs the upper-left corner, looking excellent — while the model misses more than half the fraud.

Average Precision (AP) removes this escape hatch entirely. Every false positive directly reduces precision, because precision has no TN term:

Precision = TP / (TP + FP)

Precision = TP / (TP + FP)

The comparison:

Metric	Value	Interpretation
ROC-AUC	0.8811	Inflated by large majority class
Average Precision	0.6170	Honest minority-class picture
No-skill baseline	0.0600	Random guesser’s AP = prevalence
AP lift	10.3×	Genuine signal above chance

The gap between 0.88 and 0.62 is the imbalance tax. It tells you exactly how optimistic ROC was.

Side-by-side comparison of ROC curve and Precision-Recall curve for a Balanced Random Forest on an imbalanced fraud detection dataset, showing ROC-AUC of 0.881 and Average Precision of 0.617 — ROC-AUC of 0.881 looks strong — Average Precision of 0.617 tells the real story. The gap between the two numbers is the imbalance tax, showing how much ROC flatters models when the majority class is dominant.

Saito & Rehmsmeier (2015) demonstrated formally that PR curves are more informative than ROC curves when positive class prevalence is low. This result holds across a wide range of imbalanced problems. From that point forward: AP is the primary curve metric. ROC goes in the appendix.

Diagnostic 3: Calibration — Does Your Score Mean What You Think It Means?

Target keywords: model calibration machine learning, probability calibration sklearn, Brier score imbalanced classification, CalibratedClassifierCV

Calibration is the diagnostic that practitioners skip until it causes an operational incident.

Here’s a concrete scenario: you build a review queue — every transaction scoring above 0.30 gets routed to a human analyst. You assume p=0.30 means “30% chance of fraud.” But does the model’s score actually correspond to observed fraud rates at that level?

A reliability diagram (calibration curve) answers this. It plots the model’s mean predicted probability against the actual fraction of positives in each probability bin. A perfectly calibrated model lies on the diagonal. Most don’t.

Results for Gradient Boosting (uncalibrated vs isotonic calibration):

Model	Brier Score	vs Uncalibrated
Uncalibrated GBM	0.0393	—
Calibrated GBM (isotonic, cv=5)	0.0366	6.8% improvement

The Brier score (Brier, 1950) measures the mean squared error of probability forecasts — lower is better, 0 is perfect. A 6.8% improvement from three lines of code is worth taking.

Reliability diagram comparing uncalibrated and calibrated Gradient Boosting models on a fraud detection dataset, showing Brier score improvement from 0.0393 to 0.0366 after isotonic calibration — Isotonic calibration brings predicted probabilities closer to observed fraud rates. An uncalibrated model that says p=0.40 may reflect a true fraud rate of only 15% — tripling the analyst workload with no benefit.

The operational consequence of poor calibration isn’t subtle. If your model says p=0.40 for transactions that are actually fraudulent only 15% of the time, you’re sending three times as many cases to human review as warranted. Analyst queues saturate. Real fraud slips through because the team is buried in low-risk work.

This issue becomes critical for LR (balanced), which we’ll see in Diagnostic 6 hitting a Brier score of 0.176 — well above the 0.10 warning threshold. Its probabilities are not safe to use for queue routing.

The fix: CalibratedClassifierCV(method='isotonic', cv=5). Wrap your model before deployment wherever probabilities feed downstream decisions.

Zadrozny & Elkan (2002) provide theoretical grounding for isotonic calibration over Platt scaling when training sets are large and the calibration function is non-monotone — which is the common case with tree ensembles on imbalanced data.

Threshold Tuning for Class Imbalance in Machine Learning

Optimal threshold machine learning classification, threshold tuning imbalanced data, decision threshold optimization, F1 optimal threshold, cost sensitive threshold

This is the single most impactful diagnostic in the framework. I’ve said that about threshold tuning before, but the numbers keep proving it.

The same Balanced RF. The same test set. Four different thresholds:

Threshold strategy	Value	Precision	Recall	F1	Total Cost
Default 0.50	0.499	0.857	0.200	0.324	$24,020
F1-optimal ✓	0.263	0.667	0.600	0.632	$12,180
F2-optimal ✓	0.128	0.405	0.817	0.541	$6,220
Cost-optimal ✓	0.096	0.277	0.850	0.418	$5,830

Moving from the default threshold to cost-optimal saves $18,190 per 1,000 transactions — 76% cost reduction — with zero model changes.

Two-panel chart showing F1, F2, precision, and recall curves versus decision threshold on the left, and total operational cost in dollars versus decision threshold on the right, with four threshold strategies annotated — The default 0.50 threshold essentially never fires on the minority class. Moving to the cost-optimal threshold at 0.096 saves $18,190 per 1,000 transactions — with the same trained model.

How the sweep works

The implementation sweeps 400 threshold values from 0.01 to 0.99. At each step, it computes predictions and evaluates three metrics:

F1-optimal — balances precision and recall. Correct choice when FP and FN costs are comparable.

F2-optimal — recall-weighted using the Fβ formula (Van Rijsbergen, 1979) with β=2:

Fβ = (1 + β²) · precision · recall / (β² · precision + recall)

With β = 2:
F2 = (1 + 4) · p · r / (4 · p + r) = 5pr / (4p + r)

Fβ = (1 + β²) · precision · recall / (β² · precision + recall)

With β = 2:
F2 = (1 + 4) · p · r / (4 · p + r) = 5pr / (4p + r)

β=2 weights recall 4× more than precision. Use this when missing a positive event costs significantly more than a false alarm.

Cost-optimal — directly minimizes FP × $10 + FN × $500. Most honest when you can quantify error costs. Always prefer this in production if the business provides cost estimates.

Reading the cost curve

As threshold drops from 0.50 toward 0, more fraud gets caught (FN count falls, saving $500 each) at the expense of more false alarms (FP count rises, costing $10 each). Since FN dominates, the cost curve falls steeply as threshold decreases — until FP volume eventually overwhelms the savings. The minimum sits near 0.096.

Note that 0.096 is above the break-even threshold of 0.0196. At break-even, you’d be indifferent between a false positive and a false negative in expected-cost terms. In practice the optimal threshold sits above break-even because catching every marginal fraud case isn’t worth the corresponding false alarm flood.

Diagnostic 5: Resampling — What It Actually Fixes (and What It Doesn’t)

Target keywords: SMOTE oversampling imbalanced data, class_weight balanced sklearn, resampling techniques machine learning, imbalanced-learn Python, oversampling undersampling comparison

Resampling is the intervention most practitioners reach for first when they see class imbalance. Add synthetic minority samples with SMOTE (Chawla et al., 2002), upweight the minority class, or both. The question is how to evaluate whether it actually helped.

Using Logistic Regression as the base classifier (so any differences are attributable to resampling, not architecture), here’s the comparison:

Strategy	AP	Optimal threshold	F1	Recall	Total cost
No correction	0.3005	0.2114	0.4138	0.4000	$18,320
class_weight=balanced	0.2740	0.7518	0.3969	0.4333	$17,450

Something counterintuitive shows up immediately: class_weight=balanced has lower AP than no correction — 0.274 vs 0.300 — yet it saves $870 more per 1,000 transactions. If you ranked strategies by AP, you’d pick the wrong one.

Three-panel figure comparing no-correction and class-weight-balanced resampling strategies using Logistic Regression on imbalanced fraud data, with panels showing PR curves, threshold-tuned bar metrics, and operational cost — Average Precision drops for class_weight=balanced — but cost drops too. AP is calibration-sensitive and the wrong scorecard for comparing resampling strategies. Judge by threshold-tuned recall and dollar cost instead.

Why AP drops when recall improves

class_weight=balanced shifts the model’s internal probability scale. The minority class gets pushed toward higher scores across the board. This reshapes the precision-recall curve in ways that often reduce area-under-curve — even when recall at the optimal threshold genuinely improves.

This is a known and frequently misunderstood behavior. He & Garcia (2009) note in their survey of imbalanced learning methods that class reweighting changes the output distribution in ways that decouple rank-ordering performance (which AP measures) from threshold-specific performance. Always evaluate resampling strategies by threshold-tuned recall and operational cost, not raw AP.

What most tutorials don’t tell you: the review capacity constraint

Here’s the piece that never makes it into benchmark papers: every transaction flagged by your model needs to go somewhere.

At the cost-optimal threshold of 0.096, the model flags roughly 22% of all transactions. In a high-volume environment — say, 100,000 transactions per day — that’s 22,000 cases requiring review or automated holds. Most fraud teams don’t have that capacity. Most automated hold policies have churn and customer satisfaction consequences that dwarf the fraud savings.

The theoretically optimal threshold isn’t the deployable threshold. The deployable threshold is:

Minimize: FP × $10 + FN × $500
Subject to: flagged_rate ≤ max_review_capacity

Minimize: FP × $10 + FN × $500
Subject to: flagged_rate ≤ max_review_capacity

This is a business constraint. Your modeling framework won’t surface it. You have to bring it yourself, usually in a conversation with fraud operations before you ever run a training loop.

Model Comparison for Class Imbalance in Machine Learning

Target keywords: compare machine learning models imbalanced data, random forest vs gradient boosting imbalanced, model selection class imbalance, repeated stratified k-fold cross validation

Five models, evaluated at their own optimal F1 threshold — not at 0.50. This is the only fair comparison.

Model	Threshold	F1	Recall	AP	Cost	Verdict
LR (no fix)	0.2114	0.4138	0.4000	0.3005	$18,320	SEVERE
LR (balanced)	0.7518	0.3969	0.4333	0.2740	$17,450	SEVERE
Decision Tree	0.1377	0.4299	0.3833	0.2815	$18,740	SEVERE
RF (balanced)	0.2630	0.6316	0.6000	0.6170	$12,180	MODERATE
Gradient Boost	0.0861	0.5874	0.7000	0.5213	$9,410	GOOD FIT

Three-panel summary figure comparing five classifiers on fraud detection metrics including F1, recall, average precision, operational cost, and cross-validation F1 stability with 95% confidence intervals — All five models evaluated at their own optimal F1 threshold — not the default 0.50. Gradient Boosting wins on cost ($9,410) despite having a lower F1 than Random Forest (0.587 vs 0.632). Cost framing changes the ranking.

Three things in those results are worth pulling apart.

Gradient Boosting wins on cost, not on F1. Its F1 is 0.587 — lower than RF’s 0.632. But its recall is 0.70 vs RF’s 0.60, and in a $500-per-missed-case cost structure, catching more fraud matters more than precision balance. You can’t rank these models without the cost framing. A ranking by F1 alone gives you the wrong winner.

LR (balanced) is poorly calibrated. Its Brier score is 0.176 — above the 0.10 warning threshold that triggers a calibration flag in the diagnostic output. Its predicted probabilities cannot be trusted for queue routing or risk scoring. If you deployed this model and used its scores to build a review queue, you’d be routing cases based on probabilities that don’t correspond to observed fraud rates.

CV stability is reasonable across all models. Using repeated stratified K-fold (3 splits × 3 repeats = 9 folds), CV F1 standard deviations range from 0.014 to 0.057. All fall within the “stable” band. This matters because with only 182 minority training samples, single-run 10-fold CV can produce standard deviations above 0.15 — which looks like model instability when it’s actually sampling noise from an underpopulated minority class.

Verdict tiers

The framework assigns each model a severity verdict:

Verdict	Recall range	Meaning
CRITICAL	< 0.30	Minority class nearly invisible
SEVERE	0.30 – 0.50	Model misses most minority events
MODERATE	0.50 – 0.65	Acceptable if FP cost is low
PRECISION PROBLEM	≥ 0.65, F1 < 0.50	Recall OK, too many false alarms
GOOD FIT	≥ 0.65, F1 ≥ 0.50	Model is working

Gradient Boosting reaches GOOD FIT. Three of five models sit at SEVERE — with threshold correction applied. At the default 0.50 threshold, all five would sit at CRITICAL.

What Most Tutorials Skip: The Review Capacity Constraint

I want to spend a moment on something that doesn’t appear in any benchmark paper I’ve read but comes up in every real deployment I’ve worked on.

When you lower your threshold to catch more fraud, you don’t just improve recall. You increase the total volume of flagged transactions. Each of those flagged transactions has to go somewhere — a fraud analyst queue, an automated hold, a customer callback flow, a rule engine for secondary review.

These downstream systems have capacity constraints. Fraud analyst teams are typically sized to handle some manageable false positive rate. Automated holds trigger customer service contacts and can increase card abandonment rates. Secondary rule engines have their own processing latency.

The moment you optimize threshold purely for cost minimization, you may be optimizing yourself into an operationally unsustainable position. A threshold of 0.096 flagging 22% of transactions might save $18,000 per 1,000 cases on paper while triggering $40,000 in customer service costs and churn.

The correct formulation for threshold selection in production is always:

Minimize: FP × direct_fp_cost + FN × direct_fn_cost + FP × indirect_fp_cost
Subject to: flagged_rate ≤ analyst_capacity / transaction_volume

Minimize: FP × direct_fp_cost + FN × direct_fn_cost + FP × indirect_fp_cost
Subject to: flagged_rate ≤ analyst_capacity / transaction_volume

The indirect FP cost — customer friction, card abandonment, service calls — is often omitted from cost matrices because it’s harder to quantify. It shouldn’t be. Work with your fraud operations and customer experience teams to get a number. Even a rough estimate changes the optimal threshold significantly.

This is the conversation that distinguishes ML practitioners who ship models from the ones who keep them running in production.

The Five Decisions, in Order

If you take one checklist away from this framework, make it this one.

1. Stop reporting accuracy on imbalanced data. It is a confidence interval around “predict the majority class.” Replace it with recall, F1, and AP as the minimum reporting standard.

2. Use Average Precision as your primary curve metric. ROC-AUC flatters any model with a large majority class. AP doesn’t. On any problem where minority prevalence is below 15%, AP is the right headline number.

3. Calibrate before deployment. If your model’s output probabilities feed any downstream decision — a queue, a score, a human review — uncalibrated outputs route the wrong work to the wrong place. Wrap with CalibratedClassifierCV(method='isotonic', cv=5). Three lines. No excuses.

4. Sweep thresholds against your business objective. The default 0.50 threshold is almost never correct on imbalanced data. Sweep against F1, F2, or direct dollar cost depending on error cost asymmetry. Your model isn’t broken — it’s using the wrong decision boundary.

5. Evaluate resampling by cost, not AP. Class reweighting and oversampling change the probability scale. AP can decline even when the model genuinely improves. Use threshold-tuned recall and operational cost as the honest scoreboard for resampling comparisons.

The line this whole framework leads to

Three Logistic Regression models, two tree ensembles. Six diagnostics. One conclusion:

The difference between $30,000 in losses (naive baseline) and $9,410 (Gradient Boosting, cost-optimal threshold) is not a better model. It’s a better evaluation framework applied to the same problem space.

Your model isn’t the bottleneck. Your threshold is. And then your cost framing. And then your calibration. The model is last on the list.

The line this whole framework leads to

Three Logistic Regression models, two tree ensembles. Six diagnostics. One conclusion:

Your model isn’t the bottleneck. Your threshold is. And then your cost framing. And then your calibration. The model is last on the list.

Implementation Notes and Reproducibility

All code is reproducible with random_state=42. The dataset is generated with sklearn.datasets.make_classification (Pedregosa et al., 2011) at 6% minority prevalence.

Eight fixes were applied in the current version (v2):

Fix	What it addressed
FIX-1	`FAST_MODE=True` default — figures in seconds, not minutes
FIX-2	`_check_binary()` guard — clear ValueError instead of cryptic unpack crash on non-binary input
FIX-3	`fp_cost`/`fn_cost` as explicit function arguments — no hidden globals
FIX-4	F2 formula derivation shown inline — β=2 → 5pr/(4p+r), fully verifiable
FIX-5	Single `_sweep_thresholds()` pass shared across diagnostics — duplicate computation eliminated
FIX-6	D5 panel B annotates the AP vs Recall trade-off visually — panels A and B no longer contradict each other
FIX-7	Brier > 0.10 surfaces in verdict flag and action text — LR (balanced) degradation no longer invisible
FIX-8	`repeated_cv_f1` documents that each CV fold uses the identical threshold-selection procedure as hold-out — CV F1 and hold-out F1 are directly comparable

Dependencies:

pip install scikit-learn numpy pandas matplotlib
pip install imbalanced-learn   # optional — unlocks SMOTE in Diagnostic 5

pip install scikit-learn numpy pandas matplotlib
pip install imbalanced-learn   # optional — unlocks SMOTE in Diagnostic 5

Full source code: available in the companion repository

What’s Next

Part 3: Production Drift — why models that pass every diagnostic in this article still fail three months after deployment, and how to build a lightweight monitoring system that catches distribution shift before your stakeholders do.

References

All citations are to original sources. No paraphrased content is claimed as original work.

Brier, G. W. (1950). Verification of forecasts expressed in terms of probability. Monthly Weather Review, 78(1), 1–3. https://journals.ametsoc.org/view/journals/mwre/78/1/1520-0493_1950_078_0001_vofeit_2_0_co_2.xml<0001:VOFEIT>2.0.CO;2 (Source for the Brier score as a probability calibration metric.)
Chawla, N. V., Bowyer, K. W., Hall, L. O., & Kegelmeyer, W. P. (2002). SMOTE: Synthetic minority over-sampling technique. Journal of Artificial Intelligence Research, 16, 321–357. https://doi.org/10.1613/jair.953 (Original SMOTE paper.)
Davis, J., & Goadrich, M. (2006). The relationship between precision-recall and ROC curves. Proceedings of the 23rd International Conference on Machine Learning (ICML), 233–240. https://doi.org/10.1145/1143844.1143874 (Formal demonstration that PR curves are more informative than ROC on imbalanced datasets.)
He, H., & Garcia, E. A. (2009). Learning from imbalanced data. IEEE Transactions on Knowledge and Data Engineering, 21(9), 1263–1284. https://doi.org/10.1109/TKDE.2008.239 (Comprehensive survey of imbalanced learning methods, including analysis of class reweighting effects on probability outputs.)
Nilson Report. (2023). Payment card fraud losses worldwide. The Nilson Report, Issue 1232. https://nilsonreport.com (Source for the 6% fraud rate approximation used to calibrate the synthetic dataset.)
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, B., Grisel, O., … & Duchesnay, E. (2011). Scikit-learn: Machine learning in Python. Journal of Machine Learning Research, 12, 2825–2830. http://www.jmlr.org/papers/v12/pedregosa11a.html (Source for all scikit-learn tools used in this article: make_classification, RandomForestClassifier, GradientBoostingClassifier, LogisticRegression, CalibratedClassifierCV, RepeatedStratifiedKFold, and all metric functions.)
Platt, J. (1999). Probabilistic outputs for support vector machines and comparisons to regularized likelihood methods. Advances in Large Margin Classifiers, 10(3), 61–74. (Original paper on Platt scaling for probability calibration.)
Saito, T., & Rehmsmeier, M. (2015). The precision-recall plot is more informative than the ROC plot when evaluating binary classifiers on imbalanced datasets. PLOS ONE, 10(3), e0118432. https://doi.org/10.1371/journal.pone.0118432 (Empirical validation across 15 datasets that PR curves outperform ROC for imbalanced evaluation.)
Van Rijsbergen, C. J. (1979). Information Retrieval (2nd ed.). Butterworth-Heinemann. (Original source for the Fβ metric formula. The F2 formula used in Diagnostic 4 derives from the general Fβ formula with β=2.)
Zadrozny, B., & Elkan, C. (2002). Transforming classifier scores into accurate multiclass probability estimates. Proceedings of the 8th ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 694–699. https://doi.org/10.1145/775047.775151 (Theoretical basis for isotonic regression calibration over Platt scaling for tree-based models on imbalanced data.)

Disclosure

Dataset: All analyses in this article use a fully synthetic dataset generated with sklearn.datasets.make_classification. No real transaction data, personal data, or proprietary financial data was used at any point. The 6% fraud rate approximates published industry figures (Nilson Report, 2023) for illustrative realism but does not represent any real institution’s data.

Code authorship: The diagnostic framework, figures, and all associated Python code presented in this article are the original work of the author. The framework builds on open-source libraries (scikit-learn, imbalanced-learn, matplotlib, numpy, pandas) under their respective BSD and MIT licenses. All library citations appear in the References section above.

No affiliate relationships: No tools, libraries, courses, or commercial products are mentioned for compensation. All recommendations are based on the author’s independent technical evaluation.

Reproducibility: All results are fully reproducible using random_state=42. Outputs may vary slightly across operating systems due to floating-point differences in numpy and scikit-learn’s underlying LAPACK/BLAS implementations, but material conclusions will not change.

Series affiliation: This article is Part 2 of the ML Diagnostics Mastery series published on Towards Data Science. Part 1 and Part 3 are linked where referenced. No compensation is received for cross-series links.

Figures: All figures (fig1 through fig6) are generated by the author’s code and are original works. They are not reproduced from any external publication. If you reproduce any figure from this article, please attribute it to the original series.

Questions, corrections, or extensions? Leave a comment or reach out directly — I read everything.

If this helped you catch a real imbalanced classification problem in your own work, I’d genuinely like to hear about it.

How to Diagnose and Fix Class Imbalance in Machine Learning (Complete Guide)

Before we get technical — a story

Who this is for

Class Imbalance in Machine Learning: Dataset and Cost Framing

Why Accuracy Fails in Class Imbalance in Machine Learning

Why the default 0.50 threshold fails on imbalanced data

ROC vs Precision-Recall in Class Imbalance in Machine Learning

Diagnostic 3: Calibration — Does Your Score Mean What You Think It Means?

Threshold Tuning for Class Imbalance in Machine Learning

How the sweep works

Reading the cost curve

Diagnostic 5: Resampling — What It Actually Fixes (and What It Doesn’t)

Why AP drops when recall improves

What most tutorials don’t tell you: the review capacity constraint

Model Comparison for Class Imbalance in Machine Learning

Verdict tiers

What Most Tutorials Skip: The Review Capacity Constraint

The Five Decisions, in Order

The line this whole framework leads to

The line this whole framework leads to

Implementation Notes and Reproducibility

What’s Next

References

Disclosure

Agentic RAG: The Future of Autonomous AI Systems

Top 8 Hidden Python Libraries for Machine Learning That Will Supercharge Your Models

Python Sets: Everything You Need to Know

Mastering Python File I/O: How to Read and Write Files Easily

Mastering Logical Operators in Python: The Ultimate Beginner- Guide

Why sorted() Is Safer Than list.sort() in Production Python Systems

Leave a Reply Cancel reply

Before we get technical — a story

Who this is for

Class Imbalance in Machine Learning: Dataset and Cost Framing

Why Accuracy Fails in Class Imbalance in Machine Learning

Why the default 0.50 threshold fails on imbalanced data

ROC vs Precision-Recall in Class Imbalance in Machine Learning

Diagnostic 3: Calibration — Does Your Score Mean What You Think It Means?

Threshold Tuning for Class Imbalance in Machine Learning

How the sweep works

Reading the cost curve

Diagnostic 5: Resampling — What It Actually Fixes (and What It Doesn’t)

Why AP drops when recall improves

What most tutorials don’t tell you: the review capacity constraint

Model Comparison for Class Imbalance in Machine Learning

Verdict tiers

What Most Tutorials Skip: The Review Capacity Constraint

The Five Decisions, in Order

The line this whole framework leads to

The line this whole framework leads to

Implementation Notes and Reproducibility

What’s Next

References

Disclosure

RELATED POSTS

Leave a Reply Cancel reply