Precision vs. Recall: Why Compliance Teams Can't Just "Use AI"

I work in compliance surveillance analytics. My job, broadly, is to help make sure a financial institution is following the rules. A big part of that involves alerts — automated flags that say "hey, this trade or this communication pattern looks suspicious." There are a lot of these alerts. Most of them are noise.

So naturally, people want to use AI to classify them. Train a model, score the alerts, surface the real risks, suppress the junk. Sounds great. Here's where it gets tricky.

In compliance, false negatives are existential. A false positive means an analyst reviews something that turns out to be nothing. That's wasted time, but it's just time. A false negative means a genuinely suspicious pattern goes unreviewed. That's a potential regulatory violation. That's a fine. That's front-page news. The asymmetry is brutal.

When you're training a binary classifier — "risky" vs. "not risky" — you're always making a tradeoff between precision and recall. High precision means when the model says something is risky, it's probably right. High recall means the model catches most of the actual risky stuff, even if it also flags some false alarms. In compliance, you almost always need to bias toward recall. You'd rather over-flag than under-flag.

The problem is, leadership hears "AI" and thinks "efficiency." They want fewer alerts, not more. They want to reduce the analyst workload. And a high-recall model does the opposite — it might actually increase the number of flagged items, at least initially. The efficiency gain comes from better prioritization, not from suppression. You're not eliminating alerts, you're ranking them. That's a harder sell in a boardroom than "AI reduces alerts by 40%."

When I built a risk scoring model using SageMaker Autopilot, the first thing I learned was that the default optimization metric didn't match my actual goal. Autopilot optimizes for F1 by default — a balance of precision and recall. I needed to weight recall higher. That meant custom objective functions and a lot of conversations about what "good enough" actually means in a regulated environment.

The other thing nobody talks about: explainability. A regulator isn't going to accept "the model said so." You need to show why a particular alert was scored the way it was. Black-box models are a tough sell in compliance. I ended up leaning toward models with more interpretable features — things like transaction velocity, account age, communication frequency — rather than deep embeddings that perform slightly better but can't be explained in an audit.

AI in compliance isn't about replacing analysts. It's about giving them better tools to focus on what actually matters. But getting there requires understanding the domain deeply enough to know where the model's failures will hurt the most. That's not a data science problem. That's a compliance problem.