The Problem: The Model Existed. The System Didn't.#
When I joined Moody's Analytics, the content review operation ran across three separate legacy tools — no shared taxonomy, no performance tracking, and no systematic feedback loop between what moderators observed in production and what the AI research team iterated on. The LLM-based moderation assistant had been built by Engineering. But it hadn't been operationalized.
The model existed; the system around it didn't.
What I Built: Four Phases#
I treated this as a product launch, not a tool rollout. The work had four distinct phases:
Process Mapping
I documented the existing workflow end-to-end — where decisions were made, where errors clustered, where latency was introduced. This created a shared baseline and exposed the gaps the AI system needed to close.
Evaluation Framework
I defined what "working" meant before we deployed anything. I established a Safety Index System tracking Precision, Recall, and False Positive Rate across decision categories — the contract between the AI team and ops about what "good" looked like.
Phased Deployment + Feedback Loops
Piloted with one review team, captured structured feedback, and surfaced failure patterns to AI Research in actionable terms ("over-triggering on X category in Y context"). Iterated before global rollout.
Enablement Layer
Built SOPs, training materials, and workflow documentation tailored to non-technical reviewers. Any new team member could reach full productivity without Engineering involvement.
Results#
22%↑
moderation accuracy
Driven by tighter feedback loops between operational signals and model updates
15%↓
average handle time
From better AI-assisted decision support at the point of review
40%↓
onboarding time
Structured documentation replaced ad hoc Engineering-led training
3→1
tools consolidated
Legacy tools consolidated into a unified AI platform across global review teams
The Insight#
Shipping an AI system is table stakes. Making it work — mapping the process, defining evaluation criteria, building the feedback loop, enabling the team, and measuring the impact — is the hard part. That's what I do.
Frequently Asked Questions#
What was the hardest part of this project?
The hardest part wasn't technical — it was trust. The team already had an LLM tool, but without transparent evaluation criteria, nobody knew whether it was worth trusting. Establishing the Safety Index System — defining 'working' in terms of Precision, Recall, and False Positive Rate — was the unlock that shifted the team from hesitation to adoption.
How was the Safety Index System designed?
The Safety Index System is a contract between the ops team and AI Research that defines three core metrics across decision categories: Precision (of what was flagged, how much was actually problematic), Recall (of what was actually problematic, how much was caught), and False Positive Rate (how often normal content gets wrongly flagged). Each metric has a threshold, and any category below threshold triggers a structured feedback loop back to the model team.
How does this apply to advertiser integrity or seller trust domains?
The same framework transfers directly: ad fraud detection requires the same precision/recall tradeoffs (false positives hurt legitimate advertisers, false negatives let fraud through); seller trust requires the same phased deployment and feedback loops (a policy change in one category ripples through the entire marketplace ecosystem). The methodology for making a system actually work is domain-agnostic.