Case Study · AI Operations · Moody's Analytics

Operationalizing an LLM Moderation Platform

The model existed. Teams didn't trust it, couldn't measure it, and weren't using it consistently. My job wasn't to build AI — it was to make the AI system actually work.

Elena Liu

Trust & Safety PM · Moody's Analytics

8 min read

22%↑

moderation accuracy

15%↓

avg handle time

40%↓

onboarding time

3→1

tools consolidated

Shipping an AI system is table stakes. Making it work — mapping the process, defining evaluation criteria, building the feedback loop, enabling the team, and measuring the impact — is the hard part. That's what I do.

When I joined Moody's Analytics, the content review operation ran across three separate legacy tools — no shared taxonomy, no performance tracking, and no systematic feedback loop. The LLM-based moderation assistant had been built by Engineering. But it hadn't been operationalized. Teams didn't trust it, couldn't measure it, and weren't using it consistently. The model existed; the system around it didn't.

The Problem: The Model Existed. The System Didn't.#

When I joined Moody's Analytics, the content review operation ran across three separate legacy tools — no shared taxonomy, no performance tracking, and no systematic feedback loop between what moderators observed in production and what the AI research team iterated on. The LLM-based moderation assistant had been built by Engineering. But it hadn't been operationalized.

The model existed; the system around it didn't.

What I Built: Four Phases#

I treated this as a product launch, not a tool rollout. The work had four distinct phases:

1

Process Mapping

I documented the existing workflow end-to-end — where decisions were made, where errors clustered, where latency was introduced. This created a shared baseline and exposed the gaps the AI system needed to close.

2

Evaluation Framework

I defined what "working" meant before we deployed anything. I established a Safety Index System tracking Precision, Recall, and False Positive Rate across decision categories — the contract between the AI team and ops about what "good" looked like.

3

Phased Deployment + Feedback Loops

Piloted with one review team, captured structured feedback, and surfaced failure patterns to AI Research in actionable terms ("over-triggering on X category in Y context"). Iterated before global rollout.

4

Enablement Layer

Built SOPs, training materials, and workflow documentation tailored to non-technical reviewers. Any new team member could reach full productivity without Engineering involvement.

BEFORE — 3 SILOED TOOLSLegacy Review Tool AFragmented workflowLegacy Review Tool BNo shared taxonomyLegacy Review Tool CNo performance trackingInternal Safety OSUnified platform · Shared taxonomy · End-to-end workflow ownershipLLM Moderation AgentAI-assisted decision support22%↑ accuracy · 15%↓ AHTSafety Index SystemPrecision · Recall · FPRPer-category thresholdsStructured Feedback LoopModerator observations → AI Research iteration → model improvementCONTINUOUS LOOP

Results#

22%↑

moderation accuracy

Driven by tighter feedback loops between operational signals and model updates

15%↓

average handle time

From better AI-assisted decision support at the point of review

40%↓

onboarding time

Structured documentation replaced ad hoc Engineering-led training

3→1

tools consolidated

Legacy tools consolidated into a unified AI platform across global review teams

The Insight#

Shipping an AI system is table stakes. Making it work — mapping the process, defining evaluation criteria, building the feedback loop, enabling the team, and measuring the impact — is the hard part. That's what I do.

Frequently Asked Questions#

What was the hardest part of this project?

The hardest part wasn't technical — it was trust. The team already had an LLM tool, but without transparent evaluation criteria, nobody knew whether it was worth trusting. Establishing the Safety Index System — defining 'working' in terms of Precision, Recall, and False Positive Rate — was the unlock that shifted the team from hesitation to adoption.

How was the Safety Index System designed?

The Safety Index System is a contract between the ops team and AI Research that defines three core metrics across decision categories: Precision (of what was flagged, how much was actually problematic), Recall (of what was actually problematic, how much was caught), and False Positive Rate (how often normal content gets wrongly flagged). Each metric has a threshold, and any category below threshold triggers a structured feedback loop back to the model team.

How does this apply to advertiser integrity or seller trust domains?

The same framework transfers directly: ad fraud detection requires the same precision/recall tradeoffs (false positives hurt legitimate advertisers, false negatives let fraud through); seller trust requires the same phased deployment and feedback loops (a policy change in one category ripples through the entire marketplace ecosystem). The methodology for making a system actually work is domain-agnostic.

Elena Liu

Elena Liu

Product Operation Specialist · Trust & Safety PM

SF Bay Area PM specializing in Trust & Safety infrastructure and AI-driven workflow automation. Building the systems that make moderation scale.

More about the author →
© 2026 Elena Liu. All rights reserved.|Privacy