Case Study · Data Operations · Flip + LeanData

Automating Content Triage with ML Pipelines

Manual processing can't scale. I designed classification systems that let machines handle 65% of routine decisions — so humans focus on the edge cases that actually need judgment.

Elena Liu

Trust & Safety PM · Flip · LeanData

6 min read

65%

Tier-1 reports automated (Flip)

12%↑

moderator decision speed (Flip)

35%↑

automation coverage (LeanData)

30%↓

manual reconciliation errors (LeanData)

At Flip, manual content report processing had become a bottleneck — as the platform grew, Tier-1 report volume outpaced the team's ability to triage manually. At LeanData, the absence of a standardized classification taxonomy meant inconsistent data across teams and rising governance costs. The core logic was the same in both cases: define clear classification standards, automate the structured cases, and route the genuinely ambiguous ones to human review.

Flip: Automating Tier-1 Content Triage#

Flip is a peer-to-peer resale marketplace. As the platform grew, user-reported content violations (Tier-1 reports) scaled faster than the moderation team could process them manually. Moderators were spending most of their time on structured, predictable cases — leaving less capacity for the complex edge cases that actually required judgment.

The Problem: Scale vs. Manual Processing

Most Tier-1 reports followed predictable patterns — specific violation categories showed consistent text signals, metadata patterns, and behavioral features. Processing them one-by-one was the highest-cost, lowest-value use of moderator time.

The Solution: ML Classifiers + Human-in-the-Loop

1

Data Analysis and Feature Engineering

Analyzed historical report data to identify violation categories with high automation potential. Extracted text features (TF-IDF), metadata signals (item category, account age, prior violation history), and behavioral patterns to build the training set.

2

Classifier Training and Threshold Calibration

Trained multi-class classifiers using Python and Scikit-learn. Key decision: differentiated confidence thresholds by violation category — high-confidence cases auto-resolved, low-confidence cases routed to the human review queue. The threshold calibration directly reflected the cost asymmetry between false positives and false negatives.

3

Human-in-the-Loop Design

Automation wasn't a replacement for judgment — it was a precision triage system. The classifier handled predictable, structured cases; moderators focused on the complex edge cases requiring contextual reasoning. Built a feedback loop where human corrections fed back into model improvement.

4

GenAI Policy Enforcement Tool Pilot

Supported the pilot launch of a GenAI-powered content policy enforcement tool — defining evaluation metrics, collecting structured moderator feedback, and documenting policy boundaries. This directly informed the methodology I later applied when operationalizing Moody's LLM moderation assistant.

65%

Tier-1 reports automated

Structured, high-confidence cases resolved by classifier without human intervention

12%↑

moderator decision speed

Moderators focused on complex edge cases — average handle time decreased

LeanData: Data Governance and Auto-Classification#

LeanData is a B2B revenue operations SaaS company. As a Data Governance Analyst, the core problem was inconsistent classification standards across teams — downstream analytics were unreliable and reconciliation costs were high.

The Problem: Taxonomy Fragmentation

Without a unified classification schema, each team labeled data according to its own interpretation. The same entity could have different classifications across systems, generating constant manual reconciliation work and eroding trust in the data.

The Solution: JSON Taxonomy + Python Automation

1

JSON Taxonomy Standardization

Collaborated with business teams to define unified classification standards and field specifications. Published a JSON Schema as the single data contract across systems — eliminating the root cause of each team doing its own ad-hoc labeling.

2

Python Auto-Classification System

Built an automated classification pipeline in Python (Scikit-learn) against the standardized taxonomy to process structured input data. Expanded automation coverage by 35%, substantially reducing manual labeling workload.

3

Data Quality Monitoring Framework

Established continuous monitoring for classification consistency, anomaly rates, and coverage gaps. Gave teams an actionable data quality dashboard rather than only post-hoc reports.

35%↑

automation coverage

Python classification pipeline handles structured inputs, replacing manual labeling

30%↓

manual reconciliation errors

Unified taxonomy eliminated cross-system inconsistencies

The Through-Line#

Automation isn't about replacing judgment — it's about applying judgment where it matters. Consistent taxonomy + high-confidence auto-resolution + human focus on edge cases: this methodology transfers directly across content triage, data governance, and LLM moderation platforms.

Frequently Asked Questions#

How does this connect to the Moody's Analytics LLM moderation work?

The Flip ML triage pipeline is the direct predecessor to the Moody's work: both involve designing a classification system where automation handles structured cases and humans handle edge cases. The feature engineering, threshold calibration, and HITL design at Flip fed directly into how I framed the Safety Index System — tracking Precision, Recall, and False Positive Rate as the operational contract.

How did you handle false positives in the classifier?

Differentiated thresholds by violation category: the cost of a false positive varies by category — high-risk categories (fraud, underage protection) got conservative thresholds that route to human review; low-risk repetitive categories got aggressive thresholds for auto-resolution. This asymmetric calibration is the same logic behind tracking False Positive Rate as a separate metric in the Safety Index System.

How does this transfer to seller trust or advertiser integrity domains?

Directly: seller trust requires the same classification taxonomy (which seller behaviors trigger review), automation coverage metrics, and edge-case routing. Advertiser integrity requires the same false positive/negative tradeoff — the cost of wrongly flagging a legitimate advertiser is high, but so is the cost of missing a fraudulent one. The methodology is domain-agnostic.

Elena Liu

Elena Liu

Product Operation Specialist · Trust & Safety PM

SF Bay Area PM specializing in Trust & Safety infrastructure and AI-driven workflow automation. Building the systems that make moderation scale.

More about the author →
© 2026 Elena Liu. All rights reserved.|Privacy