ML Content Triage Pipeline · Flip Case Study

Flip: Automating Tier-1 Content Triage#

Flip is a peer-to-peer resale marketplace. As the platform grew, user-reported content violations (Tier-1 reports) scaled faster than the moderation team could process them manually. Moderators were spending most of their time on structured, predictable cases — leaving less capacity for the complex edge cases that actually required judgment.

The Problem: Scale vs. Manual Processing

Most Tier-1 reports followed predictable patterns — specific violation categories showed consistent text signals, metadata patterns, and behavioral features. Processing them one-by-one was the highest-cost, lowest-value use of moderator time.

The Solution: ML Classifiers + Human-in-the-Loop

Data Analysis and Feature Engineering

Analyzed historical report data to identify violation categories with high automation potential. Extracted text features (TF-IDF), metadata signals (item category, account age, prior violation history), and behavioral patterns to build the training set.

Classifier Training and Threshold Calibration

Trained multi-class classifiers using Python and Scikit-learn. Key decision: differentiated confidence thresholds by violation category — high-confidence cases auto-resolved, low-confidence cases routed to the human review queue. The threshold calibration directly reflected the cost asymmetry between false positives and false negatives.

Human-in-the-Loop Design

Automation wasn't a replacement for judgment — it was a precision triage system. The classifier handled predictable, structured cases; moderators focused on the complex edge cases requiring contextual reasoning. Built a feedback loop where human corrections fed back into model improvement.

GenAI Policy Enforcement Tool Pilot

Supported the pilot launch of a GenAI-powered content policy enforcement tool — defining evaluation metrics, collecting structured moderator feedback, and documenting policy boundaries. This directly informed the methodology I later applied when operationalizing Moody's LLM moderation assistant.

65%

Tier-1 reports automated

Structured, high-confidence cases resolved by classifier without human intervention

12%↑

moderator decision speed

Moderators focused on complex edge cases — average handle time decreased

LeanData: Data Governance and Auto-Classification#

LeanData is a B2B revenue operations SaaS company. As a Data Governance Analyst, the core problem was inconsistent classification standards across teams — downstream analytics were unreliable and reconciliation costs were high.

The Problem: Taxonomy Fragmentation

Without a unified classification schema, each team labeled data according to its own interpretation. The same entity could have different classifications across systems, generating constant manual reconciliation work and eroding trust in the data.

The Solution: JSON Taxonomy + Python Automation

JSON Taxonomy Standardization

Collaborated with business teams to define unified classification standards and field specifications. Published a JSON Schema as the single data contract across systems — eliminating the root cause of each team doing its own ad-hoc labeling.

Python Auto-Classification System

Built an automated classification pipeline in Python (Scikit-learn) against the standardized taxonomy to process structured input data. Expanded automation coverage by 35%, substantially reducing manual labeling workload.

Data Quality Monitoring Framework

Established continuous monitoring for classification consistency, anomaly rates, and coverage gaps. Gave teams an actionable data quality dashboard rather than only post-hoc reports.

35%↑

automation coverage

Python classification pipeline handles structured inputs, replacing manual labeling

30%↓

manual reconciliation errors

Unified taxonomy eliminated cross-system inconsistencies

The Through-Line#

Automation isn't about replacing judgment — it's about applying judgment where it matters. Consistent taxonomy + high-confidence auto-resolution + human focus on edge cases: this methodology transfers directly across content triage, data governance, and LLM moderation platforms.

Frequently Asked Questions#

How does this connect to the Moody's Analytics LLM moderation work?

The Flip ML triage pipeline is the direct predecessor to the Moody's work: both involve designing a classification system where automation handles structured cases and humans handle edge cases. The feature engineering, threshold calibration, and HITL design at Flip fed directly into how I framed the Safety Index System — tracking Precision, Recall, and False Positive Rate as the operational contract.

How did you handle false positives in the classifier?

Differentiated thresholds by violation category: the cost of a false positive varies by category — high-risk categories (fraud, underage protection) got conservative thresholds that route to human review; low-risk repetitive categories got aggressive thresholds for auto-resolution. This asymmetric calibration is the same logic behind tracking False Positive Rate as a separate metric in the Safety Index System.

How does this transfer to seller trust or advertiser integrity domains?

Directly: seller trust requires the same classification taxonomy (which seller behaviors trigger review), automation coverage metrics, and edge-case routing. Advertiser integrity requires the same false positive/negative tradeoff — the cost of wrongly flagging a legitimate advertiser is high, but so is the cost of missing a fraudulent one. The methodology is domain-agnostic.