Portfolio | Pathstone Analytics

60% to 22%: LLM Output Optimization Protocol

Production Quality Control for Multi-LLM Workflows

The compounding reliability problem

AI language models in multi-turn workflows produce compounding reliability problems. Content goes unprocessed. Fabricated information passes through unchecked. Early errors cascade, and by the time anyone notices, the downstream output is already contaminated. Most organizations either trust AI output entirely or waste resources verifying everything at the same depth. Neither scales.

A verification protocol built through empirical testing

Tested on the most advanced models from OpenAI, Anthropic, Google Gemini, and Grok available in early 2026. Two distinct failure modes: incomplete content ingestion (the model skips what it was given) and hallucination (the model inverts what it thinks should be there). We developed a mathematical model to predict contamination rates across multi-turn workflows, then validated it against real production data.

Output gets classified into risk tiers with different verification requirements. Critical-risk output gets full verification. Low-risk output gets a lighter pass. Resources go where the exposure actually is, not spread uniformly across everything.

AI providers like OpenAI, Anthropic, Google, and xAI are all reducing hallucination rates from inside the model, each in different ways. None of them offer human-in-the-loop verification as a standard protocol because the token overhead makes it impractical at platform scale. They optimize the model. We optimize what comes out of it. Both matter. Neither replaces the other.

Hallucination rates dropped from 20% to 5%

Unprocessed content reduced from 60% to 22% across three verification rounds. Hallucination rates reduced from approximately 20% to approximately 5%. The predictive contamination model produced a 59% estimate against 60% observed in production, validating the mathematical framework. Protocol is deployed across active multi-LLM production workflows and scales without degradation.

Models have improved since this testing was conducted. Baseline hallucination rates are lower now than they were in early 2026. The methodology still produces measurable gains on top of what the models deliver natively, because it addresses the architectural problem of how output is verified and used, not the model problem of how output is generated.

Recurring 15-25% Efficiency Gain: AI-Powered Franchise Operations

Production AI Deployment for a 4-Location Franchise

Three constraints, one mandated budget

A 4-location franchise operation faced three simultaneous constraints: perishable inventory waste cutting into margins, a franchisor-mandated 2% local marketing spend (~$32K/year) being allocated to low-return activities by the operator, and no systematic way to identify which community organizations would generate the highest goodwill-to-customer conversion from surplus product.

AI on the backend, a clean app on the front

Engineered an AI-powered system with API-driven scoring and decision logic on the backend, delivered through a custom application that store managers interact with directly. The AI handles multi-dimensional scoring, venue ranking, and trigger-based recommendations. Operators see ranked options, pre-built scripts, and configurable proximity parameters through a clean interface. They never touch the infrastructure underneath.

The system was purpose-built around franchise compliance constraints from the ground up. Not adapted after the fact. Compliance risk operates as a penalty multiplier rather than a standard scoring factor. That architectural choice is what makes this work: the system's own logic filters out franchisor-violating recommendations before they surface, regardless of how well a target scores on every other dimension. The system scans for nearby community events and organizations, then ranks them by fit for the product. Sometimes that surfaces two options. Sometimes sixty. The number was fluid by design because the goal was quality matches, not volume.

Month-over-month improvement, zero compliance violations

Recurring 15-25% efficiency improvement on the mandated local marketing spend, sustained month over month across the deployment period. Projected annual recaptured value of $4.8K-$8K per location at sustained rates. Full franchise compliance maintained throughout deployment, with zero policy violations flagged. AI infrastructure operated transparently to end users with zero training burden on store-level staff.

82.4% Productivity Gain, 50%+ Cost Reduction: AI Workforce Optimization

Internal Deployment, 5-Person Distributed Team

Token costs climbing, no visibility into what they produced

Our team had the same problem every AI-first operation hits eventually. Token costs climbing with no visibility into what those tokens actually produced. The team was defaulting to the most powerful model for every task. Data uploads, research synthesis, document formatting, complex analysis. All running on the same tier. Screenshots were being used as primary AI input instead of text, burning 3-5x the tokens for worse output quality. Context windows were polluted with planning debris from prior exchanges, degrading every subsequent response.

We were regularly exceeding AI platform budgets with token overages, and the budget limitations were actively constraining workflow. The standard industry approach: track token volume, set budget caps, hope for the best. Nobody was measuring the relationship between what went in and what came out.

One ratio: Individual Potential divided by Token Precision

Built a framework around a single trackable ratio. The numerator captures what each team member is capable of producing. Not a job title. A 26-dimension capability profile that maps actual skill distribution, learning patterns, AI interaction maturity, and task-specific strengths. Each person's profile is different because each person is different. The AI deployment strategy adapts to the individual, not the other way around.

The denominator captures how intelligently tokens are deployed. Not how many. How precisely. Three variables: model selection per task (the task determines the tool, not habit), input format discipline (text first, screenshots only when visual context adds something text cannot), and context hygiene (chat health tracking and abort protocols that define when a conversation has degraded past the point of productive use).

High potential with high precision means the system is working. High potential with low precision means tokens are being wasted. Low potential with high precision means you are efficiently accomplishing nothing.

Six weeks of work completed in four days

Over 50% reduction in AI platform costs while simultaneously removing the workflow limitations the previous budget was creating. Overages eliminated.

For team members with prior AI experience but no structured workflow: 82.4% productivity gain. Work that previously required six weeks was completed in four days. The gain came from the framework, not from first exposure to AI. For AI-native users already running AI daily: 50-55% productivity gain. They were producing good work inefficiently. The framework closed that gap.

We confirmed the framework could sustain output at the lowest available plan tier, but the tradeoff was not Pareto-optimal: reasoning capability lost at that tier introduced risk of inferior work that outweighed the cost savings.

Before the framework, AI costs were rising while output was being capped by rate limits and budget constraints. After deployment, costs dropped, output increased, and the team has not hit a workflow limitation since. Framework operational within 30 days. Deployable to any team running AI operations.

Seeing something relevant to your operation?

Start a conversation

Near-Zero Contamination: 22-System AI Governance Architecture

Enterprise-Scale Governance Design

When systems start interfering with each other

Managing operations across multiple business units, AI workflows, and knowledge repositories creates a compounding information management problem. Without explicit governance, systems start interfering with each other. Data drifts between contexts. Authority over decisions gets ambiguous. The cost of maintaining consistency grows faster than the organization itself.

22 systems, one set of governing contracts

Designed and built a 22-system directive architecture processing over 50MB of raw structured text, equivalent to tens of thousands of pages of operational data, across multiple business units. Every system has defined scope, explicit permissions, and prohibited logic. When systems conflict, resolution rules are predetermined. Foundational systems automatically override higher-level analytical systems in any data integrity conflict.

Information flows through explicit routing rules with scope boundaries. When systems produce conflicting conclusions, both are preserved with full provenance. The architecture never resolves contradictions by choosing a winner and discarding the loser.

Failure modes designed out, not monitored after

Architecture processes tens of thousands of pages of raw structured text across multiple business units with near-zero cross-system contamination. New systems plug into the same governing contracts without refactoring the foundation. The most common failure modes in complex systems are preemptively addressed in the architecture itself, making them statistically unlikely rather than relying on detection after the fact.

Framework Validated, Project Killed: Viral Signal Detection

Market Analytics and Feasibility Evaluation

Billions of data points, no way to tell what's real

A client in the short-form video space needed a way to distinguish genuine early-stage virality from noise: algorithmic test pushes, influencer-driven distortion, and manufactured spikes that existing analytics tools treat identically to organic momentum. The gap in the market was clear, with billions of data points generated daily and no reliable methodology for separating real signals from artificial ones.

A detection framework built from first principles

The system neutralizes large-account distortion, identifies and discounts algorithmic test windows, and isolates organic carry. All metrics are scale-invariant, working across any account size, niche, or platform without recalibration. The framework was designed for portability across TikTok, Instagram Reels, YouTube Shorts, and future short-form surfaces.

We tested the framework against live platform data. Initial results validated the detection methodology.

The honest call was to walk away

The framework worked. The economics required a commitment the client wasn't prepared to make. Ongoing data collection depended on scrapers operating against platforms that were already tightening access restrictions, and we predicted that enforcement would only accelerate. Building and maintaining the infrastructure was doable with the right team and regular upkeep, but we couldn't guarantee what the future of platform access policies would look like. The uncertainty and projected likely outcomes made continued investment hard to justify. We recommended killing the project and provided a full feasibility assessment documenting what worked, what didn't, and why walking away was the right call. That prediction held up. By early 2026, social platforms had moved from reactive enforcement to preventive detection. Scripts that worked one month were returning block pages the next. The cat-and-mouse game we forecasted became the industry norm.

Five problems. Five systems.

Have a problem that fits this pattern?