The Challenge
When you're building production AI applications, one question keeps you up at night: "How do I know if my AI is actually working well?" Traditional metrics like accuracy don't capture the nuanced quality aspects of LLM outputs. You need to evaluate things like prompt alignment, context relevance, and noise sensitivity—qualities that require both algorithmic precision and human-like judgment.
At Mastra, we faced this exact challenge while building our AI evaluation framework. We needed a system that could:
- Handle diverse evaluation scenarios (accuracy, relevance, safety, performance)
- Balance speed and accuracy across different evaluation types
- Scale from prototype to production workloads
- Be extensible for custom evaluation needs
Here's how we solved it by creating a four-step evaluation pipeline that elegantly combines deterministic functions with LLM-based judgment.
The Four-Step Architecture
Every scorer in our system follows the same pipeline:
// Our unified scorer architecture
export function createScorer<TInput, TOutput>({
name,
description,
judge: { model, instructions }
}) {
return createScorer()
.analyze({ /* LLM-based analysis */ })
.generateScore({ /* Algorithmic scoring */ })
.generateReason({ /* Human-readable explanation */ })
}
1. Preprocess (Optional)
Transform input/output data into the format needed for evaluation. This might involve extracting text from complex objects, normalizing formats, or preparing context.
2. Analyze (Required for LLM scorers)
Use an LLM to perform deep analysis of the content. This leverages the model's understanding of language, context, and nuanced relationships.
3. Generate Score (Required)
Convert analysis into numerical scores using either deterministic algorithms or AI judgment. This is where we apply our scoring formulas.
4. Generate Reason (Optional)
Create human-readable explanations that help developers understand why a particular score was assigned.
Real Implementation: Noise Sensitivity Scorer
Let me show you how this works with our noise sensitivity scorer—one of the most sophisticated in our suite. This scorer evaluates how robust an AI agent is when exposed to irrelevant or misleading information.
export function createNoiseSensitivityScorerLLM({
model,
options: { baselineResponse, noisyQuery, noiseType }
}) {
return createScorer({
name: 'Noise Sensitivity (LLM)',
description: 'Evaluates how robust an agent is when exposed to irrelevant, distracting, or misleading information',
judge: { model, instructions: NOISE_SENSITIVITY_INSTRUCTIONS }
})
.analyze({
outputSchema: z.object({
dimensions: z.array(z.object({
dimension: z.string(),
impactLevel: z.enum(['none', 'minimal', 'moderate', 'significant', 'severe']),
specificChanges: z.string(),
noiseInfluence: z.string(),
})),
majorIssues: z.array(z.string()).optional(),
robustnessScore: z.number().min(0).max(1),
}),
createPrompt: ({ run }) => createAnalyzePrompt({
userQuery: getUserMessageFromRunInput(run.input) ?? '',
baselineResponse: options.baselineResponse,
noisyQuery: options.noisyQuery,
noisyResponse: getAssistantMessageFromRunOutput(run.output) ?? '',
noiseType: options.noiseType,
})
})
The Scoring Algorithm
Here's where it gets interesting. Our noise sensitivity scorer uses a hybrid approach:
.generateScore(({ results }) => {
const analysisResult = results.analyzeStepResult;
let finalScore = analysisResult.robustnessScore; // LLM's direct assessment
// Calculate algorithmic score based on impact levels
const averageImpact = dimensions.reduce((sum, dim) => {
return sum + impactWeights[dim.impactLevel]; // none=1.0, severe=0.1
}, 0) / dimensions.length;
// Conservative approach: use lower score when they diverge significantly
if (Math.abs(finalScore - averageImpact) > discrepancyThreshold) {
finalScore = Math.min(finalScore, averageImpact);
}
// Apply penalties for major issues
const issuesPenalty = Math.min(
majorIssues.length * majorIssuePenaltyRate,
maxMajorIssuePenalty
);
return Math.max(0, finalScore - issuesPenalty);
})
The formula: max(0, min(llm_score, calculated_score) - issues_penalty)
This hybrid approach gives us the best of both worlds:
- LLM judgment for nuanced quality assessment
- Algorithmic validation to prevent hallucinated scores
- Conservative bias when the two approaches disagree
- Issue-based penalties for objective quality problems
Context Relevance: A Different Challenge
Our context relevance scorer tackles a different but equally important problem—measuring how well provided context supports the AI's response:
.generateScore(({ results, run }) => {
const evaluations = results.analyzeStepResult?.evaluations || [];
// Calculate weighted score based on relevance levels
const relevanceWeights = { high: 1.0, medium: 0.7, low: 0.3, none: 0.0 };
const totalWeight = evaluations.reduce((sum, evaluation) => {
return sum + relevanceWeights[evaluation.relevanceLevel];
}, 0);
const maxPossibleWeight = evaluations.length * relevanceWeights.high;
const relevanceScore = maxPossibleWeight > 0 ? totalWeight / maxPossibleWeight : 0;
// Penalty for unused highly relevant context
const highRelevanceUnused = evaluations.filter(
evaluation => evaluation.relevanceLevel === 'high' && !evaluation.wasUsed
).length;
const usagePenalty = highRelevanceUnused * unusedPenaltyRate;
// Penalty for missing important context
const missingContextPenalty = Math.min(
missingContext.length * missingPenaltyRate,
maxMissingPenalty
);
return Math.max(0, relevanceScore - usagePenalty - missingContextPenalty);
})
This scorer captures three critical aspects:
- Relevance quality - How relevant is the provided context?
- Usage efficiency - Did the AI use the high-quality context?
- Completeness - What important context is missing?
Prompt Alignment: Multi-Modal Evaluation
The prompt alignment scorer showcases our system's flexibility by handling different evaluation modes:
export interface PromptAlignmentOptions {
evaluationMode?: 'user' | 'system' | 'both';
}
// Adaptive scoring based on evaluation mode
if (evaluationMode === 'user') {
// Focus on user intent and requirements
weightedScore =
analysis.intentAlignment.score * 0.4 +
analysis.requirementsFulfillment.overallScore * 0.3 +
analysis.completeness.score * 0.2 +
analysis.responseAppropriateness.score * 0.1;
} else if (evaluationMode === 'system') {
// Focus on system compliance
weightedScore =
analysis.intentAlignment.score * 0.35 +
analysis.requirementsFulfillment.overallScore * 0.35 +
analysis.completeness.score * 0.15 +
analysis.responseAppropriateness.score * 0.15;
} else {
// Both: weighted combination (70% user, 30% system)
const userScore = /* calculate user score */;
const systemScore = /* calculate system score */;
weightedScore = userScore * 0.7 + systemScore * 0.3;
}
Key Design Decisions
Functions vs. Prompt Objects
Each step in our pipeline can use either functions (deterministic JavaScript) or prompt objects (LLM-based evaluation):
Functions are ideal for:
- Algorithmic evaluations with clear criteria
- Performance-critical scenarios
- Integration with existing libraries
- Consistent, reproducible results
Prompt Objects excel at:
- Subjective evaluations requiring human-like judgment
- Complex contextual understanding
- Nuanced quality assessment
- Dynamic evaluation criteria
Conservative Bias
When LLM and algorithmic scores diverge significantly, we always choose the more conservative (lower) score. This prevents over-optimistic evaluations while maintaining sensitivity to real quality issues.
Configurable Penalties
All our scorers support configurable penalty systems:
penalties?: {
unusedHighRelevanceContext?: number; // 0.1 default
missingContextPerItem?: number; // 0.15 default
maxMissingContextPenalty?: number; // 0.5 default
}
This allows teams to tune the scoring system for their specific quality standards and use cases.
Production Insights
After running this system in production, we've learned several key lessons:
1. Validation is Critical
Always validate that required inputs exist before running expensive LLM evaluations:
if (evaluationMode === 'user' && !userPrompt) {
throw new Error('User prompt is required for user prompt alignment scoring');
}
2. Graceful Degradation
Handle missing analysis gracefully rather than throwing errors:
if (!analysis) {
return 0; // Default score when analysis fails
}
3. Meaningful Error Messages
Provide specific error messages that help developers debug their evaluation setup:
if (!options.baselineResponse || !options.noisyQuery) {
throw new Error('Both baselineResponse and noisyQuery are required for Noise Sensitivity scoring');
}
Performance Characteristics
Our evaluation system balances speed and accuracy:
- Noise Sensitivity: ~2-3 seconds (complex multi-dimensional analysis)
- Context Relevance: ~1-2 seconds (context evaluation and usage tracking)
- Prompt Alignment: ~1.5-2.5 seconds (multi-mode evaluation)
- Tool Call Accuracy: ~0.5-1 second (simpler binary evaluation)
The Results
This evaluation framework now powers quality assessment across our entire AI platform:
- 5 built-in scorers covering accuracy, relevance, safety, and performance
- Custom scorer support for domain-specific evaluation needs
- Production-tested across thousands of AI interactions
- Developer-friendly with clear documentation and error messages
The hybrid approach—combining algorithmic precision with AI judgment—has proven invaluable. It gives us confidence in our AI systems while providing the nuanced evaluation that modern applications require.
What's Next
We're working on several enhancements:
- Streaming evaluation for real-time quality monitoring
- Comparative scoring for A/B testing different AI approaches
- Custom model support for specialized evaluation domains
- Batch evaluation APIs for processing large datasets
Building robust AI evaluation isn't just about metrics—it's about creating systems that help you build better AI applications. The four-step pipeline approach has given us that foundation, and I hope it can help you build more reliable AI systems too.