Building AI Evaluation Pipelines: Creating Custom Scorers for LLM Quality Assessment

The Challenge

When you're building production AI applications, one question keeps you up at night: "How do I know if my AI is actually working well?" Traditional metrics like accuracy don't capture the nuanced quality aspects of LLM outputs. You need to evaluate things like prompt alignment, context relevance, and noise sensitivity—qualities that require both algorithmic precision and human-like judgment.

At Mastra, we faced this exact challenge while building our AI evaluation framework. We needed a system that could:

Handle diverse evaluation scenarios (accuracy, relevance, safety, performance)
Balance speed and accuracy across different evaluation types
Scale from prototype to production workloads
Be extensible for custom evaluation needs

Here's how we solved it by creating a four-step evaluation pipeline that elegantly combines deterministic functions with LLM-based judgment.

The Four-Step Architecture

Every scorer in our system follows the same pipeline:

// Our unified scorer architecture
export function createScorer<TInput, TOutput>({
  name,
  description,
  judge: { model, instructions }
}) {
  return createScorer()
    .analyze({ /* LLM-based analysis */ })
    .generateScore({ /* Algorithmic scoring */ })
    .generateReason({ /* Human-readable explanation */ })
}

1. Preprocess (Optional)

Transform input/output data into the format needed for evaluation. This might involve extracting text from complex objects, normalizing formats, or preparing context.

2. Analyze (Required for LLM scorers)

Use an LLM to perform deep analysis of the content. This leverages the model's understanding of language, context, and nuanced relationships.

3. Generate Score (Required)

Convert analysis into numerical scores using either deterministic algorithms or AI judgment. This is where we apply our scoring formulas.

4. Generate Reason (Optional)

Create human-readable explanations that help developers understand why a particular score was assigned.

Real Implementation: Noise Sensitivity Scorer

Let me show you how this works with our noise sensitivity scorer—one of the most sophisticated in our suite. This scorer evaluates how robust an AI agent is when exposed to irrelevant or misleading information.

export function createNoiseSensitivityScorerLLM({
  model,
  options: { baselineResponse, noisyQuery, noiseType }
}) {
  return createScorer({
    name: 'Noise Sensitivity (LLM)',
    description: 'Evaluates how robust an agent is when exposed to irrelevant, distracting, or misleading information',
    judge: { model, instructions: NOISE_SENSITIVITY_INSTRUCTIONS }
  })
  .analyze({
    outputSchema: z.object({
      dimensions: z.array(z.object({
        dimension: z.string(),
        impactLevel: z.enum(['none', 'minimal', 'moderate', 'significant', 'severe']),
        specificChanges: z.string(),
        noiseInfluence: z.string(),
      })),
      majorIssues: z.array(z.string()).optional(),
      robustnessScore: z.number().min(0).max(1),
    }),
    createPrompt: ({ run }) => createAnalyzePrompt({
      userQuery: getUserMessageFromRunInput(run.input) ?? '',
      baselineResponse: options.baselineResponse,
      noisyQuery: options.noisyQuery,
      noisyResponse: getAssistantMessageFromRunOutput(run.output) ?? '',
      noiseType: options.noiseType,
    })
  })

The Scoring Algorithm

Here's where it gets interesting. Our noise sensitivity scorer uses a hybrid approach:

.generateScore(({ results }) => {
  const analysisResult = results.analyzeStepResult;
  let finalScore = analysisResult.robustnessScore; // LLM's direct assessment
  
  // Calculate algorithmic score based on impact levels
  const averageImpact = dimensions.reduce((sum, dim) => {
    return sum + impactWeights[dim.impactLevel]; // none=1.0, severe=0.1
  }, 0) / dimensions.length;
  
  // Conservative approach: use lower score when they diverge significantly
  if (Math.abs(finalScore - averageImpact) > discrepancyThreshold) {
    finalScore = Math.min(finalScore, averageImpact);
  }
  
  // Apply penalties for major issues
  const issuesPenalty = Math.min(
    majorIssues.length * majorIssuePenaltyRate, 
    maxMajorIssuePenalty
  );
  
  return Math.max(0, finalScore - issuesPenalty);
})

The formula: max(0, min(llm_score, calculated_score) - issues_penalty)

This hybrid approach gives us the best of both worlds:

LLM judgment for nuanced quality assessment
Algorithmic validation to prevent hallucinated scores
Conservative bias when the two approaches disagree
Issue-based penalties for objective quality problems

Context Relevance: A Different Challenge

Our context relevance scorer tackles a different but equally important problem—measuring how well provided context supports the AI's response:

.generateScore(({ results, run }) => {
  const evaluations = results.analyzeStepResult?.evaluations || [];
  
  // Calculate weighted score based on relevance levels
  const relevanceWeights = { high: 1.0, medium: 0.7, low: 0.3, none: 0.0 };
  const totalWeight = evaluations.reduce((sum, evaluation) => {
    return sum + relevanceWeights[evaluation.relevanceLevel];
  }, 0);
  
  const maxPossibleWeight = evaluations.length * relevanceWeights.high;
  const relevanceScore = maxPossibleWeight > 0 ? totalWeight / maxPossibleWeight : 0;
  
  // Penalty for unused highly relevant context
  const highRelevanceUnused = evaluations.filter(
    evaluation => evaluation.relevanceLevel === 'high' && !evaluation.wasUsed
  ).length;
  const usagePenalty = highRelevanceUnused * unusedPenaltyRate;
  
  // Penalty for missing important context
  const missingContextPenalty = Math.min(
    missingContext.length * missingPenaltyRate, 
    maxMissingPenalty
  );
  
  return Math.max(0, relevanceScore - usagePenalty - missingContextPenalty);
})

This scorer captures three critical aspects:

Relevance quality - How relevant is the provided context?
Usage efficiency - Did the AI use the high-quality context?
Completeness - What important context is missing?

The prompt alignment scorer showcases our system's flexibility by handling different evaluation modes:

export interface PromptAlignmentOptions {
  evaluationMode?: 'user' | 'system' | 'both';
}

// Adaptive scoring based on evaluation mode
if (evaluationMode === 'user') {
  // Focus on user intent and requirements
  weightedScore = 
    analysis.intentAlignment.score * 0.4 +
    analysis.requirementsFulfillment.overallScore * 0.3 +
    analysis.completeness.score * 0.2 +
    analysis.responseAppropriateness.score * 0.1;
} else if (evaluationMode === 'system') {
  // Focus on system compliance
  weightedScore = 
    analysis.intentAlignment.score * 0.35 +
    analysis.requirementsFulfillment.overallScore * 0.35 +
    analysis.completeness.score * 0.15 +
    analysis.responseAppropriateness.score * 0.15;
} else {
  // Both: weighted combination (70% user, 30% system)
  const userScore = /* calculate user score */;
  const systemScore = /* calculate system score */;
  weightedScore = userScore * 0.7 + systemScore * 0.3;
}

Key Design Decisions

Functions vs. Prompt Objects

Each step in our pipeline can use either functions (deterministic JavaScript) or prompt objects (LLM-based evaluation):

Functions are ideal for:

Algorithmic evaluations with clear criteria
Performance-critical scenarios
Integration with existing libraries
Consistent, reproducible results

Prompt Objects excel at:

Subjective evaluations requiring human-like judgment
Complex contextual understanding
Nuanced quality assessment
Dynamic evaluation criteria

Conservative Bias

When LLM and algorithmic scores diverge significantly, we always choose the more conservative (lower) score. This prevents over-optimistic evaluations while maintaining sensitivity to real quality issues.

Configurable Penalties

All our scorers support configurable penalty systems:

penalties?: {
  unusedHighRelevanceContext?: number;     // 0.1 default
  missingContextPerItem?: number;          // 0.15 default  
  maxMissingContextPenalty?: number;       // 0.5 default
}

This allows teams to tune the scoring system for their specific quality standards and use cases.

Production Insights

After running this system in production, we've learned several key lessons:

1. Validation is Critical

Always validate that required inputs exist before running expensive LLM evaluations:

if (evaluationMode === 'user' && !userPrompt) {
  throw new Error('User prompt is required for user prompt alignment scoring');
}

2. Graceful Degradation

Handle missing analysis gracefully rather than throwing errors:

if (!analysis) {
  return 0; // Default score when analysis fails
}

3. Meaningful Error Messages

Provide specific error messages that help developers debug their evaluation setup:

if (!options.baselineResponse || !options.noisyQuery) {
  throw new Error('Both baselineResponse and noisyQuery are required for Noise Sensitivity scoring');
}

Performance Characteristics

Our evaluation system balances speed and accuracy:

Noise Sensitivity: ~2-3 seconds (complex multi-dimensional analysis)
Context Relevance: ~1-2 seconds (context evaluation and usage tracking)
Prompt Alignment: ~1.5-2.5 seconds (multi-mode evaluation)
Tool Call Accuracy: ~0.5-1 second (simpler binary evaluation)

The Results

This evaluation framework now powers quality assessment across our entire AI platform:

5 built-in scorers covering accuracy, relevance, safety, and performance
Custom scorer support for domain-specific evaluation needs
Production-tested across thousands of AI interactions
Developer-friendly with clear documentation and error messages

The hybrid approach—combining algorithmic precision with AI judgment—has proven invaluable. It gives us confidence in our AI systems while providing the nuanced evaluation that modern applications require.

What's Next

We're working on several enhancements:

Streaming evaluation for real-time quality monitoring
Comparative scoring for A/B testing different AI approaches
Custom model support for specialized evaluation domains
Batch evaluation APIs for processing large datasets

Building robust AI evaluation isn't just about metrics—it's about creating systems that help you build better AI applications. The four-step pipeline approach has given us that foundation, and I hope it can help you build more reliable AI systems too.