# Data Sheriff Agent - Integration Guide

## Overview

The Data Sheriff Agent is a middleware layer that validates and enriches sports queries before they're processed by the Grok LLM. It ensures data accuracy, resolves entities, and provides context about data coverage.

## Quick Start

### 1. Import the Agent

```typescript
import {
  dataSheriffAgent,
  preprocessChatMessage,
  buildEnhancedSystemPrompt,
  postprocessResponse,
} from '@/agents/dataSheriff';
```

### 2. Process Queries Before LLM

```typescript
// In your chat handler
const enhanced = await preprocessChatMessage(userMessage, 'sports');

if (!enhanced.shouldProceed && enhanced.suggestedResponse) {
  // Return clarification request to user
  return { message: enhanced.suggestedResponse };
}

// Use enhanced query and context
const aiResponse = await callGrok4Fast(
  enhanced.enhancedMessage,
  domain,
  userId,
  conversationHistory,
  enhanced.dataSheriffContext // Pass context to LLM
);
```

### 3. Add Context to System Prompt

```typescript
const systemPrompt = buildEnhancedSystemPrompt(
  baseSystemPrompt,
  enhanced.dataSheriffContext
);
```

### 4. Post-Process Response

```typescript
const finalResponse = postprocessResponse(
  aiResponse.message.content,
  enhanced.validationSummary
);
```

## Integration with `/api/chat/route.ts`

Add the following to your chat route:

```typescript
// At the top of the file
import {
  preprocessChatMessage,
  postprocessResponse
} from '@/agents/dataSheriff';

// Before calling callGrok4Fast (around line 391)
let dataSheriffContext = '';
let validationSummary = null;

if (domain === 'sports') {
  const enhanced = await preprocessChatMessage(sanitizedMessage, domain);

  // Log for debugging
  console.log(`[DATA_SHERIFF] Confidence: ${enhanced.validationSummary.confidence}, Coverage: ${enhanced.validationSummary.coverageScore}`);

  // If clarification needed, return early
  if (!enhanced.shouldProceed && enhanced.suggestedResponse) {
    await DatabaseService.addMessageToConversation(
      conversation.id,
      'ASSISTANT',
      enhanced.suggestedResponse
    );

    return createSecureResponse({
      message: { content: enhanced.suggestedResponse, role: 'assistant' },
      conversationId: conversation.id,
      needsClarification: true,
    }, 200, user);
  }

  dataSheriffContext = enhanced.dataSheriffContext;
  validationSummary = enhanced.validationSummary;
}

// Pass context to Grok (modify callGrok4Fast to accept context)
const aiResponse = await callGrok4Fast(
  sanitizedMessage,
  domain,
  user.id,
  conversationHistory,
  dataSheriffContext // New parameter
);

// After getting response, add data quality notes
if (validationSummary && aiResponse?.message?.content) {
  aiResponse.message.content = postprocessResponse(
    aiResponse.message.content,
    validationSummary
  );
}
```

## Full API Reference

### `dataSheriffAgent.processQuery(query: string)`

Main entry point. Returns:

```typescript
interface DataSheriffResponse {
  success: boolean;
  deconstructedQuery: {
    originalQuery: string;
    normalizedQuery: string;
    entities: ResolvedEntity[];
    intent: IntentClassification;
    timeframe?: DateEntity;
    league?: string;
  };
  validation: {
    isValid: boolean;
    confidence: number;
    corrections: QueryCorrection[];
    warnings: string[];
    enrichedContext: EnrichedContext;
  };
  enhancedQuery: string;
  contextForLLM: string;
  processingTimeMs: number;
  metadata: {
    entitiesResolved: number;
    correctionsApplied: number;
    coverageScore: number;
    confidenceScore: number;
  };
}
```

### Entity Types

- **Player**: `{ playerId, name, team, position, league, injuryStatus }`
- **Team**: `{ teamId, name, abbreviation, league }`
- **Date**: `{ startDate, endDate, isRange, season, week }`
- **Metric**: `{ metricKey, category }` (stat, betting, derived)
- **Event**: `{ eventType, league, date }` (Super Bowl, Finals, etc.)

### Intent Classification

- `historical_performance` - "How did LeBron do last game?"
- `strategy_backtesting` - "Backtest home favorites ATS"
- `player_comparison` - "LeBron vs Curry stats"
- `team_comparison` - "Lakers vs Celtics history"
- `future_prediction` - "Who wins tonight?"
- `odds_lookup` - "What's the spread on the Lakers game?"
- `injury_check` - "Is Steph playing tonight?"
- `prop_lookup` - "LeBron points over/under"
- `trend_analysis` - "Lakers ATS record last 10 games"
- `general_knowledge` - General questions

## Configuration

```typescript
import { createDataSheriffAgent } from '@/agents/dataSheriff';

const customAgent = createDataSheriffAgent({
  minConfidenceThreshold: 0.6,    // Min confidence to accept entity match
  enableFuzzyMatching: true,       // Fuzzy name matching
  maxFuzzyDistance: 3,             // Max Levenshtein distance
  enableInference: true,           // Infer missing data from averages
  inferenceConfidenceThreshold: 0.7,
  maxResponseTimeMs: 2000,         // Timeout
  enableCaching: true,             // Cache results
  cacheTTLSeconds: 300,            // Cache TTL
});
```

## Examples

### Example 1: Player Query

```typescript
const result = await dataSheriffAgent.processQuery(
  "How did LeBron do against Boston last Tuesday?"
);

// Result:
// - Entities: [LeBron James (player), Boston Celtics (team), 2024-01-23 (date)]
// - Intent: historical_performance (95%)
// - Coverage: 87%
// - Enhanced: "[NBA] How did LeBron James do against Boston Celtics on 2024-01-23?"
```

### Example 2: Ambiguous Query

```typescript
const result = await dataSheriffAgent.processQuery(
  "LA vs NY last game"
);

// Warnings: ["'LA' could refer to Lakers, Clippers (NBA)...", "'NY' could refer to Knicks, Nets (NBA)..."]
// Confidence: 45%
// suggestedResponse: "I need clarification - which LA and NY teams?"
```

### Example 3: Strategy Query

```typescript
const result = await dataSheriffAgent.processQuery(
  "Backtest home favorites +5 spread in NBA 2024"
);

// Intent: strategy_backtesting (92%)
// Entities: [2024 season (date), spread +5 (metric)]
// Context includes betting line coverage assessment
```

## Performance

- Average processing time: 50-200ms
- Caching enabled by default (5 min TTL)
- Lazy-loads Prisma clients on first use
- Parallel entity resolution where possible

## Troubleshooting

### Low Confidence Scores

- Check if player/team names are spelled correctly
- Add league context to disambiguate (e.g., "NBA Lakers")
- Specify dates explicitly

### Missing Data Warnings

- Check database coverage for the requested time period
- Some historical data may not be available
- Props data coverage is typically lower than game data

### Slow Queries

- First query may be slower (client initialization)
- Complex queries with many entities take longer
- Check database connection health