# The Pool — Peer Review & Consensus Plan
**Generated: 2026-02-22 | Models: DeepSeek Chat v3, Grok 3 Mini, GPT-4o**

---

## 1. PEER REVIEWS (Raw, via OpenRouter)

---

### DeepSeek (Engineering-first, Cost-aware)

| # | Severity | Issue | Fix |
|---|----------|-------|-----|
| 1 | **Critical** | Cache key collision — SHA-256 of normalized query alone doesn't account for game state or model version. "game tonight" cached pre-game could be served during live. | Include `game_state + model_version + query_category` in cache key formula |
| 2 | **High** | Cross-user leakage via intent classification — "best picks" responses cacheable across users could leak one user's betting strategy influence | Add `user_context_hash` column (derived from stable user features, NOT PII). Accept ~15% higher miss rate. |
| 3 | **High** | Thundering herd at 4AM ET cutoff — 80% of game states flip simultaneously | Stagger TTLs using game ID as entropy: `ttlJitter = (gameId % 30) * 60` (0-30 min jitter) |
| 4 | **Medium** | Embedding cost blindspot — pg_trgm fuzzy matching has hidden per-query cost | Pre-compute and cache embeddings during stage 1 writes. Batch where possible. |
| 5 | **Medium** | Backtesting validation gap — no automated drift detection between cached and fresh responses | Add drift_score metric; trigger cache invalidation if drift < 0.85 |
| 6 | **Low** | Warming worker burns always-on compute for marginal gains | Consider event-driven triggers (activity spikes, major sporting events) instead of fixed 15-min interval |

**DeepSeek Bottom Line:** Pause scaling until Critical/High items addressed. Could save $15k/month if cache correctness is bulletproof.

---

### Grok (Product/Real-time, Blunt)

| # | Severity | Issue | Fix |
|---|----------|-------|-----|
| 1 | **High** | Freshness risk during live games — cached responses from even 1-2 min ago could mislead bettors | Integrate webhook/polling from sports API to invalidate cache keys in real-time. Use Redis pub/sub. 1-2 day fix. |
| 2 | **Medium** | Two-stage matching over-engineered — fuzzy matching may rarely trigger if most repeats are exact | Analyze `cache_events` after 1 week. If fuzzy hits < 10%, disable it. Simplify operations. |
| 3 | **Medium** | Warming worker not targeting highest-frequency queries | Enhance worker to focus on top 5-10 query patterns from `cache_metrics_daily`. Only warm if game state is stable. |
| 4 | **Critical** | PII leakage edge cases — fuzzy matching + warming could inadvertently cache "injury for my favorite team" type content | Audit filter with 100 queries through backtesting. Add regex sanitizer on `response_payload`. |
| 5 | **High** | Cache miss latency hurts UX — two-stage matching adds delay during peak game nights | Add 200ms timeout cap for cache lookups. If exceeded, bypass to LLM directly. |
| 6 | **Low** | Backtesting not yet in refinement loop — just built, not actively improving thresholds | Schedule weekly backtests. Automate config updates based on results. |

**Grok Bottom Line:** Don't overbuild. Focus on what repeats most. Use existing data before adding new components.

---

### GPT-4o (Balanced, Safety + Implementation)

| # | Severity | Issue | Fix |
|---|----------|-------|-----|
| 1 | **Critical** | Cross-user data safety — safety filter may not cover evolving input formats | Dedicated NLP-based PII detection + regular audits with anonymized query samples |
| 2 | **High** | Cache key rigidity — SHA-256 misses on minor query variations | Secondary embedding-based similarity scoring before storage to group closely related queries |
| 3 | **Medium** | pg_trgm threshold suboptimal — could cause false negatives | Run extensive backtests. Use weighted scoring + dynamic threshold adjustment. |
| 4 | **Medium** | No separate test environment — testing on live system risks reliability | Build full-featured test environment. Use synthetic datasets. |
| 5 | **High** | No A/B validation — can't objectively measure cache effectiveness | A/B framework: subset of queries bypass cache. Compare tokens, latency, accuracy. |
| 6 | **High** | TTL relies on game-timing data that may lag | Dynamic monitoring with sports API webhooks/push updates for near-real-time TTL updates |
| 7 | **Medium** | Rollout too abrupt — insufficient baseline metrics | Phased rollout with checkpoints. Feed backtesting results into real-time config. |

**GPT-4o Bottom Line:** Success depends on scope, matching quality, staleness control, and safety. A/B test before trusting cache savings numbers.

---

## 2. CONSENSUS FINDINGS (Cross-model Agreement)

### All 3 agree on (must-fix):
1. **PII / Cross-user leakage** needs hardening beyond current regex filter (Critical)
2. **Live game freshness** is the biggest user-trust risk — need real-time invalidation (High)
3. **Cache key needs context** — game_state and data versioning must be part of the key (High)
4. **A/B validation** is essential to prove savings are real (High)
5. **Backtesting loop** should be automated, not manual (Medium)

### 2 of 3 agree on:
6. **Fuzzy matching may be premature** — exact match could cover 80%+ of cases
7. **TTL thundering herd** at daily cutoff needs jitter
8. **Warming worker** should target top patterns, not blanket recent messages

### Divergent opinions:
- DeepSeek wants embeddings pre-cached; Grok says don't overbuild
- GPT-4o wants NLP-based PII; DeepSeek prefers user_context_hash
- Grok would disable fuzzy matching if underperforming; GPT-4o would optimize thresholds

---

## 3. CONSENSUS PLAN OF ATTACK

### Phase 0 — Harden Safety (1-2 days)
- [ ] Audit safety filter with 100 diverse queries through backtesting dashboard
- [ ] Add response-level PII regex sanitizer on `response_payload` before caching
- [ ] Ensure "generic analysis core" separation — never cache anything referencing user history
- [ ] Add `data_version` field to `query_cache` (injuries/lines timestamp hash)

### Phase 1 — Cache Key + Freshness Fixes (2-3 days)
- [ ] Extend cache key to include: `game_state` + `data_version`
- [ ] Add TTL jitter: `(gameId % 30) * 60` seconds to prevent thundering herd
- [ ] Add 200ms timeout cap on cache lookups — bypass to LLM if exceeded
- [ ] Add `generated_at` and `data_version` to cached response metadata

### Phase 2 — A/B Validation Framework (2-3 days)
- [ ] Route 10% of queries to bypass cache (control group)
- [ ] Compare: token usage, latency, response quality (via pg_trgm similarity to cached)
- [ ] Store A/B results in `cache_events` with `ab_group` flag
- [ ] Dashboard widget in The Pool showing A/B comparison metrics

### Phase 3 — Smart Warming + Clustering (3-5 days)
- [ ] Analyze `cache_metrics_daily` for top 20 query patterns
- [ ] Refocus warming worker on top patterns only (not blanket recent messages)
- [ ] Add pre-game warming trigger 2h before scheduled game times
- [ ] Build "matchup card" cache: reusable analysis blocks across similar queries

### Phase 4 — Backtesting Automation (2-3 days)
- [ ] Schedule weekly automated backtests via cron
- [ ] Add drift_score metric (compare cached vs fresh responses)
- [ ] Auto-adjust similarity threshold based on backtest results
- [ ] Alert if drift_score < 0.85 for any query cluster

### Phase 5 — Real-time Invalidation (3-5 days)
- [ ] Webhook/polling integration with sports data API
- [ ] Redis pub/sub channel for live game state changes
- [ ] Auto-invalidate cache entries when lines/injuries update materially
- [ ] Evaluate: should fuzzy matching be kept or disabled based on Phase 2 data

---

## 4. TOKEN SAVINGS PROJECTION FRAMEWORK

### Variables (plug in your real numbers)

| Variable | Description | How to measure |
|----------|-------------|----------------|
| **Q** | Total queries/day | `SELECT COUNT(*) FROM cache_events WHERE created_at::date = CURRENT_DATE` |
| **r** | Repeat rate (fraction that are repeats/paraphrases) | `1 - (COUNT(DISTINCT normalized_query) / COUNT(*))` from cache_events |
| **h** | Cache hit rate among repeats | `cache_hits / (cache_hits + cache_misses)` from daily metrics |
| **T** | Avg tokens per full LLM response (miss) | `AVG(response_tokens) FROM query_cache` |
| **Tc** | Avg tokens per cache hit (formatting overhead) | Typically ~5% of T, or measure from proxy logs |
| **C_miss** | Cost per miss | `T × price_per_token` (varies by model) |
| **C_hit** | Cost per hit | `Tc × price_per_token` (near zero) |

### Formula

```
Daily token savings     = Q × r × h × (T - Tc)
Daily cost savings ($)  = Q × r × h × (C_miss - C_hit)
Monthly projection      = Daily × 30
```

### Sample Forecast (using day 1 actuals + growth projections)

| Metric | Day 1 (actual) | Week 1 (est.) | Month 1 (est.) | Month 3 (est.) |
|--------|----------------|---------------|-----------------|-----------------|
| Q (queries/day) | 3 | 50 | 200 | 1,000 |
| r (repeat rate) | 0.67 | 0.40 | 0.35 | 0.35 |
| h (hit rate) | 0.67 | 0.50 | 0.65 | 0.80 |
| T (avg tokens/miss) | 1,200 | 1,200 | 1,200 | 1,200 |
| Tc (tokens/hit) | 60 | 60 | 60 | 60 |
| **Tokens saved/day** | **1,529** | **11,400** | **51,870** | **319,200** |
| **$/saved/day** (@ $3/1M tok) | **$0.005** | **$0.034** | **$0.156** | **$0.958** |
| **$/saved/month** | **$0.14** | **$1.03** | **$4.67** | **$28.73** |

> At scale (10K queries/day, 35% repeats, 80% hit rate):
> **3.19M tokens/day saved = ~$9.58/day = ~$287/month**

### How to compute `r` from logs

```sql
-- Repeat rate from last 7 days
SELECT
  1.0 - (COUNT(DISTINCT normalized_query)::numeric / COUNT(*)) AS repeat_rate
FROM cache_events
WHERE created_at > NOW() - interval '7 days'
  AND event_type IN ('hit', 'miss');
```

### ROI breakeven

```
Monthly cache infra cost ≈ Redis ($0) + PG overhead ($0, already running) + warming worker CPU (~$5)
Breakeven at: ~530 queries/day with r=0.35, h=0.65
```

---

## 5. CLAUDE FINAL-REFINEMENT PROMPT

Copy/paste this to Claude to merge the peer review and finalize implementation:

```
You are Claude acting as the final system integrator for the OpenClaw Daily Query Cache.

CONTEXT:
The cache system is LIVE in production with these components:
- Proxy (port 18790) intercepting OpenAI-format requests
- Normalizer (team aliases, dates, market terms, intent classification → SHA-256 key)
- Two-stage matching (Redis exact + pg_trgm fuzzy at threshold 0.6)
- Game-state-aware TTL (props/odds/scores blocked during live games)
- Safety filter (PII, user-specific content, short queries)
- Warming worker (PM2, every 15min, pairs recent user/assistant messages)
- Metrics (Redis counters + PG cache_events + daily aggregation)
- Backtesting ("The Pool" dashboard with OpenRouter multi-model comparison)

PEER REVIEW FINDINGS (from DeepSeek, Grok, GPT-4o):
1. CRITICAL: PII/cross-user leakage needs hardening — current regex filter has edge cases
2. CRITICAL: Cache key must include game_state + data_version to prevent stale serving
3. HIGH: Live game freshness — need real-time invalidation via sports API webhooks
4. HIGH: No A/B validation — can't prove savings are real without control group
5. HIGH: TTL thundering herd at 4AM cutoff — add jitter via game ID entropy
6. HIGH: Cache miss latency — add 200ms timeout cap, bypass to LLM if exceeded
7. MEDIUM: Fuzzy matching may be premature — exact match could handle 80%+ of cases
8. MEDIUM: Warming worker blanket approach — should target top query patterns only
9. MEDIUM: Backtesting not automated — needs weekly schedule + drift detection
10. LOW: Warming worker resource usage — consider event-driven triggers

TASK:
Produce implementation code for these HIGH-PRIORITY items (in order):

A) Add `data_version` column to `query_cache` and include it in cache key generation.
   - Modify normalizer.ts to accept and incorporate data_version
   - Modify similarity.ts to filter by data_version on lookups

B) Add PII hardening to safety-filter.ts:
   - Expand regex patterns for edge cases (partial emails, phone fragments, Telegram IDs)
   - Add response-level sanitization check on response_payload before caching
   - Block any response containing "your", "you previously", "based on your" patterns

C) Add TTL jitter to ttl-strategy.ts:
   - Compute jitter from game ID: (parseInt(gameId, 36) % 30) * 60 seconds
   - Apply to all TTLs to prevent thundering herd

D) Add 200ms cache lookup timeout to cache-service.ts:
   - If lookup exceeds 200ms, abort and return cache miss
   - Log timeout events for monitoring

E) A/B test framework in proxy.ts:
   - Route configurable % of requests (default 10%) to bypass cache
   - Tag cache_events with ab_group = 'control' or 'experiment'
   - Ensure consistent assignment per user session

F) Token savings SQL view for The Pool dashboard:
   - CREATE VIEW cache_savings_summary that computes Q, r, h, T, Tc, daily/monthly projections
   - Make it queryable from the existing /api/cache/stats endpoint

For each item, produce:
1. The exact file path and code changes (diffs preferred)
2. Any new SQL migrations needed
3. Edge cases to test
4. Success metric (how to verify it works)

Output as a structured implementation ticket with clear sections per item.
```

---

*Generated by Claude Opus via OpenRouter peer review pipeline. Models charged: DeepSeek ($0.002), Grok ($0.004), GPT-4o ($0.008).*
