# Chunk Scoring System

## Overview
The RAG system uses a **hybrid scoring approach** that combines **vector similarity** (semantic meaning) with **keyword matching** (exact text matching) to rank chunks.

## Scoring Process

### Step 1: Vector Embedding Search
1. **Query Embedding**: The user's question is converted to a vector using `sentence-transformers/all-MiniLM-L6-v2`
2. **FAISS Search**: Uses FAISS (Facebook AI Similarity Search) with `IndexFlatIP` (Inner Product)
3. **Normalization**: All embeddings are L2-normalized, which converts Inner Product to **Cosine Similarity**
4. **Oversampling**: Retrieves `k * 4` candidates (default k=5, so 20 candidates)

**Vector Score Range**: -1.0 to 1.0 (cosine similarity)
- **1.0** = Perfect match (identical meaning)
- **0.0** = No similarity
- **-1.0** = Opposite meaning

### Step 2: Keyword Re-ranking
For each candidate chunk, a keyword score is calculated:

```python
keyword_score = fuzz.partial_ratio(query.lower(), chunk_text.lower()) / 100.0
```

**Keyword Score Range**: 0.0 to 1.0
- Uses `rapidfuzz` library for fuzzy string matching
- Checks against: `metadata.summary` OR `metadata.clean_excerpt` OR `content`
- **Minimum threshold**: 0.1 (chunks below this get keyword_score = 0.0)

### Step 3: Combined Score
The final score combines both methods:

```python
combined_score = (
    HYBRID_VECTOR_WEIGHT * vector_score +      # 0.7 weight
    HYBRID_KEYWORD_WEIGHT * keyword_score      # 0.3 weight
)
```

**Current Weights** (from `config.py`):
- **Vector Weight**: 70% (`HYBRID_VECTOR_WEIGHT = 0.7`)
- **Keyword Weight**: 30% (`HYBRID_KEYWORD_WEIGHT = 0.3`)
- **Minimum Keyword Score**: 0.1 (`HYBRID_MIN_KEYWORD_SCORE`)

### Step 4: Final Ranking
1. All candidates are sorted by `combined_score` (descending)
2. Top `k` chunks are returned (default: 5)

## Example Scoring

**Query**: "What is cannabis?"

**Chunk 1**: "Cannabis is a genus of flowering plants..."
- Vector Score: 0.85 (high semantic similarity)
- Keyword Score: 0.95 (exact word match)
- **Combined**: (0.7 × 0.85) + (0.3 × 0.95) = **0.88**

**Chunk 2**: "Marijuana has been used for centuries..."
- Vector Score: 0.80 (high semantic similarity, "marijuana" ≈ "cannabis")
- Keyword Score: 0.0 (no exact match, below 0.1 threshold)
- **Combined**: (0.7 × 0.80) + (0.3 × 0.0) = **0.56**

**Chunk 3**: "The history of medicine..."
- Vector Score: 0.45 (moderate semantic similarity)
- Keyword Score: 0.0 (no match)
- **Combined**: (0.7 × 0.45) + (0.3 × 0.0) = **0.32**

**Result**: Chunk 1 wins (0.88 > 0.56 > 0.32)

## Configuration

**File**: `/var/www/html/leadgen/airagagent/config.py`

```python
# Hybrid retrieval settings
HYBRID_SEARCH_OVERSAMPLE = 4          # Retrieve 4x more candidates, then re-rank
HYBRID_VECTOR_WEIGHT = 0.7             # 70% weight on semantic similarity
HYBRID_KEYWORD_WEIGHT = 0.3            # 30% weight on keyword matching
HYBRID_MIN_KEYWORD_SCORE = 0.1         # Minimum keyword score threshold
```

## Why Hybrid Scoring?

1. **Vector Similarity** (70%): Captures semantic meaning
   - Finds chunks with similar meaning even if words differ
   - Example: "cannabis" matches "marijuana", "weed", "THC"

2. **Keyword Matching** (30%): Captures exact relevance
   - Boosts chunks with exact word matches
   - Helps when user asks about specific terms

3. **Oversampling**: Retrieves 4x candidates, then re-ranks
   - Ensures best chunks aren't missed
   - Balances recall vs precision

## Code Location

**Main Search Function**: `airagagent/vector_store.py` → `search()` method (lines 143-201)

**Key Components**:
- **Embedding Model**: `sentence-transformers/all-MiniLM-L6-v2` (384-dimensional vectors)
- **Index Type**: `faiss.IndexFlatIP` (Inner Product, normalized = Cosine Similarity)
- **Fuzzy Matching**: `rapidfuzz.fuzz.partial_ratio()` for keyword scoring

## Adjusting Scoring

To change scoring behavior, edit `config.py`:

```python
# More weight on semantic similarity
HYBRID_VECTOR_WEIGHT = 0.9
HYBRID_KEYWORD_WEIGHT = 0.1

# More weight on keyword matching
HYBRID_VECTOR_WEIGHT = 0.5
HYBRID_KEYWORD_WEIGHT = 0.5

# Stricter keyword matching
HYBRID_MIN_KEYWORD_SCORE = 0.3  # Only chunks with 30%+ keyword match
```

## Score Interpretation

When viewing chunks in the admin interface, scores typically range:
- **0.7 - 1.0**: Excellent match (highly relevant)
- **0.5 - 0.7**: Good match (relevant)
- **0.3 - 0.5**: Moderate match (somewhat relevant)
- **0.0 - 0.3**: Weak match (may not be relevant)

The system returns the top 5 chunks by default, but you can adjust `k` in the search call.

