# Chunk Formatting & Dense Vector Best Practices

## Overview
This guide explains the best practices for formatting chunks and using dense information vectors in RAG systems.

## Current Implementation

### Chunk Structure
```json
{
  "content": "Full text content of the chunk",
  "metadata": {
    "source": "filename.pdf",
    "chunk_id": 123,
    "summary": "Optional summary",
    "key_points": ["point1", "point2"],
    "themes": ["theme1", "theme2"],
    "clean_excerpt": "Cleaned excerpt"
  }
}
```

### Vector Embedding
- **Model**: `sentence-transformers/all-MiniLM-L6-v2`
- **Dimensions**: 384
- **Normalization**: L2 normalized (cosine similarity)
- **Storage**: FAISS IndexFlatIP (Inner Product = Cosine Similarity)

## Best Practices for Chunk Formatting

### 1. **Content-First Approach**
```python
# ✅ GOOD: Prioritize actual content
content = chunk.get('content', '')
if technical_question and technical_terms in content:
    use_full_content = True

# ❌ BAD: Always use summaries
content = metadata.get('summary', '')  # Loses technical details
```

### 2. **Technical Query Handling**
```python
# Detect technical queries
technical_terms = ['ph', 'temperature', 'humidity', 'ppm', 'ec', 'tds']
is_technical = any(term in question.lower() for term in technical_terms)

# Use full content for technical queries
if is_technical:
    content = extract_relevant_sentences(chunk_content, query_terms)
else:
    content = use_summary_or_excerpt(chunk_content)
```

### 3. **Filter Generic Chunks**
```python
# Skip intro/preface chunks when technical info exists
intro_indicators = [
    'this book is written',
    'putting aside any legal',
    'introduction cannabis',
    'preface', 'foreword', 'copyright'
]

is_intro = any(indicator in content[:500].lower() for indicator in intro_indicators)
if is_intro and has_technical_chunks:
    skip_chunk()
```

### 4. **Query-Term Prioritization**
```python
# Extract query terms
query_terms = set(re.findall(r'\b\w+\b', question.lower()))
stop_words = {'the', 'a', 'an', 'is', 'are', ...}
query_terms = query_terms - stop_words

# Prioritize sentences containing query terms
relevant_sentences = [
    s for s in sentences 
    if any(term in s.lower() for term in query_terms)
]
```

## Dense Vector Best Practices

### 1. **Hybrid Scoring**
```python
# Current: 65% Vector + 35% Keyword
combined_score = (
    0.65 * vector_score +      # Semantic similarity
    0.35 * keyword_score      # Exact term matching
)

# For technical queries, boost keyword weight
if technical_query:
    keyword_score += 0.2  # Boost for technical term matches
```

### 2. **Vector Embedding Strategy**

#### **Chunk-Level Embedding** (Current)
- Embed entire chunk as one vector
- **Pros**: Captures context, relationships
- **Cons**: May dilute specific information

#### **Sentence-Level Embedding** (Alternative)
- Embed each sentence separately
- **Pros**: More precise matching
- **Cons**: Loses context, more vectors to manage

#### **Hybrid Approach** (Recommended)
```python
# Embed chunk, but also track sentence embeddings for key sentences
chunk_vector = embed(chunk_content)
key_sentences = extract_key_sentences(chunk_content, query)
sentence_vectors = [embed(s) for s in key_sentences]

# Search both chunk and sentence vectors
results = search(chunk_vector) + search(sentence_vectors)
```

### 3. **Metadata Enrichment**
```python
# Add dense metadata to vectors
metadata = {
    "source": "filename.pdf",
    "chunk_id": 123,
    "page_number": 45,
    "section": "pH Levels",
    "technical_terms": ["ph", "6.5", "8.0"],  # Extracted terms
    "has_measurements": True,
    "has_numbers": True
}
```

### 4. **Multi-Vector Strategy**
```python
# Create multiple vectors for different aspects
vectors = {
    "content": embed(full_content),
    "technical": embed(extract_technical_info(content)),
    "summary": embed(summary),
    "key_points": embed(" ".join(key_points))
}

# Search across all vectors
results = []
for vector_type, vector in vectors.items():
    results.extend(search(vector, weight=weights[vector_type]))
```

## Formatting Strategies

### Strategy 1: **Full Content** (Best for Technical)
```
[Source: filename.pdf, Relevance: 0.640]
Cannabis grows best in a 6.5 to 8 pH range. pH tester: electronic instrument 
or chemical used to measure the acid or alkaline balance. The pH of the 
nutrient solution controls the availability of ions that cannabis needs to 
assimilate. Maintaining proper pH is crucial for nutrient uptake.
```

### Strategy 2: **Relevant Sentences** (Best for Long Documents)
```
[Source: filename.pdf, Relevance: 0.640]
Cannabis grows best in a 6.5 to 8 pH range. The pH of the nutrient solution 
controls the availability of ions that cannabis needs to assimilate. 
Maintaining proper pH is crucial for nutrient uptake.
```

### Strategy 3: **Summary + Excerpt** (Best for General Queries)
```
[Source: filename.pdf, Relevance: 0.640]
Summary: This section discusses pH levels for cannabis cultivation.
Excerpt: Cannabis grows best in a 6.5 to 8 pH range. pH tester: electronic 
instrument or chemical used to measure the acid or alkaline balance.
```

### Strategy 4: **Structured Format** (Best for Complex Info)
```
[Source: filename.pdf, Relevance: 0.640]
Topic: pH Levels for Cannabis
Key Information:
- Optimal Range: 6.5 to 8.0
- Measurement Tool: pH tester (electronic or chemical)
- Importance: Controls nutrient availability
Details: The pH of the nutrient solution controls the availability of ions 
that cannabis needs to assimilate. Maintaining proper pH is crucial for 
nutrient uptake.
```

## Recommendations

### For Technical Queries (pH, temperature, measurements):
1. ✅ Use **full content** or **relevant sentences**
2. ✅ Filter out **intro/preface chunks**
3. ✅ Boost **keyword matching** for technical terms
4. ✅ Extract **numbers and ranges** explicitly

### For General Queries (what is, explain, describe):
1. ✅ Use **summary + excerpt** format
2. ✅ Include **key points** and **themes**
3. ✅ Prioritize **vector similarity** (semantic meaning)

### For Factual Queries (who, when, where):
1. ✅ Use **structured format** with key information
2. ✅ Extract **specific facts** explicitly
3. ✅ Include **source context**

## Implementation Example

```python
def format_chunk_for_context(chunk, question, is_technical=False):
    content = chunk.get('content', '')
    metadata = chunk.get('metadata', {})
    
    # Detect query type
    query_terms = extract_query_terms(question)
    has_technical_info = detect_technical_content(content)
    
    # Choose formatting strategy
    if is_technical and has_technical_info:
        # Strategy 1: Full content with relevant sentences
        sentences = extract_relevant_sentences(content, query_terms)
        formatted = format_full_content(sentences)
    elif is_technical:
        # Strategy 2: Relevant sentences only
        sentences = extract_relevant_sentences(content, query_terms)
        formatted = format_sentences(sentences)
    else:
        # Strategy 3: Summary + excerpt
        formatted = format_summary_excerpt(metadata, content)
    
    return formatted
```

## Current System Configuration

- **Chunk Size**: 1000 characters
- **Chunk Overlap**: 200 characters
- **Embedding Model**: all-MiniLM-L6-v2 (384 dims)
- **Vector Weight**: 65%
- **Keyword Weight**: 35%
- **Retrieval**: 10 chunks (k=10)
- **Min Score**: 0.5

## Future Improvements

1. **Semantic Chunking**: Group related sentences semantically
2. **Hierarchical Vectors**: Document → Section → Chunk → Sentence
3. **Query-Aware Embedding**: Fine-tune embeddings for domain
4. **Multi-Modal**: Include images/diagrams from PDFs
5. **Temporal Context**: Track when information was added/updated

