# Complete RAG System Workflow

## Overview
The RAG system processes PDFs in two main stages:
1. **Chunking** (Automatic) - Breaks PDFs into searchable chunks
2. **Enrichment** (Manual) - Uses Grok API to enhance chunks with summaries, key points, and themes

---

## Stage 1: PDF Chunking (Automatic on Upload)

### What Happens:
1. **PDF Upload** → File saved to `pdf_directory/`
2. **Text Extraction** → Extracts text using PyMuPDF/pdfminer
3. **Document Type Detection** → Classifies as: technical, research, legal, manual, or default
4. **Structure-Aware Chunking**:
   - Detects headings, sections, tables, lists
   - Uses sentence-based sliding window
   - Applies document-type-specific parameters
5. **Chunk Validation** → Filters out boilerplate and low-quality chunks
6. **Metadata Extraction** → Adds structural info, content quality metrics
7. **Save Chunks** → Saves to `processed/documents/filename_chunks.json`
8. **Update Metadata** → Updates `processed/metadata/processed_files.json`
9. **Add to Vector Store** → Adds raw chunks to FAISS index for searching

### Files Created:
- `processed/documents/*_chunks.json` - Raw chunks with metadata
- `processed/metadata/processed_files.json` - Processing tracking
- `processed/embeddings/faiss_index.index` - Vector search index
- `processed/embeddings/documents_metadata.json` - Vector store metadata

### Chunk Format:
```json
{
  "content": "chunk text here...",
  "metadata": {
    "source": "filename.pdf",
    "chunk_id": 0,
    "total_chunks": 100,
    "document_title": "Document Title",
    "chunk_method": "structure_aware_technical",
    "page": 15,
    "section": "Section Title",
    "structural_info": {
      "sentence_count": 8,
      "word_count": 215,
      "char_count": 1200,
      "content_type": "text"
    },
    "content_quality": {
      "readability_score": 0.82,
      "has_technical_content": true,
      "is_boilerplate": false
    },
    "clean_excerpt": "First 200 chars..."
  }
}
```

---

## Stage 2: Enrichment with Grok API (Manual - Button Click)

### What Happens:
1. **Check for Files** → Finds files with chunks but no enriched cards
2. **Load Chunks** → Reads `processed/documents/*_chunks.json` files
3. **For Each Chunk**:
   - Creates basic card (simple summary, key points, themes)
   - Calls Grok API to improve the card:
     ```
     Prompt: "Refine this knowledge card. Extract important info.
     EXCERPT: [chunk text]
     Provide: Summary, Key Points, Themes"
     ```
   - Grok returns improved summary, key points, themes
   - Saves enriched card to `processed/notes/*_cards.json`
4. **Update Vector Store** → Replaces raw chunks with enriched chunks

### Files Created:
- `processed/notes/*_cards.json` - Enriched knowledge cards
- `processed/digests/*.json` - Document digests
- `processed/capsules/*.json` - Theme capsules

### Enriched Card Format:
```json
{
  "card_id": "filename.pdf::chunk-0",
  "source": "filename.pdf",
  "summary": "AI-generated 1-2 sentence summary",
  "key_points": ["Point 1", "Point 2", "Point 3"],
  "themes": ["theme1", "theme2", "theme3"],
  "clean_excerpt": "First 600 chars..."
}
```

---

## Complete Data Flow

```
PDF Upload
    ↓
Text Extraction (PyMuPDF/pdfminer)
    ↓
Document Type Detection
    ↓
Structure-Aware Chunking
    ↓
Chunk Validation & Filtering
    ↓
Metadata Extraction
    ↓
Save: processed/documents/*_chunks.json
    ↓
Add to Vector Store (FAISS)
    ↓
[RAW CHUNKS READY FOR SEARCH]
    ↓
[USER CLICKS ENRICHMENT BUTTON]
    ↓
Load Chunks from processed/documents/
    ↓
For Each Chunk:
    Generate Basic Card
    ↓
    Call Grok API → Get Improved Card
    ↓
    Save: processed/notes/*_cards.json
    ↓
Update Vector Store with Enriched Chunks
    ↓
[ENRICHED CHUNKS READY - MORE SEARCHABLE]
```

---

## Key Directories

- `pdf_directory/` - Original PDF files
- `processed/documents/` - Raw chunk files (`*_chunks.json`)
- `processed/notes/` - Enriched card files (`*_cards.json`)
- `processed/embeddings/` - Vector store (FAISS index + metadata)
- `processed/metadata/` - Processing tracking files
- `processed/digests/` - Document summaries
- `processed/capsules/` - Theme aggregations

---

## Why Two Stages?

**Stage 1 (Chunking)**:
- ✅ Fast (no API calls)
- ✅ No cost
- ✅ Gets files searchable immediately
- ✅ Uses optimized sentence-based chunking

**Stage 2 (Enrichment)**:
- ⏱️ Slower (Grok API calls per chunk)
- 💰 Uses API credits
- ✅ Improves search quality significantly
- ✅ Better summaries, key points, themes
- ✅ Optional - system works without it

---

## Current Status Check

Run these to check status:
```bash
# Check processed files
ls processed/documents/*_chunks.json | wc -l

# Check enriched cards
ls processed/notes/*_cards.json | wc -l

# Check vector store
ls processed/embeddings/*.index
```

