Python API flow for Retrieval-Augmented Generation handling file-based queries with FastAPI and Qdrant.
RAG Chat integrates document understanding with conversational AI through chunked files, embeddings, and vector database storage for high-relevance responses.

Processing Pipeline

1. File Upload and Preprocessing

  1. User uploads file through frontend interface
  2. Backend reads content and splits into chunks
  3. Each chunk embedded using embedding model for similarity search
Technical Notes:
  • Text chunking uses tokenizer-aware splitter (recursive chunker or NLTK-based)
  • Embedding model must match between indexing and querying (OpenAI, HuggingFace, etc.)

2. Storage and Retrieval

  • Storage: Embedded chunks stored in Qdrant vector database
  • Retrieval: Similar chunks retrieved based on semantic similarity
Implementation:
  • Embeddings stored using collection_name in Qdrant
  • Queries perform top_k similarity search with optional metadata filtering

3. Model Selection and Initialization

  1. Service Controller determines model based on user selection
  2. Appropriate LLM service initialized for contextual processing
Technical Notes:
  • Controller routes to LLMs (GPT-4, Claude, etc.)
  • Model dynamically bound with Retrieval Chain post-initialization

4. Context Construction

  • Chat Repository fetches previous messages using chat ID
  • Retrieval Chain combines:
    • Retrieved document chunks
    • Chat history
    • Current user query
Prompt Structure: History + Context + Question format using templates

5. Response Generation

  1. LLM generates response streamed live to frontend
  2. System logging via MongoDB Handler:
    • Generated response
    • Token-based cost (Cost Callback)
    • Analytics and audit data
Cost Calculation: Post-inference using prompt + completion token lengths

Architecture

RAG Chat Architecture Diagram

RAG Chat Processing Flow

Key Components

  • Qdrant: High-performance vector database for embeddings
  • Retrieval Chain: Context enrichment logic wrapper
  • Chat Repository: Message history management
  • Service Controller: Model routing and initialization
  • Cost Callback: Token usage and pricing tracking
Ensure Qdrant service is running and configured correctly. Embedding models must match between indexing and querying for accurate vector similarity.

Troubleshooting

Chunks Not Retrieved from Qdrant

  • Verify embeddings are correctly generated and stored
  • Check Qdrant connection status and schema consistency
  • Validate similarity threshold and filtering logic

Streaming Performance Issues

  • Monitor token usage and LLM latency
  • Validate streaming infrastructure (WebSocket/EventStream) status
  • Check for large document sizes causing retrieval delays