RAG Chat integrates document understanding with conversational AI through chunked files, embeddings, and vector database storage for high-relevance responses.
Processing Pipeline
1. File Upload and Preprocessing
- User uploads file through frontend interface
- Backend reads content and splits into chunks
- Each chunk embedded using embedding model for similarity search
- Text chunking uses tokenizer-aware splitter (recursive chunker or NLTK-based)
- Embedding model must match between indexing and querying (OpenAI, HuggingFace, etc.)
2. Storage and Retrieval
- Storage: Embedded chunks stored in Qdrant vector database
- Retrieval: Similar chunks retrieved based on semantic similarity
- Embeddings stored using
collection_name
in Qdrant - Queries perform
top_k
similarity search with optional metadata filtering
3. Model Selection and Initialization
- Service Controller determines model based on user selection
- Appropriate LLM service initialized for contextual processing
- Controller routes to LLMs (GPT-4, Claude, etc.)
- Model dynamically bound with Retrieval Chain post-initialization
4. Context Construction
- Chat Repository fetches previous messages using chat ID
- Retrieval Chain combines:
- Retrieved document chunks
- Chat history
- Current user query
5. Response Generation
- LLM generates response streamed live to frontend
- System logging via MongoDB Handler:
- Generated response
- Token-based cost (Cost Callback)
- Analytics and audit data
Architecture

RAG Chat Processing Flow
Key Components
- Qdrant: High-performance vector database for embeddings
- Retrieval Chain: Context enrichment logic wrapper
- Chat Repository: Message history management
- Service Controller: Model routing and initialization
- Cost Callback: Token usage and pricing tracking
Ensure Qdrant service is running and configured correctly. Embedding models must match between indexing and querying for accurate vector similarity.
Troubleshooting
Chunks Not Retrieved from Qdrant
- Verify embeddings are correctly generated and stored
- Check Qdrant connection status and schema consistency
- Validate similarity threshold and filtering logic
Streaming Performance Issues
- Monitor token usage and LLM latency
- Validate streaming infrastructure (WebSocket/EventStream) status
- Check for large document sizes causing retrieval delays