RAG Chat integrates document understanding with conversational AI:
  • Files are chunked, embedded, and stored in a vector database
  • Query responses are augmented by retrieved context for high relevance
This guide walks through the RAG-based query resolution pipeline using FastAPI and Qdrant.
1

File Upload and Preprocessing

  1. User uploads a file through the frontend interface.
  2. The backend reads the file content and splits it into chunks.
  3. Each chunk is embedded using an embedding model to enable similarity search.
Developer Note:
  • Text chunking uses a tokenizer-aware splitter (e.g., recursive chunker or NLTK-based).
  • Embedding model can be OpenAI, HuggingFace, etc. (must match with query-time embedding).
2

Storage and Retrieval

  • The embedded chunks are stored in Qdrant, a high-performance vector database.
  • Upon receiving a query, similar chunks are retrieved based on semantic similarity.
Developer Note:
  • Embeddings are stored using collection_name in Qdrant.
  • Queries perform a top_k similarity search with optional filtering (e.g., by file tag or metadata).
3

Model Selection and Initialization

  1. The Service Controller determines the model to use based on user selection.
  2. Initializes the appropriate LLM service, ready to process contextual input.
Developer Note:
  • Controller handles routing to LLMs like GPT-4, Claude, etc.
  • The model is dynamically bound with the Retrieval Chain post-initialization.
4

Chat History and Retrieval Chain

  • The Chat Repository fetches previous messages using the provided chat ID.
  • A retrieval chain is constructed, combining:
    • Retrieved document chunks
    • Chat history
    • The current user query
  • This composite input serves as the prompt to the LLM.
Developer Note:
  • Retrieval Chain wraps logic for context enrichment.
  • Prompts are generated using templates, formatted in a “history + context + question” order.
5

Response Streaming and Storage

  1. The LLM generates a response, which is streamed live to the frontend.
  2. The system logs:
    • The generated response
    • The token-based cost, using the Cost Callback
    • All data is stored via the MongoDB Handler
Developer Note:
  • Token cost calculation is triggered post-inference using prompt + completion lengths.
  • MongoDB entries are structured for analytics, traceability, and auditing.
6

Architecture Diagram

LLM Query Flow Diagram
Make sure your Qdrant service is running and configured correctly. Embedding models should match those used during both indexing and querying to ensure vector similarity functions accurately.

Troubleshooting