Skip to main content

Audit progress missing stepsOverview

Blog Engine is a multi-service application for planning and generating content. It combines a React frontend, a NestJS API, and a FastAPI service to support article ideation, content drafting, site audits, and content review workflows.

What it does (Capabilities)

  • Article and project management: Create and manage projects and articles, including titles, keywords, and outlines.
  • Rich text editor: BlockNote-based editor module with toolbars, focus mode, sidebars, checks, and share/export options.
  • SEO audit workflow: Chunked, real-time progress updates for site audits and final report handling.
  • Sitemap discovery and analysis: Resolve sitemap.xml or sitemap_index.xml, discover URLs, and analyze pages in batches.
  • Web scraping for references: Google Custom Search + Playwright scraping to gather reference URLs and content signals.
  • AI summaries and content: Generate content via OpenAI, Gemini, and Claude with references appended.
  • OCR extract: Extract text from PDF/DOCX/DOC (DOC via LibreOffice conversion) for content ingestion.
  • Auth and rate limiting: JWT-based guards in the Node API with throttling and security middleware.

User flows

  • Plan content: Define a project, keywords, and titles; optionally fetch suggested titles and outlines.
  • Draft & edit: Use the editor to draft content with toolbar options and writing checks.
  • Reference & cite: Pull reference links via scraping; add citations and review sources.
  • Audit site: Run site audits and monitor progress in real time; view the final audit report once complete.
  • Export & share: Export or share content from the editor with role-based controls.

Architecture

  • Frontend (React + Vite): The UI in frontend/ provides project/article pages, the editor module (src/modules/editor/), and utilities. It communicates with the Node API using a configurable base URL.
  • Node API (NestJS): The main API in node/ exposes domain modules for projects, articles, prompts, and streaming updates (SSE). It uses MongoDB via Mongoose and applies Helmet, CORS, and rate limiting.
  • Python API (FastAPI): The service in backend_python/ handles scraping, sitemap analysis, AI summarization, OCR, and the SEO Audit router with chunked updates.
  • Data store: MongoDB via Mongoose/Motor stores projects, articles, prompts, guidelines, and audit records.
  • External services: OpenAI, Gemini, Claude, Google Custom Search; Playwright for page rendering.

Technical design

  • Modules & features (Node):
    • projects, articles, article-documents, guidelines, system-prompts, prompt-types, sse, webhooks, openai, claude, gemini.
    • Global prefix seo-content-api; Helmet, CORS, throttling; JWT guards.
  • Pipelines (Python):
    • SEO audit: Single endpoint streams progress; final report stored with status and steps.
    • Sitemap analysis: Discover → batch analyze with concurrency limits → attach content types.
    • Reference scraping: Google Custom Search → Playwright extraction → clean content → dedupe → append citations.
    • Summarization: Generate content via OpenAI/Gemini/Claude; return with references, optionally via webhooks to the Node API.
  • Session/state: Frontend uses zustand for state and react-query for fetching; Node uses JWT guards. No multi-tenant specifics stated.
  • Security & validation:
    • Helmet, CORS, throttling (@nestjs/throttler).
    • Validation in Python endpoints for ObjectId formats and input shapes; global exception handling in Node.

API reference (high level)

  • Node API (NestJS, prefixed with seo-content-api):
    • Projects, Articles, Article Documents, Guidelines, System Prompts, Prompt Types, SSE, Webhooks, AI providers (OpenAI/Claude/Gemini)
    • Auth: JWT-based guards; rate limiting enabled.
    • Exact routes: Not applicable.
  • Python API (FastAPI):
    • GET /health: Service/Mongo health.
    • POST /company-business-summary: Summarize company details.
    • POST /target-audience: Generate target audience from details.
    • POST /generate-outline: Generate content preview/outline for an article by ID.
    • POST /fetch-sitemaps: Locate sitemap and summarize counts.
    • POST /sitemap: Discover URLs and analyze in batches; returns annotated results.
    • POST /get-titles: Generate SEO titles per keyword and prompt types; similarity checks.
    • POST /check-title: Check if a title already exists within a project.
    • POST /file-ocr: Extract text from documents (PDF/DOCX/DOC).
    • router /seo-audit/*: SEO Audit endpoints (progress and final report).

Data and schemas (conceptual)

  • Project: Name, website URL, language, location, targeted audience, guideline reference, additional brand fields.
  • Article: Name (title), keywords, secondary keywords, project reference, generated outline, scraped content metadata.
  • Guideline/System Prompt/Prompt Type: Stored prompt texts and relationships used for content generation.
  • SEO Audit: Status, current step, progress steps, final audit_report, error message.
  • Users/Auth: JWT-only single-user assumptions; no multi-tenant fields stated.

TroubleShooting

  • Audit Issues
  • No reference URLs returned
  • Sitemap & Crawling
  • Editor & Real-time
  • API & Webhooks
Symptoms
  • Progress stalls at a step
  • No final report stored
  • Frontend shows in-progress indefinitely
Common causes
  • Timeouts during chunked processing
  • Network/stream parsing errors
  • Background scheduler not updating failed audits
Solutions
  • Check scheduler logs; confirm stuck audits are marked failed
  • Reduce timeouts or chunk sizes if applicable
  • Verify router integration under /seo-audit
Symptoms
  • Only final status appears
  • No progress_steps saved
  • UI lacks granular updates
Common causes
  • Parser ignoring non-JSON lines
  • Stream buffering
  • Schema changes not reflected in code
Solutions
  • Ensure each progress line is appended to progress_steps
  • Flush handlers more frequently
  • Re-run migration 
Symptoms
  • Final JSON lost
  • DB record lacks audit_report
  • Error message populated
Common causes
  • JSON framing after “Done.” not parsed
  • DB validation error
  • Connectivity to Mongo
Solutions
  • Handle the post-”Done.” JSON explicitly
  • Validate schema fields before insert/update
  • Verify database connectivity and indexes

Tech Stack

LayerTechnologiesPurpose
FrontendReact, Vite, React Router, React Query, Zustand, BlockNote, TailwindUI, editor, state, data fetching
Node APINestJS, Mongoose, @nestjs/*, SSE, Helmet, CORS, ThrottlerCore APIs, auth, rate limiting, streaming
Python APIFastAPI, Playwright, aiohttp, httpx, Crawl4AIScraping, sitemap analysis, SEO audit, AI orchestration
AI/LLMOpenAI, Google Generative AI (Gemini), Anthropic (Claude), LangChainContent generation and summarization
StorageMongoDB (Mongoose + Motor)Projects, articles, prompts, audits
Parsing/OCRPyMuPDF, python-docx, LibreOffice CLIText extraction from documents
UtilitiesDate-fns, Winston loggingFormatting and logging

Best practices

  • Validate inputs early: Enforce ObjectId and payload validation at the edge.
  • Control concurrency: Tune semaphore and batch sizes to avoid timeouts and memory pressure.
  • Truncate long contexts: Summarize scraped text before sending to LLMs to reduce cost.
  • Harden scraping: Block heavy assets, use timeouts and retries, and sanitize selectors.
  • Align API base/prefixes: Keep VITE_API_BASE_URL and Node SERVER_PORT consistent; be mindful of the global prefix.
  • Instrument and log: Use centralized logging and surface progress via SSE for UX and debugging.
  • Secure webhooks: Verify BASE_URL and auth tokens; capture non-2xx responses.
  • Version alignment: Keep BlockNote and Radix UI packages in sync to avoid UI issues.
I