Audit progress missing stepsOverview
Blog Engine is a multi-service application for planning and generating content. It combines a React frontend, a NestJS API, and a FastAPI service to support article ideation, content drafting, site audits, and content review workflows.What it does (Capabilities)
- Article and project management: Create and manage projects and articles, including titles, keywords, and outlines.
- Rich text editor: BlockNote-based editor module with toolbars, focus mode, sidebars, checks, and share/export options.
- SEO audit workflow: Chunked, real-time progress updates for site audits and final report handling.
- Sitemap discovery and analysis: Resolve
sitemap.xml
orsitemap_index.xml
, discover URLs, and analyze pages in batches. - Web scraping for references: Google Custom Search + Playwright scraping to gather reference URLs and content signals.
- AI summaries and content: Generate content via OpenAI, Gemini, and Claude with references appended.
- OCR extract: Extract text from PDF/DOCX/DOC (DOC via LibreOffice conversion) for content ingestion.
- Auth and rate limiting: JWT-based guards in the Node API with throttling and security middleware.
User flows
- Plan content: Define a project, keywords, and titles; optionally fetch suggested titles and outlines.
- Draft & edit: Use the editor to draft content with toolbar options and writing checks.
- Reference & cite: Pull reference links via scraping; add citations and review sources.
- Audit site: Run site audits and monitor progress in real time; view the final audit report once complete.
- Export & share: Export or share content from the editor with role-based controls.
Architecture
- Frontend (React + Vite): The UI in
frontend/
provides project/article pages, the editor module (src/modules/editor/
), and utilities. It communicates with the Node API using a configurable base URL. - Node API (NestJS): The main API in
node/
exposes domain modules for projects, articles, prompts, and streaming updates (SSE). It uses MongoDB via Mongoose and applies Helmet, CORS, and rate limiting. - Python API (FastAPI): The service in
backend_python/
handles scraping, sitemap analysis, AI summarization, OCR, and the SEO Audit router with chunked updates. - Data store: MongoDB via Mongoose/Motor stores projects, articles, prompts, guidelines, and audit records.
- External services: OpenAI, Gemini, Claude, Google Custom Search; Playwright for page rendering.
Technical design
- Modules & features (Node):
projects
,articles
,article-documents
,guidelines
,system-prompts
,prompt-types
,sse
,webhooks
,openai
,claude
,gemini
.- Global prefix
seo-content-api
; Helmet, CORS, throttling; JWT guards.
- Pipelines (Python):
- SEO audit: Single endpoint streams progress; final report stored with status and steps.
- Sitemap analysis: Discover → batch analyze with concurrency limits → attach content types.
- Reference scraping: Google Custom Search → Playwright extraction → clean content → dedupe → append citations.
- Summarization: Generate content via OpenAI/Gemini/Claude; return with references, optionally via webhooks to the Node API.
- Session/state: Frontend uses
zustand
for state andreact-query
for fetching; Node uses JWT guards. No multi-tenant specifics stated. - Security & validation:
- Helmet, CORS, throttling (
@nestjs/throttler
). - Validation in Python endpoints for ObjectId formats and input shapes; global exception handling in Node.
- Helmet, CORS, throttling (
API reference (high level)
Node API (NestJS, prefixed with seo-content-api)
:- Projects, Articles, Article Documents, Guidelines, System Prompts, Prompt Types, SSE, Webhooks, AI providers (OpenAI/Claude/Gemini)
- Auth: JWT-based guards; rate limiting enabled.
- Exact routes: Not applicable.
- Python API (FastAPI):
GET /health
: Service/Mongo health.POST /company-business-summary
: Summarize company details.POST /target-audience
: Generate target audience from details.POST /generate-outline
: Generate content preview/outline for an article by ID.POST /fetch-sitemaps
: Locate sitemap and summarize counts.POST /sitemap
: Discover URLs and analyze in batches; returns annotated results.POST /get-titles
: Generate SEO titles per keyword and prompt types; similarity checks.POST /check-title
: Check if a title already exists within a project.POST /file-ocr
: Extract text from documents (PDF/DOCX/DOC).router /seo-audit/*
: SEO Audit endpoints (progress and final report).
Data and schemas (conceptual)
- Project: Name, website URL, language, location, targeted audience, guideline reference, additional brand fields.
- Article: Name (title), keywords, secondary keywords, project reference, generated outline, scraped content metadata.
- Guideline/System Prompt/Prompt Type: Stored prompt texts and relationships used for content generation.
- SEO Audit: Status, current step, progress steps, final
audit_report
, error message. - Users/Auth: JWT-only single-user assumptions; no multi-tenant fields stated.
TroubleShooting
- Audit Issues
- No reference URLs returned
- Sitemap & Crawling
- Editor & Real-time
- API & Webhooks
Failed to create interview
Failed to create interview
Symptoms
- Progress stalls at a step
- No final report stored
- Frontend shows in-progress indefinitely
- Timeouts during chunked processing
- Network/stream parsing errors
- Background scheduler not updating failed audits
- Check scheduler logs; confirm stuck audits are marked failed
- Reduce timeouts or chunk sizes if applicable
- Verify router integration under /seo-audit
Audit progress missing steps
Audit progress missing steps
Symptoms
- Only final status appears
- No progress_steps saved
- UI lacks granular updates
- Parser ignoring non-JSON lines
- Stream buffering
- Schema changes not reflected in code
- Ensure each progress line is appended to progress_steps
- Flush handlers more frequently
- Re-run migration
Audit reports fail to save
Audit reports fail to save
Symptoms
- Final JSON lost
- DB record lacks audit_report
- Error message populated
- JSON framing after “Done.” not parsed
- DB validation error
- Connectivity to Mongo
- Handle the post-”Done.” JSON explicitly
- Validate schema fields before insert/update
- Verify database connectivity and indexes
Tech Stack
Layer | Technologies | Purpose |
---|---|---|
Frontend | React, Vite, React Router, React Query, Zustand, BlockNote, Tailwind | UI, editor, state, data fetching |
Node API | NestJS, Mongoose, @nestjs/* , SSE, Helmet, CORS, Throttler | Core APIs, auth, rate limiting, streaming |
Python API | FastAPI, Playwright, aiohttp, httpx, Crawl4AI | Scraping, sitemap analysis, SEO audit, AI orchestration |
AI/LLM | OpenAI, Google Generative AI (Gemini), Anthropic (Claude), LangChain | Content generation and summarization |
Storage | MongoDB (Mongoose + Motor) | Projects, articles, prompts, audits |
Parsing/OCR | PyMuPDF, python-docx, LibreOffice CLI | Text extraction from documents |
Utilities | Date-fns, Winston logging | Formatting and logging |
Best practices
- Validate inputs early: Enforce ObjectId and payload validation at the edge.
- Control concurrency: Tune semaphore and batch sizes to avoid timeouts and memory pressure.
- Truncate long contexts: Summarize scraped text before sending to LLMs to reduce cost.
- Harden scraping: Block heavy assets, use timeouts and retries, and sanitize selectors.
- Align API base/prefixes: Keep
VITE_API_BASE_URL
and NodeSERVER_PORT
consistent; be mindful of the global prefix. - Instrument and log: Use centralized logging and surface progress via SSE for UX and debugging.
- Secure webhooks: Verify
BASE_URL
and auth tokens; capture non-2xx responses. - Version alignment: Keep BlockNote and Radix UI packages in sync to avoid UI issues.