Content-Analyzer

The problem

Feeding raw HTML web pages directly into an LLM context is expensive and error-prone. Standard web pages are packed with boilerplate markup: navigation menus, site footers, sharing buttons, sidebar ads, tracking scripts, and styling rules. This noise can represent up to 80% of the raw characters, diluting the actual information and inflating your API costs.

To score sources fast and accurately, you need an extraction engine that strips the markup hierarchy, isolates the core text body, and runs a fast structured evaluation to filter out irrelevant or low-quality articles before they reach your synthesis pipeline.

How it works: step by step

Step 1: URL Retrieval. The system accepts a target URL or raw text string. If a URL is provided, it fetches the page using standard HTTP request libraries, handling headers to mimic standard browser calls.
Step 2: HTML Clean & Noise Stripping. The BeautifulSoup parser scans the HTML structure. It strips script tags, CSS styles, comment nodes, navigation tags, sidebars, and footer sections. It targets selectors like <article> and <main> to extract raw content.
Step 3: Token-Efficient Truncation. The extracted text is sliced, keeping the first 5000 characters. Slicing prevents long documents from exhausting API token limits during the initial assessment step.
Step 4: Structured Prompt Query. The cleaned text is sent to GPT-4o-mini. The prompt requests a JSON structure containing sentiment flags, relevance rankings, and summary sentences.
Step 5: Parse and Fallback. The response is validated. If the JSON is corrupt, it falls back to empty values, protecting the caller from pipeline crashes.

Interactive: Content Quality Evaluator

Input raw text to see the structured JSON analysis output.

Input Text

Extracted JSON

{
  // Awaiting analysis...
}

Structured JSON output schema

{
  "summary": "A single-sentence summary of the page content.",
  "key_insights": [
    "Insight bullet point 1",
    "Insight bullet point 2",
    "Insight bullet point 3"
  ],
  "sentiment": "positive | negative | neutral",
  "relevance_score": 8, // scale of 0-10
  "quality_score": 7,   // scale of 0-10
  "content_type": "news | analysis | tutorial | report | marketing"
}

File Architecture

src/extractor.py: Contains URL loading routines, User-Agent rotation parameters, BeautifulSoup filters for stripping markup noise, and character slice boundaries.
src/analyzer.py: Handles client calls to the OpenAI SDK, wraps inputs in a JSON schema template, and catches decoding errors.
demo/run.py: A test command line script showing how to scrape a single webpage and print the resulting metadata block.

How to run it

git clone https://github.com/shubham0086/content-analyzer
cd content-analyzer
pip install -r requirements.txt
cp .env.example .env

# Run simple test parser
python demo/run.py

Where this fits

Content-Analyzer runs as a **data ingestion utility**. It acts as the core scoring node inside the Research Agent. By analyzing each webpage candidate beforehand, the Research Agent can filter out low-scoring or promotional links, only passing high-quality content summaries to the final synthesis models.

Honest framing

This parser relies on CSS class identifiers to strip boilerplate. Sites that render content entirely client-side via complex JavaScript (Single Page Applications without server-side pre-rendering) will return blank pages to simple requests. To scrape SPAs reliably, you would need to hook up a headless browser (like Playwright), which adds significant memory and runtime overhead. For standard blog posts and news publications, this static scraper retrieves contents in under 200ms.

Content Analysis Flowchart : Annotated Reference