Content-Analyzer

URL/text analysis engine for clean content extraction & scoring.

Track 02 · Data Prep. The text preprocessing utility built for the Research Agent. Content-Analyzer strips layout noise (navigation, sidebars, footers, scripts) from raw HTML pages, truncates text to fit token budgets, sends it to GPT-4o-mini, and parses the response into structured JSON scores. Extracted from production Agentic OS.

Open source github.com ↗
Track
Track 02 · Data Preprocessing & Parsing
Runtime
Python 3.10+ pip install
LLM Backend
GPT-4o-mini Structured JSON
Features
HTML Boilerplate Stripping Token Truncation Sentiment & Quality Scoring
Repository

Content Analysis Flowchart : Annotated Reference

Strip boilerplate markup, extract body content, truncate text, run structured GPT prompt, and write JSON output.

The problem

Feeding raw HTML web pages directly into an LLM context is expensive and error-prone. Standard web pages are packed with boilerplate markup: navigation menus, site footers, sharing buttons, sidebar ads, tracking scripts, and styling rules. This noise can represent up to 80% of the raw characters, diluting the actual information and inflating your API costs.

To score sources fast and accurately, you need an extraction engine that strips the markup hierarchy, isolates the core text body, and runs a fast structured evaluation to filter out irrelevant or low-quality articles before they reach your synthesis pipeline.

How it works: step by step

  • Step 1: URL Retrieval. The system accepts a target URL or raw text string. If a URL is provided, it fetches the page using standard HTTP request libraries, handling headers to mimic standard browser calls.
  • Step 2: HTML Clean & Noise Stripping. The BeautifulSoup parser scans the HTML structure. It strips script tags, CSS styles, comment nodes, navigation tags, sidebars, and footer sections. It targets selectors like <article> and <main> to extract raw content.
  • Step 3: Token-Efficient Truncation. The extracted text is sliced, keeping the first 5000 characters. Slicing prevents long documents from exhausting API token limits during the initial assessment step.
  • Step 4: Structured Prompt Query. The cleaned text is sent to GPT-4o-mini. The prompt requests a JSON structure containing sentiment flags, relevance rankings, and summary sentences.
  • Step 5: Parse and Fallback. The response is validated. If the JSON is corrupt, it falls back to empty values, protecting the caller from pipeline crashes.

Interactive: Content Quality Evaluator

Input raw text to see the structured JSON analysis output.

Input Text

Extracted JSON

{
  // Awaiting analysis...
}

Structured JSON output schema

{
  "summary": "A single-sentence summary of the page content.",
  "key_insights": [
    "Insight bullet point 1",
    "Insight bullet point 2",
    "Insight bullet point 3"
  ],
  "sentiment": "positive | negative | neutral",
  "relevance_score": 8, // scale of 0-10
  "quality_score": 7,   // scale of 0-10
  "content_type": "news | analysis | tutorial | report | marketing"
}

File Architecture

  • src/extractor.py: Contains URL loading routines, User-Agent rotation parameters, BeautifulSoup filters for stripping markup noise, and character slice boundaries.
  • src/analyzer.py: Handles client calls to the OpenAI SDK, wraps inputs in a JSON schema template, and catches decoding errors.
  • demo/run.py: A test command line script showing how to scrape a single webpage and print the resulting metadata block.

How to run it

git clone https://github.com/shubham0086/content-analyzer
cd content-analyzer
pip install -r requirements.txt
cp .env.example .env

# Run simple test parser
python demo/run.py

Where this fits

Content-Analyzer runs as a **data ingestion utility**. It acts as the core scoring node inside the Research Agent. By analyzing each webpage candidate beforehand, the Research Agent can filter out low-scoring or promotional links, only passing high-quality content summaries to the final synthesis models.

Honest framing

This parser relies on CSS class identifiers to strip boilerplate. Sites that render content entirely client-side via complex JavaScript (Single Page Applications without server-side pre-rendering) will return blank pages to simple requests. To scrape SPAs reliably, you would need to hook up a headless browser (like Playwright), which adds significant memory and runtime overhead. For standard blog posts and news publications, this static scraper retrieves contents in under 200ms.