Aggressively clean HTML for RAG pipelines and LLM ingestion 🧹✨
Transform bloated web pages into pure, semantic HTML perfect for embeddings, vector databases, and AI processing.
🚀 Used in Production: This library powers page-replica.com's structured scraping engine. See it in action with our live demo to extract and structure real web pages into clean JSON, Markdown, or HTML.
Modern web pages are cluttered with tracking scripts, analytics, styling, ads, and interactive elements that waste tokens and dilute semantic meaning when processing content for AI systems. This library strips away the noise to give you clean, meaningful HTML that:
- ✅ Reduces token count by 60-90% (fewer API costs)
- ✅ Improves embedding quality (less noise = better semantic search)
- ✅ Speeds up processing (smaller payloads = faster inference)
- ✅ Preserves structure (headings, paragraphs, links stay intact)
- ✅ Zero dependencies (pure JavaScript, no bloat)
npm install @page-replica/pure-html-for-ragconst { cleanHtml } = require("@page-replica/pure-html-for-rag");
const rawHtml = `
<html>
<head>
<title>Example</title>
<script src="tracker.js"></script>
</head>
<body>
<h1>Example page</h1>
<p style="color:red" onclick="alert('hi')">Hello world!</p>
<img src="hero.jpg" alt="" />
</body>
</html>
`;
const cleaned = cleanHtml(rawHtml);
// => "<html><head><title>Example</title></head><body><h1>Example page</h1><p>Hello world!</p></body></html>"Result: 189 characters → 105 characters (44% reduction)
Clean HTML before chunking and embedding web content for vector databases like Pinecone, Weaviate, or ChromaDB.
const { cleanHtml } = require("@page-replica/pure-html-for-rag");
const fetch = require("node-fetch");
async function indexWebPage(url) {
const response = await fetch(url);
const html = await response.text();
const cleaned = cleanHtml(html);
// Now chunk and embed the cleaned HTML
const chunks = chunkText(cleaned, 512);
await vectorDB.upsert(chunks);
}Reduce token costs when feeding web content to GPT-4, Claude, or other LLMs.
const { cleanHtml } = require("@page-replica/pure-html-for-rag");
const puppeteer = require("puppeteer");
async function scrapeForLLM(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.goto(url);
const html = await page.content();
await browser.close();
const cleaned = cleanHtml(html);
// Use cleaned HTML in your prompt
const response = await openai.chat.completions.create({
model: "gpt-4",
messages: [
{ role: "user", content: `Summarize this page: ${cleaned}` }
]
});
}Extract clean, structured content from websites for data analysis or content migration.
const { cleanHtml } = require("@page-replica/pure-html-for-rag");
const axios = require("axios");
async function extractArticle(url) {
const { data } = await axios.get(url);
const cleaned = cleanHtml(data, {
removeComments: true,
collapseWhitespace: true,
});
// Extract text or parse structure
return parseArticleContent(cleaned);
}| Option | Type | Default | Description |
|---|---|---|---|
collapseWhitespace |
boolean |
true |
Converts repeated whitespace to single spaces for a smaller payload. |
removeEmptyElements |
boolean |
true |
Iteratively drops elements that become empty after cleaning. |
removeComments |
boolean |
true |
Removes HTML comments. |
allowedAttributeTags |
string[] |
["a"] |
Tags that should keep their attributes (e.g. keep href on links). |
Returns a minimised HTML string containing only textual content and headings.
Create a reusable cleaner with baked-in defaults:
const { createCleaner } = require("@page-replica/pure-html-for-rag");
const clean = createCleaner({ allowedAttributeTags: [] });
const output = clean(rawHtml, { collapseWhitespace: false });You can use this library in the browser by including the bundled version:
<!DOCTYPE html>
<html>
<head>
<title>HTML Cleaner Demo</title>
</head>
<body>
<script src="https://unpkg.com/@page-replica/pure-html-for-rag@latest/demo/pure-html-for-rag.bundle.js"></script>
<script>
const { cleanHtml } = window.pureHtmlForRag;
const dirtyHtml = document.documentElement.outerHTML;
const cleaned = cleanHtml(dirtyHtml);
console.log('Cleaned HTML:', cleaned);
</script>
</body>
</html>Or use with a module bundler (Webpack, Rollup, Vite):
import { cleanHtml } from '@page-replica/pure-html-for-rag';
const cleaned = cleanHtml(document.body.innerHTML);const puppeteer = require("puppeteer");
const { cleanHtml } = require("@page-replica/pure-html-for-rag");
async function scrapeAndClean(url) {
const browser = await puppeteer.launch();
const page = await browser.newPage();
// Block unnecessary resources
await page.setRequestInterception(true);
page.on('request', (req) => {
const resourceType = req.resourceType();
if (['image', 'stylesheet', 'font', 'media'].includes(resourceType)) {
req.abort();
} else {
req.continue();
}
});
await page.goto(url, { waitUntil: 'domcontentloaded' });
const html = await page.content();
await browser.close();
return cleanHtml(html);
}const { createCleaner } = require("@page-replica/pure-html-for-rag");
const cleaner = createCleaner({
collapseWhitespace: true,
removeEmptyElements: true,
allowedAttributeTags: ['a', 'img'], // Keep links and images with alt text
});
async function processMultiplePages(urls) {
const results = await Promise.all(
urls.map(async (url) => {
const response = await fetch(url);
const html = await response.text();
return {
url,
cleaned: cleaner(html),
originalSize: html.length,
cleanedSize: cleaner(html).length,
};
})
);
return results;
}const { cleanHtml } = require("@page-replica/pure-html-for-rag");
// Minimal cleaning - keep everything except scripts
const minimal = cleanHtml(html, {
removeComments: false,
removeEmptyElements: false,
collapseWhitespace: false,
allowedAttributeTags: ['a', 'img', 'table', 'td', 'th'],
});
// Aggressive cleaning - strip everything
const aggressive = cleanHtml(html, {
removeComments: true,
removeEmptyElements: true,
collapseWhitespace: true,
allowedAttributeTags: [], // No attributes preserved
});<!DOCTYPE html>
<html lang="en">
<head>
<meta charset="UTF-8">
<title>Blog Post - 10 Tips</title>
<link rel="stylesheet" href="style.css">
<script async src="https://www.google-analytics.com/ga.js"></script>
<script src="https://cdn.segment.com/analytics.js"></script>
<style>
body { font-family: Arial; color: #333; }
.container { max-width: 1200px; margin: 0 auto; }
</style>
</head>
<body>
<nav class="navbar navbar-expand-lg">
<a href="/" class="logo">
<img src="logo.png" alt="Logo" width="150">
</a>
<button onclick="toggleMenu()" class="menu-btn">Menu</button>
</nav>
<article>
<h1>10 Productivity Tips</h1>
<p>Learn how to boost your productivity with these proven strategies.</p>
<h2>Tip 1: Start Early</h2>
<p>The early bird catches the worm.</p>
<h2>Tip 2: Take Breaks</h2>
<p>Regular breaks prevent burnout.</p>
</article>
<form class="newsletter">
<input type="email" placeholder="Your email">
<button type="submit">Subscribe</button>
</form>
<script>
function toggleMenu() { /* ... */ }
gtag('event', 'page_view');
</script>
</body>
</html><html><head><title>Blog Post - 10 Tips</title></head><body><nav><a href="/"> </a></nav><article><h1>10 Productivity Tips</h1><p>Learn how to boost your productivity with these proven strategies.</p><h2>Tip 1: Start Early</h2><p>The early bird catches the worm.</p><h2>Tip 2: Take Breaks</h2><p>Regular breaks prevent burnout.</p></article></body></html><script>,<style>,<noscript>,<iframe>,<svg>,<video>,<audio>,<canvas>,<form>,<button>,<select>,<textarea>and similar interactive blocks.<img>,<source>,<track>,<input>,<meta>,<base>,<link>(including preload/stylesheet/analytics variants).- Inline
style=attributes, event handlers likeonclick=, and extraclass=clutter. - HTML comments and empty containers created by the removal step.
The end result is a compact, stable string ready to feed into embeddings or LLM prompts without wasting budget on layout cruft.
- Fast: Processes typical web pages in under 5ms
- Lightweight: Zero dependencies, ~5KB minified
- Memory efficient: Streams through content without building large DOM trees
- Regex-based: Uses optimized regular expressions for maximum speed
Benchmark on a typical blog post (100KB HTML):
Original size: 102,847 bytes
Cleaned size: 12,441 bytes (87.9% reduction)
Processing time: 3.2ms
If you just want to see how the structured cleanup works without setting anything up locally, you can try the live demo.
It lets you experiment with structuring real pages into JSON, Markdown, or low-noise HTML, which is useful if you need to quickly inspect or structure page content.
https://page-replica.com/structured/live-demo
