Crawl4AI: Open Source LLM-Friendly Web Crawler and Scraping Tool

✨ Features

📝 Markdown Generation

🧹 Clean Markdown: Generates clean, structured Markdown with accurate formatting.
🎯 Fit Markdown: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.
🔗 Citations and References: Converts page links into a numbered reference list with clean citations.
🛠️ Custom Strategies: Users can create their own Markdown generation strategies tailored to specific needs.
📚 BM25 Algorithm: Employs BM25-based filtering for extracting core information and removing irrelevant content.

📊 Structured Data Extraction

🤖 LLM-Driven Extraction: Supports all LLMs (open-source and proprietary) for structured data extraction.
🧱 Chunking Strategies: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.
🌌 Cosine Similarity: Find relevant content chunks based on user queries for semantic extraction.
🔎 CSS-Based Extraction: Fast schema-based data extraction using XPath and CSS selectors.
🔧 Schema Definition: Define custom schemas for extracting structured JSON from repetitive patterns.

🌐 Browser Integration

🖥️ Managed Browser: Use user-owned browsers with full control, avoiding bot detection.
🔄 Remote Browser Control: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.
👤 Browser Profiler: Create and manage persistent profiles with saved authentication states, cookies, and settings.
🔒 Session Management: Preserve browser states and reuse them for multi-step crawling.
🧩 Proxy Support: Seamlessly connect to proxies with authentication for secure access.
⚙️ Full Browser Control: Modify headers, cookies, user agents, and more for tailored crawling setups.
🌍 Multi-Browser Support: Compatible with Chromium, Firefox, and WebKit.
📐 Dynamic Viewport Adjustment: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.

🔎 Crawling & Scraping

🖼️ Media Support: Extract images, audio, videos, and responsive image formats like srcset and picture.
🚀 Dynamic Crawling: Execute JS and wait for async or sync for dynamic content extraction.
📸 Screenshots: Capture page screenshots during crawling for debugging or analysis.
📂 Raw Data Crawling: Directly process raw HTML (raw:) or local files (file://).
🔗 Comprehensive Link Extraction: Extracts internal, external links, and embedded iframe content.
🛠️ Customizable Hooks: Define hooks at every step to customize crawling behavior.
💾 Caching: Cache data for improved speed and to avoid redundant fetches.
📄 Metadata Extraction: Retrieve structured metadata from web pages.
📡 IFrame Content Extraction: Seamless extraction from embedded iframe content.
🕵️ Lazy Load Handling: Waits for images to fully load, ensuring no content is missed due to lazy loading.
🔄 Full-Page Scanning: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.

🚀 Deployment

🐳 Dockerized Setup: Optimized Docker image with FastAPI server for easy deployment.
🔑 Secure Authentication: Built-in JWT token authentication for API security.
🔄 API Gateway: One-click deployment with secure token authentication for API-based workflows.
🌐 Scalable Architecture: Designed for mass-scale production and optimized server performance.
☁️ Cloud Deployment: Ready-to-deploy configurations for major cloud platforms.

🎯 Additional Features

🕶️ Stealth Mode: Avoid bot detection by mimicking real users.
🏷️ Tag-Based Content Extraction: Refine crawling based on custom tags, headers, or metadata.
🔗 Link Analysis: Extract and analyze all links for detailed data exploration.
🛡️ Error Handling: Robust error management for seamless execution.
🔐 CORS & Static Serving: Supports filesystem-based caching and cross-origin requests.
📖 Clear Documentation: Simplified and updated guides for onboarding and advanced usage.
🙌 Community Recognition: Acknowledges contributors and pull requests for transparency.