[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-unclecode--crawl4ai":3,"tool-unclecode--crawl4ai":64},[4,17,27,35,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,43,44,45,15,46,26,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":10,"last_commit_at":54,"category_tags":55,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,46],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},2181,"OpenHands","OpenHands\u002FOpenHands","OpenHands 是一个专注于 AI 驱动开发的开源平台，旨在让智能体（Agent）像人类开发者一样理解、编写和调试代码。它解决了传统编程中重复性劳动多、环境配置复杂以及人机协作效率低等痛点，通过自动化流程显著提升开发速度。\n\n无论是希望提升编码效率的软件工程师、探索智能体技术的研究人员，还是需要快速原型验证的技术团队，都能从中受益。OpenHands 提供了灵活多样的使用方式：既可以通过命令行（CLI）或本地图形界面在个人电脑上轻松上手，体验类似 Devin 的流畅交互；也能利用其强大的 Python SDK 自定义智能体逻辑，甚至在云端大规模部署上千个智能体并行工作。\n\n其核心技术亮点在于模块化的软件智能体 SDK，这不仅构成了平台的引擎，还支持高度可组合的开发模式。此外，OpenHands 在 SWE-bench 基准测试中取得了 77.6% 的优异成绩，证明了其解决真实世界软件工程问题的能力。平台还具备完善的企业级功能，支持与 Slack、Jira 等工具集成，并提供细粒度的权限管理，适合从个人开发者到大型企业的各类用户场景。",70612,"2026-04-05T11:12:22",[26,15,13,45],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":75,"owner_website":82,"owner_url":83,"languages":84,"stars":100,"forks":101,"last_commit_at":102,"license":103,"difficulty_score":23,"env_os":104,"env_gpu":105,"env_ram":105,"env_deps":106,"category_tags":113,"github_topics":114,"view_count":115,"oss_zip_url":114,"oss_zip_packed_at":114,"status":16,"created_at":116,"updated_at":117,"faqs":118,"releases":147},2403,"unclecode\u002Fcrawl4ai","crawl4ai","🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper. Don't be shy, join here: https:\u002F\u002Fdiscord.gg\u002FjP8KfhDhyN","Crawl4AI 是一款专为大语言模型（LLM）设计的开源网络爬虫与数据提取工具。它的核心使命是将纷繁复杂的网页内容转化为干净、结构化的 Markdown 格式，直接服务于检索增强生成（RAG）、智能体构建及各类数据管道，让 AI 能更轻松地“读懂”互联网。\n\n传统爬虫往往面临反爬机制拦截、动态内容加载困难以及输出格式杂乱等痛点，导致后续数据处理成本高昂。Crawl4AI 通过内置自动化的三级反机器人检测、代理升级策略以及对 Shadow DOM 的深度支持，有效突破了这些障碍。它能智能移除同意弹窗，处理深层链接，并具备长任务崩溃恢复能力，确保数据采集的稳定与高效。\n\n这款工具特别适合开发者、AI 研究人员及数据工程师使用。无论是需要为本地模型构建知识库，还是搭建大规模自动化信息采集流程，Crawl4AI 都提供了极高的可控性与灵活性。作为 GitHub 上备受瞩目的开源项目，它完全免费开放，无需繁琐的注册或昂贵的 API 费用，让用户能够专注于数据价值本身而非采集难题。","# 🚀🤖 Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper.\n\n\u003Cdiv align=\"center\">\n\n\u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F11716\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_4a68feb902da.png\" alt=\"unclecode%2Fcrawl4ai | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\n[![GitHub Stars](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Funclecode\u002Fcrawl4ai?style=social)](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fstargazers)\n[![GitHub Forks](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Funclecode\u002Fcrawl4ai?style=social)](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fnetwork\u002Fmembers)\n\n[![PyPI version](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fcrawl4ai.svg)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fcrawl4ai)\n[![Python Version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fcrawl4ai)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fcrawl4ai\u002F)\n[![Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_e61070d2d408.png)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fcrawl4ai)\n[![GitHub Sponsors](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fsponsors\u002Funclecode?style=flat&logo=GitHub-Sponsors&label=Sponsors&color=pink)](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Funclecode)\n\n---\n#### 🚀 Crawl4AI Cloud API — Closed Beta (Launching Soon)\nReliable, large-scale web extraction, now built to be _**drastically more cost-effective**_ than any of the existing solutions.\n\n👉 **Apply [here](https:\u002F\u002Fforms.gle\u002FE9MyPaNXACnAMaqG7) for early access**  \n_We’ll be onboarding in phases and working closely with early users.\nLimited slots._\n\n---\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fx.com\u002Fcrawl4ai\">\n      \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FFollow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white\" alt=\"Follow on X\" \u002F>\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fwww.linkedin.com\u002Fcompany\u002Fcrawl4ai\">\n      \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FFollow%20on%20LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white\" alt=\"Follow on LinkedIn\" \u002F>\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FjP8KfhDhyN\">\n      \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJoin%20our%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white\" alt=\"Join our Discord\" \u002F>\n    \u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\nCrawl4AI turns the web into clean, LLM ready Markdown for RAG, agents, and data pipelines. Fast, controllable, battle tested by a 50k+ star community.\n\n[✨ Check out latest update v0.8.6](#-recent-updates)\n\n✨ **New in v0.8.6**: Security hotfix — replaced `litellm` with `unclecode-litellm` due to a PyPI supply chain compromise. If you're on v0.8.5, please upgrade immediately.\n\n✨ Recent v0.8.5: Anti-Bot Detection, Shadow DOM & 60+ Bug Fixes! Automatic 3-tier anti-bot detection with proxy escalation, Shadow DOM flattening, deep crawl cancellation, config defaults API, consent popup removal, and critical security patches. [Release notes →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.8.5.md)\n\n✨ Previous v0.8.0: Crash Recovery & Prefetch Mode! Deep crawl crash recovery with `resume_state` and `on_state_change` callbacks for long-running crawls. New `prefetch=True` mode for 5-10x faster URL discovery. [Release notes →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.8.0.md)\n\n✨ Previous v0.7.8: Stability & Bug Fix Release! 11 bug fixes addressing Docker API issues, LLM extraction improvements, URL handling fixes, and dependency updates. [Release notes →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.8.md)\n\n\u003Cdetails>\n  \u003Csummary>🤓 \u003Cstrong>My Personal Story\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nI grew up on an Amstrad, thanks to my dad, and never stopped building. In grad school I specialized in NLP and built crawlers for research. That’s where I learned how much extraction matters.\n\nIn 2023, I needed web-to-Markdown. The “open source” option wanted an account, API token, and $16, and still under-delivered. I went turbo anger mode, built Crawl4AI in days, and it went viral. Now it’s the most-starred crawler on GitHub.\n\nI made it open source for **availability**, anyone can use it without a gate. Now I’m building the platform for **affordability**, anyone can run serious crawls without breaking the bank. If that resonates, join in, send feedback, or just crawl something amazing.\n\u003C\u002Fdetails>\n\n\n\u003Cdetails>\n  \u003Csummary>Why developers pick Crawl4AI\u003C\u002Fsummary>\n\n- **LLM ready output**, smart Markdown with headings, tables, code, citation hints\n- **Fast in practice**, async browser pool, caching, minimal hops\n- **Full control**, sessions, proxies, cookies, user scripts, hooks\n- **Adaptive intelligence**, learns site patterns, explores only what matters\n- **Deploy anywhere**, zero keys, CLI and Docker, cloud friendly\n\u003C\u002Fdetails>\n\n\n## 🚀 Quick Start \n\n1. Install Crawl4AI:\n```bash\n# Install the package\npip install -U crawl4ai\n\n# For pre release versions\npip install crawl4ai --pre\n\n# Run post-installation setup\ncrawl4ai-setup\n\n# Verify your installation\ncrawl4ai-doctor\n```\n\nIf you encounter any browser-related issues, you can install them manually:\n```bash\npython -m playwright install --with-deps chromium\n```\n\n2. Run a simple web crawl with Python:\n```python\nimport asyncio\nfrom crawl4ai import *\n\nasync def main():\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=\"https:\u002F\u002Fwww.nbcnews.com\u002Fbusiness\",\n        )\n        print(result.markdown)\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n3. Or use the new command-line interface:\n```bash\n# Basic crawl with markdown output\ncrwl https:\u002F\u002Fwww.nbcnews.com\u002Fbusiness -o markdown\n\n# Deep crawl with BFS strategy, max 10 pages\ncrwl https:\u002F\u002Fdocs.crawl4ai.com --deep-crawl bfs --max-pages 10\n\n# Use LLM extraction with a specific question\ncrwl https:\u002F\u002Fwww.example.com\u002Fproducts -q \"Extract all product prices\"\n```\n\n## 💖 Support Crawl4AI\n\n> 🎉 **Sponsorship Program Now Open!** After powering 51K+ developers and 1 year of growth, Crawl4AI is launching dedicated support for **startups** and **enterprises**. Be among the first 50 **Founding Sponsors** for permanent recognition in our Hall of Fame.\n\nCrawl4AI is the #1 trending open-source web crawler on GitHub. Your support keeps it independent, innovative, and free for the community — while giving you direct access to premium benefits.\n\n\u003Cdiv align=\"\">\n  \n[![Become a Sponsor](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBecome%20a%20Sponsor-pink?style=for-the-badge&logo=github-sponsors&logoColor=white)](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Funclecode)  \n[![Current Sponsors](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fsponsors\u002Funclecode?style=for-the-badge&logo=github&label=Current%20Sponsors&color=green)](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Funclecode)\n\n\u003C\u002Fdiv>\n\n### 🤝 Sponsorship Tiers\n\n- **🌱 Believer ($5\u002Fmo)** — Join the movement for data democratization  \n- **🚀 Builder ($50\u002Fmo)** — Priority support & early access to features  \n- **💼 Growing Team ($500\u002Fmo)** — Bi-weekly syncs & optimization help  \n- **🏢 Data Infrastructure Partner ($2000\u002Fmo)** — Full partnership with dedicated support  \n  *Custom arrangements available - see [SPONSORS.md](SPONSORS.md) for details & contact*\n\n**Why sponsor?**  \nNo rate-limited APIs. No lock-in. Build and own your data pipeline with direct guidance from the creator of Crawl4AI.\n\n[See All Tiers & Benefits →](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Funclecode)\n\n\n## ✨ Features \n\n\u003Cdetails>\n\u003Csummary>📝 \u003Cstrong>Markdown Generation\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 🧹 **Clean Markdown**: Generates clean, structured Markdown with accurate formatting.\n- 🎯 **Fit Markdown**: Heuristic-based filtering to remove noise and irrelevant parts for AI-friendly processing.\n- 🔗 **Citations and References**: Converts page links into a numbered reference list with clean citations.\n- 🛠️ **Custom Strategies**: Users can create their own Markdown generation strategies tailored to specific needs.\n- 📚 **BM25 Algorithm**: Employs BM25-based filtering for extracting core information and removing irrelevant content. \n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>📊 \u003Cstrong>Structured Data Extraction\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 🤖 **LLM-Driven Extraction**: Supports all LLMs (open-source and proprietary) for structured data extraction.\n- 🧱 **Chunking Strategies**: Implements chunking (topic-based, regex, sentence-level) for targeted content processing.\n- 🌌 **Cosine Similarity**: Find relevant content chunks based on user queries for semantic extraction.\n- 🔎 **CSS-Based Extraction**: Fast schema-based data extraction using XPath and CSS selectors.\n- 🔧 **Schema Definition**: Define custom schemas for extracting structured JSON from repetitive patterns.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🌐 \u003Cstrong>Browser Integration\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 🖥️ **Managed Browser**: Use user-owned browsers with full control, avoiding bot detection.\n- 🔄 **Remote Browser Control**: Connect to Chrome Developer Tools Protocol for remote, large-scale data extraction.\n- 👤 **Browser Profiler**: Create and manage persistent profiles with saved authentication states, cookies, and settings.\n- 🔒 **Session Management**: Preserve browser states and reuse them for multi-step crawling.\n- 🧩 **Proxy Support**: Seamlessly connect to proxies with authentication for secure access.\n- ⚙️ **Full Browser Control**: Modify headers, cookies, user agents, and more for tailored crawling setups.\n- 🌍 **Multi-Browser Support**: Compatible with Chromium, Firefox, and WebKit.\n- 📐 **Dynamic Viewport Adjustment**: Automatically adjusts the browser viewport to match page content, ensuring complete rendering and capturing of all elements.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🔎 \u003Cstrong>Crawling & Scraping\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 🖼️ **Media Support**: Extract images, audio, videos, and responsive image formats like `srcset` and `picture`.\n- 🚀 **Dynamic Crawling**: Execute JS and wait for async or sync for dynamic content extraction.\n- 📸 **Screenshots**: Capture page screenshots during crawling for debugging or analysis.\n- 📂 **Raw Data Crawling**: Directly process raw HTML (`raw:`) or local files (`file:\u002F\u002F`).\n- 🔗 **Comprehensive Link Extraction**: Extracts internal, external links, and embedded iframe content.\n- 🛠️ **Customizable Hooks**: Define hooks at every step to customize crawling behavior (supports both string and function-based APIs).\n- 💾 **Caching**: Cache data for improved speed and to avoid redundant fetches.\n- 📄 **Metadata Extraction**: Retrieve structured metadata from web pages.\n- 📡 **IFrame Content Extraction**: Seamless extraction from embedded iframe content.\n- 🕵️ **Lazy Load Handling**: Waits for images to fully load, ensuring no content is missed due to lazy loading.\n- 🔄 **Full-Page Scanning**: Simulates scrolling to load and capture all dynamic content, perfect for infinite scroll pages.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🚀 \u003Cstrong>Deployment\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 🐳 **Dockerized Setup**: Optimized Docker image with FastAPI server for easy deployment.\n- 🔑 **Secure Authentication**: Built-in JWT token authentication for API security.\n- 🔄 **API Gateway**: One-click deployment with secure token authentication for API-based workflows.\n- 🌐 **Scalable Architecture**: Designed for mass-scale production and optimized server performance.\n- ☁️ **Cloud Deployment**: Ready-to-deploy configurations for major cloud platforms.\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🎯 \u003Cstrong>Additional Features\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 🕶️ **Stealth Mode**: Avoid bot detection by mimicking real users.\n- 🏷️ **Tag-Based Content Extraction**: Refine crawling based on custom tags, headers, or metadata.\n- 🔗 **Link Analysis**: Extract and analyze all links for detailed data exploration.\n- 🛡️ **Error Handling**: Robust error management for seamless execution.\n- 🔐 **CORS & Static Serving**: Supports filesystem-based caching and cross-origin requests.\n- 📖 **Clear Documentation**: Simplified and updated guides for onboarding and advanced usage.\n- 🙌 **Community Recognition**: Acknowledges contributors and pull requests for transparency.\n\n\u003C\u002Fdetails>\n\n## Try it Now!\n\n✨ Play around with this [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)\n\n✨ Visit our [Documentation Website](https:\u002F\u002Fdocs.crawl4ai.com\u002F)\n\n## Installation 🛠️\n\nCrawl4AI offers flexible installation options to suit various use cases. You can install it as a Python package or use Docker.\n\n\u003Cdetails>\n\u003Csummary>🐍 \u003Cstrong>Using pip\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nChoose the installation option that best fits your needs:\n\n### Basic Installation\n\nFor basic web crawling and scraping tasks:\n\n```bash\npip install crawl4ai\ncrawl4ai-setup # Setup the browser\n```\n\nBy default, this will install the asynchronous version of Crawl4AI, using Playwright for web crawling.\n\n👉 **Note**: When you install Crawl4AI, the `crawl4ai-setup` should automatically install and set up Playwright. However, if you encounter any Playwright-related errors, you can manually install it using one of these methods:\n\n1. Through the command line:\n\n   ```bash\n   playwright install\n   ```\n\n2. If the above doesn't work, try this more specific command:\n\n   ```bash\n   python -m playwright install chromium\n   ```\n\nThis second method has proven to be more reliable in some cases.\n\n---\n\n### Installation with Synchronous Version\n\nThe sync version is deprecated and will be removed in future versions. If you need the synchronous version using Selenium:\n\n```bash\npip install crawl4ai[sync]\n```\n\n---\n\n### Development Installation\n\nFor contributors who plan to modify the source code:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai.git\ncd crawl4ai\npip install -e .                    # Basic installation in editable mode\n```\n\nInstall optional features:\n\n```bash\npip install -e \".[torch]\"           # With PyTorch features\npip install -e \".[transformer]\"     # With Transformer features\npip install -e \".[cosine]\"          # With cosine similarity features\npip install -e \".[sync]\"            # With synchronous crawling (Selenium)\npip install -e \".[all]\"             # Install all optional features\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🐳 \u003Cstrong>Docker Deployment\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n> 🚀 **Now Available!** Our completely redesigned Docker implementation is here! This new solution makes deployment more efficient and seamless than ever.\n\n### New Docker Features\n\nThe new Docker implementation includes:\n- **Real-time Monitoring Dashboard** with live system metrics and browser pool visibility\n- **Browser pooling** with page pre-warming for faster response times\n- **Interactive playground** to test and generate request code\n- **MCP integration** for direct connection to AI tools like Claude Code\n- **Comprehensive API endpoints** including HTML extraction, screenshots, PDF generation, and JavaScript execution\n- **Multi-architecture support** with automatic detection (AMD64\u002FARM64)\n- **Optimized resources** with improved memory management\n\n### Getting Started\n\n```bash\n# Pull and run the latest release\ndocker pull unclecode\u002Fcrawl4ai:latest\ndocker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode\u002Fcrawl4ai:latest\n\n# Visit the monitoring dashboard at http:\u002F\u002Flocalhost:11235\u002Fdashboard\n# Or the playground at http:\u002F\u002Flocalhost:11235\u002Fplayground\n```\n\n### Quick Test\n\nRun a quick test (works for both Docker options):\n\n```python\nimport requests\n\n# Submit a crawl job\nresponse = requests.post(\n    \"http:\u002F\u002Flocalhost:11235\u002Fcrawl\",\n    json={\"urls\": [\"https:\u002F\u002Fexample.com\"], \"priority\": 10}\n)\nif response.status_code == 200:\n    print(\"Crawl job submitted successfully.\")\n    \nif \"results\" in response.json():\n    results = response.json()[\"results\"]\n    print(\"Crawl job completed. Results:\")\n    for result in results:\n        print(result)\nelse:\n    task_id = response.json()[\"task_id\"]\n    print(f\"Crawl job submitted. Task ID:: {task_id}\")\n    result = requests.get(f\"http:\u002F\u002Flocalhost:11235\u002Ftask\u002F{task_id}\")\n```\n\nFor more examples, see our [Docker Examples](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fexamples\u002Fdocker_example.py). For advanced configuration, monitoring features, and production deployment, see our [Self-Hosting Guide](https:\u002F\u002Fdocs.crawl4ai.com\u002Fcore\u002Fself-hosting\u002F).\n\n\u003C\u002Fdetails>\n\n---\n\n## 🔬 Advanced Usage Examples 🔬\n\nYou can check the project structure in the directory [docs\u002Fexamples](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Ftree\u002Fmain\u002Fdocs\u002Fexamples). Over there, you can find a variety of examples; here, some popular examples are shared.\n\n\u003Cdetails>\n\u003Csummary>📝 \u003Cstrong>Heuristic Markdown Generation with Clean and Fit Markdown\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n```python\nimport asyncio\nfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode\nfrom crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter\nfrom crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator\n\nasync def main():\n    browser_config = BrowserConfig(\n        headless=True,  \n        verbose=True,\n    )\n    run_config = CrawlerRunConfig(\n        cache_mode=CacheMode.ENABLED,\n        markdown_generator=DefaultMarkdownGenerator(\n            content_filter=PruningContentFilter(threshold=0.48, threshold_type=\"fixed\", min_word_threshold=0)\n        ),\n        # markdown_generator=DefaultMarkdownGenerator(\n        #     content_filter=BM25ContentFilter(user_query=\"WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY\", bm25_threshold=1.0)\n        # ),\n    )\n    \n    async with AsyncWebCrawler(config=browser_config) as crawler:\n        result = await crawler.arun(\n            url=\"https:\u002F\u002Fdocs.micronaut.io\u002F4.9.9\u002Fguide\u002F\",\n            config=run_config\n        )\n        print(len(result.markdown.raw_markdown))\n        print(len(result.markdown.fit_markdown))\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🖥️ \u003Cstrong>Executing JavaScript & Extract Structured Data without LLMs\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n```python\nimport asyncio\nfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode\nfrom crawl4ai import JsonCssExtractionStrategy\nimport json\n\nasync def main():\n    schema = {\n    \"name\": \"KidoCode Courses\",\n    \"baseSelector\": \"section.charge-methodology .w-tab-content > div\",\n    \"fields\": [\n        {\n            \"name\": \"section_title\",\n            \"selector\": \"h3.heading-50\",\n            \"type\": \"text\",\n        },\n        {\n            \"name\": \"section_description\",\n            \"selector\": \".charge-content\",\n            \"type\": \"text\",\n        },\n        {\n            \"name\": \"course_name\",\n            \"selector\": \".text-block-93\",\n            \"type\": \"text\",\n        },\n        {\n            \"name\": \"course_description\",\n            \"selector\": \".course-content-text\",\n            \"type\": \"text\",\n        },\n        {\n            \"name\": \"course_icon\",\n            \"selector\": \".image-92\",\n            \"type\": \"attribute\",\n            \"attribute\": \"src\"\n        }\n    ]\n}\n\n    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)\n\n    browser_config = BrowserConfig(\n        headless=False,\n        verbose=True\n    )\n    run_config = CrawlerRunConfig(\n        extraction_strategy=extraction_strategy,\n        js_code=[\"\"\"(async () => {const tabs = document.querySelectorAll(\"section.charge-methodology .tabs-menu-3 > div\");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();\"\"\"],\n        cache_mode=CacheMode.BYPASS\n    )\n        \n    async with AsyncWebCrawler(config=browser_config) as crawler:\n        \n        result = await crawler.arun(\n            url=\"https:\u002F\u002Fwww.kidocode.com\u002Fdegrees\u002Ftechnology\",\n            config=run_config\n        )\n\n        companies = json.loads(result.extracted_content)\n        print(f\"Successfully extracted {len(companies)} companies\")\n        print(json.dumps(companies[0], indent=2))\n\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>📚 \u003Cstrong>Extracting Structured Data with LLMs\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n```python\nimport os\nimport asyncio\nfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig\nfrom crawl4ai import LLMExtractionStrategy\nfrom pydantic import BaseModel, Field\n\nclass OpenAIModelFee(BaseModel):\n    model_name: str = Field(..., description=\"Name of the OpenAI model.\")\n    input_fee: str = Field(..., description=\"Fee for input token for the OpenAI model.\")\n    output_fee: str = Field(..., description=\"Fee for output token for the OpenAI model.\")\n\nasync def main():\n    browser_config = BrowserConfig(verbose=True)\n    run_config = CrawlerRunConfig(\n        word_count_threshold=1,\n        extraction_strategy=LLMExtractionStrategy(\n            # Here you can use any provider that Litellm library supports, for instance: ollama\u002Fqwen2\n            # provider=\"ollama\u002Fqwen2\", api_token=\"no-token\", \n            llm_config = LLMConfig(provider=\"openai\u002Fgpt-4o\", api_token=os.getenv('OPENAI_API_KEY')), \n            schema=OpenAIModelFee.schema(),\n            extraction_type=\"schema\",\n            instruction=\"\"\"From the crawled content, extract all mentioned model names along with their fees for input and output tokens. \n            Do not miss any models in the entire content. One extracted model JSON format should look like this: \n            {\"model_name\": \"GPT-4\", \"input_fee\": \"US$10.00 \u002F 1M tokens\", \"output_fee\": \"US$30.00 \u002F 1M tokens\"}.\"\"\"\n        ),            \n        cache_mode=CacheMode.BYPASS,\n    )\n    \n    async with AsyncWebCrawler(config=browser_config) as crawler:\n        result = await crawler.arun(\n            url='https:\u002F\u002Fopenai.com\u002Fapi\u002Fpricing\u002F',\n            config=run_config\n        )\n        print(result.extracted_content)\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🤖 \u003Cstrong>Using Your own Browser with Custom User Profile\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n```python\nimport os, sys\nfrom pathlib import Path\nimport asyncio, time\nfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode\n\nasync def test_news_crawl():\n    # Create a persistent user data directory\n    user_data_dir = os.path.join(Path.home(), \".crawl4ai\", \"browser_profile\")\n    os.makedirs(user_data_dir, exist_ok=True)\n\n    browser_config = BrowserConfig(\n        verbose=True,\n        headless=True,\n        user_data_dir=user_data_dir,\n        use_persistent_context=True,\n    )\n    run_config = CrawlerRunConfig(\n        cache_mode=CacheMode.BYPASS\n    )\n    \n    async with AsyncWebCrawler(config=browser_config) as crawler:\n        url = \"ADDRESS_OF_A_CHALLENGING_WEBSITE\"\n        \n        result = await crawler.arun(\n            url,\n            config=run_config,\n            magic=True,\n        )\n        \n        print(f\"Successfully crawled {url}\")\n        print(f\"Content length: {len(result.markdown)}\")\n```\n\n\u003C\u002Fdetails>\n\n---\n\n> **💡 Tip:** Some websites may use **CAPTCHA** based verification mechanisms to prevent automated access. If your workflow encounters such challenges, you may optionally integrate a third-party CAPTCHA-handling service such as \u003Cstrong>[CapSolver](https:\u002F\u002Fwww.capsolver.com\u002Fblog\u002FPartners\u002Fcrawl4ai-capsolver\u002F?utm_source=crawl4ai&utm_medium=github_pr&utm_campaign=crawl4ai_integration)\u003C\u002Fstrong>. They support reCAPTCHA v2\u002Fv3, Cloudflare Turnstile, Challenge, AWS WAF, and more. Please ensure that your usage complies with the target website’s terms of service and applicable laws.\n\n## ✨ Recent Updates\n\n\u003Cdetails open>\n\u003Csummary>\u003Cstrong>Version 0.8.6 — Security Hotfix: litellm Supply Chain Fix\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nReplaced `litellm` dependency with `unclecode-litellm` due to a PyPI supply chain compromise affecting the original package. If you're on v0.8.5 or earlier, upgrade immediately.\n\n```bash\npip install -U crawl4ai\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Version 0.8.5 Release Highlights - Anti-Bot Detection, Shadow DOM & 60+ Bug Fixes\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nOur biggest release since v0.8.0. Anti-bot detection with proxy escalation, Shadow DOM flattening, deep crawl cancellation, and over 60 bug fixes.\n\n- **🛡️ Anti-Bot Detection & Proxy Escalation**:\n  - 3-tier detection: known vendors, generic block indicators, structural integrity checks\n  - Automatic retry with proxy chain and fallback fetch function\n  ```python\n  from crawl4ai import CrawlerRunConfig\n  from crawl4ai.async_configs import ProxyConfig\n\n  config = CrawlerRunConfig(\n      proxy_config=[ProxyConfig.DIRECT, ProxyConfig(server=\"http:\u002F\u002Fmy-proxy:8080\")],\n      max_retries=2,\n      fallback_fetch_function=my_web_unlocker,\n  )\n  ```\n\n- **🌑 Shadow DOM Flattening**:\n  - Extract content hidden inside shadow DOM components\n  ```python\n  config = CrawlerRunConfig(flatten_shadow_dom=True)\n  ```\n\n- **🛑 Deep Crawl Cancellation**:\n  - Stop long crawls gracefully with `cancel()` or `should_cancel` callback\n  - Works with BFS, DFS, and BestFirst strategies\n\n- **⚙️ Config Defaults API**:\n  - `set_defaults()` \u002F `get_defaults()` \u002F `reset_defaults()` on BrowserConfig and CrawlerRunConfig\n\n- **🔒 Critical Security Fixes**:\n  - RCE via deserialization in Docker `\u002Fcrawl` endpoint — removed `eval()`, added allowlist\n  - Redis CVE-2025-49844 (CVSS 10.0) — upgraded to 7.2.7\n\n- **60+ Bug Fixes** across browser management, proxy, deep crawling, extraction, CLI, and Docker\n\n[Full v0.8.5 Release Notes →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.8.5.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Version 0.8.0 Release Highlights - Crash Recovery & Prefetch Mode\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nThis release introduces crash recovery for deep crawls, a new prefetch mode for fast URL discovery, and critical security fixes for Docker deployments.\n\n- **🔄 Deep Crawl Crash Recovery**:\n  - `on_state_change` callback fires after each URL for real-time state persistence\n  - `resume_state` parameter to continue from a saved checkpoint\n  - JSON-serializable state for Redis\u002Fdatabase storage\n  - Works with BFS, DFS, and Best-First strategies\n  ```python\n  from crawl4ai.deep_crawling import BFSDeepCrawlStrategy\n\n  strategy = BFSDeepCrawlStrategy(\n      max_depth=3,\n      resume_state=saved_state,  # Continue from checkpoint\n      on_state_change=save_to_redis,  # Called after each URL\n  )\n  ```\n\n- **⚡ Prefetch Mode for Fast URL Discovery**:\n  - `prefetch=True` skips markdown, extraction, and media processing\n  - 5-10x faster than full processing\n  - Perfect for two-phase crawling: discover first, process selectively\n  ```python\n  config = CrawlerRunConfig(prefetch=True)\n  result = await crawler.arun(\"https:\u002F\u002Fexample.com\", config=config)\n  # Returns HTML and links only - no markdown generation\n  ```\n\n- **🔒 Security Fixes (Docker API)**:\n  - Hooks disabled by default (`CRAWL4AI_HOOKS_ENABLED=false`)\n  - `file:\u002F\u002F` URLs blocked on API endpoints to prevent LFI\n  - `__import__` removed from hook execution sandbox\n\n[Full v0.8.0 Release Notes →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.8.0.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Version 0.7.8 Release Highlights - Stability & Bug Fix Release\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nThis release focuses on stability with 11 bug fixes addressing issues reported by the community. No new features, but significant improvements to reliability.\n\n- **🐳 Docker API Fixes**:\n  - Fixed `ContentRelevanceFilter` deserialization in deep crawl requests (#1642)\n  - Fixed `ProxyConfig` JSON serialization in `BrowserConfig.to_dict()` (#1629)\n  - Fixed `.cache` folder permissions in Docker image (#1638)\n\n- **🤖 LLM Extraction Improvements**:\n  - Configurable rate limiter backoff with new `LLMConfig` parameters (#1269):\n    ```python\n    from crawl4ai import LLMConfig\n\n    config = LLMConfig(\n        provider=\"openai\u002Fgpt-4o-mini\",\n        backoff_base_delay=5,           # Wait 5s on first retry\n        backoff_max_attempts=5,          # Try up to 5 times\n        backoff_exponential_factor=3     # Multiply delay by 3 each attempt\n    )\n    ```\n  - HTML input format support for `LLMExtractionStrategy` (#1178):\n    ```python\n    from crawl4ai import LLMExtractionStrategy\n\n    strategy = LLMExtractionStrategy(\n        llm_config=config,\n        instruction=\"Extract table data\",\n        input_format=\"html\"  # Now supports: \"html\", \"markdown\", \"fit_markdown\"\n    )\n    ```\n  - Fixed raw HTML URL variable - extraction strategies now receive `\"Raw HTML\"` instead of HTML blob (#1116)\n\n- **🔗 URL Handling**:\n  - Fixed relative URL resolution after JavaScript redirects (#1268)\n  - Fixed import statement formatting in extracted code (#1181)\n\n- **📦 Dependency Updates**:\n  - Replaced deprecated PyPDF2 with pypdf (#1412)\n  - Pydantic v2 ConfigDict compatibility - no more deprecation warnings (#678)\n\n- **🧠 AdaptiveCrawler**:\n  - Fixed query expansion to actually use LLM instead of hardcoded mock data (#1621)\n\n[Full v0.7.8 Release Notes →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.8.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Version 0.7.7 Release Highlights - The Self-Hosting & Monitoring Update\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- **📊 Real-time Monitoring Dashboard**: Interactive web UI with live system metrics and browser pool visibility\n  ```python\n  # Access the monitoring dashboard\n  # Visit: http:\u002F\u002Flocalhost:11235\u002Fdashboard\n\n  # Real-time metrics include:\n  # - System health (CPU, memory, network, uptime)\n  # - Active and completed request tracking\n  # - Browser pool management (permanent\u002Fhot\u002Fcold)\n  # - Janitor cleanup events\n  # - Error monitoring with full context\n  ```\n\n- **🔌 Comprehensive Monitor API**: Complete REST API for programmatic access to all monitoring data\n  ```python\n  import httpx\n\n  async with httpx.AsyncClient() as client:\n      # System health\n      health = await client.get(\"http:\u002F\u002Flocalhost:11235\u002Fmonitor\u002Fhealth\")\n\n      # Request tracking\n      requests = await client.get(\"http:\u002F\u002Flocalhost:11235\u002Fmonitor\u002Frequests\")\n\n      # Browser pool status\n      browsers = await client.get(\"http:\u002F\u002Flocalhost:11235\u002Fmonitor\u002Fbrowsers\")\n\n      # Endpoint statistics\n      stats = await client.get(\"http:\u002F\u002Flocalhost:11235\u002Fmonitor\u002Fendpoints\u002Fstats\")\n  ```\n\n- **⚡ WebSocket Streaming**: Real-time updates every 2 seconds for custom dashboards\n- **🔥 Smart Browser Pool**: 3-tier architecture (permanent\u002Fhot\u002Fcold) with automatic promotion and cleanup\n- **🧹 Janitor System**: Automatic resource management with event logging\n- **🎮 Control Actions**: Manual browser management (kill, restart, cleanup) via API\n- **📈 Production Metrics**: 6 critical metrics for operational excellence with Prometheus integration\n- **🐛 Critical Bug Fixes**:\n  - Fixed async LLM extraction blocking issue (#1055)\n  - Enhanced DFS deep crawl strategy (#1607)\n  - Fixed sitemap parsing in AsyncUrlSeeder (#1598)\n  - Resolved browser viewport configuration (#1495)\n  - Fixed CDP timing with exponential backoff (#1528)\n  - Security update for pyOpenSSL (>=25.3.0)\n\n[Full v0.7.7 Release Notes →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.7.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Version 0.7.5 Release Highlights - The Docker Hooks & Security Update\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- **🔧 Docker Hooks System**: Complete pipeline customization with user-provided Python functions at 8 key points\n- **✨ Function-Based Hooks API (NEW)**: Write hooks as regular Python functions with full IDE support:\n  ```python\n  from crawl4ai import hooks_to_string\n  from crawl4ai.docker_client import Crawl4aiDockerClient\n\n  # Define hooks as regular Python functions\n  async def on_page_context_created(page, context, **kwargs):\n      \"\"\"Block images to speed up crawling\"\"\"\n      await context.route(\"**\u002F*.{png,jpg,jpeg,gif,webp}\", lambda route: route.abort())\n      await page.set_viewport_size({\"width\": 1920, \"height\": 1080})\n      return page\n\n  async def before_goto(page, context, url, **kwargs):\n      \"\"\"Add custom headers\"\"\"\n      await page.set_extra_http_headers({'X-Crawl4AI': 'v0.7.5'})\n      return page\n\n  # Option 1: Use hooks_to_string() utility for REST API\n  hooks_code = hooks_to_string({\n      \"on_page_context_created\": on_page_context_created,\n      \"before_goto\": before_goto\n  })\n\n  # Option 2: Docker client with automatic conversion (Recommended)\n  client = Crawl4aiDockerClient(base_url=\"http:\u002F\u002Flocalhost:11235\")\n  results = await client.crawl(\n      urls=[\"https:\u002F\u002Fhttpbin.org\u002Fhtml\"],\n      hooks={\n          \"on_page_context_created\": on_page_context_created,\n          \"before_goto\": before_goto\n      }\n  )\n  # ✓ Full IDE support, type checking, and reusability!\n  ```\n\n- **🤖 Enhanced LLM Integration**: Custom providers with temperature control and base_url configuration\n- **🔒 HTTPS Preservation**: Secure internal link handling with `preserve_https_for_internal_links=True`\n- **🐍 Python 3.10+ Support**: Modern language features and enhanced performance\n- **🛠️ Bug Fixes**: Resolved multiple community-reported issues including URL processing, JWT authentication, and proxy configuration\n\n[Full v0.7.5 Release Notes →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.5.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Version 0.7.4 Release Highlights - The Intelligent Table Extraction & Performance Update\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- **🚀 LLMTableExtraction**: Revolutionary table extraction with intelligent chunking for massive tables:\n  ```python\n  from crawl4ai import LLMTableExtraction, LLMConfig\n  \n  # Configure intelligent table extraction\n  table_strategy = LLMTableExtraction(\n      llm_config=LLMConfig(provider=\"openai\u002Fgpt-4.1-mini\"),\n      enable_chunking=True,           # Handle massive tables\n      chunk_token_threshold=5000,     # Smart chunking threshold\n      overlap_threshold=100,          # Maintain context between chunks\n      extraction_type=\"structured\"    # Get structured data output\n  )\n  \n  config = CrawlerRunConfig(table_extraction_strategy=table_strategy)\n  result = await crawler.arun(\"https:\u002F\u002Fcomplex-tables-site.com\", config=config)\n  \n  # Tables are automatically chunked, processed, and merged\n  for table in result.tables:\n      print(f\"Extracted table: {len(table['data'])} rows\")\n  ```\n\n- **⚡ Dispatcher Bug Fix**: Fixed sequential processing bottleneck in arun_many for fast-completing tasks\n- **🧹 Memory Management Refactor**: Consolidated memory utilities into main utils module for cleaner architecture\n- **🔧 Browser Manager Fixes**: Resolved race conditions in concurrent page creation with thread-safe locking\n- **🔗 Advanced URL Processing**: Better handling of raw:\u002F\u002F URLs and base tag link resolution\n- **🛡️ Enhanced Proxy Support**: Flexible proxy configuration supporting both dict and string formats\n\n[Full v0.7.4 Release Notes →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.4.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Version 0.7.3 Release Highlights - The Multi-Config Intelligence Update\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- **🕵️ Undetected Browser Support**: Bypass sophisticated bot detection systems:\n  ```python\n  from crawl4ai import AsyncWebCrawler, BrowserConfig\n  \n  browser_config = BrowserConfig(\n      browser_type=\"undetected\",  # Use undetected Chrome\n      headless=True,              # Can run headless with stealth\n      extra_args=[\n          \"--disable-blink-features=AutomationControlled\",\n          \"--disable-web-security\"\n      ]\n  )\n  \n  async with AsyncWebCrawler(config=browser_config) as crawler:\n      result = await crawler.arun(\"https:\u002F\u002Fprotected-site.com\")\n  # Successfully bypass Cloudflare, Akamai, and custom bot detection\n  ```\n\n- **🎨 Multi-URL Configuration**: Different strategies for different URL patterns in one batch:\n```python\nfrom crawl4ai import CrawlerRunConfig, MatchMode, CacheMode\n  \n  configs = [\n      # Documentation sites - aggressive caching\n      CrawlerRunConfig(\n          url_matcher=[\"*docs*\", \"*documentation*\"],\n          cache_mode=CacheMode.WRITE_ONLY,\n          markdown_generator_options={\"include_links\": True}\n      ),\n      \n      # News\u002Fblog sites - fresh content\n      CrawlerRunConfig(\n          url_matcher=lambda url: 'blog' in url or 'news' in url,\n          cache_mode=CacheMode.BYPASS\n      ),\n      \n      # Fallback for everything else\n      CrawlerRunConfig()\n  ]\n  \n  results = await crawler.arun_many(urls, config=configs)\n  # Each URL gets the perfect configuration automatically\n  ```\n\n- **🧠 Memory Monitoring**: Track and optimize memory usage during crawling:\n  ```python\n  from crawl4ai.memory_utils import MemoryMonitor\n  \n  monitor = MemoryMonitor()\n  monitor.start_monitoring()\n  \n  results = await crawler.arun_many(large_url_list)\n  \n  report = monitor.get_report()\n  print(f\"Peak memory: {report['peak_mb']:.1f} MB\")\n  print(f\"Efficiency: {report['efficiency']:.1f}%\")\n  # Get optimization recommendations\n  ```\n\n- **📊 Enhanced Table Extraction**: Direct DataFrame conversion from web tables:\n  ```python\n  result = await crawler.arun(\"https:\u002F\u002Fsite-with-tables.com\")\n  \n  # New way - direct table access\n  if result.tables:\n      import pandas as pd\n      for table in result.tables:\n          df = pd.DataFrame(table['data'])\n          print(f\"Table: {df.shape[0]} rows × {df.shape[1]} columns\")\n  ```\n\n- **💰 GitHub Sponsors**: 4-tier sponsorship system for project sustainability\n- **🐳 Docker LLM Flexibility**: Configure providers via environment variables\n\n[Full v0.7.3 Release Notes →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.3.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>Version 0.7.0 Release Highlights - The Adaptive Intelligence Update\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- **🧠 Adaptive Crawling**: Your crawler now learns and adapts to website patterns automatically:\n  ```python\n  config = AdaptiveConfig(\n      confidence_threshold=0.7, # Min confidence to stop crawling\n      max_depth=5, # Maximum crawl depth\n      max_pages=20, # Maximum number of pages to crawl\n      strategy=\"statistical\"\n  )\n  \n  async with AsyncWebCrawler() as crawler:\n      adaptive_crawler = AdaptiveCrawler(crawler, config)\n      state = await adaptive_crawler.digest(\n          start_url=\"https:\u002F\u002Fnews.example.com\",\n          query=\"latest news content\"\n      )\n  # Crawler learns patterns and improves extraction over time\n  ```\n\n- **🌊 Virtual Scroll Support**: Complete content extraction from infinite scroll pages:\n  ```python\n  scroll_config = VirtualScrollConfig(\n      container_selector=\"[data-testid='feed']\",\n      scroll_count=20,\n      scroll_by=\"container_height\",\n      wait_after_scroll=1.0\n  )\n  \n  result = await crawler.arun(url, config=CrawlerRunConfig(\n      virtual_scroll_config=scroll_config\n  ))\n  ```\n\n- **🔗 Intelligent Link Analysis**: 3-layer scoring system for smart link prioritization:\n  ```python\n  link_config = LinkPreviewConfig(\n      query=\"machine learning tutorials\",\n      score_threshold=0.3,\n      concurrent_requests=10\n  )\n  \n  result = await crawler.arun(url, config=CrawlerRunConfig(\n      link_preview_config=link_config,\n      score_links=True\n  ))\n  # Links ranked by relevance and quality\n  ```\n\n- **🎣 Async URL Seeder**: Discover thousands of URLs in seconds:\n  ```python\n  seeder = AsyncUrlSeeder(SeedingConfig(\n      source=\"sitemap+cc\",\n      pattern=\"*\u002Fblog\u002F*\",\n      query=\"python tutorials\",\n      score_threshold=0.4\n  ))\n  \n  urls = await seeder.discover(\"https:\u002F\u002Fexample.com\")\n  ```\n\n- **⚡ Performance Boost**: Up to 3x faster with optimized resource handling and memory efficiency\n\nRead the full details in our [0.7.0 Release Notes](https:\u002F\u002Fdocs.crawl4ai.com\u002Fblog\u002Frelease-v0.7.0) or check the [CHANGELOG](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FCHANGELOG.md).\n\n\u003C\u002Fdetails>\n\n## Version Numbering in Crawl4AI\n\nCrawl4AI follows standard Python version numbering conventions (PEP 440) to help users understand the stability and features of each release.\n\n\u003Cdetails>\n\u003Csummary>📈 \u003Cstrong>Version Numbers Explained\u003C\u002Fstrong>\u003C\u002Fsummary>\n\nOur version numbers follow this pattern: `MAJOR.MINOR.PATCH` (e.g., 0.4.3)\n\n#### Pre-release Versions\nWe use different suffixes to indicate development stages:\n\n- `dev` (0.4.3dev1): Development versions, unstable\n- `a` (0.4.3a1): Alpha releases, experimental features\n- `b` (0.4.3b1): Beta releases, feature complete but needs testing\n- `rc` (0.4.3): Release candidates, potential final version\n\n#### Installation\n- Regular installation (stable version):\n  ```bash\n  pip install -U crawl4ai\n  ```\n\n- Install pre-release versions:\n  ```bash\n  pip install crawl4ai --pre\n  ```\n\n- Install specific version:\n  ```bash\n  pip install crawl4ai==0.4.3b1\n  ```\n\n#### Why Pre-releases?\nWe use pre-releases to:\n- Test new features in real-world scenarios\n- Gather feedback before final releases\n- Ensure stability for production users\n- Allow early adopters to try new features\n\nFor production environments, we recommend using the stable version. For testing new features, you can opt-in to pre-releases using the `--pre` flag.\n\n\u003C\u002Fdetails>\n\n## 📖 Documentation & Roadmap \n\n> 🚨 **Documentation Update Alert**: We're undertaking a major documentation overhaul next week to reflect recent updates and improvements. Stay tuned for a more comprehensive and up-to-date guide!\n\nFor current documentation, including installation instructions, advanced features, and API reference, visit our [Documentation Website](https:\u002F\u002Fdocs.crawl4ai.com\u002F).\n\nTo check our development plans and upcoming features, visit our [Roadmap](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FROADMAP.md).\n\n\u003Cdetails>\n\u003Csummary>📈 \u003Cstrong>Development TODOs\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- [x] 0. Graph Crawler: Smart website traversal using graph search algorithms for comprehensive nested page extraction\n- [x] 1. Question-Based Crawler: Natural language driven web discovery and content extraction\n- [x] 2. Knowledge-Optimal Crawler: Smart crawling that maximizes knowledge while minimizing data extraction\n- [x] 3. Agentic Crawler: Autonomous system for complex multi-step crawling operations\n- [x] 4. Automated Schema Generator: Convert natural language to extraction schemas\n- [x] 5. Domain-Specific Scrapers: Pre-configured extractors for common platforms (academic, e-commerce)\n- [x] 6. Web Embedding Index: Semantic search infrastructure for crawled content\n- [x] 7. Interactive Playground: Web UI for testing, comparing strategies with AI assistance\n- [x] 8. Performance Monitor: Real-time insights into crawler operations\n- [ ] 9. Cloud Integration: One-click deployment solutions across cloud providers\n- [x] 10. Sponsorship Program: Structured support system with tiered benefits\n- [ ] 11. Educational Content: \"How to Crawl\" video series and interactive tutorials\n\n\u003C\u002Fdetails>\n\n## 🤝 Contributing \n\nWe welcome contributions from the open-source community. Check out our [contribution guidelines](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FCONTRIBUTORS.md) for more information.\n\nI'll help modify the license section with badges. For the halftone effect, here's a version with it:\n\nHere's the updated license section:\n\n## 📄 License & Attribution\n\nThis project is licensed under the Apache License 2.0, attribution is recommended via the badges below. See the [Apache 2.0 License](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FLICENSE) file for details.\n\n### Attribution Requirements\nWhen using Crawl4AI, you must include one of the following attribution methods:\n\n\u003Cdetails>\n\u003Csummary>📈 \u003Cstrong>1. Badge Attribution (Recommended)\u003C\u002Fstrong>\u003C\u002Fsummary>\nAdd one of these badges to your README, documentation, or website:\n\n| Theme | Badge |\n|-------|-------|\n| **Disco Theme (Animated)** | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\u003Cimg src=\".\u002Fdocs\u002Fassets\u002Fpowered-by-disco.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\u003C\u002Fa> |\n| **Night Theme (Dark with Neon)** | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\u003Cimg src=\".\u002Fdocs\u002Fassets\u002Fpowered-by-night.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\u003C\u002Fa> |\n| **Dark Theme (Classic)** | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\u003Cimg src=\".\u002Fdocs\u002Fassets\u002Fpowered-by-dark.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\u003C\u002Fa> |\n| **Light Theme (Classic)** | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\u003Cimg src=\".\u002Fdocs\u002Fassets\u002Fpowered-by-light.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\u003C\u002Fa> |\n \n\nHTML code for adding the badges:\n```html\n\u003C!-- Disco Theme (Animated) -->\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Funclecode\u002Fcrawl4ai\u002Fmain\u002Fdocs\u002Fassets\u002Fpowered-by-disco.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\n\u003C\u002Fa>\n\n\u003C!-- Night Theme (Dark with Neon) -->\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Funclecode\u002Fcrawl4ai\u002Fmain\u002Fdocs\u002Fassets\u002Fpowered-by-night.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\n\u003C\u002Fa>\n\n\u003C!-- Dark Theme (Classic) -->\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Funclecode\u002Fcrawl4ai\u002Fmain\u002Fdocs\u002Fassets\u002Fpowered-by-dark.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\n\u003C\u002Fa>\n\n\u003C!-- Light Theme (Classic) -->\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Funclecode\u002Fcrawl4ai\u002Fmain\u002Fdocs\u002Fassets\u002Fpowered-by-light.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\n\u003C\u002Fa>\n\n\u003C!-- Simple Shield Badge -->\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPowered%20by-Crawl4AI-blue?style=flat-square\" alt=\"Powered by Crawl4AI\"\u002F>\n\u003C\u002Fa>\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>📖 \u003Cstrong>2. Text Attribution\u003C\u002Fstrong>\u003C\u002Fsummary>\nAdd this line to your documentation:\n```\nThis project uses Crawl4AI (https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai) for web data extraction.\n```\n\u003C\u002Fdetails>\n\n## 📚 Citation\n\nIf you use Crawl4AI in your research or project, please cite:\n\n```bibtex\n@software{crawl4ai2024,\n  author = {UncleCode},\n  title = {Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper},\n  year = {2024},\n  publisher = {GitHub},\n  journal = {GitHub Repository},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai}},\n  commit = {Please use the commit hash you're working with}\n}\n```\n\nText citation format:\n```\nUncleCode. (2024). Crawl4AI: Open-source LLM Friendly Web Crawler & Scraper [Computer software]. \nGitHub. https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\n```\n\n## 📧 Contact \n\nFor questions, suggestions, or feedback, feel free to reach out:\n\n- GitHub: [unclecode](https:\u002F\u002Fgithub.com\u002Funclecode)\n- Twitter: [@unclecode](https:\u002F\u002Ftwitter.com\u002Funclecode)\n- Website: [crawl4ai.com](https:\u002F\u002Fcrawl4ai.com)\n\nHappy Crawling! 🕸️🚀\n\n## 🗾 Mission\n\nOur mission is to unlock the value of personal and enterprise data by transforming digital footprints into structured, tradeable assets. Crawl4AI empowers individuals and organizations with open-source tools to extract and structure data, fostering a shared data economy.  \n\nWe envision a future where AI is powered by real human knowledge, ensuring data creators directly benefit from their contributions. By democratizing data and enabling ethical sharing, we are laying the foundation for authentic AI advancement.\n\n\u003Cdetails>\n\u003Csummary>🔑 \u003Cstrong>Key Opportunities\u003C\u002Fstrong>\u003C\u002Fsummary>\n \n- **Data Capitalization**: Transform digital footprints into measurable, valuable assets.  \n- **Authentic AI Data**: Provide AI systems with real human insights.  \n- **Shared Economy**: Create a fair data marketplace that benefits data creators.  \n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🚀 \u003Cstrong>Development Pathway\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n1. **Open-Source Tools**: Community-driven platforms for transparent data extraction.  \n2. **Digital Asset Structuring**: Tools to organize and value digital knowledge.  \n3. **Ethical Data Marketplace**: A secure, fair platform for exchanging structured data.  \n\nFor more details, see our [full mission statement](.\u002FMISSION.md).\n\u003C\u002Fdetails>\n\n## 🌟 Current Sponsors\n\n### 🏢 Enterprise Sponsors & Partners\n\nOur enterprise sponsors and technology partners help scale Crawl4AI to power production-grade data pipelines.\n\n| Company | About | Sponsorship Tier |\n|------|------|----------------------------|\n| \u003Ca href=\"https:\u002F\u002Fwww.thordata.com\u002F?ls=github&lk=crawl4ai\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fgist.github.com\u002Faravindkarnam\u002Fdfc598a67be5036494475acece7e54cf\u002Fraw\u002Fthor_data.svg\" alt=\"Thor Data\" width=\"120\"\u002F>\u003C\u002Fa>  | Leveraging Thordata ensures seamless compatibility with any AI\u002FML workflows and data infrastructure, massively accessing web data with 99.9% uptime, backed by one-on-one customer support. | 🥈 Silver |\n| \u003Ca href=\"https:\u002F\u002Fapp.nstproxy.com\u002Fregister?i=ecOqW9\" target=\"_blank\">\u003Cpicture>\u003Csource width=\"250\" media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fgist.github.com\u002Faravindkarnam\u002F62f82bd4818d3079d9dd3c31df432cf8\u002Fraw\u002Fnst-light.svg\">\u003Csource width=\"250\" media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fwww.nstproxy.com\u002Flogo.svg\">\u003Cimg alt=\"nstproxy\" src=\"ttps:\u002F\u002Fwww.nstproxy.com\u002Flogo.svg\">\u003C\u002Fpicture>\u003C\u002Fa>  | NstProxy is a trusted proxy provider with over 110M+ real residential IPs, city-level targeting, 99.99% uptime, and low pricing at $0.1\u002FGB, it delivers unmatched stability, scale, and cost-efficiency. | 🥈 Silver |\n| \u003Ca href=\"https:\u002F\u002Fapp.scrapeless.com\u002Fpassport\u002Fregister?utm_source=official&utm_term=crawl4ai\" target=\"_blank\">\u003Cpicture>\u003Csource width=\"250\" media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fgist.githubusercontent.com\u002Faravindkarnam\u002F0d275b942705604263e5c32d2db27bc1\u002Fraw\u002FScrapeless-light-logo.svg\">\u003Csource width=\"250\" media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fgist.githubusercontent.com\u002Faravindkarnam\u002F22d0525cc0f3021bf19ebf6e11a69ccd\u002Fraw\u002FScrapeless-dark-logo.svg\">\u003Cimg alt=\"Scrapeless\" src=\"https:\u002F\u002Fgist.githubusercontent.com\u002Faravindkarnam\u002F22d0525cc0f3021bf19ebf6e11a69ccd\u002Fraw\u002FScrapeless-dark-logo.svg\">\u003C\u002Fpicture>\u003C\u002Fa>  | Scrapeless provides production-grade infrastructure for Crawling, Automation, and AI Agents, offering Scraping Browser, 4 Proxy Types and Universal Scraping API. | 🥈 Silver |\n| \u003Ca href=\"https:\u002F\u002Fdashboard.capsolver.com\u002Fpassport\u002Fregister?inviteCode=ESVSECTX5Q23\" target=\"_blank\">\u003Cpicture>\u003Csource width=\"120\" media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fdocs.crawl4ai.com\u002Fuploads\u002Fsponsors\u002F20251013045338_72a71fa4ee4d2f40.png\">\u003Csource width=\"120\" media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_feed151b0bf6.png\">\u003Cimg alt=\"Capsolver\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_feed151b0bf6.png\">\u003C\u002Fpicture>\u003C\u002Fa> | AI-powered Captcha solving service. Supports all major Captcha types, including reCAPTCHA, Cloudflare, and more | 🥉 Bronze |\n| \u003Ca href=\"https:\u002F\u002Fkipo.ai\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_0881c5f473f5.png\" alt=\"DataSync\" width=\"120\"\u002F>\u003C\u002Fa> | Helps engineers and buyers find, compare, and source electronic & industrial parts in seconds, with specs, pricing, lead times & alternatives.| 🥇 Gold |\n| \u003Ca href=\"https:\u002F\u002Fwww.kidocode.com\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fdocs.crawl4ai.com\u002Fuploads\u002Fsponsors\u002F20251013045045_bb8dace3f0440d65.svg\" alt=\"Kidocode\" width=\"120\"\u002F>\u003Cp align=\"center\">KidoCode\u003C\u002Fp>\u003C\u002Fa> | Kidocode is a hybrid technology and entrepreneurship school for kids aged 5–18, offering both online and on-campus education. | 🥇 Gold |\n| \u003Ca href=\"https:\u002F\u002Fwww.alephnull.sg\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fdocs.crawl4ai.com\u002Fuploads\u002Fsponsors\u002F20251013050323_a9e8e8c4c3650421.svg\" alt=\"Aleph null\" width=\"120\"\u002F>\u003C\u002Fa> | Singapore-based  Aleph Null is Asia’s leading edtech hub, dedicated to student-centric, AI-driven education—empowering learners with the tools to thrive in a fast-changing world. | 🥇 Gold |\n\n\n\n### 🧑‍🤝 Individual Sponsors\n\nA heartfelt thanks to our individual supporters! Every contribution helps us keep our opensource mission alive and thriving!\n\n\u003Cp align=\"left\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhafezparast\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_b87016576f1d.png\" style=\"border-radius:50%;\" width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fntohidi\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_a7f59d2272d2.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FSjoeborg\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_5fe63a7c96c7.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fromek-rozen\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_d65e0acafaa4.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FKourosh-Kiyani\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_4a9dad2c26a2.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FEtherdrake\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_b2919e96816f.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fshaman247\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_552e83377704.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fwork-flow-manager\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_f25e38f207f5.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n> Want to join them? [Sponsor Crawl4AI →](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Funclecode)\n\n## Star History\n\n[![Star History Chart](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_b3bea90a214a.png)](https:\u002F\u002Fstar-history.com\u002F#unclecode\u002Fcrawl4ai&Date)\n","# 🚀🤖 Crawl4AI：开源、适合大语言模型的网络爬虫与数据抓取工具。\n\n\u003Cdiv align=\"center\">\n\n\u003Ca href=\"https:\u002F\u002Ftrendshift.io\u002Frepositories\u002F11716\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_4a68feb902da.png\" alt=\"unclecode%2Fcrawl4ai | Trendshift\" style=\"width: 250px; height: 55px;\" width=\"250\" height=\"55\"\u002F>\u003C\u002Fa>\n\n[![GitHub 星标数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002Funclecode\u002Fcrawl4ai?style=social)](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fstargazers)\n[![GitHub 复刻数](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fforks\u002Funclecode\u002Fcrawl4ai?style=social)](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fnetwork\u002Fmembers)\n\n[![PyPI 版本](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fcrawl4ai.svg)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fcrawl4ai)\n[![Python 版本](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fcrawl4ai)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fcrawl4ai\u002F)\n[![下载量](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_e61070d2d408.png)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fcrawl4ai)\n[![GitHub 赞助者](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fsponsors\u002Funclecode?style=flat&logo=GitHub-Sponsors&label=Sponsors&color=pink)](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Funclecode)\n\n---\n#### 🚀 Crawl4AI 云 API — 封闭测试版（即将上线）\n可靠的大规模网页提取功能，现已构建得比现有任何解决方案都_**更具成本效益**_。\n\n👉 **请在此处申请[early access]**  \n_我们将分阶段进行接入，并与早期用户紧密合作。\n名额有限。_\n\n---\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fx.com\u002Fcrawl4ai\">\n      \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FFollow%20on%20X-000000?style=for-the-badge&logo=x&logoColor=white\" alt=\"关注 X\" \u002F>\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fwww.linkedin.com\u002Fcompany\u002Fcrawl4ai\">\n      \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FFollow%20on%20LinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white\" alt=\"关注 LinkedIn\" \u002F>\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fdiscord.gg\u002FjP8KfhDhyN\">\n      \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJoin%20our%20Discord-5865F2?style=for-the-badge&logo=discord&logoColor=white\" alt=\"加入我们的 Discord\" \u002F>\n    \u003C\u002Fa>\n  \u003C\u002Fp>\n\u003C\u002Fdiv>\n\nCrawl4AI 将网页转化为干净、适合大语言模型的 Markdown 格式，用于 RAG、智能体和数据管道。快速、可控，经过拥有超过 5 万颗星的社区的严格考验。\n\n[✨ 查看最新更新 v0.8.6](#-recent-updates)\n\n✨ **v0.8.6 新增**：安全补丁 — 由于 PyPI 供应链遭到破坏，已将 `litellm` 替换为 `unclecode-litellm`。如果您使用的是 v0.8.5，请立即升级。\n\n✨ 最近的 v0.8.5：反机器人检测、Shadow DOM 及 60 多项 Bug 修复！自动三重防机器人检测并可升级代理，支持 Shadow DOM 展平、深度爬取取消、配置默认值 API、移除同意弹窗以及关键安全补丁。[发布说明 →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.8.5.md)\n\n✨ 之前的 v0.8.0：崩溃恢复与预取模式！通过 `resume_state` 和 `on_state_change` 回调实现深度爬取的崩溃恢复，适用于长时间运行的爬取任务。新增 `prefetch=True` 模式，使 URL 发现速度提升 5–10 倍。[发布说明 →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.8.0.md)\n\n✨ 之前的 v0.7.8：稳定性与 Bug 修复版本！修复了 11 个问题，包括 Docker API 问题、LLM 提取改进、URL 处理修复以及依赖库更新。[发布说明 →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.8.md)\n\n\u003Cdetails>\n  \u003Csummary>🤓 \u003Cstrong>我的个人故事\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n我从小在父亲的影响下接触 Amstrad 电脑，从此便从未停止过开发。研究生期间，我专攻自然语言处理，并为研究项目构建爬虫。正是在那时，我才深刻体会到数据提取的重要性。\n\n2023 年，我需要一个能将网页转换为 Markdown 的工具。“开源”方案却要求注册账号、获取 API Token，还要支付 16 美元，结果却远未达到预期。于是我怒而行动，几天内就开发出了 Crawl4AI，随后迅速走红。如今，它已成为 GitHub 上获赞最多的爬虫工具。\n\n我将其开源是为了确保**可用性**，任何人都可以无需门槛地使用它。现在，我正致力于打造一个**经济实惠**的平台，让每个人都能以低成本运行大规模爬取任务。如果您对此深有共鸣，欢迎加入我们，提供反馈，或者用它来爬取一些精彩的内容。\n\u003C\u002Fdetails>\n\n\n\u003Cdetails>\n  \u003Csummary>为什么开发者选择 Crawl4AI\u003C\u002Fsummary>\n\n- **适合大语言模型的输出**，智能 Markdown 格式，包含标题、表格、代码及引用提示\n- **实际速度快**，异步浏览器池、缓存机制、跳转次数少\n- **完全可控**，支持会话、代理、Cookie、用户脚本和钩子\n- **自适应智能**，能够学习网站结构模式，只抓取重要内容\n- **随处部署**，无需密钥，提供 CLI 和 Docker 支持，兼容云端环境\n\u003C\u002Fdetails>\n\n\n## 🚀 快速入门 \n\n1. 安装 Crawl4AI：\n```bash\n# 安装软件包\npip install -U crawl4ai\n\n# 如果需要预发布版本\npip install crawl4ai --pre\n\n# 运行安装后设置\ncrawl4ai-setup\n\n# 验证安装是否成功\ncrawl4ai-doctor\n```\n\n如果遇到与浏览器相关的问题，您可以手动安装：\n```bash\npython -m playwright install --with-deps chromium\n```\n\n2. 使用 Python 运行简单的网页爬取：\n```python\nimport asyncio\nfrom crawl4ai import *\n\nasync def main():\n    async with AsyncWebCrawler() as crawler:\n        result = await crawler.arun(\n            url=\"https:\u002F\u002Fwww.nbcnews.com\u002Fbusiness\",\n        )\n        print(result.markdown)\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n3. 或者使用新的命令行界面：\n```bash\n# 基础爬取，输出 Markdown 格式\ncrwl https:\u002F\u002Fwww.nbcnews.com\u002Fbusiness -o markdown\n\n# 深度爬取，采用 BFS 策略，最多抓取 10 页\ncrwl https:\u002F\u002Fdocs.crawl4ai.com --deep-crawl bfs --max-pages 10\n\n# 使用 LLM 提取特定问题的答案\ncrwl https:\u002F\u002Fwww.example.com\u002Fproducts -q \"提取所有产品价格\"\n```\n\n## 💖 支持 Crawl4AI\n\n> 🎉 **赞助计划现已开放！** 在为超过 5.1 万名开发者提供支持并经历一年快速发展之后，Crawl4AI 正式推出针对**初创企业**和**企业**的专属支持计划。成为首批 50 名**创始赞助商**之一，即可永久铭刻于我们的名人堂。\n\nCrawl4AI 是 GitHub 上最受欢迎的开源网络爬虫工具。您的支持不仅使其保持独立、创新和对社区免费，还能让您直接享受高级权益。\n\n\u003Cdiv align=\"\">\n  \n[![成为赞助商](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBecome%20a%20Sponsor-pink?style=for-the-badge&logo=github-sponsors&logoColor=white)](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Funclecode)  \n[![当前赞助商](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fsponsors\u002Funclecode?style=for-the-badge&logo=github&label=Current%20Sponsors&color=green)](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Funclecode)\n\n\u003C\u002Fdiv>\n\n### 🤝 赞助等级\n\n- **🌱 信仰者（每月5美元）** — 加入数据民主化的行列  \n- **🚀 构建者（每月50美元）** — 优先支持与抢先体验新功能  \n- **💼 成长团队（每月500美元）** — 双周同步会议与优化协助  \n- **🏢 数据基础设施合作伙伴（每月2000美元）** — 全面合作及专属支持  \n  *可提供定制化方案 - 详情及联系方式请参阅 [SPONSORS.md](SPONSORS.md)*\n\n**为何要赞助？**  \n无速率限制的API。无厂商锁定。在Crawl4AI创建者的直接指导下，构建并掌控您的数据管道。\n\n[查看所有等级与权益 →](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Funclecode)\n\n\n## ✨ 功能 \n\n\u003Cdetails>\n\u003Csummary>📝 \u003Cstrong>Markdown生成\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 🧹 **纯净Markdown**：生成格式准确、结构清晰的Markdown文档。\n- 🎯 **精炼Markdown**：基于启发式过滤，去除噪声和无关内容，便于AI处理。\n- 🔗 **引用与参考文献**：将页面链接转换为带编号的参考列表，并附上规范的引用格式。\n- 🛠️ **自定义策略**：用户可根据特定需求，制定个性化的Markdown生成策略。\n- 📚 **BM25算法**：采用BM25算法进行核心信息提取，剔除无关内容。\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>📊 \u003Cstrong>结构化数据提取\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 🤖 **LLM驱动提取**：支持所有开源及专有大模型进行结构化数据提取。\n- 🧱 **分块策略**：实施基于主题、正则表达式或句子级别的分块技术，实现精准内容处理。\n- 🌌 **余弦相似度**：根据用户查询找到相关的内容块，完成语义提取。\n- 🔎 **CSS选择器提取**：利用XPath和CSS选择器快速完成基于Schema的数据提取。\n- 🔧 **Schema定义**：定义自定义Schema，从重复性模式中提取结构化JSON数据。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🌐 \u003Cstrong>浏览器集成\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 🖥️ **托管浏览器**：使用用户自有浏览器，完全掌控，避免被识别为机器人。\n- 🔄 **远程浏览器控制**：连接Chrome开发者工具协议，实现大规模远程数据提取。\n- 👤 **浏览器配置文件**：创建并管理持久化配置文件，保存认证状态、Cookie和设置。\n- 🔒 **会话管理**：保留浏览器状态，供多步骤爬取复用。\n- 🧩 **代理支持**：无缝连接带有认证的代理，确保安全访问。\n- ⚙️ **全面浏览器控制**：修改请求头、Cookie、User-Agent等参数，打造定制化的爬取环境。\n- 🌍 **多浏览器支持**：兼容Chromium、Firefox和WebKit。\n- 📐 **动态视口调整**：自动调整浏览器视口以匹配页面内容，确保完整渲染并捕获所有元素。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🔎 \u003Cstrong>爬取与抓取\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 🖼️ **媒体支持**：提取图片、音频、视频以及响应式图片格式，如`srcset`和`picture`。\n- 🚀 **动态爬取**：执行JavaScript代码，等待异步或同步操作完成，从而提取动态内容。\n- 📸 **截图**：在爬取过程中截取页面截图，用于调试或分析。\n- 📂 **原始数据爬取**：直接处理原始HTML（`raw:`）或本地文件（`file:\u002F\u002F`）。\n- 🔗 **全面链接提取**：提取内部链接、外部链接以及嵌入式iframe内容。\n- 🛠️ **可定制钩子**：在每个步骤定义钩子，自定义爬取行为（支持字符串和函数两种API）。\n- 💾 **缓存机制**：对数据进行缓存，提升速度并避免重复抓取。\n- 📄 **元数据提取**：从网页中获取结构化的元数据。\n- 📡 **IFrame内容提取**：无缝提取嵌入式iframe中的内容。\n- 🕵️ **懒加载处理**：等待图片完全加载，确保不会因懒加载而遗漏内容。\n- 🔄 **全页扫描**：模拟滚动加载，捕获所有动态内容，特别适用于无限滚动页面。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🚀 \u003Cstrong>部署\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 🐳 **Docker化部署**：优化后的Docker镜像搭配FastAPI服务器，轻松部署。\n- 🔑 **安全认证**：内置JWT令牌认证，保障API安全性。\n- 🔄 **API网关**：一键部署，配备安全令牌认证，适用于基于API的工作流。\n- 🌐 **可扩展架构**：专为大规模生产设计，优化服务器性能。\n- ☁️ **云部署**：针对主流云平台提供开箱即用的部署配置。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🎯 \u003Cstrong>附加功能\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- 🕶️ **隐身模式**：通过模拟真实用户行为，避免被检测为机器人。\n- 🏷️ **标签内容提取**：根据自定义标签、头部信息或元数据细化爬取范围。\n- 🔗 **链接分析**：提取并分析所有链接，深入探索数据。\n- 🛡️ **错误处理**：强大的错误管理机制，确保流程顺畅运行。\n- 🔐 **CORS与静态服务**：支持基于文件系统的缓存及跨域请求。\n- 📖 **清晰文档**：简化并更新的指南，方便新手入门及高级用户使用。\n- 🙌 **社区认可**：公开感谢贡献者及Pull Request，增强透明度。\n\n\u003C\u002Fdetails>\n\n## 立即试用！\n\n✨ 在此玩转 [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1SgRPrByQLzjRfwoRNq1wSGE9nYY_EE8C?usp=sharing)\n\n✨ 访问我们的 [文档网站](https:\u002F\u002Fdocs.crawl4ai.com\u002F)\n\n## 安装 🛠️\n\nCrawl4AI提供灵活的安装选项，满足不同使用场景的需求。您可以通过Python包或Docker进行安装。\n\n\u003Cdetails>\n\u003Csummary>🐍 \u003Cstrong>使用pip\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n请选择最适合您的安装方式：\n\n### 基础安装\n\n适用于基本的网页爬取任务：\n\n```bash\npip install crawl4ai\ncrawl4ai-setup # 设置浏览器\n```\n\n默认情况下，这将安装Crawl4AI的异步版本，使用Playwright进行网页爬取。\n\n👉 **注意**：安装Crawl4AI时，`crawl4ai-setup`应自动安装并配置Playwright。但如果遇到Playwright相关问题，您可以尝试以下方法手动安装：\n\n1. 通过命令行：\n\n   ```bash\n   playwright install\n   ```\n\n2. 如果上述方法无效，请尝试更具体的命令：\n\n   ```bash\n   python -m playwright install chromium\n   ```\n\n后一种方法在某些情况下更为可靠。\n\n---\n\n### 同步版本安装\n\n同步版本已弃用，未来版本中将被移除。如果您需要使用Selenium的同步版本：\n\n```bash\npip install crawl4ai[sync]\n```\n\n---\n\n### 开发安装\n\n对于计划修改源代码的贡献者：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai.git\ncd crawl4ai\npip install -e .                    # 以可编辑模式进行基础安装\n```\n\n安装可选功能：\n\n```bash\npip install -e \".[torch]\"           # 包含 PyTorch 功能\npip install -e \".[transformer]\"     # 包含 Transformer 功能\npip install -e \".[cosine]\"          # 包含余弦相似度功能\npip install -e \".[sync]\"            # 包含同步爬取（Selenium）功能\npip install -e \".[all]\"             # 安装所有可选功能\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🐳 \u003Cstrong>Docker 部署\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n> 🚀 **现已可用！** 我们全新设计的 Docker 实现正式上线！这一新方案让部署比以往更加高效、无缝。\n\n### 新 Docker 特性\n\n新版 Docker 实现包括：\n- **实时监控仪表板**，提供系统运行指标和浏览器池状态的实时可视化\n- **浏览器池化**，支持页面预热以提升响应速度\n- **交互式 Playground**，用于测试和生成请求代码\n- **MCP 集成**，可直接连接到 Claude Code 等 AI 工具\n- **全面的 API 端点**，涵盖 HTML 提取、截图、PDF 生成以及 JavaScript 执行等功能\n- **多架构支持**，自动检测 AMD64\u002FARM64 架构\n- **资源优化**，改进了内存管理\n\n### 快速入门\n\n```bash\n# 拉取并运行最新版本\ndocker pull unclecode\u002Fcrawl4ai:latest\ndocker run -d -p 11235:11235 --name crawl4ai --shm-size=1g unclecode\u002Fcrawl4ai:latest\n\n# 访问监控仪表板：http:\u002F\u002Flocalhost:11235\u002Fdashboard\n# 或访问 Playground：http:\u002F\u002Flocalhost:11235\u002Fplayground\n```\n\n### 快速测试\n\n运行一个快速测试（两种 Docker 方案均适用）：\n\n```python\nimport requests\n\n# 提交一个爬取任务\nresponse = requests.post(\n    \"http:\u002F\u002Flocalhost:11235\u002Fcrawl\",\n    json={\"urls\": [\"https:\u002F\u002Fexample.com\"], \"priority\": 10}\n)\nif response.status_code == 200:\n    print(\"爬取任务提交成功。\")\n    \nif \"results\" in response.json():\n    results = response.json()[\"results\"]\n    print(\"爬取任务已完成。结果如下：\")\n    for result in results:\n        print(result)\nelse:\n    task_id = response.json()[\"task_id\"]\n    print(f\"爬取任务已提交。任务 ID：{task_id}。\")\n    result = requests.get(f\"http:\u002F\u002Flocalhost:11235\u002Ftask\u002F{task_id}\")\n```\n\n更多示例请参阅我们的 [Docker 示例](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fexamples\u002Fdocker_example.py)。如需高级配置、监控功能及生产环境部署，请参考我们的 [自托管指南](https:\u002F\u002Fdocs.crawl4ai.com\u002Fcore\u002Fself-hosting\u002F)。\n\n\u003C\u002Fdetails>\n\n---\n\n## 🔬 高级使用示例 🔬\n\n您可以在 [docs\u002Fexamples](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Ftree\u002Fmain\u002Fdocs\u002Fexamples) 目录中查看项目结构。那里提供了丰富的示例，以下分享一些热门用法。\n\n\u003Cdetails>\n\u003Csummary>📝 \u003Cstrong>基于 Clean 和 Fit Markdown 的启发式 Markdown 生成\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n```python\nimport asyncio\nfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode\nfrom crawl4ai.content_filter_strategy import PruningContentFilter, BM25ContentFilter\nfrom crawl4ai.markdown_generation_strategy import DefaultMarkdownGenerator\n\nasync def main():\n    browser_config = BrowserConfig(\n        headless=True,  \n        verbose=True,\n    )\n    run_config = CrawlerRunConfig(\n        cache_mode=CacheMode.ENABLED,\n        markdown_generator=DefaultMarkdownGenerator(\n            content_filter=PruningContentFilter(threshold=0.48, threshold_type=\"fixed\", min_word_threshold=0)\n        ),\n        # markdown_generator=DefaultMarkdownGenerator(\n        #     content_filter=BM25ContentFilter(user_query=\"WHEN_WE_FOCUS_BASED_ON_A_USER_QUERY\", bm25_threshold=1.0)\n        # ),\n    )\n    \n    async with AsyncWebCrawler(config=browser_config) as crawler:\n        result = await crawler.arun(\n            url=\"https:\u002F\u002Fdocs.micronaut.io\u002F4.9.9\u002Fguide\u002F\",\n            config=run_config\n        )\n        print(len(result.markdown.raw_markdown))\n        print(len(result.markdown.fit_markdown))\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🖥️ \u003Cstrong>无需 LLM 执行 JavaScript 并提取结构化数据\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n```python\nimport asyncio\nfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode\nfrom crawl4ai import JsonCssExtractionStrategy\nimport json\n\nasync def main():\n    schema = {\n    \"name\": \"KidoCode Courses\",\n    \"baseSelector\": \"section.charge-methodology .w-tab-content > div\",\n    \"fields\": [\n        {\n            \"name\": \"section_title\",\n            \"selector\": \"h3.heading-50\",\n            \"type\": \"text\",\n        },\n        {\n            \"name\": \"section_description\",\n            \"selector\": \".charge-content\",\n            \"type\": \"text\",\n        },\n        {\n            \"name\": \"course_name\",\n            \"selector\": \".text-block-93\",\n            \"type\": \"text\",\n        },\n        {\n            \"name\": \"course_description\",\n            \"selector\": \".course-content-text\",\n            \"type\": \"text\",\n        },\n        {\n            \"name\": \"course_icon\",\n            \"selector\": \".image-92\",\n            \"type\": \"attribute\",\n            \"attribute\": \"src\"\n        }\n    ]\n}\n\n    extraction_strategy = JsonCssExtractionStrategy(schema, verbose=True)\n\n    browser_config = BrowserConfig(\n        headless=False,\n        verbose=True\n    )\n    run_config = CrawlerRunConfig(\n        extraction_strategy=extraction_strategy,\n        js_code=[\"\"\"(async () => {const tabs = document.querySelectorAll(\"section.charge-methodology .tabs-menu-3 > div\");for(let tab of tabs) {tab.scrollIntoView();tab.click();await new Promise(r => setTimeout(r, 500));}})();\"\"\"],\n        cache_mode=CacheMode.BYPASS\n    )\n        \n    async with AsyncWebCrawler(config=browser_config) as crawler:\n        \n        result = await crawler.arun(\n            url=\"https:\u002F\u002Fwww.kidocode.com\u002Fdegrees\u002Ftechnology\",\n            config=run_config\n        )\n\n        companies = json.loads(result.extracted_content)\n        print(f\"成功提取了 {len(companies)} 家公司\")\n        print(json.dumps(companies[0], indent=2))\n\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>📚 \u003Cstrong>使用 LLM 提取结构化数据\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n```python\nimport os\nimport asyncio\nfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode, LLMConfig\nfrom crawl4ai import LLMExtractionStrategy\nfrom pydantic import BaseModel, Field\n\nclass OpenAIModelFee(BaseModel):\n    model_name: str = Field(..., description=\"Name of the OpenAI model.\")\n    input_fee: str = Field(..., description=\"Fee for input token for the OpenAI model.\")\n    output_fee: str = Field(..., description=\"Fee for output token for the OpenAI model.\")\n\nasync def main():\n    browser_config = BrowserConfig(verbose=True)\n    run_config = CrawlerRunConfig(\n        word_count_threshold=1,\n        extraction_strategy=LLMExtractionStrategy(\n            # 在这里你可以使用 Litellm 库支持的任何提供商，例如：ollama\u002Fqwen2\n            # provider=\"ollama\u002Fqwen2\", api_token=\"no-token\", \n            llm_config = LLMConfig(provider=\"openai\u002Fgpt-4o\", api_token=os.getenv('OPENAI_API_KEY')), \n            schema=OpenAIModelFee.schema(),\n            extraction_type=\"schema\",\n            instruction=\"\"\"从抓取的内容中提取所有提到的模型名称及其输入和输出 token 的费用。\n            不要遗漏整个内容中的任何模型。提取出的一个模型 JSON 格式应如下所示：\n            {\"model_name\": \"GPT-4\", \"input_fee\": \"US$10.00 \u002F 1M tokens\", \"output_fee\": \"US$30.00 \u002F 1M tokens\"}.\"\"\"\n        ),            \n        cache_mode=CacheMode.BYPASS,\n    )\n    \n    async with AsyncWebCrawler(config=browser_config) as crawler:\n        result = await crawler.arun(\n            url='https:\u002F\u002Fopenai.com\u002Fapi\u002Fpricing\u002F',\n            config=run_config\n        )\n        print(result.extracted_content)\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🤖 \u003Cstrong>使用自定义用户配置文件的浏览器\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n```python\nimport os, sys\nfrom pathlib import Path\nimport asyncio, time\nfrom crawl4ai import AsyncWebCrawler, BrowserConfig, CrawlerRunConfig, CacheMode\n\nasync def test_news_crawl():\n    # 创建一个持久化的用户数据目录\n    user_data_dir = os.path.join(Path.home(), \".crawl4ai\", \"browser_profile\")\n    os.makedirs(user_data_dir, exist_ok=True)\n\n    browser_config = BrowserConfig(\n        verbose=True,\n        headless=True,\n        user_data_dir=user_data_dir,\n        use_persistent_context=True,\n    )\n    run_config = CrawlerRunConfig(\n        cache_mode=CacheMode.BYPASS\n    )\n    \n    async with AsyncWebCrawler(config=browser_config) as crawler:\n        url = \"ADDRESS_OF_A_CHALLENGING_WEBSITE\"\n        \n        result = await crawler.arun(\n            url,\n            config=run_config,\n            magic=True,\n        )\n        \n        print(f\"成功抓取了 {url}\")\n        print(f\"内容长度: {len(result.markdown)}\")\n```\n\n\u003C\u002Fdetails>\n\n---\n\n> **💡 小贴士:** 有些网站可能会使用基于 **CAPTCHA** 的验证机制来防止自动化访问。如果您的工作流程遇到此类挑战，您可以选择集成第三方 CAPTCHA 处理服务，例如 \u003Cstrong>[CapSolver](https:\u002F\u002Fwww.capsolver.com\u002Fblog\u002FPartners\u002Fcrawl4ai-capsolver\u002F?utm_source=crawl4ai&utm_medium=github_pr&utm_campaign=crawl4ai_integration)\u003C\u002Fstrong>。他们支持 reCAPTCHA v2\u002Fv3、Cloudflare Turnstile、Challenge、AWS WAF 等。请确保您的使用符合目标网站的服务条款和相关法律法规。\n\n\n\n## ✨ 最新更新\n\n\u003Cdetails open>\n\u003Csummary>\u003Cstrong>版本 0.8.6 — 安全补丁：litellm 供应链修复\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n由于原始包受到 PyPI 供应链攻击的影响，已将 `litellm` 依赖替换为 `unclecode-litellm`。如果您使用的是 v0.8.5 或更早版本，请立即升级。\n\n```bash\npip install -U crawl4ai\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>版本 0.8.5 发布亮点 - 反机器人检测、Shadow DOM 处理及 60 多项 Bug 修复\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n这是我们自 v0.8.0 以来最大的一次发布。新增反机器人检测与代理升级功能、Shadow DOM 展平处理、深度爬虫取消功能，以及超过 60 项 Bug 修复。\n\n- **🛡️ 反机器人检测与代理升级**:\n  - 三层检测机制：识别已知厂商、通用拦截指标、结构完整性检查\n  - 自动重试功能，结合代理链和回退抓取函数\n  ```python\n  from crawl4ai import CrawlerRunConfig\n  from crawl4ai.async_configs import ProxyConfig\n\n  config = CrawlerRunConfig(\n      proxy_config=[ProxyConfig.DIRECT, ProxyConfig(server=\"http:\u002F\u002Fmy-proxy:8080\")],\n      max_retries=2,\n      fallback_fetch_function=my_web_unlocker,\n  )\n  ```\n\n- **🌑 Shadow DOM 展平处理**:\n  - 提取隐藏在 Shadow DOM 组件内的内容\n  ```python\n  config = CrawlerRunConfig(flatten_shadow_dom=True)\n  ```\n\n- **🛑 深度爬虫取消功能**:\n  - 可通过 `cancel()` 方法或 `should_cancel` 回调函数优雅地停止长时间运行的爬虫\n  - 适用于 BFS、DFS 和 BestFirst 策略\n\n- **⚙️ 配置默认值 API**:\n  - 在 BrowserConfig 和 CrawlerRunConfig 中新增 `set_defaults()`、`get_defaults()` 和 `reset_defaults()`\n\n- **🔒 重要安全修复**:\n  - 移除了 Docker `\u002Fcrawl` 端点中通过反序列化导致的 RCE 漏洞，并添加了白名单机制\n  - 升级到 Redis 7.2.7 版本，修复了 CVE-2025-49844 漏洞（CVSS 评分 10.0）\n\n- **60+ Bug 修复**：涵盖浏览器管理、代理、深度爬虫、内容提取、命令行工具和 Docker 等方面\n\n[完整 v0.8.5 发布说明 →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.8.5.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>版本 0.8.0 发布亮点 - 崩溃恢复与预取模式\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n本次发布引入了深度爬虫的崩溃恢复功能、用于快速发现 URL 的预取模式，以及针对 Docker 部署的重要安全修复。\n\n- **🔄 深度爬虫崩溃恢复**:\n  - `on_state_change` 回调会在每次处理完一个 URL 后触发，实现实时状态保存\n  - `resume_state` 参数可用于从保存的检查点继续爬取\n  - 支持 JSON 序列化的状态存储，便于存入 Redis 或数据库\n  - 适用于 BFS、DFS 和 Best-First 策略\n  ```python\n  from crawl4ai.deep_crawling import BFSDeepCrawlStrategy\n\n  strategy = BFSDeepCrawlStrategy(\n      max_depth=3,\n      resume_state=saved_state,  # 从检查点继续\n      on_state_change=save_to_redis,  # 每处理完一个 URL 调用\n  )\n  ```\n\n- **⚡ 预取模式，加速 URL 发现**:\n  - 设置 `prefetch=True` 可跳过 Markdown 处理、内容提取和媒体处理\n  - 速度比完整处理快 5 到 10 倍\n  - 非常适合两阶段爬取：先发现 URL，再有选择地进行处理\n  ```python\n  config = CrawlerRunConfig(prefetch=True)\n  result = await crawler.arun(\"https:\u002F\u002Fexample.com\", config=config)\n  # 只返回 HTML 和链接，不生成 Markdown\n  ```\n\n- **🔒 安全修复（Docker API）**:\n  - 默认禁用钩子功能（`CRAWL4AI_HOOKS_ENABLED=false`）\n  - 在 API 端点上阻止 `file:\u002F\u002F` URL，以防止 LFI 攻击\n  - 从钩子执行沙盒中移除了 `__import__` 函数\n\n[完整 v0.8.0 发布说明 →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.8.0.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>版本 0.7.8 发布亮点 - 稳定性与 Bug 修复版\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n本次发布专注于稳定性，修复了社区反馈的 11 个问题。没有新增功能，但显著提升了系统的可靠性。\n\n- **🐳 Docker API 修复**:\n  - 修复了深度爬虫请求中 `ContentRelevanceFilter` 的反序列化问题 (#1642)\n  - 修复了 `BrowserConfig.to_dict()` 中 `ProxyConfig` 的 JSON 序列化问题 (#1629)\n  - 修复了 Docker 镜像中 `.cache` 文件夹的权限问题 (#1638)\n\n- **🤖 LLM 提取功能改进**：\n  - 可配置的速率限制退避机制，新增 `LLMConfig` 参数 (#1269)：\n    ```python\n    from crawl4ai import LLMConfig\n\n    config = LLMConfig(\n        provider=\"openai\u002Fgpt-4o-mini\",\n        backoff_base_delay=5,           # 第一次重试等待5秒\n        backoff_max_attempts=5,          # 最多重试5次\n        backoff_exponential_factor=3     # 每次重试延迟时间乘以3\n    )\n    ```\n  - `LLMExtractionStrategy` 支持 HTML 输入格式 (#1178)：\n    ```python\n    from crawl4ai import LLMExtractionStrategy\n\n    strategy = LLMExtractionStrategy(\n        llm_config=config,\n        instruction=\"提取表格数据\",\n        input_format=\"html\"  # 现在支持：\"html\"、\"markdown\"、\"fit_markdown\"\n    )\n    ```\n  - 修复了原始 HTML URL 变量问题——提取策略现在接收 `\"Raw HTML\"` 而不是 HTML Blob (#1116)\n\n- **🔗 URL 处理**：\n  - 修复了 JavaScript 重定向后的相对 URL 解析问题 (#1268)\n  - 修复了提取代码中导入语句的格式问题 (#1181)\n\n- **📦 依赖项更新**：\n  - 用 pypdf 替代已弃用的 PyPDF2 (#1412)\n  - Pydantic v2 ConfigDict 兼容性——不再出现弃用警告 (#678)\n\n- **🧠 AdaptiveCrawler**：\n  - 修复了查询扩展功能，使其真正使用 LLM 而不是硬编码的模拟数据 (#1621)\n\n[完整 v0.7.8 发行说明 →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.8.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>版本 0.7.7 发布亮点——自托管与监控更新\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- **📊 实时监控仪表板**：交互式 Web UI，提供系统实时指标和浏览器池状态可视化\n  ```python\n  # 访问监控仪表板\n  # 地址：http:\u002F\u002Flocalhost:11235\u002Fdashboard\n\n  # 实时指标包括：\n  - 系统健康状况（CPU、内存、网络、运行时间）\n  - 当前活动及已完成请求跟踪\n  - 浏览器池管理（永久\u002F热\u002F冷）\n  - 清理任务事件\n  - 错误监控，附带完整上下文信息\n  ```\n\n- **🔌 全面监控 API**：完整的 REST API，用于程序化访问所有监控数据\n  ```python\n  import httpx\n\n  async with httpx.AsyncClient() as client:\n      # 系统健康状况\n      health = await client.get(\"http:\u002F\u002Flocalhost:11235\u002Fmonitor\u002Fhealth\")\n\n      # 请求跟踪\n      requests = await client.get(\"http:\u002F\u002Flocalhost:11235\u002Fmonitor\u002Frequests\")\n\n      # 浏览器池状态\n      browsers = await client.get(\"http:\u002F\u002Flocalhost:11235\u002Fmonitor\u002Fbrowsers\")\n\n      # 端点统计信息\n      stats = await client.get(\"http:\u002F\u002Flocalhost:11235\u002Fmonitor\u002Fendpoints\u002Fstats\")\n  ```\n\n- **⚡ WebSocket 实时流**：每 2 秒推送一次实时更新，适用于自定义仪表板\n- **🔥 智能浏览器池**：三层架构（永久\u002F热\u002F冷），具备自动提升和清理功能\n- **🧹 清洁系统**：自动资源管理，并记录事件日志\n- **🎮 控制操作**：可通过 API 手动管理浏览器（终止、重启、清理）\n- **📈 生产级指标**：6 项关键指标，助力卓越运营，并集成 Prometheus\n- **🐛 重要错误修复**：\n  - 修复了异步 LLM 提取阻塞问题 (#1055)\n  - 增强了 DFS 深度爬取策略 (#1607)\n  - 修复了 AsyncUrlSeeder 中站点地图解析问题 (#1598)\n  - 解决了浏览器视口配置问题 (#1495)\n  - 修复了 CDP 定时问题，并引入指数退避机制 (#1528)\n  - 更新了 pyOpenSSL 安全补丁（≥25.3.0）\n\n[完整 v0.7.7 发行说明 →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.7.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>版本 0.7.5 发布亮点——Docker 钩子与安全更新\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- **🔧 Docker 钩子系统**：通过用户提供的 Python 函数，在 8 个关键节点实现完整管道自定义\n- **✨ 基于函数的钩子 API（新）**：可将钩子编写为常规 Python 函数，享受完整的 IDE 支持：\n  ```python\n  from crawl4ai import hooks_to_string\n  from crawl4ai.docker_client import Crawl4aiDockerClient\n\n  # 定义常规 Python 函数作为钩子\n  async def on_page_context_created(page, context, **kwargs):\n      \"\"\"阻止加载图片以加快爬取速度\"\"\"\n      await context.route(\"**\u002F*.{png,jpg,jpeg,gif,webp}\", lambda route: route.abort())\n      await page.set_viewport_size({\"width\": 1920, \"height\": 1080})\n      return page\n\n  async def before_goto(page, context, url, **kwargs):\n      \"\"\"添加自定义请求头\"\"\"\n      await page.set_extra_http_headers({'X-Crawl4AI': 'v0.7.5'})\n      return page\n\n  # 方法一：使用 hooks_to_string() 工具生成 REST API 代码\n  hooks_code = hooks_to_string({\n      \"on_page_context_created\": on_page_context_created,\n      \"before_goto\": before_goto\n  })\n\n  # 方法二：推荐使用 Docker 客户端自动转换\n  client = Crawl4aiDockerClient(base_url=\"http:\u002F\u002Flocalhost:11235\")\n  results = await client.crawl(\n      urls=[\"https:\u002F\u002Fhttpbin.org\u002Fhtml\"],\n      hooks={\n          \"on_page_context_created\": on_page_context_created,\n          \"before_goto\": before_goto\n      }\n  )\n  # ✓ 完整的 IDE 支持、类型检查及代码复用！\n  ```\n\n- **🤖 增强的 LLM 集成**：支持自定义提供商，可控制温度并配置基础 URL\n- **🔒 HTTPS 保留**：通过 `preserve_https_for_internal_links=True` 实现内部链接的安全处理\n- **🐍 Python 3.10+ 支持**：采用现代语言特性，性能进一步提升\n- **🛠️ 错误修复**：解决了社区反馈的多个问题，包括 URL 处理、JWT 认证和代理配置等\n\n[完整 v0.7.5 发行说明 →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.5.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>版本 0.7.4 发布亮点——智能表格提取与性能优化\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- **🚀 LLMTableExtraction**：革命性的表格提取功能，采用智能分块技术，可处理超大表格：\n  ```python\n  from crawl4ai import LLMTableExtraction, LLMConfig\n\n  # 配置智能表格提取策略\n  table_strategy = LLMTableExtraction(\n      llm_config=LLMConfig(provider=\"openai\u002Fgpt-4.1-mini\"),\n      enable_chunking=True,           # 处理超大表格\n      chunk_token_threshold=5000,     # 智能分块阈值\n      overlap_threshold=100,          # 保持分块间的上下文连贯性\n      extraction_type=\"structured\"    # 输出结构化数据\n  )\n\n  config = CrawlerRunConfig(table_extraction_strategy=table_strategy)\n  result = await crawler.arun(\"https:\u002F\u002Fcomplex-tables-site.com\", config=config)\n\n  # 表格会自动分块、处理并合并\n  for table in result.tables:\n      print(f\"提取的表格：{len(table['data'])} 行\")\n  ```\n\n- **⚡ 调度器错误修复**：修复了 arun_many 中针对快速完成任务的顺序处理瓶颈\n- **🧹 内存管理重构**：将内存相关工具整合到主工具模块中，使架构更加清晰\n- **🔧 浏览器管理器修复**：通过线程安全锁解决了并发页面创建中的竞态条件\n- **🔗 高级 URL 处理**：更好地处理 raw:\u002F\u002F URL 以及 base 标签的链接解析\n- **🛡️ 增强的代理支持**：灵活的代理配置，同时支持字典和字符串格式\n\n[完整 v0.7.4 发布说明 →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.4.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>v0.7.3 版本亮点 - 多配置智能更新\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- **🕵️ 无痕浏览器支持**：绕过复杂的机器人检测系统：\n  ```python\n  from crawl4ai import AsyncWebCrawler, BrowserConfig\n  \n  browser_config = BrowserConfig(\n      browser_type=\"undetected\",  # 使用无痕 Chrome 浏览器\n      headless=True,              # 可以无头运行并保持隐身\n      extra_args=[\n          \"--disable-blink-features=AutomationControlled\",\n          \"--disable-web-security\"\n      ]\n  )\n  \n  async with AsyncWebCrawler(config=browser_config) as crawler:\n      result = await crawler.arun(\"https:\u002F\u002Fprotected-site.com\")\n  # 成功绕过 Cloudflare、Akamai 和自定义机器人检测\n  ```\n\n- **🎨 多 URL 配置**：为同一批次中的不同 URL 模式采用不同策略：\n```python\nfrom crawl4ai import CrawlerRunConfig, MatchMode, CacheMode\n  \n  configs = [\n      # 文档站点 - 积极缓存\n      CrawlerRunConfig(\n          url_matcher=[\"*docs*\", \"*documentation*\"],\n          cache_mode=CacheMode.WRITE_ONLY,\n          markdown_generator_options={\"include_links\": True}\n      ),\n      \n      # 新闻\u002F博客站点 - 获取最新内容\n      CrawlerRunConfig(\n          url_matcher=lambda url: 'blog' in url or 'news' in url,\n          cache_mode=CacheMode.BYPASS\n      ),\n      \n      # 其他情况的默认配置\n      CrawlerRunConfig()\n  ]\n  \n  results = await crawler.arun_many(urls, config=configs)\n  # 每个 URL 都会自动获得最合适的配置\n  ```\n\n- **🧠 内存监控**：在爬取过程中跟踪并优化内存使用：\n  ```python\n  from crawl4ai.memory_utils import MemoryMonitor\n  \n  monitor = MemoryMonitor()\n  monitor.start_monitoring()\n  \n  results = await crawler.arun_many(large_url_list)\n  \n  report = monitor.get_report()\n  print(f\"峰值内存：{report['peak_mb']:.1f} MB\")\n  print(f\"效率：{report['efficiency']:.1f}%\")\n  # 获取优化建议\n  ```\n\n- **📊 增强的表格提取**：直接将网页表格转换为 DataFrame：\n  ```python\n  result = await crawler.arun(\"https:\u002F\u002Fsite-with-tables.com\")\n  \n  \u002F\u002F 新方式 - 直接访问表格\n  if result.tables:\n      import pandas as pd\n      for table in result.tables:\n          df = pd.DataFrame(table['data'])\n          print(f\"表格：{df.shape[0]} 行 × {df.shape[1]} 列\")\n  ```\n\n- **💰 GitHub 赞助**：项目可持续发展的四层赞助体系\n- **🐳 Docker LLM 灵活性**：可通过环境变量配置提供商\n\n[完整 v0.7.3 发布说明 →](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.3.md)\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>\u003Cstrong>v0.7.0 版本亮点 - 自适应智能更新\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- **🧠 自适应爬取**：您的爬虫现在可以自动学习并适应网站模式：\n  ```python\n  config = AdaptiveConfig(\n      confidence_threshold=0.7, # 停止爬取的最低置信度\n      max_depth=5, # 最大爬取深度\n      max_pages=20, # 最大爬取页数\n      strategy=\"statistical\"\n  )\n  \n  async with AsyncWebCrawler() as crawler:\n      adaptive_crawler = AdaptiveCrawler(crawler, config)\n      state = await adaptive_crawler.digest(\n          start_url=\"https:\u002F\u002Fnews.example.com\",\n          query=\"latest news content\"\n      )\n  # 爬虫会随着时间推移学习模式并改进提取效果\n  ```\n\n- **🌊 虚拟滚动支持**：从无限滚动页面中完整提取内容：\n  ```python\n  scroll_config = VirtualScrollConfig(\n      container_selector=\"[data-testid='feed']\",\n      scroll_count=20,\n      scroll_by=\"container_height\",\n      wait_after_scroll=1.0\n  )\n  \n  result = await crawler.arun(url, config=CrawlerRunConfig(\n      virtual_scroll_config=scroll_config\n  ))\n  ```\n\n- **🔗 智能链接分析**：三层评分系统实现智能链接优先级排序：\n  ```python\n  link_config = LinkPreviewConfig(\n      query=\"machine learning tutorials\",\n      score_threshold=0.3,\n      concurrent_requests=10\n  )\n  \n  result = await crawler.arun(url, config=CrawlerRunConfig(\n      link_preview_config=link_config,\n      score_links=True\n  ))\n  # 链接按相关性和质量排序\n  ```\n\n- **🎣 异步 URL 种子生成器**：可在几秒钟内发现数千个 URL：\n  ```python\n  seeder = AsyncUrlSeeder(SeedingConfig(\n      source=\"sitemap+cc\",\n      pattern=\"*\u002Fblog\u002F*\",\n      query=\"python tutorials\",\n      score_threshold=0.4\n  ))\n  \n  urls = await seeder.discover(\"https:\u002F\u002Fexample.com\")\n  ```\n\n- **⚡ 性能提升**：通过优化资源管理和内存效率，速度最高可提升至 3 倍\n\n更多详细信息请参阅我们的 [v0.7.0 发布说明](https:\u002F\u002Fdocs.crawl4ai.com\u002Fblog\u002Frelease-v0.7.0)，或查看 [CHANGELOG](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FCHANGELOG.md)。\n\n\u003C\u002Fdetails>\n\n\n\n## Crawl4AI 的版本号规则\n\nCrawl4AI 遵循标准的 Python 版本号命名规范（PEP 440），以便用户了解每个版本的稳定性和功能特性。\n\n\u003Cdetails>\n\u003Csummary>📈 \u003Cstrong>版本号详解\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n我们的版本号采用 `MAJOR.MINOR.PATCH` 格式（例如：0.4.3）。\n\n#### 预发布版本\n我们使用不同的后缀来表示开发阶段：\n\n- `dev`（0.4.3dev1）：开发版本，不稳定\n- `a`（0.4.3a1）：Alpha 版本，包含实验性功能\n- `b`（0.4.3b1）：Beta 版本，功能已完善但需测试\n- `rc`（0.4.3）：发布候选版本，可能是最终版本\n\n#### 安装\n- 正常安装（稳定版本）：\n  ```bash\n  pip install -U crawl4ai\n  ```\n\n- 安装预发布版本：\n  ```bash\n  pip install crawl4ai --pre\n  ```\n\n- 安装特定版本：\n  ```bash\n  pip install crawl4ai==0.4.3b1\n  ```\n\n#### 为什么使用预发布版本？\n我们使用预发布版本是为了：\n- 在实际场景中测试新功能\n- 在正式发布前收集反馈\n- 确保生产环境的稳定性\n- 让早期使用者尝试新功能\n\n对于生产环境，我们建议使用稳定版本。如果需要测试新功能，可以使用 `--pre` 标志来安装预发布版本。\n\n\u003C\u002Fdetails>\n\n## 📖 文档与路线图 \n\n> 🚨 **文档更新提醒**：我们将于下周进行一次大规模的文档改版，以反映最近的更新和改进。敬请期待更加全面和最新的指南！\n\n如需查看当前文档，包括安装说明、高级功能和 API 参考，请访问我们的[文档网站](https:\u002F\u002Fdocs.crawl4ai.com\u002F)。\n\n要了解我们的开发计划和即将推出的功能，请访问我们的[路线图](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FROADMAP.md)。\n\n\u003Cdetails>\n\u003Csummary>📈 \u003Cstrong>开发待办事项\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n- [x] 0. 图形爬虫：使用图搜索算法实现智能网站遍历，用于全面提取嵌套页面\n- [x] 1. 基于问题的爬虫：通过自然语言驱动的网页发现与内容提取\n- [x] 2. 知识最优爬虫：在最小化数据提取的同时最大化知识获取的智能爬取\n- [x] 3. 智能体式爬虫：用于复杂多步骤爬取操作的自主系统\n- [x] 4. 自动化模式生成器：将自然语言转换为提取模式\n- [x] 5. 领域特定爬虫：针对常见平台（学术、电商等）预配置的提取工具\n- [x] 6. 网页嵌入索引：用于已爬取内容的语义搜索基础设施\n- [x] 7. 交互式 Playground：带有 AI 辅助的 Web UI，用于测试和比较策略\n- [x] 8. 性能监控器：实时洞察爬虫运行情况\n- [ ] 9. 云集成：跨云服务商的一键部署解决方案\n- [x] 10. 赞助计划：分层权益的结构化支持体系\n- [ ] 11. 教育内容：《如何爬取》视频系列及互动教程\n\n\u003C\u002Fdetails>\n\n## 🤝 贡献\n\n我们欢迎开源社区的贡献。请查看我们的[贡献指南](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FCONTRIBUTORS.md)，以获取更多信息。\n\n我会帮助修改许可证部分并添加徽章。对于半色调效果，这里有一个包含该效果的版本：\n\n以下是更新后的许可证部分：\n\n## 📄 许可证与署名\n\n本项目采用 Apache License 2.0 许可证，建议通过以下徽章进行署名。详细信息请参阅[Apache 2.0 许可证](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FLICENSE)文件。\n\n### 署名要求\n在使用 Crawl4AI 时，您必须包含以下署名方式之一：\n\n\u003Cdetails>\n\u003Csummary>📈 \u003Cstrong>1. 徽章署名（推荐）\u003C\u002Fstrong>\u003C\u002Fsummary>\n将以下任一徽章添加到您的 README、文档或网站中：\n\n| 主题 | 徽章 |\n|-------|-------|\n| **迪斯科主题（动画）** | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\u003Cimg src=\".\u002Fdocs\u002Fassets\u002Fpowered-by-disco.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\u003C\u002Fa> |\n| **夜景主题（暗黑霓虹）** | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\u003Cimg src=\".\u002Fdocs\u002Fassets\u002Fpowered-by-night.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\u003C\u002Fa> |\n| **经典暗黑主题** | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\u003Cimg src=\".\u002Fdocs\u002Fassets\u002Fpowered-by-dark.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\u003C\u002Fa> |\n| **经典亮色主题** | \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\u003Cimg src=\".\u002Fdocs\u002Fassets\u002Fpowered-by-light.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\u003C\u002Fa> |\n\n添加徽章的 HTML 代码：\n```html\n\u003C!-- 迪斯科主题（动画） -->\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Funclecode\u002Fcrawl4ai\u002Fmain\u002Fdocs\u002Fassets\u002Fpowered-by-disco.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\n\u003C\u002Fa>\n\n\u003C!-- 夜景主题（暗黑霓虹） -->\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Funclecode\u002Fcrawl4ai\u002Fmain\u002Fdocs\u002Fassets\u002Fpowered-by-night.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\n\u003C\u002Fa>\n\n\u003C!-- 经典暗黑主题 -->\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Funclecode\u002Fcrawl4ai\u002Fmain\u002Fdocs\u002Fassets\u002Fpowered-by-dark.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\n\u003C\u002Fa>\n\n\u003C!-- 经典亮色主题 -->\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\n  \u003Cimg src=\"https:\u002F\u002Fraw.githubusercontent.com\u002Funclecode\u002Fcrawl4ai\u002Fmain\u002Fdocs\u002Fassets\u002Fpowered-by-light.svg\" alt=\"Powered by Crawl4AI\" width=\"200\"\u002F>\n\u003C\u002Fa>\n\n\u003C!-- 简单盾牌徽章 -->\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPowered%20by-Crawl4AI-blue?style=flat-square\" alt=\"Powered by Crawl4AI\"\u002F>\n\u003C\u002Fa>\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>📖 \u003Cstrong>2. 文本署名\u003C\u002Fstrong>\u003C\u002Fsummary>\n在您的文档中添加以下文字：\n```\n本项目使用 Crawl4AI（https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai）进行网页数据提取。\n```\n\u003C\u002Fdetails>\n\n## 📚 引用\n\n如果您在研究或项目中使用了 Crawl4AI，请引用：\n\n```bibtex\n@software{crawl4ai2024,\n  author = {UncleCode},\n  title = {Crawl4AI: 开源 LLM 友好型网络爬虫与抓取工具},\n  year = {2024},\n  publisher = {GitHub},\n  journal = {GitHub 仓库},\n  howpublished = {\\url{https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai}},\n  commit = {请使用您正在使用的提交哈希值}\n}\n```\n\n文本引用格式：\n```\nUncleCode. (2024). Crawl4AI：开源 LLM 友好型网络爬虫与抓取工具 [计算机软件]。\nGitHub. https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\n```\n\n## 📧 联系方式\n\n如有任何问题、建议或反馈，欢迎随时联系我们：\n\n- GitHub：[unclecode](https:\u002F\u002Fgithub.com\u002Funclecode)\n- Twitter：[@unclecode](https:\u002F\u002Ftwitter.com\u002Funclecode)\n- 官网：[crawl4ai.com](https:\u002F\u002Fcrawl4ai.com)\n\n祝您爬取愉快！🕸️🚀\n\n## 🗾 使命\n\n我们的使命是通过将数字足迹转化为结构化、可交易的资产，释放个人和企业数据的价值。Crawl4AI 为个人和组织提供开源工具，帮助他们提取和结构化数据，从而推动共享数据经济的发展。\n\n我们设想一个未来，在这个未来中，人工智能将由真实的人类知识驱动，确保数据创造者能够直接从其贡献中受益。通过 democratizing 数据并促进道德共享，我们正在为真正的 AI 进步奠定基础。\n\n\u003Cdetails>\n\u003Csummary>🔑 \u003Cstrong>关键机遇\u003C\u002Fstrong>\u003C\u002Fsummary>\n \n- **数据资本化**：将数字足迹转化为可衡量的价值资产。\n- **真实的人工智能数据**：为 AI 系统提供真正的人类见解。\n- **共享经济**：创建一个公平的数据市场，使数据创造者受益。\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>🚀 \u003Cstrong>发展路径\u003C\u002Fstrong>\u003C\u002Fsummary>\n\n1. **开源工具**：由社区驱动的透明数据提取平台。\n2. **数字资产结构化**：用于组织和评估数字知识的工具。\n3. **道德数据市场**：一个安全、公平的平台，用于交换结构化数据。\n\n更多详情，请参阅我们的[完整使命宣言](.\u002FMISSION.md)。\n\u003C\u002Fdetails>\n\n## 🌟 当前赞助商\n\n### 🏢 企业赞助商与合作伙伴\n\n我们的企业赞助商和技术合作伙伴助力 Crawl4AI 扩展规模，为生产级数据管道提供支持。\n\n| 公司 | 简介 | 赞助级别 |\n|------|------|----------------------------|\n| \u003Ca href=\"https:\u002F\u002Fwww.thordata.com\u002F?ls=github&lk=crawl4ai\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fgist.github.com\u002Faravindkarnam\u002Fdfc598a67be5036494475acece7e54cf\u002Fraw\u002Fthor_data.svg\" alt=\"Thor Data\" width=\"120\"\u002F>\u003C\u002Fa>  | 使用 Thordata 可确保与任何 AI\u002FML 工作流和数据基础设施无缝兼容，在 99.9% 的正常运行时间内大规模获取网络数据，并享有专人客户支持。 | 🥈 银牌 |\n| \u003Ca href=\"https:\u002F\u002Fapp.nstproxy.com\u002Fregister?i=ecOqW9\" target=\"_blank\">\u003Cpicture>\u003Csource width=\"250\" media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fgist.github.com\u002Faravindkarnam\u002F62f82bd4818d3079d9dd3c31df432cf8\u002Fraw\u002Fnst-light.svg\">\u003Csource width=\"250\" media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fwww.nstproxy.com\u002Flogo.svg\">\u003Cimg alt=\"nstproxy\" src=\"ttps:\u002F\u002Fwww.nstproxy.com\u002Flogo.svg\">\u003C\u002Fpicture>\u003C\u002Fa>  | NstProxy 是一家值得信赖的代理服务提供商，拥有超过 1.1 亿个真实住宅 IP 地址，可按城市级别精准定位，提供 99.99% 的正常运行时间，且价格低廉，仅需 0.1 美元\u002FGB，带来无与伦比的稳定性、扩展性和成本效益。 | 🥈 银牌 |\n| \u003Ca href=\"https:\u002F\u002Fapp.scrapeless.com\u002Fpassport\u002Fregister?utm_source=official&utm_term=crawl4ai\" target=\"_blank\">\u003Cpicture>\u003Csource width=\"250\" media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fgist.githubusercontent.com\u002Faravindkarnam\u002F0d275b942705604263e5c32d2db27bc1\u002Fraw\u002FScrapeless-light-logo.svg\">\u003Csource width=\"250\" media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Fgist.githubusercontent.com\u002Faravindkarnam\u002F22d0525cc0f3021bf19ebf6e11a69ccd\u002Fraw\u002FScrapeless-dark-logo.svg\">\u003Cimg alt=\"Scrapeless\" src=\"https:\u002F\u002Fgist.githubusercontent.com\u002Faravindkarnam\u002F22d0525cc0f3021bf19ebf6e11a69ccd\u002Fraw\u002FScrapeless-dark-logo.svg\">\u003C\u002Fpicture>\u003C\u002Fa>  | Scrapeless 提供用于爬取、自动化和 AI 代理的生产级基础设施，包括抓取浏览器、四种代理类型以及通用抓取 API。 | 🥈 银牌 |\n| \u003Ca href=\"https:\u002F\u002Fdashboard.capsolver.com\u002Fpassport\u002Fregister?inviteCode=ESVSECTX5Q23\" target=\"_blank\">\u003Cpicture>\u003Csource width=\"120\" media=\"(prefers-color-scheme: dark)\" srcset=\"https:\u002F\u002Fdocs.crawl4ai.com\u002Fuploads\u002Fsponsors\u002F20251013045338_72a71fa4ee4d2f40.png\">\u003Csource width=\"120\" media=\"(prefers-color-scheme: light)\" srcset=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_feed151b0bf6.png\">\u003Cimg alt=\"Capsolver\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_feed151b0bf6.png\">\u003C\u002Fpicture>\u003C\u002Fa> | 基于 AI 的验证码破解服务。支持所有主流验证码类型，包括 reCAPTCHA、Cloudflare 等。 | 🥉 铜牌 |\n| \u003Ca href=\"https:\u002F\u002Fkipo.ai\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_0881c5f473f5.png\" alt=\"DataSync\" width=\"120\"\u002F>\u003C\u002Fa> | 帮助工程师和采购人员在几秒钟内查找、比较并采购电子和工业零部件，提供规格、价格、交货周期及替代方案。 | 🥇 金牌 |\n| \u003Ca href=\"https:\u002F\u002Fwww.kidocode.com\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fdocs.crawl4ai.com\u002Fuploads\u002Fsponsors\u002F20251013045045_bb8dace3f0440d65.svg\" alt=\"Kidocode\" width=\"120\"\u002F>\u003Cp align=\"center\">KidoCode\u003C\u002Fp>\u003C\u002Fa> | Kidocode 是一所面向 5 至 18 岁儿童的混合型技术和创业学校，提供线上和校内两种教育模式。 | 🥇 金牌 |\n| \u003Ca href=\"https:\u002F\u002Fwww.alephnull.sg\u002F\" target=\"_blank\">\u003Cimg src=\"https:\u002F\u002Fdocs.crawl4ai.com\u002Fuploads\u002Fsponsors\u002F20251013050323_a9e8e8c4c3650421.svg\" alt=\"Aleph null\" width=\"120\"\u002F>\u003C\u002Fa> | 新加坡的 Aleph Null 是亚洲领先的教育科技中心，致力于以学生为中心、由 AI 驱动的教育，帮助学习者掌握在快速变化的世界中茁壮成长所需的工具。 | 🥇 金牌 |\n\n\n\n### 🧑‍🤝 个人赞助商\n\n衷心感谢我们的个人支持者！每一份捐助都帮助我们延续并蓬勃发展开源使命！\n\n\u003Cp align=\"left\">\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fhafezparast\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_b87016576f1d.png\" style=\"border-radius:50%;\" width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fntohidi\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_a7f59d2272d2.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FSjoeborg\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_5fe63a7c96c7.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fromek-rozen\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_d65e0acafaa4.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FKourosh-Kiyani\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_4a9dad2c26a2.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FEtherdrake\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_b2919e96816f.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fshaman247\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_552e83377704.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fwork-flow-manager\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_f25e38f207f5.png\" style=\"border-radius:50%;\"width=\"64px;\"\u002F>\u003C\u002Fa>\n\u003C\u002Fp>\n\n> 想加入他们吗？[赞助 Crawl4AI →](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Funclecode)\n\n## 星标历史\n\n[![星标历史图表](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_readme_b3bea90a214a.png)](https:\u002F\u002Fstar-history.com\u002F#unclecode\u002Fcrawl4ai&Date)","# Crawl4AI 快速上手指南\n\nCrawl4AI 是一款开源的、对大语言模型（LLM）友好的网络爬虫与数据提取工具。它能将网页转换为干净、结构化的 Markdown 格式，专为 RAG（检索增强生成）、AI Agent 和数据管道设计。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**：Windows, macOS, 或 Linux\n*   **Python 版本**：Python 3.9 或更高版本\n*   **前置依赖**：\n    *   `pip` (Python 包管理工具)\n    *   浏览器内核（安装过程中会自动处理，若失败需手动安装 Playwright 浏览器）\n\n> **提示**：国内用户若遇到网络连接问题，建议在安装前配置 pip 国内镜像源（如清华源或阿里源）。\n\n## 安装步骤\n\n### 1. 安装 Python 包\n\n使用 pip 安装最新版本的 Crawl4AI：\n\n```bash\n# 推荐使用国内镜像源加速安装\npip install -U crawl4ai -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 如需体验预发布版本\npip install crawl4ai --pre -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 2. 运行初始化设置\n\n安装完成后，必须运行设置命令以下载必要的浏览器组件和依赖：\n\n```bash\ncrawl4ai-setup\n```\n\n### 3. 验证安装\n\n运行诊断工具确认环境是否配置成功：\n\n```bash\ncrawl4ai-doctor\n```\n\n> **注意**：如果上述步骤中浏览器安装失败，可以手动执行以下命令安装 Chromium：\n> ```bash\n> python -m playwright install --with-deps chromium\n> ```\n\n## 基本使用\n\nCrawl4AI 支持 Python 代码调用和命令行工具两种使用方式。\n\n### 方式一：Python 代码调用（推荐）\n\n这是最灵活的使用方式，适合集成到现有项目中。创建一个 `.py` 文件并运行以下代码：\n\n```python\nimport asyncio\nfrom crawl4ai import *\n\nasync def main():\n    # 初始化异步爬虫\n    async with AsyncWebCrawler() as crawler:\n        # 执行抓取，默认返回清洗后的 Markdown\n        result = await crawler.arun(\n            url=\"https:\u002F\u002Fwww.nbcnews.com\u002Fbusiness\",\n        )\n        # 输出结果\n        print(result.markdown)\n\nif __name__ == \"__main__\":\n    asyncio.run(main())\n```\n\n### 方式二：命令行工具 (CLI)\n\n如果您需要快速测试或进行简单的批量抓取，可以直接使用内置的 `crwl` 命令：\n\n```bash\n# 基础抓取：获取网页的 Markdown 内容\ncrwl https:\u002F\u002Fwww.nbcnews.com\u002Fbusiness -o markdown\n\n# 深度抓取：使用 BFS 策略，最多抓取 10 个页面\ncrwl https:\u002F\u002Fdocs.crawl4ai.com --deep-crawl bfs --max-pages 10\n\n# LLM 智能提取：针对特定问题提取数据（需配置 LLM）\ncrwl https:\u002F\u002Fwww.example.com\u002Fproducts -q \"Extract all product prices\"\n```\n\n### 核心特性简述\n\n*   **LLM Ready**：自动去除网页噪音，生成包含标题、表格和代码块的纯净 Markdown。\n*   **动态渲染**：内置浏览器引擎，完美支持 JavaScript 渲染和懒加载内容。\n*   **结构化提取**：支持通过 CSS 选择器或自然语言提问直接从网页提取 JSON 数据。","某金融科技团队需要每日从数十个动态渲染的新闻网站和论坛中抓取最新的市场舆情，并将其清洗为结构化数据供内部大模型进行风险研判。\n\n### 没有 crawl4ai 时\n- **动态内容获取失败**：传统爬虫无法执行 JavaScript，导致大量依赖前端渲染的核心新闻内容抓取为空。\n- **反爬机制阻碍严重**：面对网站的机器人检测机制，脚本频繁被拦截或封禁，需耗费大量精力维护代理池和伪装头。\n- **数据清洗成本高昂**：抓取的原始 HTML 包含大量导航栏、广告和脚本标签，开发人员需编写复杂的正则表达式进行清洗，耗时且易出错。\n- **大模型对接困难**：非结构化的噪声数据直接输入 LLM 会导致幻觉增加，团队需额外开发中间件将 HTML 转换为适合 RAG（检索增强生成）的格式。\n\n### 使用 crawl4ai 后\n- **完整内容直达**：crawl4ai 内置浏览器引擎，自动处理 Shadow DOM 和动态加载，确保所有隐藏或延迟加载的舆情信息完整获取。\n- **智能绕过检测**：利用其自带的三层反机器人检测和代理升级机制，稳定突破网站防御，大幅降低请求失败率。\n- **原生 Markdown 输出**：crawl4ai 直接将网页转换为干净的 Markdown 格式，自动剔除广告与无关标签，省去了繁琐的后处理步骤。\n- **LLM 就绪数据流**：输出的数据天然适配大模型上下文窗口，团队可直接将其接入 RAG 管道，显著提升了舆情分析的准确率与响应速度。\n\ncrawl4ai 将原本需要数天构建的复杂采集清洗链路简化为几行代码，让团队能专注于核心业务逻辑而非数据基建。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Funclecode_crawl4ai_580352f7.png","unclecode","UncleCode","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Funclecode_2ccc07ad.jpg","Author of Crawl4AI (#1 GitHub Trending). Founder of Kidocode, SE Asia's largest tech & biz school. Leading AI research on synthetic data & coffee lover.","AlephNul, and KidoCode","Singapore","unclecode@kidocode.com","http:\u002F\u002Fkidocode.com","https:\u002F\u002Fgithub.com\u002Funclecode",[85,89,93,97],{"name":86,"color":87,"percentage":88},"Python","#3572A5",98.8,{"name":90,"color":91,"percentage":92},"JavaScript","#f1e05a",0.8,{"name":94,"color":95,"percentage":96},"Shell","#89e051",0.2,{"name":98,"color":99,"percentage":96},"Dockerfile","#384d54",63242,6463,"2026-04-02T22:29:19","Apache-2.0","Linux, macOS, Windows","未说明",{"notes":107,"python":108,"dependencies":109},"安装后需运行 'crawl4ai-setup' 进行配置，并建议手动执行 'python -m playwright install --with-deps chromium' 安装浏览器依赖。支持 Docker 部署及通过 Chrome DevTools Protocol 进行远程浏览器控制。v0.8.6 版本因安全问题将 litellm 替换为 unclecode-litellm。","3.8+",[110,111,112],"playwright","litellm (或 unclecode-litellm)","fastapi",[43,15],null,21,"2026-03-27T02:49:30.150509","2026-04-06T06:55:39.010209",[119,124,129,134,139,143],{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},11060,"遇到 'Page.content: Target page, context or browser has been closed' 错误怎么办？","该错误通常发生在并发爬取时。请检查代码中是否实例化了多次 AsyncWebCrawler，确保只实例化一次。如果在本地使用 arun_many 进行并发运行时仍出现此错误，可能需要检查是否依赖了过时的 monkey patch（猴子补丁），建议更新到最新版本或调整并发策略。","https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fissues\u002F842",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},11061,"为什么会出现 'maximum recursion depth exceeded'（超过最大递归深度）错误？","这通常是由多个爬取进程重叠引起的，特别是在 Docker 容器中同时运行不同频率的任务（如每小时和每分钟的任务）时。当一个长任务运行时，短任务可能会挂起并最终导致递归错误。尝试将 thread_safe 设置为 True 可能有所帮助，但根本原因通常是进程竞争资源。建议错开任务执行时间或优化并发控制逻辑。","https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fissues\u002F399",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},11062,"爬取时遇到 'net::ERR_ABORTED' 导航错误如何解决？","该错误常见于多页爬取、深度爬取或使用代理配置时。尝试在配置中将 wait_until 参数设置为 \"domcontentloaded\" 而非默认的 \"load\"，这通常能解决大部分导航中断问题。如果使用了代理或持久化浏览器上下文（use_persistent_context=True），请检查代理配置是否正确，或参考相关 Issue #1779 获取针对 Docker 环境下的修复方案。","https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fissues\u002F1367",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},11063,"在 AWS Lambda 或只读文件系统环境中导入 crawl4ai 报错 'Read-only file system' 怎么办？","这是因为 crawl4ai 在初始化时试图写入临时文件或缓存到默认目录，而 AWS Lambda 等环境的文件系统是只读的。解决方案是设置环境变量 PLAYWRIGHT_BROWSERS_PATH 指向一个可写目录（如 \u002Ftmp），或者在代码中配置浏览器启动参数，将用户数据目录指向可写路径。例如：export PLAYWRIGHT_BROWSERS_PATH=\u002Ftmp 或在 Python 代码中指定 browser_type.launch(user_data_dir=\"\u002Ftmp\u002Fmy-user-data\")。","https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fissues\u002F155",{"id":140,"question_zh":141,"answer_zh":142,"source_url":133},11064,"如何在 Docker 环境中正确运行 crawl4ai 以避免导航错误？","在 Docker 中运行 crawl4ai 时，经常遇到 net::ERR_ABORTED 错误。这通常与浏览器启动参数或网络配置有关。确保 Docker 容器内安装了必要的依赖（如 chromium 驱动），并尝试添加启动参数 --no-sandbox 和 --disable-setuid-sandbox。此外，对于深度爬取任务，建议跟踪 Issue #1779 以获取最新的 Docker 导航问题修复补丁。",{"id":144,"question_zh":145,"answer_zh":146,"source_url":123},11065,"并发爬取时如何避免浏览器上下文关闭导致的错误？","并发爬取时，如果多个任务共享同一个浏览器实例或上下文，可能会导致 'Target page, context or browser has been closed' 错误。建议在并发场景下为每个任务创建独立的浏览器上下文，或者使用库提供的并发安全方法（如 arun_many）并确保内部实现了正确的资源隔离。避免手动重复实例化 AsyncWebCrawler，除非明确需要独立的浏览器实例。",[148,153,158,163,168,173,178,183,188,193,198,203,208,213],{"id":149,"version":150,"summary_zh":151,"released_at":152},60721,"v0.8.5","## 🎉 Crawl4AI v0.8.5 发布！\n\n### 📦 安装\n\n**PyPI:**\n```bash\npip install crawl4ai==0.8.5\n```\n\n**Docker:**\n```bash\ndocker pull unclecode\u002Fcrawl4ai:0.8.5\ndocker pull unclecode\u002Fcrawl4ai:latest\n```\n\n**注意：** Docker 镜像正在构建中，不久后即可使用。请查看 [Docker Release 工作流](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Factions\u002Fworkflows\u002Fdocker-release.yml)，以了解构建状态。\n\n### 📝 变更内容\n详情请参阅 [CHANGELOG.md](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FCHANGELOG.md)。","2026-03-18T03:34:27",{"id":154,"version":155,"summary_zh":156,"released_at":157},60722,"v0.8.0","## 🎉 Crawl4AI v0.8.0 发布！\n\n### 📦 安装\n\n**PyPI:**\n```bash\npip install crawl4ai==0.8.0\n```\n\n**Docker:**\n```bash\ndocker pull unclecode\u002Fcrawl4ai:0.8.0\ndocker pull unclecode\u002Fcrawl4ai:latest\n```\n\n**注意：** Docker 镜像正在构建中，不久后即可使用。请查看 [Docker Release 工作流](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Factions\u002Fworkflows\u002Fdocker-release.yml) 以了解构建状态。\n\n### 📝 变更内容\n详情请参阅 [CHANGELOG.md](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FCHANGELOG.md)。","2026-01-16T10:40:17",{"id":159,"version":160,"summary_zh":161,"released_at":162},60723,"v0.7.8","## 🎉 Crawl4AI v0.7.8 发布！\n\n### 📦 安装\n\n**PyPI:**\n```bash\npip install crawl4ai==0.7.8\n```\n\n**Docker:**\n```bash\ndocker pull unclecode\u002Fcrawl4ai:0.7.8\ndocker pull unclecode\u002Fcrawl4ai:latest\n```\n\n**注意：** Docker 镜像正在构建中，不久后即可使用。请查看 [Docker Release 工作流](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Factions\u002Fworkflows\u002Fdocker-release.yml)，以了解构建状态。\n\n### 📝 变更内容\n详情请参阅 [CHANGELOG.md](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FCHANGELOG.md)。","2025-12-09T08:49:38",{"id":164,"version":165,"summary_zh":166,"released_at":167},60724,"v0.7.7","## 🎉 Crawl4AI v0.7.7 发布！\n\n本次发布推出了一套完整的企业级实时监控自托管平台。这一版本将 Crawl4AI Docker 从一个简单的容器化爬虫工具，升级为具备完全运行透明度和控制能力的生产就绪平台。\n\n## 🚀 新特性\n\n  重大功能：实时监控与自托管平台\n\n  现在的 Docker 部署包含：\n\n  - 📊 交互式监控仪表板 (\u002Fdashboard)\n  - 🔌 全面的监控 API\n  - ⚡ WebSocket 流式传输\n  - 🔥 智能浏览器池（三层架构）\n  - 🧹 清理系统\n  - 📈 生产就绪\n\n  🐛 重要 bug 修复\n  - 修复了异步 LLM 提取阻塞问题 (#1055) —— 现已支持真正的并行处理\n  - 修复了 CDP 端点验证中的指数退避机制 (#1445)\n  - 修正了 arun_many 函数，使其即使在异常情况下也能始终返回列表\n\n  配置与功能\n\n  - 更新了浏览器和爬虫配置文档，使其与实际实现一致\n  - 增强了深度优先搜索（DFS）爬取策略，加入了已访问 URL 的追踪功能\n  - 修复了 AsyncUrlSeeder 中的站点地图解析和 URL 规范化问题 (#1559)\n  - 修复了托管浏览器中的视口配置问题 (#1490)\n  - 修复了 remove_overlay_elements 功能 (#1396)\n\n  Docker 与基础设施\n\n  - 修复了多提供商支持下的 LLM API 密钥处理问题\n  - 将所有配置中的 Docker 端口统一标准化为 11235\n  - 改进了错误处理，增加了全面的状态码支持\n  - 修复了 \u002Fcrawl 和 \u002Fcrawl\u002Fstream 端点中 fit_html 的序列化问题\n\n  安全性\n\n  - 将 pyOpenSSL 从 >=24.3.0 升级至 >=25.3.0（修复安全漏洞）\n  - 增加了针对安全更新的验证测试\n\n### 📦 安装\n\n**PyPI:**\n```bash\npip install crawl4ai==0.7.7\n```\n\n**Docker:**\n```bash\ndocker pull unclecode\u002Fcrawl4ai:0.7.7\ndocker pull unclecode\u002Fcrawl4ai:latest\n```\n\n**注意：** Docker 镜像正在构建中，不久即可使用。请查看 [Docker Release 工作流](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Factions\u002Fworkflows\u002Fdocker-release.yml)，以了解构建状态。\n\n### 📝 变更内容\n详情请参阅 [CHANGELOG.md](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FCHANGELOG.md)。","2025-11-14T09:28:40",{"id":169,"version":170,"summary_zh":171,"released_at":172},60725,"v0.7.6","## 🎉 Crawl4AI v0.7.6 发布！\n\nCrawl4AI v0.7.6 - Docker 作业队列 API 的 Webhook 支持\n\n用户现在可以：\n  - 在 \u002Fcrawl\u002Fjob 和 \u002Fllm\u002Fjob 两个端点上使用 Webhook\n  - 获取实时通知，无需轮询\n  - 配置带有自定义头部的 Webhook 传递\n  - 在 Webhook 负载中包含完整数据\n  - 在 config.yml 中设置全局 Webhook URL\n  - 享受带指数退避的自动重试功能\n\n### 📦 安装\n\n**PyPI：**\n```bash\npip install crawl4ai==0.7.6\n```\n\n**Docker：**\n```bash\ndocker pull unclecode\u002Fcrawl4ai:0.7.6\ndocker pull unclecode\u002Fcrawl4ai:latest\n```\n\n**注意：** Docker 镜像正在构建中，很快就会发布。请查看 [Docker Release 工作流](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Factions\u002Fworkflows\u002Fdocker-release.yml)，以了解构建状态。\n\n### 📝 变更内容\n详细信息请参阅 [CHANGELOG.md](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FCHANGELOG.md)。","2025-10-22T12:06:09",{"id":174,"version":175,"summary_zh":176,"released_at":177},60726,"v0.7.5","# 🚀 Crawl4AI v0.7.5：Docker 钩子与安全更新\n\n## 🎯 新增功能\n\n### 🔧 Docker 钩子系统\n在 8 个关键的管道节点中注入自定义 Python 函数，用于身份验证、性能优化和内容处理。\n\n**基于函数的 API**，支持 IDE 开发：\n```python\nfrom crawl4ai import hooks_to_string\n\nasync def on_page_context_created(page, context, **kwargs):\n    \"\"\"阻止加载图片以加快爬取速度\"\"\"\n    await context.route(\"**\u002F*.{png,jpg,jpeg,gif,webp}\", lambda route: route.abort())\n    return page\n\nhooks_code = hooks_to_string({\"on_page_context_created\": on_page_context_created})\n```\n\n8 个可用钩子点：\n`on_browser_created`、`on_page_context_created`、`before_goto`、`after_goto`、`on_user_agent_updated`、`on_execution_started`、`before_retrieve_html`、`before_return_html`\n\n🤖 增强的 LLM 集成\n\n- 自定义温度参数，用于控制创造力\n- 多提供商支持（OpenAI、Gemini、自定义端点）\n- `base_url` 配置，适用于自托管模型\n- 改进的 Docker API 集成\n\n🔒 HTTPS 保留\n\n新增 `preserve_https_for_internal_links` 选项，在整个爬取过程中保持安全协议——这对于已认证会话和注重安全的应用程序至关重要。\n\n🛠️ 重大错误修复\n\n- URL 处理：修复了查询参数中“+”号被转义的问题 (#1332)\n- JWT 身份验证：解决了 Docker 中 JWT 验证问题 (#1442)\n- Playwright 隐身模式：修复了隐身功能集成问题 (#1481)\n- 代理配置：通过新的 `proxy_config` 结构增强了解析能力\n- 内存管理：修复了长时间运行会话中的内存泄漏\n- Docker 序列化：解决了 JSON 编码错误 (#1419)\n- LLM 提供商：修复了自定义提供商集成到自适应爬虫中的问题 (#1291)\n- 性能：解决了退避策略失效的问题 (#989)\n\n---\n📦 安装\n\nPyPI：\npip install crawl4ai==0.7.5\n\nDocker：\ndocker pull unclecode\u002Fcrawl4ai:0.7.5\ndocker pull unclecode\u002Fcrawl4ai:latest\n\n支持平台：Linux\u002FAMD64、Linux\u002FARM64（Apple Silicon、AWS Graviton）\n\n---\n⚠️ 破坏性变更\n\n1. 需要 Python 3.10 或更高版本（从 3.9 升级）\n2. 代理参数已弃用——请使用新的 `proxy_config` 结构\n3. 新依赖项——添加了 `cssselect`，以更好地处理 CSS\n\n---\n📚 资源\n\n- 📖 完整发布说明：https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002Fdocs\u002Fblog\u002Frelease-v0.7.5.md\n- 📘 文档：https:\u002F\u002Fdocs.crawl4ai.com\n- 💬 Discord 社区：https:\u002F\u002Fdiscord.gg\u002FjP8KfhDhyN\n- 🐛 问题追踪：https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fissues\n\n---\n🙏 感谢\n\n感谢所有报告问题、提供反馈并为本次发布做出贡献的开发者和用户！\n\n完整变更日志：https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fcompare\u002Fv0.7.4...v0.7.5","2025-10-21T08:15:10",{"id":179,"version":180,"summary_zh":181,"released_at":182},60727,"v0.7.4","## 🎉 Crawl4AI v0.7.4 发布！\n\n### 📦 安装\n\n**PyPI:**\n```bash\npip install crawl4ai==0.7.4\n```\n\n**Docker:**\n```bash\ndocker pull unclecode\u002Fcrawl4ai:0.7.4\ndocker pull unclecode\u002Fcrawl4ai:latest\n```\n\n### 📝 变更内容\n详情请参阅 [CHANGELOG.md](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FCHANGELOG.md)。","2025-08-17T12:12:56",{"id":184,"version":185,"summary_zh":186,"released_at":187},60728,"v0.7.3","# 🚀 Crawl4AI v0.7.3：多配置智能更新\n\n欢迎使用 Crawl4AI v0.7.3！本次发布带来了强大的新功能，包括隐身爬取、智能 URL 配置、内存优化以及增强的数据提取能力。无论您是在处理反爬虫保护的网站、混合内容类型，还是大规模爬取任务，这次更新都能满足您的需求。\n\n## 💖 GitHub 赞助现已上线！\n\n在为 **51,000+ 开发者** 提供支持并成为 **#1 热门网络爬虫** 后，我们推出了 GitHub 赞助计划，以确保 Crawl4AI 永远保持独立与创新。\n\n### 🏆 成为创始赞助者（仅限前 50 名！）\n\n- **🌱 支持者（每月 $5）**：加入我们的社区 + 赞助者专属 Discord 社区\n- **🚀 构建者（每月 $50）**：优先支持 + 提前体验新功能  \n- **💼 发展团队（每月 $500）**：每两周一次同步会议 + 优化协助\n- **🏢 数据基础设施合作伙伴（每月 $2000）**：全面合作 + 专属支持\n\n**为什么选择赞助？** 掌控您的数据管道。无 API 限制。直接与开发者沟通交流。\n\n[**立即成为赞助者 →**](https:\u002F\u002Fgithub.com\u002Fsponsors\u002Funclecode) | [查看权益](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\u002Fblob\u002Fmain\u002FSPONSORS.md)\n\n---\n\n## 🎯 重大特性\n\n### 🕵️ 无痕浏览器支持\n借助全新的隐身能力，突破复杂的机器人检测系统：\n\n```python\nfrom crawl4ai import AsyncWebCrawler, BrowserConfig\n\n# 启用隐身模式，实现无痕爬取\nbrowser_config = BrowserConfig(\n    browser_type=\"undetected\",  # 使用无痕 Chrome 浏览器\n    headless=True,              # 可以在无头模式下运行并保持隐身\n    extra_args=[\n        \"--disable-blink-features=AutomationControlled\",\n        \"--disable-web-security\"\n    ]\n)\n\nasync with AsyncWebCrawler(config=browser_config) as crawler:\n    # 成功绕过 Cloudflare、Akamai 以及自定义的机器人检测机制\n    result = await crawler.arun(\"https:\u002F\u002Fprotected-site.com\")\n    print(f\"✅ 已绕过防护！内容长度：{len(result.markdown)} 字符\")\n```\n\n**它能实现什么：**\n- 访问此前被屏蔽的企业级网站和数据库\n- 从受保护的来源获取竞争对手数据  \n- 监控具有反爬虫措施的电商平台价格信息\n- 在防护系统存在的情况下采集新闻和社交媒体内容\n\n### 🎨 多 URL 配置系统\n自动为不同的 URL 模式应用不同的爬取策略：\n\n```python\nfrom crawl4ai import CrawlerRunConfig\n\n# 为不同类型的内容定义专用配置\nconfigs = [\n    # 文档类网站 - 积极缓存，包含链接\n    CrawlerRunConfig(\n        url_matcher=[\"*docs*\", \"*documentation*\"],\n        cache_mode=\"write\",\n        markdown_generator_options={\"include_links\": True}\n    ),\n    \n    # 新闻\u002F博客类网站 - 获取最新内容，模拟滚动加载\n    CrawlerRunConfig(\n        url_matcher=lambda url: 'blog' in url 或 'news' in url,\n        cache_mode=\"bypass\",\n        js_code=\"window.scrollTo(0, document.body.scrollHeight\u002F2);\"\n    ),\n    \n    # API 接口 - 结构化提取\n    CrawlerRunConfig(\n  ","2025-08-09T12:38:57",{"id":189,"version":190,"summary_zh":191,"released_at":192},60729,"v0.7.2","# 🚀 Crawl4AI v0.7.2：CI\u002FCD 与依赖优化更新\n\n*2025年7月25日 • 阅读需3分钟*\n\n---\n\n本次发布引入了自动化的 CI\u002FCD 流水线，实现无缝部署，并对依赖项进行了优化，使软件包更轻量、更高效。\n\n## 🎯 新增功能\n\n### 🔄 自动化发布流水线\n- **GitHub Actions CI\u002FCD**：标签推送时自动触发 PyPI 和 Docker Hub 的发布\n- **多平台 Docker 镜像**：同时支持 AMD64 和 ARM64 架构\n- **版本一致性检查**：确保标签、软件包和 Docker 镜像的版本一致\n- **自动生成发行说明**：GitHub 发布会自动创建\n\n### 📦 依赖优化\n- **将 sentence-transformers 移至可选依赖**：显著减小默认安装包的大小\n- **更轻量的 Docker 镜像**：优化 Dockerfile，构建速度更快、镜像更小\n- **更好的依赖管理**：核心依赖与可选依赖清晰分离\n\n## 🏗️ CI\u002FCD 流水线\n\n新的自动化发布流程确保了发布的一致性和可靠性：\n\n```yaml\n# 通过简单打标签触发发布\ngit tag v0.7.2\ngit push origin v0.7.2\n\n# 自动执行：\n# ✅ 校验版本一致性\n# ✅ 构建并发布到 PyPI\n# ✅ 构建多平台 Docker 镜像\n# ✅ 带正确标签推送到 Docker Hub\n# ✅ 创建 GitHub 发布\n```\n\n## 💾 更轻量的安装\n\n默认安装现在明显更小：\n\n```bash\n# 核心安装（更小、更快）\npip install crawl4ai==0.7.2\n\n# 包含 ML 功能（包含 sentence-transformers）\npip install crawl4ai[transformer]==0.7.2\n\n# 完整安装\npip install crawl4ai[all]==0.7.2\n```\n\n## 🐳 Docker 改进\n\n增强了 Docker 支持，提供多平台镜像：\n\n```bash\n# 拉取最新版本\ndocker pull unclecode\u002Fcrawl4ai:0.7.2\ndocker pull unclecode\u002Fcrawl4ai:latest\n\n# 可用标签：\n# - unclecode\u002Fcrawl4ai:0.7.2（特定版本）\n# - unclecode\u002Fcrawl4ai:0.7（次版本）\n# - unclecode\u002Fcrawl4ai:0（主版本）\n# - unclecode\u002Fcrawl4ai:latest\n```\n\n## 🔧 技术细节\n\n### 依赖变更\n- `sentence-transformers` 由必选依赖改为可选依赖\n- 默认安装包大小减少约 500MB\n- 在不需要 transformer 功能时，不会影响正常使用\n\n### CI\u002FCD 配置\n- 使用 GitHub Actions 工作流实现自动化发布\n- 发布前进行版本验证\n- 并行部署 PyPI 和 Docker Hub\n- Docker 镜像采用自动打标签策略\n\n## 🚀 安装\n\n```bash\npip install crawl4ai==0.7.2\n```\n\n无破坏性变更——可直接从 v0.7.0 或 v0.7.1 升级。\n\n---\n\n有问题或疑问？\n- GitHub：[github.com\u002Funclecode\u002Fcrawl4ai](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai)\n- Discord：[discord.gg\u002Fcrawl4ai](https:\u002F\u002Fdiscord.gg\u002FjP8KfhDhyN)\n- Twitter：[@unclecode](https:\u002F\u002Fx.com\u002Funclecode)\n\n*附注：新的 CI\u002FCD 流水线将使未来的发布更加迅速和可靠。感谢您的耐心等待，我们正在不断改进发布流程！*","2025-07-25T10:19:25",{"id":194,"version":195,"summary_zh":196,"released_at":197},60730,"v0.7.1","# 🛠️ Crawl4AI v0.7.1：小幅清理更新\n\n*2025年7月17日 · 阅读需2分钟*\n\n---\n\n这是一次小型维护版本更新，移除了未使用的代码并改进了文档。\n\n## 🎯 变更内容\n\n- 从 `crawl4ai\u002Fbrowser_manager.py` 中移除了未使用的 `StealthConfig`\n- 更新了文档，增加了更好的示例和参数说明\n- 修复了文档中虚拟滚动配置的示例\n\n## 🧹 代码清理\n\n移除了未使用的 `StealthConfig` 导入及配置，这些代码在项目中从未被使用过。目前项目通过 JavaScript 注入实现了自定义的隐身功能。\n\n```python\n# 移除的未使用代码：\nfrom playwright_stealth import StealthConfig\nstealth_config = StealthConfig(...)  # 此处从未被使用\n```\n\n## 📖 文档更新\n\n- 修正了自适应爬取参数的示例\n- 更新了会话管理相关文档\n- 纠正了虚拟滚动配置的示例\n\n## 🚀 安装\n\n```bash\npip install crawl4ai==0.7.1\n```\n\n本次更新无破坏性变更，可直接从 v0.7.0 升级。\n\n---\n\n有问题或遇到困难？  \n- GitHub：[github.com\u002Funclecode\u002Fcrawl4ai](https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai)  \n- Discord：[discord.gg\u002Fcrawl4ai](https:\u002F\u002Fdiscord.gg\u002FjP8KfhDhyN)","2025-07-17T09:48:00",{"id":199,"version":200,"summary_zh":201,"released_at":202},60731,"v0.7.0","# 🚀 Crawl4AI v0.7.0: The Adaptive Intelligence Update\n\n*January 28, 2025 • 10 min read*\n\n---\n\nToday I'm releasing Crawl4AI v0.7.0—the Adaptive Intelligence Update. This release introduces fundamental improvements in how Crawl4AI handles modern web complexity through adaptive learning, intelligent content discovery, and advanced extraction capabilities.\n\n## 🎯 What's New at a Glance\n\n- **Adaptive Crawling**: Your crawler now learns and adapts to website patterns\n- **Virtual Scroll Support**: Complete content extraction from infinite scroll pages\n- **Link Preview with 3-Layer Scoring**: Intelligent link analysis and prioritization\n- **Async URL Seeder**: Discover thousands of URLs in seconds with intelligent filtering\n- **PDF Parsing**: Extract data from PDF documents\n- **Performance Optimizations**: Significant speed and memory improvements\n\n## 🧠 Adaptive Crawling: Intelligence Through Pattern Learning\n\n**The Problem:** Websites change. Class names shift. IDs disappear. Your carefully crafted selectors break at 3 AM, and you wake up to empty datasets and angry stakeholders.\n\n**My Solution:** I implemented an adaptive learning system that observes patterns, builds confidence scores, and adjusts extraction strategies on the fly. It's like having a junior developer who gets better at their job with every page they scrape.\n\n### Technical Deep-Dive\n\nThe Adaptive Crawler maintains a persistent state for each domain, tracking:\n- Pattern success rates\n- Selector stability over time  \n- Content structure variations\n- Extraction confidence scores\n\n```python\nfrom crawl4ai import AdaptiveCrawler, AdaptiveConfig, CrawlState\n\n# Initialize with custom learning parameters\nconfig = AdaptiveConfig(\n    confidence_threshold=0.7,    # Min confidence to use learned patterns\n    max_history=100,            # Remember last 100 crawls per domain\n    learning_rate=0.2,          # How quickly to adapt to changes\n    patterns_per_page=3,        # Patterns to learn per page type\n    extraction_strategy='css'   # 'css' or 'xpath'\n)\n\nadaptive_crawler = AdaptiveCrawler(config)\n\n# First crawl - crawler learns the structure\nasync with AsyncWebCrawler() as crawler:\n    result = await crawler.arun(\n        \"https:\u002F\u002Fnews.example.com\u002Farticle\u002F12345\",\n        config=CrawlerRunConfig(\n            adaptive_config=config,\n            extraction_hints={  # Optional hints to speed up learning\n                \"title\": \"article h1\",\n                \"content\": \"article .body-content\"\n            }\n        )\n    )\n    \n    # Crawler identifies and stores patterns\n    if result.success:\n        state = adaptive_crawler.get_state(\"news.example.com\")\n        print(f\"Learned {len(state.patterns)} patterns\")\n        print(f\"Confidence: {state.avg_confidence:.2%}\")\n\n# Subsequent crawls - uses learned patterns\nresult2 = await crawler.arun(\n    \"https:\u002F\u002Fnews.example.com\u002Farticle\u002F67890\",\n    config=CrawlerRunConfig(adaptive_config=config)\n)\n# Automatically extracts using learned patterns!\n```\n\n**Expected Real-World Impact:**\n- **News Aggregation**: Maintain 95%+ extraction accuracy even as news sites update their templates\n- **E-commerce Monitoring**: Track product changes across hundreds of stores without constant maintenance\n- **Research Data Collection**: Build robust academic datasets that survive website redesigns\n- **Reduced Maintenance**: Cut selector update time by 80% for frequently-changing sites\n\n## 🌊 Virtual Scroll: Complete Content Capture\n\n**The Problem:** Modern web apps only render what's visible. Scroll down, new content appears, old content vanishes into the void. Traditional crawlers capture that first viewport and miss 90% of the content. It's like reading only the first page of every book.\n\n**My Solution:** I built Virtual Scroll support that mimics human browsing behavior, capturing content as it loads and preserving it before the browser's garbage collector strikes.\n\n### Implementation Details\n\n```python\nfrom crawl4ai import VirtualScrollConfig\n\n# For social media feeds (Twitter\u002FX style)\ntwitter_config = VirtualScrollConfig(\n    container_selector=\"[data-testid='primaryColumn']\",\n    scroll_count=20,                    # Number of scrolls\n    scroll_by=\"container_height\",       # Smart scrolling by container size\n    wait_after_scroll=1.0,             # Let content load\n    capture_method=\"incremental\",       # Capture new content on each scroll\n    deduplicate=True                   # Remove duplicate elements\n)\n\n# For e-commerce product grids (Instagram style)\ngrid_config = VirtualScrollConfig(\n    container_selector=\"main .product-grid\",\n    scroll_count=30,\n    scroll_by=800,                     # Fixed pixel scrolling\n    wait_after_scroll=1.5,             # Images need time\n    stop_on_no_change=True            # Smart stopping\n)\n\n# For news feeds with lazy loading\nnews_config = VirtualScrollConfig(\n    container_selector=\".article-feed\",\n    scroll_count=50,\n    scroll_by=\"page_height\",           # Viewport-based scrolling\n    wait_afte","2025-07-12T11:13:36",{"id":204,"version":205,"summary_zh":206,"released_at":207},60732,"v0.6.3","**Release 0.6.3 (unreleased)**\n\n**Features**\n\n* **extraction**: add `RegexExtractionStrategy` for pattern-based extraction, including built-in patterns for emails, URLs, phones, dates, support for custom regexes, an LLM-assisted pattern generator, optimized HTML preprocessing via `fit_html`, and enhanced network response body capture (9b5ccac)\n* **docker-api**: introduce job-based polling endpoints—`POST \u002Fcrawl\u002Fjob` & `GET \u002Fcrawl\u002Fjob\u002F{task_id}` for crawls, `POST \u002Fllm\u002Fjob` & `GET \u002Fllm\u002Fjob\u002F{task_id}` for LLM tasks—backed by Redis task management with configurable TTL, moved schemas to `schemas.py`, and added `demo_docker_polling.py` example (94e9959)\n* **browser**: improve profile management and cleanup—add process cleanup for existing Chromium instances on Windows\u002FUnix, fix profile creation by passing full browser config, ship detailed browser\u002FCLI docs and initial profile-creation test, bump version to 0.6.3 (9499164)\n\n**Fixes**\n\n* **crawler**: remove automatic page closure in `take_screenshot` and `take_screenshot_naive`, preventing premature teardown; callers now must explicitly close pages (BREAKING CHANGE) (a3e9ef9)\n\n**Documentation**\n\n* format bash scripts in `docs\u002Fapps\u002Flinkdin\u002FREADME.md` so examples copy & paste cleanly (87d4b0f)\n* update the same README with full `litellm` argument details for correct script usage (bd5a9ac)\n\n**Refactoring**\n\n* **logger**: centralize color codes behind an `Enum` in `async_logger`, `browser_profiler`, `content_filter_strategy` and related modules for cleaner, type-safe formatting (cd2b490)\n\n**Experimental**\n\n* start migration of logging stack to `rich` (WIP, work ongoing) (b2f3cb0)\n","2025-05-12T13:44:02",{"id":209,"version":210,"summary_zh":211,"released_at":212},60733,"vr0.6.0","## 🚀 0.6.0 — 22 Apr 2025\r\n\r\n### Highlights\r\n1. **World‑aware crawlers**:\r\n```python\r\ncrun_cfg = CrawlerRunConfig(\r\n        url=\"https:\u002F\u002Fbrowserleaks.com\u002Fgeo\",          # test page that shows your location\r\n        locale=\"en-US\",                              # Accept-Language & UI locale\r\n        timezone_id=\"America\u002FLos_Angeles\",           # JS Date()\u002FIntl timezone\r\n        geolocation=GeolocationConfig(                 # override GPS coords\r\n            latitude=34.0522,\r\n            longitude=-118.2437,\r\n            accuracy=10.0,\r\n        )\r\n    )\r\n```\r\n\r\n2. **Table‑to‑DataFrame** extraction, flip `df = pd.DataFrame(result.media[\"tables\"][0][\"rows\"], columns=result.media[\"tables\"][0][\"headers\"])\r\n` and get CSV or pandas without extra parsing.  \r\n3. **Crawler pool with pre‑warm**, pages launch hot, lower P90 latency, lower memory.  \r\n4. **Network and console capture**, full traffic log plus MHTML snapshot for audits and debugging.  \r\n\r\n### Added\r\n- Geolocation, locale, and timezone flags for every crawl.\r\n- Browser pooling with page pre‑warming.\r\n- Table extractor that exports to CSV or pandas.\r\n- Crawler pool manager in SDK and Docker API.\r\n- Network & console log capture, plus MHTML snapshot.\r\n- MCP socket and SSE endpoints with playground UI.\r\n- Stress‑test framework (`tests\u002Fmemory`) for 1 k+ URL runs.\r\n- Docs v2: TOC, GitHub badge, copy‑code buttons, Docker API demo.\r\n- “Ask AI” helper button, work in progress, shipping soon.\r\n- New examples: geo location, network\u002Fconsole capture, Docker API, markdown source selection, crypto analysis.\r\n\r\n### Changed\r\n- Browser strategy consolidation, legacy docker modules removed.\r\n- `ProxyConfig` moved to `async_configs`.\r\n- Server migrated to pool‑based crawler management.\r\n- FastAPI validators replace custom query validation.\r\n- Docker build now uses a Chromium base image.\r\n- Repo cleanup, ≈36 k insertions, ≈5 k deletions across 121 files.\r\n\r\n### Fixed\r\n- Session leaks, duplicate visits, URL normalisation.\r\n- Target‑element regressions in scraping strategies.\r\n- Logged URL readability, encoded URL decoding, middle truncation.\r\n- Closed issues: #701 #733 #756 #774 #804 #822 #839 #841 #842 #843 #867 #902 #911.\r\n\r\n### Removed\r\n- Obsolete modules in `crawl4ai\u002Fbrowser\u002F*`.\r\n\r\n### Deprecated\r\n- Old markdown generator names now alias `DefaultMarkdownGenerator` and warn.\r\n\r\n### Upgrade notes\r\n1. Update any imports from `crawl4ai\u002Fbrowser\u002F*` to the new pooled browser modules.  \r\n2. If you override `AsyncPlaywrightCrawlerStrategy.get_page` adopt the new signature.  \r\n3. Rebuild Docker images to pick up the Chromium layer.  \r\n4. Switch to `DefaultMarkdownGenerator` to silence deprecation warnings.\r\n\r\n`121 files changed, ≈36 223 insertions, ≈4 975 deletions`\r\n","2025-04-22T15:24:54",{"id":214,"version":215,"summary_zh":216,"released_at":217},60734,"v0.5.0.post1","# Crawl4AI v0.5.0.post1 Release\n\n**Release Theme: Power, Flexibility, and Scalability**\n\nCrawl4AI v0.5.0 is a major release focused on significantly enhancing the library's power, flexibility, and scalability. \n\n## Key Features\n\n1. **Deep Crawling System** - Explore websites beyond initial URLs with BFS, DFS, and BestFirst strategies, with page limiting and scoring capabilities\n2. **Memory-Adaptive Dispatcher** - Scale to thousands of URLs with intelligent memory monitoring and concurrency control\n3. **Multiple Crawling Strategies** - Choose between browser-based (Playwright) or lightweight HTTP-only crawling\n4. **Docker Deployment** - Easy deployment with FastAPI server, JWT authentication, and streaming\u002Fnon-streaming endpoints\n5. **Command-Line Interface** - New `crwl` CLI provides convenient access to all features with intuitive commands\n6. **Browser Profiler** - Create and manage persistent browser profiles to save authentication states for protected content\n7. **Crawl4AI Coding Assistant** - Interactive chat interface for asking questions about Crawl4AI and generating Python code examples\n8. **LXML Scraping Mode** - Fast HTML parsing using the `lxml` library for 10-20x speedup with complex pages\n9. **Proxy Rotation** - Built-in support for dynamic proxy switching with authentication and session persistence\n10. **PDF Processing** - Extract and process data from PDF files (both local and remote)\n\n## Additional Improvements\n\n- LLM Content Filter for intelligent markdown generation\n- URL redirection tracking\n- LLM-powered schema generation utility for extraction templates\n- robots.txt compliance support\n- Enhanced browser context management\n- Improved serialization and config handling\n\n## Breaking Changes\n\nThis release contains several breaking changes. Please review the full release notes for migration guidance.\n\nFor complete details, visit: https:\u002F\u002Fdocs.crawl4ai.com\u002Fblog\u002Freleases\u002F0.5.0\u002F\n","2025-03-04T14:21:20"]