[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-ucbepic--docetl":3,"tool-ucbepic--docetl":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":81,"owner_twitter":76,"owner_website":82,"owner_url":83,"languages":84,"stars":107,"forks":108,"last_commit_at":109,"license":110,"difficulty_score":111,"env_os":112,"env_gpu":113,"env_ram":113,"env_deps":114,"category_tags":121,"github_topics":122,"view_count":23,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":136,"updated_at":137,"faqs":138,"releases":169},2546,"ucbepic\u002Fdocetl","docetl","A system for agentic LLM-powered data processing and ETL","DocETL 是一个专为复杂文档处理设计的智能数据提取与转换（ETL）系统。它利用大语言模型（LLM）的代理能力，帮助用户构建高效的数据处理流水线，特别擅长应对非结构化文档中那些传统规则难以处理的复杂任务。\n\n在日常工作中，从大量文档中精准提取、清洗和整合信息往往是一项耗时且容易出错的挑战。DocETL 正是为了解决这一痛点而生，它将繁琐的数据预处理过程自动化，确保数据质量的同时大幅降低人工成本。无论是处理长篇报告、会议记录还是多媒体转录文本，它都能提供稳定可靠的解决方案。\n\n这款工具主要面向开发者、数据工程师以及 AI 应用研究人员。如果你正在构建需要深度理解文档内容的应用，或者希望优化现有的数据处理流程，DocETL 将是一个得力的助手。普通用户若具备一定的技术基础，也可通过其提供的辅助提示词快速上手。\n\nDocETL 的核心亮点在于其独特的“双模式”工作流。一方面，它提供了名为 DocWrangler 的交互式可视化界面，允许用户在开发阶段实时调试提示词、逐步构建管道并即时查看结果，极大地降低了试错门槛；另一方面，它提供了成熟的 Python 包，支持将验证后的流程无缝部署到生产","DocETL 是一个专为复杂文档处理设计的智能数据提取与转换（ETL）系统。它利用大语言模型（LLM）的代理能力，帮助用户构建高效的数据处理流水线，特别擅长应对非结构化文档中那些传统规则难以处理的复杂任务。\n\n在日常工作中，从大量文档中精准提取、清洗和整合信息往往是一项耗时且容易出错的挑战。DocETL 正是为了解决这一痛点而生，它将繁琐的数据预处理过程自动化，确保数据质量的同时大幅降低人工成本。无论是处理长篇报告、会议记录还是多媒体转录文本，它都能提供稳定可靠的解决方案。\n\n这款工具主要面向开发者、数据工程师以及 AI 应用研究人员。如果你正在构建需要深度理解文档内容的应用，或者希望优化现有的数据处理流程，DocETL 将是一个得力的助手。普通用户若具备一定的技术基础，也可通过其提供的辅助提示词快速上手。\n\nDocETL 的核心亮点在于其独特的“双模式”工作流。一方面，它提供了名为 DocWrangler 的交互式可视化界面，允许用户在开发阶段实时调试提示词、逐步构建管道并即时查看结果，极大地降低了试错门槛；另一方面，它提供了成熟的 Python 包，支持将验证后的流程无缝部署到生产环境中。此外，DocETL 还引入了如“Gleaning”（精炼）等先进操作符，能显著提升模型输出的准确性与一致性。结合对 Claude Code 等 AI 编程助手的良好支持，用户可以更轻松地编写和维护复杂的数据处理逻辑，实现从原型设计到实际落地的平滑过渡。","# 📜 DocETL: Powering Complex Document Processing Pipelines\n\n[![Website](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-docetl.org-blue)](https:\u002F\u002Fdocetl.org)\n[![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocumentation-docs-green)](https:\u002F\u002Fucbepic.github.io\u002Fdocetl)\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fdiscord\u002F1285485891095236608?label=Discord&logo=discord)](https:\u002F\u002Fdiscord.gg\u002FfHp7B2X3xx)\n[![Paper](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv-red)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12189)\n\n![DocETL Figure](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fucbepic_docetl_readme_30149b2787a3.png)\n\nDocETL is a tool for creating and executing data processing pipelines, especially suited for complex document processing tasks. It offers:\n\n1. An interactive UI playground for iterative prompt engineering and pipeline development\n2. A Python package for running production pipelines from the command line or Python code\n\n> 💡 **Need Help Writing Your Pipeline?**  \n> You can use **Claude Code** (recommended) to help you write your pipeline—see the quickstart: https:\u002F\u002Fucbepic.github.io\u002Fdocetl\u002Fquickstart-claude-code\u002F  \n> If you’d rather use ChatGPT or the Claude app, see [docetl.org\u002Fllms.txt](https:\u002F\u002Fdocetl.org\u002Fllms.txt) for a big prompt you can copy\u002Fpaste before describing your task.\n\n\n### 🌟 Community Projects\n\n- [Conversation Generator](https:\u002F\u002Fgithub.com\u002FPassionFruits-net\u002Fdocetl-conversation)\n- [Text-to-speech](https:\u002F\u002Fgithub.com\u002FPassionFruits-net\u002Fdocetl-speaker)\n- [YouTube Transcript Topic Extraction](https:\u002F\u002Fgithub.com\u002Frajib76\u002Fdocetl_examples)\n\n### 📚 Educational Resources\n\n- [UI\u002FUX Thoughts](https:\u002F\u002Fx.com\u002Fsh_reya\u002Fstatus\u002F1846235904664273201)\n- [Using Gleaning to Improve Output Quality](https:\u002F\u002Fx.com\u002Fsh_reya\u002Fstatus\u002F1843354256335876262)\n- [Deep Dive on Resolve Operator](https:\u002F\u002Fx.com\u002Fsh_reya\u002Fstatus\u002F1840796824636121288)\n\n\n## 🚀 Getting Started\n\nThere are two main ways to use DocETL:\n\n### 1. 🎮 DocWrangler, the Interactive UI Playground (Recommended for Development)\n\n[DocWrangler](https:\u002F\u002Fdocetl.org\u002Fplayground) helps you iteratively develop your pipeline:\n- Experiment with different prompts and see results in real-time\n- Build your pipeline step by step\n- Export your finalized pipeline configuration for production use\n\n![DocWrangler](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fucbepic_docetl_readme_04174ae867ad.png)\n\nDocWrangler is hosted at [docetl.org\u002Fplayground](https:\u002F\u002Fdocetl.org\u002Fplayground). But to run the playground locally, you can either:\n- Use Docker (recommended for quick start): `make docker`\n- Set up the development environment manually\n\nSee the [Playground Setup Guide](https:\u002F\u002Fucbepic.github.io\u002Fdocetl\u002Fplayground\u002F) for detailed instructions.\n\n### 2. 📦 Python Package (For Production Use)\n\nIf you want to use DocETL as a Python package:\n\n#### Prerequisites\n- Python 3.10 or later\n- OpenAI API key\n\n```bash\npip install docetl\n```\n\nCreate a `.env` file in your project directory:\n```bash\nOPENAI_API_KEY=your_api_key_here  # Required for LLM operations (or the key for the LLM of your choice)\n```\n\n> ⚠️ **Important: Two Different .env Files**\n> - **Root `.env`**: Used by the backend Python server that executes DocETL pipelines\n> - **`website\u002F.env.local`**: Used by the frontend TypeScript code in DocWrangler (UI features like improve prompt and chatbot)\n\nTo see examples of how to use DocETL, check out the [tutorial](https:\u002F\u002Fucbepic.github.io\u002Fdocetl\u002Ftutorial\u002F).\n\n### 2. 🎮 DocWrangler Setup\n\nTo run DocWrangler locally, you have two options:\n\n#### Option A: Using Docker (Recommended for Quick Start)\n\nThe easiest way to get the DocWrangler playground running:\n\n1. Create the required environment files:\n\nCreate `.env` in the root directory (for the backend Python server that executes pipelines):\n```bash\nOPENAI_API_KEY=your_api_key_here  # Used by DocETL pipeline execution engine\n# BACKEND configuration\nBACKEND_ALLOW_ORIGINS=http:\u002F\u002Flocalhost:3000,http:\u002F\u002F127.0.0.1:3000\nBACKEND_HOST=localhost\nBACKEND_PORT=8000\nBACKEND_RELOAD=True\n\n# FRONTEND configuration\nFRONTEND_HOST=0.0.0.0\nFRONTEND_PORT=3000\n\n# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)\nFRONTEND_DOCKER_COMPOSE_PORT=3031\nBACKEND_DOCKER_COMPOSE_PORT=8081\n\n# Supported text file encodings\nTEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1\n```\n\nCreate `.env.local` in the `website` directory (for DocWrangler UI features like improve prompt and chatbot):\n```bash\nOPENAI_API_KEY=sk-xxx  # Used by TypeScript features: improve prompt, chatbot, etc.\nOPENAI_API_BASE=https:\u002F\u002Fapi.openai.com\u002Fv1\nMODEL_NAME=gpt-4o-mini  # Model used by the UI assistant\n\nNEXT_PUBLIC_BACKEND_HOST=localhost\nNEXT_PUBLIC_BACKEND_PORT=8000\nNEXT_PUBLIC_HOSTED_DOCWRANGLER=false\n```\n\n2. Run Docker:\n```bash\nmake docker\n```\n\nThis will:\n- Create a Docker volume for persistent data\n- Build the DocETL image\n- Run the container with the UI accessible at http:\u002F\u002Flocalhost:3000\n\nTo clean up Docker resources (note that this will delete the Docker volume):\n```bash\nmake docker-clean\n```\n\n##### AWS Bedrock\n\nThis framework supports integration with AWS Bedrock. To enable:\n\n1. Configure AWS credentials:\n```bash\naws configure\n```\n\n2. Test your AWS credentials:\n```bash\nmake test-aws\n```\n\n3. Run with AWS support:\n```bash\nAWS_PROFILE=your-profile AWS_REGION=your-region make docker\n```\n\nOr using Docker Compose:\n```bash\nAWS_PROFILE=your-profile AWS_REGION=your-region docker compose --profile aws up\n```\n\nEnvironment variables:\n- `AWS_PROFILE`: Your AWS CLI profile (default: 'default')\n- `AWS_REGION`: AWS region (default: 'us-west-2')\n\nBedrock models are pefixed with `bedrock`. See liteLLM [docs](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002Fproviders\u002Fbedrock#supported-aws-bedrock-models) for more details.\n\n#### Option B: Manual Setup (Development)\n\nFor development or if you prefer not to use Docker:\n\n1. Clone the repository:\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl.git\ncd docetl\n```\n\n2. Set up environment variables in `.env` in the root\u002Ftop-level directory (for the backend Python server):\n```bash\nOPENAI_API_KEY=your_api_key_here  # Used by DocETL pipeline execution engine\n# BACKEND configuration\nBACKEND_ALLOW_ORIGINS=http:\u002F\u002Flocalhost:3000,http:\u002F\u002F127.0.0.1:3000\nBACKEND_HOST=localhost\nBACKEND_PORT=8000\nBACKEND_RELOAD=True\n\n# FRONTEND configuration\nFRONTEND_HOST=0.0.0.0\nFRONTEND_PORT=3000\n\n# Host port mapping for docker-compose (if not set, defaults are used in docker-compose.yml)\nFRONTEND_DOCKER_COMPOSE_PORT=3031\nBACKEND_DOCKER_COMPOSE_PORT=8081\n\n# Supported text file encodings\nTEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1\n```\n\nAnd create an .env.local file in the `website` directory (for DocWrangler UI features):\n```bash\nOPENAI_API_KEY=sk-xxx  # Used by TypeScript features: improve prompt, chatbot, etc.\nOPENAI_API_BASE=https:\u002F\u002Fapi.openai.com\u002Fv1\nMODEL_NAME=gpt-4o-mini  # Model used by the UI assistant\n\nNEXT_PUBLIC_BACKEND_HOST=localhost\nNEXT_PUBLIC_BACKEND_PORT=8000\nNEXT_PUBLIC_HOSTED_DOCWRANGLER=false\n```\n\n3. Install dependencies:\n```bash\nmake install      # Install Python deps with uv and set up pre-commit\nmake install-ui   # Install UI dependencies\n```\n\nIf you prefer using uv directly instead of Make:\n```bash\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\nuv sync --all-groups --all-extras\n```\n\n\n\n4. Start the development server:\n```bash\nmake run-ui-dev\n```\n\n5. Visit http:\u002F\u002Flocalhost:3000\u002Fplayground to access the interactive UI.\n\n### 🛠️ Development Setup\n\nIf you're planning to contribute or modify DocETL, you can verify your setup by running the test suite:\n\n```bash\nmake tests-basic  # Runs basic test suite (costs \u003C $0.01 with OpenAI)\n```\n\nFor detailed documentation and tutorials, visit our [documentation](https:\u002F\u002Fucbepic.github.io\u002Fdocetl).\n","# 📜 DocETL：赋能复杂文档处理流水线\n\n[![官网](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FWebsite-docetl.org-blue)](https:\u002F\u002Fdocetl.org)\n[![文档](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDocumentation-docs-green)](https:\u002F\u002Fucbepic.github.io\u002Fdocetl)\n[![Discord](https:\u002F\u002Fimg.shields.io\u002Fdiscord\u002F1285485891095236608?label=Discord&logo=discord)](https:\u002F\u002Fdiscord.gg\u002FfHp7B2X3xx)\n[![论文](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPaper-arXiv-red)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2410.12189)\n\n![DocETL示意图](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fucbepic_docetl_readme_30149b2787a3.png)\n\nDocETL 是一款用于创建和执行数据处理流水线的工具，尤其适用于复杂的文档处理任务。它提供以下功能：\n\n1. 一个交互式 UI 演示环境，用于迭代式提示工程和流水线开发\n2. 一个 Python 软件包，可用于从命令行或 Python 代码中运行生产级流水线\n\n> 💡 **需要帮助编写你的流水线吗？**  \n> 你可以使用 **Claude Code**（推荐）来辅助你编写流水线——请参阅快速入门指南：https:\u002F\u002Fucbepic.github.io\u002Fdocetl\u002Fquickstart-claude-code\u002F  \n> 如果你更倾向于使用 ChatGPT 或 Claude 应用程序，请访问 [docetl.org\u002Fllms.txt](https:\u002F\u002Fdocetl.org\u002Fllms.txt)，那里提供了一个大型提示模板，你可以在描述任务之前直接复制粘贴。\n\n### 🌟 社区项目\n\n- [对话生成器](https:\u002F\u002Fgithub.com\u002FPassionFruits-net\u002Fdocetl-conversation)\n- [文本转语音](https:\u002F\u002Fgithub.com\u002FPassionFruits-net\u002Fdocetl-speaker)\n- [YouTube 字幕主题提取](https:\u002F\u002Fgithub.com\u002Frajib76\u002Fdocetl_examples)\n\n### 📚 教育资源\n\n- [UI\u002FUX 思考](https:\u002F\u002Fx.com\u002Fsh_reya\u002Fstatus\u002F1846235904664273201)\n- [利用 Gleaning 提升输出质量](https:\u002F\u002Fx.com\u002Fsh_reya\u002Fstatus\u002F1843354256335876262)\n- [Resolve 操作符深度解析](https:\u002F\u002Fx.com\u002Fsh_reya\u002Fstatus\u002F1840796824636121288)\n\n\n## 🚀 快速上手\n\n使用 DocETL 主要有两种方式：\n\n### 1. 🎮 DocWrangler，交互式 UI 演示环境（推荐用于开发）\n\n[DocWrangler](https:\u002F\u002Fdocetl.org\u002Fplayground) 可帮助你逐步开发流水线：\n- 实验不同的提示并实时查看结果\n- 分步构建你的流水线\n- 导出最终确定的流水线配置以供生产使用\n\n![DocWrangler](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fucbepic_docetl_readme_04174ae867ad.png)\n\nDocWrangler 托管在 [docetl.org\u002Fplayground](https:\u002F\u002Fdocetl.org\u002Fplayground) 上。但如果你想在本地运行演示环境，可以采取以下两种方式之一：\n- 使用 Docker（推荐，快速启动）：`make docker`\n- 手动搭建开发环境\n\n详细说明请参阅 [Playground 设置指南](https:\u002F\u002Fucbepic.github.io\u002Fdocetl\u002Fplayground\u002F)。\n\n### 2. 📦 Python 软件包（用于生产环境）\n\n如果你想将 DocETL 作为 Python 软件包使用：\n\n#### 前提条件\n- Python 3.10 或更高版本\n- OpenAI API 密钥\n\n```bash\npip install docetl\n```\n\n在你的项目目录中创建一个 `.env` 文件：\n```bash\nOPENAI_API_KEY=your_api_key_here  # LLM 操作所需（或你选择的其他 LLM 的密钥）\n```\n\n> ⚠️ **重要提示：两个不同的 .env 文件**\n> - **根目录下的 `.env`**：由执行 DocETL 流水线的后端 Python 服务器使用\n> - **`website\u002F.env.local`**：由 DocWrangler 的前端 TypeScript 代码使用（用于改进提示和聊天机器人等功能）\n\n要查看如何使用 DocETL 的示例，请参阅 [教程](https:\u002F\u002Fucbepic.github.io\u002Fdocetl\u002Ftutorial\u002F)。\n\n### 2. 🎮 DocWrangler 设置\n\n要在本地运行 DocWrangler，你有两种选择：\n\n#### 选项 A：使用 Docker（推荐，快速启动）\n\n让 DocWrangler 演示环境运行起来最简单的方法是：\n\n1. 创建所需的环境文件：\n\n在根目录下创建 `.env` 文件（用于执行流水线的后端 Python 服务器）：\n```bash\nOPENAI_API_KEY=your_api_key_here  # 用于 DocETL 流水线执行引擎\n# 后端配置\nBACKEND_ALLOW_ORIGINS=http:\u002F\u002Flocalhost:3000,http:\u002F\u002F127.0.0.1:3000\nBACKEND_HOST=localhost\nBACKEND_PORT=8000\nBACKEND_RELOAD=True\n\n# 前端配置\nFRONTEND_HOST=0.0.0.0\nFRONTEND_PORT=3000\n\n# Docker Compose 中的主机端口映射（若未设置，则使用 docker-compose.yml 中的默认值）\nFRONTEND_DOCKER_COMPOSE_PORT=3031\nBACKEND_DOCKER_COMPOSE_PORT=8081\n\n# 支持的文本文件编码\nTEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1\n```\n\n在 `website` 目录下创建 `.env.local` 文件（用于 DocWrangler UI 功能，如改进提示和聊天机器人等）：\n```bash\nOPENAI_API_KEY=sk-xxx  # 用于 TypeScript 功能：改进提示、聊天机器人等\nOPENAI_API_BASE=https:\u002F\u002Fapi.openai.com\u002Fv1\nMODEL_NAME=gpt-4o-mini  # UI 助手使用的模型\n\nNEXT_PUBLIC_BACKEND_HOST=localhost\nNEXT_PUBLIC_BACKEND_PORT=8000\nNEXT_PUBLIC_HOSTED_DOCWRANGLER=false\n```\n\n2. 运行 Docker：\n```bash\nmake docker\n```\n\n这将完成以下操作：\n- 创建用于持久化数据的 Docker 卷\n- 构建 DocETL 镜像\n- 运行容器，UI 将可通过 http:\u002F\u002Flocalhost:3000 访问\n\n若需清理 Docker 资源（请注意，这将删除 Docker 卷）：\n```bash\nmake docker-clean\n```\n\n##### AWS Bedrock\n\n该框架支持与 AWS Bedrock 的集成。要启用：\n\n1. 配置 AWS 凭证：\n```bash\naws configure\n```\n\n2. 测试你的 AWS 凭证：\n```bash\nmake test-aws\n```\n\n3. 在启用 AWS 支持的情况下运行：\n```bash\nAWS_PROFILE=your-profile AWS_REGION=your-region make docker\n```\n\n或者使用 Docker Compose：\n```bash\nAWS_PROFILE=your-profile AWS_REGION=your-region docker compose --profile aws up\n```\n\n环境变量：\n- `AWS_PROFILE`：你的 AWS CLI 配置文件（默认为 'default'）\n- `AWS_REGION`：AWS 区域（默认为 'us-west-2'）\n\nBedrock 模型名称前会加上 `bedrock_` 前缀。更多详情请参阅 liteLLM 的 [文档](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002Fproviders\u002Fbedrock#supported-aws-bedrock-models)。\n\n#### 选项 B：手动设置（开发模式）\n\n如果你希望进行开发，或者不想使用 Docker，可以按以下步骤操作：\n\n1. 克隆仓库：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl.git\ncd docetl\n```\n\n2. 在根目录\u002F顶级目录下的 `.env` 文件中设置环境变量（用于后端 Python 服务器）：\n```bash\nOPENAI_API_KEY=your_api_key_here  # 用于 DocETL 流水线执行引擎\n# 后端配置\nBACKEND_ALLOW_ORIGINS=http:\u002F\u002Flocalhost:3000,http:\u002F\u002F127.0.0.1:3000\nBACKEND_HOST=localhost\nBACKEND_PORT=8000\nBACKEND_RELOAD=True\n\n# 前端配置\nFRONTEND_HOST=0.0.0.0\nFRONTEND_PORT=3000\n\n# Docker Compose 中的主机端口映射（若未设置，则使用 docker-compose.yml 中的默认值）\nFRONTEND_DOCKER_COMPOSE_PORT=3031\nBACKEND_DOCKER_COMPOSE_PORT=8081\n\n# 支持的文本文件编码\nTEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1\n```\n\n并在 `website` 目录下创建一个 `.env.local` 文件（用于 DocWrangler 的 UI 功能）：\n```bash\nOPENAI_API_KEY=sk-xxx  # 由 TypeScript 功能使用：改进提示、聊天机器人等。\nOPENAI_API_BASE=https:\u002F\u002Fapi.openai.com\u002Fv1\nMODEL_NAME=gpt-4o-mini  # UI 助手使用的模型。\n\nNEXT_PUBLIC_BACKEND_HOST=localhost\nNEXT_PUBLIC_BACKEND_PORT=8000\nNEXT_PUBLIC_HOSTED_DOCWRANGLER=false\n```\n\n3. 安装依赖：\n```bash\nmake install      # 使用 uv 安装 Python 依赖并设置 pre-commit 钩子\nmake install-ui   # 安装 UI 依赖\n```\n\n如果您更倾向于直接使用 uv 而不是 Make：\n```bash\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\nuv sync --all-groups --all-extras\n```\n\n\n\n4. 启动开发服务器：\n```bash\nmake run-ui-dev\n```\n\n5. 访问 http:\u002F\u002Flocalhost:3000\u002Fplayground 即可进入交互式 UI。\n\n### 🛠️ 开发环境搭建\n\n如果您计划为 DocETL 做贡献或进行修改，可以通过运行测试套件来验证您的环境是否配置正确：\n\n```bash\nmake tests-basic  # 运行基础测试套件（使用 OpenAI 时成本低于 0.01 美元）\n```\n\n如需详细文档和教程，请访问我们的[文档](https:\u002F\u002Fucbepic.github.io\u002Fdocetl)。","# DocETL 快速上手指南\n\nDocETL 是一款专为复杂文档处理任务设计的数据处理流水线工具。它提供交互式 UI（DocWrangler）用于迭代开发，以及 Python 包用于生产环境部署。\n\n## 1. 环境准备\n\n在开始之前，请确保满足以下系统要求和前置依赖：\n\n*   **操作系统**：支持 Docker 的环境（推荐）或 Linux\u002FmacOS\u002FWindows。\n*   **Python 版本**：Python 3.10 或更高版本。\n*   **API 密钥**：需要 OpenAI API Key（或其他兼容 LLM 的 API Key）。\n*   **其他依赖**：\n    *   若使用 Docker 方式：需安装 Docker 和 Docker Compose。\n    *   若手动开发：需安装 `git` 和 `uv`（可选，用于加速依赖安装）。\n\n## 2. 安装步骤\n\nDocETL 提供两种主要使用方式：**交互式 UI (DocWrangler)** 和 **Python 包**。推荐新手先通过 Docker 运行 UI 进行体验。\n\n### 方式一：使用 Docker 运行交互式 UI（推荐）\n\n这是最快上手的方式，无需配置复杂的本地开发环境。\n\n1.  **克隆仓库**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl.git\n    cd docetl\n    ```\n\n2.  **配置环境变量**\n    \n    在项目根目录创建 `.env` 文件（用于后端服务）：\n    ```bash\n    OPENAI_API_KEY=your_api_key_here\n    BACKEND_ALLOW_ORIGINS=http:\u002F\u002Flocalhost:3000,http:\u002F\u002F127.0.0.1:3000\n    BACKEND_HOST=localhost\n    BACKEND_PORT=8000\n    BACKEND_RELOAD=True\n    FRONTEND_HOST=0.0.0.0\n    FRONTEND_PORT=3000\n    FRONTEND_DOCKER_COMPOSE_PORT=3031\n    BACKEND_DOCKER_COMPOSE_PORT=8081\n    TEXT_FILE_ENCODINGS=utf-8,latin1,cp1252,iso-8859-1\n    ```\n\n    在 `website` 目录下创建 `.env.local` 文件（用于前端 UI 功能）：\n    ```bash\n    OPENAI_API_KEY=sk-xxx\n    OPENAI_API_BASE=https:\u002F\u002Fapi.openai.com\u002Fv1\n    MODEL_NAME=gpt-4o-mini\n    NEXT_PUBLIC_BACKEND_HOST=localhost\n    NEXT_PUBLIC_BACKEND_PORT=8000\n    NEXT_PUBLIC_HOSTED_DOCWRANGLER=false\n    ```\n\n3.  **启动服务**\n    ```bash\n    make docker\n    ```\n    启动成功后，访问 [http:\u002F\u002Flocalhost:3000](http:\u002F\u002Flocalhost:3000) 即可使用 DocWrangler  playground。\n\n### 方式二：作为 Python 包安装（生产环境）\n\n如果你希望在代码中直接调用 DocETL：\n\n1.  **安装库**\n    ```bash\n    pip install docetl\n    ```\n\n2.  **配置 API Key**\n    在项目目录下创建 `.env` 文件：\n    ```bash\n    OPENAI_API_KEY=your_api_key_here\n    ```\n\n## 3. 基本使用\n\n### 使用交互式 UI (DocWrangler)\n\n1.  浏览器打开 [http:\u002F\u002Flocalhost:3000\u002Fplayground](http:\u002F\u002Flocalhost:3000\u002Fplayground)。\n2.  **构建流水线**：\n    *   上传你的文档数据。\n    *   添加操作算子（如提取、总结、转换等）。\n    *   在右侧实时预览 Prompt 效果，调整提示词直到满意。\n3.  **导出配置**：完成调试后，导出 pipeline 配置文件（YAML\u002FJSON），用于后续生产环境运行。\n\n### 使用 Python 代码运行流水线\n\n安装并配置好环境后，你可以编写简单的 Python 脚本来执行处理任务：\n\n```python\nimport docetl\n\n# 加载之前从 UI 导出的 pipeline 配置\npipeline = docetl.load_pipeline(\"path\u002Fto\u002Fyour\u002Fpipeline.yaml\")\n\n# 执行流水线\nresults = pipeline.run()\n\n# 查看结果\nfor result in results:\n    print(result)\n```\n\n> **提示**：如果你不熟悉如何编写 Pipeline 配置，可以使用 Claude Code 或参考 [docetl.org\u002Fllms.txt](https:\u002F\u002Fdocetl.org\u002Fllms.txt) 中的提示词模板，让 AI 辅助生成初始配置。","某金融科技公司风控团队需从数千份非结构化的企业信贷申请 PDF 中提取关键财务指标（如营收、负债率）及风险条款，用于自动化审批决策。\n\n### 没有 docetl 时\n- **提取准确率极低**：传统 OCR 配合正则表达式难以处理复杂的表格跨页、手写签名遮挡或非标准排版，导致关键字段遗漏或错位，人工复核成本极高。\n- **逻辑一致性难保障**：不同文档中对“净利润”的定义可能隐含不同扣除项，硬编码规则无法理解上下文语义，导致数据口径混乱，后续清洗工作量巨大。\n- **迭代开发周期漫长**：当发现新的文档格式或提取错误时，工程师需重新编写解析代码并全量回归测试，调整一次提示词或逻辑往往需要数天时间。\n- **缺乏中间态调试能力**：黑盒式的处理流程让开发者难以定位具体是哪一步骤出错，面对成千上万份文档，排查个别异常案例如同大海捞针。\n\n### 使用 docetl 后\n- **智能语义提取**：利用 docetl 构建基于 LLM 的代理管道，能精准理解复杂语境下的财务术语，即使面对非标准表格也能通过语义推理准确抓取数据，显著降低人工复核率。\n- **自动消歧与标准化**：通过 docetl 的 Resolve 操作符，系统能自动识别并统一不同文档中的异构字段定义（如将“EBITDA”与“息税折旧摊销前利润”对齐），确保输出数据结构一致。\n- **交互式快速迭代**：借助 DocWrangler 可视化界面，分析师可实时调整提示词并即时查看样本结果，无需编写代码即可在几分钟内完成策略优化与验证。\n- **透明化链路追踪**：管道每一步的中间结果均可见，开发者能快速定位特定文档的处理瓶颈，针对性地优化单个节点，极大提升了调试效率。\n\ndocetl 将原本耗时数周的非结构化文档清洗工作缩短至小时级，同时通过交互式开发模式大幅降低了 AI 数据管道的构建门槛与维护成本。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fucbepic_docetl_04174ae8.png","ucbepic","EPIC Data Lab","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fucbepic_44be8e0b.png","Effective Programming Interaction and Computation with Data ",null,"epic@berkeley.edu","https:\u002F\u002Fepic.berkeley.edu","https:\u002F\u002Fgithub.com\u002Fucbepic",[85,89,93,97,100,103],{"name":86,"color":87,"percentage":88},"Python","#3572A5",64.6,{"name":90,"color":91,"percentage":92},"TypeScript","#3178c6",35.1,{"name":94,"color":95,"percentage":96},"Makefile","#427819",0.1,{"name":98,"color":99,"percentage":96},"CSS","#663399",{"name":101,"color":102,"percentage":96},"Dockerfile","#384d54",{"name":104,"color":105,"percentage":106},"JavaScript","#f1e05a",0,3699,386,"2026-04-03T00:04:37","MIT",4,"Linux, macOS, Windows","未说明",{"notes":115,"python":116,"dependencies":117},"该工具主要依赖 LLM API（如 OpenAI 或 AWS Bedrock），需配置相应的 API Key。支持通过 Docker 快速部署或使用 uv 进行本地开发环境安装。前端基于 TypeScript\u002FNext.js，后端基于 Python。","3.10+",[118,119,120],"uv","litellm","openai",[51,15,26,13],[123,124,125,126,127,128,129,130,131,132,133,134,135],"data","etl","llm","python","data-pipelines","elt","workflow","agents","semantic-data","document-processing","unstructured-data","unstructured-data-analysis","document-analysis","2026-03-27T02:49:30.150509","2026-04-06T05:32:14.355875",[139,144,149,154,159,164],{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},11762,"如何配置 DocETL 以支持 vLLM 或其他非 OpenAI 的 LLM 提供商？","可以通过更新 LiteLLM 版本并使用其提供的特定提供商配置来实现。例如，对于 vLLM，可以在 `pipeline.yaml` 中将 `default_model` 设置为 `hosted_vllm\u002Fxxxx` 格式。此外，建议检查是否需要在 `.env` 文件中设置特定的环境变量（如 `vllm api_base`），或者使用 LiteLLM 代理设置来简化不同提供商之间的切换，而无需修改代码。","https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fissues\u002F35",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},11763,"DocETL 目前支持哪些 PDF 解析和 OCR 工具？","虽然原生支持仍在开发中，但社区推荐了一些强大的替代方案。如果计算资源和许可证允许，可以使用 NougatOCR 或 Marker 进行高质量的 PDF 解析和 OCR。对于简单的文本提取，可以使用 PyPDF2，但请注意它可能无法很好地处理复杂的布局。对于需要保留布局的场景，可能需要专门的布局解析器。","https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fissues\u002F3",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},11764,"在运行 Playground 时遇到 \"Path arguments must not be null\" 错误怎么办？","这通常与环境配置或安装状态有关。建议尝试重新进行全新安装（fresh installation）来解决路径参数为空的问题。确保 `.env` 和 `.env.local` 文件中的配置正确，特别是后端和前端的主机及端口设置。","https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fissues\u002F238",{"id":155,"question_zh":156,"answer_zh":157,"source_url":158},11765,"如何处理 LLM 调用中的超时和速率限制（Rate Limit）问题？","项目已改进超时处理机制。为了避免因单个调用失败导致整个管道经济效率低下（即其他调用继续执行但最终验证失败），建议配置全局的令牌\u002F分钟或调用\u002F分钟速率限制。如果发生超时或速率限制错误，管道应尽早失败，而不是继续消耗 API 调用配额。相关改进已通过 PR 合并。","https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fissues\u002F46",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},11766,"如何在 UI 中自定义后端服务器地址？","用户现在可以通过 UI 自定义后端服务器（docking server）的端点。此功能已通过 PR #264 实现并合并，允许用户更灵活地配置后端连接地址，而不仅仅依赖于默认的环境变量设置。","https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fissues\u002F253",{"id":165,"question_zh":166,"answer_zh":167,"source_url":168},11767,"如何扩展 DocETL 以支持更多文件格式（如 PDF、Wikipedia 等）？","可以通过集成 LlamaIndex 的解析器来实现。LlamaIndex 提供了广泛的文件解析支持。DocETL 的入口点（entrypoint）设计本身就是一个简单的插件系统，可以利用这一点来包装和接入 LlamaIndex 的解析功能，从而轻松支持多种文件格式。","https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fissues\u002F51",[170,175,180,185,190,195,200,205,210,215,220,225,230,235],{"id":171,"version":172,"summary_zh":173,"released_at":174},62230,"0.2.6","## 变更内容\n* 网站：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F403 中添加新的展示示例\n* 修复展示示例中的 bug：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F404 中完成\n* 修复 map 参数中硬编码的 `calibrate: true` 温度覆盖问题：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F406 中完成\n* 重构支持多分层键的样本操作：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F408 中完成\n* 更新 Pandas API 以使用新的输出参数格式：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F409 中完成\n* 新特性：添加 topk 实现：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F410 中完成\n* 新特性：更新 topk 以同时获取排名：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F411 中完成\n* 在 init 文件中导入所有操作：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F413 中完成\n* 实现 gleaning 模型临时修复：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F414 中完成\n* 澄清前端和后端环境变量的使用方法：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F418 中完成\n* 修复网站 npm 安装错误和警告：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F423 中完成\n* 修复提示与输出模式之间的不匹配：由 @anishathalye 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F425 中完成\n* 为验证函数添加类型检查：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F427 中完成\n* 修复 resolve 操作中余弦相似度阻塞的问题：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F428 中完成\n* 修复 gather 运算符文档中的错别字：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F430 中完成\n* 修复一个 gleaning 相关的 bug：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F432 中完成\n* 使错误提示可复制：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F433 中完成\n* 允许在 UI 中加载并行 map 操作：由 @jonhilgart22 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F434 中完成\n* 使用 gleaning 模型验证并优化答案：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F436 中完成\n* 构建交互式管道可视化编辑器：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F437 中完成\n* 使用操作更新 llms-full.txt 文件：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F440 中完成\n* 修复 Python 3.14 下 pydantic-core 构建错误：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F441 中完成\n* 将自然语言设置为默认入口点：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F438 中完成\n* 澄清管道生成器的对话退出选项：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F443 中完成\n* 基于代理的网络爬虫，配备交互式数据查看器：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F442 中完成\n* 处理爬虫工具调用错误：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F444 中完成\n* 修复 Next.js 中的 TypeScript 构建错误：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F445 中完成\n* 关闭 Issue #448。添加 OR 条件…","2025-12-28T04:44:06",{"id":176,"version":177,"summary_zh":178,"released_at":179},62231,"0.2.5","## 变更内容\n* 功能：为地图操作添加校准支持，以提高一致性，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F365 中实现\n* 杂项：添加对 docling-serve v1alpha API 的支持，并提供旧版回退功能，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F364 中实现\n* 修复：正确创建连接优化器对象，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F366 中实现\n* 修复：减少优化器中的错误，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F368 中实现\n* 修复：在添加系统提示之前移动截断消息的操作，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F369 中实现\n* 文档：改进 gleaning 的描述，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F370 和 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F371 中实现\n* 修复：改进缓存机制，并且对于无效的 gather 配置不再抛出错误，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F373 中实现\n* 功能：添加条件式 gleaning 功能，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F375 中实现\n* 杂项：升级 fastapi 和 Python 的 multipart 库，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F376 中实现\n* 清理并重新组织 pytest 测试用例，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F377 中实现\n* 重构 api.py 以支持结构化输出，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F378 中实现\n* 为 pandas API 添加运算符，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F379 中实现\n* 功能：添加全局绕过缓存功能，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F383 中实现\n* 更新集群文档，使用 Jinja 循环，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F386 中实现\n* 修复：在 resolve 中为嵌入截断文档，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F385 中实现\n* 修复 GitHub 问题 384，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F387 中实现\n* 使 make lint 通过检查，由 sidjha1 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F388 中实现\n* 为 PyYAML 和 requests 添加 stub 文件，由 sidjha1 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F390 中实现\n* 移除旧的 typing 导入，由 sidjha1 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F389 中实现\n* 清理工具代码，由 sidjha1 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F391 中实现\n* 利用 pydantic 验证改进 syntax_check，由 sidjha1 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F392 中实现\n* 为 pandas API 添加结构化输出模式支持，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F394 中实现\n* 添加新演示，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F395 中实现\n* 网站：为展示页面添加 SEO 优化，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F396 中实现\n* 网站：进一步优化 SEO，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F397 中实现\n* 网站：添加 SEO 优化，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F398 中实现\n* 添加 SEO 优化，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F399 中实现\n* 杂项：切换至 cloudbank 作为 Blob 存储服务，由 shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpul 中实现","2025-08-09T20:33:26",{"id":181,"version":182,"summary_zh":183,"released_at":184},62232,"0.2.4","## 变更内容\n* 修复：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F356 中为 Python API 添加系统提示及其他配置变量。\n* 支持将 Pandas DataFrame 作为输入，由 @sfc-gh-jdu 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F357 中实现。\n* 新特性：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F359 中向 YAML 文件添加 `api_base` 配置项。\n* 新特性：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F361 中添加提取操作符。\n* 更新 README.md，由 @sotoblanco 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F358 中完成。\n\n## 新贡献者\n* @sotoblanco 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F358 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fcompare\u002F0.2.3...0.2.4","2025-05-21T05:52:26",{"id":186,"version":187,"summary_zh":188,"released_at":189},62233,"0.2.3","## 变更内容\n* expts：由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F291 中评估结构化输出的性能\n* 处理 DeepSeek R1 精馏模型：无工具调用，并添加 think 列，由 @lemig 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F295 中实现\n* 杂项：更新软件包，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F300 中完成\n* 允许导入非 UTF-8 编码的文本文件，由 @lemig 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F299 中实现\n* 检测大小写变体：DeepSeek-R1 与 deepseek-r1，由 @lemig 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F304 中完成\n* 🚀 新特性：添加 Docker Compose 支持，由 @Sunwood-ai-labs 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F301 中实现\n* 新特性：允许优化器模型为任何 Litellm 支持的模型，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F307 中实现\n* 新特性：支持 Gemini 和 Claude LLM 的原生 PDF 处理功能，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F309 中实现\n* 新特性：重构 Map 优化器，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F311 中完成\n* 新特性：在优化器中添加速率限制，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F313 中实现\n* 新特性：在管道对话中添加 NL 功能，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F314 中实现\n* 修复：pd_accessors.py 中的警告，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F316 中修复\n* 新特性：为 DocWrangler 的 Map 操作添加 PDF URL 键选项，由 @shabie 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F315 中实现\n* 修复：上传包含复杂 Schema 的管道 YAML 文件的问题，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F317 中修复\n* 新特性：在输出中添加 n 参数，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F320 中实现\n* 更新 sample.py，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F322 中完成\n* 新特性：添加标志以将 Map 操作的输出流式传输到磁盘，由 @shabie 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F323 中实现\n* 性能优化：移除客户端 JSON 验证逻辑，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F325 中完成\n* 修复：在 Rich 调用栈中不打印局部变量，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F329 中修复\n* 新特性：将 `vega` 或 `vega-lite` 代码块渲染为交互式图表，由 @hydrosquall 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F327 中实现\n* 新特性：在网站上添加 GPT 分词器，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F330 中实现\n* 新特性：添加 AWS 支持，由 @shabie 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F331 中实现\n* 文档：添加 Python 文档示例，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F334 中完成\n* 新特性：在上传数据集时，为每个文档添加 _file_path 字段，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F335 中实现\n* 新特性：添加对 Snowflake Cortex 作为 LLM 提供者的支持，由 @sfc-gh-jdu 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F336 中实现\n* 文档：添加 Python API 快速入门指南，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F337 中完成\n* 新特性：在 DocWrangler 中添加 TPM 速率限制和管道设置，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F 中实现","2025-04-29T04:05:25",{"id":191,"version":192,"summary_zh":193,"released_at":194},62234,"0.2.2","## 变更内容\n* 重构：DSLRunner 现在采用拉取式执行模型，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F273 中实现。\n* 新特性：新增客户支持工单流水线，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F276 中实现。\n* 测试：修复 Makefile，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F277 中完成。\n* 修复：优化数据集加载性能，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F279 中实现。\n* 新特性：对等连接优化器的用户体验改进，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F280 中实现。\n* 文档：添加 verbose 参数，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F283 中完成。\n* 修复：允许 litellm 补全关键字参数在前端使用，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F285 中实现。\n* 新特性：新增远程上传功能和 CSV 文件上传功能，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F289 中实现。\n* 新特性：添加 Pandas DataFrame 访问器，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F287 中实现。\n* 新特性：在 UI 中添加枚举类型，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F292 中实现。\n* 修复：修复输入上下文长度逻辑，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F293 中完成。\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fcompare\u002F0.2.1...0.2.2","2025-01-29T16:48:47",{"id":196,"version":197,"summary_zh":198,"released_at":199},62235,"0.2.1","## 变更内容\n* 修复：确保命名空间后显示输出功能正常，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F230 中完成\n* 杂项：更新锁文件，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F231 中完成\n* 本地 GPU 转换 PDF 的速度实在太慢了，根本无法忍受……，由 @plpycoin 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F229 中提出\n* 新特性：清理 UI，使其看起来更加一致和流畅，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F233 中完成\n* 修复：重新添加采样功能，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F236 中完成\n* 修复：添加严格的 Jinja 模板渲染，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F237 中完成\n* 修复：解决无限重新渲染的 bug，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F241 中完成\n* 进行中：将前端与后端分离，以便我们可以托管前端，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F242 中完成\n* 修复：在 Playground 中为聊天机器人请求头添加 OpenAI 密钥，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F243 中完成\n* 杂项：优化网站移动端体验，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F244 中完成\n* 新特性：改进网站布局，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F245 中完成\n* 修复：网站调整大小会导致演示出现问题，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F246 中完成\n* 修复：加载时显示命名空间对话框，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F247 中完成\n* UI 细微调整，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F249 中完成\n* 新特性：添加最高法院听证会记录的示例管道，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F250 中完成\n* 修复：即使并非所有文档都包含所有键，输出 CSV 写入也应能正常工作，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F252 中完成\n* 添加枚举支持，由 @rrawatt 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F254 中完成\n* 杂项：添加参数，用于在 LLM 调用失败时跳过调用，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F255 中完成\n* 新特性：为前端助手添加 Azure OpenAI，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F256 中完成\n* 重构：将 UI 文本更新为 docwrangler，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F257 中完成\n* 修复：映射优化器元数据错误，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F258 中完成\n* 文档：更新 Playground 文档，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F259 中完成\n* 杂项：改进日志记录，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F262 中完成\n* 新特性：添加用户指定其 docling 服务器的功能，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F263 中完成\n* 修复：上传数据集时绕过 Vercel 无服务器函数，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F266 中完成\n* 新特性：添加 llmstxt，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F267 中完成\n* 修复：等值连接已过时，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F269 中完成\n* 杂项：为 Gem 添加更友好的错误信息","2025-01-09T09:10:11",{"id":201,"version":202,"summary_zh":203,"released_at":204},62236,"0.2.0","## 变更内容\n* 示例功能，由 @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F92 中实现\n* 异常值处理功能，由 @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F91 中实现\n* 示例功能（包含异常值处理）的操作改进，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F100 中完成\n* #91 文档 > 项目重命名，由 @garuna-m6 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F103 中完成\n* 杂项：更新依赖版本，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F105 中完成\n* 文档：为 ResolveOp 代码示例添加 'output' 参数，由 @goutham794 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F106 中完成\n* 修复：编辑合成解析任务的代理，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F109 中完成\n* 修复：使用集群和样本操作更新 Python API，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F113 中完成\n* 文档：在文档中添加样本和集群相关内容，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F114 中完成\n* 新 API，由 @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F115 中实现\n* 修复：使文档正常工作，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F119 中完成\n* 功能性改进：为拆分-映射-收集分解流程添加人工干预环节，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F120 中完成\n* 修复：缓存部分管道运行结果，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F122 中完成\n* 标记为不稳定测试，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F123 中完成\n* 解析时仅比较不同的配对，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F124 中完成\n* 链接解析操作，由 @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F117 中实现\n* 修复解析和映射进度条问题，由 @michielree 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F126 中完成\n* 改进解析 LLM 调用的自动批处理，由 @sushruth2003 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F128 中完成\n* 合并自动批处理 PR，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F129 中完成\n* UI v1 版本！由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F118 中发布\n* 将 litellm 版本升级至 v1.51.0-stable，由 @Tendo33 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F131 中完成\n* 功能性改进：为映射和过滤调用添加批处理功能，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F133 中完成\n* 文档：将过滤与映射关联起来，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F135 中完成\n* 修复：优化器中的一个 bug，该 bug 导致无法使用 Azure 对归约操作进行优化，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F136 中完成\n* 修复：仅在路径非空时才调用 os.makedirs，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F137 中完成\n* UI：添加基于聊天的基本助手，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F139 中完成\n* 修复：清除和运行按钮也应绕过缓存，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F140 中完成\n* 功能性改进：添加 UDF 支持，由 @staru09 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F138 中完成\n* 将暂存分支合并到 MAIN 分支，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F141 中完成\n* 杂项：从当前目录加载环境变量，由 @plpycoin 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F142 中完成\n* 功能性改进：","2024-12-04T04:34:41",{"id":206,"version":207,"summary_zh":208,"released_at":209},62237,"0.1.7","## 变更内容\n* 添加操作哈希和缓存功能，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F61 中实现\n* 文档：改进管道 API 的文档，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F62 中实现\n* 重构：添加网站代码，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F65 中实现\n* （部分）修复：为速率限制错误添加指数退避机制，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F66 中实现\n* 修复：使 gleaning LLM 调用正常工作，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F70 中实现\n* 添加基于 llama-index 的解析器，由 @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F71 中实现\n* 将暂存分支合并到主分支，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F74 中实现\n* 功能：添加 pdfgpt 以解析 PDF 文件，由 @staru09 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F67 中实现\n* 将暂存分支合并到主分支（来自 add gpt_pdf），由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F76 中实现\n* 修复：禁用 Gemini 的额外属性，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F73 中实现\n* 限流，由 @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F64 中实现\n* 功能：支持速率限制，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F79 中实现\n* 功能：为 gleaning 添加 verbose 参数，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F80 中实现\n* 解析器现在可以返回任意数量的字段，并且可以访问整个条目，由 @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F81 中实现\n* 将暂存分支合并到主分支（在解析器重构之后），由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F82 中实现\n* 文档：添加 sample 参数，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F87 中实现\n* 聚类，由 @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F84 中实现\n* 将暂存分支合并到主分支（在添加聚类算子之后），由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F88 中实现\n* 功能：如果用户指定了 CSV 文件，则输出为 CSV 格式，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F89 中实现\n* 重命名内部方法，由 @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F90 中实现\n* 用于清理 API 的小改进，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F93 中实现\n* 重构：将验证和 gleaning 功能移至 call llm 中，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F98 中实现\n* 将暂存分支合并到主分支，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F99 中实现\n* 功能：添加 reduce 操作的 lineage 追踪，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F101 中实现\n* 修复：将 gleaning 提示词改为 validation_prompt，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F102 中实现\n\n## 新贡献者\n* @staru09 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F67 中做出了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fcompare\u002F0.1.6...0.1.7","2024-10-14T02:01:17",{"id":211,"version":212,"summary_zh":213,"released_at":214},62238,"0.1.6","## 变更内容\n* docs: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F27 中修复了解析文档\n* docs: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F28 中添加了指向 Ollama 聊天的链接\n* feat: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F30 中实现了更优的进度条显示\n* 由 @orban 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F16 中为映射操作添加了批处理支持，并提供了可配置的参数\n* feat: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F31 中实现了映射操作中的批量限制\n* 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F32 中添加了数据集类和解析工具\n* docs: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F34 中提升了自定义解析的清晰度\n* 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F36 中添加了 Azure Document Intelligence 读取工具\n* fix: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F38 中从用户的当前工作目录读取 .env 文件，并调整了工具模式，以使 Ollama 的 Llama 模型更好地工作\n* 由 @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F40 中修复了缓存中 sqlite3 操作错误的问题\n* fix: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F42 中使 diskcache 的读取操作线程安全\n* fix: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F43 中修复了 tutorial.md 中的模板问题\n* 由 @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F39 中修复了 RateLimit 错误\n* feat: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F44 中添加了 PaddleOCR\n* 由 @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F45 中添加了入口点\n* fix: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F52 中避免缓存无效 LLM 调用的结果\n* fix: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F57 中将默认分词器设置为 GPT-4o\n* feat: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F53 中，在出现 InvalidOutputError 错误时打印出 LLM 的消息历史和使用的工具\n* feat: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F59 中规定，如果输出模式仅包含一个参数，则不对 Ollama 和 OSS 模型使用工具调用\n* docs: 由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F60 中添加了关于使用 split gather 流水线的文档\n\n## 新贡献者\n* @orban 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F16 中做出了首次贡献\n* @redhog 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F40 中做出了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fcompare\u002F0.1.5...0.1.6","2024-10-03T21:31:47",{"id":216,"version":217,"summary_zh":218,"released_at":219},62239,"0.1.5","## 变更内容\n* 修复：如果模型不支持工具调用，则添加错误提示信息，由 @shreyashankar 在 https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F26 中实现。\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fcompare\u002F0.1.4...0.1.5","2024-09-30T03:44:22",{"id":221,"version":222,"summary_zh":223,"released_at":224},62240,"0.1.4","## What's Changed\r\n* fix: manually try to parse ollama outputs, even if it is not valid json by @shreyashankar in https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F25\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fcompare\u002F0.1.3...0.1.4","2024-09-30T03:07:49",{"id":226,"version":227,"summary_zh":228,"released_at":229},62241,"0.1.3","## What's Changed\r\n* quality of life: show error when trying to execute resolve without blocking by @shreyashankar in https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F12\r\n* Optionally persist intermediates for reduce by @shreyashankar in https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F14\r\n* Add a save config method to the Python API by @shreyashankar in https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F15\r\n* Remove unnecessary name parameter from parallel map operation. by @shreyashankar in https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F17\r\n* Add podcast to readme by @shreyashankar in https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F18\r\n* fix: remove openai client call in `utils.py` by @shreyashankar in https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F22\r\n* Add Configurable Timeouts for Operations and Ollama Integration Documentation by @shreyashankar in https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fpull\u002F24\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fucbepic\u002Fdocetl\u002Fcompare\u002F0.1.2...0.1.3","2024-09-29T23:08:15",{"id":231,"version":232,"summary_zh":233,"released_at":234},62242,"0.1.2","This release fixes a bug where the typer dependency was missing. It also adds a Python API.","2024-09-23T05:09:48",{"id":236,"version":237,"summary_zh":238,"released_at":239},62243,"0.1.1","**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fshreyashankar\u002Fdocetl\u002Fcompare\u002F0.1.0...0.1.1","2024-09-17T19:09:41"]