[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Unstructured-IO--unstructured-api":3,"tool-Unstructured-IO--unstructured-api":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":68,"owner_location":68,"owner_email":80,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":84,"stars":111,"forks":112,"last_commit_at":113,"license":114,"difficulty_score":10,"env_os":115,"env_gpu":115,"env_ram":115,"env_deps":116,"category_tags":119,"github_topics":68,"view_count":23,"oss_zip_url":68,"oss_zip_packed_at":68,"status":16,"created_at":120,"updated_at":121,"faqs":122,"releases":151},2447,"Unstructured-IO\u002Funstructured-api","unstructured-api",null,"unstructured-api 是一款专为处理非结构化数据设计的开源 API 工具，旨在帮助开发者轻松将各种复杂文档转化为机器可读的结构化数据。在日常工作中，我们常常面对 PDF、Word、PowerPoint、图片甚至电子邮件等格式各异的文件，这些非结构化数据难以直接被人工智能模型或数据分析系统利用。unstructured-api 正是为了解决这一痛点而生，它能够高效解析并提取这些文档中的文本、表格及布局信息，将其转换为干净、标准的格式，从而打通从原始文档到智能应用之间的“最后一公里”。\n\n这款工具特别适合软件开发者、数据工程师以及从事大语言模型（LLM）应用构建的研究人员使用。如果你正在开发需要读取和理解文档内容的 AI 应用，或者需要构建企业级的知识库检索系统，unstructured-api 能提供强大的底层支持。它不仅简化了繁琐的数据预处理流程，还显著提升了数据清洗的效率。\n\n在技术亮点方面，unstructured-api 引入了名为“Chipper”的测试版模型，专门针对高分辨率和布局复杂的文档进行了优化。通过启用“hi_res”策略，用户可以显著提升对复杂版面文档的","unstructured-api 是一款专为处理非结构化数据设计的开源 API 工具，旨在帮助开发者轻松将各种复杂文档转化为机器可读的结构化数据。在日常工作中，我们常常面对 PDF、Word、PowerPoint、图片甚至电子邮件等格式各异的文件，这些非结构化数据难以直接被人工智能模型或数据分析系统利用。unstructured-api 正是为了解决这一痛点而生，它能够高效解析并提取这些文档中的文本、表格及布局信息，将其转换为干净、标准的格式，从而打通从原始文档到智能应用之间的“最后一公里”。\n\n这款工具特别适合软件开发者、数据工程师以及从事大语言模型（LLM）应用构建的研究人员使用。如果你正在开发需要读取和理解文档内容的 AI 应用，或者需要构建企业级的知识库检索系统，unstructured-api 能提供强大的底层支持。它不仅简化了繁琐的数据预处理流程，还显著提升了数据清洗的效率。\n\n在技术亮点方面，unstructured-api 引入了名为“Chipper”的测试版模型，专门针对高分辨率和布局复杂的文档进行了优化。通过启用“hi_res”策略，用户可以显著提升对复杂版面文档的解析精度。此外，该服务目前提供免费访问（需申请 API Key），并拥有活跃的社区支持，方便用户交流反馈。无论是初创团队还是大型企业，都能借助 unstructured-api 快速实现文档数据的智能化转型，让非结构化数据真正产生价值。","\u003Ch3 align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured-api_readme_00da0c2b151a.png\" height=\"200\">\n\u003C\u002Fh3>\n\n\u003Ch3 align=\"center\">\n  \u003Cp>API Announcement!\u003C\u002Fp>\n\u003C\u002Fh3>\n\nWe are thrilled to announce our newly launched [Unstructured API](https:\u002F\u002Fdocs.unstructured.io\u002Fapi-reference\u002Foverview). While access to the hosted Unstructured API will remain free, API Keys are required to make requests. To prevent disruption, get yours [here](https:\u002F\u002Fwww.unstructured.io\u002F#get-api-key) now and start using it today! Check out the [readme](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api#--) here to get started making API calls.\u003C\u002Fp>\n\n#### :rocket: Beta Feature: Chipper Model\n\nWe are releasing the beta version of our Chipper model to deliver superior performance when processing high-resolution, complex documents. To start using the Chipper model in your API request, you can utilize the `hi_res` strategy. Please refer to the documentation [here](https:\u002F\u002Funstructured-io.github.io\u002Funstructured\u002Fapi.html#strategies).\n\nAs the Chipper model is in beta version, we welcome feedback and suggestions. For those interested in testing the Chipper model, we encourage you to connect with us on [Slack community](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Funstructuredw-kbe4326\u002Fshared_invite\u002Fzt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ).\n\n\u003Cdiv align=\"center\">\n\n \u003Ca\n   href=\"https:\u002F\u002Fwww.phorm.ai\u002Fquery?projectId=34efc517-2201-4376-af43-40c4b9da3dc5\">\n\t\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPhorm-Ask_AI-%23F2777A.svg?&logo=data:image\u002Fsvg+xml;base64,PHN2ZyB3aWR0aD0iNSIgaGVpZ2h0PSI0IiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgogIDxwYXRoIGQ9Ik00LjQzIDEuODgyYTEuNDQgMS40NCAwIDAgMS0uMDk4LjQyNmMtLjA1LjEyMy0uMTE1LjIzLS4xOTIuMzIyLS4wNzUuMDktLjE2LjE2NS0uMjU1LjIyNmExLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxMmMtLjA5OS4wMTItLjE5Mi4wMTQtLjI3OS4wMDZsLTEuNTkzLS4xNHYtLjQwNmgxLjY1OGMuMDkuMDAxLjE3LS4xNjkuMjQ2LS4xOTFhLjYwMy42MDMgMCAwIDAgLjItLjEwNi41MjkuNTI5IDAgMCAwIC4xMzgtLjE3LjY1NC42NTQgMCAwIDAgLjA2NS0uMjRsLjAyOC0uMzJhLjkzLjkzIDAgMCAwLS4wMzYtLjI0OS41NjcuNTY3IDAgMCAwLS4xMDMtLjIuNTAyLjUwMiAwIDAgMC0uMTY4LS4xMzguNjA4LjYwOCAwIDAgMC0uMjQtLjA2N0wyLjQzNy43MjkgMS42MjUuNjcxYS4zMjIuMzIyIDAgMCAwLS4yMzIuMDU4LjM3NS4zNzUgMCAwIDAtLjExNi4yMzJsLS4xMTYgMS40NS0uMDU4LjY5Ny0uMDU4Ljc1NEwuNzA1IDRsLS4zNTctLjA3OUwuNjAyLjkwNkMuNjE3LjcyNi42NjMuNTc0LjczOS40NTRhLjk1OC45NTggMCAwIDEgLjI3NC0uMjg1Ljk3MS45NzEgMCAwIDEgLjMzNy0uMTRjLjExOS0uMDI2LjIyNy0uMDM0LjMyNS0uMDI2TDMuMjMyLjE2Yy4xNTkuMDE0LjMzNi4wMy40NTkuMDgyYTEuMTczIDEuMTczIDAgMCAxIC41NDUuNDQ3Yy4wNi4wOTQuMTA5LjE5Mi4xNDQuMjkzYTEuMzkyIDEuMzkyIDAgMCAxIC4wNzguNThsLS4wMjkuMzJaIiBmaWxsPSIjRjI3NzdBIi8+CiAgPHBhdGggZD0iTTQuMDgyIDIuMDA3YTEuNDU1IDEuNDU1IDAgMCAxLS4wOTguNDI3Yy0uMDUuMTI0LS4xMTQuMjMyLS4xOTIuMzI0YTEuMTMgMS4xMyAwIDAgMS0uMjU0LjIyNyAxLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxNGMtLjEuMDEyLS4xOTMuMDE0LS4yOC4wMDZsLTEuNTYtLjEwOC4wMzQtLjQwNi4wMy0uMzQ4IDEuNTU5LjE1NGMuMDkgMCAuMTczLS4wMS4yNDgtLjAzM2EuNjAzLjYwMyAwIDAgMCAuMi0uMTA2LjUzMi41MzIgMCAwIDAgLjEzOS0uMTcyLjY2LjY2IDAgMCAwIC4wNjQtLjI0MWwuMDI5LS4zMjFhLjk0Ljk0IDAgMCAwLS4wMzYtLjI1LjU3LjU3IDAgMCAwLS4xMDMtLjIwMi41MDIuNTAyIDAgMCAwLS4xNjgtLjEzOC42MDUuNjA1IDAgMCAwLS4yNC0uMDY3TDEuMjczLjgyN2MtLjA5NC0uMDA4LS4xNjguMDEtLjIyMS4wNTUtLjA1My4wNDUtLjA4NC4xMTQtLjA5Mi4yMDZMLjcwNSA0IDAgMy45MzhsLjI1NS0yLjkxMUExLjAxIDEuMDEgMCAwIDEgLjM5My41NzIuOTYyLjk2MiAwIDAgMSAuNjY2LjI4NmEuOTcuOTcgMCAwIDEgLjMzOC0uMTRDMS4xMjIuMTIgMS4yMy4xMSAxLjMyOC4xMTlsMS41OTMuMTRjLjE2LjAxNC4zLjA0Ny40MjMuMWExLjE3IDEuMTcgMCAwIDEgLjU0NS40NDhjLjA2MS4wOTUuMTA5LjE5My4xNDQuMjk1YTEuNDA2IDEuNDA2IDAgMCAxIC4wNzcuNTgzbC0uMDI4LjMyMloiIGZpbGw9IndoaXRlIi8+CiAgPHBhdGggZD0iTTQuMDgyIDIuMDA3YTEuNDU1IDEuNDU1IDAgMCAxLS4wOTguNDI3Yy0uMDUuMTI0LS4xMTQuMjMyLS4xOTIuMzI0YTEuMTMgMS4xMyAwIDAgMS0uMjU0LjIyNyAxLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxNGMtLjEuMDEyLS4xOTMuMDE0LS4yOC4wMDZsLTEuNTYtLjEwOC4wMzQtLjQwNi4wMy0uMzQ4IDEuNTU5LjE1NGMuMDkgMCAuMTczLS4wMS4yNDgtLjAzM2EuNjAzLjYwMyAwIDAgMCAuMi0uMTA2LjUzMi41MzIgMCAwIDAgLjEzOS0uMTcyLjY2LjY2IDAgMCAwIC4wNjQtLjI0MWwuMDI5LS4zMjFhLjk0Ljk0IDAgMCAwLS4wMzYtLjI1LjU3LjU3IDAgMCAwLS4xMDMtLjIwMi41MDIuNTAyIDAgMCAwLS4xNjgtLjEzOC42MDUuNjA1IDAgMCAwLS4yNC0uMDY3TDEuMjczLjgyN2MtLjA5NC0uMDA4LS4xNjguMDEtLjIyMS4wNTUtLjA1My4wNDUtLjA4NC4xMTQtLjA5Mi4yMDZMLjcwNSA0IDAgMy45MzhsLjI1NS0yLjkxMUExLjAxIDEuMDEgMCAwIDEgLjM5My41NzIuOTYyLjk2MiAwIDAgMSAuNjY2LjI4NmEuOTcuOTcgMCAwIDEgLjMzOC0uMTRDMS4xMjIuMTIgMS4yMy4xMSAxLjMyOC4xMTlsMS41OTMuMTRjLjE2LjAxNC4zLjA0Ny40MjMuMWExLjE3IDEuMTcgMCAwIDEgLjU0NS40NDhjLjA2MS4wOTUuMTA5LjE5My4xNDQuMjk1YTEuNDA2IDEuNDA2IDAgMCAxIC4wNzcuNTgzbC0uMDI4LjMyMloiIGZpbGw9IndoaXRlIi8+Cjwvc3ZnPgo=\" \u002F>\n   \u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n\n---\n\n\u003Ch3 align=\"center\">\n  \u003Cp>General Pre-Processing Pipeline for Documents\u003C\u002Fp>\n\u003C\u002Fh3>\n\nThis repo implements a pre-processing pipeline for the following documents. Currently, the pipeline is capable of recognizing the file type and choosing the relevant partition function to process the file.\n\n\n| Category  | Document Types                |\n|-----------|-------------------------------|\n| Plaintext | `.txt`, `.eml`, `.msg`, `.xml`, `.html`, `.md`, `.rst`, `.json`, `.rtf` |\n| Images    | `.jpeg`, `.png`               |\n| Documents | `.doc`, `.docx`, `.ppt`, `.pptx`, `.pdf`, `.odt`, `.epub`, `.csv`, `.tsv`, `.xlsx` |\n| Zipped    | `.gz`                         |\n\n\n## :rocket: Unstructured API\n\nTry our hosted API! It's freely available to use with any of the filetypes listed above. This is the easiest way to get started. If you'd like to host your own version of the API, jump down to the [Developer Quickstart Guide](#developer-quick-start).\n\n```\n curl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -H 'unstructured-api-key: \u003CYOUR API KEY>' \\\n  -F 'files=@sample-docs\u002Ffamily-day.eml' \\\n  | jq -C . | less -R\n```\n\n### Parameters\n\n#### Strategies\n\nFour strategies are available for processing PDF\u002FImages files: `hi_res`, `fast`, `ocr_only` and `auto`. `fast` is the default `strategy` and works well for documents that do not have text embedded in images.\n\nOn the other hand, `hi_res` is the better choice for PDFs that may have text within embedded images, or for achieving greater precision of [element types](https:\u002F\u002Funstructured-io.github.io\u002Funstructured\u002Fgetting_started.html#document-elements) in the response JSON. Please be aware that, as of writing, `hi_res` requests may take 20 times longer to process compared to the `fast` option. See the example below for making a `hi_res` request.\n\n```\n curl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Flayout-parser-paper.pdf' \\\n  -F 'strategy=hi_res' \\\n  | jq -C . | less -R\n```\n\nThe `ocr_only` strategy runs the document through Tesseract for OCR. Currently, `hi_res` has difficulty ordering elements for documents with multiple columns. If you have a document with multiple columns that do not have extractable text, we recommend using the `ocr_only` strategy. Please be aware that `ocr_only` will fall back to another strategy if Tesseract is not available.\n\nFor the best of all worlds, `auto` will determine when a page can be extracted using `fast` or `ocr_only` mode, otherwise it will fall back to `hi_res`.\n\n#### Hi Res model name\n\nThe `hi_res` strategy supports different models, and the default is `detectron2onnx`. You can also specify `hi_res_model_name` parameter to run `hi_res` strategy with the chipper model while using the host API:\n\n```\n curl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Flayout-parser-paper.pdf' \\\n  -F 'strategy=hi_res' \\\n  -F 'hi_res_model_name=chipper'  \\\n  | jq -C . | less -R\n```\n\nWe also support models to be used locally, for example, `yolox`. Please refer to the `using-the-api-locally` section for more information on how to use the local API.\n\n#### OCR languages\n\nNote: This kwarg will eventually be deprecated. Please use `languages`.\nYou can also specify what languages to use for OCR with the `ocr_languages` kwarg. See the [Tesseract documentation](https:\u002F\u002Fgithub.com\u002Ftesseract-ocr\u002Ftessdata) for a full list of languages and install instructions. OCR is only applied if the text is not already available in the PDF document.\n\n```\ncurl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Fenglish-and-korean.png' \\\n  -F 'strategy=ocr_only' \\\n  -F 'ocr_languages=eng'  \\\n  -F 'ocr_languages=kor'  \\\n  | jq -C . | less -R\n```\n\n#### Languages\n\nYou can also specify what languages to use for OCR with the `languages` kwarg. See the [Tesseract documentation](https:\u002F\u002Fgithub.com\u002Ftesseract-ocr\u002Ftessdata) for a full list of languages and install instructions. OCR is only applied if the text is not already available in the PDF document.\n\n```\ncurl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Fenglish-and-korean.png' \\\n  -F 'strategy=ocr_only' \\\n  -F 'languages=eng'  \\\n  -F 'languages=kor'  \\\n  | jq -C . | less -R\n```\n\n#### Coordinates\n\nWhen elements are extracted from PDFs or images, it may be useful to get their bounding boxes as well. Set the `coordinates` parameter to `true` to add this field to the elements in the response.\n\n```\n curl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Flayout-parser-paper.pdf' \\\n  -F 'coordinates=true' \\\n  | jq -C . | less -R\n```\n\n#### Skip Table Extraction\n\nCurrently, we provide support for enabling and disabling table extraction for all file types. Set parameter `skip_infer_table_types` to specify the document types that you want to skip table extraction with. By default, we enable table extraction\nfor all file types (`skip_infer_table_types=[]`). Again, please note that table extraction only works with `hi_res` strategy. For example, if you want to skip table extraction for images, you can pass a list with matching image file types:\n\n```\n curl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Flayout-parser-paper-with-table.jpg' \\\n  -F 'strategy=hi_res' \\\n  -F 'skip_infer_table_types=[\"jpg\"]' \\\n  | jq -C . | less -R\n```\n\n#### Encoding\n\nYou can specify the encoding to use to decode the text input. If no value is provided, utf-8 will be used.\n\n```\ncurl -X 'POST' \\\n 'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n -H 'accept: application\u002Fjson'  \\\n -H 'Content-Type: multipart\u002Fform-data' \\\n -F 'files=@sample-docs\u002Ffake-power-point.pptx' \\\n -F 'encoding=utf_8' \\\n | jq -C . | less -R\n```\n\n#### Gzipped files\n\nYou can send gzipped file and api will un-gzip it. \n\n```\ncurl -X 'POST' \\\n 'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n -H 'accept: application\u002Fjson'  \\\n -H 'Content-Type: multipart\u002Fform-data' \\\n -F 'gz_uncompressed_content_type=application\u002Fpdf' \\\n -F 'files=@sample-docs\u002Flayout-parser-paper.pdf.gz' \n```\n\nIf field `gz_uncompressed_content_type` is set, the API will use its value as content-type of all files\nafter uncompressing the .gz files that are sent in single batch. If not set, the API will use\nvarious heuristics to detect the filetypes after uncompressing from .gz.\n\n#### XML Tags\n\nWhen processing XML documents, set the `xml_keep_tags` parameter to `true` to retain the XML tags in the output. If not specified, it will simply extract the text from within the tags.\n\n```\ncurl -X 'POST' \\\n 'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n -H 'accept: application\u002Fjson'  \\\n -H 'Content-Type: multipart\u002Fform-data' \\\n -F 'files=@sample-docs\u002Ffake-xml.xml' \\\n -F 'xml_keep_tags=true' \\\n | jq -C . | less -R\n```\n\n#### Page Breaks\n\nFor supported filetypes, set the `include_page_breaks` parameter to `true` to include `PageBreak` elements in the output.\n\n```\ncurl -X 'POST' \\\n 'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n -H 'accept: application\u002Fjson'  \\\n -H 'Content-Type: multipart\u002Fform-data' \\\n -F 'files=@sample-docs\u002Flayout-parser-paper-fast.pdf' \\\n -F 'include_page_breaks=true' \\\n | jq -C . | less -R\n```\n\n\n#### Unique element IDs\n\nBy default, the element ID is a SHA-256 hash of the element text. This is to ensure that\nthe ID is deterministic. One downside is that the ID is not guaranteed to be unique.\nDifferent elements with the same text will have the same ID, and there could also be hash collisions.\nTo use UUIDs in the output instead, set ``unique_element_ids=true``. Note: this means that the element IDs\nwill be random, so with every partition of the same file, you will get different IDs. \nThis can be helpful if you'd like to use the IDs as a primary key in a database, for example.\n\n```\ncurl -X 'POST' \\ \n 'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n -H 'accept: application\u002Fjson'  \\\n -H 'Content-Type: multipart\u002Fform-data' \\\n -F 'files=@sample-docs\u002Flayout-parser-paper-fast.pdf' \\\n -F 'unique_element_ids=true' \\\n | jq -C . | less -R\n```\n\n\n#### Chunking Elements\n\nUse the `chunking_strategy` form-field to chunk text into larger or smaller elements. Defaults to `None` which performs no chunking. The available chunking strategies are `basic` and `by_title`.\n\nThe `basic` strategy combines whole consecutive document elements to maximally fill chunks of `max_characters` length. A single element that by itself exceeds `max_characters` is divided into two or more chunks by text-splitting (on a word boundary).\n\nThe `by_title` strategy has the same behaviors except document section boundaries are respected, meaning elements from two different sections never occur in the same chunk. A `Title` (section heading) element introduces a new section, hence the name.\n\n  Additional Parameters (all optional):\n\n    `max_characters`\n      The hard maximum chunk size. No chunk will exceed this length. Defaults to 500.\n\n    `new_after_n_chars`\n      A chunk of this length or greater is considered \"full\" and will not receive an additional element, even if it would fit within `max_characters`.\n      This \"soft-maximum\" defaults to `max_characters`.\n\n    `overlap`\n      Specifies the length of a string (\"tail\") to be drawn from each chunk and prefixed to the\n      next chunk as a context-preserving mechanism. By default, this only applies to split-chunks\n      where an oversized element is divided into multiple chunks by text-splitting.\n\n    `overlap_all`\n      Default: `False`. When `True`, apply overlap between \"normal\" chunks formed from whole\n      elements and not subject to text-splitting. Use this with caution as it entails a certain\n      level of \"pollution\" of otherwise clean semantic chunk boundaries.\n\n    `combine_under_n_chars`\n      Combines elements (for example a series of titles) until a section reaches a\n      length of n characters. Defaults to 500. Only operative for the \"by_title\"\n      strategy.\n\n    `multipage_sections`\n      If True, sections can span multiple pages. Defaults to True. Only operative for\n      the \"by_title\" strategy.\n\n\n```\ncurl -X 'POST' \n 'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n -H 'accept: application\u002Fjson'  \\\n -H 'Content-Type: multipart\u002Fform-data' \\\n -F 'files=@sample-docs\u002Flayout-parser-paper-fast.pdf' \\\n -F 'chunking_strategy=by_title' \\\n | jq -C . | less -R\n```\n\n## Developer Quick Start\n\n* Using `pyenv` to manage virtualenv's is recommended\n\t* Mac install instructions. See [here](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Fcommunity#mac--homebrew) for more detailed instructions.\n\t\t* `brew install pyenv-virtualenv`\n\t  * `pyenv install 3.12`\n  * Linux instructions are available [here](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Fcommunity#linux).\n\n  * Create a virtualenv to work in and activate it, e.g. for one named `document-processing`:\n\n\t`pyenv  virtualenv 3.12\n   unstructured-api` \u003Cbr \u002F>\n\t`pyenv activate unstructured-api`\n\nSee the [Unstructured Quick Start](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured#eight_pointed_black_star-quick-start) for the many OS dependencies that are required, if the ability to process all file types is desired.\n\n* Run `make install`\n* Start a local jupyter notebook server with `make run-jupyter` \u003Cbr \u002F>\n\t**OR** \u003Cbr \u002F>\n\tjust start the fast-API locally with `make run-web-app`\n\n#### Using the API locally\n\nAfter running `make run-web-app` (or `make docker-start-api` to run in the container), you can now hit the API locally at port 8000. The `sample-docs` directory has a number of example file types that are currently supported.\n\nFor example:\n```\n curl -X 'POST' \\\n  'http:\u002F\u002Flocalhost:8000\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Ffamily-day.eml' \\\n  | jq -C . | less -R\n```\n\nThe response will be a list of the extracted elements:\n```\n[\n  {\n    \"element_id\": \"db1ca22813f01feda8759ff04a844e56\",\n    \"coordinates\": null,\n    \"text\": \"Hi All,\",\n    \"type\": \"UncategorizedText\",\n    \"metadata\": {\n      \"date\": \"2022-12-21T10:28:53-06:00\",\n      \"sent_from\": [\n        \"Mallori Harrell \u003Cmallori@unstructured.io>\"\n      ],\n      \"sent_to\": [\n        \"Mallori Harrell \u003Cmallori@unstructured.io>\"\n      ],\n      \"subject\": \"Family Day\",\n      \"filename\": \"family-day.eml\"\n    }\n  },\n...\n...\n```\n\nThe output format can also be set to `text\u002Fcsv` to get the data in csv format rather than json:\n```\n curl -X 'POST' \\\n  'http:\u002F\u002Flocalhost:8000\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Ffamily-day.eml' \\\n  -F 'output_format=\"text\u002Fcsv\"'\n```\n\nThe response will be a list of the extracted elements in csv format:\n```\ntype,element_id,text,filename,sent_from,sent_to,subject,languages,filetype\nUncategorizedText,db1ca22813f01feda8759ff04a844e56,\"Hi All,\",family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nNarrativeText,a663c393a5e143c01ef2bb5c98efa2c1,Get excited for our first annual family day! ,family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nNarrativeText,ce65ca3bef59957d3f1c2bab5725c82f,\"There will be face painting, a petting zoo, funnel cake and more.\",family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nNarrativeText,d7bcf988af9f06042d83e25c531e5744,Make sure to RSVP!,family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nTitle,5550577db69c2c8aabcd90979698120a,Best.,family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nTitle,ca1c571d993b6c1ed8ef56a06c16ba22,Mallori Harrell,family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nTitle,d5b612de8cd918addd9569b0255b65b2,Unstructured Technologies,family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nTitle,2e0b9e8ee04b9594a9c26d8535b818ff,Data Scientist,family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\n```\n\n#### Parallel Mode for PDFs\nAs mentioned above, processing a pdf using `hi_res` is currently a slow operation. One workaround is to split the pdf into smaller files, process these asynchronously, and merge the results. You can enable parallel processing mode with the following env variables:\n\n* `UNSTRUCTURED_PARALLEL_MODE_ENABLED` - set to `true` to process individual pdf pages remotely, default is `false`.\n* `UNSTRUCTURED_PARALLEL_MODE_URL` - the location to send pdf page asynchronously, no default setting at the moment.\n* `UNSTRUCTURED_PARALLEL_MODE_THREADS` - the number of threads making requests at once, default is `3`.\n* `UNSTRUCTURED_PARALLEL_MODE_SPLIT_SIZE` - the number of pages to be processed in one request, default is `1`.\n* `UNSTRUCTURED_PARALLEL_RETRY_ATTEMPTS` - the number of retry attempts on a retryable error, default is `2`. (i.e. 3 attempts are made in total)\n\nDue to the overhead associated with file splitting, parallel processing mode is only recommended for the `hi_res` strategy. Additionally users of the official [Python client](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-python-client?tab=readme-ov-file#splitting-pdf-by-pages) can enable client-side splitting by setting `split_pdf_page=True`.\n\n#### Security\nYou may also set the optional `UNSTRUCTURED_API_KEY` env variable to enable request validation for your self-hosted instance of Unstructured. If set, only requests including an `unstructured-api-key` header with the same value will be fulfilled. Otherwise, the server will return a 401 indicating that the request is unauthorized.\n\n#### Controlling Server Load\nSome documents will use a lot of memory as they're being processed. To mitigate OOM errors, the server will return a 503 if the host's available memory drops below 2GB. This is configured with the environment variable `UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB`, which defaults to 2048. You can lower this value to reduce these messages, that is, allow the server to use more memory. Otherwise, you can set to 0 to fully remove this check.\n\n#### Controlling server life time\nBy default server will run for indefinitely. To change that the `MAX_LIFETIME_SECONDS` environmental variable can be set. If server is run with this variable set, it will enter a graceful shutdown period after `MAX_LIFETIME_SECONDS` from its initialization. Graceful shutdown period lasts for up to 3600 seconds and during it:\n- server denies any new requests - they're met with an empty response,\n- server continues processing active requests and shuts down (ending graceful period) if all of them are processed.\n\nAfter the graceful period is over if server is still running, it is shutdown forcefully, cancelling all active requests and sending empty responses to each of them.\n\n*Max lifetime requires gnu [timeout](https:\u002F\u002Fwww.gnu.org\u002Fsoftware\u002Fcoreutils\u002Fmanual\u002Fhtml_node\u002Ftimeout-invocation.html#timeout-invocation) to be installed, available by default on most linux systems. Downloadable on macOS as gtimeout with gnu coreutils.*\n\n## :dizzy: Instructions for using the Docker image\n\nThe following instructions are intended to help you get up and running using Docker to interact with `unstructured-api`.\nSee [here](https:\u002F\u002Fdocs.docker.com\u002Fget-docker\u002F) if you don't already have docker installed on your machine.\n\nNOTE: we build multi-platform images to support both x86_64 and Apple silicon hardware. Docker pull should download the corresponding image for your architecture, but you can specify with `--platform` (e.g. --platform linux\u002Famd64) if needed.\n\nWe build Docker images for all pushes to `main`. We tag each image with the corresponding short commit hash (e.g. `fbc7a69`) and the application version (e.g. `0.5.5-dev1`). We also tag the most recent image with `latest`. To leverage this, `docker pull` from our image repository.\n\n```bash\ndocker pull downloads.unstructured.io\u002Funstructured-io\u002Funstructured-api:latest\n```\n\nOnce pulled, you can launch the container as a web app on localhost:8000.\n\n```bash\ndocker run -p 8000:8000 -d --rm --name unstructured-api downloads.unstructured.io\u002Funstructured-io\u002Funstructured-api:latest\n```\n\nYou can pass in a PORT variable to run the server on a different port in the container.\n\n```bash\ndocker run -p 9500:9500 -d --rm --name unstructured-api -e PORT=9500 downloads.unstructured.io\u002Funstructured-io\u002Funstructured-api:latest\n```\n\n## Security Policy\n\nSee our [security policy](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Fpipeline-emails\u002Fsecurity\u002Fpolicy) for\ninformation on how to report security vulnerabilities.\n\n## Learn more\n\n| Section | Description |\n|-|-|\n| [Unstructured Community GitHub](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Fcommunity) | Information about Unstructured.io community projects  |\n| [Unstructured GitHub](https:\u002F\u002Fgithub.com\u002FUnstructured-IO) | Unstructured.io open source repositories |\n| [Company Website](https:\u002F\u002Funstructured.io) | Unstructured.io product and company info |\n\n## :chart_with_upwards_trend: Analytics\n\nWe’ve partnered with Scarf (https:\u002F\u002Fscarf.sh) to collect anonymized user statistics to understand which features our community is using and how to prioritize product decision-making in the future. To learn more about how we collect and use this data, please read our [Privacy Policy](https:\u002F\u002Funstructured.io\u002Fprivacy-policy).\n","\u003Ch3 align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured-api_readme_00da0c2b151a.png\" height=\"200\">\n\u003C\u002Fh3>\n\n\u003Ch3 align=\"center\">\n  \u003Cp>API公告！\u003C\u002Fp>\n\u003C\u002Fh3>\n\n我们非常高兴地宣布全新推出的[Unstructured API](https:\u002F\u002Fdocs.unstructured.io\u002Fapi-reference\u002Foverview)。虽然托管的Unstructured API将继续免费使用，但进行请求时需要API密钥。为避免中断，请立即在[这里](https:\u002F\u002Fwww.unstructured.io\u002F#get-api-key)获取您的密钥，并从今天开始使用吧！请查看此处的[README](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api#--)，了解如何开始调用API。\u003C\u002Fp>\n\n#### :rocket: 测试版功能：Chipper模型\n\n我们发布了Chipper模型的测试版，用于处理高分辨率、复杂文档时提供更卓越的性能。要在您的API请求中使用Chipper模型，只需采用`hi_res`策略即可。详细说明请参阅[此处](https:\u002F\u002Funstructured-io.github.io\u002Funstructured\u002Fapi.html#strategies)。\n\n由于Chipper模型仍处于测试阶段，我们诚挚欢迎各位的反馈与建议。如果您有兴趣试用Chipper模型，欢迎加入我们的[Slack社区](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Funstructuredw-kbe4326\u002Fshared_invite\u002Fzt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ)与我们交流。\n\n\u003Cdiv align=\"center\">\n\n \u003Ca\n   href=\"https:\u002F\u002Fwww.phorm.ai\u002Fquery?projectId=34efc517-2201-4376-af43-40c4b9da3dc5\">\n\t\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPhorm-Ask_AI-%23F2777A.svg?&logo=data:image\u002Fsvg+xml;base64,PHN2ZyB3aWR0aD0iNSIgaGVpZ2h0PSI0IiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgogIDxwYXRoIGQ9Ik00LjQzIDEuODgyYTEuNDQgMS40NCAwIDAgMS0uMDk4LjQyNmMtLjA1LjEyMy0uMTE1LjIzLS4xOTIuMzIyLS4wNzUuMDktLjE2LjE2NS0uMjU1LjIyNmExLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxMmMtLjA5OS4wMTItLjE5Mi4wMTQtLjI3OS4wMDZsLTEuNTkzLS4xNHYtLjQwNmgxLjY1OGMuMDknMCAxLjE3LS4xNjkuMjQ2LS4xOTFhLjYwMy42MDMgMCAwIDAgLjItLjEwNi41MjkuNTI5IDAgMCAwIC4xMzgtLjE3LjY1NC42NTQgMCAwIDAgLjA2NS0uMjRsLjAyOC0uMzJhLjkzLjkzIDAgMCAwLS4wMzYtLjI0OS41NjcuNTY3IDAgMCAwLS4xMDMtLjIuNTAyLjUwMiAwIDAgMC0uMTY4LS4xMzguNjA4LjYwOCAwIDAgMC0uMjQtLjA2N0wyLjQzNy43MjkgMS42MjUuNjcxYS4zMjIuMzIyIDAgMCAwLS4yMzIuMDU4LjM3NS4zNzUgMCAwIDAtLjExNi4yMzJsLS4xMTYgMS40NS0uMDU4LjY9Ny0uMDU4Ljc1NEwuNzA1IDRsLS4zNTctLjA7OUwuNjAyLjkwNkMuNjE3LjcyNi42NjMuNTc0LjczOS40NTRhLjk1OC45NTggMCAwIDEgLjI7NC0uMjg1Ljk7MS45NzEgMCAwIDEgLjMzNy0uMTRjLjExOS0uMDI6LjIyNy0uMDM0LjMyVS0uMTV5LjE6Yy4xNTkuMDE4LjMzNi4wMy40NTkuMDgyYTEuMTczIDEuMTczIDAgMCAxIC41NDUuNDQ7Yy4wNi4wOTQuMTA9LjE9Mi4xNDQuMjk3YTEuMzkyIDEuMzkyIDAgMCAxIC4wNzguNThsLS4wMjkuMzJaIiBmaWxsPSIjRjI3NzdBIi8+CiAgPHBhdGggZD0iTTQuMDgyIDIuMDA3YTEuNDU1IDEuNDU1IDAgMCAxLS4wOTguNDI7Yy0uMDUuMTI0LS4xMTQuMjMyLS4xOTIuMzI0YTEuMTMgMS4xMyAwIDAgMS0uMjU0LjIyNyAxLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxNGMtLjEuMDEyLS4xOTMuMDE0LS4yOC4wMDZsLTEuNTYtLjEwOC4wMzQtLjQwNi4wMy0uMzQ4IDEuNTU5LjE1NGMuMDknMCAuMTuzLjwMLjI4LTuMz3EuNjAzLjYwMyAwIDAgMCAuMi0uMTA6LjUzMi41MzIgMCAwIDAgLjEzOS0uMTuyLjY6LjY6IDAgMCAwIC4wNjQtLjI0MWwuMDI9LS4zMjFhLjk0Ljk0IDAgMCAwLS4wMzYtLjI5LjU7LjU7IDAgMCAwLS4xMDMtLjIwMi41PDItLjUwMiAwIDAgMC0uMTY8LS4xMzguNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNjOLuNj......\u003Ch3 align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured-api_readme_00da0c2b151a.png\" height=\"200\">\n\u003C\u002Fh3>\n\n\u003Ch3 align=\"center\">\n  \u003Cp>API公告！\u003C\u002Fp>\n\u003C\u002Fh3>\n\n我们非常高兴地宣布全新推出的[Unstructured API](https:\u002F\u002Fdocs.unstructured.io\u002Fapi-reference\u002Foverview)。虽然托管的Unstructured API将继续免费使用，但进行请求时需要API密钥。为避免中断，请立即在[这里](https:\u002F\u002Fwww.unstructured.io\u002F#get-api-key)获取您的密钥，并从今天开始使用吧！请查看此处的[README](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api#--)，了解如何开始调用API。\u003C\u002Fp>\n\n#### :rocket: 测试版功能：Chipper模型\n\n我们发布了Chipper模型的测试版，用于在处理高分辨率、复杂文档时提供更卓越的性能。要在您的API请求中使用Chipper模型，可以采用`hi_res`策略。请参阅此处的文档[链接](https:\u002F\u002Funstructured-io.github.io\u002Funstructured\u002Fapi.html#strategies)。\n\n由于Chipper模型仍处于测试阶段，我们欢迎各位的反馈与建议。如果您有兴趣试用Chipper模型，欢迎通过[Slack社区](https:\u002F\u002Fjoin.slack.com\u002Ft\u002Funstructuredw-kbe4326\u002Fshared_invite\u002Fzt-1x7cgo0pg-PTptXWylzPQF9xZolzCnwQ)与我们联系。\n\n\u003Cdiv align=\"center\">\n\n \u003Ca\n   href=\"https:\u002F\u002Fwww.phorm.ai\u002Fquery?projectId=34efc517-2201-4376-af43-40c4b9da3dc5\">\n\t\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPhorm-Ask_AI-%23F2777A.svg?&logo=data:image\u002Fsvg+xml;base64,PHN2ZyB3aWR0aD0iNSIgaGVpZ2h0PSI0IiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgogIDxwYXRoIGQ9Ik00LjQzIDEuODgyYTEuNDQgMS40NCAwIDAgMS0uMDk4LjQyNmMtLjA1LjEyMy0uMTE1LjIzLS4xOTIuMzIyLS4wNzUuMDktLjE2LjE2NS0uMjU1LjIyNmExLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxMmMtLjA5OS4wMTItLjE5Mi4wMTQtLjI3OS4wMDZsLTEuNTkzLS4xNHYtLjQwNmgxLjY1OGMuMDkuMDAxLjE3LS4xNjkuMjQ2LS4xOTFhLjYwMy42MDMgMCAwIDAgLjItLjEwNi41MjkuNTI5IDAgMCAwIC4xMzgtLjE3LjY1NC42NTQgMCAwIDAgLjA2NS0uMjRsLjAyOC0uMzJhLjkzLjkzIDAgMCAwLS4wMzYtLjI0OS41NjcuNTY3IDAgMCAwLS4xMDMtLjIuNTAyLjUwMiAwIDAgMC0uMTY4LS4xMzguNjA4LjYwOCAwIDAgMC0uMjQtLjA2N0wyLjQzNy43MjkgMS42MjUuNjcxYS4zMjIuMzIyIDAgMCAwLS4yMzIuMDU4LjM3NS4zNzUgMCAwIDAtLjExNi4yMzJsLS4xMTYgMS40NS0uMDU8LjY5Ny0uMDU8Ljc1NEwuNzA1IDRsLS4zNTctLjA7OUwuNjAyLjkwNkMuNjE3LjcyNi42NjMuNTc0LjczOS40NTRhLjk1OC45NTggMCAwIDEgLjI7NC0uMjg1Ljk7MS45NzEgMCAwIDEgLjMzNy0uMTRjLjExOS0uMDI2LjIyNy0uMDM0LjMyNS0uMDI2TDMuMjMyLjE6Yy4xNTkuMDE4LjMzNi4wMy40NTkuMDgyYTEuMTczIDEuMTczIDAgMCAxIC41NDUuNDQ7Yy4wNi4wOTQuMTA9LjE9Mi4xNDQuMjk3YTEuMzkyIDEuMzkyIDAgMCAxIC4wNzguNThsLS4wMjkuMzJaIiBmaWxsPSIjRjI3NzdBIi8+CiAgPHBhdGggZD0iTTQuMDgyIDIuMDA7YTEuNDU1IDEuNDU1IDAgMCAxLS4wOTguNDI3Yy0uMDUuMTI0LS4xMTQuMjMyLS4xOTIuMzI0YTEuMTMgMS4xMyAwIDAgMS0uMjU0LjIyNyAxLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxNGMtLjEuMDEyLS4xOTMuMDE0LS4yOC4wMDZsLTEuNTYtLjEwOC4wMzQtLjQwNi4wMy0uMzQ4IDEuNTU5LjE1NGMuMDkgMCAuMTuzLjwMS4yNDgtLjAzM2EuNjAzLjYwMyAwIDAgMCAuMi0uMTA6LjUzMi41MzIgMCAwIDAgLjEzOS0uMTuyLjY6LjY2IDAgMCAwIC4wNjQtLjI0MWwuMDI5LS4zMjFhLjk0Ljk0IDAgMCAwLS4wMzYtLjI5LjU7LjU7IDAgMCAwLS4xMDMtLjIwMi41MDIuNTAyIDAgMCAwLS4xNjgtLjEzOC42ODUuNjODUgMCAwIDAgLC4yNDgtLjA6TDEuMjuzLjgyN2MtLjA9Ni0uMDA4LS4xNjguMDEtLjIyMS4wNTUtLjA5My4wNDUtLjA8NC4xMTQtLjA9Mi4yMDZMLjcwNSA0IDAgMy45MzhsLjI1NS0yLjkxMUExLjAxIDEuMDEgMCAwIDEgLjM9My41NzIuOTYyLjk6MiAwIDAgMSAuNjY6LjI4NmEuOTcuOTcgMCAwIDEgLjMzOC0uMTRDMS4xMjIuMTIgMS4yMy4xMSAxLjMyOC4xMTlsMS41OTMuMTRjLjE6LjAxNC4zLjA0Ny40MjMuMWExLjE7IDEuMTcgMCAwIDEgLjU0NSuNDHjLjA6MS4wOTUuMTA9LjE9Mi4xNDQuMjk5YTEuNDA6IDEuNDO6IDAgMCAxIC4wNzcuNTgzbC0uMDI4LjMyMloiIGZpbGw9IndoaXRlIi8+CiAgPHBhdGggZD0iTTQuMDgyIDIuMDA7YTEuNDU1IDEuNDU1IDAgMCAxLS4wOTguNDI7Yy0uMDUuMTI0LS4xMTQuMjMyLS4xOTIuMzI0YTEuMTMgMS4xMyAwIDAgMS0uMjU0LjIyNyAxLjM1MyAxLjM5MyAwIDAgMS0uNTk1LjIxNGMtLjEuMDEyLS4xOTMuMDE0LS4yOC4wMDZsLTEuNTYtLjEwOC.0MzQtLjQwNi.0My-.MzQ8.154CMu.MDkgMCAu.MTuzLjwMS4yNDgtLj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.OzMy.Lj.AzM2.Eu.Nj.Oz......\n\n`ocr_only` 策略会通过 Tesseract 进行 OCR 处理。目前，`hi_res` 在处理多列文档时，难以正确排列元素顺序。如果您遇到无法提取文本的多列文档，建议使用 `ocr_only` 策略。请注意，如果 Tesseract 不可用，`ocr_only` 策略将回退到其他策略。\n\n为了兼顾各方面需求，`auto` 策略会判断页面是否可以使用 `fast` 或 `ocr_only` 模式进行提取，否则将回退到 `hi_res`。\n\n#### 高分辨率模型名称\n\n`hi_res` 策略支持不同的模型，默认使用 `detectron2onnx`。您也可以通过指定 `hi_res_model_name` 参数，在使用主机 API 的同时运行 `hi_res` 策略并采用 chipper 模型：\n\n```\n curl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Flayout-parser-paper.pdf' \\\n  -F 'strategy=hi_res' \\\n  -F 'hi_res_model_name=chipper'  \\\n  | jq -C . | less -R\n```\n\n我们还支持在本地使用的模型，例如 `yolox`。有关如何使用本地 API 的更多信息，请参阅 `using-the-api-locally` 部分。\n\n#### OCR 语言\n\n注意：此关键字参数最终将被弃用。请改用 `languages`。\n您还可以通过 `ocr_languages` 关键字参数指定用于 OCR 的语言。完整的语言列表及安装说明，请参阅 [Tesseract 文档](https:\u002F\u002Fgithub.com\u002Ftesseract-ocr\u002Ftessdata)。仅当 PDF 文档中尚未包含文本时，才会应用 OCR。\n\n```\ncurl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Fenglish-and-korean.png' \\\n  -F 'strategy=ocr_only' \\\n  -F 'ocr_languages=eng'  \\\n  -F 'ocr_languages=kor'  \\\n  | jq -C . | less -R\n```\n\n#### 语言\n\n您还可以通过 `languages` 关键字参数指定用于 OCR 的语言。完整的语言列表及安装说明，请参阅 [Tesseract 文档](https:\u002F\u002Fgithub.com\u002Ftesseract-ocr\u002Ftessdata)。仅当 PDF 文档中尚未包含文本时，才会应用 OCR。\n\n```\ncurl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Fenglish-and-korean.png' \\\n  -F 'strategy=ocr_only' \\\n  -F 'languages=eng'  \\\n  -F 'languages=kor'  \\\n  | jq -C . | less -R\n```\n\n#### 坐标\n\n从 PDF 或图像中提取元素时，获取它们的边界框可能会很有用。将 `coordinates` 参数设置为 `true`，即可在响应中的元素中添加此字段。\n\n```\n curl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Flayout-parser-paper.pdf' \\\n  -F 'coordinates=true' \\\n  | jq -C . | less -R\n```\n\n#### 跳过表格提取\n\n目前，我们支持为所有文件类型启用或禁用表格提取功能。通过设置 `skip_infer_table_types` 参数，您可以指定希望跳过表格提取的文档类型。默认情况下，我们会为所有文件类型启用表格提取（`skip_infer_table_types=[]`）。需要注意的是，表格提取功能仅适用于 `hi_res` 策略。例如，如果您想跳过对图片的表格提取，可以传递一个包含相应图片文件类型的列表：\n\n```\n curl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Flayout-parser-paper-with-table.jpg' \\\n  -F 'strategy=hi_res' \\\n  -F 'skip_infer_table_types=[\"jpg\"]' \\\n  | jq -C . | less -R\n```\n\n#### 编码\n\n您可以指定用于解码文本输入的编码方式。如果未提供值，则将使用 UTF-8 编码。\n\n```\ncurl -X 'POST' \\\n 'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n -H 'accept: application\u002Fjson'  \\\n -H 'Content-Type: multipart\u002Fform-data' \\\n -F 'files=@sample-docs\u002Ffake-power-point.pptx' \\\n -F 'encoding=utf_8' \\\n | jq -C . | less -R\n```\n\n#### 压缩文件\n\n您可以发送压缩文件，API 会自动解压。\n```\ncurl -X 'POST' \\\n 'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n -H 'accept: application\u002Fjson'  \\\n -H 'Content-Type: multipart\u002Fform-data' \\\n -F 'gz_uncompressed_content_type=application\u002Fpdf' \\\n -F 'files=@sample-docs\u002Flayout-parser-paper.pdf.gz' \n```\n\n如果设置了 `gz_uncompressed_content_type` 字段，API 将以该值作为所有文件解压后的内容类型，前提是这些文件是以单个批次发送的。如果没有设置该字段，API 将根据各种启发式方法来检测解压后文件的类型。\n\n#### XML 标签\n\n处理 XML 文档时，将 `xml_keep_tags` 参数设置为 `true`，即可在输出中保留 XML 标签。如果不指定此参数，则只会提取标签内的文本。\n\n```\ncurl -X 'POST' \\\n 'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n -H 'accept: application\u002Fjson'  \\\n -H 'Content-Type: multipart\u002Fform-data' \\\n -F 'files=@sample-docs\u002Ffake-xml.xml' \\\n -F 'xml_keep_tags=true' \\\n | jq -C . | less -R\n```\n\n#### 分页符\n\n对于支持的文件类型，将 `include_page_breaks` 参数设置为 `true`，即可在输出中包含 `PageBreak` 元素。\n\n```\ncurl -X 'POST' \\\n 'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n -H 'accept: application\u002Fjson'  \\\n -H 'Content-Type: multipart\u002Fform-data' \\\n -F 'files=@sample-docs\u002Flayout-parser-paper-fast.pdf' \\\n -F 'include_page_breaks=true' \\\n | jq -C . | less -R\n```\n\n\n#### 唯一元素 ID\n\n默认情况下，元素 ID 是元素文本的 SHA-256 哈希值，以确保 ID 具有确定性。不过，这也意味着 ID 并不一定是唯一的。具有相同文本的不同元素将拥有相同的 ID，并且还可能出现哈希碰撞的情况。若希望在输出中使用 UUID 代替，可将 `unique_element_ids=true`。请注意，这意味着元素 ID 将是随机的，因此每次对同一文件进行分割时，都会得到不同的 ID。这在您希望将 ID 用作数据库主键等场景下可能非常有用。\n\n```\ncurl -X 'POST' \\ \n 'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n -H 'accept: application\u002Fjson'  \\\n -H 'Content-Type: multipart\u002Fform-data' \\\n -F 'files=@sample-docs\u002Flayout-parser-paper-fast.pdf' \\\n -F 'unique_element_ids=true' \\\n | jq -C . | less -R\n```\n\n\n#### 元素分块\n\n使用 `chunking_strategy` 表单字段可以将文本划分为更大或更小的元素。默认值为 `None`，即不进行分块。可用的分块策略包括 `basic` 和 `by_title`。\n\n`basic` 策略会将连续的整个文档元素组合起来，尽可能填满最大字符数为 `max_characters` 的块。如果某个单独的元素本身超过 `max_characters`，则会按照单词边界将其拆分为两个或多个块。\n\n`by_title` 策略的行为与上述相同，但会尊重文档的章节边界，这意味着来自两个不同章节的元素永远不会出现在同一个块中。`Title`（章节标题）元素会引入一个新的章节，因此得名。\n\n  额外参数（全部可选）：\n\n    `max_characters`\n      块大小的硬性上限。任何块都不会超过此长度。默认值为500。\n\n    `new_after_n_chars`\n      当块达到或超过此长度时，会被视为“满”的，即使它还能容纳更多内容以符合 `max_characters` 的限制，也不会再添加新的元素。这个“软性上限”默认与 `max_characters` 相同。\n\n    `overlap`\n      指定从每个块中提取一段字符串（称为“尾部”），并将其作为上下文保留机制前置到下一个块中。默认情况下，这仅适用于那些因文本拆分而被分割成多个块的超大元素。\n\n    `overlap_all`\n      默认值：`False`。当设置为 `True` 时，会对由完整元素形成且未经过文本拆分的“普通”块之间也应用重叠。请谨慎使用此选项，因为它可能会在一定程度上破坏原本清晰的语义块边界。\n\n    `combine_under_n_chars`\n      将元素（例如一系列标题）合并，直到一个章节达到指定的字符数长度为止。默认值为500。仅对 `by_title` 策略有效。\n\n    `multipage_sections`\n      如果为 `True`，章节可以跨越多页。默认值为 `True`。仅对 `by_title` 策略有效。\n\n\n```\ncurl -X 'POST' \n 'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n -H 'accept: application\u002Fjson'  \\\n -H 'Content-Type: multipart\u002Fform-data' \\\n -F 'files=@sample-docs\u002Flayout-parser-paper-fast.pdf' \\\n -F 'chunking_strategy=by_title' \\\n | jq -C . | less -R\n```\n\n\n\n## 开发者快速入门\n\n* 建议使用 `pyenv` 来管理虚拟环境\n\t* Mac 安装说明。更多详细信息请参见 [这里](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Fcommunity#mac--homebrew)。\n\t\t* `brew install pyenv-virtualenv`\n\t  * `pyenv install 3.12`\n  * Linux 安装说明请参见 [这里](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Fcommunity#linux)。\n\n  * 创建并激活一个用于工作的虚拟环境，例如名为 `document-processing` 的环境：\n\n\t`pyenv  virtualenv 3.12\n   unstructured-api` \u003Cbr \u002F>\n\t`pyenv activate unstructured-api`\n\n如需处理所有文件类型，请参阅 [Unstructured 快速入门](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured#eight_pointed_black_star-quick-start)，了解所需的众多操作系统依赖项。\n\n* 运行 `make install`\n* 使用 `make run-jupyter` 启动本地 Jupyter Notebook 服务器 \u003Cbr \u002F>\n\t**或者** \u003Cbr \u002F>\n\t直接使用 `make run-web-app` 在本地启动 FastAPI 应用程序\n\n#### 在本地使用 API\n\n运行 `make run-web-app`（或 `make docker-start-api` 在容器中运行）后，现在可以在本地端口 8000 上访问该 API。`sample-docs` 目录中包含许多当前支持的示例文件类型。\n\n例如：\n```\n curl -X 'POST' \\\n  'http:\u002F\u002Flocalhost:8000\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Ffamily-day.eml' \\\n  | jq -C . | less -R\n```\n\n响应将是一个提取出的元素列表：\n```\n[\n  {\n    \"element_id\": \"db1ca22813f01feda8759ff04a844e56\",\n    \"coordinates\": null,\n    \"text\": \"Hi All,\",\n    \"type\": \"UncategorizedText\",\n    \"metadata\": {\n      \"date\": \"2022-12-21T10:28:53-06:00\",\n      \"sent_from\": [\n        \"Mallori Harrell \u003Cmallori@unstructured.io>\"\n      ],\n      \"sent_to\": [\n        \"Mallori Harrell \u003Cmallori@unstructured.io>\"\n      ],\n      \"subject\": \"Family Day\",\n      \"filename\": \"family-day.eml\"\n    }\n  },\n...\n...\n```\n\n输出格式也可以设置为 `text\u002Fcsv`，以 CSV 格式获取数据，而不是 JSON：\n```\n curl -X 'POST' \\\n  'http:\u002F\u002Flocalhost:8000\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Ffamily-day.eml' \\\n  -F 'output_format=\"text\u002Fcsv\"'\n```\n\n响应将以 CSV 格式列出提取的元素：\n```\ntype,element_id,text,filename,sent_from,sent_to,subject,languages,filetype\nUncategorizedText,db1ca22813f01feda8759ff04a844e56,\"Hi All,\",family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nNarrativeText,a663c393a5e143c01ef2bb5c98efa2c1,Get excited for our first annual family day! ,family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nNarrativeText,ce65ca3bef59957d3f1c2bab5725c82f,\"There will be face painting, a petting zoo, funnel cake and more.\",family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nNarrativeText,d7bcf988af9f06042d83e25c531e5744,Make sure to RSVP!,family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nTitle,5550577db69c2c8aabcd90979698120a,Best.,family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nTitle,ca1c571d993b6c1ed8ef56a06c16ba22,Mallori Harrell,family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nTitle,d5b612de8cd918addd9569b0255b65b2,Unstructured Technologies,family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\nTitle,2e0b9e8ee04b9594a9c26d8535b818ff,Data Scientist,family-day.eml,['Mallori Harrell \u003Cmallori@unstructured.io>'],['Mallori Harrell \u003Cmallori@unstructured.io>'],Family Day,['eng'],message\u002Frfc822\n```\n\n#### PDF 的并行模式\n如上所述，使用 `hi_res` 处理 PDF 目前是一项较慢的操作。一种解决方法是将 PDF 分割成较小的文件，异步处理这些文件，并将结果合并。可以通过以下环境变量启用并行处理模式：\n\n* `UNSTRUCTURED_PARALLEL_MODE_ENABLED` - 设置为 `true` 可以远程处理单个 PDF 页面，默认值为 `false`。\n* `UNSTRUCTURED_PARALLEL_MODE_URL` - 用于异步发送 PDF 页面的位置，目前没有默认设置。\n* `UNSTRUCTURED_PARALLEL_MODE_THREADS` - 同时发起请求的线程数量，默认值为 `3`。\n* `UNSTRUCTURED_PARALLEL_MODE_SPLIT_SIZE` - 每次请求处理的页面数量，默认值为 `1`。\n* `UNSTRUCTURED_PARALLEL_RETRY_ATTEMPTS` - 在可重试错误发生时的重试次数，默认值为 `2`。（即总共进行 3 次尝试）\n\n由于文件拆分会带来额外的开销，因此并行处理模式仅推荐用于 `hi_res` 策略。此外，使用官方 [Python 客户端](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-python-client?tab=readme-ov-file#splitting-pdf-by-pages) 的用户可以通过设置 `split_pdf_page=True` 来启用客户端侧的拆分功能。\n\n#### 安全性\n您还可以设置可选的 `UNSTRUCTURED_API_KEY` 环境变量，以启用对您自托管的 Unstructured 实例的请求验证。如果设置了该变量，只有包含具有相同值的 `unstructured-api-key` 头部的请求才会被处理。否则，服务器将返回 401 错误，表示请求未经授权。\n\n#### 控制服务器负载\n某些文档在处理过程中会占用大量内存。为避免出现内存不足（OOM）错误，当主机可用内存降至 2GB 以下时，服务器将返回 503 错误。此行为可通过环境变量 `UNSTRUCTURED_MEMORY_FREE_MINIMUM_MB` 进行配置，默认值为 2048。您可以降低此值以减少此类提示信息，即允许服务器使用更多内存。或者，您可以将其设置为 0，以完全禁用此检查。\n\n#### 控制服务器运行时长\n默认情况下，服务器将无限期运行。若需更改此设置，可以设置 `MAX_LIFETIME_SECONDS` 环境变量。如果设置了该变量，服务器将在启动后经过指定的秒数（`MAX_LIFETIME_SECONDS`）进入优雅关闭阶段。优雅关闭阶段最长持续 3600 秒，在此期间：\n- 服务器拒绝所有新请求，这些请求将收到空响应；\n- 服务器继续处理当前正在进行的请求，并在所有请求处理完毕后结束优雅关闭阶段。\n\n优雅关闭阶段结束后，如果服务器仍在运行，则会被强制关闭，所有正在进行的请求将被取消，且每个请求都会收到空响应。\n\n*最大运行时长功能需要安装 GNU 的 [timeout](https:\u002F\u002Fwww.gnu.org\u002Fsoftware\u002Fcoreutils\u002Fmanual\u002Fhtml_node\u002Ftimeout-invocation.html#timeout-invocation) 工具，该工具在大多数 Linux 系统上默认可用。在 macOS 上，可通过 GNU coreutils 包中的 gtimeout 获得。*\n\n\n\n## :dizzy: 使用 Docker 镜像的说明\n\n以下说明旨在帮助您通过 Docker 与 `unstructured-api` 进行交互。如果您尚未在本地安装 Docker，请参阅 [此处](https:\u002F\u002Fdocs.docker.com\u002Fget-docker\u002F)。\n\n注意：我们构建了多平台镜像，以支持 x86_64 和 Apple Silicon 架构的硬件。Docker pull 命令应会自动下载适合您架构的镜像，但如有需要，您也可以使用 `--platform` 参数指定架构（例如 `--platform linux\u002Famd64`）。\n\n我们为每次推送到 `main` 分支的提交都会构建 Docker 镜像。每个镜像都会打上对应的短 commit hash 标签（如 `fbc7a69`）和应用版本标签（如 `0.5.5-dev1`）。最新的镜像还会被打上 `latest` 标签。要使用这些镜像，只需从我们的镜像仓库中执行 `docker pull` 命令即可。\n\n```bash\ndocker pull downloads.unstructured.io\u002Funstructured-io\u002Funstructured-api:latest\n```\n\n拉取完成后，您可以在本地主机的 8000 端口上以 Web 应用的形式启动容器。\n\n```bash\ndocker run -p 8000:8000 -d --rm --name unstructured-api downloads.unstructured.io\u002Funstructured-io\u002Funstructured-api:latest\n```\n\n您还可以传递 `PORT` 变量，以便在容器内的不同端口上运行服务器。\n\n```bash\ndocker run -p 9500:9500 -d --rm --name unstructured-api -e PORT=9500 downloads.unstructured.io\u002Funstructured-io\u002Funstructured-api:latest\n```\n\n## 安全策略\n\n有关如何报告安全漏洞的信息，请参阅我们的 [安全策略](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Fpipeline-emails\u002Fsecurity\u002Fpolicy)。\n\n## 了解更多\n\n| 版块 | 描述 |\n|-|-|\n| [Unstructured 社区 GitHub](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Fcommunity) | 关于 Unstructured.io 社区项目的相关信息 |\n| [Unstructured GitHub](https:\u002F\u002Fgithub.com\u002FUnstructured-IO) | Unstructured.io 的开源代码库 |\n| [公司官网](https:\u002F\u002Funstructured.io) | Unstructured.io 的产品及公司信息 |\n\n## :chart_with_upwards_trend: 数据分析\n\n我们与 Scarf 公司（https:\u002F\u002Fscarf.sh）合作，收集匿名化的用户统计数据，以了解社区正在使用哪些功能，并据此优先安排未来的产品决策。如需了解我们如何收集和使用这些数据，请阅读我们的 [隐私政策](https:\u002F\u002Funstructured.io\u002Fprivacy-policy)。","# Unstructured API 快速上手指南\n\nUnstructured API 是一个通用的文档预处理管道，能够自动识别文件类型并选择相应的处理函数。它支持纯文本、图像、常见办公文档（PDF, Word, PPT, Excel 等）及压缩文件的解析。\n\n## 环境准备\n\n*   **API 密钥**：使用托管版 Unstructured API 需要获取 API Key。请访问 [Unstructured 官网](https:\u002F\u002Fwww.unstructured.io\u002F#get-api-key) 免费注册获取。\n*   **工具依赖**：确保系统中已安装 `curl` 和 `jq`（用于格式化 JSON 输出），或使用 Postman 等 API 测试工具。\n*   **网络要求**：需能访问 `https:\u002F\u002Fapi.unstructured.io`。\n\n## 安装步骤\n\n本指南主要介绍如何调用**托管版 API**，无需本地安装复杂的环境依赖。若需本地部署，请参考官方 Developer Quickstart Guide。\n\n只需准备好你的 API Key 即可开始使用。\n\n## 基本使用\n\n### 1. 基础调用示例\n\n以下命令展示如何上传一个 `.eml` 邮件文件进行解析。请将 `\u003CYOUR API KEY>` 替换为你实际获取的密钥。\n\n```bash\ncurl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -H 'unstructured-api-key: \u003CYOUR API KEY>' \\\n  -F 'files=@sample-docs\u002Ffamily-day.eml' \\\n  | jq -C . | less -R\n```\n\n### 2. 高级策略配置\n\n针对 PDF 或图片文件，API 提供多种处理策略（`strategy`）：\n\n*   `fast`（默认）：适用于非嵌入式文本的文档，速度快。\n*   `hi_res`：适用于包含嵌入式文本图片或需要高精度元素提取的场景。速度较慢，但精度更高。\n*   `ocr_only`：强制使用 Tesseract 进行 OCR，适用于多栏且无提取文本的文档。\n*   `auto`：自动判断最佳策略。\n\n**使用 `hi_res` 策略示例：**\n\n```bash\ncurl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Flayout-parser-paper.pdf' \\\n  -F 'strategy=hi_res' \\\n  | jq -C . | less -R\n```\n\n**使用 Beta 版 Chipper 模型（高分辨率复杂文档）：**\n\n```bash\ncurl -X 'POST' \\\n  'https:\u002F\u002Fapi.unstructured.io\u002Fgeneral\u002Fv0\u002Fgeneral' \\\n  -H 'accept: application\u002Fjson' \\\n  -H 'Content-Type: multipart\u002Fform-data' \\\n  -F 'files=@sample-docs\u002Flayout-parser-paper.pdf' \\\n  -F 'strategy=hi_res' \\\n  -F 'hi_res_model_name=chipper'  \\\n  | jq -C . | less -R\n```\n\n### 3. 常用参数说明\n\n*   **指定 OCR 语言**：使用 `languages` 参数指定语言代码（如 `eng`, `kor`）。\n    ```bash\n    -F 'languages=eng' \\\n    -F 'languages=kor'\n    ```\n*   **获取坐标信息**：设置 `coordinates=true` 可在返回结果中包含元素的边界框坐标。\n    ```bash\n    -F 'coordinates=true'\n    ```\n*   **跳过表格提取**：在 `hi_res` 模式下，可通过 `skip_infer_table_types` 指定不提取表格的文件类型（如 `[\"jpg\"]`）。\n    ```bash\n    -F 'skip_infer_table_types=[\"jpg\"]'\n    ```\n*   **保留 XML 标签**：处理 XML 文件时，设置 `xml_keep_tags=true` 可保留原始标签。\n    ```bash\n    -F 'xml_keep_tags=true'\n    ```\n*   **处理 Gzip 文件**：上传 `.gz` 文件时，API 会自动解压。可通过 `gz_uncompressed_content_type` 指定解压后的内容类型。\n    ```bash\n    -F 'gz_uncompressed_content_type=application\u002Fpdf' \\\n    -F 'files=@sample-docs\u002Flayout-parser-paper.pdf.gz'\n    ```\n\n更多详细参数和响应结构请参考 [官方 API 文档](https:\u002F\u002Fdocs.unstructured.io\u002Fapi-reference\u002Foverview)。","某金融科技公司的大数据团队需要构建一个智能投研系统，每天需自动解析并入库数千份格式各异的上市公司财报（PDF）、行业研报（PPT）及内部会议纪要（DOCX），以便后续进行大模型问答检索。\n\n### 没有 unstructured-api 时\n- **解析逻辑碎片化**：开发人员必须为 PDF、PPT、Word 等不同文件格式分别集成 PyPDF2、python-pptx 等独立库，代码维护成本极高，且难以统一处理逻辑。\n- **非结构化数据丢失**：传统工具往往忽略文档中的页眉页脚、脚注或复杂表格结构，导致提取的文本顺序混乱，关键财务数据与上下文脱节，严重影响下游分析准确性。\n- **高分辨率文档处理困难**：面对包含大量图表和扫描件的“高保真”财报，普通 OCR 方案识别率低且部署复杂，缺乏针对复杂布局的高效预处理模型，导致大量人工复核工作。\n- **扩展性差**：每当新增一种文件格式或优化解析策略，都需要重新测试整个流水线，迭代周期长，无法快速响应业务对多模态数据接入的需求。\n\n### 使用 unstructured-api 后\n- **统一接口标准化**：通过调用 unstructured-api 的单一 RESTful 接口，即可自动识别并处理所有主流文件格式，后端屏蔽了底层库的差异，大幅简化了工程架构。\n- **智能元素提取**：利用其先进的分区算法，能精准识别标题、段落、列表及表格边界，保留文档的逻辑结构，确保存入向量数据库的内容具备清晰的语义层次。\n- **Chipper 模型赋能**：启用 beta 版的 Chipper 模型配合 `hi_res` 策略，显著提升了对高分辨率扫描件和复杂排版文档的解析精度，无需自建复杂的 OCR 集群即可获得高质量文本。\n- **敏捷迭代与扩展**：新增文件类型支持仅需调整 API 参数，无需重构代码；团队可将精力集中在上层应用逻辑，而非底层解析细节，显著缩短了功能上线时间。\n\nunstructured-api 的核心价值在于将繁琐、异构的文档解析工作转化为标准化的 API 服务，让企业能以极低的技术门槛实现高质量的非结构化数据清洗与结构化转换。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured-api_00da0c2b.png","Unstructured-IO","Unstructured","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FUnstructured-IO_190dee06.jpg","","security@unstructured.io","UnstructuredIO","https:\u002F\u002Funstructured.io\u002F","https:\u002F\u002Fgithub.com\u002FUnstructured-IO",[85,89,93,97,101,105,108],{"name":86,"color":87,"percentage":88},"Python","#3572A5",63,{"name":90,"color":91,"percentage":92},"Jupyter Notebook","#DA5B0B",24.2,{"name":94,"color":95,"percentage":96},"Shell","#89e051",8.1,{"name":98,"color":99,"percentage":100},"Makefile","#427819",2.5,{"name":102,"color":103,"percentage":104},"Dockerfile","#384d54",2.2,{"name":106,"color":68,"percentage":107},"Rich Text Format",0.1,{"name":109,"color":110,"percentage":107},"HTML","#e34c26",900,190,"2026-04-02T05:55:30","Apache-2.0","未说明",{"notes":117,"python":115,"dependencies":118},"提供的 README 内容主要介绍 API 的使用方法和参数配置，未包含本地部署的具体环境需求（如操作系统、Python 版本、依赖库等）。文中提到若需本地运行 API，请参考 'Developer Quickstart Guide' 或 'using-the-api-locally' 部分，但这些部分的内容未在提供的文本中。此外，文中提及 OCR 功能依赖 Tesseract，hi_res 策略默认使用 detectron2onnx 模型，也支持 chipper (Beta) 和 yolox 模型。",[115],[51,13],"2026-03-27T02:49:30.150509","2026-04-06T05:15:16.084455",[123,128,133,138,142,147],{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},11255,"如何在本地 Docker 环境中解决 401\u002F404 错误并正确调用 API？","出现 401 或 404 错误通常是因为请求的端点路径不正确，而非缺少 API Key。在本地部署时，请确保 `apiUrl` 包含完整的路径 `\u002Fgeneral\u002Fv0\u002Fgeneral`。例如：\n```javascript\nconst options = {\n  apiUrl: \"http:\u002F\u002Flocalhost:8000\u002Fgeneral\u002Fv0\u002Fgeneral\"\n};\n```\n正确的 Docker 启动命令示例：\n```bash\ndocker run -p 8000:8000 -d --rm --name unstructured-api quay.io\u002Funstructured-io\u002Funstructured-api:latest --port 8000 --host 0.0.0.0\n```","https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fissues\u002F386",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},11256,"如何配置 Unstructured API 使用 PaddleOCR 后端以加速表格提取并利用 GPU？","可以通过设置环境变量 `OCR_AGENT=paddle` 来启用 PaddleOCR。如果使用 Docker Compose，可以创建一个 shell 脚本（如 `paddle.sh`）来安装必要的依赖，并在 `docker-compose.yml` 中配置环境变量。示例配置如下：\n\n1. 创建 `paddle.sh`:\n```sh\n#!\u002Fbin\u002Fsh\nset -e\nif ! python3.11 -c \"import paddle\" &> \u002Fdev\u002Fnull; then\n    echo \"Installing paddle...\"\n    apk update && apk add --no-cache build-base python-3.11-dev\n    pip install paddlepaddle\n    pip install unstructured.paddleocr\nelse\n    echo \"paddle is already installed.\"\nfi\nsh scripts\u002Fapp-start.sh\n```\n2. 在 `docker-compose.yml` 中设置:\n```yml\nenvironment:\n  - OCR_AGENT=paddle\n```","https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fissues\u002F247",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},11257,"为什么调整分块参数（如 combine_under_n_chars）后结果没有变化？","这通常是由于参数名称拼写错误或使用了旧版本导致的。官方曾存在文档拼写错误，正确的参数名应为 `combine_under_n_chars`（注意是 `chars` 而不是 `char`）。请确保使用最新的 Docker 镜像（如 0.63 版本及以上），并在请求中使用正确的参数名：\n```bash\ncurl ... --form 'combine_under_n_chars=\"1500\"' ...\n```\n如果参数值设置得非常大（如 1000000）仍无效果，请检查是否成功传递了参数以及使用的分块策略（chunking_strategy）是否支持这些参数。","https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fissues\u002F337",{"id":139,"question_zh":140,"answer_zh":141,"source_url":132},11258,"PaddleOCR 默认不支持中文识别，如何临时修改语言设置以支持中文？","目前 PaddleOCR 后端默认仅支持英语（`en`）。若需支持中文，可以临时通过修改源码中的默认语言参数来实现：\n1. 找到 `unstructured` 包的安装位置：\n```python\nimport unstructured\nprint(unstructured.__file__)\n```\n2. 打开文件 `unstructured\u002Fpartition\u002Futils\u002Focr_models\u002Fpaddle_ocr.py`。\n3. 将第 7 行左右的默认语言参数从 `en` 修改为 `ch`（代表中英文）。\n注意：这是一种临时变通方法，未来版本可能会支持通过参数直接传递语言代码。",{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},11259,"PDF 或图片中的表格解析效果不佳，返回 UncategorizedText 怎么办？","表格解析效果受底层 OCR 模型和检测模型影响。建议尝试以下优化步骤：\n1. 确保启用了表格结构推断：在 API 请求中设置 `pdf_infer_table_structure=true`。\n2. 尝试更换 OCR 后端：如果在 x86 架构上运行，尝试安装并使用 PaddleOCR (`pip install unstructured_paddleocr`) 替代默认的 Tesseract，因为 PaddleOCR 在某些场景下对表格的支持更好。\n3. 检查硬件架构：某些模型在 M1\u002FM2 芯片或特定 x86 CPU 上的表现可能不同，如有条件可尝试在不同环境下测试。\n4. 确保使用了正确的策略：对于图片文件，使用 `strategy=ocr_only`；对于 PDF，通常使用 `strategy=auto` 或 `hi_res`。","https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fissues\u002F191",{"id":148,"question_zh":149,"answer_zh":150,"source_url":132},11260,"如何在 Docker 环境中自动安装 PaddleOCR 依赖？","在基于 Wolfi\u002FAlpine 的 Docker 镜像中，需要手动安装编译依赖和 Python 开发头文件。可以在入口脚本中添加以下逻辑：\n```sh\n# 检查并安装 Paddle\nif ! python3.11 -c \"import paddle\" &> \u002Fdev\u002Fnull; then\n    echo \"Installing paddle...\"\n    # 安装 build-base 和 python 开发库\n    apk update && apk add --no-cache build-base python-3.11-dev\n    pip install paddlepaddle\n    pip install unstructured.paddleocr\nelse\n    echo \"paddle is already installed.\"\nfi\n```\n确保在 `pip install` 之前系统已具备编译 C++ 扩展的能力（即安装了 `build-base`）。",[152,157,162,167,172,177,182,187,192,197,202,207,212,217,222,227,232,237,242,247],{"id":153,"version":154,"summary_zh":155,"released_at":156},61733,"0.1.1","## 变更内容\n* 修复(依赖)：由 @lawrence-u10d 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F538 中将依赖管理工具从 pip-compile 切换至 uv 的 pip compile。\n* 构建(Docker)：由 @luke-kucing 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F539 中切换到 Chainguard 的 wolfi-base 镜像，并将依赖升级至 0.0.93。\n* 修复：由 @awalker4 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F541 中修正 Swagger UI 示例中列表参数的无效 JSON。\n* 迁移到 uv：由 @PastelStorm 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F540 中完成。\n* 修复 ARM64 运行器：由 @PastelStorm 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F543 中完成。\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fcompare\u002F0.0.92...0.1.1","2026-02-11T20:14:35",{"id":158,"version":159,"summary_zh":160,"released_at":161},61734,"0.0.92","## 变更内容\n* chore：由 @lawrence-u10d 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F534 中修复了 CVE 漏洞\n* fix：在 Docker 发布工作流中添加了额外的清理操作，以解决运行器磁盘空间不足的问题，由 @lawrence-u10d 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F535 中完成\n* build(deps)：将 actions\u002Fcache 从版本 4 升级到版本 5，由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F530 中完成\n* perf：将 pdfminer-six 升级至 20260107 版本，由 @CyMule 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F536 中完成\n\n## 新贡献者\n* @lawrence-u10d 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F534 中完成了首次贡献\n* @CyMule 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F536 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fcompare\u002F0.0.90...0.0.92","2026-01-08T14:21:22",{"id":163,"version":164,"summary_zh":165,"released_at":166},61735,"0.0.90","## 变更内容\n* 构建（依赖）：由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F523 中将 actions\u002Fcheckout 从 4 升级至 5\n* 构建（依赖）：由 @dependabot[bot] 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F526 中将 actions\u002Fsetup-python 从 5 升级至 6\n* 修复 Unstructured API 文档链接，由 @chiedo 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F524 中完成\n* 更新版本号并升级依赖项，由 @luke-kucing 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F528 中完成\n\n## 新贡献者\n* @chiedo 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F524 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fcompare\u002F0.0.89...0.0.90","2025-11-08T01:11:31",{"id":168,"version":169,"summary_zh":170,"released_at":171},61736,"0.0.89","* 将 Pillow 升级至 11.3.0 版本以修复一个 CVE 漏洞","2025-07-05T19:32:42",{"id":173,"version":174,"summary_zh":175,"released_at":176},61737,"0.0.87","## 变更内容\n* 由 @jiajun-unstructured 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F509 中实现测试并行化\n* 由 @PastelStorm 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F514 中切换到 ubuntu-latest，并为 CI 测试安装 Python\n* 杂项：由 @cragwolfe 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F513 中添加 Claude 集成\n* 构建（依赖）：由 @dependabot 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F516 中将 stefanzweifel\u002Fgit-auto-commit-action 从 5 升级到 6\n* 更好的版本管理：由 @PastelStorm 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F515 中实现\n\n## 新贡献者\n* @jiajun-unstructured 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F509 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fcompare\u002F0.0.86...0.0.87","2025-06-16T19:09:58",{"id":178,"version":179,"summary_zh":180,"released_at":181},61738,"0.0.86","## 变更内容\n* 修复：通过 @emmanuel-ferdman 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F507 中的提交，解决了 FastAPI 对示例字段的弃用警告。\n* 升级 torch 和 requests 以修复 CVE 漏洞，由 @PastelStorm 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F511 中完成。\n\n## 新贡献者\n* @emmanuel-ferdman 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F507 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fcompare\u002F0.0.85...0.0.86","2025-06-11T03:43:04",{"id":183,"version":184,"summary_zh":185,"released_at":186},61739,"0.0.85","## 变更内容\n* 更新容器镜像，修复由 @luke-kucing 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F483 中解决的 Python 相关 CVE 漏洞\n* 由 @Umaaz 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F486 中更新了 `app-start.sh` 脚本\n* 由 @six5532one 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F487 中修复了 Starlette 的漏洞\n* 由 @ahmetmeleq 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F492 中修补了多个 CVE 漏洞\n* 由 @PastelStorm 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F502 中升级 `h11` 库以消除漏洞\n* 由 @PastelStorm 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F503 中提升了 API 版本\n* 由 @PastelStorm 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F505 中修补了多个 CVE 漏洞\n\n## 新贡献者\n* @Umaaz 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F486 中完成了首次贡献\n* @ahmetmeleq 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F492 中完成了首次贡献\n* @PastelStorm 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fpull\u002F502 中完成了首次贡献\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-api\u002Fcompare\u002F0.0.82...0.0.85","2025-05-27T23:16:27",{"id":188,"version":189,"summary_zh":190,"released_at":191},61740,"0.0.82","## 0.0.82\r\n\r\n* 升级 `unstructured` 至 0.16.11\r\n* 不再尝试从 S3 下载 NLTK 资源，以避免可能发生的 403 错误。","2024-12-14T00:18:19",{"id":193,"version":194,"summary_zh":195,"released_at":196},61741,"0.0.81","## 0.0.81\n\n* 升级至 `unstructured==0.15.13`\n* 更新 `strategy` 参数，允许使用 `'` 和 `\"` 作为值周围的引号。","2024-09-23T17:56:57",{"id":198,"version":199,"summary_zh":200,"released_at":201},61742,"0.0.80","## 0.0.80\n\n* 升级至 `unstructured` 0.15.10\n* 新增 `include_slide_notes` 参数，用于指定是否对 `ppt` 和 `pptx` 文件中的幻灯片备注进行分割。默认值为 `True`。现在，当文件中存在幻灯片备注时，备注内容将与其他元素一同被分割，这可能会导致非备注元素的索引编号发生偏移。","2024-09-10T20:06:14",{"id":203,"version":204,"summary_zh":205,"released_at":206},61743,"0.0.79","## 0.0.79\r\n\r\n* Bump to `unstructured` 0.15.7\r\n\r\n## 0.0.78\r\n\r\n* Resolve NLTK CVE.\r\n* Bump to `unstructured` 0.15.6\r\n\r\n## 0.0.77\r\n\r\n* Bump to `unstructured` 0.15.5","2024-08-21T12:27:29",{"id":208,"version":209,"summary_zh":210,"released_at":211},61744,"0.0.76","# 0.0.76\r\n\r\n* Use the library's `detect_filetype` in API to determine mimetype\r\n* Add content_type api parameter\r\n* Bump to `unstructured` 0.15.1","2024-08-06T18:32:45",{"id":213,"version":214,"summary_zh":215,"released_at":216},61745,"0.0.75","## 0.0.75\r\n\r\n* Remove constraint on `safetensors` that preventing us from bumping `transformers`.","2024-07-24T15:47:09",{"id":218,"version":219,"summary_zh":220,"released_at":221},61746,"0.0.74","## 0.0.74\r\n\r\n* Bump to `unstructured` 0.15.0","2024-07-22T22:31:15",{"id":223,"version":224,"summary_zh":225,"released_at":226},61747,"0.0.73","## 0.0.73\r\n\r\n* Bump to `unstructured` 0.14.10","2024-07-09T16:24:01",{"id":228,"version":229,"summary_zh":230,"released_at":231},61748,"0.0.72","## 0.0.72\r\n\r\n* Fix certain filetypes failing mimetype lookup in the new base image","2024-06-28T13:02:48",{"id":233,"version":234,"summary_zh":235,"released_at":236},61749,"0.0.71","## 0.0.71\r\n\r\n* replace rockylinux with chainguard\u002Fwolfi as a base image for `amd64`","2024-06-28T13:02:21",{"id":238,"version":239,"summary_zh":240,"released_at":241},61750,"0.0.70","## 0.0.70\r\n\r\n* Bump to `unstructured` 0.14.6\r\n* Bump to `unstructured-inference` 0.7.35","2024-06-18T16:37:45",{"id":243,"version":244,"summary_zh":245,"released_at":246},61751,"0.0.68","## 0.0.68\r\n\r\n* Fix list params such as `extract_image_block_types` not working via the python\u002Fjs clients\r\n\r\n## 0.0.67\r\n\r\n* Allow for a different server port with the PORT variable\r\n* Change pdf_infer_table_structure parameter from being disabled in auto strategy.\r\n\r\n## 0.0.66\r\n\r\n* Add support for `unique_element_ids` parameter.\r\n* Add max lifetime, via MAX_LIFETIME_SECONDS env-var, to API containers\r\n* Bump unstructured to 0.13.5\r\n* Change default values for `pdf_infer_table_structure` and `skip_infer_table_types`. Mark `pdf_infer_table_structure` deprecated.\r\n* Add support for the `starting_page_number` param.","2024-05-12T18:51:44",{"id":248,"version":249,"summary_zh":250,"released_at":251},61752,"0.0.65","## 0.0.65\r\n\r\n* Bump unstructured to 0.12.4\r\n* Add support for both `list[str]` and `str` input formats for `ocr_languages` parameter\r\n* Adds support for additional MIME types from `unstructured`\r\n* Document the support for gzip files and add additional testing \r\n\r\n## 0.0.64\r\n\r\n* Bump Pydantic to 2.5.x and remove it from explicit dependencies list (will be managed by fastapi)\r\n* Introduce Form params description in the code, which will form openapi and swagger documentation\r\n* Roll back some openapi customizations\r\n* Keep backward compatibility for passing parameters in form of `list[str]` (will not be shown in the documentation)\r\n\r\n## 0.0.63\r\n\r\n* Bump unstructured to 0.12.2\r\n* Fix bug that ignored `combine_under_n_chars` chunking option argument.","2024-03-13T19:43:21"]