[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-Unstructured-IO--unstructured":3,"tool-Unstructured-IO--unstructured":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":84,"stars":112,"forks":113,"last_commit_at":114,"license":115,"difficulty_score":10,"env_os":116,"env_gpu":117,"env_ram":117,"env_deps":118,"category_tags":127,"github_topics":128,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":149,"updated_at":150,"faqs":151,"releases":172},3935,"Unstructured-IO\u002Funstructured","unstructured","Convert documents to structured data effortlessly. Unstructured is open-source ETL solution for transforming complex documents into clean, structured formats for language models.  Visit our website to learn more about our enterprise grade Platform product for production grade workflows, partitioning, enrichments, chunking and embedding.","unstructured 是一款开源的 ETL（提取、转换、加载）工具，旨在轻松地将各类复杂文档转化为结构化数据。在人工智能应用开发中，大语言模型通常难以直接理解 PDF、Word、PPT 或扫描件等非结构化文件中的杂乱信息。unstructured 正是为了解决这一痛点而生，它能自动解析多种文件格式，清理冗余内容，并将文本整理成模型易于处理的干净格式。\n\n这款工具特别适合开发者、数据工程师以及 AI 研究人员使用。当你需要构建基于私有文档的知识库、研发 RAG（检索增强生成）系统，或进行大规模文档数据分析时，unstructured 能提供强大的支持。其核心技术亮点在于卓越的文档“分区”能力，能够精准识别并分离文档中的标题、段落、表格、页眉页脚等元素，同时支持智能分块（chunking）与数据富化，为后续的向量化嵌入打下坚实基础。作为连接原始文档与大模型之间的桥梁，unstructured 以开源免费的姿态，帮助用户高效打通数据预处理流程，让非结构化数据真正变得可用、好用。","\u003Ch3 align=\"center\">\n  \u003Cimg\n    src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured_readme_7e8d60f174a7.png\"\n    height=\"200\"\n  >\n\u003C\u002Fh3>\n\n\u003Cdiv align=\"center\">\n\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fblob\u002Fmain\u002FLICENSE.md\">![https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Funstructured\u002F](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fl\u002Funstructured.svg)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Funstructured\u002F\">![https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Funstructured\u002F](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Funstructured.svg)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002FGitHub.com\u002Funstructured-io\u002Funstructured\u002Fgraphs\u002Fcontributors\">![https:\u002F\u002FGitHub.com\u002Funstructured-io\u002Funstructured.js\u002Fgraphs\u002Fcontributors](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002Funstructured-io\u002Funstructured)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fblob\u002Fmain\u002FCODE_OF_CONDUCT.md\">![code_of_conduct.md](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FContributor%20Covenant-2.1-4baaaa.svg) \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002FGitHub.com\u002Funstructured-io\u002Funstructured\u002Freleases\">![https:\u002F\u002FGitHub.com\u002Funstructured-io\u002Funstructured.js\u002Freleases](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Funstructured-io\u002Funstructured)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Funstructured\u002F\">![https:\u002F\u002Fgithub.com\u002FNaereen\u002Fbadges\u002F](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured_readme_bf1ed38bbc78.png)\u003C\u002Fa>\n  [![Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured_readme_114f467f7024.png)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Funstructured)\n  [![Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured_readme_114f467f7024.png\u002Fmonth)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Funstructured)\n  \u003Ca\n   href=\"https:\u002F\u002Fwww.phorm.ai\u002Fquery?projectId=34efc517-2201-4376-af43-40c4b9da3dc5\">\n\t\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPhorm-Ask_AI-%23F2777A.svg?&logo=data:image\u002Fsvg+xml;base64,PHN2ZyB3aWR0aD0iNSIgaGVpZ2h0PSI0IiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgogIDxwYXRoIGQ9Ik00LjQzIDEuODgyYTEuNDQgMS40NCAwIDAgMS0uMDk4LjQyNmMtLjA1LjEyMy0uMTE1LjIzLS4xOTIuMzIyLS4wNzUuMDktLjE2LjE2NS0uMjU1LjIyNmExLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxMmMtLjA5OS4wMTItLjE5Mi4wMTQtLjI3OS4wMDZsLTEuNTkzLS4xNHYtLjQwNmgxLjY1OGMuMDkuMDAxLjE3LS4xNjkuMjQ2LS4xOTFhLjYwMy42MDMgMCAwIDAgLjItLjEwNi41MjkuNTI5IDAgMCAwIC4xMzgtLjE3LjY1NC42NTQgMCAwIDAgLjA2NS0uMjRsLjAyOC0uMzJhLjkzLjkzIDAgMCAwLS4wMzYtLjI0OS41NjcuNTY3IDAgMCAwLS4xMDMtLjIuNTAyLjUwMiAwIDAgMC0uMTY4LS4xMzguNjA4LjYwOCAwIDAgMC0uMjQtLjA2N0wyLjQzNy43MjkgMS42MjUuNjcxYS4zMjIuMzIyIDAgMCAwLS4yMzIuMDU4LjM3NS4zNzUgMCAwIDAtLjExNi4yMzJsLS4xMTYgMS40NS0uMDU4LjY5Ny0uMDU4Ljc1NEwuNzA1IDRsLS4zNTctLjA3OUwuNjAyLjkwNkMuNjE3LjcyNi42NjMuNTc0LjczOS40NTRhLjk1OC45NTggMCAwIDEgLjI3NC0uMjg1Ljk3MS45NzEgMCAwIDEgLjMzNy0uMTRjLjExOS0uMDI2LjIyNy0uMDM0LjMyNS0uMDI2TDMuMjMyLjE2Yy4xNTkuMDE0LjMzNi4wMy40NTkuMDgyYTEuMTczIDEuMTczIDAgMCAxIC41NDUuNDQ3Yy4wNi4wOTQuMTA5LjE5Mi4xNDQuMjkzYTEuMzkyIDEuMzkyIDAgMCAxIC4wNzguNThsLS4wMjkuMzJaIiBmaWxsPSIjRjI3NzdBIi8+CiAgPHBhdGggZD0iTTQuMDgyIDIuMDA3YTEuNDU1IDEuNDU1IDAgMCAxLS4wOTguNDI3Yy0uMDUuMTI0LS4xMTQuMjMyLS4xOTIuMzI0YTEuMTMgMS4xMyAwIDAgMS0uMjU0LjIyNyAxLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxNGMtLjEuMDEyLS4xOTMuMDE0LS4yOC4wMDZsLTEuNTYtLjEwOC4wMzQtLjQwNi4wMy0uMzQ4IDEuNTU5LjE1NGMuMDkgMCAuMTczLS4wMS4yNDgtLjAzM2EuNjAzLjYwMyAwIDAgMCAuMi0uMTA2LjUzMi41MzIgMCAwIDAgLjEzOS0uMTcyLjY2LjY2IDAgMCAwIC4wNjQtLjI0MWwuMDI5LS4zMjFhLjk0Ljk0IDAgMCAwLS4wMzYtLjI1LjU3LjU3IDAgMCAwLS4xMDMtLjIwMi41MDIuNTAyIDAgMCAwLS4xNjgtLjEzOC42MDUuNjA1IDAgMCAwLS4yNC0uMDY3TDEuMjczLjgyN2MtLjA5NC0uMDA4LS4xNjguMDEtLjIyMS4wNTUtLjA1My4wNDUtLjA4NC4xMTQtLjA5Mi4yMDZMLjcwNSA0IDAgMy45MzhsLjI1NS0yLjkxMUExLjAxIDEuMDEgMCAwIDEgLjM5My41NzIuOTYyLjk2MiAwIDAgMSAuNjY2LjI4NmEuOTcuOTcgMCAwIDEgLjMzOC0uMTRDMS4xMjIuMTIgMS4yMy4xMSAxLjMyOC4xMTlsMS41OTMuMTRjLjE2LjAxNC4zLjA0Ny40MjMuMWExLjE3IDEuMTcgMCAwIDEgLjU0NS40NDhjLjA2MS4wOTUuMTA5LjE5My4xNDQuMjk1YTEuNDA2IDEuNDA2IDAgMCAxIC4wNzcuNTgzbC0uMDI4LjMyMloiIGZpbGw9IndoaXRlIi8+CiAgPHBhdGggZD0iTTQuMDgyIDIuMDA3YTEuNDU1IDEuNDU1IDAgMCAxLS4wOTguNDI3Yy0uMDUuMTI0LS4xMTQuMjMyLS4xOTIuMzI0YTEuMTMgMS4xMyAwIDAgMS0uMjU0LjIyNyAxLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxNGMtLjEuMDEyLS4xOTMuMDE0LS4yOC4wMDZsLTEuNTYtLjEwOC4wMzQtLjQwNi4wMy0uMzQ4IDEuNTU5LjE1NGMuMDkgMCAuMTczLS4wMS4yNDgtLjAzM2EuNjAzLjYwMyAwIDAgMCAuMi0uMTA2LjUzMi41MzIgMCAwIDAgLjEzOS0uMTcyLjY2LjY2IDAgMCAwIC4wNjQtLjI0MWwuMDI5LS4zMjFhLjk0Ljk0IDAgMCAwLS4wMzYtLjI1LjU3LjU3IDAgMCAwLS4xMDMtLjIwMi41MDIuNTAyIDAgMCAwLS4xNjgtLjEzOC42MDUuNjA1IDAgMCAwLS4yNC0uMDY3TDEuMjczLjgyN2MtLjA5NC0uMDA4LS4xNjguMDEtLjIyMS4wNTUtLjA1My4wNDUtLjA4NC4xMTQtLjA5Mi4yMDZMLjcwNSA0IDAgMy45MzhsLjI1NS0yLjkxMUExLjAxIDEuMDEgMCAwIDEgLjM5My41NzIuOTYyLjk2MiAwIDAgMSAuNjY2LjI4NmEuOTcuOTcgMCAwIDEgLjMzOC0uMTRDMS4xMjIuMTIgMS4yMy4xMSAxLjMyOC4xMTlsMS41OTMuMTRjLjE2LjAxNC4zLjA0Ny40MjMuMWExLjE3IDEuMTcgMCAwIDEgLjU0NS40NDhjLjA2MS4wOTUuMTA5LjE5My4xNDQuMjk1YTEuNDA2IDEuNDA2IDAgMCAxIC4wNzcuNTgzbC0uMDI4LjMyMloiIGZpbGw9IndoaXRlIi8+Cjwvc3ZnPgo=\" \u002F>\n   \u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n\u003Cdiv>\n  \u003Cp align=\"center\">\n  \u003Ca\n  href=\"https:\u002F\u002Fshort.unstructured.io\u002Fpzw05l7\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJOIN US ON SLACK-4A154B?style=for-the-badge&logo=slack&logoColor=white\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.linkedin.com\u002Fcompany\u002Funstructuredio\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white\" \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Ch2 align=\"center\">\n  \u003Cp>Open-Source Pre-Processing Tools for Unstructured Data\u003C\u002Fp>\n\u003C\u002Fh2>\n\nThe `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and [many more](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fcore-functionality\u002Fpartitioning). The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.\n\n## Try the Unstructured Platform Product\n\nReady to move your data processing pipeline to production, and take advantage of advanced features? Check out [Unstructured Platform](https:\u002F\u002Funstructured.io\u002Fenterprise). In addition to better processing performance, take advantage of chunking, embedding, and image and table enrichment generation, all from a low code UI or an API. [Request a demo](https:\u002F\u002Funstructured.io\u002Fcontact) from our sales team to learn more about how to get started.\n\n## :eight_pointed_black_star: Quick Start\n\nThere are several ways to use the `unstructured` library:\n* [Run the library in a container](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured#run-the-library-in-a-container) or\n* Install the library\n    1. [Install from PyPI](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured#installing-the-library)\n    2. [Install for local development](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured#installation-instructions-for-local-development)\n* For installation with `conda` on Windows system, please refer to the [documentation](https:\u002F\u002Funstructured-io.github.io\u002Funstructured\u002Finstalling.html#installation-with-conda-on-windows)\n\n### Run the library in a container\n\nThe following instructions are intended to help you get up and running using Docker to interact with `unstructured`.\nSee [here](https:\u002F\u002Fdocs.docker.com\u002Fget-docker\u002F) if you don't already have docker installed on your machine.\n\nNOTE: we build multi-platform images to support both x86_64 and Apple silicon hardware. `docker pull` should download the corresponding image for your architecture, but you can specify with `--platform` (e.g. `--platform linux\u002Famd64`) if needed.\n\nWe build Docker images for all pushes to `main`. We tag each image with the corresponding short commit hash (e.g. `fbc7a69`) and the application version (e.g. `0.5.5-dev1`). We also tag the most recent image with `latest`. To leverage this, `docker pull` from our image repository.\n\n```bash\ndocker pull downloads.unstructured.io\u002Funstructured-io\u002Funstructured:latest\n```\n\nOnce pulled, you can create a container from this image and shell to it.\n\n```bash\n# create the container\ndocker run -dt --name unstructured downloads.unstructured.io\u002Funstructured-io\u002Funstructured:latest\n\n# this will drop you into a bash shell where the Docker image is running\ndocker exec -it unstructured bash\n```\n\nYou can also build your own Docker image. Note that the base image is `wolfi-base`, which is\nupdated regularly. If you are building the image locally, it is possible `docker-build` could\nfail due to upstream changes in `wolfi-base`.\n\nIf you only plan on parsing one type of data you can speed up building the image by commenting out some\nof the packages\u002Frequirements necessary for other data types. See Dockerfile to know which lines are necessary\nfor your use case.\n\n```bash\nmake docker-build\n\n# this will drop you into a bash shell where the Docker image is running\nmake docker-start-bash\n```\n\nOnce in the running container, you can try things directly in Python interpreter's interactive mode.\n```bash\n# this will drop you into a python console so you can run the below partition functions\npython3\n\n>>> from unstructured.partition.pdf import partition_pdf\n>>> elements = partition_pdf(filename=\"example-docs\u002Flayout-parser-paper-fast.pdf\")\n\n>>> from unstructured.partition.text import partition_text\n>>> elements = partition_text(filename=\"example-docs\u002Ffake-text.txt\")\n```\n\n### Installing the library\nUse the following instructions to get up and running with `unstructured` and test your\ninstallation.\n\n- Install the Python SDK to support all document types with `pip install \"unstructured[all-docs]\"`\n  - For plain text files, HTML, XML, JSON and Emails that do not require any extra dependencies, you can run `pip install unstructured`\n  - To process other doc types, you can install the extras required for those documents, such as `pip install \"unstructured[docx,pptx]\"`\n- Install the following system dependencies if they are not already available on your system.\n  Depending on what document types you're parsing, you may not need all of these.\n    - `libmagic-dev` (filetype detection)\n    - `poppler-utils` (images and PDFs)\n    - `tesseract-ocr` (images and PDFs, install `tesseract-lang` for additional language support)\n    - `libreoffice` (MS Office docs)\n    - `pandoc` is bundled automatically via the `pypandoc-binary` Python package (no system install needed)\n\n- For suggestions on how to install on the Windows and to learn about dependencies for other features, see the\n  installation documentation [here](https:\u002F\u002Funstructured-io.github.io\u002Funstructured\u002Finstalling.html).\n\nAt this point, you should be able to run the following code:\n\n```python\nfrom unstructured.partition.auto import partition\n\nelements = partition(filename=\"example-docs\u002Feml\u002Ffake-email.eml\")\nprint(\"\\n\\n\".join([str(el) for el in elements]))\n```\n\n### Installation Instructions for Local Development\n\nThe following instructions are intended to help you get up and running with `unstructured`\nlocally if you are planning to contribute to the project.\n\nThis project uses [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) for dependency management. Install it first:\n\n```bash\n# macOS \u002F Linux\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n```\n\nThen install all dependencies (base, extras, dev, test, and lint groups):\n\n```bash\nmake install\n```\n\nThis runs `uv sync --locked --all-extras --all-groups`, which creates a virtual environment\nand installs everything in one step. No need to manually create or activate a virtualenv.\n\nTo install only specific document-type extras:\n\n```bash\nuv sync --extra pdf\nuv sync --extra csv --extra docx\n```\n\nTo update the lock file after changing dependencies in `pyproject.toml`:\n\n```bash\nmake lock\n```\n\n* Optional:\n  * To install extras for processing images and PDFs locally, run `uv sync --extra pdf --extra image`.\n  * For processing image files, `tesseract` is required. See [here](https:\u002F\u002Ftesseract-ocr.github.io\u002Ftessdoc\u002FInstallation.html) for installation instructions.\n  * For processing PDF files, `tesseract` and `poppler` are required. The [pdf2image docs](https:\u002F\u002Fpdf2image.readthedocs.io\u002Fen\u002Flatest\u002Finstallation.html) have instructions on installing `poppler` across various platforms.\n\nAdditionally, if you're planning to contribute to `unstructured`, we provide you an optional `pre-commit` configuration\nfile to ensure your code matches the formatting and linting standards used in `unstructured`.\nIf you'd prefer not to have code changes auto-tidied before every commit, you can use  `make check` to see\nwhether any linting or formatting changes should be applied, and `make tidy` to apply them.\n\nIf using the optional `pre-commit`, you'll just need to install the hooks with `pre-commit install` since the\n`pre-commit` package is installed as part of `make install` mentioned above. Finally, if you decided to use `pre-commit`\nyou can also uninstall the hooks with `pre-commit uninstall`.\n\nIn addition to develop in your local OS we also provide a helper to use docker providing a development environment:\n\n```bash\nmake docker-start-dev\n```\n\nThis starts a docker container with your local repo mounted to `\u002Fmnt\u002Flocal_unstructured`. This docker image allows you to develop without worrying about your OS's compatibility with the repo and its dependencies.\n\n## :clap: Quick Tour\n\n### Documentation\nFor more comprehensive documentation, visit https:\u002F\u002Fdocs.unstructured.io . You can also learn\nmore about our other products on the documentation page, including our SaaS API.\n\nHere are a few pages from the [Open Source documentation page](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fintroduction\u002Foverview)\nthat are helpful for new users to review:\n\n- [Quick Start](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fintroduction\u002Fquick-start)\n- [Using the `unstructured` open source package](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fcore-functionality\u002Foverview)\n- [Connectors](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fingest\u002Foverview)\n- [Concepts](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fconcepts\u002Fdocument-elements)\n- [Integrations](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fintegrations)\n\n\n### PDF Document Parsing Example\nThe following examples show how to get started with the `unstructured` library. The easiest way to parse a document in unstructured is to use the `partition` function. If you use `partition` function, `unstructured` will detect the file type and route it to the appropriate file-specific partitioning function. If you are using the `partition` function, you may need to install additional dependencies per doc type.\nFor example, to install docx dependencies you need to run `pip install \"unstructured[docx]\"`.\nSee our  [installation guide](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Finstallation\u002Ffull-installation) for more details.\n\n```python\nfrom unstructured.partition.auto import partition\n\nelements = partition(\"example-docs\u002Flayout-parser-paper.pdf\")\n```\n\nRun `print(\"\\n\\n\".join([str(el) for el in elements]))` to get a string representation of the\noutput, which looks like:\n\n```\n\nLayoutParser : A Uniﬁed Toolkit for Deep Learning Based Document Image Analysis\n\nZejiang Shen 1 ( (cid:0) ), Ruochen Zhang 2 , Melissa Dell 3 , Benjamin Charles Germain Lee 4 , Jacob Carlson 3 , and\nWeining Li 5\n\nAbstract. Recent advances in document image analysis (DIA) have been primarily driven by the application of neural\nnetworks. Ideally, research outcomes could be easily deployed in production and extended for further investigation.\nHowever, various factors like loosely organized codebases and sophisticated model conﬁgurations complicate the easy\nreuse of important innovations by a wide audience. Though there have been ongoing eﬀorts to improve reusability and\nsimplify deep learning (DL) model development in disciplines like natural language processing and computer vision, none\nof them are optimized for challenges in the domain of DIA. This represents a major gap in the existing toolkit, as DIA\nis central to academic research across a wide range of disciplines in the social sciences and humanities. This paper\nintroduces LayoutParser, an open-source library for streamlining the usage of DL in DIA research and applications.\nThe core LayoutParser library comes with a set of simple and intuitive interfaces for applying and customizing DL models\nfor layout detection, character recognition, and many other document processing tasks. To promote extensibility,\nLayoutParser also incorporates a community platform for sharing both pre-trained models and full document digitization\npipelines. We demonstrate that LayoutParser is helpful for both lightweight and large-scale digitization pipelines in\nreal-word use cases. The library is publicly available at https:\u002F\u002Flayout-parser.github.io\n\nKeywords: Document Image Analysis · Deep Learning · Layout Analysis · Character Recognition · Open Source library ·\nToolkit.\n\nIntroduction\n\nDeep Learning(DL)-based approaches are the state-of-the-art for a wide range of document image analysis (DIA) tasks\nincluding document image classiﬁcation [11,\n```\n\nSee the [partitioning](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fcore-functionality\u002Fpartitioning)\nsection in our documentation for a full list of options and instructions on how to use\nfile-specific partitioning functions.\n\n## :guardsman: Security Policy\n\nSee our [security policy](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fsecurity\u002Fpolicy) for\ninformation on how to report security vulnerabilities.\n\n## :bug: Reporting Bugs\n\nEncountered a bug? Please create a new [GitHub issue](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fissues\u002Fnew\u002Fchoose) and use our bug report template to describe the problem. To help us diagnose the issue, use the `python scripts\u002Fcollect_env.py` command to gather your system's environment information and include it in your report. Your assistance helps us continuously improve our software - thank you!\n\n## :books: Learn more\n\n| Section | Description |\n|-|-|\n| [Company Website](https:\u002F\u002Funstructured.io) | Unstructured.io product and company info |\n| [Documentation](https:\u002F\u002Fdocs.unstructured.io\u002F) | Full API documentation |\n| [Batch Processing](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-ingest) | Ingesting batches of documents through Unstructured |\n\n## :chart_with_upwards_trend: Analytics\n\nTelemetry is **off by default**. To opt in, set `UNSTRUCTURED_TELEMETRY_ENABLED=true` (or `=1`) before importing `unstructured`. To opt out, set `DO_NOT_TRACK` or `SCARF_NO_ANALYTICS` to any non-empty value (e.g. `true`, `1`, `yes`, `false`, `0`—any non-empty string opts out); opt-out takes precedence. Unset the variable or leave it empty if you do not want to opt out. See our [Privacy Policy](https:\u002F\u002Funstructured.io\u002Fprivacy-policy).\n","\u003Ch3 align=\"center\">\n  \u003Cimg\n    src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured_readme_7e8d60f174a7.png\"\n    height=\"200\"\n  >\n\u003C\u002Fh3>\n\n\u003Cdiv align=\"center\">\n\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fblob\u002Fmain\u002FLICENSE.md\">![https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Funstructured\u002F](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fl\u002Funstructured.svg)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Funstructured\u002F\">![https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Funstructured\u002F](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Funstructured.svg)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002FGitHub.com\u002Funstructured-io\u002Funstructured\u002Fgraphs\u002Fcontributors\">![https:\u002F\u002FGitHub.com\u002Funstructured-io\u002Funstructured.js\u002Fgraphs\u002Fcontributors](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002Funstructured-io\u002Funstructured)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fblob\u002Fmain\u002FCODE_OF_CONDUCT.md\">![code_of_conduct.md](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FContributor%20Covenant-2.1-4baaaa.svg) \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002FGitHub.com\u002Funstructured-io\u002Funstructured\u002Freleases\">![https:\u002F\u002FGitHub.com\u002Funstructured-io\u002Funstructured.js\u002Freleases](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Funstructured-io\u002Funstructured)\u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Funstructured\u002F\">![https:\u002F\u002Fgithub.com\u002FNaereen\u002Fbadges\u002F](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured_readme_bf1ed38bbc78.png)\u003C\u002Fa>\n  [![Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured_readme_114f467f7024.png)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Funstructured)\n  [![Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured_readme_114f467f7024.png\u002Fmonth)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Funstructured)\n  \u003Ca\n   href=\"https:\u002F\u002Fwww.phorm.ai\u002Fquery?projectId=34efc517-2201-4376-af43-40c4b9da3dc5\">\n\t\u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPhorm-Ask_AI-%23F2777A.svg?&logo=data:image\u002Fsvg+xml;base64,PHN2ZyB3aWR0aD0iNSIgaGVpZ2h0PSI0IiBmaWxsPSJub25lIiB4bWxucz0iaHR0cDovL3d3dy53My5vcmcvMjAwMC9zdmciPgogIDxwYXRoIGQ9Ik00LjQzIDEuODgyYTEuNDQgMS40NCAwIDAgMS0uMDk4LjQyNmMtLjA1LjEyMy0uMTE1LjIzLS4xOTIuMzIyLS4wNzUuMDktLjE2LjE2NS0uMjU1LjIyNmExLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxMmMtLjA5OS4wMTItLjE5Mi4wMTQtLjI3OS4wMDZsLTEuNTkzLS4xNHYtLjQwNmgxLjY1OGMuMDkuMDAxLjE3LS4xNjkuMjQ2LS4xOTFhLjYwMy42MDMgMCAwIDAgLjItLjEwNi41MjkuNTI5IDAgMCAwIC4xMzgtLjE3LjY1NC42NTQgMCAwIDAgLjA2NS0uMjRsLjAyOC0uMzJhLjkzLjkzIDAgMCAwLS4wMzYtLjI0OS41NjcuNTY3IDAgMCAwLS4xMDMtLjIuNTAyLjUwMiAwIDAgMC0uMTY4LS4xMzguNjA4LjYwOCAwIDAgMC0uMjQtLjA2N0wyLjQzNy43MjkgMS42MjUuNjcxYS4zMjIuMzIyIDAgMCAwLS4yMzIuMDU4LjM3NS4zNzUgMCAwIDAtLjExNi4yMzJsLS4xMTYgMS40NS0uMDU4LjY5Ny0uMDU4Ljc1NEwuNzA1IDRsLS4zNTctLjA3OUwuNjAyLjkwNkMuNjE3LjcyNi42NjMuNTc0LjczOS40NTRhLjk1OC45NTggMCAwIDEgLjI3NC0uMjg1Ljk3MS45NzEgMCAwIDEgLjMzNy0uMTRjLjExOS0uMDI2LjIyNy0uMDM0LjMyNS0uMDI2TDMuMjMyLjE2Yy4xNTkuMDE0LjMzNi4wMy40NTkuMDgyYTEuMTczIDEuMTczIDAgMCAxIC41NDUuNDQ3Yy4wNi4wOTQuMTA5LjE5Mi4xNDQuMjkzYTEuMzkyIDEuMzkyIDAgMCAxIC4wNzguNThsLS4wMjkuMzJaIiBmaWxsPSIjRjI3NzdBIi8+CiAgPHBhdGggZD0iTTQuMDgyIDIuMDA3YTEuNDU1IDEuNDU1IDAgMCAxLS4wOTguNDI3Yy0uMDUuMTI0LS4xMTQuMjMyLS4xOTIuMzI0YTEuMTMgMS4xMyAwIDAgMS0uMjU0LjIyNyAxLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxNGMtLjEuMDEyLS4xOTMuMDE0LS4yOC4wMDZsLTEuNTYtLjEwOC4wMzQtLjQwNi4wMy0uMzQ4IDEuNTU5LjE1NGMuMDkgMCAuMTczLS4wMS4yNDgtLjAzM2EuNjAzLjYwMyAwIDAgMCAuMi0uMTA2LjUzMi41MzIgMCAwIDAgLjEzOS0uMTcyLjY2LjY2IDAgMCAwIC4wNjQtLjI0MWwuMDI5LS4zMjFhLjk0Ljk0IDAgMCAwLS4wMzYtLjI1LjU3LjU3IDAgMCAwLS4xMDMtLjIwMi41MDIuNTAyIDAgMCAwLS4xNjgtLjEzOC42MDUuNjA1IDAgMCAwLS4yNC0uMDY3TDEuMjczLjgyN2MtLjA5NC0uMDA4LS4xNjguMDEtLjIyMS4wNTUtLjA1My4wNDUtLjA4NC4xMTQtLjA5Mi4yMDZMLjcwNSA0IDAgMy45MzhsLjI1NS0yLjkxMUExLjAxIDEuMDEgMCAwIDEgLjM5My41NzIuOTYyLjk2MiAwIDAgMSAuNjY2LjI4NmEuOTcuOTcgMCAwIDEgLjMzOC0uMTRDMS4xMjIuMTIgMS4yMy4xMSAxLjMyOC4xMTlsMS41OTMuMTRjLjE2LjAxNC4zLjA0Ny40MjMuMWExLjE3IDEuMTcgMCAwIDEgLjU0NS40NDhjLjA2MS4wOTUuMTA5LjE5My4xNDQuMjk1YTEuNDA2IDEuNDA2IDAgMCAxIC4wNzcuNTgzbC0uMDI4LjMyMloiIGZpbGw9IndoaXRlIi8+CiAgPHBhdGggZD0iTTQuMDgyIDIuMDA3YTEuNDU1IDEuNDU1IDAgMCAxLS4wOTguNDI3Yy0uMDUuMTI0LS4xMTQuMjMyLS4xOTIuMzI0YTEuMTMgMS4xMyAwIDAgMS0uMjU0LjIyNyAxLjM1MyAxLjM1MyAwIDAgMS0uNTk1LjIxNGMtLjEuMDEyLS4xOTMuMDE0LS4yOC4wMDZsLTEuNTYtLjEwOC4wMzQtLjQwNi4wMy0uMzQ4IDEuNTU5LjE1NGMuMDkgMCAuMTczLS4wMS4yNDgtLjAzM2EuNjAzLjYwMyAwIDAgMCAuMi0uMTA2LjUzMi41MzIgMCAwIDAgLjEzOS0uMTcyLjY2LjY2IDAgMCAwIC4wNjQtLjI0MWwuMDI5LS4zMjFhLjk0Ljk0IDAgMCAwLS4wMzYtLjI1LjU3LjU3IDAgMCAwLS4xMDMtLjIwMi41MDIuNTAyIDAgMCAwLS4xNjgtLjEzOC42MDUuNjA1IDAgMCAwLS4yNC0uMDY3TDEuMjczLjgyN2MtLjA5NC0uMDA4LS4xNjguMDEtLjIyMS4wNTUtLjA1My4wNDUtLjA4NC4xMTQtLjA5Mi4yMDZMLjcwNSA0IDAgMy45MzhsLjI1NS0yLjkxMUExLjAxIDEuMDEgMCAwIDEgLjM5My41NzIuOTYyLjk2MiAwIDAgMSAuNjY2LjI4NmEuOTcuOTcgMCAwIDEgLjMzOC0uMTRDMS4xMjIuMTIgMS4yMy4xMSAxLjMyOC4xMTlsMS41OTMuMTRjLjE2LjAxNC4zLjA0Ny40MjMuMWExLjE3IDEuMTcgMCAwIDEgLjU0NS40NDhjLjA2MS4wOTUuMTA5LjE5My4xNDQuMjk1YTEuNDA2IDEuNDA2IDAgMCAxIC4wNzcuNTgzbC0uMDI4LjMyMloiIGZpbGw9IndoaXRlIi8+Cjwvc3ZnPgo=\" \u002F>\n   \u003C\u002Fa>\n\n\u003C\u002Fdiv>\n\n\u003Cdiv>\n  \u003Cp align=\"center\">\n  \u003Ca\n  href=\"https:\u002F\u002Fshort.unstructured.io\u002Fpzw05l7\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FJOIN US ON SLACK-4A154B?style=for-the-badge&logo=slack&logoColor=white\" \u002F>\n  \u003C\u002Fa>\n  \u003Ca href=\"https:\u002F\u002Fwww.linkedin.com\u002Fcompany\u002Funstructuredio\u002F\">\n    \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLinkedIn-0077B5?style=for-the-badge&logo=linkedin&logoColor=white\" \u002F>\n  \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n\u003Ch2 align=\"center\">\n  \u003Cp>Open-Source Pre-Processing Tools for Unstructured Data\u003C\u002Fp>\n\u003C\u002Fh2>\n\nThe `unstructured` library provides open-source components for ingesting and pre-processing images and text documents, such as PDFs, HTML, Word docs, and [many more](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fcore-functionality\u002Fpartitioning). The use cases of `unstructured` revolve around streamlining and optimizing the data processing workflow for LLMs. `unstructured` modular functions and connectors form a cohesive system that simplifies data ingestion and pre-processing, making it adaptable to different platforms and efficient in transforming unstructured data into structured outputs.\n\n## Try the Unstructured Platform Product\n\nReady to move your data processing pipeline to production, and take advantage of advanced features? Check out [Unstructured Platform](https:\u002F\u002Funstructured.io\u002Fenterprise). In addition to better processing performance, take advantage of chunking, embedding, and image and table enrichment generation, all from a low code UI or an API. [Request a demo](https:\u002F\u002Funstructured.io\u002Fcontact) from our sales team to learn more about how to get started.\n\n## :eight_pointed_black_star: Quick Start\n\nThere are several ways to use the `unstructured` library:\n* [Run the library in a container](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured#run-the-library-in-a-container) or\n* Install the library\n    1. [Install from PyPI](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured#installing-the-library)\n    2. [Install for local development](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured#installation-instructions-for-local-development)\n* For installation with `conda` on Windows system, please refer to the [documentation](https:\u002F\u002Funstructured-io.github.io\u002Funstructured\u002Finstalling.html#installation-with-conda-on-windows)\n\n### 在容器中运行该库\n\n以下说明旨在帮助您使用 Docker 与 `unstructured` 进行交互。如果您尚未在本地安装 Docker，请参阅 [此处](https:\u002F\u002Fdocs.docker.com\u002Fget-docker\u002F)。\n\n注意：我们构建了多平台镜像，以支持 x86_64 和 Apple Silicon 硬件。`docker pull` 应会下载适合您架构的镜像，但如有需要，您也可以通过 `--platform` 参数指定（例如 `--platform linux\u002Famd64`）。\n\n我们为每次推送到 `main` 分支都会构建 Docker 镜像。每个镜像都打上对应的短提交哈希标签（如 `fbc7a69`）和应用版本标签（如 `0.5.5-dev1`）。我们还会将最新镜像标记为 `latest`。要利用这一点，请从我们的镜像仓库拉取镜像。\n\n```bash\ndocker pull downloads.unstructured.io\u002Funstructured-io\u002Funstructured:latest\n```\n\n拉取完成后，您可以基于此镜像创建一个容器，并进入该容器的 Shell。\n\n```bash\n# 创建容器\ndocker run -dt --name unstructured downloads.unstructured.io\u002Funstructured-io\u002Funstructured:latest\n\n# 这将带您进入正在运行 Docker 镜像的 Bash Shell\ndocker exec -it unstructured bash\n```\n\n您也可以自行构建 Docker 镜像。请注意，基础镜像是 `wolfi-base`，它会定期更新。如果您在本地构建镜像，由于 `wolfi-base` 的上游变更，`docker build` 可能会失败。\n\n如果您只打算解析一种类型的数据，可以通过注释掉其他数据类型所需的某些包或依赖项来加快镜像构建速度。请参考 Dockerfile，了解哪些行对您的用例是必要的。\n\n```bash\nmake docker-build\n\n# 这将带您进入正在运行 Docker 镜像的 Bash Shell\nmake docker-start-bash\n```\n\n进入运行中的容器后，您可以直接在 Python 解释器的交互模式下尝试操作。\n\n```bash\n# 这将带您进入 Python 控制台，以便运行下面的分区函数\npython3\n\n>>> from unstructured.partition.pdf import partition_pdf\n>>> elements = partition_pdf(filename=\"example-docs\u002Flayout-parser-paper-fast.pdf\")\n\n>>> from unstructured.partition.text import partition_text\n>>> elements = partition_text(filename=\"example-docs\u002Ffake-text.txt\")\n```\n\n### 安装该库\n\n请按照以下说明开始使用 `unstructured` 并测试您的安装。\n\n- 使用 `pip install \"unstructured[all-docs]\"` 安装支持所有文档类型的 Python SDK。\n  - 对于不需要任何额外依赖的纯文本文件、HTML、XML、JSON 和电子邮件，您可以直接运行 `pip install unstructured`。\n  - 若要处理其他类型的文档，可以安装这些文档所需的附加组件，例如 `pip install \"unstructured[docx,pptx]\"`。\n- 如果您的系统尚未安装以下系统依赖项，请进行安装。根据您解析的文档类型，可能并不需要全部依赖项：\n  - `libmagic-dev`（文件类型检测）\n  - `poppler-utils`（图像和 PDF）\n  - `tesseract-ocr`（图像和 PDF；安装 `tesseract-lang` 可获得额外的语言支持）\n  - `libreoffice`（MS Office 文档）\n  - `pandoc` 会通过 `pypandoc-binary` Python 包自动打包，无需单独安装。\n\n- 关于如何在 Windows 上安装以及了解其他功能的依赖项，请参阅安装文档 [此处](https:\u002F\u002Funstructured-io.github.io\u002Funstructured\u002Finstalling.html)。\n\n此时，您应该能够运行以下代码：\n\n```python\nfrom unstructured.partition.auto import partition\n\nelements = partition(filename=\"example-docs\u002Feml\u002Ffake-email.eml\")\nprint(\"\\n\\n\".join([str(el) for el in elements]))\n```\n\n### 本地开发的安装说明\n\n以下说明旨在帮助您在计划为项目做出贡献时，在本地设置并运行 `unstructured`。\n\n该项目使用 [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) 进行依赖管理。请先安装它：\n\n```bash\n# macOS \u002F Linux\ncurl -LsSf https:\u002F\u002Fastral.sh\u002Fuv\u002Finstall.sh | sh\n```\n\n然后安装所有依赖项（基础、附加、开发、测试和 lint 组）：\n\n```bash\nmake install\n```\n\n这会运行 `uv sync --locked --all-extras --all-groups`，从而一步创建虚拟环境并安装所有内容。无需手动创建或激活虚拟环境。\n\n如果仅需安装特定文档类型的附加组件：\n\n```bash\nuv sync --extra pdf\nuv sync --extra csv --extra docx\n```\n\n在更改 `pyproject.toml` 中的依赖项后，更新锁定文件：\n\n```bash\nmake lock\n```\n\n* 可选：\n  * 要在本地安装用于处理图像和 PDF 的附加组件，可运行 `uv sync --extra pdf --extra image`。\n  * 处理图像文件时，需要 `tesseract`。安装说明请参见 [此处](https:\u002F\u002Ftesseract-ocr.github.io\u002Ftessdoc\u002FInstallation.html)。\n  * 处理 PDF 文件时，需要 `tesseract` 和 `poppler`。[pdf2image 文档](https:\u002F\u002Fpdf2image.readthedocs.io\u002Fen\u002Flatest\u002Finstallation.html) 提供了在不同平台上安装 `poppler` 的说明。\n\n此外，如果您计划为 `unstructured` 做出贡献，我们还提供了一个可选的 `pre-commit` 配置文件，以确保您的代码符合 `unstructured` 中使用的格式和 lint 标准。如果您不希望每次提交前自动整理代码，可以使用 `make check` 来查看是否需要进行 lint 或格式化调整，并使用 `make tidy` 来应用这些调整。\n\n如果使用可选的 `pre-commit`，只需运行 `pre-commit install` 即可安装钩子，因为 `pre-commit` 包已在上述 `make install` 中一并安装。最后，如果您决定使用 `pre-commit`，也可以通过 `pre-commit uninstall` 来卸载这些钩子。\n\n除了在本地操作系统中开发之外，我们还提供了一个辅助工具，即使用 Docker 提供开发环境：\n\n```bash\nmake docker-start-dev\n```\n\n这将启动一个 Docker 容器，其中您的本地仓库被挂载到 `\u002Fmnt\u002Flocal_unstructured`。该 Docker 镜像使您无需担心本地操作系统与仓库及其依赖项的兼容性问题，即可进行开发。\n\n## :clap: 快速游览\n\n### 文档\n如需更全面的文档，请访问 https:\u002F\u002Fdocs.unstructured.io 。您还可以在文档页面上了解更多关于我们其他产品的信息，包括我们的 SaaS API。\n\n以下是 [开源文档页面](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fintroduction\u002Foverview) 中对新用户有帮助的一些页面：\n\n- [快速入门](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fintroduction\u002Fquick-start)\n- [使用 `unstructured` 开源软件包](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fcore-functionality\u002Foverview)\n- [连接器](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fingest\u002Foverview)\n- [概念](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fconcepts\u002Fdocument-elements)\n- [集成](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fintegrations)\n\n### PDF 文档解析示例\n以下示例展示了如何开始使用 `unstructured` 库。在 unstructured 中解析文档最简单的方式是使用 `partition` 函数。如果使用 `partition` 函数，`unstructured` 会自动检测文件类型，并将其路由到相应的特定于文件类型的分割函数。如果您使用 `partition` 函数，可能需要根据文档类型安装额外的依赖项。\n例如，要安装 docx 的依赖项，您需要运行 `pip install \"unstructured[docx]\"`。\n更多详细信息请参阅我们的 [安装指南](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Finstallation\u002Ffull-installation)。\n\n```python\nfrom unstructured.partition.auto import partition\n\nelements = partition(\"example-docs\u002Flayout-parser-paper.pdf\")\n```\n\n运行 `print(\"\\n\\n\".join([str(el) for el in elements]))` 可以获取输出的字符串表示，其内容如下：\n\n```\n\nLayoutParser：基于深度学习的文档图像分析统一工具包\n\n沈泽江 1 ( (cid:0) ), 张若晨 2 , 梅丽莎·戴尔 3 , 本杰明·查尔斯·热姆·李 4 , 雅各布·卡尔森 3 , 和 李伟宁 5\n\n摘要。近年来，文档图像分析（DIA）领域的进步主要得益于神经网络的应用。理想情况下，研究成果应能轻松部署到生产环境中，并为进一步研究提供扩展性。然而，代码库组织松散、模型配置复杂等多种因素，使得重要创新难以被广泛用户群体重复使用。尽管自然语言处理和计算机视觉等领域一直在努力提高可复用性并简化深度学习（DL）模型开发，但这些方法并未针对 DIA 领域的挑战进行优化。这构成了现有工具集中的一个重大缺口，因为 DIA 在社会科学和人文学科的众多研究领域中占据核心地位。本文介绍了 LayoutParser，一个开源库，旨在简化深度学习在 DIA 研究和应用中的使用。LayoutParser 核心库提供了一组简单直观的接口，用于应用和定制深度学习模型，以实现版面检测、字符识别以及许多其他文档处理任务。为促进扩展性，LayoutParser 还整合了一个社区平台，用于共享预训练模型和完整的文档数字化流水线。我们证明了 LayoutParser 对于实际应用场景中的轻量级和大规模数字化流水线均有所帮助。该库已在 https:\u002F\u002Flayout-parser.github.io 上公开发布。\n\n关键词：文档图像分析 · 深度学习 · 版面分析 · 字符识别 · 开源库 · 工具包。\n\n引言\n\n基于深度学习（DL）的方法已成为多种文档图像分析（DIA）任务的最先进手段，包括文档图像分类 [11,\n```\n\n有关完整选项列表及如何使用特定于文件类型的分割函数的说明，请参阅我们文档中的 [分割](https:\u002F\u002Fdocs.unstructured.io\u002Fopen-source\u002Fcore-functionality\u002Fpartitioning) 部分。\n\n## :guardsman: 安全政策\n\n有关如何报告安全漏洞的信息，请参阅我们的 [安全政策](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fsecurity\u002Fpolicy)。\n\n## :bug: 报告错误\n\n遇到错误了吗？请创建一个新的 [GitHub 问题](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fissues\u002Fnew\u002Fchoose)，并使用我们的错误报告模板描述问题。为了帮助我们诊断问题，请使用 `python scripts\u002Fcollect_env.py` 命令收集您的系统环境信息，并将其包含在报告中。您的帮助将有助于我们不断改进软件——感谢您！\n\n## :books: 了解更多\n\n| 版块 | 描述 |\n|-|-|\n| [公司官网](https:\u002F\u002Funstructured.io) | Unstructured.io 产品及公司信息 |\n| [文档](https:\u002F\u002Fdocs.unstructured.io\u002F) | 完整的 API 文档 |\n| [批量处理](https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured-ingest) | 通过 Unstructured 批量导入文档 |\n\n## :chart_with_upwards_trend: 分析\n\n遥测功能 **默认关闭**。如需启用，请在导入 `unstructured` 之前设置 `UNSTRUCTURED_TELEMETRY_ENABLED=true`（或 `=1`）。如需禁用，请将 `DO_NOT_TRACK` 或 `SCARF_NO_ANALYTICS` 设置为任意非空值（例如 `true`、`1`、`yes`、`false`、`0`——任何非空字符串都会禁用遥测功能）；禁用优先于启用。如果您不想禁用遥测功能，可以不设置该变量或将其留空。详情请参阅我们的 [隐私政策](https:\u002F\u002Funstructured.io\u002Fprivacy-policy)。","# Unstructured 快速上手指南\n\n`unstructured` 是一个开源的非结构化数据预处理工具库，专为大语言模型（LLM）设计。它能高效地将 PDF、HTML、Word 文档、图片等非结构化数据转换为结构化输出，简化数据摄入和清洗流程。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求并安装了必要的系统依赖。根据您的使用场景（处理的文件类型），可能需要安装不同的依赖包。\n\n### 系统要求\n- Python 3.8+\n- 操作系统：Linux, macOS, Windows (Windows 用户建议参考官方文档使用 Conda 或 Docker)\n\n### 前置系统依赖\n如果您计划处理多种文档格式（如 PDF、图片、Office 文档），请在安装 Python 包前先安装以下系统级依赖：\n\n**Ubuntu\u002FDebian:**\n```bash\nsudo apt-get update\nsudo apt-get install -y libmagic-dev poppler-utils tesseract-ocr libreoffice pandoc\n# 如需更多语言支持，可安装 tesseract-lang\n```\n\n**macOS (使用 Homebrew):**\n```bash\nbrew install libmagic poppler tesseract libreoffice pandoc\n```\n\n> **注意**：\n> - `libmagic-dev`: 用于文件类型检测\n> - `poppler-utils`: 用于处理图片和 PDF\n> - `tesseract-ocr`: 用于 OCR 文字识别（图片和 PDF）\n> - `libreoffice`: 用于处理 MS Office 文档\n> - `pandoc`: Python 包 `pypandoc-binary` 会自动捆绑，通常无需手动安装系统版。\n\n## 安装步骤\n\n您可以选择直接通过 PyPI 安装，或使用 Docker 容器运行。\n\n### 方式一：使用 Pip 安装（推荐）\n\n**1. 全功能安装**\n支持所有文档类型（PDF, Word, PPT, 图片等）：\n```bash\npip install \"unstructured[all-docs]\"\n```\n\n**2. 按需安装**\n如果仅处理纯文本、HTML、JSON 或邮件，无需额外依赖：\n```bash\npip install unstructured\n```\n\n如果只需处理特定格式（例如仅 Word 和 PPT）：\n```bash\npip install \"unstructured[docx,pptx]\"\n```\n\n### 方式二：使用 Docker 运行\n\n如果您希望避免配置系统依赖，可以直接拉取官方镜像：\n\n```bash\n# 拉取最新镜像\ndocker pull downloads.unstructured.io\u002Funstructured-io\u002Funstructured:latest\n\n# 创建并启动容器\ndocker run -dt --name unstructured downloads.unstructured.io\u002Funstructured-io\u002Funstructured:latest\n\n# 进入容器交互界面\ndocker exec -it unstructured bash\n```\n\n## 基本使用\n\n安装完成后，您可以使用 `partition` 函数自动识别文件类型并进行解析。以下是最简单的使用示例：\n\n### Python 代码示例\n\n```python\nfrom unstructured.partition.auto import partition\n\n# 自动识别文件类型并解析\nelements = partition(filename=\"example-docs\u002Feml\u002Ffake-email.eml\")\n\n# 打印提取的内容\nprint(\"\\n\\n\".join([str(el) for el in elements]))\n```\n\n### 针对特定文件类型的用法\n\n您也可以直接调用特定格式的解析函数：\n\n```python\n# 解析 PDF\nfrom unstructured.partition.pdf import partition_pdf\nelements = partition_pdf(filename=\"example-docs\u002Flayout-parser-paper-fast.pdf\")\n\n# 解析纯文本\nfrom unstructured.partition.text import partition_text\nelements = partition_text(filename=\"example-docs\u002Ffake-text.txt\")\n```\n\n解析后的 `elements` 对象包含文档的结构化块（如标题、段落、表格等），可直接用于后续的 LLM 嵌入或向量数据库存储。","某金融合规团队需要从数千份格式各异的 PDF 财报、扫描件和 Word 合同中提取关键风险数据，以构建企业级 RAG（检索增强生成）问答系统。\n\n### 没有 unstructured 时\n- **解析格式单一**：传统库如 PyPDF2 难以处理扫描版图片或复杂排版的文档，导致大量非文本内容直接丢失。\n- **结构信息混乱**：提取出的文字往往丢失了标题、页眉页脚与正文的层级关系，变成无意义的“文字汤”，大模型无法理解文档逻辑。\n- **清洗成本高昂**：开发人员需编写大量正则表达式手动去除乱码和无关字符，耗时且容易误删关键数据。\n- **分块效果差**：由于缺乏语义感知，简单的按字符数切分会切断完整的句子或段落，严重降低后续向量检索的准确率。\n\n### 使用 unstructured 后\n- **全格式统一支持**：unstructured 能自动识别并高质量解析 PDF、图片、PPT 等 20+ 种复杂格式，即使是扫描件也能通过内置 OCR 提取文字。\n- **智能结构还原**：自动将文档元素分类为标题、叙述文本、表格等结构化对象，完整保留文档的逻辑层级，让大模型“读懂”内容。\n- **开箱即用的清洗**：内置去重、去除特殊符号等清洗流程，无需额外编码即可输出干净、标准化的 JSON 数据。\n- **语义感知分块**：提供基于标题和段落语义的智能切分策略，确保每个数据块上下文完整，显著提升 RAG 系统的回答质量。\n\nunstructured 将原本需要数周的非结构化数据清洗工程，缩短为几行代码的自动化流程，让团队能专注于核心业务逻辑而非数据预处理。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FUnstructured-IO_unstructured_7e8d60f1.png","Unstructured-IO","Unstructured","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FUnstructured-IO_190dee06.jpg","",null,"security@unstructured.io","UnstructuredIO","https:\u002F\u002Funstructured.io\u002F","https:\u002F\u002Fgithub.com\u002FUnstructured-IO",[85,89,93,97,99,103,106,109],{"name":86,"color":87,"percentage":88},"HTML","#e34c26",90.2,{"name":90,"color":91,"percentage":92},"Python","#3572A5",8.5,{"name":94,"color":95,"percentage":96},"Shell","#89e051",0.6,{"name":98,"color":79,"percentage":96},"Rich Text Format",{"name":100,"color":101,"percentage":102},"Makefile","#427819",0,{"name":104,"color":105,"percentage":102},"Dockerfile","#384d54",{"name":107,"color":108,"percentage":102},"XSLT","#EB8CEB",{"name":110,"color":111,"percentage":102},"Go","#00ADD8",14393,1208,"2026-04-05T15:30:07","Apache-2.0","Linux, macOS, Windows","未说明",{"notes":119,"python":120,"dependencies":121},"该工具主要依赖系统级库而非单纯的 Python 包。处理不同文档类型需安装对应的系统依赖：libmagic-dev 用于文件类型检测，poppler-utils 用于 PDF 和图片处理，tesseract-ocr 用于 OCR（可选装 tesseract-lang 支持多语言），libreoffice 用于 MS Office 文档。Windows 用户建议使用 conda 安装。项目使用 uv 进行依赖管理。Docker 镜像支持 x86_64 和 Apple Silicon 架构。","3.8+",[122,123,124,125,126],"libmagic-dev","poppler-utils","tesseract-ocr","libreoffice","pypandoc-binary",[13,51,54,26,14],[129,130,131,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148],"deep-learning","document-parsing","machine-learning","nlp","ocr","information-retrieval","data-pipelines","ml","preprocessing","pdf-to-text","natural-language-processing","pdf","pdf-to-json","document-image-analysis","donut","document-image-processing","document-parser","docx","langchain","llm","2026-03-27T02:49:30.150509","2026-04-06T08:35:16.654809",[152,157,162,167],{"id":153,"question_zh":154,"answer_zh":155,"source_url":156},17966,"运行 partition_pdf() 时遇到 NLTK 错误：Resource \"punkt_tab\" not found 如何解决？","该问题通常由 unstructured 和 nltk 版本不兼容引起。解决方案如下：\n1. 升级 unstructured 到 0.16.16 或更高版本，该版本会自动下载缺失的 NLTK 数据。\n2. 如果升级无效，可以尝试特定的版本组合：unstructured==0.16.12 配合 nltk==3.9.1，或者 unstructured==0.15.5 配合 nltk==3.8.1。\n3. 也可以手动运行以下 Python 代码下载资源：\n   import nltk\n   nltk.download('punkt_tab')","https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fissues\u002F3511",{"id":158,"question_zh":159,"answer_zh":160,"source_url":161},17967,"处理 PDF 时遇到 PIL.UnidentifiedImageError: cannot identify image file 错误怎么办？","此错误通常发生在 Linux 容器环境中，是因为 unstructured 使用 poppler 将 PDF 渲染为 PPM 格式的临时图像供模型处理，而某些环境下的 Pillow (PIL) 无法识别这些文件。\n原因分析：这是 PDF 渲染过程中的中间步骤，并非提取嵌入式图片失败。\n建议检查：\n1. 确认是否仅在 Linux 容器中出现，Mac 上通常正常。\n2. 检查 poppler-utils 的版本，不同版本可能导致兼容性问题。\n目前暂无直接的配置开关来跳过特定格式，需确保运行环境中 poppler 和 Pillow 版本兼容。","https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fissues\u002F3102",{"id":163,"question_zh":164,"answer_zh":165,"source_url":166},17968,"partition_html 为什么只能提取包含嵌套标签（如 span, b）的 div 元素的部分文本？","这是由于底层 lxml 库在处理包含内联标签（inline tags）的文本时的行为特性导致的，它会丢失嵌套标签后的文本内容。\n虽然官方曾尝试修复，但在某些复杂嵌套结构（如 div 中包含带 span 的 div）下，可能仍会失败或导致所有文本合并到一个元素中从而改变原有结构。\n如果遇到此问题，建议检查 unstructured 的版本是否包含最新的 lxml 文本提取修复，或者在下游处理中对提取结果进行额外的文本完整性校验。","https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fissues\u002F2362",{"id":168,"question_zh":169,"answer_zh":170,"source_url":171},17969,"如何从 PDF 中提取图片并获取 base64 格式的元数据？","在使用 partition_pdf 函数时，需要正确配置以下参数才能获取图片的 base64 数据：\n1. 设置 extract_images_in_pdf=True 以启用图片提取。\n2. 设置 extract_image_block_types=[\"Image\", \"Table\"] 指定要提取的类型。\n3. 关键步骤：必须设置 extract_image_block_to_payload=True。\n配置示例代码：\nraw_pdf_elements = partition_pdf(\n    filename=\"path\u002Fto\u002Ffile.pdf\",\n    extract_images_in_pdf=True,\n    extract_image_block_types=[\"Image\", \"Table\"],\n    extract_image_block_to_payload=True\n)\n设置后，返回的元素元数据中将包含 base64 编码的图片字符串。","https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fissues\u002F2603",[173,178,183,188,193,198,203,207,212,217,222,227,232,237,242,247,252,257,262,267],{"id":174,"version":175,"summary_zh":176,"released_at":177},108325,"0.22.16","## 0.22.16\n\n### 功能增强\n- **公式 Markdown 导出（`element_to_md` \u002F `elements_to_md`）**：新增仅限关键字参数 `formula_markdown_style`（`\"auto\"`、`\"display_math\"`、`\"plain\"`；默认为 `\"auto\"`）。在 `\"auto\"` 模式下，仅当文本看起来像数学符号（基于启发式评分）且不包含 `$` 或 `$$` 时才使用显示数学格式（`$$ ... $$`），以避免破坏 Markdown 格式和 OCR 字幕的混乱显示。`\"display_math\"` 模式会在安全的情况下始终使用显示数学格式（如果 `$` 会破坏代码块标记，则回退到普通文本）。`\"plain\"` 模式则仅输出纯文本。可选参数 `normalize_formula`（默认为 `True`）会将常见的 Unicode 运算符映射为类似 LaTeX 的标记；`normalize_formula` 参数位于仅限关键字的选项之前，因此按位置传递 `encoding` 或 `no_group_by_page` 的调用者无需更改。Unicode 中的 `√` 不会被映射为 `\\\\sqrt{}`。模块常量：`FORMULA_MARKDOWN_AUTO`、`FORMULA_MARKDOWN_DISPLAY_MATH`、`FORMULA_MARKDOWN_PLAIN`。\n\n## 0.22.15\n\n### 安全性\n\n- **安全**：修复（依赖项）：升级存在漏洞的间接依赖项 [安全]\n\n## 0.22.14\n\n### 功能增强\n- **去重 PDF 渲染**：移除 `_render_pdf_pages` 方法，并委托给 `unstructured-inference` 的 `convert_pdf_to_image` 函数（该函数已支持按页懒加载渲染）。当 `path_only=True` 时，峰值内存占用从 O(n_pages) 降至 O(1 页)，对于一个 100 页的 PDF 来说，内存占用减少了 97%。将推理依赖版本提升至 `>=1.6.2`。\n\n## 0.22.13\n\n### 功能增强\n- **加速 `standardize_quotes`**：用预先计算好的翻译表，通过一次 `str.translate()` 调用来替换原有的基于循环的字符替换操作。同时修复了一个已存在的 bug：由于字典键重复，左侧智能引号从未被规范化。","2026-04-03T20:44:39",{"id":179,"version":180,"summary_zh":181,"released_at":182},108326,"0.22.12","## 变更内容\n* mem：排除未使用的 spaCy 管道组件，以减少模型内存占用，由 @KRRT7 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4296 中实现。\n* fix：修复了 pdfminer 导致可提取文本丢失的问题，由 @qued 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4310 中完成。\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fcompare\u002F0.22.10...0.22.12","2026-04-02T16:27:21",{"id":184,"version":185,"summary_zh":186,"released_at":187},108327,"0.22.10","## 变更内容\n* 修复（分块）：通过 @cragwolfe 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4301 中的修改，在重建时保留嵌套表格结构。\n* 将 lazyproperty 替换为 functools.cached_property，由 @KRRT7 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4282 中完成。\n* 内存优化：将 PaddleOCR 的 rec_batch_num 从 6 减少到 1，由 @KRRT7 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4295 中完成。\n* 修复：在预分块中隔离表格元素，由 @claytonlin1110 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4307 中完成。\n* 功能（分块）：在后续分块中重复表格标题，由 @cragwolfe 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4298 中完成。\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fcompare\u002F0.22.6...0.22.10","2026-03-31T15:50:46",{"id":189,"version":190,"summary_zh":191,"released_at":192},108328,"0.22.6","## 变更内容\n* 修复（依赖）：更新安全补丁 [SECURITY]，由 @utic-renovate[bot] 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4303 中完成\n* 修复：在发布 CI 中添加自包含的版本提取脚本，由 @vladimir-kivi-ds 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4304 中完成\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fcompare\u002F0.22.4...0.22.6","2026-03-26T21:20:33",{"id":194,"version":195,"summary_zh":196,"released_at":197},108329,"0.22.4","## 变更内容\n* 功能：新增 `create_file_from_elements()` 方法，用于根据元素重新创建文档文件，由 @claytonlin1110 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4259 中实现。\n* 升级依赖：由 @PastelStorm 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4265 中完成。\n* 修复：避免 `_patch_current_chars_with_render_mode` 中 O(N²) 的重复扫描问题，由 @KRRT7 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4266 中修复。\n* 添加检查：在 libmagic 失败时进行检查，由 @aadland6 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4273 中实现。\n* 新增表单元素：由 @aadland6 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4272 中添加。\n* 功能：音频转文本分区功能，由 @claytonlin1110 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4264 中实现。\n* 添加复杂 PDF 检查：由 @aadland6 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4268 中添加。\n* 杂项：禁用 Anchore 容器扫描的失败构建检查，由 @lawrence-u10d 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4285 中完成。\n* 功能：将遥测功能默认关闭，由 @claytonlin1110 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4281 中实现。\n* 修复（依赖）：更新 pypdf 中的安全漏洞至 v6.9.1 [安全]，由 @utic-renovate[bot] 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4248 中完成。\n* 功能：在 ElementMetadata 中存储路由信息，由 @vladimir-kivi-ds 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4293 中实现。\n* 功能：为 `partition_md` 添加自定义 Markdown 扩展，由 @claytonlin1110 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4292 中实现。\n* 功能：tablechunks 现可重构表格，由 @qued 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4291 中实现。\n\n## 新贡献者\n* @KRRT7 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4266 中完成了首次贡献。\n* @vladimir-kivi-ds 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4293 中完成了首次贡献。\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fcompare\u002F0.21.5...0.22.4","2026-03-26T19:32:23",{"id":199,"version":200,"summary_zh":201,"released_at":202},108330,"0.21.5","## 变更内容\n* 功能：由 @claytonlin1110 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4238 中实现的语言检测自定义回退机制\n* 添加用于时间回归测试的 GitHub Actions 工作流，由 @aadland6 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4261 中完成\n* 修复：放宽 pdfminer.six 的下限版本约束，由 @badGarnet 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4262 中完成\n\n## 新贡献者\n* @claytonlin1110 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4238 中完成了首次贡献\n* @aadland6 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4261 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fcompare\u002F0.21.2...0.21.5","2026-02-24T15:28:48",{"id":204,"version":205,"summary_zh":79,"released_at":206},108331,"0.21.2","2026-02-23T01:16:21",{"id":208,"version":209,"summary_zh":210,"released_at":211},108332,"0.21.1","## 变更内容\n* 由 @badGarnet 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4257 中更新版本号\n\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fcompare\u002F0.21.0...0.21.1","2026-02-22T20:48:40",{"id":213,"version":214,"summary_zh":215,"released_at":216},108333,"0.21.0","## 0.21.0\n\n### 修复\n- **将 NLTK 替换为 spaCy，以缓解 CVE-2025-14009 漏洞**：NLTK 的下载器在未进行路径校验的情况下使用 `zipfile.extractall()`，从而可通过恶意包实现远程代码执行（CVSS 评分 10.0，暂无补丁）。而 spaCy 的模型以 pip 包的形式安装，彻底消除了这一存在漏洞的下载器。","2026-02-22T19:36:59",{"id":218,"version":219,"summary_zh":220,"released_at":221},108334,"0.20.8","## 变更内容\n* 修复：由 @qued 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4244 中设置 JSON 元素的最大解压缩大小\n* 修复：由 @badGarnet 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4247 中更新依赖项\n\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fcompare\u002F0.20.6...0.20.8","2026-02-20T00:34:09",{"id":223,"version":224,"summary_zh":225,"released_at":226},108335,"0.20.6","## 变更内容\n* 由 @PastelStorm 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4239 中实现 PyPI 发布的自动化\n* 修复：由 @bittoby 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4215 中移除 PDF 中因虚假加粗渲染导致的重复字符\n* 由 @CyMule 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4242 中优化快速分区的冷启动性能\n* 修复：由 @badGarnet 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4243 中在分块过程中优雅地处理无效的 HTML 字符串\n* 修复：由 @badGarnet 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4245 中在哈希后重新映射父级 ID\n\n## 新贡献者\n* @bittoby 在 https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4215 中完成了首次贡献\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fcompare\u002F0.20.1...0.20.6","2026-02-19T16:19:03",{"id":228,"version":229,"summary_zh":230,"released_at":231},108336,"0.20.2","Release 0.20.2","2026-02-13T03:04:28",{"id":233,"version":234,"summary_zh":235,"released_at":236},108337,"0.20.1","## What's Changed\r\n* Add Python 3.11 and 3.13 support by @PastelStorm in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4236\r\n* Use bigger runner to publish images by @PastelStorm in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4237\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fcompare\u002F0.19.3...0.20.1","2026-02-12T02:44:53",{"id":238,"version":239,"summary_zh":240,"released_at":241},108338,"0.19.3","## What's Changed\r\n* feat: add group_elements_by_parent_id utility function by @MkDev11 in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4207\r\n* Preserve newlines in Table and TableChunk elements during PDF partitioning by @eureka928 in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4214\r\n* Fix: make pdf image dpi consistent by @badGarnet in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4217\r\n* chore: bump dependencies for 0.18.34 by @luke-kucing in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4221\r\n* feat: increase PIL's max image pixel value for pdf partition by @badGarnet in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4220\r\n* Migrate to uv by @PastelStorm in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4226\r\n* Fix ARM64 paddlepaddle image builder bug by @PastelStorm in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4228\r\n* Fix Docker ARM64 image failure, use 8-core github runners by @PastelStorm in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4232\r\n* Fix ARM64 image issues by @PastelStorm in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4233\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fcompare\u002F0.18.32...0.19.3","2026-02-11T19:17:01",{"id":243,"version":244,"summary_zh":245,"released_at":246},108339,"0.18.32","## What's Changed\r\n* feat: put pdfium call behind a threadlock by @badGarnet in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4211\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fcompare\u002F0.18.31...0.18.32","2026-02-10T22:26:23",{"id":248,"version":249,"summary_zh":250,"released_at":251},108340,"0.18.31","## What's Changed\r\n* Feat: patch pdfminer and use rendermode to detect invisible text by @badGarnet in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4158\r\n* fix: add EN DASH to UNICODE_BULLETS for clean_bullets by @MkDev11 in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4186\r\n* fix: fix version number by @badGarnet in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4189\r\n* enhancement: render pdfs with pdfium by @qued in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4185\r\n* feat: consider rotated text as low fidelityfeat: consider rotated text by @badGarnet in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4190\r\n* fix: address jaraco CVE by @qued in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4198\r\n* fix: hange default for languages parameter from [\"auto\"] to None by @eureka928 in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4194\r\n* ⚡️ Speed up function `_get_optimal_value_for_bbox` by 2,883% by @aseembits93 in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4181\r\n* ⚡️ Speed up method `_DocxPartitioner._style_based_element_type` by 593% by @aseembits93 in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4179\r\n* Luke\u002Fupdate dockerfile by @luke-kucing in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4192\r\n* fix: reduce default dpi to 350 by @qued in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4199\r\n* fix(deps): switch from pip-compile to uv pip compile by @lawrence-u10d in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4202\r\n* fix: remove sandbox=True from pypandoc to fix ODT conversion by @MkDev11 in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4193\r\n* Token-Based Chunking Support by @eureka928 in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4203\r\n* fix: filter coordinates kwargs to prevent TypeError in hi_res PDF processing by @MkDev11 in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4206\r\n* fix(deps): Update docker.elastic.co\u002Felasticsearch\u002Felasticsearch Docker tag to v8.19.10 by @utic-renovate[bot] in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4133\r\n* fix(deps): Update opensearchproject\u002Fopensearch Docker tag to v2.19.4 by @utic-renovate[bot] in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4134\r\n* fix(deps): Update semitechnologies\u002Fweaviate Docker tag to v1.35.3 by @utic-renovate[bot] in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4135\r\n* fix: Preserve Line Breaks in Code Blocks During Chunking by @eureka928 in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4196\r\n* chorse sep bump to resolve open CVEs by @luke-kucing in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4205\r\n\r\n## New Contributors\r\n* @MkDev11 made their first contribution in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4186\r\n* @eureka928 made their first contribution in https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fpull\u002F4194\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FUnstructured-IO\u002Funstructured\u002Fcompare\u002F0.18.28...0.18.31","2026-01-27T15:29:50",{"id":253,"version":254,"summary_zh":255,"released_at":256},108341,"0.18.28","### Enhancement\r\n- Optimize `clean_extra_whitespace_with_index_run` (codeflash)\r\n- Optimize `recursive_xy_cut_swapped` (codeflash)\r\n- Optimize `_DocxPartitioner._parse_category_depth_by_style_name` (codeflash)\r\n- Optimize `VertexAIEmbeddingEncoder._add_embeddings_to_elements` (codeflash)\r\n- Optimize `ngrams` (codeflash)\r\n- Optimize `stage_for_datasaur` (codeflash)\r\n","2026-01-09T19:24:33",{"id":258,"version":259,"summary_zh":260,"released_at":261},108342,"0.18.27","## 0.18.27\r\n\r\n### Fixes\r\n- Comment no-ops in `zoom_image` (codeflash)\r\n- Fix an issue where elements with partially filled extracted text are marked as extracted\r\n\r\n### Enhancement\r\n- Optimize `sentence_count` (codeflash)\r\n- Optimize `_PartitionerLoader._load_partitioner` (codeflash)\r\n- Optimize `detect_languages` (codeflash)\r\n- Optimize `contains_verb` (codeflash)\r\n- Optimize `get_bbox_thickness` (codeflash)\r\n- Upgrade pdfminer-six to 20260107 to fix ~15-18% performance regression from eager f-string evaluation\r\n","2026-01-08T00:01:14",{"id":263,"version":264,"summary_zh":265,"released_at":266},108343,"0.18.26","## 0.18.26\r\n\r\n### Fixes\r\n- Pin `deltalake\u003C1.3.0` to fix ARM64 Docker builds (1.3.0 missing Linux ARM64 wheels)\r\n\r\n## 0.18.25\r\n\r\n### Fixes\r\n- **Security update**: Removed pdfminer.six version constraint and bumped pdfminer.six and urllib3 to address high severity CVEs","2026-01-05T21:41:26",{"id":268,"version":269,"summary_zh":270,"released_at":271},108344,"0.18.24","### Enhancement\r\n- Optimize `OCRAgentTesseract.extract_word_from_hocr` (codeflash)\r\n\r\n\r\n### Fixes\r\n- **Security update**: Bumped dependencies to address security vulnerabilities","2025-12-30T17:54:42"]