[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-shcherbak-ai--contextgem":3,"tool-shcherbak-ai--contextgem":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":79,"owner_website":81,"owner_url":82,"languages":83,"stars":100,"forks":101,"last_commit_at":102,"license":103,"difficulty_score":23,"env_os":104,"env_gpu":104,"env_ram":104,"env_deps":105,"category_tags":111,"github_topics":112,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":131,"updated_at":132,"faqs":133,"releases":162},734,"shcherbak-ai\u002Fcontextgem","contextgem","ContextGem: Effortless LLM extraction from documents","ContextGem 是一款专注于大语言模型上下文提取的开源 Python 库，其核心使命是让从文档中获取 LLM 可用信息的过程变得毫不费力。面对复杂的文档格式和非结构化数据，手动清洗和转换上下文往往耗时且易错，ContextGem 通过自动化流程解决了这一难题，显著提升了文档处理的效率。\n\n对于正在构建 RAG 系统、开发 AI Agent 或从事 NLP 相关研究的开发者而言，ContextGem 是非常合适的选择。ContextGem 不仅功能强大，还展现了极高的工程标准：全面覆盖 Python 3.10 至 3.14 版本，集成 uv、ruff 等现代化开发工具，并通过多项安全与质量测试。特别值得一提的是，ContextGem 对 AI Agent 非常友好，原生支持 AGENTS.md 和 CLAUDE.md，便于智能体理解项目结构。凭借详尽的文档和严格的 CI 流程，ContextGem 成为值得信赖的开发伙伴。","![ContextGem](https:\u002F\u002Fcontextgem.dev\u002F_static\u002Fcontextgem_readme_header.png \"ContextGem - Effortless LLM extraction from documents\")\n\n# ContextGem: Effortless LLM extraction from documents\n\n|          |        |\n|----------|--------|\n| **Package** | [![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fcontextgem?logo=pypi&label=PyPi&logoColor=gold&style=flat-square)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fcontextgem\u002F) [![PyPI Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fshcherbak-ai_contextgem_readme_7878541d54b6.png)](https:\u002F\u002Fpepy.tech\u002Fprojects\u002Fcontextgem) [![Python Versions](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13%20%7C%203.14-blue?logo=python&logoColor=gold&style=flat-square)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F) [![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue?style=flat-square)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0) |\n| **Quality** | [![tests](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Fshcherbak-ai\u002Fcontextgem\u002Fci-tests.yml?branch=main&style=flat-square&label=tests)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Factions\u002Fworkflows\u002Fci-tests.yml) [![Coverage](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fgist.githubusercontent.com\u002FSergiiShcherbak\u002Fdaaee00e1dfff7a29ca10a922ec3becd\u002Fraw\u002Fcoverage.json&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Factions) [![CodeQL](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Fshcherbak-ai\u002Fcontextgem\u002Fcodeql.yml?branch=main&style=flat-square&label=CodeQL)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Factions\u002Fworkflows\u002Fcodeql.yml) [![bandit security](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Fshcherbak-ai\u002Fcontextgem\u002Fbandit-security.yml?branch=main&style=flat-square&label=bandit)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Factions\u002Fworkflows\u002Fbandit-security.yml) [![OpenSSF Best Practices](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fshcherbak-ai_contextgem_readme_50ccde67b228.png)](https:\u002F\u002Fwww.bestpractices.dev\u002Fprojects\u002F10489) |\n| **Tools** | [![uv](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fraw.githubusercontent.com\u002Fastral-sh\u002Fuv\u002Fmain\u002Fassets\u002Fbadge\u002Fv0.json&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) [![Ruff](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fraw.githubusercontent.com\u002Fastral-sh\u002Fruff\u002Fmain\u002Fassets\u002Fbadge\u002Fv2.json&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fruff) [![Pydantic v2](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fraw.githubusercontent.com\u002Fpydantic\u002Fpydantic\u002Fmain\u002Fdocs\u002Fbadge\u002Fv2.json&style=flat-square)](https:\u002F\u002Fpydantic.dev) [![pyright](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpyright-checked-blue?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fpyright) [![pre-commit](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpre--commit-enabled-blue?logo=pre-commit&logoColor=white&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fpre-commit\u002Fpre-commit) [![deptry](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdeptry-checked-blue?style=flat-square)](https:\u002F\u002Fgithub.com\u002Ffpgmaas\u002Fdeptry) [![egress: tethered](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fegress-tethered-orange?labelColor=4B8BBE)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Ftethered) [![Hatch project](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A5%9A-Hatch-4051b5.svg?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fpypa\u002Fhatch) |\n| **Docs** | [![docs](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Fshcherbak-ai\u002Fcontextgem\u002Fdocs.yml?branch=main&style=flat-square&label=docs)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Factions\u002Fworkflows\u002Fdocs.yml) [![documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-latest-blue?style=flat-square)](https:\u002F\u002Fshcherbak-ai.github.io\u002Fcontextgem\u002F) ![Docstring Coverage](https:\u002F\u002Fcontextgem.dev\u002F_static\u002Finterrogate-badge.svg) [![DeepWiki](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=DeepWiki&message=Chat%20with%20Code&labelColor=%23283593&color=%237E57C2&style=flat-square)](https:\u002F\u002Fdeepwiki.com\u002Fshcherbak-ai\u002Fcontextgem) |\n| **AI Agents** | [![AGENTS.md](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAGENTS.md-✓-blue?style=flat-square)](https:\u002F\u002Fagents.md) [![CLAUDE.md](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCLAUDE.md-✓-blue?style=flat-square)](https:\u002F\u002Fdocs.anthropic.com\u002Fen\u002Fdocs\u002Fclaude-code\u002Fmemory#claudemd) |\n| **Community** | [![Contributor Covenant](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FContributor%20Covenant-2.1-4baaaa?style=flat-square)](CODE_OF_CONDUCT.md) [![GitHub issues closed](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-closed\u002Fshcherbak-ai\u002Fcontextgem?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fissues?q=is%3Aissue+is%3Aclosed) [![GitHub latest commit](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flast-commit\u002Fshcherbak-ai\u002Fcontextgem?label=latest%20commit&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fcommits\u002Fmain) |\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fshcherbak-ai_contextgem_readme_770dfa57d73c.png\" alt=\"ContextGem: 2nd Product of the week\" width=\"250\">\n\u003C\u002Fdiv>\n\u003Cbr\u002F>\u003Cbr\u002F>\n\nContextGem is a free, open-source LLM framework that makes it radically easier to extract structured data and insights from documents — with minimal code.\n\n---\n\n## 💎 Why ContextGem?\n\nMost popular LLM frameworks for extracting structured data from documents require extensive boilerplate code to extract even basic information. This significantly increases development time and complexity.\n\nContextGem addresses this challenge by providing a flexible, intuitive framework that extracts structured data and insights from documents with minimal effort. The complex, most time-consuming parts are handled with **powerful abstractions**, eliminating boilerplate code and reducing development overhead.\n\n📖 Read more on the project [motivation](https:\u002F\u002Fcontextgem.dev\u002Fmotivation\u002F) in the documentation.\n\n\n## ⭐ Key features\n\n\u003Ctable>\n    \u003Cthead>\n        \u003Ctr style=\"text-align: left; opacity: 0.8;\">\n            \u003Cth style=\"width: 75%\">Built-in abstractions\u003C\u002Fth>\n            \u003Cth style=\"width: 10%\">\u003Cstrong>ContextGem\u003C\u002Fstrong>\u003C\u002Fth>\n            \u003Cth style=\"width: 15%\">Other LLM frameworks*\u003C\u002Fth>\n        \u003C\u002Ftr>\n    \u003C\u002Fthead>\n    \u003Ctbody>\n        \u003Ctr>\n            \u003Ctd>\n                Automated dynamic prompts\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>◯\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Automated data modelling and validators\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>◯\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Precise granular reference mapping (paragraphs & sentences)\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>◯\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Justifications (reasoning backing the extraction)\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>◯\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Neural segmentation (using wtpsplit's SaT models)\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>◯\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Multilingual support (I\u002FO without prompting)\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>◯\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Single, unified extraction pipeline (declarative, reusable, fully serializable)\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Grouped LLMs with role-specific tasks\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Nested context extraction\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Unified, fully serializable results storage model (document)\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Extraction task calibration with examples\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Built-in concurrent I\u002FO processing\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Automated usage & costs tracking\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Fallback and retry logic\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                Multiple LLM providers\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n        \u003C\u002Ftr>\n    \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n🟢 - fully supported - no additional setup required\u003Cbr>\n🟡 - partially supported - requires additional setup\u003Cbr>\n◯ - not supported - requires custom logic\n\n\\* See [descriptions](https:\u002F\u002Fcontextgem.dev\u002Fmotivation\u002F#the-contextgem-solution) of ContextGem abstractions and [comparisons](https:\u002F\u002Fcontextgem.dev\u002Fvs_other_frameworks\u002F) of specific implementation examples using ContextGem and other popular open-source LLM frameworks.\n\n## 💡 What you can build\n\nWith **minimal code**, you can:\n\n- **Extract structured data** from documents (text, images)\n- **Identify and analyze key aspects** (topics, themes, categories) within documents ([learn more](https:\u002F\u002Fcontextgem.dev\u002Faspects\u002Faspects\u002F))\n- **Extract specific concepts** (entities, facts, conclusions, assessments) from documents ([learn more](https:\u002F\u002Fcontextgem.dev\u002Fconcepts\u002Fsupported_concepts\u002F))\n- **Build complex extraction workflows** through a simple, intuitive API\n- **Create multi-level extraction pipelines** (aspects containing concepts, hierarchical aspects)\n\n\u003Cbr\u002F>\n\n![ContextGem extraction example](https:\u002F\u002Fcontextgem.dev\u002F_static\u002Freadme_code_snippet.png \"ContextGem extraction example\")\n\n\n## 📦 Installation\n\nUsing [uv](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) (recommended):\n\n```bash\nuv add contextgem\n```\n\nOr using pip:\n\n```bash\npip install -U contextgem\n```\n\n\n## 🚀 Quick start\n\nThe following example demonstrates how to use ContextGem to extract **anomalies** from a legal document - a complex concept that requires contextual understanding. Unlike traditional RAG approaches that might miss subtle inconsistencies, ContextGem analyzes the entire document context to identify content that doesn't belong, complete with source references and justifications.\n\n```python\n# Quick Start Example - Extracting anomalies from a document, with source references and justifications\n\nimport os\n\nfrom contextgem import Document, DocumentLLM, StringConcept\n\n\n# Sample document text (shortened for brevity)\ndoc = Document(\n    raw_text=(\n        \"Consultancy Agreement\\n\"\n        \"This agreement between Company A (Supplier) and Company B (Customer)...\\n\"\n        \"The term of the agreement is 1 year from the Effective Date...\\n\"\n        \"The Supplier shall provide consultancy services as described in Annex 2...\\n\"\n        \"The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\\n\"\n        \"The purple elephant danced gracefully on the moon while eating ice cream.\\n\"  # 💎 anomaly\n        \"Time-traveling dinosaurs will review all deliverables before acceptance.\\n\"  # 💎 another anomaly\n        \"This agreement is governed by the laws of Norway...\\n\"\n    ),\n)\n\n# Attach a document-level concept\ndoc.concepts = [\n    StringConcept(\n        name=\"Anomalies\",  # in longer contexts, this concept is hard to capture with RAG\n        description=\"Anomalies in the document\",\n        add_references=True,\n        reference_depth=\"sentences\",\n        add_justifications=True,\n        justification_depth=\"brief\",\n        # see the docs for more configuration options\n    )\n    # add more concepts to the document, if needed\n    # see the docs for available concepts: StringConcept, JsonObjectConcept, etc.\n]\n# Or use `doc.add_concepts([...])`\n\n# Define an LLM for extracting information from the document\nllm = DocumentLLM(\n    model=\"openai\u002Fgpt-4o-mini\",  # or another provider\u002FLLM\n    api_key=os.environ.get(\n        \"CONTEXTGEM_OPENAI_API_KEY\"\n    ),  # your API key for the LLM provider\n    # see the docs for more configuration options\n)\n\n# Extract information from the document\ndoc = llm.extract_all(doc)  # or use async version `await llm.extract_all_async(doc)`\n\n# Access extracted information in the document object\nanomalies_concept = doc.concepts[0]\n# or `doc.get_concept_by_name(\"Anomalies\")`\nfor item in anomalies_concept.extracted_items:\n    print(\"Anomaly:\")\n    print(f\"  {item.value}\")\n    print(\"Justification:\")\n    print(f\"  {item.justification}\")\n    print(\"Reference paragraphs:\")\n    for p in item.reference_paragraphs:\n        print(f\"  - {p.raw_text}\")\n    print(\"Reference sentences:\")\n    for s in item.reference_sentences:\n        print(f\"  - {s.raw_text}\")\n    print()\n\n```\n[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002Fdev\u002Fnotebooks\u002Freadme\u002Fquickstart_concept.ipynb)\n\n---\n\n\n## 🧠 How it works\n\n### 📝 Step 1: Define extraction context\n\n\u003Ctable>\n\u003Cthead>\n\u003Ctr>\n\u003Cth width=\"100%\" align=\"left\">📄 \u003Cstrong>Document\u003C\u002Fstrong>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003C\u002Fthead>\n\u003Ctbody>\n\u003Ctr>\n\u003Ctd>Create a Document that contains text and\u002For visual content representing your document (contract, invoice, report, CV, etc.), from which an LLM extracts information (aspects and\u002For concepts). \u003Ca href=\"https:\u002F\u002Fcontextgem.dev\u002Fdocuments\u002Fdocument_config\u002F\">Learn more\u003C\u002Fa>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n```python\ndocument = Document(raw_text=\"Non-Disclosure Agreement...\")\n```\n\n### 🎯 Step 2: Define what to extract\n\n\u003Ctable>\n\u003Cthead>\n\u003Ctr>\n\u003Cth width=\"50%\" align=\"left\">🔍 \u003Cstrong>Aspects\u003C\u002Fstrong>\u003C\u002Fth>\n\u003Cth width=\"50%\" align=\"left\">💡 \u003Cstrong>Concepts\u003C\u002Fstrong>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003C\u002Fthead>\n\u003Ctbody>\n\u003Ctr>\n\u003Ctd>Define Aspects to extract text segments from the document (sections, topics, themes). You can organize content hierarchically and combine with concepts for comprehensive analysis. \u003Ca href=\"https:\u002F\u002Fcontextgem.dev\u002Faspects\u002Faspects\u002F\">Learn more\u003C\u002Fa>\u003C\u002Ftd>\n\u003Ctd>Define Concepts to extract specific data points with intelligent inference: entities, insights, structured objects, classifications, numerical calculations, dates, ratings, and assessments. \u003Ca href=\"https:\u002F\u002Fcontextgem.dev\u002Fconcepts\u002Fsupported_concepts\u002F\">Learn more\u003C\u002Fa>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n```python\n# Extract document sections\naspect = Aspect(\n    name=\"Term and termination\",\n    description=\"Clauses on contract term and termination\",\n)\n# Extract specific data points\nconcept = BooleanConcept(\n    name=\"NDA check\",\n    description=\"Is the contract an NDA?\",\n)\n# Add these to the document instance for further extraction\ndocument.add_aspects([aspect])\ndocument.add_concepts([concept])\n```\n\n\u003Ctable>\n\u003Cthead>\n\u003Ctr>\n\u003Cth width=\"100%\" align=\"left\">🔄 \u003Ci>Alternative\u003C\u002Fi>: Configure  \u003Cstrong>Extraction Pipeline\u003C\u002Fstrong>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003C\u002Fthead>\n\u003Ctbody>\n\u003Ctr>\n\u003Ctd>Create a reusable collection of predefined aspects and concepts that enables consistent extraction across multiple documents. \u003Ca href=\"https:\u002F\u002Fcontextgem.dev\u002Fpipelines\u002Fextraction_pipelines\u002F\">Learn more\u003C\u002Fa>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n### 🧠 Step 3: Run LLM extraction\n\n\u003Ctable>\n\u003Cthead>\n\u003Ctr>\n\u003Cth width=\"50%\" align=\"left\">🤖 \u003Cstrong>LLM\u003C\u002Fstrong>\u003C\u002Fth>\n\u003Cth width=\"50%\" align=\"left\">🤖🤖 \u003Ci>Alternative\u003C\u002Fi>: \u003Cstrong>LLM Group (advanced)\u003C\u002Fstrong>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003C\u002Fthead>\n\u003Ctbody>\n\u003Ctr>\n\u003Ctd>Configure a cloud or local LLM that will extract aspects and\u002For concepts from the document. DocumentLLM supports fallback models and role-based task routing for optimal performance. \u003Ca href=\"https:\u002F\u002Fcontextgem.dev\u002Fllms\u002Fllm_extraction_methods\u002F\">Learn more\u003C\u002Fa>\u003C\u002Ftd>\n\u003Ctd>Configure a group of LLMs with unique roles for complex extraction workflows. You can route different aspects and\u002For concepts to specialized LLMs (e.g., simple extraction vs. reasoning tasks). \u003Ca href=\"https:\u002F\u002Fcontextgem.dev\u002Fllms\u002Fllm_config\u002F#llm-groups\">Learn more\u003C\u002Fa>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n```python\nllm = DocumentLLM(\n    model=\"openai\u002Fgpt-5-mini\",  # or another provider\u002FLLM\n    api_key=\"...\",\n)\ndocument = llm.extract_all(document)\n# print(document.aspects[0].extracted_items)\n# print(document.concepts[0].extracted_items)\n```\n\n📖 Learn more about ContextGem's [core components](https:\u002F\u002Fcontextgem.dev\u002Fhow_it_works\u002F) and their practical examples in the documentation.\n\n## 📚 Usage Examples\n\n🌟 **Basic usage:**\n- [Aspect Extraction from Document](https:\u002F\u002Fcontextgem.dev\u002Fquickstart\u002F#aspect-extraction-from-document)\n- [Extracting Aspect with Sub-Aspects](https:\u002F\u002Fcontextgem.dev\u002Fquickstart\u002F#extracting-aspect-with-sub-aspects)\n- [Concept Extraction from Aspect](https:\u002F\u002Fcontextgem.dev\u002Fquickstart\u002F#concept-extraction-from-aspect)\n- [Concept Extraction from Document (text)](https:\u002F\u002Fcontextgem.dev\u002Fquickstart\u002F#concept-extraction-from-document-text)\n- [Concept Extraction from Document (vision)](https:\u002F\u002Fcontextgem.dev\u002Fquickstart\u002F#concept-extraction-from-document-vision)\n- [LLM chat interface with tool calling](https:\u002F\u002Fcontextgem.dev\u002Fquickstart\u002F#lightweight-llm-chat-interface)\n\n🚀 **Advanced usage:**\n- [Extracting Aspects Containing Concepts](https:\u002F\u002Fcontextgem.dev\u002Fadvanced_usage\u002F#extracting-aspects-with-concepts)\n- [Extracting Aspects and Concepts from a Document](https:\u002F\u002Fcontextgem.dev\u002Fadvanced_usage\u002F#extracting-aspects-and-concepts-from-a-document)\n- [Using a Multi-LLM Pipeline to Extract Data from Several Documents](https:\u002F\u002Fcontextgem.dev\u002Fadvanced_usage\u002F#using-a-multi-llm-pipeline-to-extract-data-from-several-documents)\n\n\n## 🔄 Document converters\n\nTo create a ContextGem document for LLM analysis, you can either pass raw text directly, or use built-in converters that handle various file formats.\n\n### 📄 DOCX converter\n\n ContextGem provides a built-in converter to easily transform DOCX files into LLM-ready data.\n\n- **Comprehensive extraction of document elements**: paragraphs, headings, lists, tables, comments, footnotes, textboxes, headers\u002Ffooters, links, embedded images, and inline formatting\n- **Document structure preservation** with rich metadata for improved LLM analysis\n- **Built-in converter** that directly processes Word XML\n\n> 🚀 **Performance improvement in v0.17.1**: DOCX converter now converts files **~2X faster**.\n\n```python\n# Using ContextGem's DocxConverter\n\nfrom contextgem import DocxConverter\n\n\nconverter = DocxConverter()\n\n# Convert a DOCX file to an LLM-ready ContextGem Document\n# from path\ndocument = converter.convert(\"path\u002Fto\u002Fdocument.docx\")\n# or from file object\nwith open(\"path\u002Fto\u002Fdocument.docx\", \"rb\") as docx_file_object:\n    document = converter.convert(docx_file_object)\n\n# Perform data extraction on the resulting Document object\n# document.add_aspects(...)\n# document.add_concepts(...)\n# llm.extract_all(document)\n\n# You can also use DocxConverter instance as a standalone text extractor\ndocx_text = converter.convert_to_text_format(\n    \"path\u002Fto\u002Fdocument.docx\",\n    output_format=\"markdown\",  # or \"raw\"\n)\n\n```\n\n📖 Learn more about [DOCX converter features](https:\u002F\u002Fcontextgem.dev\u002Fconverters\u002Fdocx\u002F) in the documentation.\n\n\n## 🎯 Focused document analysis\n\nContextGem leverages LLMs' long context windows to deliver superior extraction accuracy from individual documents. Unlike RAG approaches that often [struggle with complex concepts and nuanced insights](https:\u002F\u002Fwww.linkedin.com\u002Fpulse\u002Fraging-contracts-pitfalls-rag-contract-review-shcherbak-ai-ptg3f), ContextGem capitalizes on continuously expanding context capacity, evolving LLM capabilities, and decreasing costs. This focused approach enables direct information extraction from complete documents, eliminating retrieval inconsistencies while optimizing for in-depth single-document analysis. While this delivers higher accuracy for individual documents, ContextGem does not currently support cross-document querying or corpus-wide retrieval - for these use cases, modern RAG frameworks (e.g., LlamaIndex, Haystack) remain more appropriate.\n\n📖 Read more on [how ContextGem works](https:\u002F\u002Fcontextgem.dev\u002Fhow_it_works\u002F) in the documentation.\n\n## 🤖 Supported LLMs\n\nContextGem supports both cloud-based and local LLMs through [LiteLLM](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm) integration:\n- **Cloud LLMs**: OpenAI, Anthropic, Google, Azure OpenAI, xAI, and more\n- **Local LLMs**: Run models locally using providers like Ollama, LM Studio, etc.\n- **Model Architectures**: Works with both reasoning\u002FCoT-capable (e.g. gpt-5) and non-reasoning models (e.g. gpt-4.1)\n- **Simple API**: Unified interface for all LLMs with easy provider switching\n\n> **💡 Model Selection Note:** For reliable structured extraction, we recommend using models with performance equivalent to or exceeding `gpt-4o-mini`. Smaller models (such as 8B parameter models) may struggle with ContextGem's detailed extraction instructions. If you encounter issues with smaller models, see our [troubleshooting guide](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_small_llm_troubleshooting\u002F) for potential solutions.\n\n📖 Learn more about [supported LLM providers and models](https:\u002F\u002Fcontextgem.dev\u002Fllms\u002Fsupported_llms\u002F), how to [configure LLMs](https:\u002F\u002Fcontextgem.dev\u002Fllms\u002Fllm_config\u002F), and [LLM extraction methods](https:\u002F\u002Fcontextgem.dev\u002Fllms\u002Fllm_extraction_methods\u002F) in the documentation.\n\n## ⚡ Optimizations\n\nContextGem documentation offers guidance on optimization strategies to maximize performance, minimize costs, and enhance extraction accuracy:\n\n- [Optimizing for Accuracy](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_accuracy\u002F)\n- [Optimizing for Speed](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_speed\u002F)\n- [Optimizing for Cost](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_cost\u002F)\n- [Dealing with Long Documents](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_long_docs\u002F)\n- [Choosing the Right LLM(s)](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_choosing_llm\u002F)\n- [Troubleshooting Issues with Small Models](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_small_llm_troubleshooting\u002F)\n\n\n## 💾 Serializing results\n\nContextGem allows you to save and load Document objects, pipelines, and LLM configurations with built-in serialization methods:\n\n- Save processed documents to avoid repeating expensive LLM calls\n- Transfer extraction results between systems\n- Persist pipeline and LLM configurations for later reuse\n\n📖 Learn more about [serialization options](https:\u002F\u002Fcontextgem.dev\u002Fserialization\u002F) in the documentation.\n\n\n## 📚 Documentation\n\n📖 **Full documentation:** [contextgem.dev](https:\u002F\u002Fcontextgem.dev)\n\n> **⚠️ Official Documentation Notice:** [https:\u002F\u002Fcontextgem.dev\u002F](https:\u002F\u002Fcontextgem.dev\u002F) is the only official source of ContextGem documentation. Please be aware of unauthorized copies or mirrors that may contain outdated or incorrect information.\n\n🤖 **AI-powered code exploration:** [DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fshcherbak-ai\u002Fcontextgem) provides visual architecture maps and natural language Q&A for the codebase.\n\n📈 **Change history:** See the [CHANGELOG](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FCHANGELOG.md) for version history, improvements, and bug fixes.\n\n## 💬 Community\n\n🐛 **Found a bug or have a feature request?** [Open an issue](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fissues\u002Fnew) on GitHub.\n\n💭 **Need help or want to discuss?** Start a thread in [GitHub Discussions](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fdiscussions\u002Fnew\u002F).\n\n## 🤝 Contributing\n\nWe welcome contributions from the community - whether it's fixing a typo or developing a completely new feature!\n\n📋 **Get started:** Check out our [Contributor Guidelines](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FCONTRIBUTING.md).\n\n## 🤖 AI Agent-Friendly\n\nThis repository follows modern development practices with first-class support for AI coding assistants:\n\n- **[AGENTS.md](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FAGENTS.md)** - Guidelines for AI assistants working with this codebase ([agents.md standard](https:\u002F\u002Fagents.md))\n- **[CLAUDE.md](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FCLAUDE.md)** - Configuration for [Claude Code](https:\u002F\u002Fdocs.anthropic.com\u002Fen\u002Fdocs\u002Fclaude-code\u002Foverview)\n\nThese files provide AI assistants with project-specific context, coding conventions, architecture patterns, and contribution workflows - enabling more accurate and consistent AI-assisted development.\n\n## 🔐 Security\n\nThis project is automatically scanned for security vulnerabilities using multiple security tools:\n\n- **[CodeQL](https:\u002F\u002Fcodeql.github.com\u002F)** - GitHub's semantic code analysis engine for vulnerability detection\n- **[Bandit](https:\u002F\u002Fgithub.com\u002FPyCQA\u002Fbandit)** - Python security linter for common security issues  \n- **[Snyk](https:\u002F\u002Fsnyk.io)** - Dependency vulnerability monitoring (used as needed)\n\n🛡️ **Security policy:** See [SECURITY](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FSECURITY.md) file for details.\n\n## 💖 Acknowledgements\n\nContextGem relies on these excellent open-source packages:\n\n- [aiolimiter](https:\u002F\u002Fgithub.com\u002Fmjpieters\u002Faiolimiter): Powerful rate limiting for async operations\n- [colorlog](https:\u002F\u002Fgithub.com\u002Fborntyping\u002Fpython-colorlog): Colored formatter for Python's logging module\n- [docstring-parser](https:\u002F\u002Fgithub.com\u002Frr-\u002Fdocstring_parser): Docstring parsing for auto-generating tool schemas\n- [fastjsonschema](https:\u002F\u002Fgithub.com\u002Fhorejsek\u002Fpython-fastjsonschema): Ultra-fast JSON schema validation\n- [genai-prices](https:\u002F\u002Fgithub.com\u002Fpydantic\u002Fgenai-prices): LLM pricing data and utilities (by Pydantic) to automatically estimate costs\n- [Jinja2](https:\u002F\u002Fgithub.com\u002Fpallets\u002Fjinja): Fast, expressive, extensible templating engine used for prompt rendering\n- [litellm](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm): Unified interface to multiple LLM providers with seamless provider switching\n- [lxml](https:\u002F\u002Fgithub.com\u002Flxml\u002Flxml): High-performance XML processing library for parsing DOCX document structure\n- [pillow](https:\u002F\u002Fgithub.com\u002Fpython-pillow\u002FPillow): Image processing library for local model image handling\n- [pydantic](https:\u002F\u002Fgithub.com\u002Fpydantic\u002Fpydantic): The gold standard for data validation\n- [python-ulid](https:\u002F\u002Fgithub.com\u002Fmdomke\u002Fpython-ulid): Efficient ULID generation for unique object identification\n- [tenacity](https:\u002F\u002Fgithub.com\u002Fjd\u002Ftenacity): General-purpose retry library for Python\n- [typing-extensions](https:\u002F\u002Fgithub.com\u002Fpython\u002Ftyping_extensions): Backports of the latest typing features for enhanced type annotations\n- [wtpsplit-lite](https:\u002F\u002Fgithub.com\u002Fsuperlinear-ai\u002Fwtpsplit-lite): Lightweight version of [wtpsplit](https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit) for state-of-the-art paragraph\u002Fsentence segmentation using wtpsplit's SaT models\n\n\n## 🌱 Support the project\n\nContextGem is just getting started, and your support means the world to us! \n\n⭐ **Star the project** if you find ContextGem useful  \n📢 **Share it** with others who might benefit  \n🔧 **Contribute** with feedback, issues, or code improvements\n\nYour engagement is what makes this project grow!\n\n\n## 📄 License & Contact\n\n**License:** Apache 2.0 License - see the [LICENSE](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FLICENSE) and [NOTICE](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FNOTICE) files for details.\n\n**Copyright:** © 2025 [Shcherbak AI AS](https:\u002F\u002Fshcherbak.ai) — Enterprise AI Engineering. We build AI agents that transform how enterprises operate.\n\n**Connect:** [LinkedIn](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fsergii-shcherbak-10068866\u002F) or [X](https:\u002F\u002Fx.com\u002Fseshch) for questions or collaboration ideas.\n\nBuilt with ❤️ in Oslo, Norway.\n\n## 📦 More from Shcherbak AI\n\n| Package | Description |\n|---------|-------------|\n| **[tethered](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Ftethered)** [![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ftethered?logo=pypi&logoColor=gold&style=flat-square)](https:\u002F\u002Fpypi.org\u002Fproject\u002Ftethered\u002F) [![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-blue.svg?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Ftethered\u002Fblob\u002Fmain\u002FLICENSE) | Runtime network egress control for Python. One function call blocks all unauthorized outbound connections — zero dependencies, no infrastructure changes. Ideal for supply chain defense, AI agent guardrails, and test isolation. |\n","![ContextGem](https:\u002F\u002Fcontextgem.dev\u002F_static\u002Fcontextgem_readme_header.png \"ContextGem - 从文档中轻松提取 LLM 数据\")\n\n# ContextGem：从文档中轻松提取 LLM 数据\n\n|          |        |\n|----------|--------|\n| **包** | [![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fcontextgem?logo=pypi&label=PyPi&logoColor=gold&style=flat-square)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fcontextgem\u002F) [![PyPI Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fshcherbak-ai_contextgem_readme_7878541d54b6.png)](https:\u002F\u002Fpepy.tech\u002Fprojects\u002Fcontextgem) [![Python Versions](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-3.10%20%7C%203.11%20%7C%203.12%20%7C%203.13%20%7C%203.14-blue?logo=python&logoColor=gold&style=flat-square)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F) [![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache_2.0-blue?style=flat-square)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0) |\n| **质量** | [![tests](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Fshcherbak-ai\u002Fcontextgem\u002Fci-tests.yml?branch=main&style=flat-square&label=tests)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Factions\u002Fworkflows\u002Fci-tests.yml) [![Coverage](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fgist.githubusercontent.com\u002FSergiiShcherbak\u002Fdaaee00e1dfff7a29ca10a922ec3becd\u002Fraw\u002Fcoverage.json&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Factions) [![CodeQL](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Fshcherbak-ai\u002Fcontextgem\u002Fcodeql.yml?branch=main&style=flat-square&label=CodeQL)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Factions\u002Fworkflows\u002Fcodeql.yml) [![bandit security](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Fshcherbak-ai\u002Fcontextgem\u002Fbandit-security.yml?branch=main&style=flat-square&label=bandit)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Factions\u002Fworkflows\u002Fbandit-security.yml) [![OpenSSF Best Practices](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fshcherbak-ai_contextgem_readme_50ccde67b228.png)](https:\u002F\u002Fwww.bestpractices.dev\u002Fprojects\u002F10489) |\n| **工具** | [![uv](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fraw.githubusercontent.com\u002Fastral-sh\u002Fuv\u002Fmain\u002Fassets\u002Fbadge\u002Fv0.json&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv) [![Ruff](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fraw.githubusercontent.com\u002Fastral-sh\u002Fruff\u002Fmain\u002Fassets\u002Fbadge\u002Fv2.json&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fruff) [![Pydantic v2](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fraw.githubusercontent.com\u002Fpydantic\u002Fpydantic\u002Fmain\u002Fdocs\u002Fbadge\u002Fv2.json&style=flat-square)](https:\u002F\u002Fpydantic.dev) [![pyright](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpyright-checked-blue?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fpyright) [![pre-commit](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpre--commit-enabled-blue?logo=pre-commit&logoColor=white&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fpre-commit\u002Fpre-commit) [![deptry](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdeptry-checked-blue?style=flat-square)](https:\u002F\u002Fgithub.com\u002Ffpgmaas\u002Fdeptry) [![egress: tethered](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fegress-tethered-orange?labelColor=4B8BBE)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Ftethered) [![Hatch project](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A5%9A-Hatch-4051b5.svg?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fpypa\u002Fhatch) |\n| **文档** | [![docs](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002Fshcherbak-ai\u002Fcontextgem\u002Fdocs.yml?branch=main&style=flat-square&label=docs)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Factions\u002Fworkflows\u002Fdocs.yml) [![documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-latest-blue?style=flat-square)](https:\u002F\u002Fshcherbak-ai.github.io\u002Fcontextgem\u002F) ![Docstring Coverage](https:\u002F\u002Fcontextgem.dev\u002F_static\u002Finterrogate-badge.svg) [![DeepWiki](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?label=DeepWiki&message=Chat%20with%20Code&labelColor=%23283593&color=%237E57C2&style=flat-square)](https:\u002F\u002Fdeepwiki.com\u002Fshcherbak-ai\u002Fcontextgem) |\n| **AI 代理** | [![AGENTS.md](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAGENTS.md-✓-blue?style=flat-square)](https:\u002F\u002Fagents.md) [![CLAUDE.md](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FCLAUDE.md-✓-blue?style=flat-square)](https:\u002F\u002Fdocs.anthropic.com\u002Fen\u002Fdocs\u002Fclaude-code\u002Fmemory#claudemd) |\n| **社区** | [![Contributor Covenant](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FContributor%20Covenant-2.1-4baaaa?style=flat-square)](CODE_OF_CONDUCT.md) [![GitHub issues closed](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-closed\u002Fshcherbak-ai\u002Fcontextgem?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fissues?q=is%3Aissue+is%3Aclosed) [![GitHub latest commit](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Flast-commit\u002Fshcherbak-ai\u002Fcontextgem?label=latest%20commit&style=flat-square)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fcommits\u002Fmain) |\n\n\u003Cdiv align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fshcherbak-ai_contextgem_readme_770dfa57d73c.png\" alt=\"ContextGem：本周第二产品\" width=\"250\">\n\u003C\u002Fdiv>\n\u003Cbr\u002F>\u003Cbr\u002F>\n\nContextGem 是一个免费、开源的 LLM（大型语言模型）框架，它通过最少的代码量，使从文档中提取结构化数据和见解变得极其容易。\n\n---\n\n## 💎 为什么选择 ContextGem？\n\n大多数流行的用于从文档中提取结构化数据的 LLM 框架都需要大量的样板代码才能提取基本信息。这显著增加了开发时间和复杂性。\n\nContextGem 通过提供一个灵活、直观的框架来解决这一挑战，能够以最小的努力从文档中提取结构化数据和见解。复杂且最耗时的部分由**强大的抽象**处理，消除了样板代码并减少了开发开销。\n\n📖 在文档中阅读更多关于项目 [动机](https:\u002F\u002Fcontextgem.dev\u002Fmotivation\u002F) 的信息。\n\n## ⭐ 核心功能\n\n\u003Ctable>\n    \u003Cthead>\n        \u003Ctr style=\"text-align: left; opacity: 0.8;\">\n            \u003Cth style=\"width: 75%\">内置抽象\u003C\u002Fth>\n            \u003Cth style=\"width: 10%\">\u003Cstrong>ContextGem\u003C\u002Fstrong>\u003C\u002Fth>\n            \u003Cth style=\"width: 15%\">其他 LLM 框架*\u003C\u002Fth>\n        \u003C\u002Ftr>\n    \u003C\u002Fthead>\n    \u003Ctbody>\n        \u003Ctr>\n            \u003Ctd>\n                自动化动态提示词\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>◯\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                自动化数据建模和验证器\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>◯\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                精确的细粒度引用映射（段落与句子）\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>◯\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                理由说明（支撑提取的推理）\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>◯\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                神经分割（使用 wtpsplit 的 SaT 模型）\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>◯\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                多语言支持（无需提示词的输入\u002F输出）\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>◯\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                单一统一的提取管道（声明式、可复用、完全序列化）\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                具有特定角色任务的分组 LLM\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                嵌套上下文提取\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                统一、完全序列化的结果存储模型（文档）\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                带示例的提取任务校准\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                内置并发 I\u002FO 处理\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                自动化用量与成本追踪\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟡\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                降级与重试逻辑\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n        \u003C\u002Ftr>\n        \u003Ctr>\n            \u003Ctd>\n                多个 LLM 提供商\n            \u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n            \u003Ctd>🟢\u003C\u002Ftd>\n        \u003C\u002Ftr>\n    \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n🟢 - 完全支持 - 无需额外设置\u003Cbr>\n🟡 - 部分支持 - 需要额外设置\u003Cbr>\n◯ - 不支持 - 需要自定义逻辑\n\n\\* 请参阅 [描述](https:\u002F\u002Fcontextgem.dev\u002Fmotivation\u002F#the-contextgem-solution) 了解 ContextGem 抽象，以及使用 ContextGem 与其他流行开源 LLM 框架的具体实现示例的 [比较](https:\u002F\u002Fcontextgem.dev\u002Fvs_other_frameworks\u002F)。\n\n## 💡 你能构建什么\n\n通过**最少的代码**，你可以：\n\n- **从文档中提取结构化数据**（文本、图像）\n- **识别并分析文档内的关键方面**（主题、话题、类别）([了解更多](https:\u002F\u002Fcontextgem.dev\u002Faspects\u002Faspects\u002F))\n- **从文档中提取特定概念**（实体、事实、结论、评估）([了解更多](https:\u002F\u002Fcontextgem.dev\u002Fconcepts\u002Fsupported_concepts\u002F))\n- **通过简单直观的 API 构建复杂的提取工作流**\n- **创建多级提取管道**（包含概念的方面、分层方面）\n\n\u003Cbr\u002F>\n\n![ContextGem extraction example](https:\u002F\u002Fcontextgem.dev\u002F_static\u002Freadme_code_snippet.png \"ContextGem extraction example\")\n\n\n## 📦 安装\n\n使用 [uv](https:\u002F\u002Fgithub.com\u002Fastral-sh\u002Fuv)（推荐）：\n\n```bash\nuv add contextgem\n```\n\n或使用 pip：\n\n```bash\npip install -U contextgem\n```\n\n\n## 🚀 快速开始\n\n以下示例演示了如何使用 ContextGem 从法律文档中提取**异常**——这是一个需要上下文理解的复杂概念。与可能忽略细微不一致的传统 RAG（检索增强生成）方法不同，ContextGem 分析整个文档上下文来识别不相关的内容，并提供完整的来源引用和理由说明。\n\n```python\n# Quick Start Example - Extracting anomalies from a document, with source references and justifications\n\nimport os\n\nfrom contextgem import Document, DocumentLLM, StringConcept\n\n\n# Sample document text (shortened for brevity)\ndoc = Document(\n    raw_text=(\n        \"Consultancy Agreement\\n\"\n        \"This agreement between Company A (Supplier) and Company B (Customer)...\\n\"\n        \"The term of the agreement is 1 year from the Effective Date...\\n\"\n        \"The Supplier shall provide consultancy services as described in Annex 2...\\n\"\n        \"The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\\n\"\n        \"The purple elephant danced gracefully on the moon while eating ice cream.\\n\"  # 💎 anomaly\n        \"Time-traveling dinosaurs will review all deliverables before acceptance.\\n\"  # 💎 another anomaly\n        \"This agreement is governed by the laws of Norway...\\n\"\n    ),\n)\n\n# Attach a document-level concept\ndoc.concepts = [\n    StringConcept(\n        name=\"Anomalies\",  # in longer contexts, this concept is hard to capture with RAG\n        description=\"Anomalies in the document\",\n        add_references=True,\n        reference_depth=\"sentences\",\n        add_justifications=True,\n        justification_depth=\"brief\",\n        # see the docs for more configuration options\n    )\n    # add more concepts to the document, if needed\n    # see the docs for available concepts: StringConcept, JsonObjectConcept, etc.\n]\n# Or use `doc.add_concepts([...])`\n\n# Define an LLM for extracting information from the document\nllm = DocumentLLM(\n    model=\"openai\u002Fgpt-4o-mini\",  # or another provider\u002FLLM\n    api_key=os.environ.get(\n        \"CONTEXTGEM_OPENAI_API_KEY\"\n    ),  # your API key for the LLM provider\n    # see the docs for more configuration options\n)\n\n# Extract information from the document\ndoc = llm.extract_all(doc)  # or use async version `await llm.extract_all_async(doc)`\n\n# Access extracted information in the document object\nanomalies_concept = doc.concepts[0]\n```\n\n# or `doc.get_concept_by_name(\"Anomalies\")`\nfor item in anomalies_concept.extracted_items:\n    print(\"Anomaly:\")\n    print(f\"  {item.value}\")\n    print(\"Justification:\")\n    print(f\"  {item.justification}\")\n    print(\"Reference paragraphs:\")\n    for p in item.reference_paragraphs:\n        print(f\"  - {p.raw_text}\")\n    print(\"Reference sentences:\")\n    for s in item.reference_sentences:\n        print(f\"  - {s.raw_text}\")\n    print()\n\n```\n[![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002Fdev\u002Fnotebooks\u002Freadme\u002Fquickstart_concept.ipynb)\n```\n\n---\n\n\n## 🧠 工作原理\n\n### 📝 步骤 1：定义提取上下文\n\n\u003Ctable>\n\u003Cthead>\n\u003Ctr>\n\u003Cth width=\"100%\" align=\"left\">📄 \u003Cstrong>文档 (Document)\u003C\u002Fstrong>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003C\u002Fthead>\n\u003Ctbody>\n\u003Ctr>\n\u003Ctd>创建一个包含文本和\u002F或视觉内容的 Document，代表您的文档（合同、发票、报告、简历等），LLM 从中提取信息（方面和\u002F或概念）。\u003Ca href=\"https:\u002F\u002Fcontextgem.dev\u002Fdocuments\u002Fdocument_config\u002F\">了解更多\u003C\u002Fa>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n```python\ndocument = Document(raw_text=\"Non-Disclosure Agreement...\")\n```\n\n### 🎯 步骤 2：定义要提取的内容\n\n\u003Ctable>\n\u003Cthead>\n\u003Ctr>\n\u003Cth width=\"50%\" align=\"left\">🔍 \u003Cstrong>方面 (Aspects)\u003C\u002Fstrong>\u003C\u002Fth>\n\u003Cth width=\"50%\" align=\"left\">💡 \u003Cstrong>概念 (Concepts)\u003C\u002Fstrong>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003C\u002Fthead>\n\u003Ctbody>\n\u003Ctr>\n\u003Ctd>定义 Aspects 以从文档中提取文本片段（章节、主题、主题）。您可以分层组织内容并结合 Concept 进行全面分析。\u003Ca href=\"https:\u002F\u002Fcontextgem.dev\u002Faspects\u002Faspects\u002F\">了解更多\u003C\u002Fa>\u003C\u002Ftd>\n\u003Ctd>定义 Concept 以提取特定数据点，具备智能推理能力：实体、见解、结构化对象、分类、数值计算、日期、评级和评估。\u003Ca href=\"https:\u002F\u002Fcontextgem.dev\u002Fconcepts\u002Fsupported_concepts\u002F\">了解更多\u003C\u002Fa>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n```python\n# Extract document sections\naspect = Aspect(\n    name=\"Term and termination\",\n    description=\"Clauses on contract term and termination\",\n)\n# Extract specific data points\nconcept = BooleanConcept(\n    name=\"NDA check\",\n    description=\"Is the contract an NDA?\",\n)\n# Add these to the document instance for further extraction\ndocument.add_aspects([aspect])\ndocument.add_concepts([concept])\n```\n\n\u003Ctable>\n\u003Cthead>\n\u003Ctr>\n\u003Cth width=\"100%\" align=\"left\">🔄 \u003Ci>替代方案\u003C\u002Fi>：配置 \u003Cstrong>提取管道 (Extraction Pipeline)\u003C\u002Fstrong>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003C\u002Fthead>\n\u003Ctbody>\n\u003Ctr>\n\u003Ctd>创建可重用的预定义 Aspect 和 Concept 集合，以便在多个文档之间实现一致的提取。\u003Ca href=\"https:\u002F\u002Fcontextgem.dev\u002Fpipelines\u002Fextraction_pipelines\u002F\">了解更多\u003C\u002Fa>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n### 🧠 步骤 3：运行 LLM 提取\n\n\u003Ctable>\n\u003Cthead>\n\u003Ctr>\n\u003Cth width=\"50%\" align=\"left\">🤖 \u003Cstrong>大语言模型 (LLM)\u003C\u002Fstrong>\u003C\u002Fth>\n\u003Cth width=\"50%\" align=\"left\">🤖🤖 \u003Ci>替代方案\u003C\u002Fi>：\u003Cstrong>LLM 组 (高级)\u003C\u002Fstrong>\u003C\u002Fth>\n\u003C\u002Ftr>\n\u003C\u002Fthead>\n\u003Ctbody>\n\u003Ctr>\n\u003Ctd>配置云或本地 LLM，用于从文档中提取方面和\u002F或概念。DocumentLLM 支持回退模型和基于角色的任务路由以实现最佳性能。\u003Ca href=\"https:\u002F\u002Fcontextgem.dev\u002Fllms\u002Fllm_extraction_methods\u002F\">了解更多\u003C\u002Fa>\u003C\u002Ftd>\n\u003Ctd>配置一组具有独特角色的 LLM，用于复杂的提取工作流。您可以将不同的方面和\u002F或概念路由到专门的 LLM（例如，简单提取与推理任务）。\u003Ca href=\"https:\u002F\u002Fcontextgem.dev\u002Fllms\u002Fllm_config\u002F#llm-groups\">了解更多\u003C\u002Fa>\u003C\u002Ftd>\n\u003C\u002Ftr>\n\u003C\u002Ftbody>\n\u003C\u002Ftable>\n\n```python\nllm = DocumentLLM(\n    model=\"openai\u002Fgpt-5-mini\",  # or another provider\u002FLLM\n    api_key=\"...\",\n)\ndocument = llm.extract_all(document)\n# print(document.aspects[0].extracted_items)\n# print(document.concepts[0].extracted_items)\n```\n\n📖 了解更多关于 ContextGem 的 [核心组件](https:\u002F\u002Fcontextgem.dev\u002Fhow_it_works\u002F) 及其在实际中的示例。\n\n## 📚 使用示例\n\n🌟 **基本用法：**\n- [从文档提取方面](https:\u002F\u002Fcontextgem.dev\u002Fquickstart\u002F#aspect-extraction-from-document)\n- [提取带有子方面的方面](https:\u002F\u002Fcontextgem.dev\u002Fquickstart\u002F#extracting-aspect-with-sub-aspects)\n- [从方面提取概念](https:\u002F\u002Fcontextgem.dev\u002Fquickstart\u002F#concept-extraction-from-aspect)\n- [从文档提取概念（文本）](https:\u002F\u002Fcontextgem.dev\u002Fquickstart\u002F#concept-extraction-from-document-text)\n- [从文档提取概念（视觉）](https:\u002F\u002Fcontextgem.dev\u002Fquickstart\u002F#concept-extraction-from-document-vision)\n- [带工具调用的 LLM 聊天界面](https:\u002F\u002Fcontextgem.dev\u002Fquickstart\u002F#lightweight-llm-chat-interface)\n\n🚀 **高级用法：**\n- [提取包含概念的方面](https:\u002F\u002Fcontextgem.dev\u002Fadvanced_usage\u002F#extracting-aspects-with-concepts)\n- [从文档提取方面和概念](https:\u002F\u002Fcontextgem.dev\u002Fadvanced_usage\u002F#extracting-aspects-and-concepts-from-a-document)\n- [使用多 LLM 管道从多个文档提取数据](https:\u002F\u002Fcontextgem.dev\u002Fadvanced_usage\u002F#using-a-multi-llm-pipeline-to-extract-data-from-several-documents)\n\n\n## 🔄 文档转换器\n\n要为 LLM 分析创建 ContextGem 文档，您可以直接传递原始文本，或使用内置转换器处理各种文件格式。\n\n### 📄 DOCX 转换器\n\nContextGem 提供了一个内置转换器，可以轻松地将 DOCX 文件转换为 LLM 就绪的数据。\n\n- **全面提取文档元素**：段落、标题、列表、表格、注释、脚注、文本框、页眉\u002F页脚、链接、嵌入图像和内联格式\n- **保留文档结构**，并附带丰富元数据以改进 LLM 分析\n- **内置转换器**，直接处理 Word XML\n\n> 🚀 **v0.17.1 版本性能提升**：DOCX 转换器现在转换速度**快约 2 倍**。\n\n```python\n# Using ContextGem's DocxConverter\n\nfrom contextgem import DocxConverter\n\n\nconverter = DocxConverter()\n\n# Convert a DOCX file to an LLM-ready ContextGem Document\n# from path\ndocument = converter.convert(\"path\u002Fto\u002Fdocument.docx\")\n# or from file object\nwith open(\"path\u002Fto\u002Fdocument.docx\", \"rb\") as docx_file_object:\n    document = converter.convert(docx_file_object)\n\n# Perform data extraction on the resulting Document object\n# document.add_aspects(...)\n# document.add_concepts(...)\n# llm.extract_all(document)\n\n# You can also use DocxConverter instance as a standalone text extractor\ndocx_text = converter.convert_to_text_format(\n    \"path\u002Fto\u002Fdocument.docx\",\n    output_format=\"markdown\",  # or \"raw\"\n)\n\n```\n\n📖 在文档中了解更多关于 [DOCX 转换器功能](https:\u002F\u002Fcontextgem.dev\u002Fconverters\u002Fdocx\u002F) 的信息。\n\n## 🎯 聚焦文档分析\n\nContextGem 利用 LLM（大语言模型）的长上下文窗口，从单个文档中提供卓越的提取精度。与经常 [难以处理复杂概念和细微见解](https:\u002F\u002Fwww.linkedin.com\u002Fpulse\u002Fraging-contracts-pitfalls-rag-contract-review-shcherbak-ai-ptg3f) 的 RAG（检索增强生成）方法不同，ContextGem 利用了不断扩展的上下文容量、进化的 LLM 能力和下降的成本。这种聚焦方法实现了从完整文档中直接提取信息，消除了检索不一致性，同时优化了深入的单文档分析。虽然这为单个文档提供了更高的准确性，但 ContextGem 目前尚不支持跨文档查询或语料库范围的检索——对于这些用例，现代 RAG 框架（例如 LlamaIndex, Haystack）仍然更为合适。\n\n📖 在文档中阅读更多关于 [ContextGem 如何工作](https:\u002F\u002Fcontextgem.dev\u002Fhow_it_works\u002F) 的内容。\n\n## 🤖 支持的 LLM\n\nContextGem 通过 [LiteLLM](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm) 集成支持基于云和本地的 LLM：\n- **云端 LLM**：OpenAI, Anthropic, Google, Azure OpenAI, xAI 等\n- **本地 LLM**：使用 Ollama, LM Studio 等提供商在本地运行模型\n- **模型架构**：适用于具备推理\u002FCoT（思维链）能力（如 gpt-5）和非推理模型（如 gpt-4.1）\n- **简单 API**：所有 LLM 的统一接口，轻松切换提供商\n\n> **💡 模型选择说明：** 为了可靠的结构化提取，我们建议使用性能相当于或超过 `gpt-4o-mini` 的模型。较小的模型（如 8B 参数模型）可能在 ContextGem 的详细提取指令上遇到困难。如果您在使用较小模型时遇到问题，请参阅我们的 [故障排除指南](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_small_llm_troubleshooting\u002F) 以获取潜在解决方案。\n\n📖 在文档中了解更多关于 [支持的 LLM 提供商和模型](https:\u002F\u002Fcontextgem.dev\u002Fllms\u002Fsupported_llms\u002F)，如何 [配置 LLM](https:\u002F\u002Fcontextgem.dev\u002Fllms\u002Fllm_config\u002F)，以及 [LLM 提取方法](https:\u002F\u002Fcontextgem.dev\u002Fllms\u002Fllm_extraction_methods\u002F) 的信息。\n\n## ⚡ 优化\n\nContextGem 文档提供了关于优化策略的指导，以最大化性能、最小化成本并提高提取精度：\n\n- [针对精度的优化](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_accuracy\u002F)\n- [针对速度的优化](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_speed\u002F)\n- [针对成本的优化](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_cost\u002F)\n- [处理长文档](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_long_docs\u002F)\n- [选择合适的 LLM](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_choosing_llm\u002F)\n- [小型模型的故障排除](https:\u002F\u002Fcontextgem.dev\u002Foptimizations\u002Foptimization_small_llm_troubleshooting\u002F)\n\n\n## 💾 序列化结果\n\nContextGem 允许您使用内置的 serialization（序列化）方法保存和加载 Document 对象、pipeline（管道）和 LLM 配置：\n\n- 保存已处理的文档以避免重复昂贵的 LLM 调用\n- 在不同系统之间传输提取结果\n- 持久化 pipeline 和 LLM 配置以便以后重用\n\n📖 在文档中了解更多关于 [serialization 选项](https:\u002F\u002Fcontextgem.dev\u002Fserialization\u002F) 的信息。\n\n\n## 📚 文档\n\n📖 **完整文档：** [contextgem.dev](https:\u002F\u002Fcontextgem.dev)\n\n> **⚠️ 官方文档通知：** [https:\u002F\u002Fcontextgem.dev\u002F](https:\u002F\u002Fcontextgem.dev\u002F) 是 ContextGem 文档的唯一官方来源。请注意可能存在包含过时或不正确信息的未经授权副本或镜像。\n\n🤖 **AI 驱动的代码探索：** [DeepWiki](https:\u002F\u002Fdeepwiki.com\u002Fshcherbak-ai\u002Fcontextgem) 为代码库提供可视化架构映射和自然语言问答。\n\n📈 **变更历史：** 请查看 [CHANGELOG](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FCHANGELOG.md) 以了解版本历史、改进和错误修复。\n\n## 💬 社区\n\n🐛 **发现错误或有任何功能请求？** 在 GitHub 上 [提交问题](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fissues\u002Fnew)。\n\n💭 **需要帮助或想要讨论？** 在 [GitHub Discussions](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fdiscussions\u002Fnew\u002F) 中开启一个线程。\n\n## 🤝 贡献\n\n我们欢迎社区的贡献——无论是修复拼写错误还是开发完全新功能！\n\n📋 **开始：** 查看我们的 [贡献者指南](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FCONTRIBUTING.md)。\n\n## 🤖 对 AI 智能体友好\n\n此仓库遵循现代开发实践，并提供对 AI 编码助手的一流支持：\n\n- **[AGENTS.md](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FAGENTS.md)** - 针对使用此代码库的 AI 助手的指南 ([agents.md 标准](https:\u002F\u002Fagents.md))\n- **[CLAUDE.md](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FCLAUDE.md)** - [Claude Code](https:\u002F\u002Fdocs.anthropic.com\u002Fen\u002Fdocs\u002Fclaude-code\u002Foverview) 的配置\n\n这些文件为 AI 助手提供特定于项目的上下文、编码约定、架构模式和贡献工作流程——从而实现更准确和一致的 AI 辅助开发。\n\n## 🔐 安全\n\n本项目使用多种安全工具自动扫描安全漏洞：\n\n- **[CodeQL](https:\u002F\u002Fcodeql.github.com\u002F)** - GitHub 的语义代码分析引擎，用于漏洞检测\n- **[Bandit](https:\u002F\u002Fgithub.com\u002FPyCQA\u002Fbandit)** - Python 安全 linter（代码检查器），用于常见安全问题  \n- **[Snyk](https:\u002F\u002Fsnyk.io)** - 依赖项漏洞监控（按需使用）\n\n🛡️ **安全策略：** 详细信息请参见 [SECURITY](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FSECURITY.md) 文件。\n\n## 💖 致谢\n\nContextGem 依赖于这些优秀的开源包：\n\n- [aiolimiter](https:\u002F\u002Fgithub.com\u002Fmjpieters\u002Faiolimiter): 用于异步操作的强大速率限制\n- [colorlog](https:\u002F\u002Fgithub.com\u002Fborntyping\u002Fpython-colorlog): Python 日志模块的彩色格式化器\n- [docstring-parser](https:\u002F\u002Fgithub.com\u002Frr-\u002Fdocstring_parser): 用于自动生成工具架构的文档字符串解析\n- [fastjsonschema](https:\u002F\u002Fgithub.com\u002Fhorejsek\u002Fpython-fastjsonschema): 超快速的 JSON 架构验证\n- [genai-prices](https:\u002F\u002Fgithub.com\u002Fpydantic\u002Fgenai-prices): LLM（大型语言模型）定价数据和实用工具（由 Pydantic 提供），用于自动估算成本\n- [Jinja2](https:\u002F\u002Fgithub.com\u002Fpallets\u002Fjinja): 用于提示词渲染的快速、表达力强且可扩展的模板引擎\n- [litellm](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm): 多个 LLM 提供商的统一接口，支持无缝切换提供商\n- [lxml](https:\u002F\u002Fgithub.com\u002Flxml\u002Flxml): 用于解析 DOCX 文档结构的高性能 XML 处理库\n- [pillow](https:\u002F\u002Fgithub.com\u002Fpython-pillow\u002FPillow): 用于本地模型图像处理的图像处理库\n- [pydantic](https:\u002F\u002Fgithub.com\u002Fpydantic\u002Fpydantic): 数据验证的黄金标准\n- [python-ulid](https:\u002F\u002Fgithub.com\u002Fmdomke\u002Fpython-ulid): 用于唯一对象标识的高效 ULID 生成\n- [tenacity](https:\u002F\u002Fgithub.com\u002Fjd\u002Ftenacity): Python 通用重试库\n- [typing-extensions](https:\u002F\u002Fgithub.com\u002Fpython\u002Ftyping_extensions): 最新类型特性的向后移植版本，用于增强类型注解\n- [wtpsplit-lite](https:\u002F\u002Fgithub.com\u002Fsuperlinear-ai\u002Fwtpsplit-lite): [wtpsplit](https:\u002F\u002Fgithub.com\u002Fsegment-any-text\u002Fwtpsplit) 的轻量级版本，使用 wtpsplit 的 SaT 模型进行最先进的段落\u002F句子分割\n\n\n## 🌱 支持项目\n\nContextGem 才刚刚开始，您的支持对我们意义重大！ \n\n⭐ 如果您觉得 ContextGem 有用，请**给项目加星**  \n📢 **分享**给可能受益的其他人  \n🔧 通过反馈、问题报告或代码改进来**贡献**\n\n您的参与是这个项目成长的关键！\n\n\n## 📄 许可证与联系方式\n\n**许可证：** Apache 2.0 许可证 - 详见 [LICENSE](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FLICENSE) 和 [NOTICE](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fblob\u002Fmain\u002FNOTICE) 文件。\n\n**版权：** © 2025 [Shcherbak AI AS](https:\u002F\u002Fshcherbak.ai) — 企业人工智能工程。我们构建改变企业运营方式的 AI 智能体。\n\n**联系：** 如有疑问或合作想法，请联系 [LinkedIn](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fsergii-shcherbak-10068866\u002F) 或 [X](https:\u002F\u002Fx.com\u002Fseshch)。\n\n在挪威奥斯陆用心打造。\n\n## 📦 Shcherbak AI 更多产品\n\n| 包名 | 描述 |\n|---------|-------------|\n| **[tethered](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Ftethered)** [![PyPI](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ftethered?logo=pypi&logoColor=gold&style=flat-square)](https:\u002F\u002Fpypi.org\u002Fproject\u002Ftethered\u002F) [![License: MIT](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-blue.svg?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Ftethered\u002Fblob\u002Fmain\u002FLICENSE) | Python 运行时网络出口控制。一次函数调用即可阻止所有未经授权的出站连接——零依赖，无需更改基础设施。适用于供应链防御、AI 智能体护栏和测试隔离。 |","# ContextGem 快速上手指南\n\n**ContextGem** 是一个免费的开源 LLM 框架，旨在通过极简代码从文档中提取结构化数据和洞察。它提供了强大的抽象能力，消除了繁琐的样板代码，支持多语言、神经分割及自动引用映射等功能。\n\n## 环境准备\n\n- **Python 版本**: 3.10 及以上 (支持 3.10 | 3.11 | 3.12 | 3.13 | 3.14)\n- **依赖项**: 无额外系统级依赖，需配置 LLM 提供商的 API Key（如 OpenAI）\n- **网络**: 需要访问 PyPI 及 LLM 服务接口\n\n## 安装步骤\n\n推荐使用 `uv` 进行快速安装：\n\n```bash\nuv add contextgem\n```\n\n或者使用 `pip` 安装：\n\n```bash\npip install -U contextgem\n```\n\n## 基本使用\n\n以下示例演示如何从文档中提取异常信息（Anomalies），包含源引用和理由说明。\n\n```python\n# Quick Start Example - Extracting anomalies from a document, with source references and justifications\n\nimport os\n\nfrom contextgem import Document, DocumentLLM, StringConcept\n\n\n# Sample document text (shortened for brevity)\ndoc = Document(\n    raw_text=(\n        \"Consultancy Agreement\\n\"\n        \"This agreement between Company A (Supplier) and Company B (Customer)...\\n\"\n        \"The term of the agreement is 1 year from the Effective Date...\\n\"\n        \"The Supplier shall provide consultancy services as described in Annex 2...\\n\"\n        \"The Customer shall pay the Supplier within 30 calendar days of receiving an invoice...\\n\"\n        \"The purple elephant danced gracefully on the moon while eating ice cream.\\n\"  # 💎 anomaly\n        \"Time-traveling dinosaurs will review all deliverables before acceptance.\\n\"  # 💎 another anomaly\n        \"This agreement is governed by the laws of Norway...\\n\"\n    ),\n)\n\n# Attach a document-level concept\ndoc.concepts = [\n    StringConcept(\n        name=\"Anomalies\",  # in longer contexts, this concept is hard to capture with RAG\n        description=\"Anomalies in the document\",\n        add_references=True,\n        reference_depth=\"sentences\",\n        add_justifications=True,\n        justification_depth=\"brief\",\n        # see the docs for more configuration options\n    )\n    # add more concepts to the document, if needed\n    # see the docs for available concepts: StringConcept, JsonObjectConcept, etc.\n]\n# Or use `doc.add_concepts([...])`\n\n# Define an LLM for extracting information from the document\nllm = DocumentLLM(\n    model=\"openai\u002Fgpt-4o-mini\",  # or another provider\u002FLLM\n    api_key=os.environ.get(\n        \"CONTEXTGEM_OPENAI_API_KEY\"\n    ),  # your API key for the LLM provider\n    # see the docs for more configuration options\n)\n\n# Extract information from the document\ndoc = llm.extract_all(doc)  # or use async version `await llm.extract_all_async(doc)`\n\n# Access extracted information in the document object\nanomalies_concept = doc.concepts[0]\n```","某金融风控团队正在构建内部知识库问答系统，核心任务是从数百份历史审计报告中提取关键信息供大模型进行风险推理。\n\n### 没有 contextgem 时\n- 开发人员需手写复杂的 PDF 解析脚本，处理扫描件乱码及特殊字符极其耗时\n- 文档中的表格和图表被强制转成纯文本后丢失结构，严重影响模型理解能力\n- 不同格式的文档（DOCX、PDF、TXT）需要分别适配不同的清洗逻辑，维护困难\n- 上下文切片过于粗糙，导致检索增强生成（RAG）系统中的关键信息召回率偏低\n\n### 使用 contextgem 后\n- contextgem 能够统一处理多格式文档，自动剥离页眉页脚等无关噪声保留核心语义\n- 智能算法维持原文档的层级结构，确保长文本在输入模型时逻辑依然连贯清晰\n- 提供标准化的输出接口，可无缝对接现有的向量数据库和复杂提示词模板\n- 将原本需要数周的文档预处理工作压缩至几天，大幅降低了后续的工程维护成本\n\ncontextgem 通过自动化文档上下文提取，让开发者能专注于业务逻辑而非繁琐的数据清洗。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fshcherbak-ai_contextgem_f211f96d.png","shcherbak-ai","Shcherbak AI","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fshcherbak-ai_4266b5b3.png","AI engineering company developing tools for AI\u002FML\u002FNLP developers.",null,"sergii@shcherbak.ai","https:\u002F\u002Fwww.shcherbak.ai\u002F","https:\u002F\u002Fgithub.com\u002Fshcherbak-ai",[84,88,92,96],{"name":85,"color":86,"percentage":87},"Python","#3572A5",85.2,{"name":89,"color":90,"percentage":91},"Jupyter Notebook","#DA5B0B",11.2,{"name":93,"color":94,"percentage":95},"Jinja","#a52a22",3.4,{"name":97,"color":98,"percentage":99},"HTML","#e34c26",0.2,1823,150,"2026-04-04T07:48:39","Apache-2.0","未说明",{"notes":106,"python":107,"dependencies":108},"需配置外部 LLM 提供商 API Key（如 OpenAI）；推荐使用 uv 进行环境管理；支持多语言文档及结构化数据提取。","3.10 | 3.11 | 3.12 | 3.13 | 3.14",[109,110],"pydantic","wtpsplit",[51,14,26,15,13],[113,114,115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130],"ai","contract-analysis","data-extraction","document-intelligence","generative-ai","legaltech","llm","llm-extraction","llm-framework","llm-pipeline","llms","nlp","prompt-engineering","text-analysis","unstructured-data","docx","docx2md","docx2txt","2026-03-27T02:49:30.150509","2026-04-06T05:36:23.812896",[134,139,143,148,152,157],{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},3108,"如何导入和使用本地 SAT 模型？","从 v0.4.0 版本开始，ContextGem 已支持在 Document 类的 `sat_model_id` 参数中指定本地 SaT 模型路径。如果您使用的是旧版本，可能需要升级才能使用此功能。","https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fissues\u002F16",{"id":140,"question_zh":141,"answer_zh":142,"source_url":138},3109,"如何在提取结果中获取源引用（参考段落或句子）？","可以通过访问提取项的属性来获取引用：\n1. `aspect_or_concept.extracted_items[0].reference_paragraphs` 获取参考段落。\n2. 如果设置了 `reference_depth=\"sentences\"`，可访问 `reference_sentences`。\n注意：Aspect 类型始终启用源引用；Concept 类型需要在实例化时设置 `add_references=True`。这些是属性，不会直接显示在打印输出的摘要中。",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},3110,"安装时遇到 `mosestokenizer` 依赖错误或包元数据生成失败怎么办？","在 v0.5.0 及以上版本中，`mosestokenizer` 已不再是必需的依赖项（项目已迁移至轻量级的 `wtpsplit-lite`）。建议将 ContextGem 升级到最新版本以解决此问题。","https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fissues\u002F17",{"id":149,"question_zh":150,"answer_zh":151,"source_url":147},3111,"Windows 系统安装时报错 `UnicodeDecodeError: 'gbk' codec can't decode` 如何解决？","这是由于 Windows 默认编码为 GBK 导致的字符编码问题。推荐以下解决方案：\n**方案 1（推荐）**：在安装前设置环境变量\nCMD: `set PYTHONIOENCODING=utf-8` 然后 `pip install -U contextgem`\nPowerShell: `$env:PYTHONIOENCODING = \"utf-8\"` 然后 `pip install -U contextgem`\n**方案 2**：单独安装依赖时指定编码（可靠性较低）。",{"id":153,"question_zh":154,"answer_zh":155,"source_url":156},3112,"ContextGem 是否支持 Python 3.14？","是的，Python 3.14 的支持已在 v0.21.0 版本中上线。请确保您的环境满足该版本要求，且所有依赖项也支持 Python 3.14。","https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fissues\u002F84",{"id":158,"question_zh":159,"answer_zh":160,"source_url":161},3113,"使用 Ollama 或 LiteLLM 进行概念提取时出现超时或无效 JSON 错误怎么办？","这通常是因为选用的模型太小（如 deepseek-r1:8b 或 phi3:3.8b），无法遵循复杂的内部提示指令。建议尝试选择参数量更大的模型来提高解析成功率和稳定性。","https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fissues\u002F34",[163,168,173,178,183,188,193,198,203,208,213,218,223,228,233,238,243,248,253,258],{"id":164,"version":165,"summary_zh":166,"released_at":167},102649,"v0.18.0","### Added\r\n- Chat: Added optional `chat_session` parameter (accepts a `ChatSession`) to preserve message history across turns in `DocumentLLM.chat()`. When this parameter is omitted, chat is single-turn, without message history.","2025-09-01T21:08:23",{"id":169,"version":170,"summary_zh":171,"released_at":172},102650,"v0.17.1","### Changed\r\n- `DocxConverter`: Conversion speed improved by ~2X, significantly reducing processing time for DOCX files.","2025-08-26T21:49:48",{"id":174,"version":175,"summary_zh":176,"released_at":177},102651,"v0.17.0","### Added\r\n- Multimodal LLM roles (`\"extractor_multimodal\"` and `\"reasoner_multimodal\"`) to support extraction of multimodal document-level concepts from both text and images. Previously, only text and vision roles were supported, requiring choosing either text or image context for extraction, not both.","2025-08-24T17:00:00",{"id":179,"version":180,"summary_zh":181,"released_at":182},102652,"v0.16.1","### Fixed\r\n- Added support for `\"minimal\"` reasoning effort for gpt-5 models.","2025-08-19T01:36:42",{"id":184,"version":185,"summary_zh":186,"released_at":187},102653,"v0.16.0","### Added\r\n- Reasoning-aware extraction prompts: Automatically enables private chain-of-thought guidance on models that support reasoning, yielding higher-quality outputs (no change for other models).","2025-08-19T00:13:13",{"id":189,"version":190,"summary_zh":191,"released_at":192},102654,"v0.15.0","### Added\r\n- Auto-pricing for LLMs: enable via `auto_pricing=True` to automatically estimate costs using pydantic's `genai-prices`; optional `auto_pricing_refresh=True` refreshes cached price data at runtime.\r\n\r\n### Refactor\r\n- Public API made more consistent and stable: user-facing classes are now thin, well-documented facades over internal implementations. No behavior changes.\r\n- Internal reorganization for maintainability and future-proofing.\r\n\r\n### Docs\r\n- Added guidance for configuring auto-pricing for LLMs.","2025-08-13T22:26:41",{"id":194,"version":195,"summary_zh":196,"released_at":197},102655,"v0.14.4","### Fixed\r\n- Suppressed noisy LiteLLM proxy missing-dependency error logs (prompting to install `litellm[proxy]`) emitted by `litellm>=1.75.2` during LLM API calls. ContextGem does not require LiteLLM proxy features. Suppression is scoped to LiteLLM loggers.","2025-08-08T19:25:28",{"id":199,"version":200,"summary_zh":201,"released_at":202},102656,"v0.14.3","### Fixed\r\n- Enabled `reasoning_effort` parameter for gpt-5 models by explicitly forwarding it via `allowed_openai_params`, since `litellm.get_supported_openai_params()` does not yet include this parameter for gpt-5 models.","2025-08-07T22:37:46",{"id":204,"version":205,"summary_zh":206,"released_at":207},102657,"v0.14.2","### Added\r\n- Added warning for `gpt-oss` models used with `lm_studio\u002F` provider due to performance issues (according to tests), with a recommendation to use Ollama as a working alternative (e.g., `ollama_chat\u002Fgpt-oss:20b`).","2025-08-06T18:05:27",{"id":209,"version":210,"summary_zh":211,"released_at":212},102658,"v0.14.1","### Added\r\n- Added step-by-step usage guide in README, with brief descriptions of core components.\r\n- Added new documentation on documents, extraction pipelines, and logging configuration.\r\n\r\n### Changed\r\n- Renamed `DocumentPipeline` to `ExtractionPipeline` to better reflect its purpose and scope. `DocumentPipeline` is maintained as a deprecated wrapper class for backwards compatibility until v1.0.0.\r\n- Simplified logging config to use a single environment variable.","2025-08-05T23:27:32",{"id":214,"version":215,"summary_zh":216,"released_at":217},102659,"v0.14.0","### Added\r\n- Added utility function `create_image()` for flexible image creation from various sources (file paths, PIL objects, file-like objects, raw bytes) with automatic MIME type detection.\r\n\r\n### Changed\r\n- Updated `image_to_base64()` utility function to accept more image source types (file-like objects and raw bytes) in addition to file paths.\r\n- Made `temperature` and `top_p` parameters for DocumentLLM optional.","2025-08-02T21:57:55",{"id":219,"version":220,"summary_zh":221,"released_at":222},102660,"v0.13.0","### Changed\r\n- Enhanced LLM prompts with XML tags for improved instruction clarity and higher-quality extraction outputs.\r\n- Updated LabelConcept documentation with clearer distinction between multi-label and multi-class classification types.\r\n\r\n### Fixed\r\n- Fixed a bug where LabelConcept with multi-class classification type did not always return a label, as expected for multi-class classification tasks.","2025-07-29T22:20:59",{"id":224,"version":225,"summary_zh":226,"released_at":227},102641,"v0.22.0","### Changed\r\n- Upgraded pinned dependency versions: `litellm==1.82.2`, `genai-prices==0.0.55`. Versions remain pinned to maintain stability and avoid occasional breaking changes and API inconsistencies observed in previous unpinned releases.\r\n\r\n### Deprecated\r\n- `DocxConverter` is deprecated and will be removed in v1.0.0. Use dedicated document conversion libraries (e.g., [Docling](https:\u002F\u002Fgithub.com\u002Fdocling-project\u002Fdocling), [MarkItDown](https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002Fmarkitdown)) to convert files to text, then pass the result to `Document(raw_text=...)`.\r\n\r\n### Removed\r\n- Removed `llms.txt` generation (build script, pre-commit hook, and Sphinx integration). Use `AGENTS.md` and `CLAUDE.md` for AI assistant context instead.","2026-03-16T00:13:33",{"id":229,"version":230,"summary_zh":231,"released_at":232},102642,"v0.21.0","### Added\r\n- Python 3.14 support.\r\n\r\n### Fixed\r\n- Fixed `Union.__getitem__()` incompatibility on Python 3.14, which caused `JsonObjectConcept` structure validation to fail when using complex union\u002Foptional type annotations (e.g., `Optional[Literal[...] | Literal[...]]`).\r\n- Fixed incomplete `types.UnionType` detection in type normalization and validation, ensuring union types created with the `|` operator (Python 3.10+) are handled consistently across Python 3.10–3.14.","2026-02-22T21:34:39",{"id":234,"version":235,"summary_zh":236,"released_at":237},102643,"v0.20.0","### Added\r\n- Auto-generate tool schemas from `@register_tool` decorated functions. Pass functions directly to `tools=[...]` — schemas are built automatically from type hints and docstrings. Explicit OpenAI-compatible schema dicts remain supported for full backward compatibility.\r\n- Added `docstring-parser` dependency for extracting tool parameter descriptions from Sphinx\u002FreST, Google, and NumPy style docstrings.\r\n\r\n### Fixed\r\n- Deterministic tool schema generation: `required` field ordering in auto-generated schemas is now sorted, preventing non-deterministic output from `frozenset` iteration across Python process invocations.\r\n- Pinned `onnxruntime\u003C1.24.0` for Python 3.10, as `onnxruntime` 1.24+ dropped Python 3.10 wheels. Python 3.11+ is unaffected and uses the latest version.\r\n\r\n### Changed\r\n- Upgraded pinned dependency versions: `litellm==1.81.14`, `openai==2.21.0`, `genai-prices==0.0.54`. Versions remain pinned to maintain stability and avoid occasional breaking changes and API inconsistencies observed in previous unpinned releases.\r\n\r\n### Docs\r\n- Added dedicated \"Chat with Tools\" documentation page with examples for auto-schema generation, supported type hints, `TypedDict` usage, custom schema overrides, and tool configuration options.\r\n- Simplified quickstart tools example using the new `@register_tool` function-passing syntax.\r\n- Updated `CONTRIBUTING.md` with AI assistant guidance and Fabric commands.","2026-02-22T19:16:27",{"id":239,"version":240,"summary_zh":241,"released_at":242},102644,"v0.19.4","### Fixed\r\n- Applied fix for missing quote in JSON example format within prompt template. (PR [#86](https:\u002F\u002Fgithub.com\u002Fshcherbak-ai\u002Fcontextgem\u002Fpull\u002F86))\r\n\r\n### Added\r\n- Added support for `reasoning_effort=\"xhigh\"` for gpt-5.2 models.\r\n\r\n### Changed\r\n- Upgraded pinned dependency versions: `litellm==1.80.10`, `openai==2.13.0`, `genai-prices==0.0.49`. Versions remain pinned to maintain stability and avoid occasional breaking changes and API inconsistencies observed in previous unpinned releases.","2025-12-19T03:10:13",{"id":244,"version":245,"summary_zh":246,"released_at":247},102645,"v0.19.3","### Changed\r\n- Upgraded pinned dependency versions: `litellm==1.80.0`, `openai==2.8.0`, `genai-prices==0.0.39`. Versions remain pinned to maintain stability and avoid occasional breaking changes and API inconsistencies observed in previous unpinned releases.\r\n\r\n### Note\r\n- The litellm 1.80.0 upgrade introduces Pydantic serialization warnings during async cleanup phases. These warnings are non-actionable and do not affect functionality. This is a known upstream issue tracked at [litellm PR #16299](https:\u002F\u002Fgithub.com\u002FBerriAI\u002Flitellm\u002Fpull\u002F16299). ContextGem suppresses these warnings within its execution contexts; warnings may appear only during cleanup phases after operations complete.","2025-11-16T20:54:18",{"id":249,"version":250,"summary_zh":251,"released_at":252},102646,"v0.19.2","### Fixed\r\n- Logging system refactored to use Python's standard library logging with namespaced logger (`contextgem`) for production-ready integration. Eliminates global state pollution, prevents conflicts with host application logging, and enables independent configuration. Replaced Loguru with colorlog for colored output.","2025-09-30T15:15:39",{"id":254,"version":255,"summary_zh":256,"released_at":257},102647,"v0.19.1","### Changed\r\n- Enhanced documentation with pretty URLs (removing `.html` extensions) for improved SEO and user experience\r\n- Added official documentation source notice to prevent confusion with unauthorized copies or mirrors","2025-09-19T17:35:32",{"id":259,"version":260,"summary_zh":261,"released_at":262},102648,"v0.19.0","### Added\r\n- Tool calling support in `DocumentLLM.chat(...)`.","2025-09-09T13:16:16"]