[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-thiswillbeyourgithub--wdoc":3,"tool-thiswillbeyourgithub--wdoc":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",144730,2,"2026-04-07T23:26:32",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":32,"env_os":94,"env_gpu":95,"env_ram":96,"env_deps":97,"category_tags":107,"github_topics":108,"view_count":32,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":118,"updated_at":119,"faqs":120,"releases":149},5291,"thiswillbeyourgithub\u002Fwdoc","wdoc","Summarize and query from a lot of heterogeneous documents. Any LLM provider, any filetype, advanced RAG, advanced summaries, scriptable, etc","wdoc 是一款强大的检索增强生成（RAG）系统，专为从海量异构文档中提炼摘要和精准问答而设计。它诞生于处理复杂多源信息的实际需求，旨在解决传统工具在面对音频、视频、PDF、电子书及 Anki 卡片等多种格式时，难以统一检索且容易产生“幻觉”回答的痛点。\n\n无论是需要跨学科文献研究的学生、处理大量病例记录的医疗从业者，还是希望构建私有知识库的开发者，wdoc 都能提供极大帮助。其核心优势在于“高召回率”与“精准溯源”：系统能同时检索数万个不同格式的文档，通过巧妙的语义分批聚合技术，结合高低成本大语言模型协作，最终输出带有确切来源引用的 Markdown 格式答案，让每一条信息都有据可查。\n\n此外，wdoc 具备极高的灵活性，支持接入几乎所有大模型服务商（包括本地部署模型），并提供基于 Gradio 的 Docker Web 界面，让用户无需编写代码即可轻松上手。它不仅是一个高效的查询工具，更是一个可扩展的开发库，甚至能作为插件集成到 Open-WebUI 等平台，是管理繁杂信息流的得力助手。","[![PyPI version](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fwdoc.svg)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fwdoc)\n\n# wdoc\n\n\u003Cp align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fthiswillbeyourgithub_wdoc_readme_ad45d36c7ca0.png\" width=\"512\" style=\"background-color: transparent !important\">\u003C\u002Fp>\n\n> *I'm wdoc. I solve RAG problems.*\n> - wdoc, imitating Winston \"The Wolf\" Wolf\n\n`wdoc` is a powerful RAG (Retrieval-Augmented Generation) system designed to summarize, search, and query documents across various file types. It's particularly useful for handling large volumes of diverse document types, making it ideal for researchers, students, and professionals dealing with extensive information sources.\n\nCreated by a psychiatry resident who needed a way to get a definitive answer from multiple sources at the same time (audio recordings, video lectures, [Anki flashcards](https:\u002F\u002Fapps.ankiweb.net\u002F), PDFs, EPUBs, etc.). `wdoc` was born from frustration with existing RAG solutions for querying and summarizing.\n\n*(The online documentation can be found [here](https:\u002F\u002Fwdoc.readthedocs.io\u002Fen\u002Fstable))*\n\n* **Goal and project specifications**: `wdoc`'s  goal is to create **perfectly useful** summaries and **perfectly useful** sourced answers to questions on heterogeneous corpus. It's capable of querying **tens of thousands** of documents across [various file types](#filetypes) at the same time. The project also includes an opinionated summary feature to help users efficiently keep up with large amounts of information. It uses mostly [LangChain](https:\u002F\u002Fpython.langchain.com\u002F) and [LiteLLM](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002F) as backends.\n\n* **Current status**: **usable, tested, still under active development, tens of planned features**\n    * I don't plan on stopping to read anytime soon so if you find it promising, stick around as I have many improvements planned (see roadmap section).\n    * **I would greatly benefit from testing by users as it's the quickest way for me to find the many minor quick-to-fix bugs.**\n    * The main branch is more stable than the dev branch, which in turns offers more features.\n    * Open to feature requests and pull requests. All feedbacks, including reports of typos, are highly appreciated\n    * Please open an issue before making a PR, as there may be ongoing improvements in the pipeline.\n\n* **Key Features**:\n    * **Docker Web UI**: Easy deployment with a [Gradio-based web interface](.\u002Fdocker\u002FREADME.md) for simplified document processing without CLI interaction.\n    * **High recall and specificity**: it was made to find A LOT of documents using carefully designed embedding search then carefully aggregate gradually each answer using semantic batch to produce a single answer that mentions the source pointing to the exact portion of the source document.\n        * Use both an expensive and cheap LLM to make recall as high as possible because we can afford fetching a lot of documents per query (via embeddings)\n    * Supports **virtually any LLM providers**, including local ones, and even with extra layers of security for super secret stuff.\n    * Aims to **support *any* filetypes** and query from all of them at the same time (**15+** are already implemented!)\n    * **Actually *useful* AI powered summary**: get the thought process of the author instead of nebulous takeaways.\n    * **Actually *useful* AI powered queries**: get the **sourced** indented markdown answer yo your questions instead of hallucinated nonsense.\n    * **Extensible**: this is both a tool and a library. It was even turned into [an Open-WebUI Tool](https:\u002F\u002Fopenwebui.com\u002Ft\u002Fqqqqqqqqqqqqqqqqqqqq\u002Fwdoctool). Also available as a [Docker web UI](.\u002Fdocker\u002FREADME.md) for easy deployment.\n    * **Web Search**: Preliminary web search support using [DuckDuckGo](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDuckDuckGo) (via the [ddgs](https:\u002F\u002Fpypi.org\u002Fproject\u002Fddgs\u002F) library)\n\n### Table of contents\n- [Comprehensive reference (SKILL.md)](#comprehensive-reference)\n- [Explanatory diagrams](#explanatory-diagrams)\n- [Ultra short guide for people in a hurry](#ultra-short-guide-for-people-in-a-hurry)\n- [Features](#features)\n  - [Tasks](#Tasks)\n  - [Filetypes](#filetypes)\n  - [Walkthrough and examples](#walkthrough-and-examples)\n- [Getting started](#getting-started)\n  - [Direct installation](#direct-installation)\n  - [Experimental Docker Interface](#experiental-docker-interface)\n- [Scripts made with wdoc](#scripts-made-with-wdoc)\n- [FAQ](#faq)\n- [Roadmap](#roadmap)\n\n## Comprehensive reference\n\nA single-page comprehensive reference covering every CLI argument, environment variable, filetype, and the full Python API can be found in **[SKILL.md](.\u002FSKILL.md)**.\n\n## Explanatory diagrams\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fthiswillbeyourgithub_wdoc_readme_c6a907a68177.png\" alt=\"Query task workflow diagram showing the flow from user inputs through Raphael the Rephraser, VectorStore, Eve the Evaluator, Anna the Answerer, and recursive combining to final output\" height=\"400\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fthiswillbeyourgithub_wdoc_readme_6918dc7f0065.png\" alt=\"Summary task workflow diagram showing the flow from user inputs through loading & chunking, Sam the Summarizer, concatenation to wdocSummary output\" height=\"400\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fthiswillbeyourgithub_wdoc_readme_890baf33afe3.png\" alt=\"Search task workflow diagram showing the flow from user inputs through Raphael the Rephraser, VectorStore, Eve the Evaluator to search output\" height=\"400\">\n\u003C\u002Fp>\n\n## Ultra short guide for people in a hurry\n\u003Cdetails>\n\u003Csummary>\nGive it to me I am in a hurry!\n\u003C\u002Fsummary>\n\n**Note: a list of examples can be found in [examples.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fblob\u002Fmain\u002Fwdoc\u002Fdocs\u002Fexamples.md)**\n\n**Quick Start with Docker**: If you want an experimental web UI, check out the [Docker deployment guide](.\u002Fdocker\u002FREADME.md).\n\nFirst, let's see how to *query* a pdf.\n\n``` zsh\nlink=\"https:\u002F\u002Fsituational-awareness.ai\u002Fwp-content\u002Fuploads\u002F2024\u002F06\u002Fsituationalawareness.pdf\"\n\nwdoc --path=$link --task=query --filetype=\"online_pdf\" --query=\"What does it say about alphago?\" --query_retrievers='basic_multiquery' --top_k=auto_200_500\n```\n* This will:\n    1. parse what's in --path as a link to a pdf to download (otherwise the url could simply be a webpage, but in most cases you can leave it to 'auto' by default as heuristics are in place to detect the most appropriate parser).\n    2. cut the text into chunks and create embeddings for each\n    3. Take the user query, create embeddings for it ('basic') AND ask the default LLM to generate alternative queries and embed those\n    4. Use those embeddings to search through all chunks of the text and get the 200 most appropriate documents\n    5. Pass each of those documents to the smaller LLM (default: openrouter\u002Fgoogle\u002Fgemini-2.5-flash) to tell us if the document seems appropriate given the user query\n    6. If More than 90% of the 200 documents are appropriate, then we do another search with a higher top_k and repeat until documents start to be irrelevant OR we it 500 documents.\n    7. Then each relevant doc is sent to the strong LLM (by default, openrouter\u002Fgoogle\u002Fgemini-3.1-pro-preview) to extract relevant info and give one answer per relevant document.\n    8. Then all those \"intermediate\" answers are 'semantic batched' (meaning we create embeddings, do hierarchical clustering, then create small batch containing several intermediate answers of similar semantics, sort the batch in semantic order too), each batch is combined into a single answer per batch of relevant doc (or after: per batch of batches).\n    9. Rinse and repeat steps 7+8 (i.e. gradually aggregate batches) until we have only one answer, that is returned to the user.\n\nNow, let's see how to summarize a pdf.\n\n``` zsh\nlink=\"https:\u002F\u002Fsituational-awareness.ai\u002Fwp-content\u002Fuploads\u002F2024\u002F06\u002Fsituationalawareness.pdf\"\n\nwdoc --path=$link --task=summarize --filetype=\"online_pdf\"\n```\n* This will:\n    1. Split the text into chunks\n    2. pass each chunk into the strong LLM (by default openrouter\u002Fgoogle\u002Fgemini-3.1-pro-preview) for a very low level (=with all details) summary. The format is markdown bullet points for each idea and with logical indentation.\n    3. When creating each new chunk, the LLM has access to the previous chunk for context.\n    4. All summary are then concatenated and returned to the user\n\n* For extra large documents like books for example, this summary can be recusively fed to `wdoc` using argument --summary_n_recursion=2 for example.\n\n* Those two tasks, query and summary, can be combined with --task summarize_then_query which will summarize the document but give you a prompt at the end to ask question in case you want to clarify things.\n\n* For more, you can read [examples.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fblob\u002Fmain\u002Fwdoc\u002Fdocs\u002Fexamples.md).\n\n* Note that there is [an official Open-WebUI Tool](https:\u002F\u002Fopenwebui.com\u002Ft\u002Fqqqqqqqqqqqqqqqqqqqq\u002Fwdoctool) that is even simpler to use.\n\n\u003C\u002Fdetails>\n\n## Features\n* **15+ filetypes**: also supports combination to load recursively or define complex heterogenous corpus like a list of files, list of links, using regex, youtube playlists etc. See [Filestypes](#Filetypes) and [Recursive Filetypes](#recursive-filetypes). All filetype can be seamlessly combined in the same index, meaning you can query your anki collection at the same time as your work PDFs). It supports removing silence from audio files and youtube videos too! There is even a `ddg` filetype to search the web using [DuckDuckGo](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDuckDuckGo).\n* **100+ LLMs and many embeddings**: Supports any LLM by OpenAI, Mistral, Claude, Ollama, Openrouter, etc. thanks to [litellm](https:\u002F\u002Fdocs.litellm.ai\u002F). The list of supported embeddings engine can be found [here](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002Fembedding\u002Fsupported_embedding) but includes at least Openai (or any openai API compatible models), Cohere, Azure, Bedrock, NVIDIA NIM, Hugginface, Mistral, Ollama, Gemini, Vertex, Voyage.\n* **Local and Private LLM**: When in private mode, measures are taken to make sure no data leaves your computer and goes to an LLM provider: no API keys are used, all `api_base` are user set, cache are isolated from the rest, outgoing connections are censored by overloading python sockets, etc.\n* **Advanced RAG to query lots of diverse documents**:\n    1. The documents are retrieved using embeddings\n    2. Then a weak LLM model (\"Eve the Evaluator\") is used to tell which of those document is not relevant\n    3. Then the strong LLM is used to answer (\"Anna the Answerer\") the question using each individual remaining documents.\n    4. Then all relevant answers are combined (\"Carl the Combiner\") into a single short markdown-formatted answer. Before being combined, they are batched by semantic clusters\n    and semantic order using scipy's hierarchical clustering and leaf ordering, this makes it easier for the LLM to combine the answers in a manner that makes bottom up sense.\n    `Eve the Evaluator`, `Anna the Answerer` and `Carl the Combiner` are the names given to each LLM in their system prompt, this way you can easily add specific additional instructions to a specific step. There's also `Sam the Summarizer` for summaries and `Raphael the Rephraser` to expand your query.\n    5. Each document is identified by a unique hash and the answers are sourced, meaning you know from which document comes each information of the answer.\n    * Supports a special syntax like \"QE >>>> QA\" were QE is a question used to filter the embeddings and QA is the actual question you want answered.\n* **Web Search**: Preliminary support for web search using [DuckDuckGo](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDuckDuckGo). Just do `wdoc web \"How is Nvidia today this month?\"`\n* **Advanced summary**:\n    * Instead of unusable \"high level takeaway\" points, compress the reasoning, arguments, though process etc of the author into an easy to skim markdown file.\n    * The summaries are then checked again n times for correct logical indentation etc.\n    * The summary can be in the same language as the documents or directly translated.\n* **Many tasks**: See [Supported tasks](#Tasks).\n* **Trust but verify**: The answer is sourced: `wdoc` keeps track of the hash of each document used in the answer, allowing you to verify each assertion.\n* **Markdown formatted answers and summaries**: using [rich](https:\u002F\u002Fgithub.com\u002FTextualize\u002Frich).\n* **Sane embeddings**: By default use sophisticated embeddings like [multi query retrievers](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fhow_to\u002FMultiQueryRetriever) but also include SVM, KNN, parent retriever etc. Customizable.\n* **Fully documented** Lots of docstrings, lots of in code comments, detailed `--help` etc. Take a look at the [examples.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fblob\u002Fmain\u002Fwdoc\u002Fdocs\u002Fexamples.md) for a list of shell and python examples. The full help can be found in the file [help.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fdocs\u002Fhelp.md) or via `python -m wdoc --help`. I work hard to maintain an exhaustive documentation. The complete documentation in a single page is available [on the website](https:\u002F\u002Fwdoc.readthedocs.io\u002Fen\u002Flatest\u002Fall_docs.html).\n* **Scriptable \u002F Extensible**: You can use `wdoc` as an executable or as a library. Take a look at the scripts [below](#scripts-made-with-wdoc). There is even [an open-webui Tool](https:\u002F\u002Fopenwebui.com\u002Ft\u002Fqqqqqqqqqqqqqqqqqqqq\u002Fwdoctool).\n* **Strictly Typed**: Runtime type checking without performance penalty thanks to the incredible [beartype](https:\u002F\u002Fbeartype.readthedocs.io\u002Fen\u002Flatest\u002F)! Opt out using an environment flag: `WDOC_TYPECHECKING=\"disabled \u002F warn \u002F crash\" wdoc` (by default: `warn`).\n* **LLM (and embeddings) caching**: speed things up, as well as index storing and loading (handy for large collections).\n* **Good PDF parsing** PDF parsers are notoriously unreliable, so 15 (!) different loaders are used, and the best according to a parsing scorer is kept. Including table support via [openparse](https:\u002F\u002Fgithub.com\u002FFilimoa\u002Fopen-parse\u002F) (no GPU needed by default) or via [UnstructuredPDFLoader](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fintegrations\u002Fdocument_loaders\u002Funstructured_pdfloader\u002F).\n* **Langfuse support**: If you set the appropriate langfuse environment variables they will be used. See [this guide](https:\u002F\u002Flangfuse.com\u002Fdocs\u002Fintegrations\u002Flangchain\u002Ftracing) or [this one](https:\u002F\u002Flangfuse.com\u002Fdocs\u002Fintegrations\u002Flitellm\u002Ftracing) to learn more (Note: this is disabled if using private_mode to avoid any leaks).\n* **Document filtering**: based on regex for document content or metadata.\n* **Binary embeddings support**: Custom langchain VectorStore to use binary embeddings, leading (potentially, as it depends on the embeddings model) to [~32x better compression ratio, faster search and usually negligible accuracy loss](https:\u002F\u002Fsimonwillison.net\u002F2024\u002FMar\u002F26\u002Fbinary-vector-search\u002F).\n* **Fast**: Parallel document loading, parsing, embeddings, querying, etc.\n* **Shell autocompletion** using [python-fire](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fpython-fire\u002Fblob\u002Fmaster\u002Fdocs\u002Fusing-cli.md#completion-flag)\n* **Notification callback**: Can be used for example to get summaries on your phone using [ntfy.sh](ntfy.sh).\n* **Hacker mindset**: I'm a friendly dev! Just open an issue if you have a feature request or anything else.\n\n### Tasks\n* **query** give documents and asks questions about it.\n* **search** only returns the documents and their metadata. For anki it can be used to directly open cards in the browser.\n* **summarize** give documents and read a summary. The summary prompt can be found in `utils\u002Fprompts.py`.\n* **summarize_then_query** summarize the document then allow you to query directly about it.\n\n### Filetypes\n* **anki**: any subset of an [anki](https:\u002F\u002Fgithub.com\u002Fankitects\u002Fanki) collection db. `alt` and `title` of images can be shown to the LLM, meaning that if you used [the ankiOCR addon](https:\u002F\u002Fgithub.com\u002Fcfculhane\u002FAnkiOCR) this information will help contextualize the note for the LLM.\n* **auto**: default, guess the filetype for you\n* **epub**: barely tested because epub is in general a poorly defined format\n* **json_dict**: a text file containing a single json dict.\n* **local_audio**: supports many file formats, can use either OpenAI's whisper or [deepgram](https:\u002F\u002Fdeepgram.com)'s Nova-3 model. Supports automatically removing silence etc. Note: audio that are too large for whisper (usually >25mb) are automatically split into smaller files, transcribed, then combined. Also, audio transcripts are converted to text containing timestamps at regular intervals, making it possible to ask the LLM when something was said.\n* **local_html**: useful for website dumps\n* **local_video**: extract the audio then treat it as **local_audio**\n* **logseq_markdown**: thanks to my other project: [LogseqMarkdownParser](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002FLogseqMarkdownParser) you can use your [Logseq graph](https:\u002F\u002Fgithub.com\u002Flogseq\u002Flogseq\u002F)\n* **online_media**: use youtube_dl to try to download videos\u002Faudio, if fails try to intercept good url candidates using playwright to load the page. Then processed as **local_audio** (but works with video too).\n* **online_pdf**: via URL then treated as a **pdf** (see above)\n* **pdf**: 15 default loaders are implemented, heuristics are used to keep the best one and stop early. Table support via [openparse](https:\u002F\u002Fgithub.com\u002FFilimoa\u002Fopen-parse\u002F) or [UnstructuredPDFLoader](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fintegrations\u002Fdocument_loaders\u002Funstructured_pdfloader\u002F). Easy to add more.\n* **powerpoint**: .ppt, .pptx, .odp, ...\n* **string**: the cli prompts you for a text so you can easily paste something, handy for paywalled articles!\n* **text**: send a text content directly as path\n* **txt**: .txt, markdown, etc\n* **url**: try many ways to load a webpage, with heuristics to find the better parsed one\n* **word**: .doc, .docx, .odt, ...\n* **youtube**: text is then either from the yt subtitles \u002F translation or even better: using whisper \u002F deepgram. Note that youtube subtitles are downloaded with the timecode (so you can ask 'when does the author talks about such and such) but at a lower sampling frequency (instead of one timecode per second, only one per 15s). Youtube chapters are also given as context to the LLM when summarizing, which probably help it a lot.\n\n### Recursive Filetypes\n* **ddg**: does an online web search using [DuckDuckGo](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDuckDuckGo). This is not an agent search, we only use `wdoc` over the urls fetched by DuckDuckGo and return the result. Only supported by `query` tasks.\n* **json_entries**: turns a path to a file where each line is a json **dict**: that contains arguments to use when loading. Example: load several other recursive types. An example can be found in `docs\u002Fjson_entries_example.json`.\n* **link_file**: turn a text file where each line contains a url into appropriate loader arguments. Supports any link, so for example webpage, link to pdfs and youtube links can be in the same file. Handy for summarizing lots of things!\n* **recursive_paths**: turns a path, a regex pattern and a filetype into all the files found recurisvely, and treated a the specified filetype (for example many PDFs or lots of HTML files etc).\n* **toml_entries**: read a .toml file. An example can be found in `docs\u002Ftoml_entries_example.toml`.\n* **youtube playlists**: get the link for each video then process as **youtube**\n\n## Walkthrough and examples\n\nRefer to [examples.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fblob\u002Fmain\u002Fwdoc\u002Fdocs\u002Fexamples.md).\n\n## Getting started\n*`wdoc` was mainly developped and tested on python 3.13.5 but for compatibility it is installable with python version `>=3.11`. If possible, try to use python `3.13`.*\n\n### Direct Installation\n\n1. To install:\n    * Using pip: `pip install -U wdoc[full]` (if you want to try the version with much less dependencies, use `pip install -U wdoc` but you will have to manually install the missing dependencies for your usecase).\n    * Or to get a specific git branch:\n        * `dev` branch: `pip install git+https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc@dev[full]`\n        * `main` branch: `pip install git+https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc@main[full]`\n    * You can also use uvx or pipx. But as I'm not experiences with them I don't know if that can cause issues with for example caching etc. Do tell me if you tested it!\n        * Using uvx: `uvx wdoc[full]@latest --help`\n        * Using pipx: `pipx run wdoc[full] --help`\n    * In any case, it is recommended to:\n        * Install the `wdoc[full]` version except if you have specific constraints.\n        * try to install pdftotext with `pip install -U wdoc[pdftotext]` as well as add fasttext support with `pip install -U wdoc[fasttext]`.\n    * If you plan on contributing, you will also need `wdoc[dev]` for the commit hooks.\n    * **Claude Code users**: to give Claude Code knowledge of `wdoc`'s CLI and Python API, install the [SKILL.md](.\u002FSKILL.md) reference file:\n        ```bash\n        mkdir -p ~\u002F.claude\u002Fskills\u002Fwdoc && wget -O ~\u002F.claude\u002Fskills\u002Fwdoc\u002FSKILL.md https:\u002F\u002Fraw.githubusercontent.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fmain\u002FSKILL.md\n        ```\n2. Add the API key for the backend you want as an environment variable: for example `export ANTHROPIC_API_KEY=\"***my_key***\"`\n3. Launch is as easy as using `wdoc --task=query --path=MYDOC [ARGS]`\n    * If for some reason this fails, maybe try with `python -m wdoc`. And if everything fails, try with `uvx wdoc@latest`, or as last resort clone this repo and try again after `cd` inside it? Don't hesitate to open an issue.\n    * To get shell autocompletion: if you're using zsh: `eval $(cat shell_completions\u002Fwdoc_completion.zsh)`. Also provided for `bash` and `fish`. You can generate your own with `wdoc -- --completion MYSHELL > my_completion_file\"`.\n    * Don't forget that if you're using a lot of documents (notably via recursive filetypes) it can take a lot of time (depending on parallel processing too, but you then might run into memory errors).\n    * Take a look at the [examples.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fblob\u002Fmain\u002Fwdoc\u002Fdocs\u002Fexamples.md) for a list of shell and python examples. \n4. To ask questions about a local document: `wdoc query --path=\"PATH\u002FTO\u002FYOUR\u002FFILE\" --filetype=\"auto\"`\n    * If you want to reduce the startup time by directly loading the embeddings from a previous run (although the embeddings are always cached anyway): add `--saveas=\"some\u002Fpath\"` to the previous command to save the generated embeddings to a file and replace with `--loadfrom \"some\u002Fpath\"` on every subsequent call.\n5. To do an online search, the idea is `wdoc --task=query --path='How is Nvidia doing this month?' --query='How is Nvidia doing this month' --filetype=ddg`. But if any of `path` or `query` is missing, we replace it by the other one. This can also be used like so: `wdoc web 'How is Nvidia doing this month?'`.\n6. For more: read the documentation at `wdoc --help`\n\n### Experimental Docker Interface\n\nYou can also use the experimental docker interface to use `wdoc` in the browser (including on a smartphone!).\n\nSee the [Docker README](.\u002Fdocker\u002FREADME.md) for detailed instructions.\n\n## Scripts made with wdoc\n* *More to come in [the scripts folder](.\u002Fscripts\u002F)*.\n* [Ntfy Summarizer](scripts\u002FNtfySummarizer): automatically summarize a document from your android phone using [ntfy.sh](ntfy.sh).\n* [TheFiche](scripts\u002FTheFiche): create summaries for specific notions directly as a [logseq](https:\u002F\u002Fgithub.com\u002Flogseq\u002Flogseq) page.\n* [FilteredDeckCreator](scripts\u002FFilteredDeckCreator): directly create an [anki](https:\u002F\u002Fankitects.github.io\u002F) filtered deck from the cards found by `wdoc`.\n* [Official Open-WebUI Tool](https:\u002F\u002Fopenwebui.com\u002Ft\u002Fqqqqqqqqqqqqqqqqqqqq\u002Fwdoctool), hosted [here](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fopenwebui_custom_pipes_filters\u002Fblob\u002Fmain\u002Ftools\u002Fwdoc_tools.py).\n* [MediaURLFinder](scripts\u002FMediaURLFinder) simply leverages the `find_online_media` loader helper to use `playwright` and `yt-dlp` to find all the URLs of medias (videos, audio etc). This is especially useful if `yt-dlp` alone is not able to find the URL of a ressource.\n\n## FAQ\n\n\u003Cdetails>\n\u003Csummary>\nFAQ\n\u003C\u002Fsummary>\n\n* **Who is this for?**\n    * `wdoc` is for power users who want document querying on steroid, and in depth AI powered document summaries.\n* **What's RAG?**\n    * A RAG system (retrieval augmented generation) is basically an LLM powered search through a text corpus.\n* **Why make another RAG system? Can't you use any of the others?**\n    * I'm [Olicorne](https:\u002F\u002Folicorne.org\u002F), a psychiatry resident who needed a tool to ask medical questions from **a lot** (tens of thousands) of documents, of different types (epub, pdf, [anki](https:\u002F\u002Fankitects.github.io\u002F) database, [Logseq](https:\u002F\u002Fgithub.com\u002Flogseq\u002Flogseq\u002F), website dump, youtube videos and playlists, recorded conferences, audio files, etc). Existing solutions couldn't handle this diversity and scale of content.\n* **Why is `wdoc` better than most RAG system to ask questions on documents?**\n    * It uses both a strong and query_eval LLM. After finding the appropriate documents using embeddings, the query_eval LLM is used to filter through the documents that don't seem to be about the question, then the strong LLM answers the question based on each remaining documents, then combines them all in a neat markdown. Also `wdoc` is very customizable.\n* **Can you use wdoc on `wdoc`'s documentation?**\n    * Yes of course! `wdoc --task=query --path https:\u002F\u002Fwdoc.readthedocs.io\u002Fen\u002Flatest\u002Fall_docs.html`\n* **Why can `wdoc` also produce summaries?**\n    * I have little free time so I needed a tailor made summary feature to keep up with the news. But most summary systems are rubbish and just try to give you the high level takeaway points, and don't handle properly text chunking. So I made my own tailor made summarizer. **The summary prompts can be found in `utils\u002Fprompts.py` and focus on extracting the arguments\u002Freasonning\u002Fthough process\u002Farguments of the author then use markdown indented bullet points to make it easy to read.** It's really good! The prompts dataclass is not frozen so you can provide your own prompt if you want.\n* **Which tasks are supported by `wdoc`?**\n    * See [Tasks](#tasks).\n* **Which LLM providers are supported by `wdoc`?**\n    * `wdoc` supports virtually any LLM provider thanks to [litellm](https:\u002F\u002Fdocs.litellm.ai\u002F). It even supports local LLM and local embeddings (see [examples.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fblob\u002Fmain\u002Fwdoc\u002Fdocs\u002Fexamples.md)). The list of supported embeddings engine can be found [here](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002Fembedding\u002Fsupported_embedding) but includes at least Openai (or any openai API compatible models), Cohere, Azure, Bedrock, NVIDIA NIM, Hugginface, Mistral, Ollama, Gemini, Vertex, Voyage.\n* **What do you use `wdoc` for?**\n    * I follow heterogeneous sources to keep up with the news: youtube, website, etc. So thanks to `wdoc` I can automatically create awesome markdown summaries that end up straight into my [Logseq](https:\u002F\u002Fgithub.com\u002Flogseq\u002Flogseq\u002F) database as a bunch of `TODO` blocks.\n    * I use it to ask technical questions to my vast heterogeneous corpus of medical knowledge.\n    * I use it to query my personal documents using the `--private` argument.\n    * I sometimes use it to summarize a documents then go straight to asking questions about it, all in the same command.\n    * I use it to ask questions about entire youtube playlists.\n    * Other use case are the reason I made the [scripts made with `wdoc` section](#scripts-made-with-wdoc)\n* **What's up with the name?**\n    * One of my favorite character (and somewhat of a rolemodel is [Winston Wolf](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UeoMuK536C8) and after much hesitation I decided `WolfDoc` would be too confusing and `WinstonDoc` sounds like something micro$oft would do. Also `wd` and `wdoc` were free, whereas `doctools` was already taken. The initial name of the project was `DocToolsLLM`, a play on words between 'doctor' and 'tool'.\n* **How can I improve the prompt for a specific task without coding?**\n    * Each prompt of the `query` task are roleplaying as employees working for WDOC-CORP©, either as `Eve the Evaluator` (the LLM that filters out relevant documents), `Anna the Answerer` (the LLM that answers the question from a filtered document) or `Carl the Combiner` (the LLM that combines answers from Answerer as one). There's also `Sam the Summarizer` for summaries and `Raphael the Rephraser` to expand your query. They are all receiving orders from you if you talk to them in a prompt.\n* **How can I use `wdoc`'s parser for my own documents?**\n    * If you are in the shell cli you can easily use `wdoc parse my_file.pdf`.\n    add `--format=langchain_dict` to get the text and metadata as a list of dict, otherwise you will only get the text. Other formats exist including `--format=xml` to make it LLM friendly like [files-to-promt](https:\u002F\u002Fgithub.com\u002Fsimonw\u002Ffiles-to-prompt).\n    * If you want the document using python:\n        ``` python\n        from wdoc import wdoc\n        list_of_docs = wdoc.parse_doc(path=my_path)\n        ```\n    * Another example would be to use wdoc to parse an anki deck: `wdoc parse --filetype \"anki\" --anki_profile \"Main\" --anki_deck \"mydeck::subdeck1\" --anki_notetype \"my_notetype\" --anki_template \"\u003Cheader>\\n{header}\\n\u003C\u002Fheader>\\n\u003Cbody>\\n{body}\\n\u003C\u002Fbody>\\n\u003Cpersonal_notes>\\n{more}\\n\u003C\u002Fpersonal_notes>\\n\u003Ctags>{tags}\u003C\u002Ftags>\\n{image_ocr_alt}\" --anki_tag_filter \"a::tag::regex::.*something.*\" --format=text`\n* **What should I do if my PDF are encrypted?**\n    * If you're on linux you can try running `qpdf --decrypt input.pdf output.pdf`\n        * I made a quick and dirty batch script for [in this repo](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002FPDF_batch_decryptor)\n* **How can I add my own pdf parser?**\n    * Write a python class and add it there: `wdoc.utils.loaders.pdf_loaders['parser_name']=parser_object` then call `wdoc` with `--pdf_parsers=parser_name`.\n        * The class has to take a `path` argument in `__init__`, have a `load` method taking\n        no argument but returning a `List[Document]`. Take a look at the `OpenparseDocumentParser`\n        class for an example.\n\n* **What should I do if I keep hitting rate limits?**\n    * The simplest way is to add the `debug` argument. It will disable multithreading,\n        multiprocessing and LLM concurrency. A less harsh alternative is to set the\n        environment variable `WDOC_LLM_MAX_CONCURRENCY` to a lower value.\n\n* **How can I run the tests?**\n    * Take a look at the files `.\u002Ftests\u002Frun_all_tests.sh`.\n\n* **How can I query a text but without chunking? \u002F How can I query a text with the full text as context?**\n    * If you set the environment variable `WDOC_MAX_CHUNK_SIZE` to a very high value and use a model with enough context according to litellm's metadata, then no chunking will happen and the LLM will have the full text as context.\n\n* **Is there a way to use `wdoc` with [Open-WebUI](https:\u002F\u002Fgithub.com\u002Fopen-webui\u002Fopen-webui\u002F)?**\n    * Yes! I am maintaining an [official Open-WebUI Tool](https:\u002F\u002Fopenwebui.com\u002Ft\u002Fqqqqqqqqqqqqqqqqqqqq\u002Fwdoctool) which is hosted [here](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fopenwebui_custom_pipes_filters\u002Fblob\u002Fmain\u002Ftools\u002Fwdoc_tools.py).\n\n* **Is there a web UI for `wdoc`?**\n    * Yes! An [experimental Docker-based Gradio web interface](.\u002Fdocker\u002FREADME.md) is available for easy deployment and use without command-line interaction.\n\n* **Can I use shell pipes with `wdoc`?**\n    * Yes! Data sent using shell pipes (be it for strings or binary data) will be automatically saved to a temporary file which is then passed as `--path=[temp_file]` argument. For example `cat **\u002F*.txt | wdoc --task=query`, `echo $my_url | wdoc parse`  or even `cat my_file.pdf | wdoc parse --filetype=pdf`. For binary input it is strongly recommended to use a `--filetype` argument because `python-magic` version \u003C=0.4.27 chokes otherwise (see [that issue](https:\u002F\u002Fgithub.com\u002Fahupp\u002Fpython-magic\u002Fissues\u002F261).\n\n* **Can the environment variables be set at runtime?**\n    * Sort of. Actually when importing `wdoc`, code in `wdoc\u002Futils\u002Fenv.py` creates a dataclass that holds the environment variables used by `wdoc`. This is done primarily to ensure runtime type checking and to ensure that when an env variable is accessed inside wdoc's code (through the dataclass) it is always compared to the environment one. If you decide to change env variables throughout the code, this change new value will be used inside `wdoc`. But that's somewhat brittle because some env variables are used to store the *default* value of some function or class and hence are only used when importing code so will be out of sync. Additionaly, `wdoc` will intentionaly crash if it suspects the `WDOC_PRIVATE_MODE` env var is out of sync, just to be safe. Also note that if env vars like `WDOC_LANGFUSE_PUBLIC_KEY` are found, `wdoc` will overwrite `LANGFUSE_PUBLIC_KEY` with it. This is because `litellm` (maybe others) looks for this env variable to enable `langfuse` callbacks. This whole contraption allows to set env variable for a specific user of when using the `open-webui` `wdoc` tool. Feedback is much welcome for this feature.\n\n* **How can I build the autodoc using sphinx?**\n    * The command I've been using is `sphinx-apidoc -o docs\u002Fsource\u002F wdoc --force`, to call from the root of this repository.\n\n* **Why can't I load the vectorstores in other langchain projects?**\n    * In `wdoc\u002Futils\u002Fcustoms\u002Fbinary_faiss_vectorstore.py`, we create `BinaryFAISS` and `CompressedFAISS`. The latter is just like FAISS but with zlib compression to the pickled index and the former adds on top binary embeddings, resulting in faster and more compact embeddings. If you want to disable compression altogether, use the env variable `WDOC_MOD_FAISS_COMPRESSION=false`.\n\n* **Which python version is used in the test suite?**\n    * The recommended python version is `3.12.11`.\n\n* **Why does the online search only supports the 'query' task?**\n    * The way `wdoc` works for summaries is to take the \"whole document\", chunk it into sequential \"documents\" and iteratively create the summary. But if we start with several documents (say difference web pages) then the \"sequence\" wouldn't make sense.\n\n\u003C\u002Fdetails>\n\n## Roadmap\n\n\u003Cdetails>\n\u003Csummary>\nClick to read more\n\u003C\u002Fsummary>\n\n\u003Ci>This TODO list is maintained automatically by [MdXLogseqTODOSync](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002FMdXLogseqTODOSync)\u003C\u002Fi>\n\n\u003C!-- BEGIN_TODO -->\n- ## Most urgent\n    - figure out a good way to skip merging batches that are too large before trying to merge them\n        - probably means adding an env var to store a max value, document it in the help.md\n        - then check after batch creation if a batch is that large\n        - if it is put it in a separate var, to be concatenated later with the rest of the answers\n    - add more tests\n        - add test for the private mode\n        - add test for the testing models\n        - add test for the recursive loader functions\n        - add test for each loader\n    - rewrite the python API to make it more useable. (also related to https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fissues\u002F13)\n        - pay attention to how to modify the init and main.py files\n        - pay attention to how the --help flag works\n        - pay attention to how the USAGE document is structured\n    - support other vector databases\n    - learn how to set a github action for test code coverage\n    - allow anki to use anki type search queries\n    - refactor the tasks to use langgraph, as it seems easier to do complex recursive tasks with it\n    - use async for the langchain chains\n- ### Features\n    -  use clusters of semantic ordering instead of just the order you dumbass\n    - ability to cap the search documents capped by a number of tokens instead of a number of documents\n    - Add prompt caching for claude\n    - add a \"fast summary\" feature that does not use recursive summary if you care more about speed than overlapping summaries\n    - count how many time each source is used, as it can be relevant to infer answer quality\n    - add an html format output. It would display a nice UI with proper dropdowns for sources etc\n    - if a model supports structured output we should make use of it to get the thinking and answer part. Opt in because some models hide their thoughts.\n    - add an intermediate step for queries that asks the LLM for appropriate headers for the md output. Then for each intermediate answer attribute it a list of 1 to 3 headers (because a given intermediate answer can  contain several pieces of information), then do the batch merge of intermediate answer per header.\n        - this needs to be scalable and easy to add recursion to (because then we can do this for subheaders and so on)\n        - the end goal is to have a scalable solution to answer queries about extremely large documents for impossibly vast questions\n    - use apprise instead of ntfy for the scripts\n    - add crawl4ai parser: https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\n    - Way to add the title (or all metadata) of a document to its own text. Enabled by default. Because this would allow searching among many documents that don't refer to the original title (for example: material safety datasheets)\n        - default value is \"author\" \"page\" title\"\n        - pay attention to avoid including personnal info (for example use relative paths instead of absolute paths)\n    - add a \u002Fsave PATH command to save the chat and metadata to a json file\n    - add image support printing via icat or via the other lib you found last time, would be useful for summaries etc\n    - add wdoc to tldr pages\n    - add an audio backend to use the subtitles from a video file directly\n    - store the anki images as 'imagekeys' as the idea works for other parsers too\n    - investigate asking the LLM to add leading emojis to the bullet point for improved reading\n    - add a key\u002Fval arg to specify the trust we have in a doc, call it context\n    - add a way to open the documents automatically, based on platform dirs etc. For ex if okular is installed, open pdfs directly at the right page\n        - the best way would be to create opener.py that does a bit like loader but for all filetypes and platforms\n        - use a cli selector like in mnemonics creator\n            - add shortcut to sort by score or by name\n            - display metadata and score in a previewer\n    - add an argument --whole_text to avoid chunking (this would just increase the chunk size to a super large number I guess)\n    - add apprise callback support\n    - add a filetype \"custom_parser\" and an argument \"--custom_parser\" containing a path to a python file. Must receive a docdict and a few other things and return a list of documents\n    - add bespoke-minicheck from ollama to fact check when using RAG: https:\u002F\u002Follama.com\u002Flibrary\u002Fbespoke-minicheck\n        - or via their API directly : https:\u002F\u002Fdocs.bespokelabs.ai\u002Fbespoke-minicheck\u002Fapi but they don't seem to properly disclose what they do with the data\n    - add a langchain code loader that uses aider to get the repomap\n        - https:\u002F\u002Fgithub.com\u002Fpaul-gauthier\u002Faider\u002Fissues\u002F1043#issuecomment-2278486840\n        - https:\u002F\u002Faider.chat\u002Fdocs\u002Fscripting.html\n    - add a pikepdf loader because it can be used to automatically decrypt pdfs\n    - add a query_branching_nb argument that asks an LLM to identify a list of keywords from the intermediate answers, then look again for documents using this keyword and filtering via the weak llm\n    - write a script that shows how to use bertopic on the documents of wdoc\n    - add a retriever where the LLM answer without any context\n    - add support for readabilipy for parsing html\n        - https:\u002F\u002Fgithub.com\u002Falan-turing-institute\u002FReadabiliPy\n    - add an obsidian loader\n        - https:\u002F\u002Fpypi.org\u002Fproject\u002Fobsidiantools\u002F\n    - add a \u002Fchat command to the prompt, it would enable starting an interactive session directly with the llm\n    - find a way to make it work with llm from simonw\n    - make images an actual filetype\n- ### Enhancements\n    - store the available tasks in a dataclass in misc.py\n    - turn arugments that contain a _ into arguments with a -\n        - in the cli launcher function, manually convert arguments\n    - maybe add support for docling to parse documents?\n    - when querying hard stuff the number of drop documents after batching is non negligible, we should remove those from the list of documents to display and instead store those in another variable\n    - check if using html syntax is less costly and confusing to LLMs than markdown with tall those indentation. Or maybe json. It would be simple to turn that into markdown afterwards.\n    - check that the task search work on things other than anki\n    - create a custom custom retriever, derived from multiquery retriever that does actual parallel requests. Right now it's not the case (maybe in async but I don't plan on using async for now). This retriever seems a good part of the slow down.\n    - stop using your own youtube timecode parser and instead use langchain's chunk transcript format\n    - implement usearch instead of faiss, it seems in all points faster, supports quantized embeddings, i trust their langchain implementation more\n        - https:\u002F\u002Fpython.langchain.com\u002Fapi_reference\u002Fcommunity\u002Fvectorstores\u002Flangchain_community.vectorstores.usearch.USearch.html#langchain_community.vectorstores.usearch.USearch\n    - Use an env var to drop_params of litellm\n    - add more specific exceptions for file loading error. One exception for all, one for batch and one for individual loader\n    - use heuristics to find the best number of clusters when doing semantic reranking\n    - arg to use jina v3 embeddings for semantic batching because it allows specifying tasks that seem appropriate for that\n    - add an env variable or arg to overload the backend url for whisper. Then set it always for you and mention it there: https:\u002F\u002Fgithub.com\u002Ffedirz\u002Ffaster-whisper-server\u002Fissues\u002F5\n    - find a way to set a max cost at which to crash if it exceeds a maximum cost during a query, probably via the price callback\n    - anki_profile should be able to be a path\n    - store wdoc's version and indexing timestamp in the metadata of the document\n    - arg --oneoff that does not trigger the chat after replying. Allowing to not hog all the RAM if ran in multiple terminals for example through SSH\n    - add a (high) token threshold above which two texts are not combined but just concatenated in the semantic order. It would avoid it loosing context. Use a --- separator\n    - compute the cost of whisper and deepgram\n    - use a pydantic basemodel for output instead of a dict\n        - same for summaries, it should at least contain the method to substitute the sources and then back\n    - investigate storing the vectors in a sqlite3 file\n    - make a plugin to llm that looks like file-to-prompt from simonw\n    - Always bind a user metadata to litellm for langfuse etc\n        - Add more metadata to each request to langfuse more informative\n    - add a reranker to better sort the output of the retrievers. Right now with the multiquery it returns way too many and I'm thinking it might be a bad idea to just crop at top_k as I'm doing currently\n    - add a status argument that just outputs the logs location and size, the cache location and size, the number of documents etc\n    - add the python magic of the file as a file metadata\n    - add an env var to specify the threshold for relevant document by the query eval llm\n    - find a way to return the evaluations for each document also\n    - move retrievers.py in an embeddings folder\n    - stop using lambda functions in the chains because it makes the code barely readable\n    - when doing recursive summary: tell the model that if it's really sure that there are no modifications to do: it should just reply \"EXIT\" and it would save time and money instead of waiting for it to copy back the exact content\n    - add image parsing as base64 metadata from pdf\n    - use multiple small chains instead of one large and complicated and hard to maintain\n    - add an arg to bypass query combine, useful for small models\n    - tell the llm to write a special message if the parsing failed or we got a 404 or paywall etc\n        - catch this text and crash\n    - add check that all metadata is only made of int float and str\n    - move the code that filters embeddings inside the embeddings.py file\n        - this way we can dynamically refilter using the chat prompt\n    - task summary then query should keep in context both the full text and the summary\n    - if there's only one intermediate answer, pass it as answer without trying to recombine\n    - filter_metadata should support an OR syntax\n    - add a --show_models argument to display the list of available models\n    - add a way to open the documents automatically, based on platform dirs etc. For ex if okular is installed, open pdfs directly at the right page\n        - the best way would be to create opener.py that does a bit like loader but for all filetypes and platforms\n    - add an image filetype: it will be either OCR'd using format and\u002For will be captioned using a multimodal llm, for example gpt4o mini\n        - nanollava is a 0.5b that probably can be used for that with proper prompting\n    - add a key\u002Fval arg to specify the trust we have in a doc, call this metadata context in the prompt\n    - add an arg to return just the dict of all documents and embeddings. Notably useful to debug documents\n    - use a class for the cli prompt, instead of a dumb function\n    - arg to disable eval llm filtering\n        - just answer 1 directly if no eval llm is set\n    - display the number of documents and tokens in the bottom toolbar\n    - add a demo gif\n    - investigate asking the LLM to  add leading emojis to the bullet point for quicker reading of summaries\n    - see how easy or hard it is to use an async chain\n    - ability to cap the search documents capped by a number of tokens instead of a number of documents\n    - for anki, allow using a query instead of loading with ankipandas\n    - add a \"try_all\" filetype that will try each filetype and keep the first that works\n    - add textract extractor : https:\u002F\u002Ftextract.readthedocs.io\u002Fen\u002Fstable\u002F\n    - write a langchain compatible tool for agents\n    - add bespoke-minicheck from ollama to fact check when using RAG: https:\u002F\u002Follama.com\u002Flibrary\u002Fbespoke-minicheck\n        - or via their API directly : https:\u002F\u002Fdocs.bespokelabs.ai\u002Fbespoke-minicheck\u002Fapi but they don't seem to properly disclose what they do with the data\n\u003C!-- END_TODO -->\n\n\u003C\u002Fdetails>\n","[![PyPI版本](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fwdoc.svg)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fwdoc)\n\n# wdoc\n\n\u003Cp align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fthiswillbeyourgithub_wdoc_readme_ad45d36c7ca0.png\" width=\"512\" style=\"background-color: transparent !important\">\u003C\u002Fp>\n\n> *我是wdoc。我解决RAG问题。*\n> - wdoc，模仿温斯顿“狼”沃尔夫\n\n`wdoc`是一个功能强大的RAG（检索增强生成）系统，旨在对多种文件类型的文档进行摘要、搜索和查询。它特别适用于处理大量不同类型的文档，因此非常适合研究人员、学生以及需要处理海量信息来源的专业人士。\n\n`wdoc`由一位精神科住院医师创建，他需要一种方法能够同时从多个来源（音频录音、视频讲座、[Anki闪卡](https:\u002F\u002Fapps.ankiweb.net\u002F)、PDF、EPUB等）获取明确的答案。面对现有的用于查询和摘要的RAG解决方案，他感到十分沮丧，于是便诞生了`wdoc`。\n\n*（在线文档可以在这里找到 [https:\u002F\u002Fwdoc.readthedocs.io\u002Fen\u002Fstable](https:\u002F\u002Fwdoc.readthedocs.io\u002Fen\u002Fstable)）*\n\n* **目标与项目规格**：`wdoc`的目标是为异构语料库中的问题生成**完全实用**的摘要和**完全实用**的带来源答案。它能够同时查询跨越[多种文件类型](#filetypes)的**数万份**文档。该项目还包含一个有观点的摘要功能，帮助用户高效地跟进大量信息。它主要使用[LangChain](https:\u002F\u002Fpython.langchain.com\u002F)和[LiteLLM](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002F)作为后端。\n\n* **当前状态**：**可用、经过测试、仍在积极开发中、计划添加数十项功能**\n    * 我暂时没有停止开发的打算，所以如果你觉得它很有前景，请继续关注，因为我还有很多改进计划（见路线图部分）。\n    * **用户的测试对我帮助很大，因为这是我发现并快速修复许多小bug的最快方式。**\n    * 主分支比开发分支更稳定，而开发分支则提供了更多功能。\n    * 欢迎提出功能请求和拉取请求。所有反馈，包括错别字报告，都将不胜感激。\n    * 请在提交PR之前先打开一个问题，因为可能已经有正在进行的改进。\n\n* **关键特性**：\n    * **Docker Web UI**：通过基于[Gradio的Web界面](.\u002Fdocker\u002FREADME.md)，无需CLI交互即可轻松部署，简化文档处理流程。\n    * **高召回率和高特异性**：它被设计用来通过精心设计的嵌入式搜索找到大量文档，然后利用语义批处理逐步聚合每个答案，最终生成一条引用来源并指向源文档确切位置的回答。\n        * 同时使用昂贵和廉价的LLM，以尽可能提高召回率，因为我们有能力在每次查询中抓取大量文档（通过嵌入）。\n    * 支持**几乎任何LLM提供商**，包括本地模型，甚至还能为超级机密内容提供额外的安全层。\n    * 目标是**支持*任何*文件类型**，并能同时从所有这些类型中进行查询（目前已实现**15种以上**！）\n    * **真正*实用*的AI驱动摘要**：获取作者的思考过程，而不是模糊不清的要点总结。\n    * **真正*实用*的AI驱动查询**：针对你的问题给出带有**来源**的缩进Markdown格式回答，而不是幻觉般的胡言乱语。\n    * **可扩展性**：这既是一个工具，也是一个库。它甚至被改造成了[Open-WebUI工具](https:\u002F\u002Fopenwebui.com\u002Ft\u002Fqqqqqqqqqqqqqqqqqqqq\u002Fwdoctool)。此外，还提供了一个[基于Docker的Web UI](.\u002Fdocker\u002FREADME.md)，方便部署。\n    * **网络搜索**：初步支持使用[DuckDuckGo](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDuckDuckGo)进行网络搜索（通过[ddgs](https:\u002F\u002Fpypi.org\u002Fproject\u002Fddgs\u002F)库）\n\n### 目录\n- [全面参考（SKILL.md）](#comprehensive-reference)\n- [说明性图表](#explanatory-diagrams)\n- [给赶时间的人的超短指南](#ultra-short-guide-for-people-in-a-hurry)\n- [特性](#features)\n  - [任务](#Tasks)\n  - [文件类型](#filetypes)\n  - [操作指南和示例](#walkthrough-and-examples)\n- [开始使用](#getting-started)\n  - [直接安装](#direct-installation)\n  - [实验性Docker界面](#experiental-docker-interface)\n- [用wdoc制作的脚本](#scripts-made-with-wdoc)\n- [常见问题](#faq)\n- [路线图](#roadmap)\n\n## 全面参考\n\n涵盖所有CLI参数、环境变量、文件类型以及完整Python API的单页综合参考资料可以在**[SKILL.md](.\u002FSKILL.md)**中找到。\n\n## 说明性图表\n\n\u003Cp float=\"left\" align=\"middle\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fthiswillbeyourgithub_wdoc_readme_c6a907a68177.png\" alt=\"查询任务工作流图，展示从用户输入经过Raphael重述者、向量存储、Eve评估者、Anna回答者，再到递归合并最终输出的流程\" height=\"400\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fthiswillbeyourgithub_wdoc_readme_6918dc7f0065.png\" alt=\"摘要任务工作流图，展示从用户输入经过加载与分块、Sam摘要生成者，再到wdocSummary输出的流程\" height=\"400\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fthiswillbeyourgithub_wdoc_readme_890baf33afe3.png\" alt=\"搜索任务工作流图，展示从用户输入经过Raphael重述者、向量存储、Eve评估者，到搜索结果输出的流程\" height=\"400\">\n\u003C\u002Fp>\n\n## 为赶时间的人准备的超短指南\n\u003Cdetails>\n\u003Csummary>\n直接给我答案，我赶时间！\n\u003C\u002Fsummary>\n\n**注：示例列表可在 [examples.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fblob\u002Fmain\u002Fwdoc\u002Fdocs\u002Fexamples.md) 中找到**\n\n**Docker 快速入门**：如果您想要一个实验性的 Web 界面，请查看 [Docker 部署指南](.\u002Fdocker\u002FREADME.md)。\n\n首先，我们来看看如何 *查询* 一份 PDF 文件。\n\n``` zsh\nlink=\"https:\u002F\u002Fsituational-awareness.ai\u002Fwp-content\u002Fuploads\u002F2024\u002F06\u002Fsituationalawareness.pdf\"\n\nwdoc --path=$link --task=query --filetype=\"online_pdf\" --query=\"What does it say about alphago?\" --query_retrievers='basic_multiquery' --top_k=auto_200_500\n```\n* 这将：\n    1. 将 `--path` 中的内容解析为一个用于下载 PDF 的链接（否则该 URL 可能只是一个网页，但在大多数情况下，您可以将其保留为默认值 `auto`，因为系统内置了启发式算法来检测最合适的解析器）。\n    2. 将文本切分成多个块，并为每个块创建嵌入向量。\n    3. 对用户查询进行嵌入处理（使用 `basic` 模型），并让默认的大型语言模型生成替代查询，同时对这些替代查询也创建嵌入向量。\n    4. 使用这些嵌入向量在所有文本块中进行搜索，找出最相关的 200 个文档。\n    5. 将这 200 个文档逐一传递给小型语言模型（默认为 openrouter\u002Fgoogle\u002Fgemini-2.5-flash），以判断它们是否与用户查询相关。\n    6. 如果这 200 个文档中有超过 90% 被判定为相关，则再次以更高的 `top_k` 值进行搜索，重复此过程，直到文档开始变得不相关，或者已检索到 500 个文档为止。\n    7. 接着，将每个相关的文档发送给强大的大型语言模型（默认为 openrouter\u002Fgoogle\u002Fgemini-3.1-pro-preview），提取相关信息，并为每个相关文档生成一个答案。\n    8. 然后，对所有这些“中间”答案进行“语义分批”处理（即先创建嵌入向量，再进行层次聚类，最后将语义相似的多个中间答案组合成一个小批次，并按语义顺序排序）。每个批次会被合并为一个答案，最终形成针对相关文档的单个答案（或进一步合并为针对多个批次的单一答案）。\n    9. 重复步骤 7 和 8（即逐步聚合各个批次），直到只剩下一个答案，然后将其返回给用户。\n\n接下来，我们看看如何总结一份 PDF 文件。\n\n``` zsh\nlink=\"https:\u002F\u002Fsituational-awareness.ai\u002Fwp-content\u002Fuploads\u002F2024\u002F06\u002Fsituationalawareness.pdf\"\n\nwdoc --path=$link --task=summarize --filetype=\"online_pdf\"\n```\n* 这将：\n    1. 将文本分割成多个块。\n    2. 将每个块传递给强大的大型语言模型（默认为 openrouter\u002Fgoogle\u002Fgemini-3.1-pro-preview），生成非常详细的摘要。格式为 Markdown 列表，每条列出一个观点，并根据逻辑关系进行缩进。\n    3. 在处理每个新块时，语言模型会参考前一个块的内容以获取上下文信息。\n    4. 最后，将所有摘要拼接在一起并返回给用户。\n\n* 对于像书籍这样非常大的文档，可以递归地将生成的摘要再次输入 `wdoc`，例如使用 `--summary_n_recursion=2` 参数。\n\n* 查询和总结这两个任务也可以结合使用，只需指定 `--task summarize_then_query`，即可先对文档进行总结，最后还会提供一个提示，方便您进一步提问以澄清细节。\n\n* 如需了解更多，请参阅 [examples.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fblob\u002Fmain\u002Fwdoc\u002Fdocs\u002Fexamples.md)。\n\n* 请注意，还有一个[官方 Open-WebUI 工具](https:\u002F\u002Fopenwebui.com\u002Ft\u002Fqqqqqqqqqqqqqqqqqqqq\u002Fwdoctool)，使用起来更加简单。\n\u003C\u002Fdetails>\n\n## 功能\n* **15+ 种文件类型**：还支持组合使用，以递归方式加载，或定义复杂的异构语料库，例如文件列表、链接列表、正则表达式、YouTube 播放列表等。请参阅 [文件类型](#Filetypes) 和 [递归文件类型](#recursive-filetypes)。所有文件类型都可以无缝地结合在同一索引中，这意味着您可以同时查询 Anki 词库和工作 PDF 文件）。它还支持从音频文件和 YouTube 视频中去除静音！甚至还有一个 `ddg` 文件类型，可用于使用 [DuckDuckGo](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDuckDuckGo) 搜索网络。\n* **100 多种 LLM 和多种嵌入模型**：得益于 [litellm](https:\u002F\u002Fdocs.litellm.ai\u002F)，支持 OpenAI、Mistral、Claude、Ollama、Openrouter 等任何 LLM。支持的嵌入引擎列表可在 [这里](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002Fembedding\u002Fsupported_embedding) 查看，至少包括 OpenAI（或任何兼容 OpenAI API 的模型）、Cohere、Azure、Bedrock、NVIDIA NIM、Hugging Face、Mistral、Ollama、Gemini、Vertex、Voyage 等。\n* **本地私有 LLM**：在私有模式下，会采取措施确保数据不会离开您的计算机并传输到 LLM 提供商：不使用 API 密钥，所有 `api_base` 均由用户设置，缓存与其余部分隔离，出站连接通过重载 Python 套接字进行审查等。\n* **高级 RAG 查询大量多样化文档**：\n    1. 文档通过嵌入检索获得。\n    2. 随后使用一个弱 LLM 模型（“Eve the Evaluator”）来判断哪些文档不相关。\n    3. 再使用强 LLM（“Anna the Answerer”）根据剩余的每一份文档回答问题。\n    4. 最后，所有相关答案将由“Carl the Combiner”合并成一条简短的 Markdown 格式答案。在合并之前，这些答案会先按语义聚类和语义顺序分批处理，使用 SciPy 的层次聚类和叶序排列方法，这样 LLM 就能更轻松地以自底向上的逻辑方式整合答案。\n    “Eve the Evaluator”、“Anna the Answerer”和“Carl the Combiner”是系统提示中为每个 LLM 赋予的名称，这样您可以轻松地为特定步骤添加额外指令。此外还有用于摘要的“Sam the Summarizer”以及用于扩展查询的“Raphael the Rephraser”。\n    5. 每份文档都由唯一哈希值标识，答案也会注明来源，因此您清楚每条信息来自哪份文档。\n    * 支持特殊语法，如“QE >>>> QA”，其中 QE 是用于筛选嵌入的过滤性问题，而 QA 则是您真正想要得到答案的问题。\n* **网络搜索**：初步支持使用 [DuckDuckGo](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDuckDuckGo) 进行网络搜索。只需输入 `wdoc web \"How is Nvidia today this month?\"` 即可。\n* **高级摘要**：\n    * 不再生成无法使用的“高层次要点”，而是将作者的推理、论点、思考过程等内容压缩成易于浏览的 Markdown 文件。\n    * 摘要随后会经过多次检查，确保逻辑缩进等正确无误。\n    * 摘要可以采用与原文相同的语言，也可以直接翻译。\n* **多项任务**：请参阅 [支持的任务](#Tasks)。\n* **信任但需验证**：答案会注明来源：`wdoc` 会跟踪答案中所用每份文档的哈希值，方便您核实每一项断言。\n* **Markdown 格式的答案和摘要**：使用 [rich](https:\u002F\u002Fgithub.com\u002FTextualize\u002Frich)。\n* **合理的嵌入策略**：默认使用诸如 [多查询检索器](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fhow_to\u002FMultiQueryRetriever) 等复杂嵌入技术，同时也包含 SVM、KNN、父级检索器等，且可自定义。\n* **文档齐全**：大量 docstring、代码注释以及详细的 `--help` 说明等。请查看 [examples.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fblob\u002Fmain\u002Fwdoc\u002Fdocs\u002Fexamples.md)，其中列出了 Shell 和 Python 示例。完整的帮助信息可在 [help.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fdocs\u002Fhelp.md) 文件中找到，或通过 `python -m wdoc --help` 获取。我致力于维护详尽的文档。单页完整文档可在 [网站](https:\u002F\u002Fwdoc.readthedocs.io\u002Fen\u002Flatest\u002Fall_docs.html) 上查阅。\n* **可脚本化\u002F可扩展**：您可以将 `wdoc` 作为可执行文件或库来使用。请参阅下方的 [使用 wdoc 编写的脚本](#scripts-made-with-wdoc)。甚至还有一款 [开放 Web UI 工具](https:\u002F\u002Fopenwebui.com\u002Ft\u002Fqqqqqqqqqqqqqqqqqqqq\u002Fwdoctool)。\n* **严格类型检查**：借助出色的 [beartype](https:\u002F\u002Fbeartype.readthedocs.io\u002Fen\u002Flatest\u002F) 库，在不降低性能的情况下实现运行时类型检查！可通过环境变量选择关闭类型检查：`WDOC_TYPECHECKING=\"disabled \u002F warn \u002F crash\" wdoc`（默认为 `warn`）。\n* **LLM（及嵌入）缓存**：加速处理速度，同时优化索引的存储和加载（对于大型语料库非常实用）。\n* **优秀的 PDF 解析**：PDF 解析器通常不太可靠，因此我们采用了 15 种不同的加载器，并根据解析评分保留最佳结果。其中包括通过 [openparse](https:\u002F\u002Fgithub.com\u002FFilimoa\u002Fopen-parse\u002F) 实现的表格支持（默认无需 GPU），或通过 [UnstructuredPDFLoader](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fintegrations\u002Fdocument_loaders\u002Funstructured_pdfloader\u002F) 实现。\n* **Langfuse 支持**：如果您设置了相应的 Langfuse 环境变量，它们将会被使用。请参阅 [这篇指南](https:\u002F\u002Flangfuse.com\u002Fdocs\u002Fintegrations\u002Flangchain\u002Ftracing) 或 [这篇](https:\u002F\u002Flangfuse.com\u002Fdocs\u002Fintegrations\u002Flitellm\u002Ftracing) 了解更多（注意：若处于私有模式，则此功能会被禁用，以避免任何数据泄露）。\n* **文档过滤**：可根据文档内容或元数据的正则表达式进行过滤。\n* **二进制嵌入支持**：自定义 LangChain 向量存储，以使用二进制嵌入，这可能带来 [约 32 倍的压缩比、更快的搜索速度，且通常几乎不会损失准确性](https:\u002F\u002Fsimonwillison.net\u002F2024\u002FMar\u002F26\u002Fbinary-vector-search\u002F)。\n* **快速高效**：并行加载、解析、嵌入、查询文档等。\n* **Shell 自动补全**：使用 [python-fire](https:\u002F\u002Fgithub.com\u002Fgoogle\u002Fpython-fire\u002Fblob\u002Fmaster\u002Fdocs\u002Fusing-cli.md#completion-flag)。\n* **通知回调**：例如，可以使用 [ntfy.sh](ntfy.sh) 将摘要发送到您的手机上。\n* **黑客思维**：我是一位友好的开发者！如果您有任何功能请求或其他问题，请随时提交 Issue。\n\n### 任务\n* **query**：提供文档并针对其内容提问。\n* **search**：仅返回文档及其元数据。对于 Anki 用户，可以直接在浏览器中打开卡片。\n* **summarize**：提供文档并生成摘要。摘要提示可在 `utils\u002Fprompts.py` 中找到。\n* **summarize_then_query**：先对文档进行摘要，然后允许您直接就摘要内容提问。\n\n### 文件类型\n* **anki**: Anki 数据库集合的任意子集。图像的 `alt` 和 `title` 属性可以展示给大模型，这意味着如果你使用了 [ankiOCR 插件](https:\u002F\u002Fgithub.com\u002Fcfculhane\u002FAnkiOCR)，这些信息将有助于为大模型提供笔记的上下文。\n* **auto**: 默认选项，自动猜测文件类型。\n* **epub**: 由于 epub 格式本身定义不够清晰，因此测试较少。\n* **json_dict**: 包含单个 JSON 字典的文本文件。\n* **local_audio**: 支持多种音频格式，可选择使用 OpenAI 的 Whisper 模型或 [Deepgram](https:\u002F\u002Fdeepgram.com) 的 Nova-3 模型。支持自动去除静音等功能。注意：对于 Whisper 无法处理的大文件（通常大于 25MB），系统会自动将其拆分为较小的文件进行转录，然后再合并。此外，音频转录内容会被转换为带有规律时间戳的文本，从而允许用户询问大模型某段内容是在何时出现的。\n* **local_html**: 适用于网站数据转储。\n* **local_video**: 先提取音频，然后按 **local_audio** 处理。\n* **logseq_markdown**: 借助我的另一个项目 [LogseqMarkdownParser](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002FLogseqMarkdownParser)，你可以直接使用你的 [Logseq 图谱](https:\u002F\u002Fgithub.com\u002Flogseq\u002Flogseq\u002F)。\n* **online_media**: 使用 youtube_dl 尝试下载视频或音频；若失败，则通过 Playwright 加载页面来拦截可能有效的 URL。随后按 **local_audio** 处理（但也可处理视频）。\n* **online_pdf**: 通过 URL 获取 PDF，然后按 **pdf** 类型处理（见上文）。\n* **pdf**: 实现了 15 种默认加载器，系统会根据启发式算法选择最佳方案并提前停止。表格支持可通过 [openparse](https:\u002F\u002Fgithub.com\u002FFilimoa\u002Fopen-parse\u002F) 或 [UnstructuredPDFLoader](https:\u002F\u002Fpython.langchain.com\u002Fdocs\u002Fintegrations\u002Fdocument_loaders\u002Funstructured_pdfloader\u002F) 实现。易于扩展更多支持。\n* **powerpoint**: .ppt、.pptx、.odp 等。\n* **string**: 命令行会提示你输入文本，方便粘贴内容，尤其适合付费文章！\n* **text**: 直接以路径形式传递文本内容。\n* **txt**: .txt、Markdown 等。\n* **url**: 尝试多种方式加载网页，并通过启发式算法找到解析效果最好的版本。\n* **word**: .doc、.docx、.odt 等。\n* **youtube**: 文本内容来自 YouTube 字幕或翻译，或者更好的是，使用 Whisper 或 Deepgram 进行转录。需要注意的是，YouTube 字幕会附带时间码（因此你可以询问“作者是在什么时候提到某某内容的”），但其采样频率较低（不是每秒一个时间码，而是每 15 秒一个）。在总结时，YouTube 章节也会作为上下文提供给大模型，这可能会显著提升总结效果。\n\n### 递归文件类型\n* **ddg**: 使用 [DuckDuckGo](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDuckDuckGo) 进行在线网页搜索。这不是代理搜索，我们仅对 DuckDuckGo 检索到的 URL 使用 `wdoc` 进行处理并返回结果。此功能仅支持 `query` 任务。\n* **json_entries**: 将路径指向一个文件，其中每行都是一个包含加载参数的 JSON **字典**。例如，用于加载其他多种递归类型。示例可在 `docs\u002Fjson_entries_example.json` 中找到。\n* **link_file**: 将每行包含一个 URL 的文本文件转换为相应的加载器参数。支持任何类型的链接，例如网页、PDF 链接和 YouTube 链接都可以放在同一文件中。非常适合批量总结内容！\n* **recursive_paths**: 将路径、正则表达式模式和文件类型结合，递归地查找所有匹配的文件，并按照指定的文件类型进行处理（例如多个 PDF 或大量 HTML 文件等）。\n* **toml_entries**: 读取 .toml 文件。示例可在 `docs\u002Ftoml_entries_example.toml` 中找到。\n* **youtube playlists**: 获取每个视频的链接，然后按 **youtube** 类型处理。\n\n## 操作指南与示例\n\n请参阅 [examples.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fblob\u002Fmain\u002Fwdoc\u002Fdocs\u002Fexamples.md)。\n\n## 开始使用\n*`wdoc` 主要在 Python 3.13.5 上开发和测试，但为了兼容性，也可以在 Python 版本 `>=3.11` 上安装。如果条件允许，请尽量使用 Python 3.13。*\n\n### 直接安装\n\n1. 安装方法：\n    * 使用 pip：`pip install -U wdoc[full]`（如果你想尝试依赖项更少的版本，可以使用 `pip install -U wdoc`，但你需要手动为你的用例安装缺失的依赖）。\n    * 或者获取特定的 Git 分支：\n        * `dev` 分支：`pip install git+https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc@dev[full]`\n        * `main` 分支：`pip install git+https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc@main[full]`\n    * 你也可以使用 uvx 或 pipx。但由于我对它们不太熟悉，不确定是否会导致缓存等问题。如果你试过的话，请告诉我！\n        * 使用 uvx：`uvx wdoc[full]@latest --help`\n        * 使用 pipx：`pipx run wdoc[full] --help`\n    * 无论如何，建议：\n        * 安装 `wdoc[full]` 版本，除非有特殊限制。\n        * 同时尝试安装 `pdftotext`：`pip install -U wdoc[pdftotext]`，以及添加 `fasttext` 支持：`pip install -U wdoc[fasttext]`。\n    * 如果你计划贡献代码，还需要 `wdoc[dev]` 来使用提交钩子。\n    * **Claude Code 用户**：为了让 Claude Code 了解 `wdoc` 的 CLI 和 Python API，请安装 [SKILL.md](.\u002FSKILL.md) 参考文件：\n        ```bash\n        mkdir -p ~\u002F.claude\u002Fskills\u002Fwdoc && wget -O ~\u002F.claude\u002Fskills\u002Fwdoc\u002FSKILL.md https:\u002F\u002Fraw.githubusercontent.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fmain\u002FSKILL.md\n        ```\n2. 将你想要使用的后端 API 密钥添加为环境变量：例如 `export ANTHROPIC_API_KEY=\"***my_key***\"`\n3. 启动非常简单，只需运行 `wdoc --task=query --path=MYDOC [ARGS]` 即可。\n    * 如果由于某种原因失败了，可以尝试使用 `python -m wdoc`。如果仍然不行，可以试试 `uvx wdoc@latest`；作为最后的手段，可以克隆这个仓库并进入目录后再试一次。如有问题，请随时提交 issue。\n    * 要启用 Shell 自动补全功能：如果你使用 zsh，运行 `eval $(cat shell_completions\u002Fwdoc_completion.zsh)`。同时提供适用于 `bash` 和 `fish` 的补全脚本。你也可以通过 `wdoc -- --completion MYSHELL > my_completion_file` 生成自己的补全文件。\n    * 请注意，如果你处理大量文档（尤其是递归类型的文件），可能需要较长时间（这也取决于并行处理能力，但可能会遇到内存错误）。\n    * 查看 [examples.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fblob\u002Fmain\u002Fwdoc\u002Fdocs\u002Fexamples.md)，其中列出了 Shell 和 Python 的示例。\n4. 要对本地文档提问：`wdoc query --path=\"PATH\u002FTO\u002FYOUR\u002FFILE\" --filetype=\"auto\"`\n    * 如果你想通过直接加载之前运行生成的嵌入来减少启动时间（尽管嵌入本身始终会被缓存），可以在之前的命令中添加 `--saveas=\"some\u002Fpath\"`，将生成的嵌入保存到文件中，并在后续每次调用时替换为 `--loadfrom \"some\u002Fpath\"`。\n5. 要进行在线搜索，可以这样操作：`wdoc --task=query --path='How is Nvidia doing this month?' --query='How is Nvidia doing this month' --filetype=ddg`。但如果 `path` 或 `query` 中有任何一个缺失，我们会用另一个来代替。也可以这样使用：`wdoc web 'How is Nvidia doing this month?'`。\n6. 更多信息请阅读文档：`wdoc --help`\n\n### 实验性 Docker 界面\n\n你还可以使用实验性的 Docker 界面，在浏览器中（包括在智能手机上）使用 `wdoc`。\n\n详细说明请参阅 [Docker README](.\u002Fdocker\u002FREADME.md)。\n\n## 使用 wdoc 制作的脚本\n* *更多内容将在 [scripts 文件夹](.\u002Fscripts\u002F) 中陆续发布*。\n* [Ntfy Summarizer](scripts\u002FNtfySummarizer)：利用 [ntfy.sh](ntfy.sh) 自动从你的 Android 手机总结文档。\n* [TheFiche](scripts\u002FTheFiche)：直接以 [logseq](https:\u002F\u002Fgithub.com\u002Flogseq\u002Flogseq) 页面的形式为特定概念创建摘要。\n* [FilteredDeckCreator](scripts\u002FFilteredDeckCreator)：根据 `wdoc` 检索到的卡片，直接为 [anki](https:\u002F\u002Fankitects.github.io\u002F) 创建筛选后的学习卡片组。\n* [官方 Open-WebUI 工具](https:\u002F\u002Fopenwebui.com\u002Ft\u002Fqqqqqqqqqqqqqqqqqqqq\u002Fwdoctool)，托管于 [这里](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fopenwebui_custom_pipes_filters\u002Fblob\u002Fmain\u002Ftools\u002Fwdoc_tools.py)。\n* [MediaURLFinder](scripts\u002FMediaURLFinder) 简单地利用 `find_online_media` 加载器助手，结合 `playwright` 和 `yt-dlp` 查找所有媒体资源的 URL（视频、音频等）。这在仅靠 `yt-dlp` 无法找到资源 URL 时特别有用。\n\n## 常见问题解答\n\n\u003Cdetails>\n\u003Csummary>\n常见问题解答\n\u003C\u002Fsummary>\n\n* **这是为谁准备的？**\n    * `wdoc` 专为希望进行强大文档查询并获得深度 AI 驱动文档摘要的高级用户设计。\n* **什么是 RAG？**\n    * RAG 系统（检索增强生成）本质上是基于大型语言模型对文本语料库进行搜索。\n* **为什么要再做一个 RAG 系统？不能用现有的其他系统吗？**\n    * 我是 [Olicorne](https:\u002F\u002Folicorne.org\u002F)，一名精神科住院医师，我需要一个工具来从**大量**（数万份）不同类型的文档中提出医学问题，这些文档包括 epub、pdf、[Anki](https:\u002F\u002Fankitects.github.io\u002F) 数据库、[Logseq](https:\u002F\u002Fgithub.com\u002Flogseq\u002Flogseq\u002F)、网站备份、YouTube 视频和播放列表、会议录音、音频文件等。现有的解决方案无法处理如此多样且规模庞大的内容。\n* **为什么 `wdoc` 比大多数 RAG 系统更适合用于文档问答？**\n    * 它同时使用了强模型和查询评估模型。首先通过嵌入技术找到合适的文档，然后由查询评估模型过滤掉与问题无关的文档；接着，强模型会根据剩余的每一份文档回答问题，并将所有答案以整洁的 Markdown 格式整合在一起。此外，`wdoc` 具有高度可定制性。\n* **能否在 `wdoc` 的文档上使用 `wdoc`？**\n    * 当然可以！`wdoc --task=query --path https:\u002F\u002Fwdoc.readthedocs.io\u002Fen\u002Flatest\u002Fall_docs.html`\n* **为什么 `wdoc` 还能生成摘要？**\n    * 我空闲时间很少，因此需要一个量身定制的摘要功能来及时了解新闻动态。但大多数摘要系统质量很差，只是简单地给出高层次的要点，而没有正确处理文本分块。于是我自己开发了一个专门的摘要器。**摘要提示可以在 `utils\u002Fprompts.py` 中找到，其重点在于提取作者的论点、推理过程和思想脉络，然后使用 Markdown 缩进的项目符号列表使内容易于阅读。** 效果非常好！该提示的数据类未被冻结，因此你可以根据需要提供自己的提示。\n* **`wdoc` 支持哪些任务？**\n    * 请参阅 [任务](#tasks)。\n* **`wdoc` 支持哪些 LLM 提供商？**\n    * 借助 [litellm](https:\u002F\u002Fdocs.litellm.ai\u002F)，`wdoc` 几乎支持任何 LLM 提供商。它甚至支持本地 LLM 和本地嵌入（详见 [examples.md](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fblob\u002Fmain\u002Fwdoc\u002Fdocs\u002Fexamples.md)）。支持的嵌入引擎列表可在 [这里](https:\u002F\u002Fdocs.litellm.ai\u002Fdocs\u002Fembedding\u002Fsupported_embedding) 查看，其中包括至少 OpenAI（或任何兼容 OpenAI API 的模型）、Cohere、Azure、Bedrock、NVIDIA NIM、Hugging Face、Mistral、Ollama、Gemini、Vertex、Voyage 等。\n* **你用 `wdoc` 来做什么？**\n    * 我会关注各种不同的信息来源以了解最新动态：YouTube、网站等。借助 `wdoc`，我可以自动生成精美的 Markdown 摘要，并直接将其导入我的 [Logseq](https:\u002F\u002Fgithub.com\u002Flogseq\u002Flogseq\u002F) 数据库，形成一系列 `TODO` 块。\n    * 我用它向我庞大的医学知识语料库提出技术性问题。\n    * 我还使用 `--private` 参数查询个人文档。\n    * 有时我会先对文档进行摘要，然后再在同一命令中直接提问。\n    * 我也会用它来询问整个 YouTube 播放列表的相关问题。\n    * 其他使用场景则构成了我创建的 [使用 `wdoc` 制作的脚本部分](#scripts-made-with-wdoc)。\n* **这个名字有什么含义？**\n    * 我最喜欢的角色之一（也可以说是我的榜样）是 [温斯顿·沃尔夫](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=UeoMuK536C8)，但在经过一番犹豫后，我决定“WolfDoc”会让人感到困惑，“WinstonDoc”又像是微软会起的名字。另外，“wd”和“wdoc”这两个名称当时还未被占用，而“doctools”已经被注册了。该项目最初的名字是“DocToolsLLM”，这是一个结合了“医生”和“工具”的双关语。\n* **如何在不编写代码的情况下改进特定任务的提示？**\n    * `query` 任务中的每个提示都扮演着 WDOC-CORP© 公司员工的角色，分别是“Eve the Evaluator”（负责筛选相关文档的 LLM）、“Anna the Answerer”（根据筛选后的文档回答问题的 LLM）以及“Carl the Combiner”（将 Answerer 的答案整合为一个整体的 LLM）。此外还有“Sam the Summarizer”用于生成摘要，以及“Raphael the Rephraser”用于扩展你的查询。只要你以提示的方式与它们对话，它们就会听从你的指令。\n* **如何将 `wdoc` 的解析器用于我自己的文档？**\n    * 如果你在 Shell 命令行界面中，可以轻松使用 `wdoc parse my_file.pdf`。\n        * 添加 `--format=langchain_dict` 可以将文本和元数据以字典列表的形式输出，否则只会得到纯文本。还有其他格式，例如 `--format=xml`，使其更符合 LLM 的要求，类似于 [files-to-prompt](https:\u002F\u002Fgithub.com\u002Fsimonw\u002Ffiles-to-prompt)。\n    * 如果你想用 Python 处理文档：\n        ``` python\n        from wdoc import wdoc\n        list_of_docs = wdoc.parse_doc(path=my_path)\n        ```\n    * 另一个例子是使用 `wdoc` 解析 Anki 词库：`wdoc parse --filetype \"anki\" --anki_profile \"Main\" --anki_deck \"mydeck::subdeck1\" --anki_notetype \"my_notetype\" --anki_template \"\u003Cheader>\\n{header}\\n\u003C\u002Fheader>\\n\u003Cbody>\\n{body}\\n\u003C\u002Fbody>\\n\u003Cpersonal_notes>\\n{more}\\n\u003C\u002Fpersonal_notes>\\n\u003Ctags>{tags}\u003C\u002Ftags>\\n{image_ocr_alt}\" --anki_tag_filter \"a::tag::regex::.*something.*\" --format=text`\n* **如果我的 PDF 文件被加密了，该怎么办？**\n    * 如果你在 Linux 系统上，可以尝试运行 `qpdf --decrypt input.pdf output.pdf`。\n        * 我在这个仓库里制作了一个简单粗暴的批处理脚本：[PDF_batch_decryptor](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002FPDF_batch_decryptor)。\n* **如何添加我自己的 PDF 解析器？**\n    * 编写一个 Python 类，并将其添加到此处：`wdoc.utils.loaders.pdf_loaders['parser_name']=parser_object`，然后在调用 `wdoc` 时指定 `--pdf_parsers=parser_name`。\n        * 该类必须在 `__init__` 方法中接收一个 `path` 参数，并包含一个无参数但返回 `List[Document]` 的 `load` 方法。可以参考 `OpenparseDocumentParser` 类作为示例。\n\n* **如果我频繁遇到速率限制，该怎么办？**\n    * 最简单的方法是添加 `debug` 参数。这将禁用多线程、多进程和 LLM 并发。另一种较为温和的方式是将环境变量 `WDOC_LLM_MAX_CONCURRENCY` 设置为较低的值。\n\n* **如何运行测试？**\n    * 请查看 `.\u002Ftests\u002Frun_all_tests.sh` 文件。\n\n* **如何在不进行分块的情况下查询文本？\u002F 如何以全文作为上下文进行查询？**\n    * 如果你将环境变量 `WDOC_MAX_CHUNK_SIZE` 设置为一个非常高的值，并使用一个根据 litellm 元数据具有足够上下文窗口的模型，那么就不会发生分块，LLM 将以全文作为上下文进行处理。\n\n* **有没有办法将 `wdoc` 与 [Open-WebUI](https:\u002F\u002Fgithub.com\u002Fopen-webui\u002Fopen-webui\u002F) 一起使用？**\n    * 是的！我维护了一个[官方 Open-WebUI 工具](https:\u002F\u002Fopenwebui.com\u002Ft\u002Fqqqqqqqqqqqqqqqqqqqq\u002Fwdoctool)，托管在[这里](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fopenwebui_custom_pipes_filters\u002Fblob\u002Fmain\u002Ftools\u002Fwdoc_tools.py)。\n\n* **`wdoc` 有 Web 界面吗？**\n    * 有！提供了一个[基于 Docker 的 Gradio 实验性 Web 界面](.\u002Fdocker\u002FREADME.md)，可以轻松部署和使用，无需命令行交互。\n\n* **我可以将 shell 管道与 `wdoc` 一起使用吗？**\n    * 可以！通过 shell 管道发送的数据（无论是字符串还是二进制数据）都会自动保存到一个临时文件中，然后作为 `--path=[temp_file]` 参数传递。例如 `cat **\u002F*.txt | wdoc --task=query`、`echo $my_url | wdoc parse`，甚至 `cat my_file.pdf | wdoc parse --filetype=pdf`。对于二进制输入，强烈建议使用 `--filetype` 参数，因为 `python-magic` 版本 ≤0.4.27 在没有指定文件类型时会出错（参见[这个问题](https:\u002F\u002Fgithub.com\u002Fahupp\u002Fpython-magic\u002Fissues\u002F261)）。\n\n* **是否可以在运行时设置环境变量？**\n    * 基本可以。实际上，在导入 `wdoc` 时，`wdoc\u002Futils\u002Fenv.py` 中的代码会创建一个数据类来存储 `wdoc` 使用的环境变量。这样做主要是为了确保运行时的类型检查，并保证当 `wdoc` 代码内部通过该数据类访问某个环境变量时，始终会与系统环境中的值进行比较。如果你决定在整个代码中更改环境变量，那么这些新值会在 `wdoc` 内部生效。不过这种方式有些脆弱，因为某些环境变量用于存储函数或类的默认值，因此只在代码导入时使用一次，后续可能会不同步。此外，如果 `wdoc` 怀疑 `WDOC_PRIVATE_MODE` 环境变量不同步，它会故意崩溃，以确保安全。另外需要注意的是，如果发现类似 `WDOC_LANGFUSE_PUBLIC_KEY` 的环境变量，`wdoc` 会用其覆盖 `LANGFUSE_PUBLIC_KEY`。这是因为 `litellm`（可能还有其他库）会查找这个环境变量来启用 `langfuse` 回调。整个机制允许为特定用户或在使用 `open-webui` 的 `wdoc` 工具时设置环境变量。欢迎对此功能提出反馈。\n\n* **如何使用 Sphinx 构建 autodoc？**\n    * 我一直使用的命令是 `sphinx-apidoc -o docs\u002Fsource\u002F wdoc --force`，需要从本仓库的根目录执行。\n\n* **为什么我无法在其他 LangChain 项目中加载向量存储？**\n    * 在 `wdoc\u002Futils\u002Fcustoms\u002Fbinary_faiss_vectorstore.py` 中，我们创建了 `BinaryFAISS` 和 `CompressedFAISS`。后者与 FAISS 类似，只是对序列化的索引进行了 zlib 压缩；而前者则在此基础上增加了二进制嵌入，从而实现更快、更紧凑的嵌入表示。如果你想完全禁用压缩，可以使用环境变量 `WDOC_MOD_FAISS_COMPRESSION=false`。\n\n* **测试套件使用的是哪个 Python 版本？**\n    * 推荐的 Python 版本是 `3.12.11`。\n\n* **为什么在线搜索只支持“query”任务？**\n    * `wdoc` 处理摘要的方式是将“整篇文档”分割成连续的“子文档”，然后逐个生成摘要。但如果一开始就有多篇文档（比如不同的网页），这种“顺序”就失去了意义。\n\n\u003C\u002Fdetails>\n\n\n\n## 路线图\n\n\u003Cdetails>\n\u003Csummary>\n点击查看更多\n\u003C\u002Fsummary>\n\n\u003Ci>此待办事项列表由 [MdXLogseqTODOSync](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002FMdXLogseqTODOSync) 自动维护。\u003C\u002Fi>\n\n\u003C!-- BEGIN_TODO -->\n- ## Most urgent\n    - figure out a good way to skip merging batches that are too large before trying to merge them\n        - probably means adding an env var to store a max value, document it in the help.md\n        - then check after batch creation if a batch is that large\n        - if it is put it in a separate var, to be concatenated later with the rest of the answers\n    - add more tests\n        - add test for the private mode\n        - add test for the testing models\n        - add test for the recursive loader functions\n        - add test for each loader\n    - rewrite the python API to make it more useable. (also related to https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fissues\u002F13)\n        - pay attention to how to modify the init and main.py files\n        - pay attention to how the --help flag works\n        - pay attention to how the USAGE document is structured\n    - support other vector databases\n    - learn how to set a github action for test code coverage\n    - allow anki to use anki type search queries\n    - refactor the tasks to use langgraph, as it seems easier to do complex recursive tasks with it\n    - use async for the langchain chains\n- ### Features\n    -  use clusters of semantic ordering instead of just the order you dumbass\n    - ability to cap the search documents capped by a number of tokens instead of a number of documents\n    - Add prompt caching for claude\n    - add a \"fast summary\" feature that does not use recursive summary if you care more about speed than overlapping summaries\n    - count how many time each source is used, as it can be relevant to infer answer quality\n    - add an html format output. It would display a nice UI with proper dropdowns for sources etc\n    - if a model supports structured output we should make use of it to get the thinking and answer part. Opt in because some models hide their thoughts.\n    - add an intermediate step for queries that asks the LLM for appropriate headers for the md output. Then for each intermediate answer attribute it a list of 1 to 3 headers (because a given intermediate answer can  contain several pieces of information), then do the batch merge of intermediate answer per header.\n        - this needs to be scalable and easy to add recursion to (because then we can do this for subheaders and so on)\n        - the end goal is to have a scalable solution to answer queries about extremely large documents for impossibly vast questions\n    - use apprise instead of ntfy for the scripts\n    - add crawl4ai parser: https:\u002F\u002Fgithub.com\u002Funclecode\u002Fcrawl4ai\n    - Way to add the title (or all metadata) of a document to its own text. Enabled by default. Because this would allow searching among many documents that don't refer to the original title (for example: material safety datasheets)\n        - default value is \"author\" \"page\" title\"\n        - pay attention to avoid including personnal info (for example use relative paths instead of absolute paths)\n    - add a \u002Fsave PATH command to save the chat and metadata to a json file\n    - add image support printing via icat or via the other lib you found last time, would be useful for summaries etc\n    - add wdoc to tldr pages\n    - add an audio backend to use the subtitles from a video file directly\n    - store the anki images as 'imagekeys' as the idea works for other parsers too\n    - investigate asking the LLM to add leading emojis to the bullet point for improved reading\n    - add a key\u002Fval arg to specify the trust we have in a doc, call it context\n    - add a way to open the documents automatically, based on platform dirs etc. For ex if okular is installed, open pdfs directly at the right page\n        - the best way would be to create opener.py that does a bit like loader but for all filetypes and platforms\n        - use a cli selector like in mnemonics creator\n            - add shortcut to sort by score or by name\n            - display metadata and score in a previewer\n    - add an argument --whole_text to avoid chunking (this would just increase the chunk size to a super large number I guess)\n    - add apprise callback support\n    - add a filetype \"custom_parser\" and an argument \"--custom_parser\" containing a path to a python file. Must receive a docdict and a few other things and return a list of documents\n    - add bespoke-minicheck from ollama to fact check when using RAG: https:\u002F\u002Follama.com\u002Flibrary\u002Fbespoke-minicheck\n        - or via their API directly : https:\u002F\u002Fdocs.bespokelabs.ai\u002Fbespoke-minicheck\u002Fapi but they don't seem to properly disclose what they do with the data\n    - add a langchain code loader that uses aider to get the repomap\n        - https:\u002F\u002Fgithub.com\u002Fpaul-gauthier\u002Faider\u002Fissues\u002F1043#issuecomment-2278486840\n        - https:\u002F\u002Faider.chat\u002Fdocs\u002Fscripting.html\n    - add a pikepdf loader because it can be used to automatically decrypt pdfs\n    - add a query_branching_nb argument that asks an LLM to identify a list of keywords from the intermediate answers, then look again for documents using this keyword and filtering via the weak llm\n    - write a script that shows how to use bertopic on the documents of wdoc\n    - add a retriever where the LLM answer without any context\n    - add support for readabilipy for parsing html\n        - https:\u002F\u002Fgithub.com\u002Falan-turing-institute\u002FReadabiliPy\n    - add an obsidian loader\n        - https:\u002F\u002Fpypi.org\u002Fproject\u002Fobsidiantools\u002F\n    - add a \u002Fchat command to the prompt, it would enable starting an interactive session directly with the llm\n    - find a way to make it work with llm from simonw\n    - make images an actual filetype\n- ### Enhancements\n    - store the available tasks in a dataclass in misc.py\n    - turn arugments that contain a _ into arguments with a -\n        - in the cli launcher function, manually convert arguments\n    - maybe add support for docling to parse documents?\n    - when querying hard stuff the number of drop documents after batching is non negligible, we should remove those from the list of documents to display and instead store those in another variable\n    - check if using html syntax is less costly and confusing to LLMs than markdown with tall those indentation. Or maybe json. It would be simple to turn that into markdown afterwards.\n    - check that the task search work on things other than anki\n    - create a custom custom retriever, derived from multiquery retriever that does actual parallel requests. Right now it's not the case (maybe in async but I don't plan on using async for now). This retriever seems a good part of the slow down.\n    - stop using your own youtube timecode parser and instead use langchain's chunk transcript format\n    - implement usearch instead of faiss, it seems in all points faster, supports quantized embeddings, i trust their langchain implementation more\n        - https:\u002F\u002Fpython.langchain.com\u002Fapi_reference\u002Fcommunity\u002Fvectorstores\u002Flangchain_community.vectorstores.usearch.USearch.html#langchain_community.vectorstores.usearch.USearch\n    - Use an env var to drop_params of litellm\n    - add more specific exceptions for file loading error. One exception for all, one for batch and one for individual loader\n    - use heuristics to find the best number of clusters when doing semantic reranking\n    - arg to use jina v3 embeddings for semantic batching because it allows specifying tasks that seem appropriate for that\n    - add an env variable or arg to overload the backend url for whisper. Then set it always for you and mention it there: https:\u002F\u002Fgithub.com\u002Ffedirz\u002Ffaster-whisper-server\u002Fissues\u002F5\n    - find a way to set a max cost at which to crash if it exceeds a maximum cost during a query, probably via the price callback\n    - anki_profile should be able to be a path\n    - store wdoc's version and indexing timestamp in the metadata of the document\n    - arg --oneoff that does not trigger the chat after replying. Allowing to not hog all the RAM if ran in multiple terminals for example through SSH\n    - add a (high) token threshold above which two texts are not combined but just concatenated in the semantic order. It would avoid it loosing context. Use a --- separator\n    - compute the cost of whisper and deepgram\n    - use a pydantic basemodel for output instead of a dict\n        - same for summaries, it should at least contain the method to substitute the sources and then back\n    - investigate storing the vectors in a sqlite3 file\n    - make a plugin to llm that looks like file-to-prompt from simonw\n    - Always bind a user metadata to litellm for langfuse etc\n        - Add more metadata to each request to langfuse more informative\n    - add a reranker to better sort the output of the retrievers. Right now with the multiquery it returns way too many and I'm thinking it might be a bad idea to just crop at top_k as I'm doing currently\n    - add a status argument that just outputs the logs location and size, the cache location and size, the number of documents etc\n    - add the python magic of the file as a file metadata\n    - add an env var to specify the threshold for relevant document by the query eval llm\n    - find a way to return the evaluations for each document also\n    - move retrievers.py in an embeddings folder\n    - stop using lambda functions in the chains because it makes the code barely readable\n    - when doing recursive summary: tell the model that if it's really sure that there are no modifications to do: it should just reply \"EXIT\" and it would save time and money instead of waiting for it to copy back the exact content\n    - add image parsing as base64 metadata from pdf\n    - use multiple small chains instead of one large and complicated and hard to maintain\n    - add an arg to bypass query combine, useful for small models\n    - tell the llm to write a special message if the parsing failed or we got a 404 or paywall etc\n        - catch this text and crash\n    - add check that all metadata is only made of int float and str\n    - move the code that filters embeddings inside the embeddings.py file\n        - this way we can dynamically refilter using the chat prompt\n    - task summary then query should keep in context both the full text and the summary\n    - if there's only one intermediate answer, pass it as answer without trying to recombine\n    - filter_metadata should support an OR syntax\n    - add a --show_models argument to display the list of available models\n    - add a way to open the documents automatically, based on platform dirs etc. For ex if okular is installed, open pdfs directly at the right page\n        - the best way would be to create opener.py that does a bit like loader but for all filetypes and platforms\n    - add an image filetype: it will be either OCR'd using format and\u002For will be captioned using a multimodal llm, for example gpt4o mini\n        - nanollava is a 0.5b that probably can be used for that with proper prompting\n    - add a key\u002Fval arg to specify the trust we have in a doc, call this metadata context in the prompt\n    - add an arg to return just the dict of all documents and embeddings. Notably useful to debug documents\n    - use a class for the cli prompt, instead of a dumb function\n    - arg to disable eval llm filtering\n        - just answer 1 directly if no eval llm is set\n    - display the number of documents and tokens in the bottom toolbar\n    - add a demo gif\n    - investigate asking the LLM to  add leading emojis to the bullet point for quicker reading of summaries\n    - see how easy or hard it is to use an async chain\n    - ability to cap the search documents capped by a number of tokens instead of a number of documents\n    - for anki, allow using a query instead of loading with ankipandas\n    - add a \"try_all\" filetype that will try each filetype and keep the first that works\n    - add textract extractor : https:\u002F\u002Ftextract.readthedocs.io\u002Fen\u002Fstable\u002F\n    - write a langchain compatible tool for agents\n    - add bespoke-minicheck from ollama to fact check when using RAG: https:\u002F\u002Follama.com\u002Flibrary\u002Fbespoke-minicheck\n        - or via their API directly : https:\u002F\u002Fdocs.bespokelabs.ai\u002Fbespoke-minicheck\u002Fapi but they don't seem to properly disclose what they do with the data\n\u003C!-- END_TODO -->\n\n\u003C\u002Fdetails>","# wdoc 快速上手指南\n\nwdoc 是一个强大的 RAG（检索增强生成）系统，专为处理海量异构文档（PDF、EPUB、音频、视频、Anki 卡片等）而设计。它能提供带确切来源引用的精准回答和深度摘要，适合研究人员、学生及需要处理大量信息的开发者。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**：Linux, macOS, Windows (WSL 推荐)\n- **Python 版本**：Python 3.9 或更高版本\n- **依赖工具**：`pip`, `git`\n\n### 前置依赖\nwdoc 依赖多个后端库（如 LangChain, LiteLLM）。为确保安装顺利，建议先更新包管理工具并安装基础编译依赖（特别是在 Linux 环境下）：\n\n```bash\n# Ubuntu\u002FDebian 示例\nsudo apt-get update\nsudo apt-get install -y python3-pip python3-dev build-essential libssl-dev libffi-dev\n\n# 确保 pip 为最新版本\npip install --upgrade pip setuptools wheel\n```\n\n> **国内加速提示**：建议使用国内镜像源加速 Python 包下载。\n> ```bash\n> pip config set global.index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 安装步骤\n\n### 方式一：直接安装（推荐）\n通过 PyPI 直接安装最新稳定版：\n\n```bash\npip install wdoc\n```\n\n若需体验最新功能（开发版），可安装 GitHub 主分支：\n```bash\npip install git+https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc.git@main\n```\n\n### 方式二：Docker 部署（可选）\n如果你希望使用图形化界面（Web UI）且避免配置本地环境，可以使用 Docker：\n\n```bash\n# 拉取并运行（具体参数参考官方 docker\u002FREADME.md）\ndocker run -p 7860:7860 thiswillbeyourgithub\u002Fwdoc:latest\n```\n*启动后访问 `http:\u002F\u002Flocalhost:7860` 即可使用基于 Gradio 的网页界面。*\n\n## 基本使用\n\n在使用前，请确保已配置好 LLM API Key（支持 OpenAI, Claude, Ollama, OpenRouter 等）。通常通过环境变量设置，例如：\n```bash\nexport OPENROUTER_API_KEY=\"your_api_key_here\"\n```\n\n### 1. 文档问答 (Query)\n对单个在线 PDF 进行提问，系统将自动下载、切片、检索并生成带来源引用的回答。\n\n```zsh\nlink=\"https:\u002F\u002Fsituational-awareness.ai\u002Fwp-content\u002Fuploads\u002F2024\u002F06\u002Fsituationalawareness.pdf\"\n\nwdoc --path=$link --task=query --filetype=\"online_pdf\" --query=\"What does it say about alphago?\" --query_retrievers='basic_multiquery' --top_k=auto_200_500\n```\n\n**执行逻辑简述：**\n1. 解析链接并下载 PDF。\n2. 文本分块并向量化。\n3. 使用多查询策略检索最相关的 200-500 个文档片段。\n4. 利用轻量级模型过滤无关内容，再由强力模型提取信息。\n5. 语义聚类合并答案，输出最终带引用的 Markdown 回答。\n\n### 2. 文档摘要 (Summarize)\n生成保留作者推理过程和细节的深度摘要，而非泛泛而谈的结论。\n\n```zsh\nlink=\"https:\u002F\u002Fsituational-awareness.ai\u002Fwp-content\u002Fuploads\u002F2024\u002F06\u002Fsituationalawareness.pdf\"\n\nwdoc --path=$link --task=summarize --filetype=\"online_pdf\"\n```\n\n**执行逻辑简述：**\n1. 将文本分块。\n2. 调用强力 LLM 对每个片段进行低层级（含细节）的要点总结。\n3. 上下文关联：处理新片段时参考前一片段内容。\n4. 拼接所有片段总结并返回。\n\n### 3. 组合任务 (Summarize then Query)\n先总结文档，随后允许用户基于总结内容继续提问：\n\n```zsh\nwdoc --path=$link --task=summarize_then_query --filetype=\"online_pdf\"\n```\n\n### 4. 网络搜索 (Web Search)\n利用 DuckDuckGo 进行实时网络信息检索与回答：\n\n```zsh\nwdoc web \"How is Nvidia today this month?\"\n```\n\n> **提示**：更多高级用法（如递归总结、混合文件类型索引、本地私有化部署）请参考项目仓库中的 `examples.md` 或 `SKILL.md` 文档。","某医学研究员需要同时分析数百份格式各异的资料（包括讲座录音、PDF 论文、EPUB 教科书及 Anki 卡片），以撰写一份关于新型抗抑郁药副作用的综述报告。\n\n### 没有 wdoc 时\n- **格式割裂严重**：必须手动打开不同软件分别处理音频、视频和文档，无法在同一语境下交叉验证信息。\n- **检索效率低下**：面对海量资料，传统关键词搜索难以定位具体段落，常遗漏关键细节或混淆相似概念。\n- **溯源困难重重**：AI 生成的总结往往缺乏明确出处，研究员需花费数小时人工核对原文，以防“幻觉”误导结论。\n- **总结流于表面**：通用工具仅能提取泛泛的要点，无法还原作者深层的推导逻辑和临床思考过程。\n\n### 使用 wdoc 后\n- **异构数据融合**：wdoc 直接 ingest 所有类型的文件，将录音、课件和文本统一索引，实现跨媒介的无缝查询。\n- **高精度语义召回**：利用高级 RAG 技术，wdoc 能从数万份文档中精准锁定相关片段，并自动聚合语义批次生成答案。\n- **答案自带溯源**：输出的 Markdown 回答不仅逻辑清晰，还精确标注了引用来源的具体段落，让每一条结论都可信可查。\n- **深度逻辑提炼**：wdoc 生成的摘要能还原作者的思维路径，帮助研究员快速掌握复杂的临床论证而非仅仅获取结论。\n\nwdoc 通过将杂乱的多源异构数据转化为可追溯、有深度的知识网络，极大提升了专业领域从信息收集到决策制定的效率。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fthiswillbeyourgithub_wdoc_a6a0b76b.png","thiswillbeyourgithub","Olivier Cornelis","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fthiswillbeyourgithub_bb017cbe.png","🧠 + 🩺 + 💻 = 🧑‍⚕️me🧑‍⚕️\r\n\r\nMore details at olicorne.org","olicorp","Paris, France",null,"https:\u002F\u002Folicorne.org","https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub",[82,86],{"name":83,"color":84,"percentage":85},"Python","#3572A5",97.9,{"name":87,"color":88,"percentage":89},"Shell","#89e051",2.1,514,41,"2026-04-01T10:26:00","AGPL-3.0","Linux, macOS, Windows","非必需（支持本地 LLM 及云端 API，具体取决于所选模型后端）","未说明（处理大量文档时建议较大内存）",{"notes":98,"python":99,"dependencies":100},"该工具主要作为 RAG 系统依赖外部 LLM 提供商（如 OpenAI, Claude, Ollama 等）或本地部署的模型，自身不强制要求特定 GPU 型号。支持通过 Docker 部署 Web UI。可处理 15+ 种文件格式，包括音频、视频、PDF、EPUB 和 Anki 卡片等。若使用本地私有模式，需自行配置本地模型环境。","未说明",[101,102,103,104,105,106],"LangChain","LiteLLM","Gradio","ddgs","scipy","Docker (可选)",[35,14],[109,110,111,112,113,114,115,116,117],"langchain","llm","news","python","question-answering","summarizer","parser","pdf","rag","2026-03-27T02:49:30.150509","2026-04-08T09:59:14.314478",[121,126,131,136,140,145],{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},23988,"运行查询任务时提示需要安装 `LogseqMarkdownParser`，即使已经安装了该怎么办？","这通常是因为 `pip` 安装失败，只创建了 `.dist-info` 文件夹而没有实际安装包文件。解决方法是手动修复该依赖包：\n1. 找到 `LogseqMarkdownParser` 的 `setup.py` 文件。\n2. 删除其中 `package_dir={\"\": \"src\"},` 这一行。\n3. 将 `src\u002FLogseqMarkdownParser` 文件夹移动到项目根目录。\n4. 在该目录下重新运行 `pip install .`。\n完成后 Python 应该能正确识别该包。","https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fissues\u002F5",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},23989,"运行时出现 `ImportError: failed to find libmagic` 错误如何解决？","该错误表明系统缺少 `libmagic` 库，这是 `python-magic` 包所依赖的底层库。\n- **Windows 用户**：通常需要安装 `python-magic-bin` 而不是 `python-magic`，或者确保安装了包含 `magic1.dll` 的环境。\n- **Linux\u002FMac 用户**：需要通过系统包管理器安装 libmagic（例如 Ubuntu 上运行 `sudo apt-get install libmagic1`，Mac 上运行 `brew install libmagic`）。\n维护者已在后续版本中优化了相关日志输出，仅在 verbose 模式下显示此类警告。","https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fissues\u002F3",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},23990,"如何通过 pip 安装开发版（dev branch）并正确运行模块？","可以通过以下命令安装开发分支版本：\n`pip install git+https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002FDocToolsLLM.git@dev`\n\n安装后如果直接运行 `python -m DocToolsLLM` 提示找不到模块，可以尝试以下几种运行方式：\n1. 直接在控制台运行命令（如果已添加脚本路径）：`DocToolsLLM`\n2. 确保当前目录在 PYTHONPATH 中，克隆仓库后切换到 dev 分支，运行：`PYTHONPATH=\".\" python -m DocToolsLLM`\n3. 如果是通过 git clone 安装的，请确保在项目根目录下运行 `pip install .`。","https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fissues\u002F1",{"id":137,"question_zh":138,"answer_zh":139,"source_url":135},23991,"加载 txt 文档时报错 `string indices must be integers, not 'str'` 是什么原因？","这是一个已修复的 Bug，通常发生在特定版本的解析逻辑中处理字符串索引错误时。维护者已发布修复版本。\n解决方法：\n1. 升级到最新版本（特别是 dev 分支或最新 release）。\n2. 如果问题依旧，尝试重新克隆仓库并执行 `pip install .` 以确保本地代码是最新的。\n3. 检查是否混用了不同版本的依赖包，建议重建虚拟环境。",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},23992,"如何将 WDoc 集成到 Open-WebUI 中作为 \"Pipe\" 使用？","可以将 WDoc 配置为 Open-WebUI 的自定义工具（Pipe）。\n关键配置步骤：\n1. 设置环境变量以启用原生导入模式：`os.environ[\"WDOC_IMPORT_TYPE\"] = \"native\"`。\n2. 在代码中添加 try-except 块来处理可选依赖（如 `torchaudio`）导入失败的情况，防止程序崩溃。\n3. 确保 API Key（如 Anthropic Key）正确传递给工具参数。\n目前已有可用的演示版本推送到 Open-WebUI 社区，具体实现代码可参考相关仓库中的 `wdoc_tools.py` 文件。","https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002Fwdoc\u002Fissues\u002F4",{"id":146,"question_zh":147,"answer_zh":148,"source_url":130},23993,"为什么运行时会出现关于 `ftlangdetect` 和 `langdetect` 的警告信息？","系统使用了两种语言检测包：`ftlangdetect`（快但难安装）和 `langdetect`（慢但兼容性好）。\n程序逻辑是优先尝试导入快速的 `ftlangdetect`，如果失败则回退到 `langdetect` 并打印警告。\n- 如果你看到 `Couldn't import optional package 'ftlangdetect'...` 的警告，说明快速检测不可用，系统正在使用较慢的替代方案，这不影响功能正常使用。\n- 在最新版本中，这类警告默认只在开启详细模式（verbose）时才会打印，以减少干扰。",[150,155,160,165,170,175,180,185,190,195,200,205,210,215,220,225,230,235,240,245],{"id":151,"version":152,"summary_zh":153,"released_at":154},145539,"5.0.0","# 新增内容\n\n## 重大发布：Docker Web 界面、Python 3.13 支持及架构改进\n\n本次重大发布引入了基于 Docker 的实验性 Web 界面，升级了 Python 版本要求，迁移到现代的 LangChain 模块，并包含许可证更新带来的破坏性变更。\n\n### ✨ 功能\n\n- **Docker Web 界面** [7da29d19, 214835bb, 6a4715be, 3f42424e, cf9d937e, 8690723f, 6dcf7d2b, 2ab086ed, 92564b2c, a2403f91, a8fa5cdc, 5475b8b5, cdcbe71d, 94912ad2, 87fef832, 56ffc4cd, 24a9d8c4, 0ce26490, c8ab4041, e6de6277, 2d5816ca, 12cc4a70, 4f84c0e1, f3399668, 756c7cab, c16d5ac1, af9c7fce, d1f0fedd, 9128bec7, 9fb2c7e7, 90de7d8f]\n  - 为 Docker 部署添加基于 Gradio 的实验性 Web UI\n  - 支持通过 Web 界面配置环境变量\n  - 动态文件类型参数和高级设置折叠面板\n  - 日志捕获与显示，采用等宽字体格式\n  - 启用 PWA 模式，提供双标签界面\n  - 顺序处理队列和非 root 用户安全机制\n  - 查询 GUI 中新增加载\u002F保存嵌入功能\n\n- **性能改进** [770a5245, a28a4bee, 6caa0113, 6e54b787, e3327a09]\n  - 通过缓存避免重复计算文档的 token 长度\n  - 根据 token 数量和文档限制动态创建批次\n  - 启用并发处理文档回答功能\n\n### 🔧 重构与破坏性变更\n\n- **Python 版本升级** [126026f0, 2d1f8ab5, b476dba4]\n  - 要求 Python 3.13 及以上版本（破坏性变更）\n  - 更新至 Python 3.13.5\n  - 添加适用于 Python 3.13+ 的 audioop-lts 安装后脚本\n\n- **LangChain 迁移** [830edd47, 4eb303c4, 4763b552, 908d536f, e190d601, 5846ae05]\n  - 迁移到 langchain_core 和 langchain_text_splitters 模块\n  - 更新从过时 LangChain 模块的导入语句\n  - 要求 LangChain ≥ 1.2.0\n  - 更新 CacheBackedEmbeddings 的导入路径\n\n- **许可证变更** [f30fcda8, b9e8eb24]\n  - 从 GPLv3 切换到 AGPLv3（破坏性变更）\n\n- **异步操作** [c4122339]\n  - 使用 asyncio tqdm 替代普通 tqdm，以更好地支持异步操作\n\n### 📚 文档\n\n- **Docker 文档** [10be9b05, 69a3ed5a, 555e1d5e, 09cf6567, 87d6442e, ba75fd0d, 44d419cf, def3f979]\n  - 添加包含安装说明的完整 Docker README\n  - 包含 Gradio 界面截图\n  - 增加权限问题的故障排除部分\n  - 将 Docker 文档整合到在线文档中\n\n- **流程图** [6bbb1069, abbb8ed3, f54d3804, 0c705087, 9842b090, 14018a68, 4576785d]\n  - 更新带有标题的工作流图\n  - 添加包含所有三张流程图的合并图片\n  - 在 README 中并排展示流程图\n  - 使用子图改进 Mermaid 流程图结构\n\n- **通用文档** [701645c3, 19a4555e, 2938f369, 1a4cf2da, 019c6407, a96acd91, 0a6af6d2, 4273ed1a, 2cf0be4f, 5ea630d5, b747f054, 817380ce, 0acca059, 659dcf86, 082b016f, 86aeda4b, 5af143a6, 0abd2683]\n  - 添加网站链接和作者信息\n  - 使用 uvx 命令澄清安装说明\n  - 移除重复的文档章节\n  - 修复图片 l","2026-01-04T15:08:22",{"id":156,"version":157,"summary_zh":158,"released_at":159},145540,"4.1.2","# 新增内容\n\n# 新增内容\n\n此补丁版本修复了一个可选依赖项的安装问题，并提升了哈希性能。\n\n## 🐛 修复\n\n- 修复了 chonkie 可选安装依赖项的名称错误（应为 `semantic` 而非 `semantics`）[[18a2f9d5](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002FWDoc\u002Fcommit\u002F18a2f9d5)]\n\n## ⚡ 性能\n\n- 将哈希算法从 SHA256 切换为 BLAKE3，以提升速度 [[835e43dc](https:\u002F\u002Fgithub.com\u002Fthiswillbeyourgithub\u002FWDoc\u002Fcommit\u002F835e43dc)]\n  - 更新了 `setup.py`、`tests\u002Ftest_parsing.py` 和 `wdoc\u002Futils\u002Fmisc.py`\n\n## 自上一版本以来的提交详情\n\n\n- [cb522e6b] 由 @thiswillbeyourgithub 提交，10 秒前：\n将版本号从 4.1.1 升级至 4.1.2\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [18a2f9d5] 由 @thiswillbeyourgithub 提交，30 分钟前：\n修复：chonkie 的可选安装依赖项名为 semantic，而非 semantics\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nsetup.py\n\n\n\n- [835e43dc] 由 @thiswillbeyourgithub 提交，31 分钟前：\n性能优化：使用 blake3 替代 sha256\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nsetup.py\ntests\u002Ftest_parsing.py\nwdoc\u002Futils\u002Fmisc.py\n","2025-10-28T09:12:13",{"id":161,"version":162,"summary_zh":163,"released_at":164},145541,"4.1.1","# 新增内容\n\n本次发布重点在于集成 Chonkie 进行语义分块，提升测试可靠性，并通过全面的代码 lint 检查来优化代码质量。\n\n## 功能\n\n- **Chonkie 语义分块集成**\n  - 使用带有记忆功能的语义分块实现了 `ChonkieSemanticSplitter` ([081e81aa])\n  - 为 `ChonkieSemanticSplitter` 添加了 `transform_documents` 方法 ([534cc909])\n  - 在 `summarize.py` 中将 `RecursiveCharacterTextSplitter` 替换为 `ChonkieSemanticSplitter` ([77f16520])\n  - 将 chonkie 添加到依赖项中 ([7234f86d])\n  - 合并 chonkie 分支到 dev 分支 ([f89390ac])\n\n## 修复\n\n- **日志与显示**\n  - 修复了 loguru 中颜色不显示的问题 ([99502e76])\n  - 修复了 stdout 颜色逻辑错误 ([83e7fb9c])\n  \n- **解析与类型提示**\n  - 允许 LLM 在思考过程中提到“思考”一词 ([d2bca844])\n  - 修复了解析“思考”时的错误信息 ([a50ec42f])\n  - 修复了 topk 自动增加的类型提示错误 ([615828ae])\n\n## 重构\n\n- 将批处理文件加载器拆分为两个文件 ([a0420fd7])\n- 在整个代码库中全面运行 ruff linter ([d9f7eac2])\n- 从 black 切换到 ruff ([2d8a51b9])\n- 使 ruff 配置更加宽松 ([e04fc8d7])\n\n## 测试\n\n- **DDG 测试改进**\n  - 终于修复了 DDG 错误无法捕获输出的问题 ([e1b2a87e])\n  - 正确捕获 DDG 输出 ([d9c5ae9f])\n  - 将 DDG 最大结果数设置为 10，以减少失败次数 ([ccdffd16])\n  - 在错误信息之前打印输出 ([66bf47ce])\n  - 改进了输出打印方式 ([96c51869])\n  - 不再使用 grep 的别名 ([a02ffbc2])\n\n## 杂项\n\n- 更新了 pyfiglet 字体 ([fd49cca5])\n- 清理了 README 中已完成的 TODO 项 ([113c0082], [ea4a99b0], [86563279], [7206234c], [4a4f4e8e])\n- 其他小幅改进 ([b8734694], [ca55c139])\n\n## 自上一次发布以来的提交详情\n\n\n- [766c3737] 由 @thiswillbeyourgithub 提交，4 秒前：\n版本号从 4.1.0 升级到 4.1.1\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [e1b2a87e] 由 @thiswillbeyourgithub 提交，66 分钟前：\n测试：终于修复了 DDG 无法捕获输出的错误\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Ftest_cli.sh\n\n\n\n- [a02ffbc2] 由 @thiswillbeyourgithub 提交，2 小时前：\n测试：不要使用 grep 的别名\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Ftest_cli.sh\n\n\n\n- [66bf47ce] 由 @thiswillbeyourgithub 提交，2 小时前：\n测试：在错误信息之前打印输出\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Ftest_cli.sh\n\n\n\n- [4a4f4e8e] 由 @thiswillbeyourgithub 提交，2 小时前：\nTODO：完成了 Chonkie 集成\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nREADME.md\n\n\n\n- [d9c5ae9f] 由 @thiswillbeyourgithub 提交，2 小时前：\n测试：捕获 DDG 输出\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Ftest_cli.sh\n\n\n\n- [96c51869] 由 @thiswillbeyourgithub 提交，2 小时前：\n测试：","2025-10-27T18:11:34",{"id":166,"version":167,"summary_zh":168,"released_at":169},145542,"4.1.0","# 新增内容\n\n# 新增内容\n\n本次发布主要集中在增强系统的鲁棒性，特别是在语言检测、文件加载和错误处理方面。\n\n## 功能\n\n- **任务类型系统**：引入基于数据类的任务类型存储，以提高类型安全性 [7c95e3cb]\n- **源标签日志记录**：在源标签日志中增加了失败次数和成功率的跟踪功能 [69dca457]\n\n## 修复\n\n- **PowerPoint 加载器**：修复了加载 PowerPoint 文件时出现的 TypeError 错误 [ebfc66c0]\n- **Anki 加载器**：解决了前向引用错误 [73924e10]\n- **语言检测**：修复了一个潜在的边缘情况问题 [2d928ab2]\n- **无限循环检测**：\n  - 将简单的循环计数器替换为基于哈希的检测机制 [bb147b3d]\n  - 调整了循环计数器的阈值 [fcf9ca55]\n\n## 改进\n\n- **语言检测改进**：\n  - 更好的异常处理 [0b9c6da1]\n  - 降低了调试日志的详细程度 [d7589cc0]\n  - 其他通用改进 [c0e2ce79]\n- **批量文件加载器**：减少了进度日志的详细程度 [d207d986]\n- **测试**：改进了模型检测逻辑 [5257c5a7]\n- **安装后操作**：在安装过程中使用 logger.error 代替 print [c0795e9e]\n\n## 重构\n\n- **wdoc 类**：添加了动态的 interaction_settings 属性 [f806b986]\n- **类型提示**：在多个模块中改进了类型注解 [a94a8891, 920e5d31]\n\n## 文档\n\n- **帮助文本**：修正了 PowerPoint 文件类型的文档，其中错误地提到了 .doc\u002F.docx 而不是 .ppt\u002F.pptx [e9b29ebf]\n\n## 依赖项\n\n- 升级了 litellm，以支持最新的 OpenRouter 定价 [577e6f63]\n\n## 维护\n\n- 移除了调试打印语句 [80f7f32b]\n- 提供了更清晰的警告信息 [faa5d3b5]\n- 修复了 setup.py 中的日志使用问题 [4a672c1f]\n\n## 自上一版本以来的提交详情\n\n\n- [5adc87e6] 由 @thiswillbeyourgithub 提交，72秒前：\n将版本号从 4.0.4 更新至 4.1.0\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [0b9c6da1] 由 @thiswillbeyourgithub 提交，2小时前：\n增强：在语言检测中加入更好的异常捕获机制\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [80f7f32b] 由 @thiswillbeyourgithub 提交，3小时前：\n移除一条调试打印语句\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Floaders\u002F__init__.py\n\n\n\n- [d7589cc0] 由 @thiswillbeyourgithub 提交，3小时前：\n增强：语言检测器减少调试日志输出\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [c0e2ce79] 由 @thiswillbeyourgithub 提交，3小时前：\n增强：语言检测器\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [2d928ab2] 由 @thiswillbeyourgithub 提交，3小时前：\n修复：语言检测中的潜在边缘情况问题\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [69dca457] 由 @thiswillbeyourgithub 提交，4小时前：\n新增功能：添加失败计数","2025-10-21T18:25:57",{"id":171,"version":172,"summary_zh":173,"released_at":174},145543,"4.0.2","# 新增内容\n\n# 新增内容\n\n本次发布主要集中在与文档存储过滤和检索器功能相关的错误修复、性能改进以及代码清理。\n\n## 🐛 修复\n\n- **文档存储过滤改进**\n  - 修复了调用 `filter_docstore` 时缺少参数的问题 ([1a2442d0])\n  - 修复了 `filter_docstore` ([d96c2f3a]) 和 `create_filter_metadata` ([ee9cc6f2]) 的类型提示\n  - 纠正了文档存储的序列化行为 ([9c1d967b])\n  \n- **检索器修复**\n  - 修复了从嵌入向量加载时的父级检索器问题 ([cf9171d3])\n  - 修复了在边缘情况下检索器的类型提示问题 ([39951a8e])\n\n## ⚡ 性能\n\n- 不再存储或序列化未过滤的文档存储 ([d29d3a56], [a9a0a359])\n  - 为清晰起见，将 `filter_docstore` 重命名为 `filter_vectorstore`\n\n## ✨ 功能\n\n- 添加了对文档存储序列化和删除操作的计时测量 ([375c1a1d])\n\n## 🧹 杂项\n\n- 移除了未使用的导入 ([de18c658], [65254648])\n- 版本号从 4.0.1 升级到 4.0.2 ([83f23ddf])\n\n## 自上一版本以来的提交详情\n\n\n- [83f23ddf] 由 @thiswillbeyourgithub 提交，9 秒前：\n将版本号从 4.0.1 升级到 4.0.2\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [a9a0a359] 由 @thiswillbeyourgithub 提交，17 分钟前：\n将 `filter_docstore` 重命名为 `filter_vectorstore`\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Ffilters.py\nwdoc\u002Fwdoc.py\n\n\n\n- [d29d3a56] 由 @thiswillbeyourgithub 提交，18 分钟前：\n性能优化：不再存储或序列化未过滤的文档存储\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Ffilters.py\nwdoc\u002Fwdoc.py\n\n\n\n- [cf9171d3] 由 @thiswillbeyourgithub 提交，24 分钟前：\n修复：从嵌入向量加载时的父级检索器问题\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fretrievers.py\n\n\n\n- [de18c658] 由 @thiswillbeyourgithub 提交，37 分钟前：\n移除未使用的导入\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fretrievers.py\n\n\n\n- [39951a8e] 由 @thiswillbeyourgithub 提交，37 分钟前：\n修复：在边缘情况下检索器的类型提示问题\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fretrievers.py\n\n\n\n- [375c1a1d] 由 @thiswillbeyourgithub 提交，45 分钟前：\n新增功能：为文档存储的序列化和删除操作添加计时测量\n共同作者：aider (openrouter\u002Fanthropic\u002Fclaude-sonnet-4.5) \u003Caider@aider.chat>\n\nwdoc\u002Futils\u002Ffilters.py\n\n\n\n- [65254648] 由 @thiswillbeyourgithub 提交，48 分钟前：\n移除未使用的导入\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Ffilters.py\n\n\n\n- [9c1d967b] 由 @thiswillbeyourgithub 提交，48 分钟前：\n修复：实际上未过滤的文档存储会被序列化\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Ffilters.py\nwdoc\u002Fwdoc.py\n\n\n\n- [ee9cc6f2] 由 @thiswillbeyourgithub 提交，67 分钟前：\n修复","2025-10-15T08:51:31",{"id":176,"version":177,"summary_zh":178,"released_at":179},145544,"4.0.1","# 新增内容\n\n# 新增内容\n\n本次发布重点在于兼容 LangFuse v3 以及改进错误处理。\n\n## 🐛 修复\n\n- **LangFuse v3 兼容性**\n  - [89f5132c] 更新 LangFuse v3 的回调导入\n  - [07257e0c] 在 v3 中使用 LangFuse OpenTelemetry\n\n- **文档加载的健壮性**\n  - [3039bcfd] 防止在 `transform_documents` 后无文档剩余时程序崩溃\n  - [101c7f77] 添加断言以验证是否找到文档\n\n## 📝 文档\n\n- [56866d18] 添加警告：建议使用 Whisper 或 Deepgram 而非 YouTube 音频后端\n\n## 🔧 维护\n\n- [fb49e60d] 版本号从 4.0.0 升级至 4.0.1\n\n## 自上一版本以来的提交详情\n\n\n- [fb49e60d] 由 @thiswillbeyourgithub 提交，13 秒前：\n将版本号从 4.0.0 升级至 4.0.1\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [07257e0c] 由 @thiswillbeyourgithub 提交，3 分钟前：\n修复：在 v3 中使用 LangFuse OpenTelemetry\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [56866d18] 由 @thiswillbeyourgithub 提交，11 分钟前：\n文档更新：添加关于使用 YouTube 音频后端而非 Whisper 或 Deepgram 的警告\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Floaders\u002Fyoutube.py\n\n\n\n- [89f5132c] 由 @thiswillbeyourgithub 提交，14 分钟前：\n修复：LangFuse v3 的回调导入已更改\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [3039bcfd] 由 @thiswillbeyourgithub 提交，20 分钟前：\n修复：当运行 `transform_documents` 后无文档时不会导致程序崩溃\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Floaders\u002F__init__.py\n\n\n\n- [101c7f77] 由 @thiswillbeyourgithub 提交，29 分钟前：\n添加断言以确保找到文档\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Floaders\u002F__init__.py\n","2025-10-07T23:02:45",{"id":181,"version":182,"summary_zh":183,"released_at":184},145545,"4.0.0","# 新增内容\n\n# 新增内容\n\n本次发布重点在于通过惰性加载和延迟导入实现**重大性能提升**，进行**大规模代码重构**以提高可维护性，以及**完善测试基础设施**。\n\n## ⚡ 性能\n\n- 通过延迟导入和惰性加载，**显著缩短启动时间** [52985d59, dce3c244, 3ffaec3a]\n  - 将 litellm 的导入移至仅在需要时执行 [52985d59]\n  - 延迟 requests 模块的导入 [0b4c2fb0]\n  - 移除 `__init__.py` 文件中的立即导入 [306d4ca4]\n  - 将加载器、嵌入模块和核心模块中的导入移至惰性加载 [de1ceccf, 08b92066, fd2dcba5, 1838e0f6, 1bd4ced6, f1740c40, 2b3d9e8c, 6fbe51d2, 6c74d8eb, f306325f]\n  - 为文档加载器添加了基于 `WDOC_LAZY_LOAD` 环境变量的惰性加载功能 [7fc5fad2, ce10c4bc]\n\n## 🔧 修复\n\n- 修复了多个模块中的前向引用类型提示问题 [fd6a7e79, 22b44b4b, 15a27461]\n- 修复了 parse 函数的签名包装问题 [29dbf5db]\n- 修复了 DuckDuckGo 和 OpenRouter 的 API 测试问题 [8b9ebc27, 8f511dd4, 32e036d6, 048f99ec]\n- 修复了边缘情况下缺失的文件类型处理问题 [0422dec0]\n- 修复了 Word 文档加载错误 [8cad00d4]\n- 修复了惰性加载逻辑（原逻辑相反）[a35446ff]\n- 修复了 query_task 和 search_task 的输出处理问题 [6f633e8b, 8b95a81f]\n- 修复了使用管道时摘要未输出到文件的错误 [2a85a6b1]\n- 修复了加载器中的导入问题 [ebd45580, af85343c, 986abd21, 4e61a6fe]\n- 为 Python 3.13+ 添加了缺失的 `audioop-lts` 依赖 [56bd634f]\n\n## ♻️ 重构\n\n- **模块化加载器**：将单体加载器文件拆分为独立模块 [df1a0ad1, d3ed8735, f0a3fcef, b2490688, 984a8d37, def441f3, fb421cc5]\n  - 为 PDF、Anki、URL、音频、HTML 等加载器创建了专用文件\n  - 启用了加载器模块的惰性加载 [7fc5fad2]\n- **提取任务相关函数**至独立模块：\n  - 将 `parse_doc` 移至 `utils\u002Ftasks\u002Fparse.py` [1c7c6e4a]\n  - 将查询\u002F搜索检索逻辑移至任务模块 [79820513, c2e61427]\n  - 将 `evaluate_doc_chain` 移至 `shared_query_search.py` [8965c483]\n  - 将查询拆分逻辑提取至共享工具中 [4bb54a5f]\n  - 将 `source_replace` 移至 query.py [0ce5f4f6]\n  - 将 `autoincrease_top_k` 移至 query.py [38e82b44]\n- 使用更好的类型提示拆分了搜索和查询任务方法 [1d946440, 824f3959, 319b8ebe]\n- 将 `debug_exceptions` 移至日志模块 [99cc99fb]\n- 将 VectorStore 过滤代码移至 filters.py [de4ce57f]\n- 添加了用于类型提示的 `wdocSummary` 数据类 [9fc51c02, 92f5c47d]\n- 为 `all_texts` 属性添加了惰性缓存 [79b16616, 7b459480]\n- 移除了已废弃的 `import_tricks.py` [51166167]\n\n## 🧪 测试\n\n- 改进了测试清理和临时文件删除操作 [a7686421, 35ef63ed, c149f5d5, 913378aa]\n- 在成本测试中提供了更详细的输出信息 [342ad3f6]\n- 在 OpenRouter API 测试中使用 Mistral（零数据留存）[8f511dd4]\n- 添加了基于 Shell 的 CLI 测试脚本，以提高测试的可靠性 [cc74a84a, 41705670]\n- 添加了对 `wdoc[full]` 安装情况的检查 [7cb9a3ca]\n- 更新了 Ollama 嵌入测试","2025-10-05T16:36:08",{"id":186,"version":187,"summary_zh":188,"released_at":189},145546,"3.3.1","# 新增内容\n\n本次发布重点在于通过全面的类型提示修复和增强的测试基础设施来提升代码质量。\n\n## 🔧 修复\n- **类型提示**：对整个代码库进行了全面的类型提示改进\n  - 二进制 FAISS 向量存储的类型提示（[e46ed4a2]、[ac02a65b]、[da864cd2]、[95c37056]、[81b36cb8]、[be3f3523]）\n  - 加载器函数的类型提示（[e6fcad8b]、[e65abadf]、[b6243737]）\n  - 语义批处理的类型提示（[f3a52893]、[dd6ad29c]）\n  - 提示模板的类型提示（[1b4ec86b]、[bc56beb2]）\n  - 通用类型提示修复（[d4e99fdb]）\n\n- **模型兼容性**：修复了某些模型将 `\u003Canswer>` 视为暗示 `\u003C\u002Fthink>` 的问题（[09684bb6]）\n- **Langchain 集成**：通过创建不带装饰器的可运行对象，修复了 callable_chain 的兼容性问题（[0c89cac3]）\n\n## ✨ 增强\n- **类型检查**：用导入钩子系统替换了手动类型检查（[56b353af]、[6b3ddab3]）\n- **日志记录**：降低了 litellm 日志的详细程度（[9a4a69cd]）\n- **搜索功能**：为 DuckDuckGo 搜索结果添加了重复项检查（[f68c8a42]）\n\n## 🧪 测试\n- 添加了对 DuckDuckGo 搜索功能的全面测试（[7dbd3c25]）\n- 修复了现有的 CLI 测试（[781f6d69]）\n\n## 📦 版本\n- 将版本号从 3.3.0 升级到 3.3.1（[0690df9c]）\n\n## 自上一版本以来的提交详情\n\n\n- [0690df9c] 由 @thiswillbeyourgithub 提交，41 秒前：\n将版本号从 3.3.0 升级到 3.3.1\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [7dbd3c25] 由 @thiswillbeyourgithub 提交，8 小时前：\n测试：添加对 DuckDuckGo 搜索功能的测试\n共同作者：aider (openrouter\u002Fanthropic\u002Fclaude-sonnet-4) \u003Caider@aider.chat>\n\ntests\u002Ftest_wdoc.py\n\n\n\n- [781f6d69] 由 @thiswillbeyourgithub 提交，9 小时前：\n修复：测试\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Ftest_cli.py\n\n\n\n- [e46ed4a2] 由 @thiswillbeyourgithub 提交，14 小时前：\n修复：边缘得分搜索的类型提示\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fcustoms\u002Fbinary_faiss_vectorstore.py\n\n\n\n- [ac02a65b] 由 @thiswillbeyourgithub 提交，18 小时前：\n修复：二进制 FAISS 的类型提示\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fcustoms\u002Fbinary_faiss_vectorstore.py\n\n\n\n- [f68c8a42] 由 @thiswillbeyourgithub 提交，19 小时前：\n添加对 DDG 搜索结果重复项的检查\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fbatch_file_loader.py\n\n\n\n- [da864cd2] 由 @thiswillbeyourgithub 提交，19 小时前：\n修复：二进制 FAISS 的类型提示\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fcustoms\u002Fbinary_faiss_vectorstore.py\n\n\n\n- [9a4a69cd] 由 @thiswillbeyourgithub 提交，19 小时前：\n增强：降低 litellm 的日志详细程度\n签署人：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fwdoc.py\n\n\n\n- [e6fcad8b] 由 @thiswillbeyourgithub 提交，19 小时前：\n修复：lo 的类型提示","2025-07-26T08:24:33",{"id":191,"version":192,"summary_zh":193,"released_at":194},145547,"3.3.0","# 新增内容\n\n本次发布重点在于添加 DuckDuckGo 网页搜索功能，并引入二进制嵌入支持，以实现更高效的向量存储。\n\n## ✨ 新特性\n\n### DuckDuckGo 网页搜索集成\n- **[372fe577]** 添加支持 DuckDuckGo 搜索，并实现 URL 提取与元数据处理\n- **[273195e0]** 支持 `wdoc wdb \"your query\"` 简写形式进行网页搜索\n- **[03bfe083]** 增加 DuckDuckGo 搜索的测试用例及文档说明\n\n### 二进制嵌入支持\n- **[c528bad6]** 添加对二进制嵌入的支持，内存占用降低 8 倍\n- **[8f651979]** 默认启用 FAISS 向量存储压缩\n- **[37ebd97e]** 创建带有 zlib 压缩的 CompressedFAISS 子类\n\n## 🐛 Bug 修复\n\n### 核心功能\n- **[0d72efd0]** 修复 `load_one_doc` 使用错误装饰器的问题\n- **[edcf6711]** 修正 `ddg_region` 的类型（应为字符串而非整数）\n- **[66ab1775]** 修复 `ddg_safesearch` 和 `loading_failure` 的类型注解\n- **[957936cf]** 调用 wdoc 时改用关键字参数，而非 Fire 命令行工具\n\n### 测试环境\n- **[d3de58e5]** 修复 pytest 环境中管道输入输出的处理问题\n- **[42ff5166]** 防止在 pytest 环境中使用管道\n- **[c78dc0b7]** 添加 pytest 环境检测功能\n\n## 🧪 测试改进\n\n- **[1b099968]** 修复 `run_all_test` 脚本\n- **[8ed1d0cb]** 增加全面的 DuckDuckGo 搜索功能测试\n- **[b184177b]** 将 CLI 测试拆分至单独的 `test_cli.py` 文件\n- **[9d7fe9c0]** 将解析相关测试拆分至独立的 `test_parsing.py` 文件\n- **[12b012d1]** 将向量存储测试移至专用测试文件\n\n## 📚 文档更新\n\n- **[d7d6b043]** 在 README 中说明如何运行测试\n- **[dc150010]** 清晰解释如何禁用并行处理\n- **[df4b79f9]** 记录调试模式对 `loading_failure` 默认值的影响\n- **[18322991]** 增加 DuckDuckGo 使用的 Shell 示例\n\n## 🔧 功能增强\n\n### CLI\u002FUX 改进\n- **[7e994a6d]** 将 `parse_file` 函数重命名为 `parse_doc`\n- **[4aa247e6]** 当 CLI 输入为空查询时，重新提示用户输入\n- **[57d5d5fa]** 修复 Fire 命令行工具在 CLI 中的分页器问题\n\n### 性能与可靠性提升\n- **[68d4c757]** 升级 LiteLLM 至最新版本，以改善启动时间\n- **[ab9c5e92]** 为 Whisper 音频分割功能新增并行处理选项\n- **[6b130442]** 为递归文件处理添加循环计数器和崩溃保护机制\n\n## 🔄 版本更新\n\n- **[64351334]** 版本号从 3.2.5 升级至 3.3.0\n\n## 自上一版本以来的提交详情\n\n\n- [64351334] 由 @thiswillbeyourgithub 提交，36 分钟前：\n将版本号从 3.2.5 升级至 3.3.0\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [1b099968] 由 @thiswillbeyourgithub 提交，24 小时前：\n测试：修复 run_all_test 脚本\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Frun_all_tests.sh\n\n\n\n- [d7d6b043] 由 @thiswillbeyourgithub 提交，24 小时前：\n文档：说明如何运行测试\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nREADME.md\n\n\n\n- [62cc2ce0] 由 @thiswillbeyourgithub 提交，24 小时前","2025-07-19T12:44:49",{"id":196,"version":197,"summary_zh":198,"released_at":199},145548,"3.2.5","# 新增内容\n\n此版本带来了命令行参数处理和文件类型检测的多项改进，以及关键的错误修复和构建流程优化。\n\n### ✨ 功能\n\n*   **CLI 与文件类型检测：**\n    *   通过利用文件类型检测器推断适当的操作，增强了对多个隐式 CLI 参数的处理能力（[ab76610f]）。\n    *   引入了特定异常，以便在无法推断文件类型时提供更清晰的错误报告，包括针对无法检测到的文件类型的全新错误（[520f4ce8], [39af2232]）。\n*   **构建流程：**\n    *   通过 `.readthedocs.yaml` 中的预构建作业，将 `sphinx-apidoc` 集成到 ReadTheDocs 构建流程中（[cc86c7be]）。\n\n### 🐛 修复\n\n*   修复了 `sys.argv` 处理不当导致参数重复的问题（[e7cf1855]）。\n*   更新了 `litellm` 依赖项，以解决 Windows 环境下出现的崩溃问题（[cfff0aca]），详情参见 #20。\n\n### 🛠️ 改进与重构\n\n*   **文件类型检测内部机制：**\n    *   将文件类型检测逻辑重构为专用函数，以提高模块化程度（[b4537485]）。\n    *   在文件类型检测器中添加了调试日志，以帮助排查问题（[05966c65]）。\n*   **代码质量：**\n    *   通过为自定义异常类添加文档字符串，提升了文档质量（[8e6ca1a0]）。\n\n### 杂项\n\n*   版本号升级至 3.2.5（[82b7f815]）。\n\n## 自上一版本以来的提交详情\n\n\n- [82b7f815] 由 @thiswillbeyourgithub 提交，19 分钟前：\n将版本号从 3.2.4 升级至 3.2.5\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [e7cf1855] 由 @thiswillbeyourgithub 提交，3 分钟前：\n修复：`sys.argv` 处理不当导致参数重复\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002F__main__.py\n\n\n\n- [cfff0aca] 由 @thiswillbeyourgithub 提交，19 分钟前：\n修复：更新 `litellm` 版本以避免 Windows 下崩溃\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nsetup.py\n\n\n\n- [cc86c7be] 由 @thiswillbeyourgithub (aider) 提交，2 天前：\n新增功能：在 `.readthedocs.yaml` 中添加预构建作业以运行 `sphinx-apidoc`\n\n.readthedocs.yaml\n\n\n\n- [ab76610f] 由 @thiswillbeyourgithub 提交，2 天前：\n新增功能：当 CLI 存在多个隐式参数时，使用文件类型检测器推断应执行的操作\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002F__main__.py\n\n\n\n- [05966c65] 由 @thiswillbeyourgithub 提交，2 天前：\n增强功能：在文件类型检测器中添加调试打印信息\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fbatch_file_loader.py\n\n\n\n- [520f4ce8] 由 @thiswillbeyourgithub 提交，2 天前：\n新增功能：当无法推断文件类型时，使用特定异常\n签名者：thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fbatch_file_loader.py\n\n\n\n- [8e6ca1a0] 由 @thiswillbeyourgithub 提交，2 天前：\n为部分异常添加文档字符串\n签名者：thiswillbeyourgithub\n\u003C26625900+thiswillbeyourg","2025-05-17T18:35:26",{"id":201,"version":202,"summary_zh":203,"released_at":204},145549,"3.2.4","# What's new\n\nThis release primarily focuses on significant documentation enhancements, crucial bug fixes for stability and build processes, and introduces updated dependencies and tokenization.\n\n### ✨ New Features\n*   Upgraded default token estimation to use `gpt-4o-mini` tokenizer, replacing `gpt-3.5-turbo` ([6d418175]).\n*   Integrated the latest `yt-dlp` for YouTube downloads ([ab207b44]).\n*   Environment variable documentation is now automatically added to the `EnvDataclass` class `__doc__` ([ed9dd385]).\n\n### 🐛 Bug Fixes\n*   Resolved a crash on ReadTheDocs caused by missing `yt-dlp` dependency ([f5068a30]).\n*   Fixed an issue where accessing `env.__class__` on ReadTheDocs could cause a crash ([4e180f02]).\n*   Corrected relative import paths in `wdoc` that were preventing Sphinx API documentation builds ([ade59302]).\n*   Fixed issues with the Sphinx API command in the FAQ section of the README ([38008aa0], [ff093a29]).\n*   Ensured collapsible bars in documentation function correctly ([3cef8336]).\n\n### 📚 Documentation & Refinements\n*   Extensive updates and fixes to Sphinx documentation generation and content:\n    *   Addressed outdated Sphinx documentation files ([90bde99e]).\n    *   Improved API autodoc parameters for clearer documentation ([243de666]).\n    *   Excluded private and special members from documentation ([7abedd4d]).\n    *   Added Sphinx command to FAQ in README ([1e6602e9]) and removed private members from it ([11ae11b4]).\n    *   Updated copyright year to 2025 ([bd7e3c5a]).\n*   Streamlined documentation structure and configuration:\n    *   Removed unused make files (`Makefile`, `make.bat`) for documentation ([07b0a7d9]).\n    *   Removed unused argument for theme flyout display ([17bc5e6e]).\n    *   Removed unused templates path ([6bffa20f]) and CSS ([712df086]).\n    *   Removed duplicate README from the documentation source ([2b931624]).\n    *   Added a documentation table to the main index ([1dfe2b31]).\n\n### ⚙️ Build & Chores\n*   Bumped version to 3.2.4 ([ed7a9c75]).\n\n## Commits details since the last release\n\n\n- [ed7a9c75] by @thiswillbeyourgithub, 20 seconds ago:\nbump version 3.2.3 -> 3.2.4\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [f5068a30] by @thiswillbeyourgithub, 13 minutes ago:\nfix: missing yt-dlp makes readthedock crash\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nsetup.py\n\n\n\n- [17bc5e6e] by @thiswillbeyourgithub, 19 minutes ago:\nremove unused argument for theme flyout display\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ndocs\u002Fsource\u002Fconf.py\n\n\n\n- [4e180f02] by @thiswillbeyourgithub, 22 minutes ago:\nfix: __class__ attribute of env is accessed by readthedocks and should not crash\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fenv.py\n\n\n\n- [243de666] by @thiswillbeyourgithub, 2 hours ago:\nsaner api autodoc parameters\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ndocs\u002Fsource\u002Fconf.py\n\n\n\n- [ed9dd385] by @thiswillbeyourgithub, 3 hours ago:\nnew: add the environment variable documentation to the __doc__ of the EnvDataclass class\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fdocs\u002Fhelp.md\nwdoc\u002Futils\u002Fenv.py\n\n\n\n- [07b0a7d9] by @thiswillbeyourgithub, 3 hours ago:\ndoc: remove unused make files for doc\nSigned-off-by: thiswillbeyourgithub\n\u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ndocs\u002FMakefile\ndocs\u002Fmake.bat\n\n\n\n- [7abedd4d] by @thiswillbeyourgithub, 4 hours ago:\ndoc: dont include private nor special\nSigned-off-by: thiswillbeyourgithub\n\u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ndocs\u002Fsource\u002Fconf.py\n\n\n\n- [38008aa0] by @thiswillbeyourgithub, 2 hours ago:\nfix: sphinx api command of faq\nSigned-off-by: thiswillbeyourgithub\n\u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\nSigned-off-by: thiswillbeyourgithub\n\u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nREADME.md\n\n\n\n- [11ae11b4] by @thiswillbeyourgithub, 4 hours ago:\nremove private from sphinx command\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nREADME.md\n\n\n\n- [90bde99e] by @thiswillbeyourgithub, 4 hours ago:\nfix outdated sphinx doc\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ndocs\u002Fsource\u002Fwdoc.rst\ndocs\u002Fsource\u002Fwdoc.utils.batch_file_loader.rst\ndocs\u002Fsource\u002Fwdoc.utils.customs.compressed_embeddings_cache.rst\ndocs\u002Fsource\u002Fwdoc.utils.customs.fix_llm_caching.rst\ndocs\u002Fsource\u002Fwdoc.utils.customs.rst\ndocs\u002Fsource\u002Fwdoc.utils.embeddings.rst\ndocs\u002Fsource\u002Fwdoc.utils.env.rst\ndocs\u002Fsource\u002Fwdoc.utils.errors.rst\ndocs\u002Fsource\u002Fwdoc.utils.flags.rst\ndocs\u002Fsource\u002Fwdoc.utils.import_tricks.rst\ndocs\u002Fsource\u002Fwdoc.utils.interact.rst\ndocs\u002Fsource\u002Fwdoc.utils.llm.rst\ndocs\u002Fsource\u002Fwdoc.utils","2025-05-15T14:55:37",{"id":206,"version":207,"summary_zh":208,"released_at":209},145550,"3.2.3","# What's new\n\nThis release primarily focuses on enhancing context management for embedding models, improving debugging utilities, and updating documentation for better clarity. It also includes several important bug fixes and feature additions.\n\n### ✨ Features\n- Introduced a new environment variable `WDOC_MAX_EMBED_CONTEXT` to allow capping the context size for embedding models (`[d9e200f8]`)\n  - Documentation for this new variable has been added (`[a2408fd0]`)\n- Enhanced debugging by ensuring debug prints are always active when `md_printer` is used. This helps in retrieving LLM answers from logs if they weren't saved to a file (`[69db1916]`)\n- Added the current date to summary metadata and headers to help reduce potential LLM hallucinations (`[64ca4665]`)\n\n### 🐛 Fixes\n- **Text Splitting & Context Handling:**\n  - Addressed an issue where large language models have more context than embedding models by setting a `max_tokens` limit for the text splitter (`[dac6802d]`)\n  - Fixed an edge case where the `wdoc max chunk` setting could be ignored (`[196b3a00]`)\n  - Corrected an old variable name within the text splitting logic (`[767bc754]`)\n- Updated the default model to `gemini 2.5 preview` to reflect its renaming on OpenRouter (`[22978609]`)\n- Improved the mechanism for ignoring initial \"breathing\" or placeholder lines in summaries (`[4dbcf158]`)\n\n### 📚 Documentation\n- **Clarity and Enhancements:**\n  - Clarified the usage of `save` and `load` functionalities (`[9d9642d4]`) and specifically advised against using them simultaneously (`[5270c350]`)\n  - Made multiple clarifications to the README for better understanding (`[9284ff54]`, `[cb4cb519]`, `[f677e5a2]`, `[39e0da55]`)\n  - Updated Ollama examples to recommend `snowflake-arctic-embed2` instead of `bge-m3` (`[d045702b]`)\n  - Added documentation for the `WDOC_MAX_EMBED_CONTEXT` environment variable (`[a2408fd0]`)\n- Removed a documentation file (`summary_rag.md`) that was not yet ready for release (`[6d20c220]`)\n\n### ⚙️ Chore & Maintenance\n- Version bumped to `3.2.3` (following an earlier bump to `3.2.2` [`[71ac503c]`]) (`[f62a2322]`)\n- **README Updates:**\n  - Updated TODO items (`[8f2cbfd7]`, `[5d090421]`)\n  - Added a PyPI badge for better project visibility (`[60ef4112]`)\n\n## Commits details since the last release\n\n\n- [f62a2322] by @thiswillbeyourgithub, 46 seconds ago:\nbump version 3.2.2 -> 3.2.3\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [6d20c220] by @thiswillbeyourgithub, 76 seconds ago:\ndoc: removed file not yet ready\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nsummary_rag.md\n\n\n\n- [71ac503c] by @thiswillbeyourgithub, 4 minutes ago:\nbump version 3.2.1 -> 3.2.2\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [8f2cbfd7] by @thiswillbeyourgithub, 3 minutes ago:\ntodo\nSigned-off-by: thiswillbeyourgithub\n\u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nREADME.md\n\n\n\n- [69db1916] by @thiswillbeyourgithub, 40 minutes ago:\nnew: now debug print is used anyway when md_printer is used\nthis is to make you able to go to the logs to fetch and answer form the\nLLM if you have forgotten to store it to a file\n\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Flogger.py\nwdoc\u002Fwdoc.py\n\n\n\n- [a2408fd0] by @thiswillbeyourgithub (aider), 66 minutes ago:\ndocs: Add documentation for WDOC_MAX_EMBED_CONTEXT variable\n\nwdoc\u002Fdocs\u002Fhelp.md\n\n\n\n- [d9e200f8] by @thiswillbeyourgithub, 66 minutes ago:\nfeat: add new env var to cap the context size for embedding models\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fenv.py\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [196b3a00] by @thiswillbeyourgithub, 72 minutes ago:\nfix: edge case where wdoc max chunk would be ignored\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [dac6802d] by @thiswillbeyourgithub, 76 minutes ago:\nfix: set a limit to max_tokens for the text splitter as large LLM have more context than embeddings models nowadays\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [767bc754] by @thiswillbeyourgithub, 80 minutes ago:\nfix: forgot to rename an old variable name\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [22978609] by @thiswillbeyourgithub, 86 minutes ago:\nfix: set default model to gemini 2.5 preview without date timestamp\nopenrouter renamed that model apparently\n\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nREADME.md\nwdoc\u002Futils\u002Fenv.py\n\n\n\n- [9d9642d4] by @thiswillbeyourgithub, 22 hours ago:\ndoc: clarify save and load\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fdocs\u002Fhelp.md\n\n\n\n- [5270c350] by @thiswillbeyourgithub, 22 hours ago:\ndoc: clarify that","2025-05-13T16:35:05",{"id":211,"version":212,"summary_zh":213,"released_at":214},145551,"3.2.1","# What's new\n\nThis small patch release primarily focuses on integrating OpenRouter for model pricing\u002Fmetadata and refining cost calculations.\n\n### ✨ Features\n\n- Set default models to use OpenRouter ([915699c1]).\n- Fetch model prices and metadata automatically from OpenRouter, improving reliability ([7f840b7a]).\n\n### 🐛 Fixes & Enhancements\n\n- Much improved price calculation and handling:\n  - Reworked computation to account for internal thinking costs ([c0b90d89]).\n  - Enhanced method for retrieving model prices, supported parameters and maximum context size ([a17b41c9]).\n  - Clearer error message when model prices are missing ([2b29a9d3]).\n- Updated `litellm` dependency ([179b5891]).\n\n### 🧪 Tests\n\n- API integration tests now fail faster if an underlying API call fails ([9a0c8565]).\n\n## Commits details since the last release\n\n\n- [03aeab2d] by @thiswillbeyourgithub, 2 minutes ago:\nbump version 3.2.0 -> 3.2.1\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [915699c1] by @thiswillbeyourgithub, 6 minutes ago:\nnew: set the default models to use openrouter\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nREADME.md\nwdoc\u002Futils\u002Fenv.py\n\n\n\n- [c0b90d89] by @thiswillbeyourgithub, 64 minutes ago:\nfix: reworked how pricing are computed to take internal thinking into account\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fllm.py\nwdoc\u002Futils\u002Fmisc.py\nwdoc\u002Futils\u002Ftasks\u002Fsummarize.py\nwdoc\u002Fwdoc.py\n\n\n\n- [a17b41c9] by @thiswillbeyourgithub, 80 minutes ago:\nenh: better way to get the model prices\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\nwdoc\u002Fwdoc.py\n\n\n\n- [9a0c8565] by @thiswillbeyourgithub, 22 minutes ago:\ntest: crash early if one api crash fails\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Frun_all_tests.sh\n\n\n\n- [7f840b7a] by @thiswillbeyourgithub, 2 hours ago:\nfeat: automatically fetch the price and metadata from openrouter instead of waiting for litellm\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\nwdoc\u002Fwdoc.py\n\n\n\n- [2b29a9d3] by @thiswillbeyourgithub, 2 hours ago:\nfix: error message on missing model price\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [179b5891] by @thiswillbeyourgithub, 2 hours ago:\nbump litellm version\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nsetup.py\n","2025-05-03T17:08:20",{"id":216,"version":217,"summary_zh":218,"released_at":219},145552,"3.2.0","# What's new\n\nThis release focuses on improving the command-line interface (especially handling piped input\u002Foutput), enhancing language model interactions (switching defaults, better caching, Ollama support), and increasing overall stability through numerous bug fixes and testing improvements.\n\n## ✨ Features\n\n*   Added arguments to set specific keyword arguments (`kwargs`) for language models (`--model_kwargs`, `--query_eval_model_kwargs`) ([13925539]).\n*   Introduced `WDOC_LLM_REQUEST_TIMEOUT` environment variable for LLM request timeouts (default 600s), useful for Ollama ([ec3c0c59]).\n*   Switched default models from Claude Sonnet\u002FHaiku to Gemini 1.5 Pro\u002FFlash ([82ef10d0]).\n*   Unified LLM handling to primarily use `ChatLiteLLM`, removing direct `ChatOpenAI` usage ([30a0f0c8]).\n*   Enabled cost tracking for queries, storing the cost in the output ([e7753af0]).\n*   Added automatic download of `nltk punkt` tokenizer during post-installation ([44f5bf8e]).\n*   Overhauled Command Line Interface (CLI) argument parsing for `wdoc` and `wdoc parse` using `fire` ([7c51ed2b], [2f4748d7]).\n*   Removed the `--pipe` argument, relying on automatic stdin\u002Fstdout detection ([b03e79a3], [2e6c1dd2], [838f1640]).\n*   Removed the separate `wdoc_parse_file` entry point; use `wdoc parse` instead ([2e878d20]).\n*   Added a new script `media_url_finder.py` ([beaf8fa9]).\n\n## 🐛 Fixes\n\n*   **LLM PLACEHOLDER Caching:**\n    *   Resolved issues with LLM caching, including invalidation when `kwargs` change and LangChain's SQLite cache ([cb785daf], [3e3e753d]).\n    *   Fixed edge cases in thinking block parsing for models like Gemini and updated tags (`\u003Cthinking>` -> `\u003Cthink>`) ([e111bdb1], [d0ae21a2], [ca9245bc], [99ed332b]).\n    *   Corrected underflow errors in cost calculation due to low LLM prices ([3f18f5df], [95a19846]).\n    *   Addressed issues specific to Ollama: API key requirement relaxation, price assumption (zero), `litellm` naming (`ollama_chat` -> `ollama`), and context window estimation ([d2f92a39], [5784b25c], [43c63408], [c3c15e12]).\n    *   Fixed handling of `testing\u002Ftesting` models and associated parameters ([b995197e], [91b5e67d], [7cf840c3], [9a7b95b2]).\n    *   Fixed `query_retrievers` parsing ([02d74125]).\n    *   Pinned `litellm` version for stability ([1b17c78f]).\n*   **CLI PLACEHOLDER Piping:**\n    *   Improved detection and handling of piped input\u002Foutput ([2e6c1dd2], [509626aa], [db2fa0f1]).\n    *   Fixed crashes and hangs when using pipes, especially with long inputs or specific test commands ([f59f34b5], [414de8d0], [b95b1252], [826e7aa3], [b6f7fd71], [177be6b1]).\n    *   Corrected argument parsing issues affecting the `--help` command ([c909337d]).\n    *   Ensured logs are not colorized and Markdown rendering is disabled when outputting to a pipe ([f1d63cd7], [fe2665c1]).\n    *   Fixed issues where debug prints or warnings were incorrectly suppressed or handled ([64fcd60e], [a7724ff4]).\n*   **General:**\n    *   Fixed various bugs in task execution, parameter handling, and attribute declarations ([27a8d353], [91d8df34], [a0eaf512], [a6effc08], [5dce2f3f], [4623fcc9], [b17f5676], [8cc91906], [e91ed3b8], [c3649ab8]).\n    *   Corrected import path in `__main__` ([0ef5e4d5]).\n    *   Suppressed excessive INFO logs from `faiss` ([a17a8d18]).\n    *   Handled `BrokenPipeError` gracefully ([b40832bb]).\n\n## 🧪 Testing\n\n*   Improved test setup for caching, using separate directories and disabling cache where necessary ([9104f860], [89f48598], [085a87eb], [6935fe7c]).\n*   Added tests for OpenRouter\u002Fdefault models, piping functionality, summary\u002Fquery tasks with testing models, and environment variable handling ([06e35b0b], [bbb83717], [caae34ce], [cb9d2376], [eaafafde], [1f835ebc]).\n*   Refactored pipe tests to use `subprocess` explicitly and fixed related issues (stderr redirection, pytest capture, shell usage) ([38a35716], [7f3249a4], [573acf92]). (Note: Some pipe tests were later commented out ([45cf4193])). \n\n## ⚡ Enhancements\n\n*   Reworked logic for detecting and modifying model parameters based on the task ([564c4f90]).\n*   Improved `load_media` function to handle online media more robustly by finding and clicking appropriate buttons ([049c9cba], [67772f81], [c5828d3c]).\n*   Added checks to prevent exceeding total token limits during summarization ([9bdcabcf]).\n*   Refined logging levels and Markdown printing logic ([edfec82b], [4ca394c3], [895a60fc]).\n\n## 📚 Documentation\n\n*   Updated examples for Ollama arguments, model usage (Gemma -> Qwen2), and general clarity ([00871179], [49437ecb], [4083dda1], [404bbe49]).\n*   Clarified behavior related to LLM caching and model `kwargs` in help documentation ([c3e0219c], [3e3e753d], [13925539], [7db844f6]).\n*   Updated README and help files reflecting changes in default models, CLI arguments, and entry points ([82ef10d0], [b03e79a3], [2e878d20], [a30bccf7]).\n\n## ⚙️ Build PLACEHOLDER Chore\n\n*   Bumped version to 3.2.0 ([7d69d794]).\n*   Added `nltk` to dependencies ([44f5bf8e]).","2025-05-03T14:05:11",{"id":221,"version":222,"summary_zh":223,"released_at":224},145553,"3.1.0","# What's new\n\nThis release primarily focuses on enhancing logging capabilities and fixing issues related to piping behavior.\n\nVersion bump to `3.1.0` (`[e93dcad6]`).\n\n## ✨ New Features\n\n*   **Logging:**\n    *   Always display the default log location (`[2fe2c431]`).\n    *   Set log level to debug for log files and critical when used in a pipe (`[130058a1]`).\n\n## 🚀 Enhancements\n\n*   **Logging:**\n    *   Improved log format (`[61465aff]`, `[dc06ccfd]`).\n    *   Increased probability of early logger initialization (`[01f01ac7]`).\n    *   Clearer error messages from python-magic (`[c846dafa]`).\n\n## 🐛 Fixes\n\n*   **Piping:**\n    *   Resolved confusion between input and output during piping (`[e175b7d5]`).\n    *   Corrected initialization of `is_piped` variable (`[e4532d30]`).\n*   **Logging & Environment:**\n    *   Fixed default handler issue in logger (`[43c859dd]`).\n    *   Prevented potential crash related to environment variable handling (`[d3b1e2bc]`).\n\n## 🧹 Minor Changes\n\n*   Removed unused imports (`[f3c05962]`).\n*   Adjusted test imports structure (`[69738119]`).\n*   Removed commented code (`[86b51102]`).\n*   Removed unused `disable_md_printing` argument (`[b3af430e]`).\n\n## ✅ Testing\n\n*   Added test for exception handling (`[dfbfad54]`).\n*   Added environment variable tests (`[0fba8a13]`).\n\n---\n\n## Commits details since the last release\n\n\n- [e93dcad6] by @thiswillbeyourgithub, 10 minutes ago:\nbump version 3.0.2 -> 3.1.0\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [e175b7d5] by @thiswillbeyourgithub, 31 minutes ago:\nfix: piping behavior was confusing input and output\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fbatch_file_loader.py\nwdoc\u002Futils\u002Fenv.py\nwdoc\u002Futils\u002Floaders.py\nwdoc\u002Futils\u002Flogger.py\nwdoc\u002Futils\u002Fmisc.py\nwdoc\u002Fwdoc.py\n\n\n\n- [b3af430e] by @thiswillbeyourgithub, 34 minutes ago:\nforgot to remove the arg disable_md_printing\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fdocs\u002Fhelp.md\nwdoc\u002Fwdoc.py\n\n\n\n- [61465aff] by @thiswillbeyourgithub, 36 minutes ago:\nenh: better log format\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Flogger.py\n\n\n\n- [2fe2c431] by @thiswillbeyourgithub, 37 minutes ago:\nnew: print the default log location always\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Flogger.py\n\n\n\n- [43c859dd] by @thiswillbeyourgithub, 37 minutes ago:\nfix: default handler\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Flogger.py\n\n\n\n- [e4532d30] by @thiswillbeyourgithub, 47 minutes ago:\nfix: is_piped variable was wrong\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fenv.py\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [01f01ac7] by @thiswillbeyourgithub, 66 minutes ago:\nenh: increase chances of logger beint initialized asap\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002F__init__.py\nwdoc\u002F__main__.py\n\n\n\n- [dc06ccfd] by @thiswillbeyourgithub, 89 minutes ago:\nbetter log format\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Flogger.py\n\n\n\n- [f3c05962] by @thiswillbeyourgithub, 2 hours ago:\nremove unused imports\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Flogger.py\n\n\n\n- [e637c2ff] by @thiswillbeyourgithub, 2 hours ago:\nnew: the log level now is always at debug level for the logfile and using --debug only modifyed the stdout of user\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Flogger.py\n\n\n\n- [86b51102] by @thiswillbeyourgithub, 2 hours ago:\nminor: remove commented line\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Flogger.py\n\n\n\n- [130058a1] by @thiswillbeyourgithub, 2 hours ago:\nnew: if wdoc is used in a pipe, we set the log level to critical\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fenv.py\nwdoc\u002Futils\u002Flogger.py\n\n\n\n- [dfbfad54] by @thiswillbeyourgithub, 2 hours ago:\ntest: add test for exception handling\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Ftest_wdoc.py\n\n\n\n- [69738119] by @thiswillbeyourgithub, 2 hours ago:\nminor: move the test imports higher up\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Ftest_wdoc.py\n\n\n\n- [0fba8a13] by @thiswillbeyourgithub, 2 hours ago:\ntest: add an unexpected env variable to test that it works\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Ftest_wdoc.py\n\n\n\n- [d3b1e2bc] by @thiswillbeyourgithub, 2 hours ago:\nfix: env variable handling could cause a crash\nSigned-off-","2025-04-18T13:19:50",{"id":226,"version":227,"summary_zh":228,"released_at":229},145554,"3.0.2","# What's new\n\n## Fixes\n\n- **Error Message Stability**  \n  - Fixed a crash caused by error messages in batch file loader\n    - Files affected:\n      - `wdoc\u002Futils\u002Fbatch_file_loader.py`\n      - `wdoc\u002Futils\u002Floaders.py`\n  - Commit hash: [4af7dc6f]\n  - Author: @thiswillbeyourgithub\n\n## Version Bump\n\n- **Version Update**  \n  - Updated version from `3.0.1` to `3.0.2` for better stability and minor enhancements\n    - Files affected for version bump:\n      - `bumpver.toml`\n      - `docs\u002Fsource\u002Fconf.py`\n      - `setup.py`\n      - `wdoc\u002Fwdoc.py`\n  - Commit hash: [504b5c9a]\n  - Author: @thiswillbeyourgithub\n\n### Note\nThese updates aim to enhance overall functionality and prevent errors from causing interrupts, ensuring a smoother user experience. The version bump signifies an incremental improvement with significant internal fixes.\n\n## Commits details since the last release\n\n\n- [504b5c9a] by @thiswillbeyourgithub, 6 seconds ago:\nbump version 3.0.1 -> 3.0.2\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [4af7dc6f] by @thiswillbeyourgithub, 10 seconds ago:\nfix: error message was causing a crash\nSigned-off-by: thiswillbeyourgithub\n\u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fbatch_file_loader.py\nwdoc\u002Futils\u002Floaders.py\n","2025-04-18T10:13:27",{"id":231,"version":232,"summary_zh":233,"released_at":234},145555,"3.0.1","# What's new\n\n## Version 3.0.1 - April 18, 2025\n\n### Chores Housekeeping\n- **Version Bump:**\n  - Bumped version from 3.0.0 to 3.0.1. \n    - Commit: [3341823c] by @thiswillbeyourgithub\n\n### Bug Fixes\n- **Error Message Fix:**\n  - Resolved issue where error message was causing a crash.\n    - Location: `wdoc\u002Futils\u002Floaders.py`\n    - Commit: [20b5ccd4] by @thiswillbeyourgithub\n\n### Documentation\n- **Companion Tool Mention:**\n  - Updated README to mention that a companion tool might be needed.\n    - Location: `README.md`\n    - Commit: [75bc42c0] by @thiswillbeyourgithub\n\n### Testing\n- **Test Script Modification:**\n  - Changed script to use `rm` instead of `trash`. \n    - Location: `tests\u002Frun_all_tests.sh`\n    - Commit: [75a21ee8] by @thiswillbeyourgithub\n\n---\n\n## Commits details since the last release\n\n\n- [3341823c] by @thiswillbeyourgithub, 6 seconds ago:\nbump version 3.0.0 -> 3.0.1\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [20b5ccd4] by @thiswillbeyourgithub, 44 seconds ago:\nfix: error message was causing a crash\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Floaders.py\n\n\n\n- [75bc42c0] by @thiswillbeyourgithub, 16 hours ago:\ndoc: mention the companion tool might be needed\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nREADME.md\n\n\n\n- [75a21ee8] by @thiswillbeyourgithub, 18 hours ago:\ntest: use rm instead of trash\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Frun_all_tests.sh\n","2025-04-18T10:07:21",{"id":236,"version":237,"summary_zh":238,"released_at":239},145556,"3.0.0","# What's new\n\n- **Version Update 3.0.0**  \n- **Environment Variable Handling**  \n  - Completely revamped the approach to handling environment variables.  \n    - Commits: [fe49bd9e], [02d2f842]\n- **Logging Improvements**  \n  - Integrated `loguru` for more comprehensive logging.  \n    - Commits: [31ade201]\n  - Altered default log level based on debug status.  \n    - Commits: [2e9b7f8d]\n\n## Enhancements\n- **Documentation and Scripts**  \n  - Updated README and other documentation to reflect recent changes.  \n    - Commits: [247ef4d5], [11cc311e], [e9fa7c17]\n  - Improved test scripts for more efficient testing.  \n    - Commits: [94963efd]\n- **Code Optimization**  \n  - Streamlined environment variable checking for redundancy optimization.  \n    - Commits: [3f6fcd63], [d4fd1d71]\n  - Various small optimizations across multiple files led to better semantic batch handling.  \n    - Commits: [04d2b92c], [6ac351f6], [cb1024dd]\n\n## Commits details since the last release\n\n\n- [51bbc55b] by @thiswillbeyourgithub, 29 minutes ago:\nbump version 2.9.0 -> 3.0.0\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [94963efd] by @thiswillbeyourgithub, 21 minutes ago:\nbetter test script\nSigned-off-by: thiswillbeyourgithub\n\u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Frun_all_tests.sh\n\n\n\n- [247ef4d5] by @thiswillbeyourgithub, 30 minutes ago:\ndoc: update todo list\nSigned-off-by: thiswillbeyourgithub\n\u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nREADME.md\n\n\n\n- [8f516e41] by @thiswillbeyourgithub, 45 minutes ago:\nfix: wrongly setting env vars to True instead of \"true\"\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fenv.py\nwdoc\u002Fwdoc.py\n\n\n\n- [3491b6ae] by @thiswillbeyourgithub, 47 minutes ago:\nfix: main was still using flags instead of env\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002F__main__.py\n\n\n\n- [c06ec641] by @thiswillbeyourgithub, 62 minutes ago:\nnew: compulsively check for unexpected values in env var\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fenv.py\n\n\n\n- [f2165468] by @thiswillbeyourgithub, 67 minutes ago:\nreplace a print by a logger.warning\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fenv.py\n\n\n\n- [f1335036] by @thiswillbeyourgithub, 70 minutes ago:\nremove weird handling of md_printing_disabled\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Flogger.py\n\n\n\n- [138ef37f] by @thiswillbeyourgithub, 71 minutes ago:\nuse loguru in main instead of print\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002F__main__.py\n\n\n\n- [d4fd1d71] by @thiswillbeyourgithub, 73 minutes ago:\nnew: stop using flags.py to store something that should be stored in env.py\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fdocs\u002Fhelp.md\nwdoc\u002Futils\u002F__init__.py\nwdoc\u002Futils\u002Fbatch_file_loader.py\nwdoc\u002Futils\u002Fembeddings.py\nwdoc\u002Futils\u002Fenv.py\nwdoc\u002Futils\u002Fflags.py\nwdoc\u002Futils\u002Fllm.py\nwdoc\u002Futils\u002Floaders.py\nwdoc\u002Futils\u002Fmisc.py\nwdoc\u002Futils\u002Ftasks\u002Fquery.py\nwdoc\u002Fwdoc.py\n\n\n\n- [a06b09f5] by @thiswillbeyourgithub, 2 hours ago:\nminor: explanatory comment\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fenv.py\n\n\n\n- [b9904af7] by @thiswillbeyourgithub, 2 hours ago:\ntypo\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fdocs\u002Fhelp.md\n\n\n\n- [177e81ab] by @thiswillbeyourgithub, 2 hours ago:\nfix: unbounlocalerror incomprehenssible unless I reimport logger\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fwdoc.py\n\n\n\n- [69a60133] by @thiswillbeyourgithub, 2 hours ago:\nminor: move cache dir declaration misc.py\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Flogger.py\nwdoc\u002Futils\u002Fmisc.py\nwdoc\u002Fwdoc.py\n\n\n\n- [2e9b7f8d] by @thiswillbeyourgithub, 2 hours ago:\nswtich default log level depending on if is_debug is set\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Flogger.py\n\n\n\n- [31ade201] by @thiswillbeyourgithub, 2 hours ago:\nfeat: switch logging backend to loguru\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002F__main__.py\nwdoc\u002Futils\u002Fbatch_file_loader.py\nwdoc\u002Futils\u002Fembeddings.py\nwdoc\u002Futils\u002Finteract.py\nwdoc\u002Futils\u002Fllm.py\nwdoc\u002Futils\u002Floaders.py\nwdoc\u002Futils\u002Flogger.py\nwdoc\u002Futils\u002Fmisc.py\nwdoc\u002Futils\u002Fprompts.py\nwdoc\u002Futils\u002Ftasks\u002Fquery.py\nwdoc\u002Futils\u002Ftasks\u002Fsummarize.py\nwdoc\u002Fwdoc.py\n\n\n\n- [d362034f] by @thiswillbeyourgithub, 3 hours ago:\nminor: pass the youtube playlist title metadata to docs\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002F","2025-04-16T16:56:08",{"id":241,"version":242,"summary_zh":243,"released_at":244},145557,"2.9.0","# What's new\n\n### New Features\n\n- **Shell Command Testing**\n  - Added shell command test for NYTimes parsing with content validation. [d3832f2d]\n\n### Fixes\n\n- **Intermediate Merging**\n  - Fixed error during batch merging, ensuring intermediate answers are handled properly when too large. [d06cbb36]\n  - Improved logic for retry attempts during batch merging. [573e15fa]\n- **Parsing Reliability**\n  - Prevent return of intermediately parsed output on parsing failure. [78c93640]\n  - Adapted handling to retry when encountering unparseable output or `finish_reason=length`. [763e9b4b]\n- **Backend and Output**\n  - Resolved backend errors and edge case issues breaking summary generation. [bf00ced0, 89c01de3]\n  - Corrected handling of test outputs to avoid crashing in specific import modes. [ee416ece]\n- **Testing Corrections**\n  - Corrected various test functions and ensured expected outcomes align with API changes. [bfcba71a, 89c01de3, 292ce90b]\n\n### Documentation\n\n- **General Updates**\n  - Enhanced walkthrough formatting for improved readability in documentation. [2ec7fad9]\n  - Provided context about the origins of wdoc in README. [31e6c5d5]\n- **Example and Help Docs**\n  - Updated examples documentation to remove confusing import_mode arguments. [cef5cdf1]\n  - Improved query and summary help documentation to reflect recent changes. [a289759c, f1a6294b]\n\n### Improvements\n\n- **Configuration and Setup Adjustments**\n  - Simplified the post-install script to handle dependencies via `uv` if present. [7a551432, 4ed0a350]\n- **Performance and Debugging Enhancements**\n  - Bumped max_tokens for intermediate answers to accommodate larger outputs during processing. [df059b61]\n  - Stored original strings before parsing for effective debugging. [6ec39576]\n  - Addressed deprecation warnings to keep up with latest standards. [15e4793c]\n\n### Minor Changes\n\n- **Code and Debug Tune-ups**\n  - Streamlined testing arguments and outputs for precise validation. [7e5e4ce7, b7f1a1f0, d79b4c51]\n  - Fine-tuned post-install scripts and functional debug outputs for better clarity. [d620f879]\n- **Enhanced wdoc Docs Via SVG Files (WIP)**\n  - Create SVG diagram and documentation for summary algorithm. [b5f49a96]\n  - Added SVG visualization and improved design and intuitive flow representation for better understanding. [9489a8af, 703dde08]\n\n\n## Commits details since the last release\n\n\n- [d06cbb36] by @thiswillbeyourgithub, 34 minutes ago:\nfix: error when merging batch when intermediate answers got so large the model can't merge them anymore\nWe just concatennate them using semantic order and that will be good\nenough, the alternative is too expensive\n\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fwdoc.py\n\n\n\n- [573e15fa] by @thiswillbeyourgithub, 35 minutes ago:\nfix: one more trial given to merge batches\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fwdoc.py\n\n\n\n- [7edbe1f8] by @thiswillbeyourgithub, 54 minutes ago:\ndoc: add helpful debug message if abrupt message tail\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [df059b61] by @thiswillbeyourgithub, 55 minutes ago:\nnew: bump max_token for intermediate answer from 1000 to 4000\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fwdoc.py\n\n\n\n- [78c93640] by @thiswillbeyourgithub, 3 hours ago:\nfix: don't return intermediately parsed output if parsing fails\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [6ec39576] by @thiswillbeyourgithub, 3 hours ago:\nminor: store the original string before parsing to help debugging\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Fmisc.py\n\n\n\n- [ee1c8571] by @thiswillbeyourgithub, 3 hours ago:\nminor: better order of the output price prints\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fwdoc.py\n\n\n\n- [cc49037a] by @thiswillbeyourgithub, 3 hours ago:\nfix: out_file test\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Ftest_wdoc.py\n\n\n\n- [cef5cdf1] by @thiswillbeyourgithub, 3 hours ago:\nfix: forgot to remove import_mode args from examples\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fdocs\u002Fexamples.md\n\n\n\n- [d3832f2d] by @thiswillbeyourgithub (aider), 3 hours ago:\nfeat: Add shell command test for NYTimes parsing with content validation\n\ntests\u002Ftest_wdoc.py\n\n\n\n- [ee416ece] by @thiswillbeyourgithub, 3 hours ago:\nnew: don't crash if using import_mode at the same time as out_file\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fwdoc.py\n\n\n\n- [32ecdbde] by @thiswillbeyourgithub, 4 hours ago:\ntest: remove unused debug and verbose argsc\nSigned-of","2025-03-17T19:25:42",{"id":246,"version":247,"summary_zh":248,"released_at":249},145558,"2.8.0","# What's new\n\n# New Features\n\n- Add an environment variable to control invalid document evaluation behavior\n- Add `WDOC_APPLY_ASYNCIO_PATCH` env variable to manage asyncio patching\n- Specify name of `LocalFileStore` for better logging\n- Add a decorator for more useful debug logs\n\n# Improvements\n\n## Logging\n- Introduced better format for debug prints\n- Harmonized default environment value presentation in the documentation\n\n## Refactoring\n- Replaced hash-based source identifiers with a consistent format\n- Improved source identifier handling for single document cases\n\n# Bug Fixes\n\n- **MAJOR** Resolved error where sources were not properly referenced\n- **MAJOR** Addressed problems with cluster detection in text analysis\n- Applied patch before running tests to resolve buggy processes\n- Corrected issues with concurrency setting causing unexpected behavior\n- Fixed an obsolete script referencing an outdated environment variable\n\n# Documentation\n\n- Removed outdated mention of winston doc, replacing with current references\n\n# Dependency Management\n\n- To run tests, `pytest-xdist` must be installed\n- Bumped `PersistDict` to the latest version\n\n# Minor Changes\n\n- Various minor code and logic corrections throughout the codebase\n\n## Commits details since the last release\n\n\n- [bf143b9c] by @thiswillbeyourgithub, 16 minutes ago:\nbump version 2.7.1 -> 2.8.0\n\nbumpver.toml\ndocs\u002Fsource\u002Fconf.py\nsetup.py\nwdoc\u002Fwdoc.py\n\n\n\n- [f97935e3] by @thiswillbeyourgithub, 3 minutes ago:\nRevert \"tests: remove the fixture from tests as they are bugging some tests\"\nThis reverts commit 34adb42a1f79f49711b50b0cd812a6d5cab993ca.\n\ntests\u002Fconftest.py\n\n\n\n- [6515373d] by @thiswillbeyourgithub, 6 minutes ago:\nfix: apply patch before running tests\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Fconftest.py\ntests\u002Frun_all_tests.sh\n\n\n\n- [34adb42a] by @thiswillbeyourgithub, 14 minutes ago:\ntests: remove the fixture from tests as they are bugging some tests\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Fconftest.py\n\n\n\n- [10633a61] by @thiswillbeyourgithub, 61 minutes ago:\nfix: cant use xdist for the api tests apparently\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Frun_all_tests.sh\n\n\n\n- [5e7233b2] by @thiswillbeyourgithub, 77 minutes ago:\nfix: obsolete script was using an old import env var name\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nscripts\u002FAnkiFiltered\u002FAnkiFilteredDeckCreator.py\n\n\n\n- [7265e000] by @thiswillbeyourgithub, 2 hours ago:\nfix: to run the tests we must install pytest-xdist\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Frun_all_tests.sh\n\n\n\n- [ed119347] by @thiswillbeyourgithub, 2 hours ago:\nfix: to run the tests we need to patch asyncio\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\ntests\u002Frun_all_tests.sh\n\n\n\n- [f382e173] by @thiswillbeyourgithub, 4 hours ago:\nnew: better format for debug prints\nSigned-off-by: thiswillbeyourgithub\n\u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Flogger.py\n\n\n\n- [e213351b] by @thiswillbeyourgithub, 4 hours ago:\nminor: remove a mention of winston doc and replace by wdoc\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fdocs\u002Fhelp.md\n\n\n\n- [11b5c955] by @thiswillbeyourgithub, 4 hours ago:\nfix: set default concurrency to 1 actually because it is causing issues\nSigned-off-by: thiswillbeyourgithub\n\u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fdocs\u002Fhelp.md\nwdoc\u002Futils\u002Fenv.py\n\n\n\n- [dca2b477] by @thiswillbeyourgithub (aider), 7 hours ago:\nfeat: Add environment variable to control invalid document evaluation behavior\n\nwdoc\u002Fdocs\u002Fhelp.md\nwdoc\u002Futils\u002Fenv.py\nwdoc\u002Futils\u002Ftasks\u002Fquery.py\n\n\n\n- [ffa2d67e] by @thiswillbeyourgithub, 7 hours ago:\ndocs: harmonize default env valuee presentation\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fdocs\u002Fhelp.md\n\n\n\n- [95ec3aa0] by @thiswillbeyourgithub, 7 hours ago:\nfix: set default llm concurrency to 5 instead of 15\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fdocs\u002Fhelp.md\nwdoc\u002Futils\u002Fenv.py\n\n\n\n- [c056ba4e] by @thiswillbeyourgithub, 7 hours ago:\nfix: exit code should have been 0 not 1\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fwdoc.py\n\n\n\n- [26ba02b6] by @thiswillbeyourgithub, 7 hours ago:\nfix: litellm debugging\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Fwdoc.py\n\n\n\n- [9ac4f327] by @thiswillbeyourgithub, 7 hours ago:\ntypo\nSigned-off-by: thiswillbeyourgithub \u003C26625900+thiswillbeyourgithub@users.noreply.github.com>\n\nwdoc\u002Futils\u002Ftasks\u002Fquery.py\n\n\n\n- [3d3006e3] by @thiswillbeyourgithub (aider), 31 hours ago","2025-03-15T01:19:05"]