[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-re-search--DocProduct":3,"tool-re-search--DocProduct":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",143909,2,"2026-04-07T11:33:18",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":77,"owner_email":78,"owner_twitter":77,"owner_website":77,"owner_url":79,"languages":80,"stars":93,"forks":94,"last_commit_at":95,"license":96,"difficulty_score":97,"env_os":98,"env_gpu":99,"env_ram":100,"env_deps":101,"category_tags":110,"github_topics":111,"view_count":32,"oss_zip_url":77,"oss_zip_packed_at":77,"status":17,"created_at":123,"updated_at":124,"faqs":125,"releases":161},5144,"re-search\u002FDocProduct","DocProduct","Medical Q&A with Deep Language Models","DocProduct 是一个基于深度语言模型的医疗问答开源项目，旨在探索人工智能在医学信息检索与生成方面的潜力。它主要解决了如何从海量非结构化医疗数据中快速提取相关信息，并生成自然流畅回答的难题。该项目抓取了来自 Reddit、HealthTap 等平台的 70 万条医疗问答数据，通过结合 BERT 和 GPT-2 两大前沿模型，实现了对用户医学问题的智能响应：利用微调后的 BioBERT 精准理解问题语义，再借助 GPT-2 生成针对性的解答。\n\n技术上，DocProduct 的创新之处在于巧妙融合了 Transformer 架构、潜在向量搜索（Faiss）、负采样技术以及生成式预训练，并在 TensorFlow 2.0 框架下完成了从零构建。值得注意的是，该项目曾入围\"#PoweredByTF 2.0\"挑战赛六强，展示了强大的工程落地能力。\n\nDocProduct 非常适合 NLP 研究人员、AI 开发者以及对医疗大模型应用感兴趣的技术爱好者使用，可作为研究医学文本编码与检索的优质参考案例。但请务必注意，项目明确声明其仅用于学术探索与技术验证，绝不可作为实际的医疗诊断或治疗建议依据","DocProduct 是一个基于深度语言模型的医疗问答开源项目，旨在探索人工智能在医学信息检索与生成方面的潜力。它主要解决了如何从海量非结构化医疗数据中快速提取相关信息，并生成自然流畅回答的难题。该项目抓取了来自 Reddit、HealthTap 等平台的 70 万条医疗问答数据，通过结合 BERT 和 GPT-2 两大前沿模型，实现了对用户医学问题的智能响应：利用微调后的 BioBERT 精准理解问题语义，再借助 GPT-2 生成针对性的解答。\n\n技术上，DocProduct 的创新之处在于巧妙融合了 Transformer 架构、潜在向量搜索（Faiss）、负采样技术以及生成式预训练，并在 TensorFlow 2.0 框架下完成了从零构建。值得注意的是，该项目曾入围\"#PoweredByTF 2.0\"挑战赛六强，展示了强大的工程落地能力。\n\nDocProduct 非常适合 NLP 研究人员、AI 开发者以及对医疗大模型应用感兴趣的技术爱好者使用，可作为研究医学文本编码与检索的优质参考案例。但请务必注意，项目明确声明其仅用于学术探索与技术验证，绝不可作为实际的医疗诊断或治疗建议依据。","![python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython%20-3.7.1-brightgreen.svg) ![tensorflow](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftensorflow-2.0.0--alpha0-orange.svg) ![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)\n\n# Doc Product: Medical Q&A with Deep Language Models\n\nCollaboration between Santosh Gupta, Alex Sheng, and Junpeng Ye\n\nDownload trained models and embedding file [here](https:\u002F\u002F1drv.ms\u002Fu\u002Fs!An_n1-LB8-2dgoAYrXYnnBSA4d5dsg?e=i3mnFH).\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fre-search_DocProduct_readme_e7620f1b72bc.jpg\">\n\u003C\u002Fp>\n\nWinner Top 6 Finalist of the ⚡#PoweredByTF 2.0 Challenge! https:\u002F\u002Fdevpost.com\u002Fsoftware\u002Fnlp-doctor . Doc Product will be presented to the Tensorflow Engineering Team at Tensorflow Connect. Stay tuned for details. \n\nWe wanted to use TensorFlow 2.0 to explore how well state-of-the-art natural language processing models like [BERT](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04805) and [GPT-2](https:\u002F\u002Fopenai.com\u002Fblog\u002Fbetter-language-models\u002F) could respond to medical questions by retrieving and conditioning on relevant medical data, and this is the result.\n\n## DISCLAIMER\n\nThe purpose of this project is to explore the capabilities of deep learning language models for scientific encoding and retrieval IT SHOULD NOT TO BE USED FOR ACTIONABLE MEDICAL ADVICE.\n\n## How we built Doc Product\n\nAs a group of friends with diverse backgrounds ranging from broke undergrads to data scientists to top-tier NLP researchers, we drew inspiration for our design from various different areas of machine learning. By combining the power of [transformer architectures](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.03762), [latent vector search](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffaiss), [negative sampling](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf), and [generative pre-training](https:\u002F\u002Fd4mucfpksywv.cloudfront.net\u002Fbetter-language-models\u002Flanguage_models_are_unsupervised_multitask_learners.pdf) within TensorFlow 2.0's flexible deep learning framework, we were able to come up with a novel solution to a difficult problem that at first seemed like a herculean task.\n\n\u003Cdiv style=\"text-align:center\">\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FwzWt039.png\" \u002F>\u003C\u002Fdiv>\n\n- 700,000 medical questions and answers scraped from Reddit, HealthTap, WebMD, and several other sites\n- Fine-tuned TF 2.0 [BERT](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04805) with [pre-trained BioBERT weights](https:\u002F\u002Farxiv.org\u002Fabs\u002F1901.08746) for extracting representations from text\n- Fine-tuned TF 2.0 [GPT-2](https:\u002F\u002Fd4mucfpksywv.cloudfront.net\u002Fbetter-language-models\u002Flanguage_models_are_unsupervised_multitask_learners.pdf) with OpenAI's GPT-2-117M parameters for generating answers to new questions\n- Network heads for mapping question and answer embeddings to metric space, made with a Keras.Model feedforward network\n- Over a terabyte of TFRECORDS, CSV, and CKPT data\n\nIf you're interested in the whole story of how we built *Doc Product* and the details of our architecture, [take a look at our GitHub README](https:\u002F\u002Fgithub.com\u002FSantosh-Gupta\u002FDocProduct)!\n\n## Challenges\n\nOur project was wrought with too many challenges to count, from compressing astronomically large datasets, to re-implementing the entirety of BERT in TensorFlow 2.0, to running GPT-2 with 117 million parameters in Colaboratory, to rushing to get the last parts of our project ready with a few hours left until the submission deadline. Oddly enough, the biggest challenges were often when we had disagreements about the direction that the project should be headed. However, although we'd disagree about what the best course of action was, in the end we all had the same end goal of building something meaningful and potentially valuable for a lot of people. That being said, we would always eventually be able to sit down and come to an agreement and, with each other's support and late-night pep talks over Google Hangouts, rise to the challenges and overcome them together.\n\n## What's next?\n\nAlthough *Doc Product* isn't ready for widespread commercial use, its surprisingly good performance shows that advancements in general language models like [BERT](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04805) and [GPT-2](https:\u002F\u002Fd4mucfpksywv.cloudfront.net\u002Fbetter-language-models\u002Flanguage_models_are_unsupervised_multitask_learners.pdf) have made previously intractable problems like medical information processing accessible to deep NLP-based approaches. Thus, we hope that our work serves to inspire others to tackle these problems and explore the newly open NLP frontier themselves.\n\nNevertheless, we still plan to continue work on *Doc Product*, specifically expanding it to take advantage of the 345M, 762M, and 1.5B parameter versions of GPT-2 as OpenAI releases them as part of their [staged release program](https:\u002F\u002Fopenai.com\u002Fblog\u002Fbetter-language-models\u002F#update). We also intend to continue training the model, since we still have quite a bit more data to go through.\n\nNOTE: We are currrently working on research in scientific\u002Fmedical NLP and information retrieval. **If you're interested in collaborating, shoot us an e-mail at Research2Vec@gmail.com!**\n\n## Try it out!\n\n### Install from pip\n\nYou can install *Doc Product* directly from pip and run it on your local machine. Here's the code to install *Doc Product*, along with TensorFlow 2.0 and FAISS:\n\n```\n!wget  https:\u002F\u002Fanaconda.org\u002Fpytorch\u002Ffaiss-cpu\u002F1.2.1\u002Fdownload\u002Flinux-64\u002Ffaiss-cpu-1.2.1-py36_cuda9.0.176_1.tar.bz2\n#To use GPU FAISS use\n# !wget  https:\u002F\u002Fanaconda.org\u002Fpytorch\u002Ffaiss-gpu\u002F1.2.1\u002Fdownload\u002Flinux-64\u002Ffaiss-gpu-1.2.1-py36_cuda9.0.176_1.tar.bz2\n!tar xvjf faiss-cpu-1.2.1-py36_cuda9.0.176_1.tar.bz2\n!cp -r lib\u002Fpython3.6\u002Fsite-packages\u002F* \u002Fusr\u002Flocal\u002Flib\u002Fpython3.6\u002Fdist-packages\u002F\n!pip install mkl\n\n!pip install tensorflow-gpu==2.0.0-alpha0\nimport tensorflow as tf\n!pip install https:\u002F\u002Fgithub.com\u002FSantosh-Gupta\u002FDocProduct\u002Farchive\u002Fmaster.zip\n```\n \nOur repo contains scripts for generating **.tfrefords** data, training *Doc Product* on your own Q&A data, and running *Doc Product* to get answers for medical questions. Please see the **Google Colaboratory demos** section below for code samples to load data\u002Fweights and run our models.\n\n### Colaboratory demos\n\n[Take a look at our Colab demos!](https:\u002F\u002Fdrive.google.com\u002Fopen?id=1kXqsE4N0MgfktsEJpZZn470yOZ3UQ10F) We plan on adding more demos as we go, allowing users to explore more of the functionalities of *Doc Product*. All new demos will be added to the same Google Drive folder.\n\nThe demos include code for installing *Doc Product* via pip, downloading\u002Floading pre-trained weights, and running *Doc Product*'s retrieval functions and fine-tuning on your own Q&A data.\n\n#### Run our interactive retrieval model to get answers to your medical questions\nhttps:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F11hAr1qo7VCSmIjWREFwyTFblU2LVeh1R\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FZ8DOXuJ.png\">\n\u003C\u002Fp>\n\n#### Train your own medical Q&A retrieval model\nhttps:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1Rz2rzkwWrVEXcjiQqTXhxzLCW5cXi7xA\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fsnag.gy\u002FWPdV5T.jpg\">\n\u003C\u002Fp>\n\n#### [Experimental] Run the full Doc Product pipeline with BERT, FCNN, FAISS, and GPT-2 to get your medical questions answered by state-of-the-art AI.\nThe end-to-end *Doc Product* demo is still **experimental**, but feel free to try it out!\nhttps:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1Bv7bpPxIImsMG4YWB_LWjDRgUHvi7pxx\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fsnag.gy\u002FWU1YPE.jpg\">\n\u003C\u002Fp>\n\n## What it does\n\nOur BERT has been trained to encode medical questions and medical information. A user can type in a medical question, and our model will retrieve the most relevant medical information to that question.\n\n## Data\n\nWe created datasets from several medical question and answering forums. The forums are WebMD, HealthTap, eHealthForums, iClinic, Question Doctors, and Reddit.com\u002Fr\u002FAskDocs\n\n## Architecture \n\nThe architecture consists of a fine-tuned bioBert (same for both questions and answers) to convert text input to an embedding representation. The embedding is then input into a FCNN (a different one for the questions and answers) to develop an embedding which is used for similarity lookup. The top similar questions and answers are then used by GPT-2 to generate an answer. The full architecture is shown below. \n\nLets take a look at the first half of the diagram above above in more detail, the training of the BERT and the FCNNs. A detailed figure of this part is shown below\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FIRCyKIL.jpg?1\">\n\u003C\u002Fp>\n\nDuring training, we take a batch of medical questions and their corresponding medical answers, and convert them to bioBert embeddings. The same Bert weights are used for both the questions and answers. \n\n![DoctorBert](https:\u002F\u002Fi.imgur.com\u002FLpjjcvk.jpg)\n\nThese embeddings are then inputted into a FCNN layer. There are separate FCNN layers for both the question and answer embeddings. To recap, we use the same weights in the Bert layer, but the questions and answers each have their own seperate FCNN layer. \n\n![DoctorBert](https:\u002F\u002Fi.imgur.com\u002F6HwikW2.jpg)\n\nNow here's where things get a little tricky. Usually embedding similarity training involves negative samples, like how word2vec uses NCE loss. However, we can not use NCE loss in our case since the embeddings are generated during each step, and the weights change during each training step. \n\nSo instead of NCE loss, what we did was compute the dot product for every combination of the question and answer embeddings within our batch. This is shown in the figure below\n\n![DoctorBert](https:\u002F\u002Fi.imgur.com\u002FKOyiCJU.jpg)\n\nThen, a softmax is taken across the rows; for each question, all of it's answer combinations are softmaxed. \n\n![DoctorBert](https:\u002F\u002Fi.imgur.com\u002FX6N84Gd.jpg)\n\nFinally, the loss used is cross entropy loss. The softmaxed matrix is compared to a ground truth matrix; the correct combinations of questions and answers are labeled with a '1', and all the other combinations are labeled with a '0'. \n\n## Technical Obstacles we ran into\n\n### Data Gathering and Wrangling\n\nThe data gathering was tricky because the formatting of all of the different medical sites was significantly different. Custom work needed to be done for each site in order to pull questions and answers from the correct portion of the HTML tags. Some of the sites also had the possibility of multiple doctors responding to a single question so we needed a method of gathering multiple responses to individual questions. In order to deal with this, we created multiple rows for every question-answer pair. From here we needed to run the model through BERT and store the outputs from one of the end layers in order to make BioBERT embeddings we could pass through the dense layers of our feed-forward neural network(FFNN). 768 dimension vectors were stored for both the question and answers and concatenated with the corresponding text in a CSV file. We tried various different formats for more compact and faster loading and sharing, but CSV ended up being the easiest and most flexible method. After the BioBERT embeddings were created and stored the similarity training process was done and then FFNN embeddings were created that would capture the similarity of questions to answers. These were also stored along with the BioBERT embeddings and source text for later visualization and querying.\n\n### Combining Models Built in TF 1.X and TF 2.0\n\nThe embedding models are built in TF 2.0 which utilizes the flexibility of eager execution of TF 2.0. However, GPT2 model that we use are are built in TF 1.X. Luckily, we can train two models separately. While inference, we need to maintain disable eager execution with tf.compat.v1.disable_eager_execution and maintain two separate sessions. We also need to take care of the GPU memory of two sessions to avoid OOM.\n\n## Accomplishments that we're proud of\n\n### Robust Model with Careful Loss and Architecture Design\n\nOne obvious approach to retrieve answers based on user’s questions is that we use a powerful encoder(BERT) to encode input questions and questions in our database and do a similarity search. There is no training involves and the performance of this approach totally rely on the encoder. Instead, we use separate Feed-forward networks for questions and answers and calculate cosine similarity between them. Inspired by the negative sampling of word2vec paper, we treat other answers in the same batch as negative samples and calculate cross entropy loss. This approach makes the questions embeddings and answers embeddings in one pair as close as possible in terms of Euclidean distance. It turns out that this approach yields more robust results than doing similarity search directly using BERT embedding vector.\n\n### High-performance Input Pipeline\n\nThe preprocessing of BERT is complicated and we totally have around 333K QA pairs and over 30 million tokens. Considering shuffle is very important in our training, we need the shuffle buffer sufficiently large to properly train our model. It took over 10 minutes to preprocess data before starting to train model in each epoch. So we used the tf.data and TFRecords to build a high-performance input pipeline. After the optimization, it only took around 20 seconds to start training and no GPU idle time. \n\nAnother problem with BERT preprocessing is that it pads all data to a fixed length. Therefore, for short sequences, a lot of computation and GPU memory are wasted. This is very important especially with big models like BERT. So we rewrite the BERT preprocessing code and make use of [tf.data.experimental.bucket_by_sequence_length](https:\u002F\u002Fwww.tensorflow.org\u002Fversions\u002Fr2.0\u002Fapi_docs\u002Fpython\u002Ftf\u002Fdata\u002Fexperimental\u002Fbucket_by_sequence_length) to bucket sequences with different lengths and dynamically padding sequences. By doing this, we achieved a longer max sequence length and faster training.\n\n### Imperative BERT Model\n\nAfter some modification, the Keras-Bert is able to run in tf 2.0 environment. However, when we try to use the Keras-Bert as a sub-model in our embedding models, we found the following two problems.\n- It uses the functional API. Functional API is very flexible, however, it’s still symbolic. That means even though eager execution is enabled, we still cannot use the traditional python debugging method at run time. In order to fully utilize the power of eager execution, we need to build the model using tf.keras.Model\n- We are not directly using the input layer of Keras-Bert and ran into this [issue](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftensorflow\u002Fissues\u002F27543). It’s not easy to avoid this bug without changing our input pipeline.\n\nAs a result, we decided to re-implement an imperative version of BERT. We used some components of Keras-Bert(Multihead Attention, Checkpoint weight loading, etc) and write the call method of Bert. Our implementation is easier to debug and compatible with both flexible eager mode and high-performance static graph mode.\n\n### Answer Generation with Auxiliary Inputs\n\nUsers may experience multiple symptoms in various condition, which makes the perfect answer might be a combination of multiple answers. To tackle that, we make use of the powerful GPT2 model and feed the model the questions from users along with Top K auxiliary answers that we retrieved from our data. The GPT2 model will be based on the question and the Top K answers and generate a better answer. To properly train the GPT2 model, we create the training data as following: we take every question in our dataset, do a similarity search to obtain top K+1 answer, use the original answer as target and other answers as auxiliary inputs. By doing this we get the same amount of GPT2 training data as the embedding model training data.\n\n## What we learned\n\nBert is fantastic for encoding medical questions and answers, and developing robust vector representations of those questions\u002Fanswers. \n\nWe trained a fine-tuned version of our model which was initialized with Naver's bioBert. We also trained a version where the bioBert weights were frozen, and only trained the two FCNNs for the questions and answers. While we expected the fine-tuned version to work well, we were surprised at how robust later was. This suggests that bioBert has innate capabilities in being able to encode the means of medical questions and answers. \n\n## What's next for Information Retrieval w\u002FBERT for Medical Question Answering \n\nExplore if there's any practical use of this project outside of research\u002Fexploratory purposes. A model like this should not be used in the public for obtaining medical information. But perhaps it can be used by trained\u002Flicenced medical professionals to gather information for vetting. \n\n\nExplore applying the same method to other domains (ie history information retrieval, engineering information retrieval, etc.).\n\nExplore how the recently released sciBert (from Allen AI) compares against Naver's bioBert. \n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fre-search_DocProduct_readme_f5d118389aed.jpg\">\n\u003C\u002Fp>\n\n## Thanks!\n\nWe give our thanks to the TensorFlow team for providing the #PoweredByTF2.0 Challenge as a platform through which we could share our work with others, and a special thanks to Dr. Llion Jones, whose insights and guidance had an important impact on the direction of our project.\n","![python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython%20-3.7.1-brightgreen.svg) ![tensorflow](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Ftensorflow-2.0.0--alpha0-orange.svg) ![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)\n\n# 文档产品：基于深度语言模型的医疗问答系统\n\n桑托什·古普塔、亚历克斯·盛和叶俊鹏的合作成果\n\n已训练好的模型及嵌入文件可在此下载：[点击这里](https:\u002F\u002F1drv.ms\u002Fu\u002Fs!An_n1-LB8-2dgoAYrXYnnBSA4d5dsg?e=i3mnFH)。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fre-search_DocProduct_readme_e7620f1b72bc.jpg\">\n\u003C\u002Fp>\n\n荣获⚡#PoweredByTF 2.0挑战赛前六名决赛选手！详情请访问：https:\u002F\u002Fdevpost.com\u002Fsoftware\u002Fnlp-doctor。Doc Product将在TensorFlow Connect上向TensorFlow工程团队进行展示，敬请期待后续详情。\n\n我们希望利用TensorFlow 2.0来探索当前最先进的自然语言处理模型，如[BERT](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04805)和[GPT-2](https:\u002F\u002Fopenai.com\u002Fblog\u002Fbetter-language-models\u002F)，在检索并基于相关医疗数据进行条件生成的情况下，对医学问题作出响应的能力。这就是我们的成果。\n\n## 免责声明\n\n本项目旨在探索深度学习语言模型在科学编码与信息检索方面的潜力，但绝不应被用于提供可执行的医疗建议。\n\n## 我们如何构建Doc Product\n\n作为一群背景各异的朋友——从囊中羞涩的本科生到数据科学家，再到顶尖的自然语言处理研究人员——我们在机器学习的多个领域汲取灵感，设计出了这一方案。通过将[Transformer架构](https:\u002F\u002Farxiv.org\u002Fabs\u002F1706.03762)、[潜在向量搜索](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffaiss)、[负采样](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf)以及[TensorFlow 2.0灵活的深度学习框架中的生成式预训练](https:\u002F\u002Fd4mucfpksywv.cloudfront.net\u002Fbetter-language-models\u002Flanguage_models_are_unsupervised_multitask_learners.pdf)相结合，我们成功地为一个看似艰巨的任务提供了一种新颖的解决方案。\n\n\u003Cdiv style=\"text-align:center\">\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FwzWt039.png\" \u002F>\u003C\u002Fdiv>\n\n- 从Reddit、HealthTap、WebMD等多个网站抓取的70万条医学问答\n- 使用[预训练的BioBERT权重](https:\u002F\u002Farxiv.org\u002Fabs\u002F1901.08746)对TF 2.0 [BERT](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04805)进行微调，以提取文本表示\n- 使用OpenAI的GPT-2-117M参数对TF 2.0 [GPT-2](https:\u002F\u002Fd4mucfpksywv.cloudfront.net\u002Fbetter-language-models\u002Flanguage_models_are_unsupervised_multitask_learners.pdf)进行微调，用于生成新问题的答案\n- 利用Keras.Model前馈网络构建的网络头部，将问题和答案的嵌入映射到度量空间\n- 超过一太字节的TFRECORDS、CSV和CKPT数据\n\n如果您想了解我们构建*Doc Product*的完整过程及其架构细节，请查看我们的GitHub自述文件：[点击这里](https:\u002F\u002Fgithub.com\u002FSantosh-Gupta\u002FDocProduct)！\n\n## 挑战\n\n我们的项目面临了数不胜数的挑战，从压缩天文数字般庞大的数据集，到在TensorFlow 2.0中重新实现整个BERT模型，再到在Colaboratory中运行拥有1.17亿参数的GPT-2，以及在提交截止时间仅剩几小时时赶工完成项目的最后部分。奇怪的是，最大的挑战往往出现在我们对项目发展方向产生分歧的时候。然而，尽管我们有时会就最佳行动方案争执不下，但最终我们都怀着共同的目标——打造一件对许多人有意义且可能具有价值的作品。因此，我们总能坐下来达成一致，并在彼此的支持和深夜Google Hangouts上的鼓励下，携手克服这些挑战。\n\n## 下一步计划\n\n虽然*Doc Product*目前尚不具备广泛商业化的条件，但其令人惊喜的良好表现表明，像[BERT](https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04805)和[GPT-2](https:\u002F\u002Fd4mucfpksywv.cloudfront.net\u002Fbetter-language-models\u002Flanguage_models_are_unsupervised_multitask_learners.pdf)这样的通用语言模型的进步，已经使得以往难以解决的医学信息处理等问题，可以通过深度NLP方法得以实现。因此，我们希望自己的工作能够激励更多人投身于这些问题的研究，并亲自探索这片全新的NLP前沿。\n\n尽管如此，我们仍计划继续推进*Doc Product*的开发，特别是随着OpenAI按照其[分阶段发布计划](https:\u002F\u002Fopenai.com\u002Fblog\u002Fbetter-language-models\u002F#update)陆续推出3.45亿、7.62亿和15亿参数版本的GPT-2，我们将进一步扩展该系统以充分利用这些新版本的功能。此外，我们还打算继续训练模型，因为我们仍有大量数据有待处理。\n\n注：我们目前正在从事科学\u002F医学领域的NLP和信息检索研究。**如果您有意合作，请发送邮件至Research2Vec@gmail.com！**\n\n## 体验一下吧！\n\n### 通过pip安装\n\n您可以直接从pip安装*Doc Product*，并在本地机器上运行。以下是安装*Doc Product*以及TensorFlow 2.0和FAISS的代码：\n\n```\n!wget  https:\u002F\u002Fanaconda.org\u002Fpytorch\u002Ffaiss-cpu\u002F1.2.1\u002Fdownload\u002Flinux-64\u002Ffaiss-cpu-1.2.1-py36_cuda9.0.176_1.tar.bz2\n#若需使用GPU FAISS，请使用：\n# !wget  https:\u002F\u002Fanaconda.org\u002Fpytorch\u002Ffaiss-gpu\u002F1.2.1\u002Fdownload\u002Flinux-64\u002Ffaiss-gpu-1.2.1-py36_cuda9.0.176_1.tar.bz2\n!tar xvjf faiss-cpu-1.2.1-py36_cuda9.0.176_1.tar.bz2\n!cp -r lib\u002Fpython3.6\u002Fsite-packages\u002F* \u002Fusr\u002Flocal\u002Flib\u002Fpython3.6\u002Fdist-packages\u002F\n!pip install mkl\n\n!pip install tensorflow-gpu==2.0.0-alpha0\nimport tensorflow as tf\n!pip install https:\u002F\u002Fgithub.com\u002FSantosh-Gupta\u002FDocProduct\u002Farchive\u002Fmaster.zip\n```\n\n我们的仓库包含用于生成**.tfrecords**数据、基于您自己的问答数据训练*Doc Product*，以及运行*Doc Product*以获取医学问题答案的脚本。请参阅下方的**Google Colaboratory演示**部分，获取加载数据\u002F权重并运行我们模型的代码示例。\n\n### Colaboratory 演示\n\n[请查看我们的 Colab 演示！](https:\u002F\u002Fdrive.google.com\u002Fopen?id=1kXqsE4N0MgfktsEJpZZn470yOZ3UQ10F) 我们计划在后续不断添加更多演示，以便用户能够探索 *Doc Product* 的更多功能。所有新演示都将被添加到同一个 Google Drive 文件夹中。\n\n这些演示包括通过 pip 安装 *Doc Product* 的代码、下载\u002F加载预训练权重，以及运行 *Doc Product* 的检索功能并在您自己的问答数据上进行微调。\n\n#### 运行我们的交互式检索模型，解答您的医学问题\nhttps:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F11hAr1qo7VCSmIjWREFwyTFblU2LVeh1R\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FZ8DOXuJ.png\">\n\u003C\u002Fp>\n\n#### 训练您自己的医学问答检索模型\nhttps:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1Rz2rzkwWrVEXcjiQqTXhxzLCW5cXi7xA\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fsnag.gy\u002FWPdV5T.jpg\">\n\u003C\u002Fp>\n\n#### [实验性] 使用 BERT、FCNN、FAISS 和 GPT-2 运行完整的 Doc Product 流程，让最先进的 AI 回答您的医学问题。\n端到端的 *Doc Product* 演示目前仍处于 **实验阶段**，但欢迎您尝试！\nhttps:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1Bv7bpPxIImsMG4YWB_LWjDRgUHvi7pxx\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fsnag.gy\u002FWU1YPE.jpg\">\n\u003C\u002Fp>\n\n## 功能简介\n\n我们的 BERT 经过训练，能够对医学问题和医学信息进行编码。用户只需输入一个医学问题，我们的模型便会检索出与该问题最相关的医学信息。\n\n## 数据\n\n我们从多个医学问答论坛中创建了数据集。这些论坛包括 WebMD、HealthTap、eHealthForums、iClinic、Question Doctors 以及 Reddit.com\u002Fr\u002FAskDocs。\n\n## 架构\n\n该架构由一个经过微调的 bioBert 组成（问题和答案共用同一模型），用于将文本输入转换为嵌入表示。随后，这些嵌入会被输入到 FCNN 中（问题和答案分别使用不同的 FCNN），以生成用于相似度查找的嵌入。最后，GPT-2 会利用检索到的最相似问题和答案来生成最终答案。完整架构如下所示。\n\n让我们更详细地看一下上图的前半部分——BERT 和 FCNN 的训练过程。下图展示了这一部分的详细结构：\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FIRCyKIL.jpg?1\">\n\u003C\u002Fp>\n\n在训练过程中，我们会选取一批医学问题及其对应的医学答案，并将其转换为 bioBert 嵌入。问题和答案均使用相同的 Bert 权重。\n\n![DoctorBert](https:\u002F\u002Fi.imgur.com\u002FLpjjcvk.jpg)\n\n这些嵌入随后会被输入到 FCNN 层中。问题和答案的嵌入分别对应不同的 FCNN 层。总结一下，我们在 BERT 层中使用相同的权重，但问题和答案各自拥有独立的 FCNN 层。\n\n![DoctorBert](https:\u002F\u002Fi.imgur.com\u002F6HwikW2.jpg)\n\n接下来的情况就有点复杂了。通常，嵌入相似度训练会涉及负样本，比如 word2vec 就使用 NCE 损失函数。然而，在我们的场景中无法使用 NCE 损失，因为嵌入是在每一步生成的，而权重也会在每次训练步骤中发生变化。\n\n因此，我们没有采用 NCE 损失，而是计算了批次内所有问题和答案嵌入之间的点积。如下图所示：\n\n![DoctorBert](https:\u002F\u002Fi.imgur.com\u002FKOyiCJU.jpg)\n\n接着，我们对每一行进行 softmax 处理：对于每个问题，其所有可能的答案组合都会被 softmax 化。\n\n![DoctorBert](https:\u002F\u002Fi.imgur.com\u002FX6N84Gd.jpg)\n\n最后，我们使用的损失函数是交叉熵损失。我们将 softmax 后的矩阵与真实标签矩阵进行比较：正确的问题-答案组合标记为“1”，其他组合则标记为“0”。\n\n## 遇到的技术难题\n\n### 数据收集与清洗\n\n数据收集的过程颇具挑战，因为不同医学网站的格式差异很大。我们需要针对每个网站进行定制化处理，才能从正确的 HTML 标签部分提取问题和答案。此外，有些网站允许多位医生对同一问题作出回复，因此我们也需要一种方法来收集单个问题的多条回复。为此，我们为每一对问题-答案创建了多行记录。之后，我们需要将数据输入 BERT 模型，并保存其中间层的输出，从而生成可用于全连接神经网络（FFNN）密集层的 BioBERT 嵌入。我们为问题和答案分别存储了 768 维向量，并将其与相应的文本一起保存到 CSV 文件中。尽管我们尝试过多种更紧凑、加载和共享速度更快的格式，但最终发现 CSV 是最简单且最具灵活性的方式。在 BioBERT 嵌入生成并存储完成后，我们进行了相似度训练，进而生成了能够捕捉问题与答案之间相似性的 FFNN 嵌入。这些嵌入也与 BioBERT 嵌入及原始文本一同保存，以便后续可视化和查询。\n\n### 结合 TF 1.X 和 TF 2.0 构建的模型\n\n嵌入模型是基于 TF 2.0 构建的，充分利用了 TF 2.0 的动态执行特性。然而，我们使用的 GPT-2 模型却是基于 TF 1.X 构建的。幸运的是，我们可以分别训练这两个模型。在推理阶段，我们需要通过 tf.compat.v1.disable_eager_execution 关闭动态执行，并维持两个独立的会话。同时，我们还需要合理管理两个会话的 GPU 内存，以避免出现内存不足（OOM）的情况。\n\n## 我们的自豪之处\n\n### 精心设计的鲁棒模型与损失函数\n\n一种常见的基于用户提问检索答案的方法是使用强大的编码器（BERT）对输入问题和数据库中的问题进行编码，然后进行相似度搜索。这种方法无需训练，性能完全依赖于编码器。相比之下，我们分别为问题和答案构建了独立的前馈神经网络，并计算它们之间的余弦相似度。受 word2vec 论文中的负采样启发，我们将同一批次中的其他答案视为负样本，计算交叉熵损失。这种方法使得同一配对中的问题嵌入和答案嵌入在欧几里得距离上尽可能接近。实践证明，这种方式比直接使用 BERT 嵌入向量进行相似度搜索更为稳健。\n\n### 高性能输入管道\n\nBERT 的预处理过程较为复杂，我们的数据集包含约 33.3 万个问答对和超过 3000 万个词元。考虑到在训练过程中打乱数据顺序非常重要，我们需要一个足够大的打乱缓冲区，以确保模型能够得到充分的随机化训练。在每个 epoch 开始训练之前，数据预处理需要花费超过 10 分钟。因此，我们采用了 tf.data 和 TFRecords 来构建高性能的输入管道。经过优化后，启动训练的时间缩短至约 20 秒，并且 GPU 再也没有出现空闲状态。\n\nBERT 预处理的另一个问题是，它会将所有数据填充到固定长度。对于较短的序列来说，这会导致大量的计算资源和 GPU 显存被浪费。这一点在使用 BERT 等大型模型时尤为重要。为此，我们重写了 BERT 的预处理代码，利用 [tf.data.experimental.bucket_by_sequence_length](https:\u002F\u002Fwww.tensorflow.org\u002Fversions\u002Fr2.0\u002Fapi_docs\u002Fpython\u002Ftf\u002Fdata\u002Fexperimental\u002Fbucket_by_sequence_length) 将不同长度的序列分桶，并动态地进行填充。通过这种方式，我们不仅能够支持更长的最大序列长度，还显著提升了训练速度。\n\n### 命令式 BERT 模型\n\n经过一些修改后，Keras-Bert 已经可以在 TensorFlow 2.0 环境中运行。然而，当我们尝试将 Keras-Bert 作为嵌入模型中的子模块时，发现了以下两个问题：\n- 它使用的是函数式 API。虽然函数式 API 非常灵活，但它仍然是符号式的。这意味着即使启用了急切执行模式，我们仍然无法在运行时使用传统的 Python 调试方法。为了充分发挥急切执行的优势，我们需要使用 tf.keras.Model 来构建模型。\n- 我们并没有直接使用 Keras-Bert 的输入层，因此遇到了 [此问题](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftensorflow\u002Fissues\u002F27543)。如果不改变我们的输入管道，很难避免这个 bug。\n\n基于以上原因，我们决定重新实现一个命令式版本的 BERT。我们复用了 Keras-Bert 中的一些组件（如多头注意力机制、检查点权重加载等），并编写了 Bert 的 call 方法。我们的实现更加易于调试，并且同时兼容灵活的急切执行模式和高性能的静态图模式。\n\n### 基于辅助输入的答案生成\n\n用户在不同情况下可能会经历多种症状，因此最完美的答案可能是由多个答案组合而成。为了解决这一问题，我们利用强大的 GPT-2 模型，将用户的提问与从数据集中检索出的 Top K 辅助答案一起输入到模型中。GPT-2 将根据问题和这些辅助答案生成一个更优的答案。为了有效地训练 GPT-2 模型，我们按照如下方式创建训练数据：遍历数据集中的每一个问题，通过相似度搜索获取 Top K+1 个答案，将原始答案作为目标答案，其余答案作为辅助输入。这样，我们获得的 GPT-2 训练数据量与嵌入模型的训练数据量相同。\n\n## 我们的收获\n\nBERT 在编码医学领域的问答方面表现出色，能够为这些问题和答案生成鲁棒的向量表示。\n\n我们训练了一个基于 Naver 的 bioBert 初始化的微调版本模型，同时也训练了一个冻结了 bioBert 权重、仅对问题和答案的两个全连接神经网络进行训练的版本。尽管我们原本预期微调版本的效果会更好，但令人惊讶的是，后者的表现同样非常稳健。这表明 bioBert 具备天然的能力，能够有效编码医学领域的问题和答案。\n\n## 医学问答信息检索中 BERT 的下一步方向\n\n探索该项目在研究或探索性用途之外是否有实际应用价值。像这样的模型不应被用于公众获取医疗信息的场景。不过，它或许可以被受过专业培训或持有执业资格的医疗专业人士用来收集信息，以供进一步审核。\n\n探索将相同的方法应用于其他领域（例如历史信息检索、工程信息检索等）。\n\n比较近期发布的 sciBert（由 Allen AI 发布）与 Naver 的 bioBert 的表现差异。\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fre-search_DocProduct_readme_f5d118389aed.jpg\">\n\u003C\u002Fp>\n\n## 感谢！\n\n我们感谢 TensorFlow 团队提供了 #PoweredByTF2.0 挑战赛这一平台，使我们能够与他人分享自己的工作成果。同时，特别感谢 Llion Jones 博士，他的见解和指导对我们项目的方向产生了重要影响。","# DocProduct 快速上手指南\n\nDocProduct 是一个基于深度语言模型（BERT 和 GPT-2）的医疗问答检索系统。它能够通过检索相关的医疗数据并生成回答来响应用户的医疗问题。\n\n> **⚠️ 重要免责声明**：本项目旨在探索深度学习语言模型在科学编码和检索方面的能力，**绝不可用于提供可执行的医疗建议或替代专业医生诊断**。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu) 或 macOS。Windows 用户建议使用 WSL2 或 Docker。\n*   **Python 版本**: 3.7.1 (项目基于此版本测试)。\n*   **深度学习框架**: TensorFlow 2.0.0-alpha0。\n*   **向量检索库**: FAISS (Facebook AI Similarity Search)。\n*   **硬件建议**: 推荐使用带有 GPU 的环境进行推理和训练，但 CPU 亦可运行检索功能。\n\n## 安装步骤\n\n您可以直接在本地机器或通过 Google Colab 运行。以下是在本地环境中通过 `pip` 安装的步骤。\n\n### 1. 安装 FAISS 依赖\n首先下载并配置 FAISS 库（此处以 CPU 版本为例，如需 GPU 版本请参考原文注释）：\n\n```bash\n!wget  https:\u002F\u002Fanaconda.org\u002Fpytorch\u002Ffaiss-cpu\u002F1.2.1\u002Fdownload\u002Flinux-64\u002Ffaiss-cpu-1.2.1-py36_cuda9.0.176_1.tar.bz2\n#To use GPU FAISS use\n# !wget  https:\u002F\u002Fanaconda.org\u002Fpytorch\u002Ffaiss-gpu\u002F1.2.1\u002Fdownload\u002Flinux-64\u002Ffaiss-gpu-1.2.1-py36_cuda9.0.176_1.tar.bz2\n!tar xvjf faiss-cpu-1.2.1-py36_cuda9.0.176_1.tar.bz2\n!cp -r lib\u002Fpython3.6\u002Fsite-packages\u002F* \u002Fusr\u002Flocal\u002Flib\u002Fpython3.6\u002Fdist-packages\u002F\n!pip install mkl\n```\n\n### 2. 安装 TensorFlow 和 DocProduct\n安装特定版本的 TensorFlow 以及 DocProduct 源码包：\n\n```bash\n!pip install tensorflow-gpu==2.0.0-alpha0\nimport tensorflow as tf\n!pip install https:\u002F\u002Fgithub.com\u002FSantosh-Gupta\u002FDocProduct\u002Farchive\u002Fmaster.zip\n```\n\n### 3. 获取预训练模型\n项目所需的训练模型和嵌入文件（Embedding file）较大，请访问以下链接下载并放置于项目目录：\n*   [下载预训练模型和嵌入文件](https:\u002F\u002F1drv.ms\u002Fu\u002Fs!An_n1-LB8-2dgoAYrXYnnBSA4d5dsg?e=i3mnFH)\n\n## 基本使用\n\n安装完成后，您可以加载预训练权重并运行交互式检索模型来回答医疗问题。\n\n### 运行交互式检索示例\n\n以下代码展示了如何初始化模型并针对输入的医疗问题进行检索和回答生成。建议在 **Google Colab** 中直接运行官方提供的演示笔记本以获得最佳体验：\n\n*   **交互式检索演示**: [Run Interactive Retrieval Model](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F11hAr1qo7VCSmIjWREFwyTFblU2LVeh1R)\n\n如果您在本地 Python 环境中运行，核心逻辑如下（需确保已下载权重文件）：\n\n```python\nimport doc_product # 假设安装后模块名\n\n# 1. 加载预训练的 BERT 和 GPT-2 权重\n# 注意：由于混合了 TF 1.x (GPT-2) 和 TF 2.x (BERT)，推理时需管理好 Session\nimport tensorflow as tf\ntf.compat.v1.disable_eager_execution() \n\n# 2. 初始化检索器 (伪代码示例，具体类名请参考源码)\n# retriever = DocProductRetriever(model_path=\"path\u002Fto\u002Fdownloaded\u002Fmodels\")\n\n# 3. 输入医疗问题并获取回答\nquestion = \"What are the symptoms of flu?\"\n# answer = retriever.generate_answer(question)\n# print(answer)\n```\n\n### 进阶：训练自己的模型\n如果您拥有自己的医疗问答数据集（CSV 格式），可以使用以下演示脚本微调模型：\n*   **训练演示**: [Train Your Own Model](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1Rz2rzkwWrVEXcjiQqTXhxzLCW5T7xA)\n\n### 端到端实验流程\n尝试包含 BERT、FCNN、FAISS 和 GPT-2 的完整流水线（实验性功能）：\n*   **端到端演示**: [Full Pipeline Demo](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1Bv7bpPxIImsMG4YWB_LWjDRgUHvi7pxx)","某医疗科技初创团队正在开发一款面向患者的智能预诊助手，需要快速构建一个能理解复杂病症描述并生成专业回复的原型系统。\n\n### 没有 DocProduct 时\n- 团队需从零收集并清洗海量医疗问答数据，耗时数周才能凑齐训练集，且数据质量参差不齐。\n- 缺乏针对医学领域优化的预训练模型，通用 NLP 模型对“心悸”、“室性早搏”等专业术语理解偏差大，回答常出现常识性错误。\n- 手动编写规则或训练传统检索系统难以覆盖长尾病症问题，用户稍微变换提问方式，系统就无法匹配到相关答案。\n- 在有限算力下微调百亿级参数模型极其困难，团队多次因显存溢出导致实验中断，研发进度严重滞后。\n\n### 使用 DocProduct 后\n- 直接利用其内置的从 Reddit、WebMD 等渠道爬取并清洗好的 70 万条高质量医疗问答数据集，当天即可启动模型训练。\n- 基于已微调的 BioBERT 和 GPT-2 模型，系统能精准提取病症特征并生成符合医学逻辑的回答，显著降低幻觉率。\n- 结合潜在向量搜索技术，即使用户用口语化描述症状，DocProduct 也能迅速检索到最相关的医学知识并生成针对性建议。\n- 依托 TensorFlow 2.0 的高效架构，团队在单块 GPU 上即可流畅运行微调任务，将原型开发周期从数月缩短至几天。\n\nDocProduct 通过整合领域专用数据与前沿深度学习架构，让中小团队也能低成本构建高可信度的医疗问答系统。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fre-search_DocProduct_43e5e9ed.png","re-search","re.Search","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fre-search_76c968ee.png","",null,"research2vec@gmail.com","https:\u002F\u002Fgithub.com\u002Fre-search",[81,85,89],{"name":82,"color":83,"percentage":84},"Jupyter Notebook","#DA5B0B",55.8,{"name":86,"color":87,"percentage":88},"HTML","#e34c26",42,{"name":90,"color":91,"percentage":92},"Python","#3572A5",2.1,571,157,"2026-02-23T14:31:50","MIT",4,"Linux","可选（支持 CPU 和 GPU）。若使用 GPU，需 NVIDIA 显卡以运行 tensorflow-gpu，安装脚本示例中指定了 CUDA 9.0.176。在 Colab 中运行 GPT-2 (117M) 时需注意显存限制以避免 OOM。","未说明（但处理超过 1TB 的数据集及运行大型模型建议高内存）",{"notes":102,"python":103,"dependencies":104},"1. 该项目主要基于 TensorFlow 2.0 Alpha 版本构建，且包含需要在 TF 1.X 环境下运行的 GPT-2 组件，推理时需禁用 eager execution 并管理两个独立的会话。2. 安装 FAISS 时提供了针对 CUDA 9.0 的特定二进制文件链接。3. 数据量巨大（超过 1TB），本地运行需准备充足存储空间。4. 明确声明仅用于研究，不可用于实际医疗建议。5. 提供 Google Colab 演示环境，适合资源有限的用户尝试。","3.7.1",[105,106,107,108,109],"tensorflow==2.0.0-alpha0","faiss-cpu==1.2.1 (或 faiss-gpu)","mkl","BERT (BioBERT weights)","GPT-2 (OpenAI 117M parameters)",[35,14],[112,113,114,115,116,117,118,119,120,121,122],"artificial-intelligence","machine-learning","deep-learning","nlp","medical","health","healthcare","bert","gpt-2","tensorflow","tensorflow-2","2026-03-27T02:49:30.150509","2026-04-08T01:01:39.995956",[126,131,136,141,146,151,156],{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},23339,"运行代码时出现 TypeError 错误，提示与 TensorFlow 版本有关，如何解决？","该问题通常是因为使用了不兼容的 TensorFlow 版本。项目是基于 TensorFlow 2.0 alpha 开发的，不支持 TensorFlow 1.x（如 1.9）。请创建一个新环境并安装 TensorFlow 2.0 alpha 或更高兼容版本来运行代码。","https:\u002F\u002Fgithub.com\u002Fre-search\u002FDocProduct\u002Fissues\u002F8",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},23340,"解压 qa_embeddings\u002Fbertffn_crossentropy.zip 时提示\"Zip file structure invalid\"错误怎么办？","这是一个已知问题，通常由从 OneDrive 自动下载失败导致。解决方案是手动通过浏览器下载该文件。如果原链接失效，可以尝试等待几小时后重试，或者使用维护者提供的备用链接：https:\u002F\u002F1drv.ms\u002Fu\u002Fs!An_n1-LB8-2dgoAYrXYnnBSA4d5dsg?e=FSgTNu。下载后建议上传到自己的云盘以备后用。","https:\u002F\u002Fgithub.com\u002Fre-search\u002FDocProduct\u002Fissues\u002F7",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},23341,"如何获取用于训练的数据抓取脚本或示例？","团队曾使用过数据抓取脚本，相关代码可以在项目的 notebooks 目录中找到，具体文件为：https:\u002F\u002Fgithub.com\u002Fre-search\u002FDocProduct\u002Fblob\u002Fmaster\u002Fnotebooks\u002Fwebmd_data_gather.ipynb。你可以参考此脚本来抓取医学问答网站的数据。","https:\u002F\u002Fgithub.com\u002Fre-search\u002FDocProduct\u002Fissues\u002F26",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},23342,"运行时报错 ImportError: Unable to find a usable engine for read_parquet (pyarrow\u002Ffastparquet)，如何解决？","这是因为缺少读取 parquet 文件所需的引擎。请在终端或 notebook 中运行命令 `pip install pyarrow` 安装依赖库，安装完成后即可正常读取嵌入文件。","https:\u002F\u002Fgithub.com\u002Fre-search\u002FDocProduct\u002Fissues\u002F21",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},23343,"在 Google Colab 中运行时出现 AttributeError: module 'tensorflow' has no attribute 'compat' 错误，如何修复？","这是由于 TensorFlow 版本更新导致部分函数被弃用。解决方法是重新安装特定版本的 tensorflow-estimator。请在 Colab 单元格中依次执行以下命令：\n1. `!pip uninstall -y tensorflow`\n2. `!pip uninstall -y tensorflow-estimator`\n3. `!pip install tensorflow-estimator==2.1`\n完成后重启运行时并重新导入 tensorflow 即可。","https:\u002F\u002Fgithub.com\u002Fre-search\u002FDocProduct\u002Fissues\u002F28",{"id":152,"question_zh":153,"answer_zh":154,"source_url":155},23344,"如何为新数据集生成 FFNNEmbed csv 文件？","虽然具体步骤在截断的评论中未完全显示，但根据上下文，通常需要利用项目中的特征提取工具。一般流程是使用 `keras-bert` 进行特征提取，或者利用 `train_data_to_embedding` 模块将新的问答对转换为嵌入向量，最终保存为 CSV 格式供模型使用。建议查看项目中关于嵌入生成的相关 Notebook 或脚本。","https:\u002F\u002Fgithub.com\u002Fre-search\u002FDocProduct\u002Fissues\u002F9",{"id":157,"question_zh":158,"answer_zh":159,"source_url":160},23345,"安装 tensorflow-gpu==2.0.0-alpha0 时出现 ImportError: cannot import name 'export_saved_model' 错误怎么办？","这通常是 TensorFlow 内部版本变更或环境冲突引起的问题，而非本项目代码的直接错误。维护者指出该仓库已有一段时间未测试。建议尝试卸载当前 TensorFlow 及相关包，清理环境后重新安装稳定的 TensorFlow 2.x 版本，或者检查是否与 `tensorflow-estimator` 版本不匹配，尝试调整 estimator 版本以匹配主包。","https:\u002F\u002Fgithub.com\u002Fre-search\u002FDocProduct\u002Fissues\u002F25",[]]