[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-MilaNLProc--contextualized-topic-models":3,"tool-MilaNLProc--contextualized-topic-models":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":79,"owner_website":79,"owner_url":80,"languages":81,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":23,"env_os":94,"env_gpu":95,"env_ram":94,"env_deps":96,"category_tags":102,"github_topics":103,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":116,"updated_at":117,"faqs":118,"releases":146},3634,"MilaNLProc\u002Fcontextualized-topic-models","contextualized-topic-models","A python package to run contextualized topic modeling. CTMs combine contextualized embeddings (e.g., BERT) with topic models to get coherent topics. Published at EACL and ACL 2021 (Bianchi et al.). ","contextualized-topic-models 是一款专为文本主题挖掘设计的 Python 开源库。它巧妙地将 BERT 等预训练语言模型生成的“上下文嵌入”与传统主题模型相结合，旨在解决传统方法仅依赖词袋模型而忽略词语语境，导致提取出的主题连贯性不足、语义模糊的痛点。\n\n这款工具特别适合自然语言处理领域的研究人员、数据科学家以及需要深入分析大规模文本数据的开发者使用。其核心亮点在于打破了传统模型对词汇表的严格限制：一方面，它提供的 CombinedTM 模式融合了上下文语义与词频统计，能生成更精准、易懂的主题；另一方面，其 ZeroShotTM 模式支持零样本学习，不仅能处理训练集中未出现的新词，还具备强大的跨语言能力，只需使用多语言嵌入即可轻松实现多语种主题分析。此外，项目独有的\"Kitty\"子模块引入了人机协作机制，帮助用户快速对文档进行分类和命名聚类。得益于灵活的架构设计，用户可以随时接入最新的预训练模型以提升效果，让主题建模更加智能高效。","===========================\nContextualized Topic Models\n===========================\n\n.. image:: https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fcontextualized_topic_models.svg\n        :target: https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fcontextualized_topic_models\n\n.. image:: https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fworkflows\u002FPython%20package\u002Fbadge.svg\n        :target: https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Factions\n\n.. image:: https:\u002F\u002Freadthedocs.org\u002Fprojects\u002Fcontextualized-topic-models\u002Fbadge\u002F?version=latest\n        :target: https:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002F?badge=latest\n        :alt: Documentation Status\n\n.. image:: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002FMilaNLProc\u002Fcontextualized-topic-models\n        :target: https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fgraphs\u002Fcontributors\u002F\n        :alt: Contributors\n\n.. image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-blue.svg\n        :target: https:\u002F\u002Flbesson.mit-license.org\u002F\n        :alt: License\n\n.. image:: https:\u002F\u002Fpepy.tech\u002Fbadge\u002Fcontextualized-topic-models\n        :target: https:\u002F\u002Fpepy.tech\u002Fproject\u002Fcontextualized-topic-models\n        :alt: Downloads\n\n.. image:: https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\n    :target: https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing\n    :alt: Open In Colab\n\n.. image:: https:\u002F\u002Fraw.githubusercontent.com\u002Faleen42\u002Fbadges\u002Fmaster\u002Fsrc\u002Fmedium.svg\n    :target: https:\u002F\u002Fmedium.com\u002Ftowards-data-science\u002Fcontextualized-topic-modeling-with-python-eacl2021-eacf6dfa576\n    :alt: Medium Blog Post\n\n.. image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fyoutube-video-red\n        :target: https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=n1_G8K07KoM\n        :alt: Video Tutorial\n\n\nContextualized Topic Models (CTM) are a family of topic models that use pre-trained representations of language (e.g., BERT) to\nsupport topic modeling. See the papers for details:\n\n* Bianchi, F., Terragni, S., & Hovy, D. (2021). `Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence`. ACL. https:\u002F\u002Faclanthology.org\u002F2021.acl-short.96\u002F\n* Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). `Cross-lingual Contextualized Topic Models with Zero-shot Learning`. EACL. https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2021.eacl-main.143\u002F\n\n\n.. image:: https:\u002F\u002Fraw.githubusercontent.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fmaster\u002Fimg\u002Flogo.png\n   :align: center\n   :width: 200px\n\n\nTopic Modeling with Contextualized Embeddings\n---------------------------------------------\n\nOur new topic modeling family supports many different languages (i.e., the one supported by HuggingFace models) and comes in two versions: **CombinedTM** combines contextual embeddings with the good old bag of words to make more coherent topics; **ZeroShotTM** is the perfect topic model for task in which you might have missing words in the test data and also, if trained with multilingual embeddings, inherits the property of being a multilingual topic model!\n\nThe big advantage is that you can use different embeddings for CTMs. Thus, when a new\nembedding method comes out you can use it in the code and improve your results. We are not limited\nby the BoW anymore.\n\nWe also have `Kitty \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fkitty.html>`_! A new submodule that can be used to create a human-in-the-loop\nclassifier to quickly classify your documents and create named clusters.\n\n.. image:: https:\u002F\u002Fraw.githubusercontent.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fmaster\u002Fimg\u002Flogo_kitty.png\n   :align: center\n   :width: 200px\n\n\nTutorials\n---------\n\nYou can look at our `medium`_ blog post or start from one of our Colab Tutorials:\n\n\n.. |colab1_2| image:: https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\n    :target: https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing\n    :alt: Open In Colab\n\n.. |colab2_2| image:: https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\n    :target: https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1bfWUYEypULFk_4Tfff-Pb_n7-tSjEe9v?usp=sharing\n    :alt: Open In Colab\n\n.. |colab3_3| image:: https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\n    :target: https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1upTRu4zSm1VMbl633n9qkIDA526l22E_?usp=sharing\n    :alt: Open In Colab\n\n.. |kitty_colab| image:: https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\n    :target: https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F18mKzaKnmBlBOHb1oiS5MtaTSyq47ys2X?usp=sharing\n    :alt: Open In Colab\n\n+--------------------------------------------------------------------------------+------------------+\n| Name                                                                           | Link             |\n+================================================================================+==================+\n| Combined TM on Wikipedia Data (Preproc+Saving+Viz) (stable **v2.3.0**)         | |colab1_2|       |\n+--------------------------------------------------------------------------------+------------------+\n| Zero-Shot Cross-lingual Topic Modeling (Preproc+Viz) (stable **v2.3.0**)       | |colab2_2|       |\n+--------------------------------------------------------------------------------+------------------+\n| Kitty: Human in the loop Classifier (High-level usage) (stable **v2.2.0**)     | |kitty_colab|    |\n+--------------------------------------------------------------------------------+------------------+\n| SuperCTM and  β-CTM (High-level usage) (stable **v2.2.0**)                     | |colab3_3|       |\n+--------------------------------------------------------------------------------+------------------+\n\nOverview\n--------\n\nTL;DR\n~~~~~\n\n+ In CTMs we have two models. CombinedTM and ZeroShotTM, which have different use cases.\n+ CTMs work better when the size of the bag of words **has been restricted to a number of terms** that does not go over **2000 elements**. This is because we have a neural model that reconstructs the input bag of word, Moreover, in CombinedTM we project the contextualized embedding to the vocab space, the bigger the vocab the more parameters you get, with the training being more difficult and prone to bad fitting. This is **NOT** a strict limit, however, consider preprocessing your dataset. We have a preprocessing_ pipeline that can help you in dealing with this.\n+ Check the contextual model you are using, the **multilingual model one used on English data might not give results that are as good** as the pure English trained one.\n+ **Preprocessing is key**. If you give a contextual model like BERT preprocessed text, it might be difficult to get out a good representation. What we usually do is use the preprocessed text for the bag of word creating and use the NOT preprocessed text for BERT embeddings. Our preprocessing_ class can take care of this for you.\n+ CTM uses `SBERT`_, you should check it out to better understand how we create embeddings. SBERT allows us to use any embedding model. You might want to check things like `max length \u003Chttps:\u002F\u002Fwww.sbert.net\u002Fexamples\u002Fapplications\u002Fcomputing-embeddings\u002FREADME.html#input-sequence-length>`_.\n\nInstalling\n~~~~~~~~~~\n\n**Important**: If you want to use CUDA you need to install the correct version of\nthe CUDA systems that matches your distribution, see pytorch_.\n\nInstall the package using pip\n\n.. code-block:: bash\n\n    pip install -U contextualized_topic_models\n\nModels\n~~~~~~\n\nAn important aspect to take into account is which network you want to use:\nthe one that combines contextualized embeddings\nand the BoW (`CombinedTM \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fcombined.html>`_) or the one that just uses contextualized embeddings (`ZeroShotTM \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fzeroshot.html>`_)\n\nBut remember that you can do zero-shot cross-lingual topic modeling only with the `ZeroShotTM \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fzeroshot.html>`_ model.\n\nContextualized Topic Models also support supervision (SuperCTM). You can read more about this on the `documentation \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fintroduction.html>`_.\n\n.. image:: https:\u002F\u002Fraw.githubusercontent.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fmaster\u002Fimg\u002Fctm_both.jpeg\n   :align: center\n   :width: 800px\n\nWe also have `Kitty \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fkitty.html>`_: a utility you can use to do a simpler human in the loop classification of your\ndocuments. This can be very useful to do document filtering. It also works in cross-lingual setting and\nthus you might be able to filter documents in a language you don't know!\n\nReferences\n----------\n\nIf you find this useful you can cite the following papers :)\n\n**ZeroShotTM**\n\n::\n\n    @inproceedings{bianchi-etal-2021-cross,\n        title = \"Cross-lingual Contextualized Topic Models with Zero-shot Learning\",\n        author = \"Bianchi, Federico and Terragni, Silvia and Hovy, Dirk  and\n          Nozza, Debora and Fersini, Elisabetta\",\n        booktitle = \"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume\",\n        month = apr,\n        year = \"2021\",\n        address = \"Online\",\n        publisher = \"Association for Computational Linguistics\",\n        url = \"https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2021.eacl-main.143\",\n        pages = \"1676--1683\",\n    }\n\n**CombinedTM**\n\n::\n\n    @inproceedings{bianchi-etal-2021-pre,\n        title = \"Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence\",\n        author = \"Bianchi, Federico  and\n          Terragni, Silvia  and\n          Hovy, Dirk\",\n        booktitle = \"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)\",\n        month = aug,\n        year = \"2021\",\n        address = \"Online\",\n        publisher = \"Association for Computational Linguistics\",\n        url = \"https:\u002F\u002Faclanthology.org\u002F2021.acl-short.96\",\n        doi = \"10.18653\u002Fv1\u002F2021.acl-short.96\",\n        pages = \"759--766\",\n    }\n\n\nLanguage-Specific and Multilingual\n----------------------------------\n\nSome of the examples below use a multilingual embedding model\n:code:`paraphrase-multilingual-mpnet-base-v2`.\nThis means that the representations you are going to use are mutlilingual.\nHowever you might need a broader coverage of languages or just one specific language.\nRefer to the page in the documentation to see how to choose a model for another language.\nIn that case, you can check `SBERT`_ to find the perfect model to use.\n\nHere, you can read more about `language-specific and mulitlingual \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Flanguage.html>`_.\n\nQuick Overview\n--------------\n\nYou should definitely take a look at the `documentation \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fintroduction.html>`_\nto better understand how these topic models work.\n\nCombined Topic Model\n~~~~~~~~~~~~~~~~~~~~\n\nHere is how you can use the CombinedTM. This is a standard topic model that also uses contextualized embeddings. The good thing about CombinedTM is that it makes your topic much more coherent (see the paper https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.03974).\nn_components=50 specifies the number of topics.\n\n.. code-block:: python\n\n    from contextualized_topic_models.models.ctm import CombinedTM\n    from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation\n    from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file\n\n    qt = TopicModelDataPreparation(\"all-mpnet-base-v2\")\n\n    training_dataset = qt.fit(text_for_contextual=list_of_unpreprocessed_documents, text_for_bow=list_of_preprocessed_documents)\n\n    ctm = CombinedTM(bow_size=len(qt.vocab), contextual_size=768, n_components=50) # 50 topics\n\n    ctm.fit(training_dataset) # run the model\n\n    ctm.get_topics(2)\n\n\n**Advanced Notes:** Combined TM combines the BoW with SBERT, a process that seems to increase\nthe coherence of the predicted topics (https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.03974.pdf).\n\nZero-Shot Topic Model\n~~~~~~~~~~~~~~~~~~~~~\n\nOur ZeroShotTM can be used for zero-shot topic modeling. It can handle words that are not used during the training phase.\nMore interestingly, this model can be used for cross-lingual topic modeling (See next sections)! See the paper (https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2021.eacl-main.143)\n\n.. code-block:: python\n\n    from contextualized_topic_models.models.ctm import ZeroShotTM\n    from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation\n    from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file\n\n    text_for_contextual = [\n        \"hello, this is unpreprocessed text you can give to the model\",\n        \"have fun with our topic model\",\n    ]\n\n    text_for_bow = [\n        \"hello unpreprocessed give model\",\n        \"fun topic model\",\n    ]\n\n    qt = TopicModelDataPreparation(\"paraphrase-multilingual-mpnet-base-v2\")\n\n    training_dataset = qt.fit(text_for_contextual=text_for_contextual, text_for_bow=text_for_bow)\n\n    ctm = ZeroShotTM(bow_size=len(qt.vocab), contextual_size=768, n_components=50)\n\n    ctm.fit(training_dataset) # run the model\n\n    ctm.get_topics(2)\n\n\nAs you can see, the high-level API to handle the text is pretty easy to use;\n**text_for_bert** should be used to pass to the model a list of documents that are not preprocessed.\nInstead, to **text_for_bow** you should pass the preprocessed text used to build the BoW.\n\n**Advanced Notes:** in this way, SBERT can use all the information in the text to generate the representations.\n\nUsing The Topic Models\n----------------------\n\nGetting The Topics\n~~~~~~~~~~~~~~~~~~\n\nOnce the model is trained, it is very easy to get the topics!\n\n.. code-block:: python\n\n    ctm.get_topics()\n\nPredicting Topics For Unseen Documents\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe **transform** method will take care of most things for you, for example the generation\nof a corresponding BoW by considering only the words that the model has seen in training.\nHowever, this comes with some bumps when dealing with the ZeroShotTM, as we will se in the next section.\n\nYou can, however, manually load the embeddings if you like (see the Advanced part of this documentation).\n\nMono-Lingual Topic Modeling\n===========================\n\nIf you use **CombinedTM** you need to include the test text for the BOW:\n\n.. code-block:: python\n\n    testing_dataset = qt.transform(text_for_contextual=testing_text_for_contextual, text_for_bow=testing_text_for_bow)\n\n    # n_sample how many times to sample the distribution (see the doc)\n    ctm.get_doc_topic_distribution(testing_dataset, n_samples=20) # returns a (n_documents, n_topics) matrix with the topic distribution of each document\n\nIf you use **ZeroShotTM** you do not need to use the `testing_text_for_bow` because if you are using\na different set of test documents, this will create a BoW of a different size. Thus, the best\nway to do this is to pass just the text that is going to be given in input to the contexual model:\n\n.. code-block:: python\n\n    testing_dataset = qt.transform(text_for_contextual=testing_text_for_contextual)\n\n    # n_sample how many times to sample the distribution (see the doc)\n    ctm.get_doc_topic_distribution(testing_dataset, n_samples=20)\n\n\nCross-Lingual Topic Modeling\n============================\n\nOnce you have trained the ZeroShotTM model with multilingual embeddings,\nyou can use this simple pipeline to predict the topics for documents in a different language (as long as this language\nis covered by **paraphrase-multilingual-mpnet-base-v2**).\n\n.. code-block:: python\n\n    # here we have a Spanish document\n    testing_text_for_contextual = [\n        \"hola, bienvenido\",\n    ]\n\n    # since we are doing multilingual topic modeling, we do not need the BoW in\n    # ZeroShotTM when doing cross-lingual experiments (it does not make sense, since we trained with an english Bow\n    # to use the spanish BoW)\n    testing_dataset = qt.transform(testing_text_for_contextual)\n\n    # n_sample how many times to sample the distribution (see the doc)\n    ctm.get_doc_topic_distribution(testing_dataset, n_samples=20) # returns a (n_documents, n_topics) matrix with the topic distribution of each document\n\n**Advanced Notes:** We do not need to pass the Spanish bag of word: the bag of words of the two languages will not be comparable! We are passing it to the model for compatibility reasons, but you cannot get\nthe output of the model (i.e., the predicted BoW of the trained language) and compare it with the testing language one.\n\nMore Advanced Stuff\n-------------------\n\n\n\nPreprocessing\n~~~~~~~~~~~~~\n\nDo you need a quick script to run the preprocessing pipeline? We got you covered! Load your documents\nand then use our SimplePreprocessing class. It will automatically filter infrequent words and remove documents\nthat are empty after training. The preprocess method will return the preprocessed and the unpreprocessed documents.\nWe generally use the unpreprocessed for BERT and the preprocessed for the Bag Of Word.\n\n.. code-block:: python\n\n    from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing\n\n    documents = [line.strip() for line in open(\"unpreprocessed_documents.txt\").readlines()]\n    sp = WhiteSpacePreprocessing(documents, \"english\")\n    preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()\n\nUsing Custom Embeddings with Kitty\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nDo you have custom embeddings and want to use them for faster results? Just give them to Kitty!\n\n.. code-block:: python\n\n    from contextualized_topic_models.models.kitty_classifier import Kitty\n    import numpy as np\n\n    # read the training data\n    training_data = list(map(lambda x : x.strip(), open(\"train_data\").readlines()))\n    custom_embeddings = np.load('custom_embeddings.npy')\n\n    kt = Kitty()\n    kt.train(training_data, custom_embeddings=custom_embeddings, stopwords_list=[\"stopwords\"])\n\n    print(kt.pretty_print_word_classes())\n\n\nNote: Custom embeddings must be numpy.arrays.\n\nDevelopment Team\n----------------\n\n* `Federico Bianchi`_ \u003Cf.bianchi@unibocconi.it> Bocconi University\n* `Silvia Terragni`_ \u003Cs.terragni4@campus.unimib.it> University of Milan-Bicocca\n* `Dirk Hovy`_ \u003Cdirk.hovy@unibocconi.it> Bocconi University\n\n\nSoftware Details\n----------------\n\n* Free software: MIT license\n* Documentation: https:\u002F\u002Fcontextualized-topic-models.readthedocs.io.\n* Super big shout-out to `Stephen Carrow`_ for creating the awesome https:\u002F\u002Fgithub.com\u002Festebandito22\u002FPyTorchAVITM package from which we constructed the foundations of this package. We are happy to redistribute this software again under the MIT License.\n\n\n\nCredits\n-------\n\n\nThis package was created with Cookiecutter_ and the `audreyr\u002Fcookiecutter-pypackage`_ project template.\nTo ease the use of the library we have also included the `rbo`_ package, all the rights reserved to the author of that package.\n\nNote\n----\n\nRemember that this is a research tool :)\n\n.. _pytorch: https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F\n.. _Cookiecutter: https:\u002F\u002Fgithub.com\u002Faudreyr\u002Fcookiecutter\n.. _preprocessing: https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models#preprocessing\n.. _cross-lingual-topic-modeling: https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models#cross-lingual-topic-modeling\n.. _`audreyr\u002Fcookiecutter-pypackage`: https:\u002F\u002Fgithub.com\u002Faudreyr\u002Fcookiecutter-pypackage\n.. _`Stephen Carrow` : https:\u002F\u002Fgithub.com\u002Festebandito22\n.. _`rbo` : https:\u002F\u002Fgithub.com\u002Fdlukes\u002Frbo\n.. _Federico Bianchi: https:\u002F\u002Ffedericobianchi.io\n.. _Silvia Terragni: https:\u002F\u002Fsilviatti.github.io\u002F\n.. _Dirk Hovy: https:\u002F\u002Fdirkhovy.com\u002F\n.. _SBERT: https:\u002F\u002Fwww.sbert.net\u002Fdocs\u002Fpretrained_models.html\n.. _HuggingFace: https:\u002F\u002Fhuggingface.co\u002Fmodels\n.. _UmBERTo: https:\u002F\u002Fhuggingface.co\u002FMusixmatch\u002Fumberto-commoncrawl-cased-v1\n.. _medium: https:\u002F\u002Ffbvinid.medium.com\u002Fcontextualized-topic-modeling-with-python-eacl2021-eacf6dfa576\n\n","===========================\n上下文主题模型\n===========================\n\n.. image:: https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fcontextualized_topic_models.svg\n        :target: https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fcontextualized_topic_models\n\n.. image:: https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fworkflows\u002FPython%20package\u002Fbadge.svg\n        :target: https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Factions\n\n.. image:: https:\u002F\u002Freadthedocs.org\u002Fprojects\u002Fcontextualized-topic-models\u002Fbadge\u002F?version=latest\n        :target: https:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002F?badge=latest\n        :alt: Documentation Status\n\n.. image:: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002FMilaNLProc\u002Fcontextualized-topic-models\n        :target: https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fgraphs\u002Fcontributors\u002F\n        :alt: Contributors\n\n.. image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-blue.svg\n        :target: https:\u002F\u002Flbesson.mit-license.org\u002F\n        :alt: License\n\n.. image:: https:\u002F\u002Fpepy.tech\u002Fbadge\u002Fcontextualized-topic-models\n        :target: https:\u002F\u002Fpepy.tech\u002Fproject\u002Fcontextualized-topic-models\n        :alt: Downloads\n\n.. image:: https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\n    :target: https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing\n    :alt: Open In Colab\n\n.. image:: https:\u002F\u002Fraw.githubusercontent.com\u002Faleen42\u002Fbadges\u002Fmaster\u002Fsrc\u002Fmedium.svg\n    :target: https:\u002F\u002Fmedium.com\u002Ftowards-data-science\u002Fcontextualized-topic-modeling-with-python-eacl2021-eacf6dfa576\n    :alt: Medium Blog Post\n\n.. image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fyoutube-video-red\n        :target: https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=n1_G8K07KoM\n        :alt: Video Tutorial\n\n\n上下文主题模型（CTM）是一类利用预训练语言表示（例如BERT）来支持主题建模的主题模型。详细信息请参阅以下论文：\n\n* Bianchi, F., Terragni, S., & Hovy, D. (2021). `预训练是热门话题：上下文文档嵌入提升主题一致性`。ACL。https:\u002F\u002Faclanthology.org\u002F2021.acl-short.96\u002F\n* Bianchi, F., Terragni, S., Hovy, D., Nozza, D., & Fersini, E. (2021). `零样本学习的跨语言上下文主题模型`。EACL。https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2021.eacl-main.143\u002F\n\n\n.. image:: https:\u002F\u002Fraw.githubusercontent.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fmaster\u002Fimg\u002Flogo.png\n   :align: center\n   :width: 200px\n\n\n使用上下文嵌入进行主题建模\n---------------------------------------------\n\n我们新推出的主题模型系列支持多种语言（即HuggingFace模型所支持的语言），并提供两种版本：**CombinedTM**将上下文嵌入与传统的词袋模型相结合，以生成更连贯的主题；**ZeroShotTM**则是处理测试数据中可能存在缺失词汇的任务的理想选择，同时，如果使用多语言嵌入进行训练，它还具备多语言主题模型的特性！\n\n最大的优势在于，您可以为上下文主题模型使用不同的嵌入方法。因此，当新的嵌入方法发布时，您可以在代码中直接应用，从而进一步提升模型效果。我们不再受限于传统的词袋模型。\n\n此外，我们还推出了`Kitty \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fkitty.html>`_！这是一个全新的子模块，可用于构建人机协作分类器，快速对文档进行分类并创建命名聚类。\n\n.. image:: https:\u002F\u002Fraw.githubusercontent.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fmaster\u002Fimg\u002Flogo_kitty.png\n   :align: center\n   :width: 200px\n\n\n教程\n---------\n\n您可以阅读我们的`Medium`博客文章，或者从我们的Colab教程开始：\n\n.. |colab1_2| image:: https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\n    :target: https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1fXJjr_rwqvpp1IdNQ4dxqN4Dp88cxO97?usp=sharing\n    :alt: Open In Colab\n\n.. |colab2_2| image:: https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\n    :target: https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1bfWUYEypULFk_4Tfff-Pb_n7-tSjEe9v?usp=sharing\n    :alt: Open In Colab\n\n.. |colab3_3| image:: https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\n    :target: https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1upTRu4zSm1VMbl633n9qkIDA526l22E_?usp=sharing\n    :alt: Open In Colab\n\n.. |kitty_colab| image:: https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg\n    :target: https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F18mKzaKnmBlBOHb1oiS5MtaTSyq47ys2X?usp=sharing\n    :alt: Open In Colab\n\n+--------------------------------------------------------------------------------+------------------+\n| 名称                                                                           | 链接             |\n+================================================================================+==================+\n| 在维基百科数据上使用Combined TM（预处理+保存+可视化）（稳定 **v2.3.0**）         | |colab1_2|       |\n+--------------------------------------------------------------------------------+------------------+\n| 零样本跨语言主题建模（预处理+可视化）（稳定 **v2.3.0**）                       | |colab2_2|       |\n+--------------------------------------------------------------------------------+------------------+\n| Kitty：人机协作分类器（高级用法）（稳定 **v2.2.0**）                           | |kitty_colab|    |\n+--------------------------------------------------------------------------------+------------------+\n| SuperCTM 和 β-CTM（高级用法）（稳定 **v2.2.0**）                               | |colab3_3|       |\n+--------------------------------------------------------------------------------+------------------+\n\n概述\n--------\n\n简而言之\n~~~~~\n\n+ 在 CTMs 中，我们有两种模型：CombinedTM 和 ZeroShotTM，它们适用于不同的场景。\n+ 当词袋的大小**限制在不超过 2000 个词项**时，CTMs 的效果会更好。这是因为我们有一个神经网络模型用于重建输入的词袋表示。此外，在 CombinedTM 中，我们会将上下文嵌入投影到词汇空间中，词汇表越大，参数数量越多，训练难度也会增加，并且更容易出现过拟合。不过，这**并不是一个严格的限制**，因此建议您对数据集进行预处理。我们提供了一个预处理流水线，可以帮助您解决这个问题。\n+ 请检查您使用的上下文模型，**在英文数据上使用的多语言模型可能无法达到纯英文训练模型那样的效果**。\n+ **预处理是关键**。如果您将经过预处理的文本输入到像 BERT 这样的上下文模型中，可能难以获得良好的表示。通常的做法是使用预处理后的文本来构建词袋，而使用未预处理的文本来生成 BERT 嵌入。我们的预处理类可以为您完成这一任务。\n+ CTM 使用 `SBERT`_，您可以查看它以更好地理解我们如何创建嵌入。SBERT 允许我们使用任何嵌入模型。您可能需要关注一些设置，例如 `最大长度 \u003Chttps:\u002F\u002Fwww.sbert.net\u002Fexamples\u002Fapplications\u002Fcomputing-embeddings\u002FREADME.html#input-sequence-length>`_。\n\n安装\n~~~~~~\n\n**重要提示**：如果您想使用 CUDA，需要安装与您的系统发行版匹配的正确版本的 CUDA 系统，请参阅 pytorch_。\n\n使用 pip 安装包\n\n.. code-block:: bash\n\n    pip install -U contextualized_topic_models\n\n模型\n~~~~~~\n\n需要考虑的一个重要方面是您希望使用哪种模型：结合上下文嵌入和词袋的模型（`CombinedTM \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fcombined.html>`_）或仅使用上下文嵌入的模型（`ZeroShotTM \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fzeroshot.html>`_）。\n\n但请记住，只有使用 `ZeroShotTM \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fzeroshot.html>`_ 模型才能进行零样本跨语言主题建模。\n\n上下文主题模型还支持监督学习（SuperCTM）。您可以在`文档 \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fintroduction.html>`_ 中了解更多相关信息。\n\n.. image:: https:\u002F\u002Fraw.githubusercontent.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fmaster\u002Fimg\u002Fctm_both.jpeg\n   :align: center\n   :width: 800px\n\n我们还有 `Kitty \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fkitty.html>`_：一个工具，可用于更简便地进行人工参与的文档分类。这对于文档筛选非常有用。它也支持跨语言操作，因此即使您不懂某种语言，也能用它来筛选该语言的文档！\n\n参考文献\n----------\n\n如果您觉得这些内容有用，可以引用以下论文 :)\n\n**ZeroShotTM**\n\n::\n\n    @inproceedings{bianchi-etal-2021-cross,\n        title = \"Cross-lingual Contextualized Topic Models with Zero-shot Learning\",\n        author = \"Bianchi, Federico and Terragni, Silvia and Hovy, Dirk  and\n          Nozza, Debora and Fersini, Elisabetta\",\n        booktitle = \"Proceedings of the 16th Conference of the European Chapter of the Association for Computational Linguistics: Main Volume\",\n        month = apr,\n        year = \"2021\",\n        address = \"Online\",\n        publisher = \"Association for Computational Linguistics\",\n        url = \"https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2021.eacl-main.143\",\n        pages = \"1676--1683\",\n    }\n\n**CombinedTM**\n\n::\n\n    @inproceedings{bianchi-etal-2021-pre,\n        title = \"Pre-training is a Hot Topic: Contextualized Document Embeddings Improve Topic Coherence\",\n        author = \"Bianchi, Federico  and\n          Terragni, Silvia  and\n          Hovy, Dirk\",\n        booktitle = \"Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)\",\n        month = aug,\n        year = \"2021\",\n        address = \"Online\",\n        publisher = \"Association for Computational Linguistics\",\n        url = \"https:\u002F\u002Faclanthology.org\u002F2021.acl-short.96\",\n        doi = \"10.18653\u002Fv1\u002F2021.acl-short.96\",\n        pages = \"759--766\",\n    }\n\n\n特定语言与多语言\n----------------------------------\n\n以下示例中使用了多语言嵌入模型`:code:`paraphrase-multilingual-mpnet-base-v2`。这意味着您将使用的表示是多语言的。然而，您可能需要更广泛的语言覆盖，或者只需要某一特定语言。请参考文档中的页面，了解如何选择其他语言的模型。在这种情况下，您可以查看 `SBERT`_，以找到最适合的模型。\n\n在这里，您可以阅读更多关于`特定语言与多语言 \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Flanguage.html>`_的信息。\n\n快速概览\n--------------\n\n为了更好地理解这些主题模型的工作原理，您务必查看`文档 \u003Chttps:\u002F\u002Fcontextualized-topic-models.readthedocs.io\u002Fen\u002Flatest\u002Fintroduction.html>`_。\n\n组合主题模型\n~~~~~~~~~~~~~~~~~~~~\n\n以下是使用 CombinedTM 的方法。这是一种标准的主题模型，同时也利用了上下文嵌入。CombinedTM 的优点在于它能使主题更加连贯（参见论文 https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.03974）。\nn_components=50 指定了主题的数量。\n\n.. code-block:: python\n\n    from contextualized_topic_models.models.ctm import CombinedTM\n    from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation\n    from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file\n\n    qt = TopicModelDataPreparation(\"all-mpnet-base-v2\")\n\n    training_dataset = qt.fit(text_for_contextual=list_of_unpreprocessed_documents, text_for_bow=list_of_preprocessed_documents)\n\n    ctm = CombinedTM(bow_size=len(qt.vocab), contextual_size=768, n_components=50) # 50个主题\n\n    ctm.fit(training_dataset) # 运行模型\n\n    ctm.get_topics(2)\n\n\n**进阶说明**：Combined TM 将词袋与 SBERT 结合，这一过程似乎能提高预测主题的连贯性（https:\u002F\u002Farxiv.org\u002Fpdf\u002F2004.03974.pdf）。\n\n零样本主题模型\n~~~~~~~~~~~~~~~~~~~~~\n\n我们的 ZeroShotTM 可以用于零样本主题建模。它可以处理在训练阶段未出现过的词汇。更有趣的是，该模型还可以用于跨语言主题建模（详见后续章节）！请参阅相关论文（https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2021.eacl-main.143）。\n\n.. code-block:: python\n\nfrom contextualized_topic_models.models.ctm import ZeroShotTM\n    from contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation\n    from contextualized_topic_models.utils.data_preparation import bert_embeddings_from_file\n\n    text_for_contextual = [\n        \"你好，这是未经预处理的文本，可以直接输入模型\",\n        \"祝你使用我们的主题模型玩得开心\",\n    ]\n\n    text_for_bow = [\n        \"你好 未预处理 给 模型\",\n        \"开心 主题 模型\",\n    ]\n\n    qt = TopicModelDataPreparation(\"paraphrase-multilingual-mpnet-base-v2\")\n\n    training_dataset = qt.fit(text_for_contextual=text_for_contextual, text_for_bow=text_for_bow)\n\n    ctm = ZeroShotTM(bow_size=len(qt.vocab), contextual_size=768, n_components=50)\n\n    ctm.fit(training_dataset) # 运行模型\n\n    ctm.get_topics(2)\n\n\n正如你所见，用于处理文本的高级API非常易于使用；\n**text_for_bert** 应当用来向模型传递未经预处理的文档列表。\n而 **text_for_bow** 则应传入用于构建词袋模型的已预处理文本。\n\n**进阶说明：** 这样，SBERT 可以利用文本中的所有信息来生成文本表示。\n\n使用主题模型\n--------------\n\n获取主题\n~~~~~~~~~~\n\n一旦模型训练完成，获取主题就非常简单了！\n\n.. code-block:: python\n\n    ctm.get_topics()\n\n为未见文档预测主题\n~~~~~~~~~~~~~~~~~~~~~\n\n**transform** 方法会为你处理大部分工作，例如仅根据模型在训练中见过的词汇生成相应的词袋模型。然而，在使用 ZeroShotTM 时，这会遇到一些问题，我们将在下一节中看到。\n\n不过，如果你愿意，也可以手动加载嵌入（参见本文档的进阶部分）。\n\n单语主题建模\n=============\n\n如果你使用 **CombinedTM**，则需要包含用于词袋模型的测试文本：\n\n.. code-block:: python\n\n    testing_dataset = qt.transform(text_for_contextual=testing_text_for_contextual, text_for_bow=testing_text_for_bow)\n\n    # n_sample 表示从分布中采样的次数（详见文档）\n    ctm.get_doc_topic_distribution(testing_dataset, n_samples=20) # 返回一个 (n_documents, n_topics) 矩阵，表示每个文档的主题分布\n\n如果你使用 **ZeroShotTM**，则不需要使用 `testing_text_for_bow`，因为如果使用不同的测试文档，将会生成不同大小的词袋模型。因此，最好的做法是只传递将输入到上下文模型的文本：\n\n.. code-block:: python\n\n    testing_dataset = qt.transform(text_for_contextual=testing_text_for_contextual)\n\n    # n_sample 表示从分布中采样的次数（详见文档）\n    ctm.get_doc_topic_distribution(testing_dataset, n_samples=20)\n\n\n跨语言主题建模\n===============\n\n当你使用多语言嵌入训练完 ZeroShotTM 模型后，你可以通过这个简单的流程来预测其他语言文档的主题（只要该语言被 **paraphrase-multilingual-mpnet-base-v2** 支持）。\n\n.. code-block:: python\n\n    # 这里我们有一篇西班牙语文档\n    testing_text_for_contextual = [\n        \"hola, bienvenido\",\n    ]\n\n    # 由于我们正在进行多语言主题建模，在进行跨语言实验时，ZeroShotTM 不需要词袋模型（这样做没有意义，因为我们是用英语词袋模型训练的，却要使用西班牙语词袋模型）。\n    testing_dataset = qt.transform(testing_text_for_contextual)\n\n    # n_sample 表示从分布中采样的次数（详见文档）\n    ctm.get_doc_topic_distribution(testing_dataset, n_samples=20) # 返回一个 (n_documents, n_topics) 矩阵，表示每个文档的主题分布\n\n**进阶说明：** 我们不需要传递西班牙语的词袋模型：两种语言的词袋模型是无法比较的！我们之所以将其传递给模型，只是为了兼容性考虑，但你不能将模型的输出（即训练语言的预测词袋模型）与测试语言的词袋模型进行比较。\n\n更多进阶内容\n-------------\n\n预处理\n~~~~~~~\n\n你需要一个快速脚本来运行预处理流程吗？我们已经为你准备好了！加载你的文档，然后使用我们的 SimplePreprocessing 类。它会自动过滤低频词，并移除训练后为空的文档。preprocess 方法将返回预处理过的文档和未预处理的文档。通常，我们会将未预处理的文档用于 BERT，而将预处理后的文档用于词袋模型。\n\n.. code-block:: python\n\n    from contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessing\n\n    documents = [line.strip() for line in open(\"unpreprocessed_documents.txt\").readlines()]\n    sp = WhiteSpacePreprocessing(documents, \"english\")\n    preprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()\n\n使用自定义嵌入与 Kitty\n~~~~~~~~~~~~~~~~~~~~~\n\n你有自定义嵌入，并希望用它们来加快结果生成速度吗？只需将它们交给 Kitty 即可！\n\n.. code-block:: python\n\n    from contextualized_topic_models.models.kitty_classifier import Kitty\n    import numpy as np\n\n    # 读取训练数据\n    training_data = list(map(lambda x : x.strip(), open(\"train_data\").readlines()))\n    custom_embeddings = np.load('custom_embeddings.npy')\n\n    kt = Kitty()\n    kt.train(training_data, custom_embeddings=custom_embeddings, stopwords_list=[\"stopwords\"])\n\n    print(kt.pretty_print_word_classes())\n\n\n注意：自定义嵌入必须是 numpy 数组。\n\n开发团队\n--------\n\n* `Federico Bianchi`_ \u003Cf.bianchi@unibocconi.it> 博科尼大学\n* `Silvia Terragni`_ \u003Cs.terragni4@campus.unimib.it> 米兰-比可卡大学\n* `Dirk Hovy`_ \u003Cdirk.hovy@unibocconi.it> 博科尼大学\n\n\n软件详情\n--------\n\n* 自由软件：MIT 许可证\n* 文档：https:\u002F\u002Fcontextualized-topic-models.readthedocs.io。\n* 特别感谢 `Stephen Carrow`_ 创建了出色的 https:\u002F\u002Fgithub.com\u002Festebandito22\u002FPyTorchAVITM 包，我们正是基于该包构建了本项目的基石。我们很高兴能够在 MIT 许可证下再次发布这款软件。\n\n\n\n致谢\n----\n\n\n本项目使用 Cookiecutter_ 和 `audreyr\u002Fcookiecutter-pypackage`_ 项目模板创建。为了方便用户使用，我们还引入了 `rbo`_ 包，其所有权利均归该包作者所有。\n\n注\n----\n\n请记住，这是一个研究工具 :)\n\n.. _PyTorch: https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F\n.. _Cookiecutter: https:\u002F\u002Fgithub.com\u002Faudreyr\u002Fcookiecutter\n.. _预处理: https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models#preprocessing\n.. _跨语言主题建模: https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models#cross-lingual-topic-modeling\n.. _`audreyr\u002Fcookiecutter-pypackage`: https:\u002F\u002Fgithub.com\u002Faudreyr\u002Fcookiecutter-pypackage\n.. _`Stephen Carrow` : https:\u002F\u002Fgithub.com\u002Festebandito22\n.. _`rbo` : https:\u002F\u002Fgithub.com\u002Fdlukes\u002Frbo\n.. _费德里科·比安奇: https:\u002F\u002Ffedericobianchi.io\n.. _西尔维娅·特拉尼亚尼: https:\u002F\u002Fsilviatti.github.io\u002F\n.. _迪尔克·霍维: https:\u002F\u002Fdirkhovy.com\u002F\n.. _SBERT: https:\u002F\u002Fwww.sbert.net\u002Fdocs\u002Fpretrained_models.html\n.. _Hugging Face: https:\u002F\u002Fhuggingface.co\u002Fmodels\n.. _UmBERTo: https:\u002F\u002Fhuggingface.co\u002FMusixmatch\u002Fumberto-commoncrawl-cased-v1\n.. _Medium: https:\u002F\u002Ffbvinid.medium.com\u002Fcontextualized-topic-modeling-with-python-eacl2021-eacf6dfa576","# Contextualized Topic Models (CTM) 快速上手指南\n\nContextualized Topic Models (CTM) 是一个利用预训练语言模型（如 BERT）的上下文嵌入来增强主题建模效果的 Python 库。它支持多语言，并提供了两种核心模型：**CombinedTM**（结合词袋与上下文嵌入，主题连贯性更强）和 **ZeroShotTM**（仅使用上下文嵌入，支持零样本跨语言主题建模）。\n\n## 环境准备\n\n*   **操作系统**: Linux, macOS, Windows\n*   **Python 版本**: 建议 Python 3.7+\n*   **前置依赖**:\n    *   PyTorch\n    *   Hugging Face Transformers \u002F Sentence Transformers (SBERT)\n*   **GPU 加速 (可选)**: 如需使用 CUDA 加速，请确保已安装与当前 PyTorch 版本匹配的 CUDA 驱动及 toolkit。\n\n## 安装步骤\n\n推荐使用 pip 进行安装。国内用户可使用清华或阿里镜像源加速下载。\n\n**标准安装：**\n```bash\npip install -U contextualized_topic_models\n```\n\n**使用国内镜像源加速安装：**\n```bash\npip install -U contextualized_topic_models -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n\n以下示例展示如何使用 **CombinedTM** 模型进行主题建模。该流程包含数据预处理、模型训练及主题提取。\n\n### 1. 导入依赖与准备数据\nCTM 的核心优势在于同时利用“原始文本”生成上下文嵌入，以及利用“预处理文本”构建词袋（BoW）。\n\n```python\nfrom contextualized_topic_models.models.ctm import CombinedTM\nfrom contextualized_topic_models.utils.data_preparation import TopicModelDataPreparation\n\n# 假设你已有两份文档列表：\n# list_of_unpreprocessed_documents: 原始文本列表（用于 BERT 嵌入）\n# list_of_preprocessed_documents: 分词\u002F去停用词后的文本列表（用于词袋模型）\n\n# 初始化数据预处理工具，指定预训练模型（此处使用多语言 mpnet 模型）\nqt = TopicModelDataPreparation(\"all-mpnet-base-v2\")\n\n# 拟合数据：生成训练数据集\n# text_for_contextual: 传入原始文本\n# text_for_bow: 传入预处理后的文本\ntraining_dataset = qt.fit(\n    text_for_contextual=list_of_unpreprocessed_documents, \n    text_for_bow=list_of_preprocessed_documents\n)\n```\n\n### 2. 初始化并训练模型\n`n_components` 参数用于设定想要提取的主题数量。注意：为了获得最佳效果，建议将词袋词汇量限制在 2000 个术语以内。\n\n```python\n# 初始化 CombinedTM 模型\n# bow_size: 词袋词汇表大小\n# contextual_size: 上下文嵌入维度 (all-mpnet-base-v2 通常为 768)\n# n_components: 主题数量\nctm = CombinedTM(\n    bow_size=len(qt.vocab), \n    contextual_size=768, \n    n_components=50 \n)\n\n# 训练模型\nctm.fit(training_dataset)\n```\n\n### 3. 获取主题结果\n训练完成后，即可查看生成的主题及其对应的关键词。\n\n```python\n# 获取前 2 个主题及其关键词\ntopics = ctm.get_topics(2)\nprint(topics)\n```\n\n> **提示**：若需进行跨语言零样本主题建模，请将 `CombinedTM` 替换为 `ZeroShotTM`，其余调用方式类似。更多高级用法（如 Kitty 人机交互分类器、SuperCTM 监督模式）请参考官方文档。","某跨国电商公司的数据团队需要分析全球用户的海量商品评论，以挖掘不同语言市场中的潜在消费趋势和产品痛点。\n\n### 没有 contextualized-topic-models 时\n- **语义理解浅层化**：传统模型仅依赖词袋（BoW）统计词频，无法识别“屏幕清晰”与“显示效果好”的语义相似性，导致生成的主题支离破碎。\n- **多语言处理困难**：面对英、法、德等多语种评论，团队需为每种语言单独训练模型并人工对齐主题，耗时且难以保证标准统一。\n- **新词与冷启动问题**：对于新兴网络用语或测试集中未出现的词汇，传统模型完全无法处理，导致大量包含关键信息的近期评论被忽略。\n- **主题连贯性差**：提取出的主题往往包含无关高频词（如“但是”、“非常”），业务人员难以直接从中提炼出可执行的洞察。\n\n### 使用 contextualized-topic-models 后\n- **深度语义融合**：利用 BERT 等预训练上下文嵌入，模型能精准捕捉词语在特定语境下的含义，将语义相近但措辞不同的评论自动归入同一连贯主题。\n- **零样本跨语言迁移**：借助 ZeroShotTM 功能，只需使用多语言嵌入训练一次，即可直接处理未见过的语言数据，无需重新训练即可实现全球话题对齐。\n- **动态适应新表达**：基于上下文的表示方法不再受限于固定词表，即使评论中出现新的缩写或流行语，模型也能根据其语境向量将其正确分类。\n- **高可解释性输出**：生成的主题关键词高度相关且逻辑清晰，业务团队可直接据此发现“电池续航焦虑”或“包装破损”等具体问题，大幅缩短决策路径。\n\ncontextualized-topic-models 通过引入上下文感知能力，将原本杂乱的多语言文本转化为结构清晰、语义连贯的商业洞察，彻底解决了传统主题模型在复杂语境下的失效难题。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FMilaNLProc_contextualized-topic-models_d8f5f83a.png","MilaNLProc","MilaNLP","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FMilaNLProc_e554d0fe.png","",null,"https:\u002F\u002Fgithub.com\u002FMilaNLProc",[82,86],{"name":83,"color":84,"percentage":85},"Python","#3572A5",97.8,{"name":87,"color":88,"percentage":89},"Makefile","#427819",2.2,1266,151,"2026-03-20T06:42:12","MIT","未说明","可选。若需使用 CUDA 加速，需安装与系统分布匹配的 CUDA 版本（具体版本需参考 PyTorch 要求），否则可使用 CPU 运行。",{"notes":97,"python":94,"dependencies":98},"1. 该工具支持多种语言，依赖 HuggingFace 提供的预训练模型（如 BERT、SBERT）。2. 为获得最佳效果，词袋（BoW）大小建议限制在 2000 个术语以内，以避免神经网络重构困难和过拟合。3. 数据预处理至关重要：建议使用未预处理文本生成上下文嵌入（BERT），使用预处理文本构建词袋（BoW）。4. 若使用多语言模型处理英语数据，效果可能不如专用英语模型。5. 包含一个名为'Kitty'的子模块，用于人机协作分类。",[99,100,101],"torch","sentence-transformers (SBERT)","transformers (HuggingFace)",[13,51,26],[104,105,106,107,108,109,110,111,112,113,114,115],"topic-modeling","bert","transformer","embeddings","text-as-data","topic-coherence","multilingual-topic-models","multilingual-models","neural-topic-models","nlp","nlp-library","nlp-machine-learning","2026-03-27T02:49:30.150509","2026-04-06T11:30:58.446867",[119,124,129,134,138,142],{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},16664,"如何使用 OpenAI、Cohere 等高级模型获取主题表示（embedding）？","可以使用 `bert_embeddings_from_list` 函数来生成上下文嵌入。代码示例如下：\n```python\nfrom contextualized_topic_models.utils.data_preparation import bert_embeddings_from_list\ntrain_contextualized_embeddings = bert_embeddings_from_list(unpreprocessed_corpus, \"bert-base-nli-mean-tokens\", max_seq_length=200)\n```\n注意：`max_seq_length` 参数应根据所选模型的具体要求进行调整。对于双词（bi-grams）训练，需配合 `CountVectorizer(ngram_range=(2,2))` 使用。","https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fissues\u002F128",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},16665,"运行评估分数时遇到 'numpy.ndarray size changed' 或 'C-API incompatibility' 错误怎么办？","这通常是由于 Google Colab 环境中 numpy 版本与已编译模块不兼容导致的。解决方法是：\n1. 在安装 `contextualized-topic-models` (ctm) 包之后，务必**重启 Colab 运行时环境**（Runtime -> Restart runtime）。\n2. 如果问题依旧，尝试强制升级 numpy：`!pip install -U numpy --upgrade`，然后再次重启环境并重新导入库。","https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fissues\u002F126",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},16666,"如何处理超过 3GB 的大型数据集以避免内存分配错误？","在处理大型数据集（如 7.3GB）时，如果遇到内存错误，建议尝试使用更轻量级的模型来创建上下文嵌入。例如，使用 `SentenceTransformer(\"bert-base-nli-mean-tokens\")` 替代较大的模型已被证实可以在大数据集上成功运行。虽然这可能不是性能最优的选择，但能有效解决内存不足的问题。","https:\u002F\u002Fgithub.com\u002FMilaNLProc\u002Fcontextualized-topic-models\u002Fissues\u002F129",{"id":135,"question_zh":136,"answer_zh":137,"source_url":123},16667,"如何正确配置代码以训练双词（bi-grams）主题模型？","训练双词模型需要结合 `CountVectorizer` 和预处理步骤。参考代码如下：\n```python\nfrom sklearn.feature_extraction.text import CountVectorizer\nfrom contextualized_topic_models.utils.preprocessing import WhiteSpacePreprocessingStopwords\n\n# 预处理\nsp = WhiteSpacePreprocessingStopwords(docs, stopwords_list=stopwords)\npreprocessed_documents, unpreprocessed_corpus, vocab, retained_indices = sp.preprocess()\n\n# 配置双词向量器\nvectorizer = CountVectorizer(ngram_range=(2,2))\ntrain_bow_embeddings = vectorizer.fit_transform(preprocessed_documents)\nvocab = vectorizer.get_feature_names_out()\nid2token = {k: v for k, v in zip(range(0, len(vocab)), vocab)}\n\n# 生成上下文嵌入\ntrain_contextualized_embeddings = bert_embeddings_from_list(unpreprocessed_corpus, \"bert-base-nli-mean-tokens\", max_seq_length=200)\n```",{"id":139,"question_zh":140,"answer_zh":141,"source_url":128},16668,"在 Google Colab 中导入库时出现 RuntimeError (API version mismatch) 如何解决？","该错误表明模块编译的 API 版本与当前安装的 numpy 版本不匹配。最有效的解决步骤是：\n1. 执行 `!pip install -U contextualized-topic-models` 确保安装最新版本。\n2. **关键步骤**：在安装完成后，必须点击菜单栏的 \"Runtime\" (运行时) 选择 \"Restart runtime\" (重启运行时)。\n3. 重启后再次运行导入代码，不要直接在同一个单元格中连续执行安装和导入操作。",{"id":143,"question_zh":144,"answer_zh":145,"source_url":123},16669,"使用不同预训练模型生成嵌入时，最大序列长度（max_seq_length）应该如何设置？","`max_seq_length` 的值取决于所使用的具体模型。在使用 `bert_embeddings_from_list` 函数时，需要根据模型的限制进行调整。例如，对于 `bert-base-nli-mean-tokens` 模型，设置为 200 通常是可行的，但更换其他模型时请查阅该模型的文档以确认其支持的最大序列长度，避免因截断或报错影响效果。",[]]