[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-MaartenGr--BERTopic":3,"tool-MaartenGr--BERTopic":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":75,"owner_website":82,"owner_url":83,"languages":84,"stars":93,"forks":94,"last_commit_at":95,"license":96,"difficulty_score":97,"env_os":98,"env_gpu":99,"env_ram":100,"env_deps":101,"category_tags":113,"github_topics":114,"view_count":10,"oss_zip_url":124,"oss_zip_packed_at":124,"status":16,"created_at":125,"updated_at":126,"faqs":127,"releases":158},982,"MaartenGr\u002FBERTopic","BERTopic","Leveraging BERT and c-TF-IDF to create easily interpretable topics. ","BERTopic 是一种结合 BERT 和 c-TF-IDF 技术的主题建模工具，能够生成易于解释的主题，同时保留主题描述中的重要词汇。它通过将文本嵌入与聚类算法结合，帮助用户从大量文档中提取出清晰且有意义的主题结构。\n\n在传统主题建模方法中，主题的可解释性和灵活性往往不足。BERTopic 解决了这一问题，不仅支持多种建模方式（如引导式、监督式、动态主题等），还允许用户根据需求自定义主题或合并模型。无论是处理静态数据还是动态变化的文本流，它都能提供强大的支持。此外，新引入的零样本学习和多模态功能进一步扩展了其适用范围。\n\n这款工具特别适合需要进行文本分析的研究人员和开发者，尤其是那些希望快速构建高质量主题模型的人。对于熟悉 Python 的用户来说，安装和使用都非常便捷。BERTopic 的独特亮点在于其对 🤗 Transformers 的深度集成，以及对多种高级功能的支持，例如结合大型语言模型生成主题描述。这些特性使其成为探索复杂文本数据的理想选择。\n\n如果你正在寻找一个灵活、强大且易于使用的主题建模解决方案，BERTopic 无疑是一个值得尝试的选择。","[![PyPI Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FMaartenGr_BERTopic_readme_40d304df6d24.png)](https:\u002F\u002Fpepy.tech\u002Fprojects\u002Fbertopic)\n[![PyPI - Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-v3.10+-blue.svg)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fbertopic\u002F)\n[![Build](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002FMaartenGr\u002FBERTopic\u002Ftesting.yml?branch=master)](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Factions)\n[![docs](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-Passing-green.svg)](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002F)\n[![PyPI - PyPi](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002FBERTopic)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fbertopic\u002F)\n[![PyPI - License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-green.svg)](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FVLAC\u002Fblob\u002Fmaster\u002FLICENSE)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2203.05794-\u003CCOLOR>.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.05794)\n\n\n# BERTopic\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FMaartenGr_BERTopic_readme_b279bc330ad2.png\" width=\"35%\" align=\"right\" \u002F> \n\nBERTopic is a topic modeling technique that leverages 🤗 transformers and c-TF-IDF to create dense clusters\nallowing for easily interpretable topics whilst keeping important words in the topic descriptions.\n\nBERTopic supports all kinds of topic modeling techniques:  \n\u003Ctable>\n  \u003Ctr>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fguided\u002Fguided.html\">Guided\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fsupervised\u002Fsupervised.html\">Supervised\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fsemisupervised\u002Fsemisupervised.html\">Semi-supervised\u003C\u002Fa>\u003C\u002Ftd>\n \u003C\u002Ftr>\n   \u003Ctr>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmanual\u002Fmanual.html\">Manual\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fdistribution\u002Fdistribution.html\">Multi-topic distributions\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fhierarchicaltopics\u002Fhierarchicaltopics.html\">Hierarchical\u003C\u002Fa>\u003C\u002Ftd>\n \u003C\u002Ftr>\n \u003Ctr>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftopicsperclass\u002Ftopicsperclass.html\">Class-based\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftopicsovertime\u002Ftopicsovertime.html\">Dynamic\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fonline\u002Fonline.html\">Online\u002FIncremental\u003C\u002Fa>\u003C\u002Ftd>\n \u003C\u002Ftr>\n \u003Ctr>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultimodal\u002Fmultimodal.html\">Multimodal\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultiaspect\u002Fmultiaspect.html\">Multi-aspect\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Fllm.html\">Text Generation\u002FLLM\u003C\u002Fa>\u003C\u002Ftd>\n \u003C\u002Ftr>\n \u003Ctr>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fzeroshot\u002Fzeroshot.html\">Zero-shot \u003Cb>(new!)\u003C\u002Fb>\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmerge\u002Fmerge.html\">Merge Models \u003Cb>(new!)\u003C\u002Fb>\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fseed_words\u002Fseed_words.html\">Seed Words \u003Cb>(new!)\u003C\u002Fb>\u003C\u002Fa>\u003C\u002Ftd>\n \u003C\u002Ftr>\n\u003C\u002Ftable>\n\nCorresponding medium posts can be found [here](https:\u002F\u002Fmedium.com\u002Fdata-science\u002Ftopic-modeling-with-bert-779f7db187e6?sk=0b5a470c006d1842ad4c8a3057063a99\n), [here](https:\u002F\u002Fmedium.com\u002Fdata-science\u002Fusing-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf?sk=b1e0fd46f70cb15e8422b4794a81161d\n) and [here](https:\u002F\u002Fmedium.com\u002Fdata-science\u002Finteractive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4\n). For a more detailed overview, you can read the [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.05794) or see a [brief overview](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Falgorithm\u002Falgorithm.html). \n\n## Installation\n\nInstallation, with sentence-transformers, can be done using [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F):\n\n```bash\nuv add bertopic\n```\n\nor with [pip](https:\u002F\u002Fgithub.com\u002Fpypa\u002Fpip):\n\n```bash\npip install bertopic\n```\n\nIf you want to install BERTopic with other embedding models, you can choose one of the following:\n\n```bash\n# Choose an embedding backend\npip install bertopic[flair,gensim,spacy,use]\n\n# Topic modeling with images\npip install bertopic[vision]\n```\n\nFor a *light-weight installation* without transformers, UMAP and\u002For HDBSCAN (for training with Model2Vec or inference), see [this tutorial](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftips_and_tricks\u002Ftips_and_tricks.html#lightweight-installation).\n\n## Getting Started\nFor an in-depth overview of the features of BERTopic \nyou can check the [**full documentation**](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002F) or you can follow along \nwith one of the examples below:\n\n| Name  | Link  |\n|---|---|\n| Start Here - **Best Practices in BERTopic**  | [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing)  |\n| **🆕 New!** - Topic Modeling on Large Data (GPU Acceleration)  | [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing)  |\n| **🆕 New!** - Topic Modeling with Llama 2 🦙 | [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing)  |\n| **🆕 New!** - Topic Modeling with Quantized LLMs | [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1DdSHvVPJA3rmNfBWjCo2P1E9686xfxFx?usp=sharing)  |\n| Topic Modeling with BERTopic  | [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing)  |\n| (Custom) Embedding Models in BERTopic  | [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F18arPPe50szvcCp_Y6xS56H2tY0m-RLqv?usp=sharing) |\n| Advanced Customization in BERTopic  |  [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1ClTYut039t-LDtlcd-oQAdXWgcsSGTw9?usp=sharing) |\n| (semi-)Supervised Topic Modeling with BERTopic  |  [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1bxizKzv5vfxJEB29sntU__ZC7PBSIPaQ?usp=sharing)  |\n| Dynamic Topic Modeling with Trump's Tweets  | [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1un8ooI-7ZNlRoK0maVkYhmNRl0XGK88f?usp=sharing)  |\n| Topic Modeling arXiv Abstracts | [![Kaggle](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?style=for-the-badge&message=Kaggle&color=222222&logo=Kaggle&logoColor=20BEFF&label=)](https:\u002F\u002Fwww.kaggle.com\u002Fmaartengr\u002Ftopic-modeling-arxiv-abstract-with-bertopic) |\n\n\n## Quick Start\nWe start by extracting topics from the well-known 20 newsgroups dataset containing English documents:\n\n```python\nfrom bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n \ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n```\n\nAfter generating topics and their probabilities, we can access all of the topics together with their topic representations:\n\n```python\n>>> topic_model.get_topic_info()\n\nTopic\tCount\tName\n-1\t4630\t-1_can_your_will_any\n0\t693\t49_windows_drive_dos_file\n1\t466\t32_jesus_bible_christian_faith\n2\t441\t2_space_launch_orbit_lunar\n3\t381\t22_key_encryption_keys_encrypted\n...\n```\n\nThe `-1` topic refers to all outlier documents and are typically ignored. Each word in a topic describes the underlying theme of that topic and can be used \nfor interpreting that topic. Next, let's take a look at the most frequent topic that was generated:\n\n```python\n>>> topic_model.get_topic(0)\n\n[('windows', 0.006152228076250982),\n ('drive', 0.004982897610645755),\n ('dos', 0.004845038866360651),\n ('file', 0.004140142872194834),\n ('disk', 0.004131678774810884),\n ('mac', 0.003624848635985097),\n ('memory', 0.0034840976976789903),\n ('software', 0.0034415334250699077),\n ('email', 0.0034239554442333257),\n ('pc', 0.003047105930670237)]\n```  \n\nUsing `.get_document_info`, we can also extract information on a document level, such as their corresponding topics, probabilities, whether they are representative documents for a topic, etc.:\n\n```python\n>>> topic_model.get_document_info(docs)\n\nDocument                               Topic\tName\t                        Top_n_words                     Probability    ...\nI am sure some bashers of Pens...\t0\t0_game_team_games_season\tgame - team - games...\t        0.200010       ...\nMy brother is in the market for...      -1     -1_can_your_will_any\t        can - your - will...\t        0.420668       ...\nFinally you said what you dream...\t-1     -1_can_your_will_any\t        can - your - will...            0.807259       ...\nThink! It's the SCSI card doing...\t49     49_windows_drive_dos_file\twindows - drive - docs...\t0.071746       ...\n1) I have an old Jasmine drive...\t49     49_windows_drive_dos_file\twindows - drive - docs...\t0.038983       ...\n```\n\n**`🔥 Tip`**: Use `BERTopic(language=\"multilingual\")` to select a model that supports 50+ languages. \n\n## Fine-tune Topic Representations\n\nIn BERTopic, there are a number of different [topic representations](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Frepresentation.html) that we can choose from. They are all quite different from one another and give interesting perspectives and variations of topic representations. A great start is `KeyBERTInspired`, which for many users increases the coherence and reduces stopwords from the resulting topic representations:\n\n ```python\nfrom bertopic.representation import KeyBERTInspired\n\n# Fine-tune your topic representations\nrepresentation_model = KeyBERTInspired()\ntopic_model = BERTopic(representation_model=representation_model)\n```\n\nHowever, you might want to use something more powerful to describe your clusters. You can even use ChatGPT or other models from OpenAI to generate labels, summaries, phrases, keywords, and more:\n\n```python\nimport openai\nfrom bertopic.representation import OpenAI\n\n# Fine-tune topic representations with GPT\nclient = openai.OpenAI(api_key=\"sk-...\")\nrepresentation_model = OpenAI(client, model=\"gpt-4o-mini\", chat=True)\ntopic_model = BERTopic(representation_model=representation_model)\n```\n\n**`🔥 Tip`**: Instead of iterating over all of these different topic representations, you can model them simultaneously with [multi-aspect topic representations](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultiaspect\u002Fmultiaspect.html) in BERTopic. \n\n\n## Visualizations\nAfter having trained our BERTopic model, we can iteratively go through hundreds of topics to get a good \nunderstanding of the topics that were extracted. However, that takes quite some time and lacks a global representation. Instead, we can use one of the [many visualization options](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fvisualization\u002Fvisualization.html) in BERTopic. \nFor example, we can visualize the topics that were generated in a way very similar to \n[LDAvis](https:\u002F\u002Fgithub.com\u002Fcpsievert\u002FLDAvis):\n\n```python\ntopic_model.visualize_topics()\n``` \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FMaartenGr_BERTopic_readme_4c245aa0b80c.gif\" width=\"80%\" align=\"center\" \u002F>\n\n## Modularity\nBy default, the [main steps](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Falgorithm\u002Falgorithm.html) for topic modeling with BERTopic are sentence-transformers, UMAP, HDBSCAN, and c-TF-IDF run in sequence. However, it assumes some independence between these steps which makes BERTopic quite modular. In other words, BERTopic not only allows you to build your own topic model but to explore several topic modeling techniques on top of your customized topic model:\n\nhttps:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F218420473-4b2bb539-9dbe-407a-9674-a8317c7fb3bf.mp4\n\nYou can swap out any of these models or even remove them entirely. The following steps are completely modular:\n\n1. [Embedding](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fembeddings\u002Fembeddings.html) documents\n2. [Reducing dimensionality](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fdim_reduction\u002Fdim_reduction.html) of embeddings\n3. [Clustering](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fclustering\u002Fclustering.html) reduced embeddings into topics\n4. [Tokenization](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fvectorizers\u002Fvectorizers.html) of topics\n5. [Weight](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fctfidf\u002Fctfidf.html) tokens\n6. [Represent topics](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Frepresentation.html) with one or [multiple](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultiaspect\u002Fmultiaspect.html) representations\n\n\n## Functionality\nBERTopic has many functions that quickly can become overwhelming. To alleviate this issue, you will find an overview \nof all methods and a short description of its purpose. \n\n### Common\nBelow, you will find an overview of common functions in BERTopic. \n\n| Method | Code  | \n|-----------------------|---|\n| Fit the model    |  `.fit(docs)` |\n| Fit the model and predict documents  |  `.fit_transform(docs)` |\n| Predict new documents    |  `.transform([new_doc])` |\n| Access single topic   | `.get_topic(topic=12)`  |   \n| Access all topics     |  `.get_topics()` |\n| Get topic freq    |  `.get_topic_freq()` |\n| Get all topic information|  `.get_topic_info()` |\n| Get all document information|  `.get_document_info(docs)` |\n| Get representative docs per topic |  `.get_representative_docs()` |\n| Update topic representation | `.update_topics(docs, n_gram_range=(1, 3))` |\n| Generate topic labels | `.generate_topic_labels()` |\n| Set topic labels | `.set_topic_labels(my_custom_labels)` |\n| Merge topics | `.merge_topics(docs, topics_to_merge)` |\n| Reduce nr of topics | `.reduce_topics(docs, nr_topics=30)` |\n| Reduce outliers | `.reduce_outliers(docs, topics)` |\n| Find topics | `.find_topics(\"vehicle\")` |\n| Save model    |  `.save(\"my_model\", serialization=\"safetensors\")` |\n| Load model    |  `BERTopic.load(\"my_model\")` |\n| Get parameters |  `.get_params()` |\n\n\n### Attributes\nAfter having trained your BERTopic model, several attributes are saved within your model. These attributes, in part, \nrefer to how model information is stored on an estimator during fitting. The attributes that you see below all end in `_` and are \npublic attributes that can be used to access model information. \n\n| Attribute | Description |\n|------------------------|---------------------------------------------------------------------------------------------|\n| `.topics_`               | The topics that are generated for each document after training or updating the topic model. |\n| `.probabilities_` | The probabilities that are generated for each document if HDBSCAN is used. |\n| `.topic_sizes_`           | The size of each topic                                                                      |\n| `.topic_mapper_`          | A class for tracking topics and their mappings anytime they are merged\u002Freduced.             |\n| `.topic_representations_` | The top *n* terms per topic and their respective c-TF-IDF values.                           |\n| `.c_tf_idf_`              | The topic-term matrix as calculated through c-TF-IDF.                                       |\n| `.topic_aspects_`          | The different aspects, or representations, of each topic.                                  |\n| `.topic_labels_`          | The default labels for each topic.                                                          |\n| `.custom_labels_`         | Custom labels for each topic as generated through `.set_topic_labels`.                      |\n| `.topic_embeddings_`      | The embeddings for each topic if `embedding_model` was used.                                |\n| `.representative_docs_`   | The representative documents for each topic if HDBSCAN is used.                             |\n\n\n### Variations\nThere are many different use cases in which topic modeling can be used. As such, several variations of BERTopic have been developed such that one package can be used across many use cases.\n\n| Method | Code  | \n|-----------------------|---|\n| [Topic Distribution Approximation](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fdistribution\u002Fdistribution.html) | `.approximate_distribution(docs)` |\n| [Online Topic Modeling](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fonline\u002Fonline.html) | `.partial_fit(doc)` |\n| [Semi-supervised Topic Modeling](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fsemisupervised\u002Fsemisupervised.html) | `.fit(docs, y=y)` |\n| [Supervised Topic Modeling](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fsupervised\u002Fsupervised.html) | `.fit(docs, y=y)` |\n| [Manual Topic Modeling](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmanual\u002Fmanual.html) | `.fit(docs, y=y)` |\n| [Multimodal Topic Modeling](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultimodal\u002Fmultimodal.html) | ``.fit(docs, images=images)`` |\n| [Topic Modeling per Class](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftopicsperclass\u002Ftopicsperclass.html) | `.topics_per_class(docs, classes)` |\n| [Dynamic Topic Modeling](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftopicsovertime\u002Ftopicsovertime.html) | `.topics_over_time(docs, timestamps)` |\n| [Hierarchical Topic Modeling](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fhierarchicaltopics\u002Fhierarchicaltopics.html) | `.hierarchical_topics(docs)` |\n| [Guided Topic Modeling](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fguided\u002Fguided.html) | `BERTopic(seed_topic_list=seed_topic_list)` |\n| [Zero-shot Topic Modeling](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fzeroshot\u002Fzeroshot.html) | `BERTopic(zeroshot_topic_list=zeroshot_topic_list)` |\n| [Merge Multiple Models](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmerge\u002Fmerge.html) | `BERTopic.merge_models([topic_model_1, topic_model_2])` |\n\n\n### Visualizations\nEvaluating topic models can be rather difficult due to the somewhat subjective nature of evaluation. \nVisualizing different aspects of the topic model helps in understanding the model and makes it easier \nto tweak the model to your liking. \n\n| Method | Code  | \n|-----------------------|---|\n| Visualize Topics    |  `.visualize_topics()` |\n| Visualize Documents    |  `.visualize_documents()` |\n| Visualize Document Hierarchy    |  `.visualize_hierarchical_documents()` |\n| Visualize Topic Hierarchy    |  `.visualize_hierarchy()` |\n| Visualize Topic Tree   |  `.get_topic_tree(hierarchical_topics)` |\n| Visualize Topic Terms    |  `.visualize_barchart()` |\n| Visualize Topic Similarity  |  `.visualize_heatmap()` |\n| Visualize Term Score Decline  |  `.visualize_term_rank()` |\n| Visualize Topic Probability Distribution    |  `.visualize_distribution(probs[0])` |\n| Visualize Topics over Time   |  `.visualize_topics_over_time(topics_over_time)` |\n| Visualize Topics per Class | `.visualize_topics_per_class(topics_per_class)` | \n\n\n## Citation\nTo cite the [BERTopic paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.05794), please use the following bibtex reference:\n\n```bibtext\n@article{grootendorst2022bertopic,\n  title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},\n  author={Grootendorst, Maarten},\n  journal={arXiv preprint arXiv:2203.05794},\n  year={2022}\n}\n```\n","[![PyPI 下载量](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FMaartenGr_BERTopic_readme_40d304df6d24.png)](https:\u002F\u002Fpepy.tech\u002Fprojects\u002Fbertopic)\n[![PyPI - Python](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fpython-v3.10+-blue.svg)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fbertopic\u002F)\n[![构建状态](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Factions\u002Fworkflow\u002Fstatus\u002FMaartenGr\u002FBERTopic\u002Ftesting.yml?branch=master)](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Factions)\n[![文档](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fdocs-Passing-green.svg)](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002F)\n[![PyPI - 版本](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002FBERTopic)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fbertopic\u002F)\n[![PyPI - 许可证](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Flicense-MIT-green.svg)](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FVLAC\u002Fblob\u002Fmaster\u002FLICENSE)\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2203.05794-\u003CCOLOR>.svg)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.05794)\n\n\n# BERTopic\n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FMaartenGr_BERTopic_readme_b279bc330ad2.png\" width=\"35%\" align=\"right\" \u002F> \n\nBERTopic（一种基于 Transformer 的主题建模技术）是一种主题建模（Topic Modeling）方法，它利用 🤗 Transformer 模型库和 c-TF-IDF 生成密集聚类（dense clusters），既能创建易于理解的主题，又能保留主题描述中的关键术语。\n\nBERTopic 支持多种主题建模技术：  \n\u003Ctable>\n  \u003Ctr>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fguided\u002Fguided.html\">引导式建模\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fsupervised\u002Fsupervised.html\">监督式建模\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fsemisupervised\u002Fsemisupervised.html\">半监督式建模\u003C\u002Fa>\u003C\u002Ftd>\n \u003C\u002Ftr>\n   \u003Ctr>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmanual\u002Fmanual.html\">手动建模\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fdistribution\u002Fdistribution.html\">多主题分布\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fhierarchicaltopics\u002Fhierarchicaltopics.html\">层次化建模\u003C\u002Fa>\u003C\u002Ftd>\n \u003C\u002Ftr>\n \u003Ctr>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftopicsperclass\u002Ftopicsperclass.html\">基于类别的建模\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftopicsovertime\u002Ftopicsovertime.html\">动态主题建模\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fonline\u002Fonline.html\">在线\u002F增量建模\u003C\u002Fa>\u003C\u002Ftd>\n \u003C\u002Ftr>\n \u003Ctr>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultimodal\u002Fmultimodal.html\">多模态建模\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultiaspect\u002Fmultiaspect.html\">多维度建模\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Fllm.html\">文本生成\u002F大语言模型（LLM）\u003C\u002Fa>\u003C\u002Ftd>\n \u003C\u002Ftr>\n \u003Ctr>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fzeroshot\u002Fzeroshot.html\">零样本建模 \u003Cb>(新增！)\u003C\u002Fb>\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmerge\u002Fmerge.html\">模型合并 \u003Cb>(新增！)\u003C\u002Fb>\u003C\u002Fa>\u003C\u002Ftd>\n    \u003Ctd>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fseed_words\u002Fseed_words.html\">种子词 \u003Cb>(新增！)\u003C\u002Fb>\u003C\u002Fa>\u003C\u002Ftd>\n \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n对应的 Medium 文章可在此处[查看](https:\u002F\u002Fmedium.com\u002Fdata-science\u002Ftopic-modeling-with-bert-779f7db187e6?sk=0b5a470c006d1842ad4c8a3057063a99\n)、[此处](https:\u002F\u002Fmedium.com\u002Fdata-science\u002Fusing-whisper-and-bertopic-to-model-kurzgesagts-videos-7d8a63139bdf?sk=b1e0fd46f70cb15e8422b4794a81161d\n)和[此处](https:\u002F\u002Fmedium.com\u002Fdata-science\u002Finteractive-topic-modeling-with-bertopic-1ea55e7d73d8?sk=03c2168e9e74b6bda2a1f3ed953427e4\n)。如需更详细说明，可阅读[论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.05794)或查看[简要概述](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Falgorithm\u002Falgorithm.html)。\n\n## 安装\n\n使用 [uv](https:\u002F\u002Fdocs.astral.sh\u002Fuv\u002F) 安装（含 sentence-transformers）：\n\n```bash\nuv add bertopic\n```\n\n或使用 [pip](https:\u002F\u002Fgithub.com\u002Fpypa\u002Fpip)：\n\n```bash\npip install bertopic\n```\n\n如需安装支持其他嵌入模型（embedding models）的 BERTopic，可选择以下任一方式：\n\n```bash\n# 选择嵌入后端\npip install bertopic[flair,gensim,spacy,use]\n\n# 图像主题建模\npip install bertopic[vision]\n```\n\n如需*轻量级安装*（不含 transformers、UMAP 和\u002F或 HDBSCAN，适用于 Model2Vec 训练或推理），请参阅[本教程](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftips_and_tricks\u002Ftips_and_tricks.html#lightweight-installation)。\n\n## 快速入门\n如需深入了解 BERTopic 的功能特性，可查阅[**完整文档**](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002F)，或参考以下示例：\n\n| 名称  | 链接  |\n|---|---|\n| 从此开始 - **BERTopic 最佳实践**  | [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1BoQ_vakEVtojsd2x_U6-_x52OOuqruj2?usp=sharing)  |\n| **🆕 新增！** - 大规模数据主题建模（GPU 加速）  | [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1W7aEdDPxC29jP99GGZphUlqjMFFVKtBC?usp=sharing)  |\n| **🆕 新增！** - 使用 Llama 2 🦙 进行主题建模 | [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1QCERSMUjqGetGGujdrvv_6_EeoIcd_9M?usp=sharing)  |\n| **🆕 新增！** - 使用量化大语言模型（LLMs）进行主题建模 | [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1DdSHvVPJA3rmNfBWjCo2P1E9686xfxFx?usp=sharing)  |\n| 使用 BERTopic 进行主题建模  | [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1FieRA9fLdkQEGDIMYl0I3MCjSUKVF8C-?usp=sharing)  |\n| BERTopic 中的（自定义）嵌入模型  | [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F18arPPe50szvcCp_Y6xS56H2tY0m-RLqv?usp=sharing) |\n| BERTopic 高级自定义  |  [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1ClTYut039t-LDtlcd-oQAdXWgcsSGTw9?usp=sharing) |\n| BERTopic 的（半）监督主题建模  |  [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1bxizKzv5vfxJEB29sntU__ZC7PBSIPaQ?usp=sharing)  |\n| 基于特朗普推文的动态主题建模  | [![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1un8ooI-7ZNlRoK0maVkYhmNRl0XGK88f?usp=sharing)  |\n| arXiv 摘要主题建模 | [![Kaggle](https:\u002F\u002Fimg.shields.io\u002Fstatic\u002Fv1?style=for-the-badge&message=Kaggle&color=222222&logo=Kaggle&logoColor=20BEFF&label=)](https:\u002F\u002Fwww.kaggle.com\u002Fmaartengr\u002Ftopic-modeling-arxiv-abstract-with-bertopic) |\n\n## 快速开始\n我们首先从著名的 20 newsgroups（20 个新闻组）数据集中提取主题，该数据集包含英文文档：\n\n```python\nfrom bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n \ndocs = fetch_20newsgroups(subset='all',  remove=('headers', 'footers', 'quotes'))['data']\n\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n```\n\n生成主题及其概率后，我们可以访问所有主题及其主题表示：\n\n```python\n>>> topic_model.get_topic_info()\n\n主题\t计数\t名称\n-1\t4630\t-1_can_your_will_any\n0\t693\t49_windows_drive_dos_file\n1\t466\t32_jesus_bible_christian_faith\n2\t441\t2_space_launch_orbit_lunar\n3\t381\t22_key_encryption_keys_encrypted\n...\n```\n\n`-1` 主题指代所有离群文档，通常被忽略。主题中的每个词描述了该主题的潜在主题，可用于解释该主题。接下来，我们查看生成的最频繁主题：\n\n```python\n>>> topic_model.get_topic(0)\n\n[('windows', 0.006152228076250982),\n ('drive', 0.004982897610645755),\n ('dos', 0.004845038866360651),\n ('file', 0.004140142872194834),\n ('disk', 0.004131678774810884),\n ('mac', 0.003624848635985097),\n ('memory', 0.0034840976976789903),\n ('software', 0.0034415334250699077),\n ('email', 0.0034239554442333257),\n ('pc', 0.003047105930670237)]\n```  \n\n使用 `.get_document_info`，我们还可以提取文档级别的信息，例如其对应的主题、概率、是否为该主题的代表性文档等：\n\n```python\n>>> topic_model.get_document_info(docs)\n\n文档                                   主题\t名称\t                        Top_n_words                     概率    ...\nI am sure some bashers of Pens...\t0\t0_game_team_games_season\tgame - team - games...\t        0.200010       ...\nMy brother is in the market for...      -1     -1_can_your_will_any\t        can - your - will...\t        0.420668       ...\nFinally you said what you dream...\t-1     -1_can_your_will_any\t        can - your - will...            0.807259       ...\nThink! It's the SCSI card doing...\t49     49_windows_drive_dos_file\twindows - drive - docs...\t0.071746       ...\n1) I have an old Jasmine drive...\t49     49_windows_drive_dos_file\twindows - drive - docs...\t0.038983       ...\n```\n\n**`🔥 提示`**: 使用 `BERTopic(language=\"multilingual\")` 选择支持 50 多种语言的模型。\n\n## 微调主题表示\n在 BERTopic 中，我们可以选择多种不同的[主题表示方法](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Frepresentation.html)。它们彼此差异较大，为主题表示提供了有趣的视角和变体。一个很好的起点是 `KeyBERTInspired`（基于 KeyBERT 启发的主题表示方法），对许多用户而言，它能提高主题连贯性并减少停用词：\n\n ```python\nfrom bertopic.representation import KeyBERTInspired\n\n# 微调主题表示\nrepresentation_model = KeyBERTInspired()\ntopic_model = BERTopic(representation_model=representation_model)\n```\n\n然而，您可能希望使用更强大的方法来描述聚类。您甚至可以使用 ChatGPT 或其他 OpenAI 模型生成标签、摘要、短语、关键词等：\n\n```python\nimport openai\nfrom bertopic.representation import OpenAI\n\n# 使用 GPT 微调主题表示\nclient = openai.OpenAI(api_key=\"sk-...\")\nrepresentation_model = OpenAI(client, model=\"gpt-4o-mini\", chat=True)\ntopic_model = BERTopic(representation_model=representation_model)\n```\n\n**`🔥 提示`**: 无需逐一尝试这些不同的主题表示方法，您可以在 BERTopic 中使用[多角度主题表示](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultiaspect\u002Fmultiaspect.html)同时建模。\n\n## 可视化\n训练完 BERTopic 模型后，我们可以逐个检查数百个主题以充分理解提取的主题。但这非常耗时且缺乏全局视角。相反，我们可以使用 BERTopic 中的[多种可视化选项](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fvisualization\u002Fvisualization.html)。例如，我们可以以类似 [LDAvis](https:\u002F\u002Fgithub.com\u002Fcpsievert\u002FLDAvis) 的方式可视化生成的主题：\n\n```python\ntopic_model.visualize_topics()\n``` \n\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FMaartenGr_BERTopic_readme_4c245aa0b80c.gif\" width=\"80%\" align=\"center\" \u002F>\n\n## 模块化\n默认情况下，使用 BERTopic 进行主题建模的[主要步骤](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Falgorithm\u002Falgorithm.html)是 sentence-transformers（句子嵌入模型）、UMAP（降维算法）、HDBSCAN（聚类算法）和 c-TF-IDF（主题表示算法）按顺序执行。但这些步骤之间具有一定的独立性，使 BERTopic 非常模块化。换言之，BERTopic 不仅允许您构建自己的主题模型，还能在自定义主题模型基础上探索多种主题建模技术：\n\nhttps:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F218420473-4b2bb539-9dbe-407a-9674-a8317c7fb3bf.mp4\n\n您可以替换其中任何模型，甚至完全移除它们。以下步骤完全模块化：\n\n1. [嵌入](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fembeddings\u002Fembeddings.html)文档\n2. [降低嵌入维度](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fdim_reduction\u002Fdim_reduction.html)\n3. 将降维后的嵌入[聚类](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fclustering\u002Fclustering.html)为主题\n4. [主题向量化](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fvectorizers\u002Fvectorizers.html)\n5. [加权](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fctfidf\u002Fctfidf.html)主题词\n6. 使用一种或[多种](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultiaspect\u002Fmultiaspect.html)表示方法[表示主题](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Frepresentation.html)\n\n## 功能\nBERTopic 拥有许多功能，可能令人眼花缭乱。为解决此问题，您将找到所有方法的概述及其用途的简要说明。\n\n### 常见功能\n以下是对BERTopic（神经主题建模工具）中常见功能的概述。\n\n| 方法 | 代码  |\n|-----------------------|---|\n| 拟合模型    |  `.fit(docs)` |\n| 拟合模型并预测文档  |  `.fit_transform(docs)` |\n| 预测新文档    |  `.transform([new_doc])` |\n| 访问单个主题   | `.get_topic(topic=12)`  |   \n| 访问所有主题     |  `.get_topics()` |\n| 获取主题频率    |  `.get_topic_freq()` |\n| 获取所有主题信息|  `.get_topic_info()` |\n| 获取所有文档信息|  `.get_document_info(docs)` |\n| 获取每个主题的代表性文档 |  `.get_representative_docs()` |\n| 更新主题表示 | `.update_topics(docs, n_gram_range=(1, 3))` |\n| 生成主题标签 | `.generate_topic_labels()` |\n| 设置主题标签 | `.set_topic_labels(my_custom_labels)` |\n| 合并主题 | `.merge_topics(docs, topics_to_merge)` |\n| 减少主题数量 | `.reduce_topics(docs, nr_topics=30)` |\n| 减少离群点 | `.reduce_outliers(docs, topics)` |\n| 查找主题 | `.find_topics(\"vehicle\")` |\n| 保存模型    |  `.save(\"my_model\", serialization=\"safetensors\")` |\n| 加载模型    |  `BERTopic.load(\"my_model\")` |\n| 获取参数 |  `.get_params()` |\n\n\n### 属性\n训练完BERTopic模型后，模型内部会保存多个属性。这些属性部分描述了模型在拟合过程中如何存储信息。以下所有属性均以`_`结尾，是可用于访问模型信息的公共属性。\n\n| 属性 | 描述 |\n|------------------------|---------------------------------------------------------------------------------------------|\n| `.topics_`               | 训练或更新主题模型后，为每个文档生成的主题。 |\n| `.probabilities_` | 如果使用HDBSCAN（分层密度聚类算法），则为每个文档生成的概率。 |\n| `.topic_sizes_`           | 每个主题的大小                                                                      |\n| `.topic_mapper_`          | 用于在主题合并\u002F减少时跟踪主题及其映射的类。             |\n| `.topic_representations_` | 每个主题的前*n*个术语及其相应的c-TF-IDF（类TF-IDF）值。                           |\n| `.c_tf_idf_`              | 通过c-TF-IDF（类TF-IDF）计算得到的主题-术语矩阵。                                       |\n| `.topic_aspects_`          | 每个主题的不同方面或表示。                                  |\n| `.topic_labels_`          | 每个主题的默认标签。                                                          |\n| `.custom_labels_`         | 通过`.set_topic_labels`生成的每个主题的自定义标签。                      |\n| `.topic_embeddings_`      | 如果使用了`embedding_model`（嵌入模型），则为每个主题的嵌入向量。                                |\n| `.representative_docs_`   | 如果使用HDBSCAN（分层密度聚类算法），则为每个主题的代表性文档。                            \n\n\n### 变体\n主题建模可用于多种不同的使用场景。因此，BERTopic开发了多种变体，使得一个包可以适用于多种用例。\n\n| 方法 | 代码  |\n|-----------------------|---|\n| [主题分布近似](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fdistribution\u002Fdistribution.html) | `.approximate_distribution(docs)` |\n| [在线主题建模](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fonline\u002Fonline.html) | `.partial_fit(doc)` |\n| [半监督主题建模](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fsemisupervised\u002Fsemisupervised.html) | `.fit(docs, y=y)` |\n| [监督主题建模](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fsupervised\u002Fsupervised.html) | `.fit(docs, y=y)` |\n| [手动主题建模](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmanual\u002Fmanual.html) | `.fit(docs, y=y)` |\n| [多模态主题建模](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultimodal\u002Fmultimodal.html) | ``.fit(docs, images=images)`` |\n| [按类别主题建模](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftopicsperclass\u002Ftopicsperclass.html) | `.topics_per_class(docs, classes)` |\n| [动态主题建模](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftopicsovertime\u002Ftopicsovertime.html) | `.topics_over_time(docs, timestamps)` |\n| [分层主题建模](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fhierarchicaltopics\u002Fhierarchicaltopics.html) | `.hierarchical_topics(docs)` |\n| [引导式主题建模](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fguided\u002Fguided.html) | `BERTopic(seed_topic_list=seed_topic_list)` |\n| [零样本主题建模](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fzeroshot\u002Fzeroshot.html) | `BERTopic(zeroshot_topic_list=zeroshot_topic_list)` |\n| [合并多个模型](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmerge\u002Fmerge.html) | `BERTopic.merge_models([topic_model_1, topic_model_2])` |\n\n\n### 可视化\n由于评估具有一定的主观性，评估主题模型可能相当困难。可视化主题模型的不同方面有助于理解模型，并使其更容易根据您的喜好进行调整。\n\n| 方法 | 代码  |\n|-----------------------|---|\n| 可视化主题    |  `.visualize_topics()` |\n| 可视化文档    |  `.visualize_documents()` |\n| 可视化文档层次结构    |  `.visualize_hierarchical_documents()` |\n| 可视化主题层次结构    |  `.visualize_hierarchy()` |\n| 可视化主题树   |  `.get_topic_tree(hierarchical_topics)` |\n| 可视化主题术语    |  `.visualize_barchart()` |\n| 可视化主题相似度  |  `.visualize_heatmap()` |\n| 可视化术语分数下降  |  `.visualize_term_rank()` |\n| 可视化主题概率分布    |  `.visualize_distribution(probs[0])` |\n| 可视化主题随时间变化   |  `.visualize_topics_over_time(topics_over_time)` |\n| 可视化按类别的主题 | `.visualize_topics_per_class(topics_per_class)` | \n\n\n## 引用\n要引用[BERTopic论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2203.05794)，请使用以下bibtex参考文献：\n\n```bibtext\n@article{grootendorst2022bertopic,\n  title={BERTopic: Neural topic modeling with a class-based TF-IDF procedure},\n  author={Grootendorst, Maarten},\n  journal={arXiv preprint arXiv:2203.05794},\n  year={2022}\n}\n```","# BERTopic 快速上手指南\n\n## 环境准备\n- **系统要求**：Python 3.10 或更高版本\n- **前置依赖**：确保已安装基础 Python 环境（推荐使用虚拟环境）。若需 GPU 加速，需配置 CUDA 环境（非必需）\n- **国内加速建议**：安装时优先使用清华镜像源，显著提升下载速度\n\n## 安装步骤\n推荐使用国内镜像源安装（速度更快）：\n```bash\npip install bertopic -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n如需安装额外功能模块（按需选择）：\n```bash\n# 安装 Flair\u002FGensim\u002FSpacy 等嵌入模型支持\npip install bertopic[flair,gensim,spacy,use] -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n\n# 安装图像主题建模支持\npip install bertopic[vision] -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n以下是最简示例，使用内置数据集快速体验主题建模：\n\n```python\nfrom bertopic import BERTopic\nfrom sklearn.datasets import fetch_20newsgroups\n\n# 加载示例数据（20 Newsgroups 英文文档集）\ndocs = fetch_20newsgroups(subset='all', remove=('headers', 'footers', 'quotes'))['data']\n\n# 训练主题模型\ntopic_model = BERTopic()\ntopics, probs = topic_model.fit_transform(docs)\n\n# 查看主题统计信息\nprint(topic_model.get_topic_info())\n\n# 查看指定主题的关键词（示例：主题ID=0）\nprint(topic_model.get_topic(0))\n```\n\n**关键说明**：\n1. `fit_transform()` 自动完成文档嵌入、聚类和主题生成\n2. 输出中 `-1` 表示离群文档（通常可忽略）\n3. 主题名称格式为 `ID_关键词1_关键词2_...`（如 `0_game_team_games`）\n4. **多语言支持**：初始化时添加参数 `BERTopic(language=\"multilingual\")` 可处理 50+ 语言\n\n> 提示：首次运行会自动下载预训练模型（约 400MB），建议在网络稳定的环境下执行。实际应用时，直接替换 `docs` 为您的中文文本列表即可（需确保文本已分句处理）。","某大型电商平台的数据分析师团队每周需处理50万条用户评论，以识别产品缺陷和客户痛点，支撑产品迭代决策。\n\n### 没有 BERTopic 时\n- 传统LDA模型生成的主题词孤立且模糊，例如\"电池\"同时关联手机和笔记本电脑，导致团队误判30%的充电问题归属。\n- 手动分类评论耗时费力，5人小组需3天完成分析，常错过促销活动前的优化窗口期。\n- 新产品评论涌入时模型无法动态更新，主题漂移严重，如智能手表评论被错误归入\"耳机\"类别。\n- 主题描述冗长难懂，如\"与电力相关的设备组件问题\"，管理层难以快速理解核心问题。\n\n### 使用 BERTopic 后\n- BERTopic的BERT嵌入精准捕捉语义上下文，主题自动区分\"手机电池续航短\"和\"笔记本充电故障\"，误判率降至5%以下。\n- 自动化流程结合c-TF-IDF生成简洁主题名（如\"充电问题\"），分析时间压缩至2小时内，实时支持当日运营调整。\n- 在线增量学习功能动态吸收新数据，智能手表评论立即归入专属主题，主题漂移问题彻底解决。\n- 可解释的主题描述直接输出业务语言，管理层5分钟内掌握关键痛点，决策效率提升50%。\n\nBERTopic让海量文本洞察从耗时负担转化为即时业务驱动力，真正实现数据驱动的产品优化。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FMaartenGr_BERTopic_b279bc33.png","MaartenGr","Maarten Grootendorst","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FMaartenGr_0047c7ce.jpg","AI @google-deepmind  | Author of \"Hands-On Large Language Models\"","@google-deepmind","Netherlands","maartengrootendorst@gmail.com","https:\u002F\u002Fnewsletter.maartengrootendorst.com\u002F","https:\u002F\u002Fgithub.com\u002FMaartenGr",[85,89],{"name":86,"color":87,"percentage":88},"Python","#3572A5",99.9,{"name":90,"color":91,"percentage":92},"Makefile","#427819",0.1,7508,886,"2026-04-05T13:30:15","MIT",1,"","可选，用于加速；无具体显卡要求","未说明",{"notes":102,"python":103,"dependencies":104},"基础安装需下载预训练模型（约100MB-1GB）；支持多种嵌入后端（flair\u002Fspacy等）；大型数据集建议启用GPU加速；提供轻量级安装选项（可排除transformers\u002FUMAP\u002FHDBSCAN）","3.10+",[105,106,107,108,109,110,111,112],"sentence-transformers","umap-learn","hdbscan","torch","transformers","scikit-learn","numpy","pandas",[13,26],[115,109,116,117,118,119,120,121,122,123],"bert","topic-modeling","sentence-embeddings","nlp","machine-learning","topic","ldavis","topic-modelling","topic-models",null,"2026-03-27T02:49:30.150509","2026-04-06T08:35:15.166391",[128,133,138,143,148,153],{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},4368,"安装 BERTopic 时出现 'Could not build wheels for hdbscan' 错误如何解决？","在 Ubuntu 系统上，运行命令 `sudo apt-get install -y build-essential python3-dev python3-venv` 安装必要依赖即可解决此问题。该方案由用户验证有效，适用于因缺少构建工具导致的 hdbscan 安装失败。","https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F816",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},4369,"为何在相同数据集上多次训练 BERTopic 会得到不同数量的主题？","这通常由 UMAP 的随机性引起。维护者建议尝试从特定提交安装 UMAP，参考此 issue 评论：https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1817#issuecomment-1953547246。同时，确保在初始化 BERTopic 时设置 `umap_model=UMAP(random_state=42)` 以提高可重复性。","https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F461",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},4370,"如何计算新词集与主题词的相似度以进行主题分类？","使用 `approximate_distribution` 方法获取主题分布，例如 `topic_distr, topic_token_distr = model.approximate_distribution(new_doc, calculate_tokens=True)`。返回的 DataFrame 包含每个词属于特定主题的概率及文档中的词，可通过分析 `topic_token_distr` 实现词级主题分配。","https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1292",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},4371,"BERTopic 是否支持带聚类的 2D UMAP 可视化？","当前版本不支持此功能。维护者表示由于不想过度依赖 plotly（未来可能更换绘图库），暂时不会添加，但会考虑在将来实现。建议关注官方文档更新以获取新功能支持。","https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F584",{"id":149,"question_zh":150,"answer_zh":151,"source_url":152},4372,"训练超大规模数据集（如 150 万句子）时如何避免内存溢出？","维护者虽未直接回复此问题，但在相关 issue 中提到 hdbscan 更新可能影响内存使用。建议先确保 hdbscan 为最新版本，并参考 FAQ 中的安装修复方案。对于大数据集，可尝试分批次处理或调整 UMAP 参数（如 `n_neighbors`）降低内存消耗。","https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F151",{"id":154,"question_zh":155,"answer_zh":156,"source_url":157},4373,"BERTopic 的 fit_transform 和 transform 方法是否分别用于训练和预测？","是的，`fit_transform` 用于训练模型并转换输入数据，`transform` 用于对新数据进行预测。维护者在其他讨论中确认此设计，建议训练时使用完整数据集，预测时直接调用 `transform` 方法处理新文档，无需重新训练。","https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F278",[159,164,169,174,179,184,189,194,199,204,209,214,219,224,229,234,239,244,249,254],{"id":160,"version":161,"summary_zh":162,"released_at":163},113469,"v0.17.4","\u003Ch2>Highlights:\u003C\u002Fh3>\r\n\r\n* Add `.delete_topics` by [@shuanglovesdata](https:\u002F\u002Fgithub.com\u002Fshuanglovesdata) in [#2322](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2322)\r\n* Allow execution without plotly by [@luismavs](https:\u002F\u002Fgithub.com\u002Fluismavs) in [#2401](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2401)\r\n* Add tqdm to _litellm.py [@NFrnk](https:\u002F\u002Fgithub.com\u002FNFrnk) in [#2408](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2408)\r\n* Drop support for python 3.9 by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#2419](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2419)\r\n* Make UMAP's init default to random on visualize_topics for reproducible visualization by [@makramab](https:\u002F\u002Fgithub.com\u002Fmakramab) in [#2412](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2412)\r\n\r\n\u003Ch2>cuML:\u003C\u002Fh2>\r\n\r\nPreparing for MEGA!-scale BERTopic with [Multi-GPU UMAP](https:\u002F\u002Fgithub.com\u002Frapidsai\u002Fcuml\u002Fissues\u002F3330#issuecomment-3376951158) and the following necessary updates:\r\n\r\n* Update installation instructions for cuML with BERTopic by [@csadorf](https:\u002F\u002Fgithub.com\u002Fcsadorf) in [#2446](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2446)\r\n* Speed up `._create_topic_vectors` by replacing DataFrame .loc with NumPy masking [@jinsolp](https:\u002F\u002Fgithub.com\u002Fjinsolp) in [#2406](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2406)\r\n* Modify _reduce_dimensionality to use fit_transform by [@betatim](https:\u002F\u002Fgithub.com\u002Fbetatim) in [#2416](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2416)\r\n\r\n\u003Ch2>Fixes:\u003C\u002Fh2>\r\n\r\n* Fix incorrect label in zero-shot svg in documentation by [@huisman](https:\u002F\u002Fgithub.com\u002Fhuisman) in [#2448](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2448)\r\n* Enable ruff rule RUF by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#2457](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2457)\r\n* CI: bump github actions versions by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#2427](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2427)\r\n* CI: prefer action-pre-commit-uv for lint job by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#2434](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2434)\r\n* CI: switch to uv based project installation by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#2445](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2445)\r\n* Chore: update pre-commit hooks by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#2414](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2414) and [#2443](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2443)\r\n* Chore: remove obsolete `version_info` check by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#2444](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2444)\r\n","2025-12-03T14:22:33",{"id":165,"version":166,"summary_zh":167,"released_at":168},113470,"v0.17.3","BERTopic now fully supports `uv`! You can install BERTopic with uv as follows:\r\n\r\n```bash\r\nuv add bertopic\r\n```","2025-07-08T10:56:22",{"id":170,"version":171,"summary_zh":172,"released_at":173},113471,"v0.17.1","\u003Ch2>Highlights:\u003C\u002Fh2>\r\n\r\n* Added FastEmbed backend by [@nickprock](https:\u002F\u002Fgithub.com\u002Fnickprock) in [#2213](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2312) \r\n* Added LangChain backend by [@regaltsui](https:\u002F\u002Fgithub.com\u002Fregaltsui) in [#2303](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2303) \r\n* Pass precomputed embeddings to KeyBERTInspired.extract_topics [@saikumaru](https:\u002F\u002Fgithub.com\u002Fsaikumaru) in [#2368](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2368)\r\n\r\n\u003Ch2>Fixes:\u003C\u002Fh2>\r\n\r\n* Merge models without pytorch (using safetensors) by [@MaartenGr](https:\u002F\u002Fgithub.com\u002FMaartenGr) in [#2329](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2329) \r\n* Fix installation issue with uv by [@MaartenGr](https:\u002F\u002Fgithub.com\u002FMaartenGr) in [#2328](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2328) * Fix incorrect comparison in update_topics by [@uply23333](https:\u002F\u002Fgithub.com\u002Fuply23333) in [#2336](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2336) \r\n* Add missing comma under Exploration subsection by [@angelonazzaro](https:\u002F\u002Fgithub.com\u002Fangelonazzaro) in [#2374](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2374) \r\n* Fix typo in Lightweight installation under tips_and_tricks by [@angelonazzaro](https:\u002F\u002Fgithub.com\u002Fangelonazzaro) in [#2375](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2375) \r\n* Fix IndexError in zero-shot topic modeling by [@MaartenGr](https:\u002F\u002Fgithub.com\u002FMaartenGr) in [#2267](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2267)","2025-07-08T09:04:53",{"id":175,"version":176,"summary_zh":177,"released_at":178},113472,"v0.17.0","\u003Ch3>\u003Cb>Highlights:\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh3>\r\n\r\n* [Light-weight installation](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftips_and_tricks\u002Ftips_and_tricks.html#lightweight-installation) without UMAP and HDBSCAN by [@MaartenGr](https:\u002F\u002Fgithub.com\u002FMaartenGr) in [#2289](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2289)\r\n* Add [Model2Vec](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fembeddings\u002Fembeddings.html#model2vec) as an embedding backend by [@MaartenGr](https:\u002F\u002Fgithub.com\u002FMaartenGr) in [#2245](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2245)\r\n* Add LiteLLM as a representation model by [@MaartenGr](https:\u002F\u002Fgithub.com\u002FMaartenGr) in [#2213](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2213)\r\n* [Interactive DataMapPlot](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fvisualization\u002Fvisualize_documents.html#interactive-datamapplot) by [@MaartenGr](https:\u002F\u002Fgithub.com\u002FMaartenGr) in [#2287](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2287)\r\n\r\n\u003Ch3>\u003Cb>Fixes:\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh3>\r\n\r\n* Lightweight installation: use safetensors without torch by [@hedgeho](https:\u002F\u002Fgithub.com\u002Fhedgeho) in [#2306](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2306)\r\n* Fix missing links by [@MaartenGr](https:\u002F\u002Fgithub.com\u002FMaartenGr) in [#2305](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2305)\r\n* Set up pre-commit hooks by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#2283](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2283)\r\n* Fix handling OpenAI returning None objects by [@jeaninejuliettes](https:\u002F\u002Fgithub.com\u002Fjeaninejuliettes) in [#2280](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2280)\r\n* Add support for python 3.13 by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#2173](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2173)\r\n* Added system prompts by [@Leo-LiHao](https:\u002F\u002Fgithub.com\u002FLeo-LiHao) in [#2145](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2145)\r\n* More documentation for topic reduction by [@MaartenGr](https:\u002F\u002Fgithub.com\u002FMaartenGr) in [#2260](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2260)\r\n* Drop support for python 3.8 by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#2243](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2243)\r\n* Fixed online topic modeling on GPU by [@SSivakumar12](https:\u002F\u002Fgithub.com\u002FSSivakumar12) in [#2181](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2181)\r\n* Fixed hierarchical cluster visualization by [@PipaFlores](https:\u002F\u002Fgithub.com\u002FPipaFlores) in [#2191](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2191)\r\n* Remove duplicated phrase by [@AndreaFrancis](https:\u002F\u002Fgithub.com\u002FAndreaFrancis) in [#2197](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2197)\r\n\r\n![2025-03-1916-30-14online-video-cutter com-ezgif com-optimize (2)](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002Fb5f4dfdd-c376-4ec3-be61-2d4504f48691)\r\n\r\n## Model2Vec\r\n\r\nWith Model2Vec, we now have a very interesting pipeline for light-weight embeddings. Combined with the light-weight installation, you can now run BERTopic without using pytorch!\r\n\r\nInstallation is straightforward:\r\n\r\n```\r\npip install --no-deps bertopic\r\npip install --upgrade numpy pandas scikit-learn tqdm plotly pyyaml\r\n```\r\n\r\nThis will install BERTopic even without UMAP or HDBSCAN, so you can use other techniques instead. If these are not installed, then it uses PCA with scikit-learn's HDBSCAN instead. You can install them, together with Model2Vec:\r\n\r\n```\r\npip install model2vec umap-learn hdbscan\r\n```\r\n\r\nThen, creating a BERTopic model is as straightforward as you are used to:\r\n\r\n```python\r\nfrom bertopic import BERTopic\r\nfrom model2vec import StaticModel\r\n\r\n# Model2Vec\r\nembedding_model = StaticModel.from_pretrained(\"minishlab\u002Fpotion-base-8M\")\r\n\r\n# BERTopic\r\ntopic_model = BERTopic(embedding_model=embedding_model)\r\n```\r\n\r\n## DataMapPlot\r\n\r\nTo use the interactive version of DataMapPlot, you only need to run the following:\r\n\r\n```python\r\nfrom umap import UMAP\r\n\r\n# Reduce your embeddings to 2-dimensions\r\nreduced_embeddings = UMAP(n_neighbors=10, n_components=2, min_dist=0.0, metric='cosine').fit_transform(embeddings)\r\n\r\n# Create an interactive DataMapPlot figure\r\ntopic_model.visualize_document_datamap(docs, reduced_embeddings=reduced_embeddings, interactive=True\r\n```","2025-03-19T17:02:35",{"id":180,"version":181,"summary_zh":182,"released_at":183},113473,"v0.16.4","\u003Ch1>\u003Cb>Fixes\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* Fix ValueError in Guided Topic Modeling by [@RTChou](https:\u002F\u002Fgithub.com\u002FRTChou) in [#2115](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2115)\r\n* Fix saving BERTopic when c-TF-IDF is None by [@sete39](https:\u002F\u002Fgithub.com\u002Fsete39) in [#2112](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2112)\r\n* Fix `KeyError: 'topics_from'` in [#2101](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2101)\r\n* Fix issues related Zero-shot Topic Modeling by [@ianrandman](https:\u002F\u002Fgithub.com\u002Fianrandman) in [#2105](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2105)\r\n* Fix regex matching being used in PartOfSpeech representation model by [@woranov](https:\u002F\u002Fgithub.com\u002Fworanov) in [#2138](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2138)\r\n* Update typo by [@saikumaru](https:\u002F\u002Fgithub.com\u002Fsaikumaru) in [#2162](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2162)\r\n\r\n\r\n","2024-10-09T10:57:08",{"id":185,"version":186,"summary_zh":187,"released_at":188},113474,"v0.16.3","\u003Ch1>\u003Cb>Highlights\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* Simplify zero-shot topic modeling by [@ianrandman](https:\u002F\u002Fgithub.com\u002Fianrandman) in [#2060](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2060)\r\n* Option to choose between c-TF-IDF and Topic Embeddings in many functions by [@azikoss](https:\u002F\u002Fgithub.com\u002Fazikoss) in [#1894](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1894)\r\n    * Use the `use_ctfidf` parameter in the following function to choose between c-TF-IDF and topic embeddings:\r\n        * `hierarchical_topics`, `reduce_topics`, `visualize_hierarchy`, `visualize_heatmap`, `visualize_topics`\r\n* Linting with Ruff by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#2033](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2033)\r\n* Switch from setup.py to pyproject.toml by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#1978](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1978)\r\n* In multi-aspect context, allow Main model to be chained by [@ddicato](https:\u002F\u002Fgithub.com\u002Fddicato) in [#2002](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2002)\r\n\r\n\u003Ch1>\u003Cb>Fixes\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* Added templates for [issues](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Ftree\u002Fmaster\u002F.github\u002FISSUE_TEMPLATE) and [pull requests](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fblob\u002Fmaster\u002F.github\u002FPULL_REQUEST_TEMPLATE.md)\r\n* Update River documentation example by [@Proteusiq](https:\u002F\u002Fgithub.com\u002FProteusiq) in [#2004](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2004)\r\n* Fix PartOfSpeech reproducibility by [@Greenpp](https:\u002F\u002Fgithub.com\u002FGreenpp) in [#1996](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1996)\r\n* Fix PartOfSpeech ignoring first word by [@Greenpp](https:\u002F\u002Fgithub.com\u002FGreenpp) in [#2024](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2024)\r\n* Make sklearn embedding backend auto-select more cautious by [@freddyheppell](https:\u002F\u002Fgithub.com\u002Ffreddyheppell) in [#1984](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1984)\r\n* Fix typos by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#1974](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1974)\r\n* Fix hierarchical_topics(...) when the distances between three clusters are the same by [@azikoss](https:\u002F\u002Fgithub.com\u002Fazikoss) in [#1929](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1929)\r\n* Fixes to chain strategy example in outlier_reduction.md by [@reuning](https:\u002F\u002Fgithub.com\u002Freuning) in [#2065](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2065)\r\n* Remove obsolete flake8 config and update line length by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#22066](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F2066)\r\n\r\n","2024-07-22T08:25:06",{"id":190,"version":191,"summary_zh":192,"released_at":193},113475,"v0.16.2","\u003Ch1>\u003Cb>Fixes:\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* Fix issue with zeroshot topic modeling missing outlier [#1957](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1957)\r\n* Bump github actions versions by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#1941](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1941)\r\n* Drop support for python 3.7 by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#1949](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1949)\r\n* Add testing python 3.10+ in Github actions by [@afuetterer](https:\u002F\u002Fgithub.com\u002Fafuetterer) in [#1968](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1968)\r\n* Speed up fitting CountVectorizer by [@dannywhuang](https:\u002F\u002Fgithub.com\u002Fdannywhuang) in [#1938](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1938)\r\n* Fix `transform` when using cuML HDBSCAN by [@beckernick](https:\u002F\u002Fgithub.com\u002Fbeckernick) in [#1960](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1960)\r\n* Fix wrong link in algorithm documentation by [@naeyn](https:\u002F\u002Fgithub.com\u002Fnaeyn) in [#1970](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1970)","2024-05-12T09:32:29",{"id":195,"version":196,"summary_zh":197,"released_at":198},113476,"v0.16.1","\r\n\u003Ch1>\u003Cb>Highlights:\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* Add Quantized [LLM Tutorial](https:\u002F\u002Fcolab.research.google.com\u002Fdrive\u002F1DdSHvVPJA3rmNfBWjCo2P1E9686xfxFx?usp=sharing)\r\n* Add optional [datamapplot](https:\u002F\u002Fgithub.com\u002FTutteInstitute\u002Fdatamapplot) visualization using `topic_model.visualize_document_datamap` by [@lmcinnes](https:\u002F\u002Fgithub.com\u002Flmcinnes) in [#1750](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1750)\r\n* Migrated OpenAIBackend to openai>=1 by [@peguerosdc](https:\u002F\u002Fgithub.com\u002Fpeguerosdc) in [#1724](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1724)\r\n* Add automatic height scaling and font resize by [@ir2718](https:\u002F\u002Fgithub.com\u002Fir2718) in [#1863](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1863)\r\n* Use `[KEYWORDS]` tags with the LangChain representation model by [@mcantimmy](https:\u002F\u002Fgithub.com\u002Fmcantimmy) in [#1871](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1871)\r\n\r\n\r\n\u003Ch3>\u003Cb>Fixes:\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh3>\r\n\r\n* Fixed issue with `.merge_models` seemingly skipping topic [#1898](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1898)\r\n* Fixed Cohere client.embed TypeError [#1904](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1904)\r\n* Fixed `AttributeError: 'TextGeneration' object has no attribute 'random_state'` [#1870](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1870)\r\n* Fixed topic embeddings not properly updated if all outliers were removed [#1838](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1838)\r\n* Fixed issue with representation models not properly merging [#1762](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1762)\r\n* Fixed Embeddings not ordered correctly when using `.merge_models` [#1804](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1804)\r\n* Fixed Outlier topic not in the 0th position when using zero-shot topic modeling causing prediction issues (amongst others) [#1804](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1804)\r\n* Fixed Incorrect label in ZeroShot doc SVG [#1732](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1732)\r\n* Fixed MultiModalBackend throws error with clip-ViT-B-32-multilingual-v1 [#1670](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1670)\r\n* Fixed AuthenticationError while using OpenAI() [#1678](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1678)\r\n* Update FAQ on Apple Silicon by [@benz0li](https:\u002F\u002Fgithub.com\u002Fbenz0li) in [#1901](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1901)\r\n* Add documentation DataMapPlot + FAQ for running on Apple Silicon by [@dkapitan](https:\u002F\u002Fgithub.com\u002Fdkapitan) in [#1854](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1854)\r\n* Remove commas from pip install reference in readme by [@luisoala](https:\u002F\u002Fgithub.com\u002Fluisoala) in [#1850](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1850)\r\n* Spelling corrections by [@joouha](https:\u002F\u002Fgithub.com\u002Fjoouha) in [#1801](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1801)\r\n* Replacing the deprecated `text-ada-001` model with the latest `text-embedding-3-small` from OpenAI by [@atmb4u](https:\u002F\u002Fgithub.com\u002Fatmb4u) in [#1800](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1800)\r\n* Prevent invalid empty input error when retrieving embeddings with openai backend by [@liaoelton](https:\u002F\u002Fgithub.com\u002Fliaoelton) in [#1827](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1827)\r\n* Remove spurious warning about missing embedding model by [@sliedes](https:\u002F\u002Fgithub.com\u002Fsliedes) in [#1774](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1774)\r\n* Fix type hint in ClassTfidfTransformer constructor [@snape](https:\u002F\u002Fgithub.com\u002Fsnape) in [#1803](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1803)\r\n* Fix typo and simplify wording in OnlineCountVectorizer docstring by [@chrisji](https:\u002F\u002Fgithub.com\u002Fchrisji) in [#1802](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1802)\r\n* Fixed warning when saving a topic model without an embedding model by [@zilch42](https:\u002F\u002Fgithub.com\u002Fzilch42) in [#1740](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1740)\r\n* Fix bug in `TextGeneration` by [@manveersadhal](https:\u002F\u002Fgithub.com\u002Fmanveersadhal) in [#1726](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1726)\r\n* Fix an incorrect link to usecases.md by [@nicholsonjf](https:\u002F\u002Fgithub.com\u002Fnicholsonjf) in [#1731](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1731)\r\n* Prevent `model` argument being passed twice when using `generator_kwargs` in OpenAI by [@ninavandiermen](https:\u002F\u002Fgithub.com\u002Fninavandiermen) in [#1733](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1733)\r\n* Several fixes to the docstrings by [@arpadikuma](https:\u002F\u002Fgithub.com\u002Farpadikuma) in [#1719](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1719)\r\n* Remove unused `cluster_df` variable in `hierarchical_topics` by [@shadiakiki1986](https:\u002F\u002Fgithub.com\u002Fshadiakiki1986) in [#1701](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1701)\r\n* Removed redundant quotation mark by [@LawrenceFulton](https:\u002F\u002Fgithub.com\u002FLawrenceFulton) in [#1695](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1695)\r\n* Fix typo in merge models docs by [@zilch42](https:\u002F\u002Fgithub.com\u002Fzilch42) in [#1660](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1660)\r\n","2024-04-21T14:42:38",{"id":200,"version":201,"summary_zh":202,"released_at":203},113477,"v0.16.0","\u003Ch1>\u003Cb>Highlights:\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* Merge pre-trained BERTopic models with [**`.merge_models`**](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmerge\u002Fmerge.html)\r\n    * Combine models with different representations together!\r\n    * Use this for *incremental\u002Fonline topic modeling* to detect new incoming topics\r\n    * First step towards *federated learning* with BERTopic\r\n* [**Zero-shot**](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fzeroshot\u002Fzeroshot.html) Topic Modeling\r\n    * Use a predefined list of topics to assign documents\r\n    * If needed, allows for further exploration of undefined topics\r\n* [**Seed (domain-specific) words**](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fseed_words\u002Fseed_words.html) with `ClassTfidfTransformer`\r\n    * Make sure selected words are more likely to end up in the representation without influencing the clustering process\r\n* Added params to [**truncate documents**](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Fllm.html#truncating-documents) to length when using LLMs\r\n* Added [**LlamaCPP**](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Fllm.html#llamacpp) as a representation model\r\n* LangChain: Support for **LCEL Runnables** by [@joshuasundance-swca](https:\u002F\u002Fgithub.com\u002Fjoshuasundance-swca) in [#1586](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1586)\r\n* Added `topics` parameter to `.topics_over_time` to select a subset of documents and topics\r\n* Documentation:\r\n    * [Best practices Guide](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fbest_practices\u002Fbest_practices.html)\r\n    * [Llama 2 Tutorial](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Fllm.html#llama-2)\r\n    * [Zephyr Tutorial](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Fllm.html#zephyr-mistral-7b)\r\n    * Improved [embeddings guidance](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fembeddings\u002Fembeddings.html#sentence-transformers) (MTEB)\r\n    * Improved logging throughout the package\r\n* Added support for **Cohere's Embed v3**:\r\n```python\r\ncohere_model = CohereBackend(\r\n    client,\r\n    embedding_model=\"embed-english-v3.0\",\r\n    embed_kwargs={\"input_type\": \"clustering\"}\r\n)\r\n```\r\n\r\n\u003Ch1>\u003Cb>Fixes:\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* Fixed n-gram Keywords need delimiting in OpenAI() [#1546](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1546)\r\n* Fixed OpenAI v1.0 issues [#1629](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1629)\r\n* Improved documentation\u002Flogging to adress [#1589](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1589), [#1591](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1591)\r\n* Fixed engine support for Azure OpenAI embeddings [#1577](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1487)\r\n* Fixed OpenAI Representation: KeyError: 'content' [#1570](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1570)\r\n* Fixed Loading topic model with multiple topic aspects changes their format [#1487](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1487)\r\n*  Fix expired link in algorithm.md by [@burugaria7](https:\u002F\u002Fgithub.com\u002Fburugaria7) in [#1396](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1396)\r\n*  Fix guided topic modeling in cuML's UMAP by [@stevetracvc](https:\u002F\u002Fgithub.com\u002Fstevetracvc) in [#1326](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1326)\r\n*  OpenAI: Allow retrying on Service Unavailable errors by [@agamble](https:\u002F\u002Fgithub.com\u002Fagamble) in [#1407](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1407)\r\n*  Fixed parameter naming for HDBSCAN in best practices by [@rnckp](https:\u002F\u002Fgithub.com\u002Frnckp) in [#1408](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1408)\r\n*  Fixed typo in tips_and_tricks.md by [@aronnoordhoek](https:\u002F\u002Fgithub.com\u002Faronnoordhoek) in [#1446](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1446)\r\n*  Fix typos in documentation by [@bobchien](https:\u002F\u002Fgithub.com\u002Fbobchien) in [#1481](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1481)\r\n*  Fix IndexError when all outliers are removed by reduce_outliers by [@Aratako](https:\u002F\u002Fgithub.com\u002FAratako) in [#1466](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1466)\r\n*  Fix TypeError on reduce_outliers \"probabilities\" by [@ananaphasia](https:\u002F\u002Fgithub.com\u002Fananaphasia) in [#1501](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1501)\r\n*  Add new line to fix markdown bullet point formatting by [@saeedesmaili](https:\u002F\u002Fgithub.com\u002Fsaeedesmaili) in [#1519](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1519)\r\n*  Update typo in topicrepresentation.md by [@oliviercaron](https:\u002F\u002Fgithub.com\u002Foliviercaron) in [#1537](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1537)\r\n*  Fix typo in FAQ by [@sandijou](https:\u002F\u002Fgithub.com\u002Fsandijou) in [#1542](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1542)\r\n*  Fixed typos in best practices documentation by [@poomkusa](https:\u002F\u002Fgithub.com\u002Fpoomkusa) in [#1557](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1557)\r\n*  Correct TopicMapper doc example by [@chrisji](https:\u002F\u002Fgithub.com\u002Fchrisji) in [#1637](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1637)\r\n* Fix typing in hierarchical_topics by [@dschwalm](htt","2023-11-27T08:06:29",{"id":205,"version":206,"summary_zh":207,"released_at":208},113478,"v0.15.0","\u003Ch1>\u003Cb>Highlights:\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* [**Multimodal**](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultimodal\u002Fmultimodal.html) Topic Modeling\r\n    * Train your topic modeling on text, images, or images and text!\r\n    * Use the `bertopic.backend.MultiModalBackend` to embed images, text, both or even caption images!\r\n* [**Multi-Aspect**](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultiaspect\u002Fmultiaspect.html) Topic Modeling\r\n    * Create multiple topic representations simultaneously \r\n* Improved [**Serialization**](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fserialization\u002Fserialization.html) options\r\n    * Push your model to the HuggingFace Hub with `.push_to_hf_hub`\r\n    * Safer, smaller and more flexible serialization options with `safetensors`\r\n    * Thanks to a great collaboration with HuggingFace and the authors of [BERTransfer](https:\u002F\u002Fgithub.com\u002Fopinionscience\u002FBERTransfer)!\r\n* Added new embedding models\r\n    * OpenAI: `bertopic.backend.OpenAIBackend`\r\n    * Cohere: `bertopic.backend.CohereBackend`\r\n* Added example of [summarizing topics](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Frepresentation.html#summarization) with OpenAI's GPT-models\r\n* Added `nr_docs` and `diversity` parameters to OpenAI and Cohere representation models\r\n* Use `custom_labels=\"Aspect1\"` to use the aspect labels for visualizations instead\r\n* Added cuML support for probability calculation in `.transform`\r\n* Updated **topic embeddings**\r\n    * Centroids by default and c-TF-IDF weighted embeddings for `partial_fit` and `.update_topics`\r\n* Added `exponential_backoff` parameter to `OpenAI` model\r\n\r\n\u003Ch2>\u003Cb>Fixes:\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh2>\r\n\r\n* Fixed custom prompt not working in `TextGeneration` \r\n* Fixed [#1142](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1142)\r\n* Add additional logic to handle cupy arrays by [@metasyn](https:\u002F\u002Fgithub.com\u002Fmetasyn) in [#1179](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1179)\r\n* Fix hierarchy viz and handle any form of distance matrix by [@elashrry](https:\u002F\u002Fgithub.com\u002Felashrry) in [#1173](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1173)\r\n* Updated languages list by [@sam9111](https:\u002F\u002Fgithub.com\u002Fsam9111) in [#1099](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1099)\r\n* Added level_scale argument to visualize_hierarchical_documents by [@zilch42](https:\u002F\u002Fgithub.com\u002Fzilch42) in [#1106](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1106)\r\n* Fix inconsistent naming by [@rolanderdei](https:\u002F\u002Fgithub.com\u002Frolanderdei) in [#1073](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1073)\r\n\r\n\u003Ch1>\u003Cb>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultimodal\u002Fmultimodal.html\">Multimodal Topic Modeling\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\nWith v0.15, we can now perform multimodal topic modeling in BERTopic! The most basic example of multimodal topic modeling in BERTopic is when you have images that accompany your documents. This means that it is expected that each document has an image and vice versa. Instagram pictures, for example, almost always have some descriptions to them. \r\n\r\nIn this example, we are going to use images from `flickr` that each have a caption accociated to it: \r\n\r\n```python\r\n# NOTE: This requires the `datasets` package which you can \r\n# install with `pip install datasets`\r\nfrom datasets import load_dataset\r\n\r\nds = load_dataset(\"maderix\u002Fflickr_bw_rgb\")\r\nimages = ds[\"train\"][\"image\"]\r\ndocs = ds[\"train\"][\"caption\"]\r\n```\r\n\r\nThe `docs` variable contains the captions for each image in `images`. We can now use these variables to run our multimodal example:\r\n\r\n```python\r\nfrom bertopic import BERTopic\r\nfrom bertopic.representation import VisualRepresentation\r\n\r\n# Additional ways of representing a topic\r\nvisual_model = VisualRepresentation()\r\n\r\n# Make sure to add the `visual_model` to a dictionary\r\nrepresentation_model = {\r\n   \"Visual_Aspect\":  visual_model,\r\n}\r\ntopic_model = BERTopic(representation_model=representation_model, verbose=True)\r\n```\r\n\r\nWe can now access our image representations for each topic with `topic_model.topic_aspects_[\"Visual_Aspect\"]`.\r\nIf you want an overview of the topic images together with their textual representations in jupyter, you can run the following:\r\n\r\n```python\r\nimport base64\r\nfrom io import BytesIO\r\nfrom IPython.display import HTML\r\n\r\ndef image_base64(im):\r\n    if isinstance(im, str):\r\n        im = get_thumbnail(im)\r\n    with BytesIO() as buffer:\r\n        im.save(buffer, 'jpeg')\r\n        return base64.b64encode(buffer.getvalue()).decode()\r\n\r\n\r\ndef image_formatter(im):\r\n    return f'\u003Cimg src=\"data:image\u002Fjpeg;base64,{image_base64(im)}\">'\r\n\r\n# Extract dataframe\r\ndf = topic_model.get_topic_info().drop(\"Representative_Docs\", 1).drop(\"Name\", 1)\r\n\r\n# Visualize the images\r\nHTML(df.to_html(formatters={'Visual_Aspect': image_formatter}, escape=False))\r\n```\r\n\r\n![images_and_text](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fassets\u002F25746895\u002F3a741e2b-5810-4865-9664-0c6bb24ca3f9)\r\n\r\n\r\n\u003Ch1>\u003Cb>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmultiaspect\u002Fmultiaspect.html\">","2023-05-30T16:49:53",{"id":210,"version":211,"summary_zh":212,"released_at":213},113479,"v0.14.1","\u003Ch1>\u003Cb>Features\u002FFixes\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* Use [**ChatGPT**](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Frepresentation.html#chatgpt) to create topic representations!\r\n* Added `delay_in_seconds` parameter to OpenAI and Cohere representation models for throttling the API\r\n    * Setting this between 5 and 10 allows for trial users now to use more easily without hitting RateLimitErrors\r\n* Fixed missing `title` param to visualization methods\r\n* Fixed probabilities not correctly aligning ([#1024](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F1024))\r\n* Fix typo in textgenerator  [@dkopljar27](https:\u002F\u002Fgithub.com\u002Fdkopljar27) in [#1002](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F1002)\r\n\r\n\u003Ch1>\u003Cb>ChatGPT\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\nWithin OpenAI's API, the ChatGPT models use a different API structure compared to the GPT-3 models. \r\nIn order to use ChatGPT with BERTopic, we need to define the model and make sure to set `chat=True`:\r\n\r\n```python\r\nimport openai\r\nfrom bertopic import BERTopic\r\nfrom bertopic.representation import OpenAI\r\n\r\n# Create your representation model\r\nopenai.api_key = MY_API_KEY\r\nrepresentation_model = OpenAI(model=\"gpt-3.5-turbo\", delay_in_seconds=10, chat=True)\r\n\r\n# Use the representation model in BERTopic on top of the default pipeline\r\ntopic_model = BERTopic(representation_model=representation_model)\r\n```\r\n\r\nPrompting with ChatGPT is very satisfying and can be customized in BERTopic by using certain tags. \r\nThere are currently two tags, namely `\"[KEYWORDS]\"` and `\"[DOCUMENTS]\"`. \r\nThese tags indicate where in the prompt they are to be replaced with a topics keywords and top 4 most representative documents respectively. \r\nFor example, if we have the following prompt:\r\n\r\n```python\r\nprompt = \"\"\"\r\nI have topic that contains the following documents: \\n[DOCUMENTS]\r\nThe topic is described by the following keywords: [KEYWORDS]\r\n\r\nBased on the information above, extract a short topic label in the following format:\r\ntopic: \u003Ctopic label>\r\n\"\"\"\r\n```\r\n\r\nthen that will be rendered as follows and passed to OpenAI's API:\r\n\r\n```python\r\n\"\"\"\r\nI have a topic that contains the following documents: \r\n- Our videos are also made possible by your support on patreon.co.\r\n- If you want to help us make more videos, you can do so on patreon.com or get one of our posters from our shop.\r\n- If you want to help us make more videos, you can do so there.\r\n- And if you want to support us in our endeavor to survive in the world of online video, and make more videos, you can do so on patreon.com.\r\n\r\nThe topic is described by the following keywords: videos video you our support want this us channel patreon make on we if facebook to patreoncom can for and more watch \r\n\r\nBased on the information above, extract a short topic label in the following format:\r\ntopic: \u003Ctopic label>\r\n\"\"\"\r\n```\r\n\r\n> **Note**\r\n> Whenever you create a custom prompt, it is important to add \r\n> ```\r\n> Based on the information above, extract a short topic label in the following format:\r\n> topic: \u003Ctopic label>\r\n> ```\r\n> at the end of your prompt as BERTopic extracts everything that comes after `topic: `. Having \r\n> said that, if `topic: ` is not in the output, then it will simply extract the entire response, so \r\n> feel free to experiment with the prompts. \r\n\r\n","2023-03-02T13:19:22",{"id":215,"version":216,"summary_zh":217,"released_at":218},113480,"v0.14.0","\u003Ch1>\u003Cb>Highlights\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* Fine-tune [topic representations](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Frepresentation.html) with `bertopic.representation`\r\n    * Diverse range of models, including KeyBERT, MMR, POS, Transformers, OpenAI, and more!'\r\n    * Create your own prompts for text generation models, like GPT3:\r\n        * Use `\"[KEYWORDS]\"` and `\"[DOCUMENTS]\"` in the prompt to decide where the keywords and and set of representative documents need to be inserted.\r\n    * Chain models to perform fine-grained fine-tuning\r\n    * Create and customize your represention model\r\n* Improved the topic reduction technique when using `nr_topics=int`\r\n* Added `title` parameters for all graphs ([#800](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F800))\r\n\r\n\r\n\u003Ch1>\u003Cb>Fixes\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* Improve documentation ([#837](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F837), [#769](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F769), [#954](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F954), [#912](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F912), [#911](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F911))\r\n* Bump pyyaml ([#903](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F903))\r\n* Fix large number of representative docs ([#965](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F965))\r\n* Prevent stochastisch behavior in `.visualize_topics` ([#952](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F952))\r\n* Add custom labels parameter to `.visualize_topics` ([#976](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F976))\r\n* Fix cuML HDBSCAN type checks by [@FelSiq](https:\u002F\u002Fgithub.com\u002FFelSiq) in [#981](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F981)\r\n\r\n\u003Ch2>\u003Cb>API Changes\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh2>\r\n\r\n* The `diversity` parameter was removed in favor of `bertopic.representation.MaximalMarginalRelevance`\r\n* The `representation_model` parameter was added to `bertopic.BERTopic`\r\n\r\n\u003Cbr>  \r\n\r\n\u003Ch1>\u003Cb>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Frepresentation.html\">Representation Models\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\nFine-tune the c-TF-IDF representation with a variety of models. Whether that is through a KeyBERT-Inspired model or GPT-3, the choice is up to you! \r\n\r\n[\u003Ciframe width=\"1200\" height=\"500\" src=\"https:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F218417067-a81cc179-9055-49ba-a2b0-f2c1db535159.mp4\r\n\" title=\"BERTopic Overview\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen>\u003C\u002Fiframe>](https:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F218417067-a81cc179-9055-49ba-a2b0-f2c1db535159.mp4)\r\n\r\n\u003Cbr>  \r\n\r\n\r\n\u003Ch2>\u003Cb>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Frepresentation.html#keybertinspired\">KeyBERTInspired\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh2>\r\n\r\nThe algorithm follows some principles of [KeyBERT](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FKeyBERT) but does some optimization in order to speed up inference. Usage is straightforward:\r\n\r\n![keybertinspired](https:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F216336376-d2c4e5d6-6cf7-435c-904c-fc195aae7dcd.svg)\r\n\r\n```python\r\nfrom bertopic.representation import KeyBERTInspired\r\nfrom bertopic import BERTopic\r\n# Create your representation model\r\nrepresentation_model = KeyBERTInspired()\r\n# Use the representation model in BERTopic on top of the default pipeline\r\ntopic_model = BERTopic(representation_model=representation_model)\r\n```\r\n\r\n![keybert](https:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F218417161-bfd5980e-43c7-498a-904a-b6018ba58d45.svg)\r\n\r\n\u003Ch2>\u003Cb>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Frepresentation.html#partofspeech\">PartOfSpeech\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh2>\r\n\r\nOur candidate topics, as extracted with c-TF-IDF, do not take into account a keyword's part of speech as extracting noun-phrases from all documents can be computationally quite expensive. Instead, we can leverage c-TF-IDF to perform part of speech on a subset of keywords and documents that best represent a topic. \r\n\r\n![partofspeech](https:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F216336534-48ff400e-72e1-4c50-9030-414576bac01e.svg)\r\n\r\n\r\n```python\r\nfrom bertopic.representation import PartOfSpeech\r\nfrom bertopic import BERTopic\r\n# Create your representation model\r\nrepresentation_model = PartOfSpeech(\"en_core_web_sm\")\r\n# Use the representation model in BERTopic on top of the default pipeline\r\ntopic_model = BERTopic(representation_model=representation_model)\r\n```\r\n\r\n![pos](https:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F218417198-41c19b5c-251f-43c1-bfe2-0a480731565a.svg)\r\n\r\n\r\n\u003Ch2>\u003Cb>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Frepresentation\u002Frepresentation.html#maximalmarginalrelevance\">MaximalMarginalRelevance\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh2>\r\n\r\nWhen we calculate the weights of keywords, we typically do not consider whether we already have similar keywords in our topic. Words like \"car\" and \"cars\" \r\nessentially represent the same information and often redundant. We can use `MaximalMarginalRelevance` to improve diversi","2023-02-14T13:48:22",{"id":220,"version":221,"summary_zh":222,"released_at":223},113481,"v0.13.0","\r\n\u003Ch1>\u003Cb>Highlights\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* Calculate [topic distributions](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fdistribution\u002Fdistribution.html) with `.approximate_distribution` regardless of the cluster model used\r\n    * Generates topic distributions on a document- and token-levels\r\n    * Can be used for any document regardless of its size!\r\n* [Fully supervised BERTopic](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fsupervised\u002Fsupervised.html)\r\n    * You can now use a classification model for the clustering step instead to create a fully supervised topic model\r\n* [Manual topic modeling](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fmanual\u002Fmanual.html)\r\n    * Generate topic representations from labels directly\r\n    * Allows for skipping the embedding and clustering steps in order to go directly to the topic representation step\r\n* [Reduce outliers](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Foutlier_reduction\u002Foutlier_reduction.html) with 4 different strategies using `.reduce_outliers`\r\n* Install BERTopic without `SentenceTransformers` for a [lightweight package](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftips_and_tricks\u002Ftips_and_tricks.html#lightweight-installation):\r\n    * `pip install --no-deps bertopic`\r\n    * `pip install --upgrade numpy hdbscan umap-learn pandas scikit-learn tqdm plotly pyyaml`\r\n* Get meta data of trained documents such as topics and probabilities using `.get_document_info(docs)`\r\n* Added more support for cuML's HDBSCAN\r\n    * Calculate and predict probabilities during `fit_transform`  and `transform` respectively\r\n    * This should give a major speed-up when setting `calculate_probabilities=True`\r\n* More images to the documentation and a lot of changes\u002Fupdates\u002Fclarifications\r\n* Get representative documents for non-HDBSCAN models by comparing document and topic c-TF-IDF representations \r\n* Sklearn Pipeline [Embedder](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fembeddings\u002Fembeddings.html#scikit-learn-embeddings) by [@koaning](https:\u002F\u002Fgithub.com\u002Fkoaning) in [#791](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F791)\r\n\r\n\u003Ch1>\u003Cb>Fixes\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh1>\r\n\r\n* Improve `.partial_fit` documentation ([#837](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F837))\r\n* Fixed scipy linkage usage ([#807](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F807))\r\n* Fixed shifted heatmap ([#782](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F782))\r\n* Fixed SpaCy backend ([#744](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F744))\r\n* Fixed representative docs with small clusters (\u003C3) ([#703](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F703))\r\n* Typo fixed by [@timpal0l](https:\u002F\u002Fgithub.com\u002Ftimpal0l) in [#734](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F734)\r\n* Typo fixed by [@srulikbd](https:\u002F\u002Fgithub.com\u002Ftimpal0l) in [#842](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F842)\r\n* Correcting iframe urls by [@Mustapha-AJEGHRIR](https:\u002F\u002Fgithub.com\u002FMustapha-AJEGHRIR) in [#798](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F798)\r\n* Refactor embedding methods by [@zachschillaci27](https:\u002F\u002Fgithub.com\u002Fzachschillaci27) in [#855](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F855)\r\n* Added diversity parameter to update_topics() function by [@anubhabdaserrr](https:\u002F\u002Fgithub.com\u002Fanubhabdaserrr) in [#887](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F887)\r\n\r\n\r\n\u003Ch2>\u003Cb>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Falgorithm\u002Falgorithm.html\">Documentation\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh2>\r\n\r\nPersonally, I believe that documentation can be seen as a feature and is an often underestimated aspect of open-source. So I went a bit overboard😅... and created an animation about the three pillars of BERTopic using Manim. There are many other visualizations added, one of each variation of BERTopic, and many smaller changes. \r\n\r\n[\u003Ciframe width=\"1200\" height=\"500\" src=\"https:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F205490350-cd9833e7-9cd5-44fa-8752-407d748de633.mp4\r\n\" title=\"BERTopic Overview\" frameborder=\"0\" allow=\"accelerometer; autoplay; clipboard-write; encrypted-media; gyroscope; picture-in-picture\" allowfullscreen>\u003C\u002Fiframe>](https:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F205490350-cd9833e7-9cd5-44fa-8752-407d748de633.mp4)\r\n\r\n\u003Ch2>\u003Cb>\u003Ca href=\"https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fdistribution\u002Fdistribution.html\">Topic Distributions\u003C\u002Fa>\u003C\u002Fb>\u003C\u002Fh2>\r\n\r\nThe difficulty with a cluster-based topic modeling technique is that it does not directly consider that documents may contain multiple topics. With the new release, we can now model the distributions of topics! We even consider that a single word might be related to multiple topics. If a document is a mixture of topics, what is preventing a single word to be the same? \r\n\r\n![approximate_distribution (1)](https:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F201884093-0855c4fd-0eec-4ea4-bb83-2d109d75c3b4.svg)\r\n\r\n\r\nTo do so, we approximate the distribution of topics in a document by calculating and summing the similarities of tokensets (achieved by applying a sliding window) with the top","2023-01-04T11:27:26",{"id":225,"version":226,"summary_zh":227,"released_at":228},113482,"v0.12.0","# Highlights\r\n\r\n* Perform [online\u002Fincremental topic modeling](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fonline\u002Fonline.html) with `.partial_fit`\r\n* Expose [c-TF-IDF model](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fctfidf\u002Fctfidf.html) for customization with `bertopic.vectorizers.ClassTfidfTransformer`\r\n    * The parameters `bm25_weighting` and `reduce_frequent_words` were added to potentially improve representations:\r\n* Expose attributes for easier access to internal data\r\n* Added many tests with the intention of making development a bit more stable\r\n\r\n# Documentation\r\n* Major changes to the [Algorithm](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Falgorithm\u002Falgorithm.html) page of the documentation, which now contains three overviews of the algorithm:\r\n    *  [Visualize Overview](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Falgorithm\u002Falgorithm.html#visual-overview)\r\n    *  [Code Overview](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Falgorithm\u002Falgorithm.html#code-overview)\r\n    *  [Detailed Overview](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Falgorithm\u002Falgorithm.html#detailed-overview)\r\n* Added an [example](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftips_and_tricks\u002Ftips_and_tricks.html#keybert-bertopic) of combining BERTopic with KeyBERT\r\n\r\n# Fixes\r\n\r\n* Fixed iteratively merging topics ([#632](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F632) and ([#648](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F648))\r\n* Fixed 0th topic not showing up in visualizations ([#667](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F667))\r\n* Fixed lowercasing not being optional ([#682](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F682))\r\n* Fixed spelling ([#664](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F664) and ([#673](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F673))\r\n* Fixed 0th topic not shown in `.get_topic_info` by [@oxymor0n](https:\u002F\u002Fgithub.com\u002Foxymor0n) in [#660](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F660)\r\n* Fixed spelling by [@domenicrosati](https:\u002F\u002Fgithub.com\u002Fdomenicrosati) in [#674](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F674)\r\n* Add custom labels and title options to barchart [@leloykun](https:\u002F\u002Fgithub.com\u002Fleloykun) in [#694](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F694)\r\n\r\n## Online\u002Fincremental topic modeling\r\n\r\nOnline topic modeling (sometimes called \"incremental topic modeling\") is the ability to learn incrementally from a mini-batch of instances. Essentially, it is a way to update your topic model with data on which it was not trained before. In Scikit-Learn, this technique is often modeled through a `.partial_fit` function, which is also used in BERTopic. \r\n\r\nAt a minimum, the cluster model needs to support a `.partial_fit` function in order to use this feature. The default HDBSCAN model will not work as it does not support online updating. \r\n\r\n```python\r\nfrom sklearn.datasets import fetch_20newsgroups\r\nfrom sklearn.cluster import MiniBatchKMeans\r\nfrom sklearn.decomposition import IncrementalPCA\r\nfrom bertopic.vectorizers import OnlineCountVectorizer\r\nfrom bertopic import BERTopic\r\n\r\n# Prepare documents\r\nall_docs = fetch_20newsgroups(subset=\"all\",  remove=('headers', 'footers', 'quotes'))[\"data\"]\r\ndoc_chunks = [all_docs[i:i+1000] for i in range(0, len(all_docs), 1000)]\r\n\r\n# Prepare sub-models that support online learning\r\numap_model = IncrementalPCA(n_components=5)\r\ncluster_model = MiniBatchKMeans(n_clusters=50, random_state=0)\r\nvectorizer_model = OnlineCountVectorizer(stop_words=\"english\", decay=.01)\r\n\r\ntopic_model = BERTopic(umap_model=umap_model,\r\n                       hdbscan_model=cluster_model,\r\n                       vectorizer_model=vectorizer_model)\r\n\r\n# Incrementally fit the topic model by training on 1000 documents at a time\r\nfor docs in doc_chunks:\r\n    topic_model.partial_fit(docs)\r\n```\r\n\r\nOnly the topics for the most recent batch of documents are tracked. If you want to be using online topic modeling, not for a streaming setting but merely for low-memory use cases, then it is advised to also update the `.topics_` attribute as variations such as hierarchical topic modeling will not work afterward:\r\n\r\n```python\r\n# Incrementally fit the topic model by training on 1000 documents at a time and tracking the topics in each iteration\r\ntopics = []\r\nfor docs in doc_chunks:\r\n    topic_model.partial_fit(docs)\r\n    topics.extend(topic_model.topics_)\r\n\r\ntopic_model.topics_ = topics\r\n```\r\n\r\n## c-TF-IDF\r\n\r\nExplicitly define, use, and adjust the `ClassTfidfTransformer` with new parameters, `bm25_weighting` and `reduce_frequent_words`, to potentially improve the topic representation: \r\n\r\n```python\r\nfrom bertopic import BERTopic\r\nfrom bertopic.vectorizers import ClassTfidfTransformer\r\n\r\nctfidf_model = ClassTfidfTransformer(bm25_weighting=True)\r\ntopic_model = BERTopic(ctfidf_model=ctfidf_model)\r\n```\r\n\r\n## Attributes\r\n\r\nAfter having fitted your BERTopic instance, you can use the following attributes to have quick access to certain information, such as the topic assignment for each document in `topic_model.topi","2022-09-11T10:36:36",{"id":230,"version":231,"summary_zh":232,"released_at":233},113483,"v0.11.0","## Highlights\r\n\r\n* Perform [hierarchical topic modeling](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fhierarchicaltopics\u002Fhierarchicaltopics.html) with `.hierarchical_topics`\r\n* Visualize [hierarchical topic representations](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fhierarchicaltopics\u002Fhierarchicaltopics.html#visualizations) with `.visualize_hierarchy`\r\n* Extract a [text-based hierarchical topic representation](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fhierarchicaltopics\u002Fhierarchicaltopics.html#visualizations) with `.get_topic_tree`\r\n* Visualize [2D documents](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fvisualization\u002Fvisualization.html#visualize-documents) with `.visualize_documents()`\r\n* Visualize [2D hierarchical documents](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fvisualization\u002Fvisualization.html#visualize-hierarchical-documents) with `.visualize_hierarchical_documents()` \r\n* Create [custom labels](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftopicrepresentation\u002Ftopicrepresentation.html#custom-labels) to the topics throughout most visualizations\r\n* Manually [merge topics](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fhierarchicaltopics\u002Fhierarchicaltopics.html#merge-topics) with `.merge_topics()`\r\n* Added native [Hugging Face transformers](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fembeddings\u002Fembeddings.html#hugging-face-transformers) support \r\n## Documentation\r\n\r\n* Added example for finding similar topics between two models in the [tips & tricks](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftips_and_tricks\u002Ftips_and_tricks.html) page\r\n* Add multi-modal example in the [tips & tricks](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Ftips_and_tricks\u002Ftips_and_tricks.html) page\r\n\r\n## Fixes\r\n\r\n* Fix support for k-Means in `.visualize_heatmap` ([#532](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F532))\r\n* Fix missing topic 0 in `.visualize_topics` ([#533](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F533))\r\n* Fix inconsistencies in `.get_topic_info` ([#572](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F572)) and ([#581](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F581))\r\n* Add `optimal_ordering` parameter to `.visualize_hierarchy` by [@rafaelvalero](https:\u002F\u002Fgithub.com\u002Frafaelvalero) in [#390](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F390)\r\n* Fix RuntimeError when used as sklearn estimator by [@simonfelding](https:\u002F\u002Fgithub.com\u002Fsimonfelding) in [#448](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F448)\r\n* Fix typo in visualization documentation by [@dwhdai](https:\u002F\u002Fgithub.com\u002Fdwhdai) in [#475](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F475)\r\n* Fix typo in docstrings by [@xwwwwww](https:\u002F\u002Fgithub.com\u002Fxwwwwww) in [#549](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fpull\u002F549)\r\n* Support higher Flair versions\r\n\r\n## Visualization examples\r\n\r\n#### Visualize [hierarchical topic representations](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fhierarchicaltopics\u002Fhierarchicaltopics.html#visualizations) with `.visualize_hierarchy`:\r\n\r\n![image](https:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F175038726-4f67a97a-75f7-473e-bad2-387777d2417b.png)\r\n\r\n#### Extract a [text-based hierarchical topic representation](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fhierarchicaltopics\u002Fhierarchicaltopics.html#visualizations) with `.get_topic_tree`:\r\n\r\n```python\r\n.\r\n└─atheists_atheism_god_moral_atheist\r\n     ├─atheists_atheism_god_atheist_argument\r\n     │    ├─■──atheists_atheism_god_atheist_argument ── Topic: 21\r\n     │    └─■──br_god_exist_genetic_existence ── Topic: 124\r\n     └─■──moral_morality_objective_immoral_morals ── Topic: 29\r\n```\r\n\r\n#### Visualize [2D documents](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fvisualization\u002Fvisualization.html#visualize-documents) with `.visualize_documents()`:\r\n\r\n![visualize_documents](https:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F176111881-b8908286-e62b-46fd-9e92-016a4c1a216a.gif)\r\n\r\n#### Visualize [2D hierarchical documents](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Fgetting_started\u002Fvisualization\u002Fvisualization.html#visualize-hierarchical-documents) with `.visualize_hierarchical_documents()`:\r\n\r\n![visualize_hierarchical_documents](https:\u002F\u002Fuser-images.githubusercontent.com\u002F25746895\u002F176176556-fe053088-ea9c-43d2-9d42-959e52eefafa.gif)\r\n","2022-07-11T09:55:43",{"id":235,"version":236,"summary_zh":237,"released_at":238},113484,"v0.10.0","## Highlights\r\n* Use any dimensionality reduction technique instead of UMAP:\r\n\r\n```python\r\nfrom bertopic import BERTopic\r\nfrom sklearn.decomposition import PCA\r\n\r\ndim_model = PCA(n_components=5)\r\ntopic_model = BERTopic(umap_model=dim_model)\r\n```\r\n\r\n* Use any clustering technique instead of HDBSCAN:\r\n\r\n```python\r\nfrom bertopic import BERTopic\r\nfrom sklearn.cluster import KMeans\r\n\r\ncluster_model = KMeans(n_clusters=50)\r\ntopic_model = BERTopic(hdbscan_model=cluster_model)\r\n```\r\n\r\n### Documentation\r\n* Add a CountVectorizer page with tips and tricks on how to create topic representations that fit your use case\r\n* Added pages on how to use other dimensionality reduction and clustering algorithms\r\n * Additional instructions on how to reduce outliers in the FAQ:\r\n```python \r\nimport numpy as np\r\nprobability_threshold = 0.01\r\nnew_topics = [np.argmax(prob) if max(prob) >= probability_threshold else -1 for prob in probs] \r\n```\r\n\r\n### Fixes\r\n\r\n* Fixed `None` being returned for probabilities when transforming unseen documents\r\n* Replaced all instances of `arg:` with `Arguments:` for consistency\r\n* Before saving a fitted BERTopic instance, we remove the stopwords in the fitted CountVectorizer model as it can get quite large due to the number of words that end in stopwords if `min_df` is set to a value larger than 1\r\n* Set `\"hdbscan>=0.8.28\"` to prevent numpy issues\r\n  * Although this was already fixed by the new release of HDBSCAN, it is technically still possible to install 0.8.27 with BERTopic which leads to these numpy issues\r\n* Update gensim dependency to `>=4.0.0` (#371)\r\n* Fix topic 0 not appearing in visualizations (#472)\r\n* Fix #506\r\n* Fix #429","2022-04-30T07:03:06",{"id":240,"version":241,"summary_zh":242,"released_at":243},113485,"v0.9.4","A number of fixes, documentation updates, and small features:\r\n\r\n### Highlights:\r\n* Expose diversity parameter\r\n    * Use `BERTopic(diversity=0.1)` to change how diverse the words in a topic representation are (ranges from 0 to 1)\r\n* Improve stability of topic reduction by only computing the cosine similarity within c-TF-IDF and not the topic embeddings\r\n* Added property to c-TF-IDF that all IDF values should be positive ([#351](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F351))\r\n* Major [documentation](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002F) overhaul (mkdocs, tutorials, FAQ, images, etc. ) ([#330](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F330))\r\n* Additional logging for `.transform` ([#356](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F356))\r\n\r\n### Fixes:\r\n* Drop python 3.6 ([#333](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F333))\r\n* Relax plotly dependency ([#88](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F88))\r\n* Improve stability of `.visualize_barchart()` and `.visualize_hierarchy()`","2021-12-14T11:58:34",{"id":245,"version":246,"summary_zh":247,"released_at":248},113486,"v0.9.3","Fix #282, #285, and #288. \r\n\r\n## Fixes\r\n* #282 \r\n    * As it turns out the old implementation of topic mapping was still found in the `transform` function\r\n* #285\r\n    * Fix getting all representative docs\r\n* Fix #288\r\n    * A recent issue with the package `pyyaml` that can be found in Google Colab\r\n* Remove the `YAMLLoadWarning` each time BERTopic is imported\r\n\r\n```python\r\nimport yaml\r\nyaml._warnings_enabled[\"YAMLLoadWarning\"] = False\r\n```\r\n\r\n","2021-10-17T06:46:17",{"id":250,"version":251,"summary_zh":252,"released_at":253},113487,"v0.9.2","A release focused on algorithmic optimization and fixing several issues:\r\n\r\n**Highlights**:  \r\n  \r\n* Update the non-multilingual paraphrase-* models to the all-* models due to improved [performance](https:\u002F\u002Fwww.sbert.net\u002Fdocs\u002Fpretrained_models.html)\r\n* Reduce necessary RAM in c-TF-IDF top 30 word [extraction](https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F49207275\u002Ffinding-the-top-n-values-in-a-row-of-a-scipy-sparse-matrix)\r\n\r\n**Fixes**:  \r\n\r\n* Fix topic mapping\r\n    * When reducing the number of topics, these need to be mapped to the correct input\u002Foutput which had some issues in the previous version\r\n    * A new class was created as a way to track these mappings regardless of how many times they were executed\r\n    * In other words, you can iteratively reduce the number of topics after training the model without the need to continuously train the model\r\n* Fix typo in embeddings page ([#200](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F200)) \r\n* Fix link in README ([#233](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F233))\r\n* Fix documentation `.visualize_term_rank()` ([#253](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F253)) \r\n* Fix getting correct representative docs ([#258](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F258))\r\n* Update [memory FAQ](https:\u002F\u002Fmaartengr.github.io\u002FBERTopic\u002Ffaq.html#i-am-facing-memory-issues-help) with [HDBSCAN pr](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F151)","2021-10-12T08:42:59",{"id":255,"version":256,"summary_zh":257,"released_at":258},113488,"v0.9.1","**Fixes**:  \r\n\r\n* Fix TypeError when auto-reducing topics ([#210](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F210)) \r\n* Fix mapping representative docs when reducing topics ([#208](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F208))\r\n* Fix visualization issues with probabilities ([#205](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F205))\r\n* Fix missing `normalize_frequency` param in plots ([#213](https:\u002F\u002Fgithub.com\u002FMaartenGr\u002FBERTopic\u002Fissues\u002F208))\r\n  \r\n","2021-09-01T06:41:40"]