[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-gregversteeg--corex_topic":3,"tool-gregversteeg--corex_topic":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":98,"env_os":99,"env_gpu":100,"env_ram":100,"env_deps":101,"category_tags":106,"github_topics":107,"view_count":23,"oss_zip_url":82,"oss_zip_packed_at":82,"status":16,"created_at":113,"updated_at":114,"faqs":115,"releases":146},3440,"gregversteeg\u002Fcorex_topic","corex_topic","Hierarchical unsupervised and semi-supervised topic models for sparse count data with CorEx","corex_topic 是一款基于 CorEx（相关性解释）算法的层次化主题建模工具，专为处理稀疏计数数据而设计。它能够帮助用户从大量文档中自动提取出信息量丰富且结构清晰的主题，有效解决了传统模型在缺乏标注数据时难以捕捉深层语义关联，或无法灵活融入专家知识的痛点。\n\n这款工具非常适合自然语言处理研究人员、数据科学家以及需要深入挖掘文本数据的开发者使用。其核心亮点在于极高的灵活性：既支持完全无监督的自动探索，也允许通过“锚点词”进行半监督引导，让用户能将领域知识轻松融入模型，从而精准控制主题的方向与可解释性。此外，corex_topic 还能构建层次化的主题结构，揭示话题间的从属关系，并提供量化指标帮助用户科学地确定最佳主题数量。遵循 scikit-learn 的使用习惯，corex_topic 让复杂的主题分析变得简单高效，是探索文本数据内在结构的得力助手。","# Anchored CorEx: Hierarchical Topic Modeling with Minimal Domain Knowledge\n\n**Cor**relation **Ex**planation (CorEx) is a topic model that yields rich topics that are maximally informative about a set of documents. The advantage of using CorEx versus other topic models is that it can be easily run as an unsupervised, semi-supervised, or hierarchical topic model depending on a user's needs. For semi-supervision, CorEx allows a user to integrate their domain knowledge via \"anchor words.\"  This integration is flexible and allows the user to guide the topic model in the direction of those words. This allows for creative strategies that promote topic representation, separability, and aspects. More generally, this implementation of CorEx is good for clustering any sparse binary data.\n\nIf you use this code, please cite the following:\n\n> Gallagher, R. J., Reing, K., Kale, D., and Ver Steeg, G. \"[Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge](https:\u002F\u002Fwww.transacl.org\u002Fojs\u002Findex.php\u002Ftacl\u002Farticle\u002Fview\u002F1244).\" *Transactions of the Association for Computational Linguistics (TACL)*, 2017.\n\n## Getting Started\n\n### Install\n\nPython code for the CorEx topic model can be installed via pip:\n\n```\npip install corextopic\n```\n\n### Example Notebook\n\nFull details on how to retrieve and interpret output from the CorEx topic model are given in the [example notebook](https:\u002F\u002Fgithub.com\u002Fgregversteeg\u002Fcorex_topic\u002Fblob\u002Fmaster\u002Fcorextopic\u002Fexample\u002Fcorex_topic_example.ipynb). Below we describe how to get CorEx running as an unsupervised, semi-supervised, or hierarchical topic model.\n\n### Running the CorEx Topic Model\n\nGiven a doc-word matrix, the CorEx topic model is easy to run. The code follows the scikit-learn fit\u002Ftransform conventions.\n\n```python\nimport numpy as np\nimport scipy.sparse as ss\nfrom corextopic import corextopic as ct\n\n# Define a matrix where rows are samples (docs) and columns are features (words)\nX = np.array([[0,0,0,1,1],\n              [1,1,1,0,0],\n              [1,1,1,1,1]], dtype=int)\n# Sparse matrices are also supported\nX = ss.csr_matrix(X)\n# Word labels for each column can be provided to the model\nwords = ['dog', 'cat', 'fish', 'apple', 'orange']\n# Document labels for each row can be provided\ndocs = ['fruit doc', 'animal doc', 'mixed doc']\n\n# Train the CorEx topic model\ntopic_model = ct.Corex(n_hidden=2)  # Define the number of latent (hidden) topics to use.\ntopic_model.fit(X, words=words, docs=docs)\n```\n\nOnce the model is trained, we can get topics using the ```get_topics()``` function.\n\n```python\ntopics = topic_model.get_topics()\nfor topic_n,topic in enumerate(topics):\n    # w: word, mi: mutual information, s: sign\n    topic = [(w,mi,s) if s > 0 else ('~'+w,mi,s) for w,mi,s in topic]\n    # Unpack the info about the topic\n    words,mis,signs = zip(*topic)    \n    # Print topic\n    topic_str = str(topic_n+1)+': '+', '.join(words)\n    print(topic_str)\n```\n\nSimilarly, the most probable documents for each topic can be accessed through the ``get_top_docs()`` function.\n\n```python\ntop_docs = topic_model.get_top_docs()\nfor topic_n, topic_docs in enumerate(top_docs):\n    docs,probs = zip(*topic_docs)\n    topic_str = str(topic_n+1)+': '+', '.join(docs)\n    print(topic_str)\n```\n\nSummary files and visualizations can be outputted from ```vis_topic.py```.\n\n```python\nfrom corextopic import vis_topic as vt\nvt.vis_rep(topic_model, column_label=words, prefix='topic-model-example')\n```\n\n\n### Choosing the Number of Topics\n\nEach topic explains a certain portion of the *total correlation* (TC). We can access the topic TCs through the `tcs` attribute, as well as the overall TC (the sum of the topic TCs) through the `tc` attribute. To determine how many topics we should use, we can look at the distribution of `tcs`.  If adding additional topics contributes little to the overall TC, then the topics already explain a large portion of the information in the documents. If this is the case, then we likely do not need more topics in our topic model. So, as a general rule of thumb continue adding topics until the overall TC plateaus.\n\nWe can also restart the CorEx topic model from several different initializations. This allows CorEx to explore different parts of the topic space and potentially find more informative topics. If we want to follow a strictly quantitative approach to choosing which of the multiple topic model runs we should use, then we can choose the topic model that has the highest TC (the one that explains the most information about the documents)\n\n\n\n\n## Semi-Supervised Topic Modeling\n\n### Using Anchor Words\n\nAnchored CorEx allows a user integrate their domain knowledge through \"anchor words.\" Anchoring encourages (but does not force) CorEx to search for topics that are related to the anchor words. This helps us find topics of interest, enforce separability of topics, and find aspects around topics.\n\nIf ```words``` is initialized, then it is easy to use anchor words:\n\n```python\ntopic_model.fit(X, words=words, anchors=[['dog','cat'], 'apple'], anchor_strength=2)\n```\n\nThis anchors \"dog\" and \"cat\" to the first topic, and \"apple\" to the second topic. The `anchor_strength` is the relative amount of weight given to an anchor word relative to all the other words. For example, if `anchor_strength=2`, then CorEx will place twice as much weight on the anchor word when searching for relevant topics. The `anchor_strength` should always be set above 1. The choice of `anchor_strength` beyond that depends on the size of the vocabulary and the task at hand. We encourage users to experiment with ```anchor_strength``` to find what is useful for their own purposes.  \n\nIf ```words``` is not initialized, we can anchor by specifying the column indices of the document-term matrix that we wish to anchor on. For example,\n\n```python\ntopic_model.fit(X, anchors=[[0, 2], 1], anchor_strength=2)\n```\n\nanchors the words of columns 0 and 2 to the first topic, and word 1 to the second topic.\n\n### Anchoring Strategies\n\nThere are a number of strategies we can use with anchored CorEx. Below we provide just a handful of examples.\n\n1. *Anchoring a single set of words to a single topic*. This can help promote a topic that did not naturally emerge when running an unsupervised instance of the CorEx topic model. For example, we might anchor words like \"snow,\" \"cold,\" and \"avalanche\" to a topic if we supsect there should be a snow avalanche topic within a set of disaster relief articles.\n\n```python\ntopic_model.fit(X, words=words, anchors=[['snow', 'cold', 'avalanche']], anchor_strength=4)\n```\n\n2. *Anchoring single sets of words to multiple topics*. This can help find different aspects of a topic that may be discussed in several different contexts. For example, we might anchor \"protest\" to three topics and \"riot\" to three other topics to understand different framings that arise from tweets about political protests.\n\n```python\ntopic_model.fit(X, words=words, anchors=['protest', 'protest', 'protest', 'riot', 'riot', 'riot'], anchor_strength=2)\n```\n\n3. *Anchoring different sets of words to multiple topics.* This can help enforce topic separability if there appear to be \"chimera\" topics that are not well-separated. For example, we might anchor \"mountain,\" \"Bernese,\" and \"dog\" to one topic and \"mountain,\" \"rocky,\" and \"colorado\" to another topic to help separate topics that merge discussion of Bernese Mountain Dogs and the Rocky Mountains.\n\n```python\ntopic_model.fit(X, words=words, anchors=[['bernese', 'mountain', 'dog'], ['mountain', 'rocky', 'colorado']], anchor_strength=2)\n```\n\nThe [example notebook](https:\u002F\u002Fgithub.com\u002Fgregversteeg\u002Fcorex_topic\u002Fblob\u002Fmaster\u002Fcorextopic\u002Fexample\u002Fcorex_topic_example.ipynb) details other examples of using anchored CorEx. We encourage domain experts to experiment with other anchoring strategies that suit their needs.\n\n**Note:** when running unsupervised CorEx, the topics are returned and sorted according to how much total correlation they each explain. When running anchored CorEx, the topics are not sorted by total correlation, and the first *n* topics will correspond to the *n* anchored topics in the order given by the model input.\n\n\n\n## Hierarchical Topic Modeling\n\n### Building a Hierarchical Topic Model\n\nFor the CorEx topic model, topics are latent factors that can be expressed or not in each document. We can use the matrices of these topic expressions as input for another layer of the CorEx topic model, yielding a hierarchical topic model.\n\n```python\n# Train the first layer\ntopic_model = ct.Corex(n_hidden=100)\ntopic_model.fit(X)\n\n# Train successive layers\ntm_layer2 = ct.Corex(n_hidden=10)\ntm_layer2.fit(topic_model.labels)\n\ntm_layer3 = ct.Corex(n_hidden=1)\ntm_layer3.fit(tm_layer2.labels)\n```\n\nVisualizations of the hierarchical topic model can be accessed through ```vis_topic.py```.\n\n```python\nvt.vis_hierarchy([topic_model, tm_layer2, tm_layer3], column_label=words, max_edges=300, prefix='topic-model-example')\n```\n\n\n\n## Technical notes\n\n### Binarization of Documents\n\nFor speed reasons, this version of the CorEx topic model works only on binary data and produces binary latent factors. Despite this limitation, our work demonstrates CorEx produces coherent topics that are as good as or better than those produced by LDA for short to medium length documents. However, you may wish to consider additional preprocessing for working with longer documents. We have several strategies for handling text data.\n\n0. Naive binarization. This will be good for documents of similar length and especially short- to medium-length documents.\n\n1. Average binary bag of words. We split documents into chunks, compute the binary bag of words for each documents and then average. This implicitly weights all documents equally.\n\n2. All binary bag of words. Split documents into chunks and consider each chunk as its own binary bag of words documents.This changes the number of documents so it may take some work to match the ids back, if desired. Implicitly, this will weight longer documents more heavily. Generally this seems like the most theoretically justified method. Ideally, you could aggregate the latent factors over sub-documents to get 'counts' of latent factors at the higher layers.\n\n3. Fractional counts. This converts counts into a fraction of the background rate, with 1 as the max. Short documents tend to stay binary and words in long documents are weighted according to their frequency with respect to background in the corpus. This seems to work Ok on tests. It requires no preprocessing of count data and it uses the full range of possible inputs. However, this approach is not very rigorous or well tested.\n\nFor the python API, for 1 and 2, you can use the functions in ```vis_topic``` to process data or do the same yourself. Naive binarization is specified through the python api with count='binarize' and fractional counts with count='fraction'. While fractional counts may be work theoretically, their usage in the CorEx topic model has not be adequately tested.\n\n### Single Membership of Words in Topics\n\nAlso for speed reasons, the CorEx topic model enforces single membership of words in topics. If a user anchors a word to multiple topics, the single membership will be overriden.\n\n\n## References\nIf you use this code, please cite the following:\n\n> Gallagher, R. J., Reing, K., Kale, D., and Ver Steeg, G. \"[Anchored Correlation Explanation: Topic Modeling with Minimal Domain Knowledge](https:\u002F\u002Fwww.transacl.org\u002Fojs\u002Findex.php\u002Ftacl\u002Farticle\u002Fview\u002F1244).\" *Transactions of the Association for Computational Linguistics (TACL)*, 2017.\n\nSee the following papers if you interested in how CorEx works generally beyond sparse binary data.\n\n> [*Discovering Structure in High-Dimensional Data Through Correlation Explanation*](http:\u002F\u002Farxiv.org\u002Fabs\u002F1406.1222), Ver Steeg and Galstyan, NIPS 2014. \u003Cbr>\n\n>[*Maximally Informative Hierarchical Representions of High-Dimensional Data*](http:\u002F\u002Farxiv.org\u002Fabs\u002F1410.7404), Ver Steeg and Galstyan, AISTATS 2015.\n","# 锚定 CorEx：只需最少领域知识的层次主题建模\n\n**Cor**relation **Ex**planation (CorEx) 是一种主题模型，能够生成对一组文档最具信息量的丰富主题。与其他主题模型相比，CorEx 的优势在于，它可以根据用户需求轻松地作为无监督、半监督或层次主题模型运行。在半监督学习中，CorEx 允许用户通过“锚词”整合其领域知识。这种整合方式灵活，使用户能够引导主题模型朝这些词语的方向发展。这为促进主题表示、可分离性和方面提取提供了创造性的策略。更广泛地说，CorEx 的这一实现非常适合对任何稀疏二值数据进行聚类。\n\n如果您使用此代码，请引用以下文献：\n\n> Gallagher, R. J., Reing, K., Kale, D., and Ver Steeg, G. “[锚定相关解释：只需最少领域知识的主题建模](https:\u002F\u002Fwww.transacl.org\u002Fojs\u002Findex.php\u002Ftacl\u002Farticle\u002Fview\u002F1244).” *计算语言学协会汇刊 (TACL)*, 2017.\n\n## 快速入门\n\n### 安装\n\n可以通过 pip 安装 CorEx 主题模型的 Python 代码：\n\n```\npip install corextopic\n```\n\n### 示例笔记本\n\n有关如何获取并解释 CorEx 主题模型输出的完整说明，请参阅 [示例笔记本](https:\u002F\u002Fgithub.com\u002Fgregversteeg\u002Fcorex_topic\u002Fblob\u002Fmaster\u002Fcorextopic\u002Fexample\u002Fcorex_topic_example.ipynb)。下面我们介绍如何将 CorEx 运行为无监督、半监督或层次主题模型。\n\n### 运行 CorEx 主题模型\n\n给定一个文档-词矩阵，CorEx 主题模型非常容易运行。代码遵循 scikit-learn 的 fit\u002Ftransform 约定。\n\n```python\nimport numpy as np\nimport scipy.sparse as ss\nfrom corextopic import corextopic as ct\n\n# 定义一个矩阵，其中行代表样本（文档），列代表特征（词）\nX = np.array([[0,0,0,1,1],\n              [1,1,1,0,0],\n              [1,1,1,1,1]], dtype=int)\n# 也支持稀疏矩阵\nX = ss.csr_matrix(X)\n# 可以为每一列提供词标签\nwords = ['狗', '猫', '鱼', '苹果', '橙子']\n# 可以为每一行提供文档标签\ndocs = ['水果文档', '动物文档', '混合文档']\n\n# 训练 CorEx 主题模型\ntopic_model = ct.Corex(n_hidden=2)  # 定义要使用的潜在（隐藏）主题数量。\ntopic_model.fit(X, words=words, docs=docs)\n```\n\n模型训练完成后，我们可以使用 `get_topics()` 函数获取主题。\n\n```python\ntopics = topic_model.get_topics()\nfor topic_n,topic in enumerate(topics):\n    # w: 词，mi: 互信息，s: 符号\n    topic = [(w,mi,s) if s > 0 else ('~'+w,mi,s) for w,mi,s in topic]\n    # 解包主题信息\n    words,mis,signs = zip(*topic)    \n    # 打印主题\n    topic_str = str(topic_n+1)+': '+', '.join(words)\n    print(topic_str)\n```\n\n同样，每个主题最可能对应的文档可以通过 `get_top_docs()` 函数访问。\n\n```python\ntop_docs = topic_model.get_top_docs()\nfor topic_n, topic_docs in enumerate(top_docs):\n    docs,probs = zip(*topic_docs)\n    topic_str = str(topic_n+1)+': '+', '.join(docs)\n    print(topic_str)\n```\n\n总结文件和可视化结果可以由 `vis_topic.py` 输出。\n\n```python\nfrom corextopic import vis_topic as vt\nvt.vis_rep(topic_model, column_label=words, prefix='topic-model-example')\n```\n\n\n### 选择主题数量\n\n每个主题解释了 *总相关性* (TC) 的一部分。我们可以通过 `tcs` 属性访问各主题的 TC，也可以通过 `tc` 属性访问总体 TC（即所有主题 TC 的总和）。为了确定应使用多少个主题，我们可以查看 `tcs` 的分布情况。如果添加更多主题对总体 TC 的贡献很小，则说明现有主题已经解释了文档中的大部分信息。在这种情况下，我们的主题模型可能不需要更多的主题。因此，作为一般经验法则，可以继续添加主题，直到总体 TC 达到平稳状态。\n\n我们还可以从多个不同的初始化点重新启动 CorEx 主题模型。这样可以使 CorEx 探索主题空间的不同区域，并有可能找到更具信息量的主题。如果我们希望采用严格的定量方法来选择使用哪一次主题模型运行的结果，那么可以选择具有最高 TC 的模型（即能够解释最多文档信息的模型）。\n\n\n\n\n## 半监督主题建模\n\n### 使用锚词\n\n锚定 CorEx 允许用户通过“锚词”整合其领域知识。锚定会鼓励（但不强制）CorEx 搜索与锚词相关的主题。这有助于我们找到感兴趣的主题、增强主题之间的可分离性，并围绕主题发现各个方面。\n\n如果已初始化 `words`，则可以轻松使用锚词：\n\n```python\ntopic_model.fit(X, words=words, anchors=[['狗','猫'], '苹果'], anchor_strength=2)\n```\n\n这会将“狗”和“猫”锚定到第一个主题，“苹果”锚定到第二个主题。`anchor_strength` 是相对于所有其他词而言，给予锚词的相对权重。例如，如果 `anchor_strength=2`，那么 CorEx 在搜索相关主题时会将锚词的权重提高两倍。`anchor_strength` 应始终设置为大于 1。在此基础上的具体取值取决于词汇表的大小和具体任务。我们鼓励用户尝试不同的 `anchor_strength` 值，以找到最适合自身需求的设置。\n\n如果未初始化 `words`，我们可以通过指定文档-术语矩阵中要锚定的列索引来进行锚定。例如，\n\n```python\ntopic_model.fit(X, anchors=[[0, 2], 1], anchor_strength=2)\n```\n\n将第 0 列和第 2 列的词锚定到第一个主题，将第 1 列的词锚定到第二个主题。\n\n### 锚定策略\n\n我们可以使用多种策略来应用锚定 CorEx 模型。以下仅提供几个示例。\n\n1. *将一组词语锚定到单个主题*。这有助于引入在无监督 CorEx 主题模型运行中未自然出现的主题。例如，如果我们怀疑一组灾害救援文章中应包含雪崩主题，可以将“snow”、“cold”和“avalanche”等词语锚定到某个主题。\n\n```python\ntopic_model.fit(X, words=words, anchors=[['snow', 'cold', 'avalanche']], anchor_strength=4)\n```\n\n2. *将多组词语分别锚定到多个主题*。这可以帮助发现一个主题在不同上下文中可能涉及的不同方面。例如，我们可能将“protest”锚定到三个主题，“riot”锚定到另外三个主题，以理解关于政治抗议的推文所呈现的不同话语框架。\n\n```python\ntopic_model.fit(X, words=words, anchors=['protest', 'protest', 'protest', 'riot', 'riot', 'riot'], anchor_strength=2)\n```\n\n3. *将不同的词语集分别锚定到多个主题*。如果存在一些分离度较差的“嵌合体”主题，这种方法可以帮助增强主题的可区分性。例如，我们可以将“mountain”、“Bernese”和“dog”锚定到一个主题，同时将“mountain”、“rocky”和“colorado”锚定到另一个主题，以更好地分离讨论伯恩山犬和落基山脉的主题。\n\n```python\ntopic_model.fit(X, words=words, anchors=[['bernese', 'mountain', 'dog'], ['mountain', 'rocky', 'colorado']], anchor_strength=2)\n```\n\n[示例笔记本](https:\u002F\u002Fgithub.com\u002Fgregversteeg\u002Fcorex_topic\u002Fblob\u002Fmaster\u002Fcorextopic\u002Fexample\u002Fcorex_topic_example.ipynb)详细介绍了其他使用锚定 CorEx 的示例。我们鼓励领域专家根据自身需求尝试不同的锚定策略。\n\n**注意**：在运行无监督 CorEx 时，主题会按照其解释的总相关性大小进行排序并返回。而在运行锚定 CorEx 时，主题不会按总相关性排序，前 *n* 个主题将对应于模型输入中给出顺序的前 *n* 个锚定主题。\n\n\n\n## 层次化主题建模\n\n### 构建层次化主题模型\n\n对于 CorEx 主题模型而言，主题是潜在因子，可以在每篇文档中出现或不出现。我们可以将这些主题表达矩阵作为另一层 CorEx 主题模型的输入，从而构建层次化主题模型。\n\n```python\n# 训练第一层\ntopic_model = ct.Corex(n_hidden=100)\ntopic_model.fit(X)\n\n# 训练后续各层\ntm_layer2 = ct.Corex(n_hidden=10)\ntm_layer2.fit(topic_model.labels)\n\ntm_layer3 = ct.Corex(n_hidden=1)\ntm_layer3.fit(tm_layer2.labels)\n```\n\n可以通过 ```vis_topic.py``` 访问层次化主题模型的可视化结果。\n\n```python\nvt.vis_hierarchy([topic_model, tm_layer2, tm_layer3], column_label=words, max_edges=300, prefix='topic-model-example')\n```\n\n\n\n## 技术说明\n\n### 文档二值化\n\n出于速度考虑，本版本的 CorEx 主题模型仅适用于二值数据，并生成二值潜在因子。尽管存在这一限制，我们的研究表明，对于短至中等长度的文档，CorEx 所生成的主题连贯性与 LDA 相当甚至更好。然而，处理较长文档时，您可能需要考虑额外的预处理步骤。我们提供了几种文本数据处理策略。\n\n0. 简单二值化。此方法适用于长度相近的文档，尤其适合短至中等长度的文档。\n\n1. 平均二值词袋。我们将文档分割成若干块，计算每块的二值词袋，然后取平均值。这种方法会隐式地对所有文档赋予相等的权重。\n\n2. 全部二值词袋。将文档分割成若干块，并将每块视为独立的二值词袋文档。这样做会改变文档数量，因此如果需要，可能需要重新匹配文档 ID。隐式地，这种方法会更重视较长的文档。总体而言，这似乎是最具理论依据的方法。理想情况下，您可以汇总子文档的潜在因子，以获得更高层的潜在因子“计数”。\n\n3. 分数计数。此方法将计数转换为相对于背景频率的比例，最大值为 1。短文档通常保持二值状态，而长文档中的词语则会根据其在语料库中的背景频率进行加权。该方法在测试中表现尚可。它无需对计数数据进行预处理，且能充分利用所有可能的输入范围。然而，这种方法尚未经过严格或充分的测试。\n\n对于 Python API，针对第 1 和第 2 种方法，您可以使用 ```vis_topic``` 中的函数来处理数据，也可以自行完成。简单二值化可通过 count='binarize' 参数指定，分数计数则通过 count='fraction' 指定。尽管分数计数在理论上可行，但其在 CorEx 主题模型中的应用尚未得到充分验证。\n\n### 单一主题归属\n\n同样出于速度考虑，CorEx 主题模型强制要求每个词语只能属于一个主题。如果用户将某个词语锚定到多个主题，则单一归属规则将被覆盖。\n\n\n## 参考文献\n如果您使用此代码，请引用以下文献：\n\n> Gallagher, R. J., Reing, K., Kale, D., and Ver Steeg, G. “[锚定相关性解释：基于最小领域知识的主题建模](https:\u002F\u002Fwww.transacl.org\u002Fojs\u002Findex.php\u002Ftacl\u002Farticle\u002Fview\u002F1244).” *计算语言学协会汇刊 (TACL)*, 2017.\n\n如果您想了解 CorEx 在稀疏二值数据之外的一般工作原理，请参阅以下论文：\n\n> [*通过相关性解释发现高维数据中的结构*](http:\u002F\u002Farxiv.org\u002Fabs\u002F1406.1222), Ver Steeg 和 Galstyan, NIPS 2014. \u003Cbr>\n\n>[*高维数据的最大信息量层次表示*](http:\u002F\u002Farxiv.org\u002Fabs\u002F1410.7404), Ver Steeg 和 Galstyan, AISTATS 2015.","# corex_topic 快速上手指南\n\n**corex_topic** 是一个基于“锚定相关性解释”（Anchored CorEx）算法的主题模型工具。与传统的 LDA 等模型相比，它支持无监督、半监督（通过“锚点词”融入领域知识）以及分层主题建模，特别适用于稀疏二值数据的聚类分析。\n\n## 环境准备\n\n- **操作系统**：Linux, macOS, Windows\n- **Python 版本**：建议 Python 3.6 及以上\n- **前置依赖**：\n  - `numpy`\n  - `scipy`\n  - `scikit-learn` (部分功能兼容)\n\n确保已安装基础科学计算库，若未安装可先行执行：\n```bash\npip install numpy scipy\n```\n\n## 安装步骤\n\n推荐使用 pip 直接安装官方发布版本。国内用户如遇下载缓慢，可使用清华或阿里云镜像源加速。\n\n**标准安装：**\n```bash\npip install corextopic\n```\n\n**使用国内镜像源加速安装（推荐）：**\n```bash\npip install corextopic -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n\n以下示例演示如何构建一个最简单的无监督主题模型。代码遵循 scikit-learn 的 `fit\u002Ftransform` 风格。\n\n### 1. 准备数据与训练模型\n\n创建一个文档 - 词矩阵（支持稠密数组或稀疏矩阵），定义词汇表和文档标签，然后初始化并训练模型。\n\n```python\nimport numpy as np\nimport scipy.sparse as ss\nfrom corextopic import corextopic as ct\n\n# 定义矩阵：行代表样本（文档），列代表特征（词）\n# 1 表示该词出现在文档中，0 表示未出现（二值化数据）\nX = np.array([[0,0,0,1,1],\n              [1,1,1,0,0],\n              [1,1,1,1,1]], dtype=int)\n\n# 支持稀疏矩阵格式\nX = ss.csr_matrix(X)\n\n# 定义每列对应的词标签\nwords = ['dog', 'cat', 'fish', 'apple', 'orange']\n\n# 定义每行对应的文档标签\ndocs = ['fruit doc', 'animal doc', 'mixed doc']\n\n# 初始化模型：n_hidden 指定潜在主题的数量\ntopic_model = ct.Corex(n_hidden=2)\n\n# 训练模型\ntopic_model.fit(X, words=words, docs=docs)\n```\n\n### 2. 获取主题结果\n\n训练完成后，使用 `get_topics()` 提取每个主题下的关键词及其互信息（MI）。\n\n```python\ntopics = topic_model.get_topics()\nfor topic_n, topic in enumerate(topics):\n    # 格式化输出：正相关显示词，负相关显示 ~词\n    topic_formatted = [(w, mi, s) if s > 0 else ('~'+w, mi, s) for w, mi, s in topic]\n    \n    # 解包数据\n    words_list, mis, signs = zip(*topic_formatted)    \n    \n    # 打印主题关键词\n    topic_str = str(topic_n+1) + ': ' + ', '.join(words_list)\n    print(topic_str)\n```\n\n### 3. 获取典型文档\n\n使用 `get_top_docs()` 查看每个主题下概率最高的文档。\n\n```python\ntop_docs = topic_model.get_top_docs()\nfor topic_n, topic_docs in enumerate(top_docs):\n    docs_list, probs = zip(*topic_docs)\n    topic_str = str(topic_n+1) + ': ' + ', '.join(docs_list)\n    print(topic_str)\n```\n\n### 4. 可视化（可选）\n\n工具内置了可视化模块，可生成主题关系图或层级结构图。\n\n```python\nfrom corextopic import vis_topic as vt\n\n# 生成主题表示可视化文件\nvt.vis_rep(topic_model, column_label=words, prefix='topic-model-example')\n```\n\n> **提示**：若需进行半监督建模（指定锚点词）或分层建模，只需在 `fit` 函数中传入 `anchors` 参数，或将上一层的 `labels` 作为下一层的输入即可。详细用法请参考官方 Example Notebook。","某电商客服团队需要从数万条稀疏的用户投诉文本中，快速梳理出核心问题层级并定位特定业务线的异常。\n\n### 没有 corex_topic 时\n- 传统主题模型（如 LDA）生成的主题往往词汇重叠严重，难以区分“物流延迟”与“商品破损”等细粒度差异，导致分类模糊。\n- 分析师明知“退款”、“缺货”是关键业务指标，却无法将这些领域知识强制融入模型，只能被动接受算法输出的无关主题。\n- 面对海量数据，只能得到扁平的主题列表，无法自动构建从“售后服务”到“具体退换货流程”的层级结构，人工整理耗时极长。\n- 对于短文本或稀疏的计数数据，模型效果大幅下降，大量包含关键信息的文档被错误归类或忽略。\n\n### 使用 corex_topic 后\n- 利用 CorEx 的最大信息量原则，生成的主题词汇互斥性更强，清晰分离出“支付失败”、“发货慢”等独立且高纯度的问题簇。\n- 通过“锚点词（Anchor Words）”功能，直接将“优惠券”、“积分”设为引导词，迫使模型优先提取营销相关的潜在主题，实现半监督精准挖掘。\n- 自动构建分层主题模型，顶层展示“物流”、“质量”等大方向，下层自动展开具体子问题，直观呈现问题根源的树状结构。\n- 专为稀疏计数数据优化，即使是在只有几个关键词的简短投诉中，也能准确捕捉文档间的关联，显著提升召回率。\n\ncorex_topic 通过引入极少的人工先验知识，将原本黑盒的无监督聚类转化为可解释、可引导的层级化洞察工具，极大缩短了从原始数据到业务决策的路径。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fgregversteeg_corex_topic_e7c2f35d.png","gregversteeg","Greg Ver Steeg","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fgregversteeg_02d7b01e.jpg","Associate professor at UCR","University of California Riverside","Los Angeles, CA","gversteeg@gmail.com",null,"http:\u002F\u002Fgregversteeg.com","https:\u002F\u002Fgithub.com\u002Fgregversteeg",[86,90],{"name":87,"color":88,"percentage":89},"Python","#3572A5",55,{"name":91,"color":92,"percentage":93},"Jupyter Notebook","#DA5B0B",45,639,118,"2026-03-20T06:46:49","Apache-2.0",1,"","未说明",{"notes":102,"python":100,"dependencies":103},"该工具通过 pip 安装（包名为 corextopic），主要处理稀疏二值数据。代码遵循 scikit-learn 的 fit\u002Ftransform 规范。虽然支持半监督和分层主题建模，但 README 中未明确指定具体的操作系统、GPU、内存或 Python 版本要求。对于长文档，建议进行额外的预处理（如分块或二值化策略）。",[104,105],"numpy","scipy",[13],[108,109,110,111,112],"python","machine-learning","unsupervised-learning","topic-modeling","information-theory","2026-03-27T02:49:30.150509","2026-04-06T05:35:43.607365",[116,121,126,131,136,141],{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},15790,"如何使用训练好的模型对新数据进行预测？","在测试新句子时，必须复用原始数据训练时使用的同一个向量化器（vectorizer），以确保字典一致。如果直接使用新的 CountVectorizer 会导致维度不匹配或报错。请确保对新数据使用 `original_vectorizer.transform(new_data)` 而不是 `fit_transform`。","https:\u002F\u002Fgithub.com\u002Fgregversteeg\u002Fcorex_topic\u002Fissues\u002F24",{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},15791,"运行 vis_hierarchy 时出现 'NameError: name 'unicode' is not defined' 错误怎么办？","这是 Python 3 兼容性问题。该问题已在 Pull Request #21 中修复。如果您是通过 pip 安装的旧版本，请运行 `pip install corextopic --upgrade` 升级到最新版本以解决此错误。如果尚未更新到 pip，可能需要从源码安装最新代码。","https:\u002F\u002Fgithub.com\u002Fgregversteeg\u002Fcorex_topic\u002Fissues\u002F19",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},15792,"get_topics() 返回的主题词为什么没有按互信息（MI）降序排列？","这是因为某些词的缺失比存在更能提供信息（负相关）。新版本已更新 `get_topics()` 方法：现在返回的是一个三元组 (单词，MI 值，符号)，其中符号表示该词的存在或缺失是否具有信息量。此外，新增了 `weighted_rank` 选项，设置为 True 时可返回 alpha*MI 作为排序依据，这通常能提供更合理的排名。","https:\u002F\u002Fgithub.com\u002Fgregversteeg\u002Fcorex_topic\u002Fissues\u002F35",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},15793,"如果锚定词（anchor words）不在词汇表中，程序会报错停止吗？","在新版本中，如果锚定词不在词汇表中，CorEx 不再抛出错误终止程序，而是发出警告并继续处理其他有效的锚定词。请确保通过 `pip install corextopic --upgrade` 将包升级到最新版本以获取此功能。","https:\u002F\u002Fgithub.com\u002Fgregversteeg\u002Fcorex_topic\u002Fissues\u002F23",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},15794,"如何为不同的主题设置不同的锚定词列表？未锚定的主题该如何表示？","您应该为每个主题提供一个列表。对于想要锚定的主题，填入对应的词列表；对于不需要锚定的主题，只需在 `anchors` 列表中对应位置留空或不填（具体取决于实现，通常建议显式传递空列表或仅列出需要锚定的部分，但根据维护者建议，格式应为 `anchors=[['word1', 'word2'], [], ['word3']]`，其中空列表代表该主题无锚定词）。不要为未锚定的主题重复填充无关数据。","https:\u002F\u002Fgithub.com\u002Fgregversteeg\u002Fcorex_topic\u002Fissues\u002F48",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},15795,"如何在 CorEx 中计算或获取主题的一致性分数（Coherence Scores）？","CorEx 类本身目前没有直接输出一致性分数的内置方法。用户需要自行编写代码计算，通常结合 `sklearn.feature_extraction.text.CountVectorizer` 和外部一致性计算逻辑。您可以参考社区提供的解决方案，手动提取主题词并与原始文档矩阵结合来计算一致性指标。","https:\u002F\u002Fgithub.com\u002Fgregversteeg\u002Fcorex_topic\u002Fissues\u002F36",[]]