[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-piskvorky--gensim-data":3,"tool-piskvorky--gensim-data":64},[4,23,32,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,2,"2026-04-05T10:45:23",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},2234,"scikit-learn","scikit-learn\u002Fscikit-learn","scikit-learn 是一个基于 Python 构建的开源机器学习库，依托于 SciPy、NumPy 等科学计算生态，旨在让机器学习变得简单高效。它提供了一套统一且简洁的接口，涵盖了从数据预处理、特征工程到模型训练、评估及选择的全流程工具，内置了包括线性回归、支持向量机、随机森林、聚类等在内的丰富经典算法。\n\n对于希望快速验证想法或构建原型的数据科学家、研究人员以及 Python 开发者而言，scikit-learn 是不可或缺的基础设施。它有效解决了机器学习入门门槛高、算法实现复杂以及不同模型间调用方式不统一的痛点，让用户无需重复造轮子，只需几行代码即可调用成熟的算法解决分类、回归、聚类等实际问题。\n\n其核心技术亮点在于高度一致的 API 设计风格，所有估算器（Estimator）均遵循相同的调用逻辑，极大地降低了学习成本并提升了代码的可读性与可维护性。此外，它还提供了强大的模型选择与评估工具，如交叉验证和网格搜索，帮助用户系统地优化模型性能。作为一个由全球志愿者共同维护的成熟项目，scikit-learn 以其稳定性、详尽的文档和活跃的社区支持，成为连接理论学习与工业级应用的最",65628,1,"2026-04-05T10:10:46",[20,18,14],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":10,"last_commit_at":38,"category_tags":39,"status":22},3364,"keras","keras-team\u002Fkeras","Keras 是一个专为人类设计的深度学习框架，旨在让构建和训练神经网络变得简单直观。它解决了开发者在不同深度学习后端之间切换困难、模型开发效率低以及难以兼顾调试便捷性与运行性能的痛点。\n\n无论是刚入门的学生、专注算法的研究人员，还是需要快速落地产品的工程师，都能通过 Keras 轻松上手。它支持计算机视觉、自然语言处理、音频分析及时间序列预测等多种任务。\n\nKeras 3 的核心亮点在于其独特的“多后端”架构。用户只需编写一套代码，即可灵活选择 TensorFlow、JAX、PyTorch 或 OpenVINO 作为底层运行引擎。这一特性不仅保留了 Keras 一贯的高层易用性，还允许开发者根据需求自由选择：利用 JAX 或 PyTorch 的即时执行模式进行高效调试，或切换至速度最快的后端以获得最高 350% 的性能提升。此外，Keras 具备强大的扩展能力，能无缝从本地笔记本电脑扩展至大规模 GPU 或 TPU 集群，是连接原型开发与生产部署的理想桥梁。",63927,"2026-04-04T15:24:37",[20,14,18],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":10,"last_commit_at":46,"category_tags":47,"status":22},2403,"crawl4ai","unclecode\u002Fcrawl4ai","Crawl4AI 是一款专为大语言模型（LLM）设计的开源网络爬虫与数据提取工具。它的核心使命是将纷繁复杂的网页内容转化为干净、结构化的 Markdown 格式，直接服务于检索增强生成（RAG）、智能体构建及各类数据管道，让 AI 能更轻松地“读懂”互联网。\n\n传统爬虫往往面临反爬机制拦截、动态内容加载困难以及输出格式杂乱等痛点，导致后续数据处理成本高昂。Crawl4AI 通过内置自动化的三级反机器人检测、代理升级策略以及对 Shadow DOM 的深度支持，有效突破了这些障碍。它能智能移除同意弹窗，处理深层链接，并具备长任务崩溃恢复能力，确保数据采集的稳定与高效。\n\n这款工具特别适合开发者、AI 研究人员及数据工程师使用。无论是需要为本地模型构建知识库，还是搭建大规模自动化信息采集流程，Crawl4AI 都提供了极高的可控性与灵活性。作为 GitHub 上备受瞩目的开源项目，它完全免费开放，无需繁琐的注册或昂贵的 API 费用，让用户能够专注于数据价值本身而非采集难题。",63242,"2026-04-02T22:29:19",[14,17],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":10,"last_commit_at":54,"category_tags":55,"status":22},193,"meilisearch","meilisearch\u002Fmeilisearch","Meilisearch 是一个开源的极速搜索服务，专为现代应用和网站打造，开箱即用。它能帮助开发者快速集成高质量的搜索功能，无需复杂的配置或额外的数据预处理。传统搜索方案往往需要大量调优才能实现准确结果，而 Meilisearch 内置了拼写容错、同义词识别、即时响应等实用特性，并支持 AI 驱动的混合搜索（结合关键词与语义理解），显著提升用户查找信息的体验。\n\nMeilisearch 特别适合 Web 开发者、产品团队或初创公司使用，尤其适用于需要快速上线搜索功能的场景，如电商网站、内容平台或 SaaS 应用。它提供简洁的 RESTful API 和多种语言 SDK，部署简单，资源占用低，本地开发或生产环境均可轻松运行。对于希望在不依赖大型云服务的前提下，为用户提供流畅、智能搜索体验的团队来说，Meilisearch 是一个高效且友好的选择。",56972,"2026-04-05T22:34:33",[13,17,14,20,16,18],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":22},223,"Made-With-ML","GokuMohandas\u002FMade-With-ML","Made-With-ML 是一个面向实战的开源项目，旨在帮助开发者系统掌握从设计、开发到部署和迭代生产级机器学习应用的完整流程。它解决了许多人在学习机器学习时“会训练模型但不会上线”的痛点，强调将软件工程最佳实践与 ML 技术结合，构建可靠、可维护的端到端系统。\n\n该项目特别适合三类人群：一是希望将模型真正落地的开发者（包括软件工程师、数据科学家）；二是刚毕业、想补齐工业界所需技能的学生；三是需要理解技术边界以更好推动产品的技术管理者或产品经理。\n\nMade-With-ML 的亮点在于注重第一性原理讲解，避免盲目调包；同时覆盖 MLOps 关键环节（如实验跟踪、模型测试、服务部署、CI\u002FCD 等），并支持在 Python 生态内平滑扩展训练与推理任务，无需切换语言或复杂基础设施。课程内容结构清晰，配有详细代码示例和视频导览，兼顾理论深度与工程实用性。",47120,"2026-04-05T22:16:18",[19,18,14,16,20],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":84,"stars":89,"forks":90,"last_commit_at":91,"license":92,"difficulty_score":29,"env_os":93,"env_gpu":93,"env_ram":93,"env_deps":94,"category_tags":98,"github_topics":99,"view_count":107,"oss_zip_url":81,"oss_zip_packed_at":81,"status":22,"created_at":108,"updated_at":109,"faqs":110,"releases":140},954,"piskvorky\u002Fgensim-data","gensim-data","Data repository for pretrained NLP models and NLP corpora.","gensim-data 是一个专注于自然语言处理（NLP）领域的开源数据存储库，提供预训练模型和文本语料库。它解决了研究数据集经常消失、格式复杂或难以使用的问题，通过标准化的 API 和长期支持，为非结构化文本处理提供了稳定可靠的数据来源。\n\n这个存储库非常适合 NLP 研究人员和开发者使用，尤其是那些需要快速获取高质量语料或预训练模型的人。用户无需直接与存储库交互，只需通过 Gensim 的下载 API 即可轻松加载所需资源，例如词向量模型或文本数据集。所有数据会自动保存到本地 `~\u002Fgensim-data` 文件夹中，方便管理。\n\ngensim-data 的独特之处在于其设计的可持续性：每个数据集都有独立的版本发布，并附带详细的使用示例和许可证说明，确保透明性和合规性。此外，数据以 GitHub 发布附件的形式存储，保证了数据的不可变性和长期可用性。\n\n无论是加载预训练模型（如 GloVe 词向量）还是获取语料库（如 Wikipedia 或 text8），gensim-data 都能显著简化工作流程，让研究人员和开发者专注于核心任务，而不必为数据准备耗费精力。","# What is Gensim-data for?\n\nResearch datasets regularly disappear, change over time, become obsolete or come without a sane implementation to handle the data format reading and processing.\n\nFor this reason, [Gensim](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim) launched its own dataset storage, committed to long-term support, a sane standardized usage API and focused on datasets for **unstructured text processing** (no images or audio). This [Gensim-data](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data) repository serves as that storage.\n\n**There's no need for you to use this repository directly**. Instead, simply install Gensim and use its download API (see the Quickstart below). It will \"talk\" to this repository automagically.\n\n💡 When you use the Gensim download API, all data is stored in your `~\u002Fgensim-data` home folder.\n\nRead more about the project rationale and design decisions in this article: [New Download API for Pretrained NLP Models and Datasets](https:\u002F\u002Frare-technologies.com\u002Fnew-download-api-for-pretrained-nlp-models-and-datasets-in-gensim\u002F).\n\n# How does it work?\n\nTechnically, the actual (sometimes large) corpora and model files are being stored as [release attachments](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Freleases) here on Github. Each dataset (and each new version of each dataset) gets its own release, forever immutable.\n\nEach release is accompanied by a usage example and release notes, for example: [Corpus of USPTO Patents from 2017](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Freleases\u002Ftag\u002Fpatent-2017); [English Wikipedia from 2017 with plaintext section](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Freleases\u002Ftag\u002Fwiki-english-20171001).\n\n🔴 **Each dataset comes with its own license, which the users should study carefully before using the dataset!**\n\n----\n\n## Quickstart\n\nTo load a model or corpus, use either the Python or command line interface of [Gensim](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim) (you'll need Gensim installed first):\n\n- **Python API**\n\n  Example: load a pre-trained model (gloVe word vectors):\n\n  ```python\n  import gensim.downloader as api\n\n  info = api.info()  # show info about available models\u002Fdatasets\n  model = api.load(\"glove-twitter-25\")  # download the model and return as object ready for use\n  model.most_similar(\"cat\")\n\n  \"\"\"\n  output:\n\n  [(u'dog', 0.9590819478034973),\n   (u'monkey', 0.9203578233718872),\n   (u'bear', 0.9143137335777283),\n   (u'pet', 0.9108031392097473),\n   (u'girl', 0.8880630135536194),\n   (u'horse', 0.8872727155685425),\n   (u'kitty', 0.8870542049407959),\n   (u'puppy', 0.886769711971283),\n   (u'hot', 0.8865255117416382),\n   (u'lady', 0.8845518827438354)]\n\n  \"\"\"\n  ```\n\n  Example: load a corpus and use it to train a Word2Vec model:\n\n  ```python\n  from gensim.models.word2vec import Word2Vec\n  import gensim.downloader as api\n\n  corpus = api.load('text8')  # download the corpus and return it opened as an iterable\n  model = Word2Vec(corpus)  # train a model from the corpus\n  model.most_similar(\"car\")\n\n  \"\"\"\n  output:\n\n  [(u'driver', 0.8273754119873047),\n   (u'motorcycle', 0.769528865814209),\n   (u'cars', 0.7356342077255249),\n   (u'truck', 0.7331641912460327),\n   (u'taxi', 0.718338131904602),\n   (u'vehicle', 0.7177008390426636),\n   (u'racing', 0.6697118878364563),\n   (u'automobile', 0.6657308340072632),\n   (u'passenger', 0.6377975344657898),\n   (u'glider', 0.6374964714050293)]\n\n  \"\"\"\n  ```\n\n  Example: **only** download a dataset and return the local file path (no opening):\n\n  ```python\n  import gensim.downloader as api\n\n  print(api.load(\"20-newsgroups\", return_path=True))  # output: \u002Fhome\u002Fuser\u002Fgensim-data\u002F20-newsgroups\u002F20-newsgroups.gz\n  print(api.load(\"glove-twitter-25\", return_path=True))  # output: \u002Fhome\u002Fuser\u002Fgensim-data\u002Fglove-twitter-25\u002Fglove-twitter-25.gz\n  ```\n\n - The same operations, but from **CLI, command line interface**:\n\n   ```bash\n   python -m gensim.downloader --info  # show info about available models\u002Fdatasets\n   python -m gensim.downloader --download text8  # download text8 dataset to ~\u002Fgensim-data\u002Ftext8\n   python -m gensim.downloader --download glove-twitter-25  # download model to ~\u002Fgensim-data\u002Fglove-twitter-50\u002F\n   ```\n\n----\n\n## Available data\n\n### Datasets\n\n| name | file size | read_more | description | license |\n|------|-----------|-----------|-------------|---------|\n| 20-newsgroups | 13 MB | \u003Cul>\u003Cli>http:\u002F\u002Fqwone.com\u002F~jason\u002F20Newsgroups\u002F\u003C\u002Fli>\u003C\u002Ful> | The notorious collection of approximately 20,000 newsgroup posts, partitioned (nearly) evenly across 20 different newsgroups. | not found |\n| fake-news | 19 MB | \u003Cul>\u003Cli>https:\u002F\u002Fwww.kaggle.com\u002Fmrisdal\u002Ffake-news\u003C\u002Fli>\u003C\u002Ful> | News dataset, contains text and metadata from 244 websites and represents 12,999 posts in total from a specific window of 30 days. The data was pulled using the webhose.io API, and because it's coming from their crawler, not all websites identified by their BS Detector are present in this dataset. Data sources that were missing a label were simply assigned a label of 'bs'. There are (ostensibly) no genuine, reliable, or trustworthy news sources represented in this dataset (so far), so don't trust anything you read. | https:\u002F\u002Fcreativecommons.org\u002Fpublicdomain\u002Fzero\u002F1.0\u002F |\n| patent-2017 | 2944 MB | \u003Cul>\u003Cli>http:\u002F\u002Fpatents.reedtech.com\u002Fpgrbft.php\u003C\u002Fli>\u003C\u002Ful> | Patent Grant Full Text. Contains the full text including tables, sequence data and 'in-line' mathematical expressions of each patent grant issued in 2017. | not found |\n| quora-duplicate-questions | 20 MB | \u003Cul>\u003Cli>https:\u002F\u002Fdata.quora.com\u002FFirst-Quora-Dataset-Release-Question-Pairs\u003C\u002Fli>\u003C\u002Ful> | Over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line contains a duplicate pair or not. | probably https:\u002F\u002Fwww.quora.com\u002Fabout\u002Ftos |\n| semeval-2016-2017-task3-subtaskA-unannotated | 223 MB | \u003Cul>\u003Cli>http:\u002F\u002Falt.qcri.org\u002Fsemeval2016\u002Ftask3\u002F\u003C\u002Fli> \u003Cli>http:\u002F\u002Falt.qcri.org\u002Fsemeval2016\u002Ftask3\u002Fdata\u002Fuploads\u002Fsemeval2016-task3-report.pdf\u003C\u002Fli> \u003Cli>https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fissues\u002F18\u003C\u002Fli> \u003Cli>https:\u002F\u002Fgithub.com\u002FWitiko\u002Fsemeval-2016_2017-task3-subtaskA-unannotated-english\u003C\u002Fli>\u003C\u002Ful> | SemEval 2016 \u002F 2017 Task 3 Subtask A unannotated dataset contains 189,941 questions and 1,894,456 comments in English collected from the Community Question Answering (CQA) web forum of Qatar Living. These can be used as a corpus for language modelling. | These datasets are free for general research use. |\n| semeval-2016-2017-task3-subtaskBC | 6 MB | \u003Cul>\u003Cli>http:\u002F\u002Falt.qcri.org\u002Fsemeval2017\u002Ftask3\u002F\u003C\u002Fli> \u003Cli>http:\u002F\u002Falt.qcri.org\u002Fsemeval2017\u002Ftask3\u002Fdata\u002Fuploads\u002Fsemeval2017-task3.pdf\u003C\u002Fli> \u003Cli>https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fissues\u002F18\u003C\u002Fli> \u003Cli>https:\u002F\u002Fgithub.com\u002FWitiko\u002Fsemeval-2016_2017-task3-subtaskB-english\u003C\u002Fli>\u003C\u002Ful> | SemEval 2016 \u002F 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the task paper http:\u002F\u002Falt.qcri.org\u002Fsemeval2016\u002Ftask3\u002Fdata\u002Fuploads\u002Fsemeval2016-task3-report.pdf linked in section “Papers” of https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fissues\u002F18. | All files released for the task are free for general research use |\n| text8 | 31 MB | \u003Cul>\u003Cli>http:\u002F\u002Fmattmahoney.net\u002Fdc\u002Ftextdata.html\u003C\u002Fli>\u003C\u002Ful> | First 100,000,000 bytes of plain text from Wikipedia. Used for testing purposes; see wiki-english-* for proper full Wikipedia datasets. | not found |\n| wiki-english-20171001 | 6214 MB | \u003Cul>\u003Cli>https:\u002F\u002Fdumps.wikimedia.org\u002Fenwiki\u002F20171001\u002F\u003C\u002Fli>\u003C\u002Ful> | Extracted Wikipedia dump from October 2017. Produced by `python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o wiki-en.gz` | https:\u002F\u002Fdumps.wikimedia.org\u002Flegal.html |\n\n### Models\n\n| name | num vectors | file size | base dataset | read_more  | description | parameters | preprocessing | license |\n|------|-------------|-----------|--------------|------------|-------------|------------|---------------|---------|\n| conceptnet-numberbatch-17-06-300 | 1917247 | 1168 MB | ConceptNet, word2vec, GloVe, and OpenSubtitles 2016 | \u003Cul>\u003Cli>http:\u002F\u002Faaai.org\u002Focs\u002Findex.php\u002FAAAI\u002FAAAI17\u002Fpaper\u002Fview\u002F14972\u003C\u002Fli> \u003Cli>https:\u002F\u002Fgithub.com\u002Fcommonsense\u002Fconceptnet-numberbatch\u003C\u002Fli> \u003Cli>http:\u002F\u002Fconceptnet.io\u002F\u003C\u002Fli>\u003C\u002Ful> | ConceptNet Numberbatch consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning. ConceptNet Numberbatch is part of the ConceptNet open data project. ConceptNet provides lots of ways to compute with word meanings, one of which is word embeddings. ConceptNet Numberbatch is a snapshot of just the word embeddings. It is built using an ensemble that combines data from ConceptNet, word2vec, GloVe, and OpenSubtitles 2016, using a variation on retrofitting. | \u003Cul>\u003Cli>dimension - 300\u003C\u002Fli>\u003C\u002Ful> | - | https:\u002F\u002Fgithub.com\u002Fcommonsense\u002Fconceptnet-numberbatch\u002Fblob\u002Fmaster\u002FLICENSE.txt |\n| fasttext-wiki-news-subwords-300 | 999999 | 958 MB | Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens) | \u003Cul>\u003Cli>https:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fenglish-vectors.html\u003C\u002Fli> \u003Cli>https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.09405\u003C\u002Fli> \u003Cli>https:\u002F\u002Farxiv.org\u002Fabs\u002F1607.01759\u003C\u002Fli>\u003C\u002Ful> | 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens). | \u003Cul>\u003Cli>dimension - 300\u003C\u002Fli>\u003C\u002Ful> | - | https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-sa\u002F3.0\u002F |\n| glove-twitter-100 | 1193514 | 387 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | Pre-trained vectors based on  2B tweets, 27B tokens, 1.2M vocab, uncased (https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F) | \u003Cul>\u003Cli>dimension - 100\u003C\u002Fli>\u003C\u002Ful> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-twitter-100.txt`. | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-twitter-200 | 1193514 | 758 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F). | \u003Cul>\u003Cli>dimension - 200\u003C\u002Fli>\u003C\u002Ful> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-twitter-200.txt`. | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-twitter-25 | 1193514 | 104 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F). | \u003Cul>\u003Cli>dimension - 25\u003C\u002Fli>\u003C\u002Ful> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-twitter-25.txt`. | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-twitter-50 | 1193514 | 199 MB | Twitter (2B tweets, 27B tokens, 1.2M vocab, uncased) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | Pre-trained vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased (https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F) | \u003Cul>\u003Cli>dimension - 50\u003C\u002Fli>\u003C\u002Ful> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-twitter-50.txt`. | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-wiki-gigaword-100 | 400000 | 128 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | Pre-trained vectors based on Wikipedia 2014 + Gigaword 5.6B tokens, 400K vocab, uncased (https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F). | \u003Cul>\u003Cli>dimension - 100\u003C\u002Fli>\u003C\u002Ful> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-wiki-gigaword-100.txt`. | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-wiki-gigaword-200 | 400000 | 252 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F). | \u003Cul>\u003Cli>dimension - 200\u003C\u002Fli>\u003C\u002Ful> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-wiki-gigaword-200.txt`. | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-wiki-gigaword-300 | 400000 | 376 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F). | \u003Cul>\u003Cli>dimension - 300\u003C\u002Fli>\u003C\u002Ful> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-wiki-gigaword-300.txt`. | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-wiki-gigaword-50 | 400000 | 65 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, uncased) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | Pre-trained vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, 400K vocab, uncased (https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F). | \u003Cul>\u003Cli>dimension - 50\u003C\u002Fli>\u003C\u002Ful> | Converted to w2v format with `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-wiki-gigaword-50.txt`. | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| word2vec-google-news-300 | 3000000 | 1662 MB | Google News (about 100 billion words) | \u003Cul>\u003Cli>https:\u002F\u002Fcode.google.com\u002Farchive\u002Fp\u002Fword2vec\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Farxiv.org\u002Fabs\u002F1301.3781\u003C\u002Fli> \u003Cli>https:\u002F\u002Farxiv.org\u002Fabs\u002F1310.4546\u003C\u002Fli> \u003Cli>https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Flinguistic-regularities-in-continuous-space-word-representations\u002F?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf\u003C\u002Fli>\u003C\u002Ful> | Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contains 300-dimensional vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in 'Distributed Representations of Words and Phrases and their Compositionality' (https:\u002F\u002Fcode.google.com\u002Farchive\u002Fp\u002Fword2vec\u002F). | \u003Cul>\u003Cli>dimension - 300\u003C\u002Fli>\u003C\u002Ful> | - | not found |\n| word2vec-ruscorpora-300 | 184973 | 198 MB | Russian National Corpus (about 250M words) | \u003Cul>\u003Cli>https:\u002F\u002Fwww.academia.edu\u002F24306935\u002FWebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models\u003C\u002Fli> \u003Cli>http:\u002F\u002Frusvectores.org\u002Fen\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fissues\u002F3\u003C\u002Fli>\u003C\u002Ful> | Word2vec Continuous Skipgram vectors trained on full Russian National Corpus (about 250M words). The model contains 185K words. | \u003Cul>\u003Cli>dimension - 300\u003C\u002Fli> \u003Cli>window_size - 10\u003C\u002Fli>\u003C\u002Ful> | The corpus was lemmatized and tagged with Universal PoS | https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby\u002F4.0\u002Fdeed.en |\n\n(generated by [generate_table.py](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fblob\u002Fmaster\u002Fgenerate_table.py) based on [list.json](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fblob\u002Fmaster\u002Flist.json))\n\n\n----\n\n# Want to add a new corpus or model?\n\n1. Compress your data set using gzip or bz2.\n\n2. Share the compressed file on any file-sharing service.\n\n2. Create a [new issue](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fissues) and give us the dataset link. Add a **detailed description** on **why** and **how** you created the dataset, any related papers or research, plus how do you expect other users should use it. Include a code example where relevant.\n\n----------------\n\n`Gensim-data` is open source software released under the [LGPL 2.1 license](https:\u002F\u002Fgithub.com\u002Frare-technologies\u002Fgensim-data\u002Fblob\u002Fmaster\u002FLICENSE).\n\nCopyright (c) 2018 [RARE Technologies](https:\u002F\u002Frare-technologies.com\u002F).\n","# Gensim-data 是用来做什么的？\n\n研究数据集经常会出现消失、随时间变化、变得过时，或者没有合理的实现来处理数据格式的读取和处理。\n\n基于这个原因，[Gensim](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim) 推出了自己的数据集存储库，致力于长期支持、提供标准化的使用 API，并专注于**非结构化文本处理**（不包括图像或音频）的数据集。这个 [Gensim-data](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data) 仓库就充当了这样的存储库。\n\n**您无需直接使用此存储库**。相反，只需安装 Gensim 并使用其下载 API（见下文快速入门）。它会自动与该存储库进行交互。\n\n💡 当您使用 Gensim 下载 API 时，所有数据都会存储在您的 `~\u002Fgensim-data` 主目录文件夹中。\n\n有关项目原理和设计决策的更多信息，请阅读这篇文章：[New Download API for Pretrained NLP Models and Datasets](https:\u002F\u002Frare-technologies.com\u002Fnew-download-api-for-pretrained-nlp-models-and-datasets-in-gensim\u002F)。\n\n# 它是如何工作的？\n\n从技术上讲，实际的（有时非常大的）语料库和模型文件被存储为 [release attachments](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Freleases) 在 Github 上。每个数据集（以及每个数据集的新版本）都有自己的发布版本，且永远不可更改。\n\n每次发布都附带一个使用示例和发布说明，例如：[2017 年 USPTO 专利语料库](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Freleases\u002Ftag\u002Fpatent-2017); [2017 年英文维基百科，包含纯文本部分](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Freleases\u002Ftag\u002Fwiki-english-20171001)。\n\n🔴 **每个数据集都有自己的许可证，用户在使用数据集之前应仔细阅读！**\n\n----\n\n## 快速入门\n\n要加载模型或语料库，可以使用 [Gensim](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim) 的 Python 或命令行接口（首先需要安装 Gensim）：\n\n- **Python API**\n\n  示例：加载预训练模型（gloVe 词向量）：\n\n  ```python\n  import gensim.downloader as api\n\n  info = api.info()  # 显示可用模型\u002F数据集的信息\n  model = api.load(\"glove-twitter-25\")  # 下载模型并返回可直接使用的对象\n  model.most_similar(\"cat\")\n\n  \"\"\"\n  输出：\n\n  [(u'dog', 0.9590819478034973),\n   (u'monkey', 0.9203578233718872),\n   (u'bear', 0.9143137335777283),\n   (u'pet', 0.9108031392097473),\n   (u'girl', 0.8880630135536194),\n   (u'horse', 0.8872727155685425),\n   (u'kitty', 0.8870542049407959),\n   (u'puppy', 0.886769711971283),\n   (u'hot', 0.8865255117416382),\n   (u'lady', 0.8845518827438354)]\n\n  \"\"\"\n  ```\n\n  示例：加载语料库并用它训练 Word2Vec 模型：\n\n  ```python\n  from gensim.models.word2vec import Word2Vec\n  import gensim.downloader as api\n\n  corpus = api.load('text8')  # 下载语料库并返回打开后的可迭代对象\n  model = Word2Vec(corpus)  # 使用语料库训练模型\n  model.most_similar(\"car\")\n\n  \"\"\"\n  输出：\n\n  [(u'driver', 0.8273754119873047),\n   (u'motorcycle', 0.769528865814209),\n   (u'cars', 0.7356342077255249),\n   (u'truck', 0.7331641912460327),\n   (u'taxi', 0.718338131904602),\n   (u'vehicle', 0.7177008390426636),\n   (u'racing', 0.6697118878364563),\n   (u'automobile', 0.6657308340072632),\n   (u'passenger', 0.6377975344657898),\n   (u'glider', 0.6374964714050293)]\n\n  \"\"\"\n  ```\n\n  示例：**仅**下载数据集并返回本地文件路径（不打开）：\n\n  ```python\n  import gensim.downloader as api\n\n  print(api.load(\"20-newsgroups\", return_path=True))  # 输出: \u002Fhome\u002Fuser\u002Fgensim-data\u002F20-newsgroups\u002F20-newsgroups.gz\n  print(api.load(\"glove-twitter-25\", return_path=True))  # 输出: \u002Fhome\u002Fuser\u002Fgensim-data\u002Fglove-twitter-25\u002Fglove-twitter-25.gz\n  ```\n\n - 同样的操作，但通过**CLI，命令行界面**完成：\n\n   ```bash\n   python -m gensim.downloader --info  # 显示可用模型\u002F数据集的信息\n   python -m gensim.downloader --download text8  # 将 text8 数据集下载到 ~\u002Fgensim-data\u002Ftext8\n   python -m gensim.downloader --download glove-twitter-25  # 将模型下载到 ~\u002Fgensim-data\u002Fglove-twitter-50\u002F\n   ```\n\n----\n\n## 可用数据\n\n### 数据集\n\n| 名称 | 文件大小 | 了解更多 | 描述 | 许可 |\n|------|-----------|-----------|-------------|---------|\n| 20-newsgroups | 13 MB | \u003Cul>\u003Cli>http:\u002F\u002Fqwone.com\u002F~jason\u002F20Newsgroups\u002F\u003C\u002Fli>\u003C\u002Ful> | 臭名昭著的约20,000篇新闻组帖子集合，几乎均匀分布在20个不同的新闻组中。 | 未找到 |\n| fake-news | 19 MB | \u003Cul>\u003Cli>https:\u002F\u002Fwww.kaggle.com\u002Fmrisdal\u002Ffake-news\u003C\u002Fli>\u003C\u002Ful> | 新闻数据集，包含来自244个网站的文本和元数据，总计代表了在特定30天窗口内的12,999篇帖子。数据通过webhose.io API抓取，由于数据来源于其爬虫，并非所有被BS Detector（垃圾站点检测器）识别的网站都出现在此数据集中。缺少标签的数据源被简单地标记为'bs'。该数据集中没有（据称）任何真实、可靠或值得信赖的新闻来源（至少目前如此），所以请不要相信你读到的内容。 | https:\u002F\u002Fcreativecommons.org\u002Fpublicdomain\u002Fzero\u002F1.0\u002F |\n| patent-2017 | 2944 MB | \u003Cul>\u003Cli>http:\u002F\u002Fpatents.reedtech.com\u002Fpgrbft.php\u003C\u002Fli>\u003C\u002Ful> | 专利授权全文。包含2017年发布的每项专利授权的完整文本，包括表格、序列数据和“内联”数学表达式。 | 未找到 |\n| quora-duplicate-questions | 20 MB | \u003Cul>\u003Cli>https:\u002F\u002Fdata.quora.com\u002FFirst-Quora-Dataset-Release-Question-Pairs\u003C\u002Fli>\u003C\u002Ful> | 超过400,000行潜在的重复问题对。每一行包含每对问题的ID、每个问题的完整文本以及一个二进制值，指示该行是否包含重复对。 | 可能是 https:\u002F\u002Fwww.quora.com\u002Fabout\u002Ftos |\n| semeval-2016-2017-task3-subtaskA-unannotated | 223 MB | \u003Cul>\u003Cli>http:\u002F\u002Falt.qcri.org\u002Fsemeval2016\u002Ftask3\u002F\u003C\u002Fli> \u003Cli>http:\u002F\u002Falt.qcri.org\u002Fsemeval2016\u002Ftask3\u002Fdata\u002Fuploads\u002Fsemeval2016-task3-report.pdf\u003C\u002Fli> \u003Cli>https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fissues\u002F18\u003C\u002Fli> \u003Cli>https:\u002F\u002Fgithub.com\u002FWitiko\u002Fsemeval-2016_2017-task3-subtaskA-unannotated-english\u003C\u002Fli>\u003C\u002Ful> | SemEval 2016 \u002F 2017任务3子任务A未标注数据集包含从卡塔尔生活社区问答（CQA）网络论坛收集的189,941个问题和1,894,456条评论。这些可以用作语言建模的语料库。 | 这些数据集可供一般研究用途免费使用。 |\n| semeval-2016-2017-task3-subtaskBC | 6 MB | \u003Cul>\u003Cli>http:\u002F\u002Falt.qcri.org\u002Fsemeval2017\u002Ftask3\u002F\u003C\u002Fli> \u003Cli>http:\u002F\u002Falt.qcri.org\u002Fsemeval2017\u002Ftask3\u002Fdata\u002Fuploads\u002Fsemeval2017-task3.pdf\u003C\u002Fli> \u003Cli>https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fissues\u002F18\u003C\u002Fli> \u003Cli>https:\u002F\u002Fgithub.com\u002FWitiko\u002Fsemeval-2016_2017-task3-subtaskB-english\u003C\u002Fli>\u003C\u002Ful> | SemEval 2016 \u002F 2017任务3子任务B和C数据集包含英语的训练+开发集（317个原始问题，3,169个相关问题和31,690条评论）以及测试数据集。任务描述和收集的数据详见任务论文的第3节和第4.1节 http:\u002F\u002Falt.qcri.org\u002Fsemeval2016\u002Ftask3\u002Fdata\u002Fuploads\u002Fsemeval2016-task3-report.pdf，链接位于 https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fissues\u002F18 的“Papers”部分。 | 为该任务发布的所有文件均可供一般研究用途免费使用。 |\n| text8 | 31 MB | \u003Cul>\u003Cli>http:\u002F\u002Fmattmahoney.net\u002Fdc\u002Ftextdata.html\u003C\u002Fli>\u003C\u002Ful> | 来自维基百科的前100,000,000字节纯文本。用于测试目的；完整的维基百科数据集请参见wiki-english-*。 | 未找到 |\n| wiki-english-20171001 | 6214 MB | \u003Cul>\u003Cli>https:\u002F\u002Fdumps.wikimedia.org\u002Fenwiki\u002F20171001\u002F\u003C\u002Fli>\u003C\u002Ful> | 2017年10月提取的维基百科数据转储。通过 `python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o wiki-en.gz` 生成。 | https:\u002F\u002Fdumps.wikimedia.org\u002Flegal.html |\n\n### 模型\n\n| 名称 | 向量数量 | 文件大小 | 基础数据集 | 了解更多 | 描述 | 参数 | 预处理 | 许可协议 |\n|------|-------------|-----------|--------------|------------|-------------|------------|---------------|---------|\n| conceptnet-numberbatch-17-06-300 | 1917247 | 1168 MB | ConceptNet, word2vec, GloVe 和 OpenSubtitles 2016 | \u003Cul>\u003Cli>http:\u002F\u002Faaai.org\u002Focs\u002Findex.php\u002FAAAI\u002FAAAI17\u002Fpaper\u002Fview\u002F14972\u003C\u002Fli> \u003Cli>https:\u002F\u002Fgithub.com\u002Fcommonsense\u002Fconceptnet-numberbatch\u003C\u002Fli> \u003Cli>http:\u002F\u002Fconceptnet.io\u002F\u003C\u002Fli>\u003C\u002Ful> | ConceptNet Numberbatch 包含最先进的语义向量（也称为词嵌入，word embeddings），可以直接用作单词意义的表示，也可以作为进一步机器学习的起点。ConceptNet Numberbatch 是 ConceptNet 开放数据项目的一部分。ConceptNet 提供了多种计算单词意义的方法，其中之一是词嵌入。ConceptNet Numberbatch 是这些词嵌入的一个快照。它通过一种集成方法构建，结合了来自 ConceptNet、word2vec、GloVe 和 OpenSubtitles 2016 的数据，并使用了一种改进的 retrofitting 方法。 | \u003Cul>\u003Cli>维度 - 300\u003C\u002Fli>\u003C\u002Ful> | - | https:\u002F\u002Fgithub.com\u002Fcommonsense\u002Fconceptnet-numberbatch\u002Fblob\u002Fmaster\u002FLICENSE.txt |\n| fasttext-wiki-news-subwords-300 | 999999 | 958 MB | Wikipedia 2017, UMBC webbase 语料库和 statmt.org 新闻数据集 (16B tokens) | \u003Cul>\u003Cli>https:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fenglish-vectors.html\u003C\u002Fli> \u003Cli>https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.09405\u003C\u002Fli> \u003Cli>https:\u002F\u002Farxiv.org\u002Fabs\u002F1607.01759\u003C\u002Fli>\u003C\u002Ful> | 在 Wikipedia 2017、UMBC webbase 语料库和 statmt.org 新闻数据集（16B tokens）上训练的 100 万个词向量。 | \u003Cul>\u003Cli>维度 - 300\u003C\u002Fli>\u003C\u002Ful> | - | https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-sa\u002F3.0\u002F |\n| glove-twitter-100 | 1193514 | 387 MB | Twitter (2B 推文, 27B tokens, 1.2M 词汇, 不区分大小写) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | 基于 2B 推文、27B tokens、1.2M 词汇、不区分大小写的预训练向量（https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F）。 | \u003Cul>\u003Cli>维度 - 100\u003C\u002Fli>\u003C\u002Ful> | 使用 `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-twitter-100.txt` 转换为 w2v 格式。 | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-twitter-200 | 1193514 | 758 MB | Twitter (2B 推文, 27B tokens, 1.2M 词汇, 不区分大小写) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | 基于 2B 推文、27B tokens、1.2M 词汇、不区分大小写的预训练向量（https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F）。 | \u003Cul>\u003Cli>维度 - 200\u003C\u002Fli>\u003C\u002Ful> | 使用 `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-twitter-200.txt` 转换为 w2v 格式。 | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-twitter-25 | 1193514 | 104 MB | Twitter (2B 推文, 27B tokens, 1.2M 词汇, 不区分大小写) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | 基于 2B 推文、27B tokens、1.2M 词汇、不区分大小写的预训练向量（https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F）。 | \u003Cul>\u003Cli>维度 - 25\u003C\u002Fli>\u003C\u002Ful> | 使用 `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-twitter-25.txt` 转换为 w2v 格式。 | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-twitter-50 | 1193514 | 199 MB | Twitter (2B 推文, 27B tokens, 1.2M 词汇, 不区分大小写) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | 基于 2B 推文、27B tokens、1.2M 词汇、不区分大小写的预训练向量（https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F）。 | \u003Cul>\u003Cli>维度 - 50\u003C\u002Fli>\u003C\u002Ful> | 使用 `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-twitter-50.txt` 转换为 w2v 格式。 | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-wiki-gigaword-100 | 400000 | 128 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, 不区分大小写) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | 基于 Wikipedia 2014 + Gigaword 5 的预训练向量，6B tokens，400K 词汇，不区分大小写（https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F）。 | \u003Cul>\u003Cli>维度 - 100\u003C\u002Fli>\u003C\u002Ful> | 使用 `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-wiki-gigaword-100.txt` 转换为 w2v 格式。 | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-wiki-gigaword-200 | 400000 | 252 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, 不区分大小写) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | 基于 Wikipedia 2014 + Gigaword 5 的预训练向量，5.6B tokens，400K 词汇，不区分大小写（https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F）。 | \u003Cul>\u003Cli>维度 - 200\u003C\u002Fli>\u003C\u002Ful> | 使用 `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-wiki-gigaword-200.txt` 转换为 w2v 格式。 | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-wiki-gigaword-300 | 400000 | 376 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, 不区分大小写) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | 基于 Wikipedia 2014 + Gigaword 5 的预训练向量，5.6B tokens，400K 词汇，不区分大小写（https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F）。 | \u003Cul>\u003Cli>维度 - 300\u003C\u002Fli>\u003C\u002Ful> | 使用 `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-wiki-gigaword-300.txt` 转换为 w2v 格式。 | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| glove-wiki-gigaword-50 | 400000 | 65 MB | Wikipedia 2014 + Gigaword 5 (6B tokens, 不区分大小写) | \u003Cul>\u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\u003C\u002Fli>\u003C\u002Ful> | 基于 Wikipedia 2014 + Gigaword 5 的预训练向量，5.6B tokens，400K 词汇，不区分大小写（https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F）。 | \u003Cul>\u003Cli>维度 - 50\u003C\u002Fli>\u003C\u002Ful> | 使用 `python -m gensim.scripts.glove2word2vec -i \u003Cfname> -o glove-wiki-gigaword-50.txt` 转换为 w2v 格式。 | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F |\n| word2vec-google-news-300 | 3000000 | 1662 MB | Google News (约 1000 亿单词) | \u003Cul>\u003Cli>https:\u002F\u002Fcode.google.com\u002Farchive\u002Fp\u002Fword2vec\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Farxiv.org\u002Fabs\u002F1301.3781\u003C\u002Fli> \u003Cli>https:\u002F\u002Farxiv.org\u002Fabs\u002F1310.4546\u003C\u002Fli> \u003Cli>https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Flinguistic-regularities-in-continuous-space-word-representations\u002F?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf\u003C\u002Fli>\u003C\u002Ful> | 基于 Google News 数据集（约 1000 亿单词）部分训练的预训练向量。该模型包含 300 万单词和短语的 300 维向量。短语是通过 'Distributed Representations of Words and Phrases and their Compositionality' 中描述的一种简单数据驱动方法获得的（https:\u002F\u002Fcode.google.com\u002Farchive\u002Fp\u002Fword2vec\u002F）。 | \u003Cul>\u003Cli>维度 - 300\u003C\u002Fli>\u003C\u002Ful> | - | 未找到 |\n| word2vec-ruscorpora-300 | 184973 | 198 MB | 俄罗斯国家语料库（约 2.5 亿单词） | \u003Cul>\u003Cli>https:\u002F\u002Fwww.academia.edu\u002F24306935\u002FWebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models\u003C\u002Fli> \u003Cli>http:\u002F\u002Frusvectores.org\u002Fen\u002F\u003C\u002Fli> \u003Cli>https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fissues\u002F3\u003C\u002Fli>\u003C\u002Ful> | 在完整俄罗斯国家语料库（约 2.5 亿单词）上训练的 Word2vec Continuous Skipgram 向量。该模型包含 18.5 万单词。 | \u003Cul>\u003Cli>维度 - 300\u003C\u002Fli> \u003Cli>窗口大小 - 10\u003C\u002Fli>\u003C\u002Ful> | 语料库经过词形还原并标注了通用词性标签 | https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby\u002F4.0\u002Fdeed.en |\n\n（由 [generate_table.py](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fblob\u002Fmaster\u002Fgenerate_table.py) 根据 [list.json](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fblob\u002Fmaster\u002Flist.json) 生成）\n\n\n----\n\n\n\n# 想要添加新的语料库或模型吗？\n\n1. 使用 gzip 或 bz2 压缩你的数据集。\n\n2. 在任何文件共享服务上分享压缩后的文件。\n\n2. 创建一个 [新问题](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fissues)，并提供数据集的链接。附上一份**详细描述**，说明你**为什么**以及**如何**创建该数据集，相关的论文或研究内容，以及其他用户应如何使用它。在适当的情况下，请包含代码示例。\n\n----------------\n\n`Gensim-data` 是开源软件，遵循 [LGPL 2.1 许可证](https:\u002F\u002Fgithub.com\u002Frare-technologies\u002Fgensim-data\u002Fblob\u002Fmaster\u002FLICENSE) 发布。\n\n版权所有 (c) 2018 [RARE Technologies](https:\u002F\u002Frare-technologies.com\u002F)。","# gensim-data 快速上手指南\n\n`gensim-data` 是 Gensim 提供的一个长期支持的开源数据存储库，专注于**非结构化文本处理**相关的数据集和预训练模型。通过 Gensim 的下载 API，用户可以轻松获取并使用这些资源。\n\n---\n\n## 环境准备\n\n### 系统要求\n- 操作系统：Windows、macOS 或 Linux\n- Python 版本：Python 3.6 及以上\n\n### 前置依赖\n在使用 `gensim-data` 之前，请确保已安装 Gensim 库：\n```bash\npip install gensim\n```\n\n如果在中国大陆地区，建议使用国内镜像源加速安装：\n```bash\npip install gensim -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n---\n\n## 安装步骤\n\n无需单独安装 `gensim-data`，只需安装 Gensim 即可。所有数据集和模型会通过 Gensim 的下载 API 自动获取。\n\n---\n\n## 基本使用\n\n### 使用 Python API\n\n#### 示例 1：加载预训练模型（如 GloVe 词向量）\n以下代码加载预训练的 `glove-twitter-25` 模型，并查询与“cat”最相似的词：\n```python\nimport gensim.downloader as api\n\ninfo = api.info()  # 查看可用模型\u002F数据集信息\nmodel = api.load(\"glove-twitter-25\")  # 下载并加载模型\nprint(model.most_similar(\"cat\"))\n```\n\n#### 示例 2：加载语料库并训练 Word2Vec 模型\n以下代码加载 `text8` 语料库，并用其训练一个 Word2Vec 模型：\n```python\nfrom gensim.models.word2vec import Word2Vec\nimport gensim.downloader as api\n\ncorpus = api.load('text8')  # 下载并加载语料库\nmodel = Word2Vec(corpus)  # 训练模型\nprint(model.most_similar(\"car\"))\n```\n\n#### 示例 3：仅下载数据集并获取本地路径\n以下代码仅下载数据集并返回其本地文件路径：\n```python\nimport gensim.downloader as api\n\nprint(api.load(\"20-newsgroups\", return_path=True))  # 输出：\u002Fhome\u002Fuser\u002Fgensim-data\u002F20-newsgroups\u002F20-newsgroups.gz\nprint(api.load(\"glove-twitter-25\", return_path=True))  # 输出：\u002Fhome\u002Fuser\u002Fgensim-data\u002Fglove-twitter-25\u002Fglove-twitter-25.gz\n```\n\n---\n\n### 使用命令行接口 (CLI)\n\n#### 查看可用数据集和模型信息\n```bash\npython -m gensim.downloader --info\n```\n\n#### 下载数据集或模型\n以下命令分别下载 `text8` 数据集和 `glove-twitter-25` 模型到本地目录 `~\u002Fgensim-data`：\n```bash\npython -m gensim.downloader --download text8\npython -m gensim.downloader --download glove-twitter-25\n```\n\n---\n\n## 注意事项\n1. 所有数据默认存储在 `~\u002Fgensim-data` 目录下。\n2. 每个数据集都有自己的许可证，请在使用前仔细阅读相关条款。\n\n更多详细信息，请参考 [Gensim-data GitHub 仓库](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data)。","一位数据科学家正在开发一个基于自然语言处理的问答系统，需要使用预训练的词向量模型来提升语义理解能力。\n\n### 没有 gensim-data 时\n- 需要手动搜索和下载预训练模型，比如从不同网站获取 GloVe 或 Word2Vec 模型，过程繁琐且容易出错。\n- 下载的文件格式不统一，可能需要额外编写代码来解析和加载模型，增加了开发时间。\n- 数据集或模型可能因为链接失效或版本更新而无法长期使用，导致项目维护困难。\n- 缺乏对数据集许可证的清晰说明，可能在无意中违反使用条款，带来法律风险。\n- 文件存储位置分散，团队协作时难以共享和管理资源。\n\n### 使用 gensim-data 后\n- 只需调用简单的 API（如 `api.load(\"glove-twitter-25\")`），即可自动下载并加载所需模型，大幅简化操作流程。\n- 所有数据集和模型都经过标准化处理，开箱即用，无需额外编写解析代码，节省开发时间。\n- 数据集和模型存储在本地 `~\u002Fgensim-data` 文件夹中，版本固定且长期可用，确保项目的稳定性和可维护性。\n- 每个数据集附带明确的许可证信息，帮助开发者合规使用，降低法律风险。\n- 团队成员可以通过相同的 API 轻松获取资源，文件路径统一，便于协作和资源共享。\n\ngensim-data 的核心价值在于为自然语言处理任务提供了便捷、可靠的数据和模型获取方式，显著提升了开发效率和项目稳定性。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fpiskvorky_gensim-data_e6c628e8.png","piskvorky","Radim Řehůřek","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fpiskvorky_5588d153.jpg","Creator of Gensim. Founder and CTO at pii-tools.com. I love history and beginnings in general.","@pii-tools","Prague",null,"https:\u002F\u002Ftwitter.com\u002Fradimrehurek","https:\u002F\u002Fgithub.com\u002Fpiskvorky",[85],{"name":86,"color":87,"percentage":88},"Python","#3572A5",100,1048,142,"2026-03-22T19:51:50","LGPL-2.1","未说明",{"notes":95,"python":93,"dependencies":96},"需要安装 Gensim 库并通过其 API 下载数据集或模型，所有数据存储在 ~\u002Fgensim-data 文件夹中。部分数据集和模型文件较大（如 wiki-english-20171001 为 6214 MB），建议确保磁盘空间充足。",[97],"gensim",[14],[100,97,101,102,103,104,105,106],"dataset","word2vec-model","corpora","pretrained-models","lsi-model","glove-model","lda-model",3,"2026-03-27T02:49:30.150509","2026-04-06T07:14:22.180384",[111,116,121,125,130,135],{"id":112,"question_zh":113,"answer_zh":114,"source_url":115},4209,"为什么使用 gensim.downloader 加载的 FastText 模型无法处理未登录词（OOV）？","通过 gensim.downloader 加载的 FastText 模型返回的是 KeyedVectors 对象，而不是完整的 FastText 模型，因此无法利用 FastText 的子词能力来处理 OOV。如果需要此功能，请尝试直接使用 Facebook 的 fastText 实现：https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText\u002Ftree\u002Fmaster\u002Fpython。","https:\u002F\u002Fgithub.com\u002Fpiskvorky\u002Fgensim-data\u002Fissues\u002F34",{"id":117,"question_zh":118,"answer_zh":119,"source_url":120},4210,"如何正确加载 FastText 模型以支持未登录词（OOV）？","通过 gensim 下载的 FastText 模型是以 word2vec 格式存储的，因此不支持 OOV。若需要支持 OOV 的模型，请使用 Facebook 的 fastText Python 包：https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText\u002Ftree\u002Fmaster\u002Fpython。","https:\u002F\u002Fgithub.com\u002Fpiskvorky\u002Fgensim-data\u002Fissues\u002F26",{"id":122,"question_zh":123,"answer_zh":124,"source_url":120},4211,"gensim 中的 FastText 是否支持监督学习模式？","Gensim 的 FastText 实现仅支持无监督学习模式。如果需要监督学习模式，请使用 Facebook 的 fastText 实现：https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText\u002Ftree\u002Fmaster\u002Fpython。",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},4212,"为什么加载 GloVe 向量时会报 'unicode' 未定义的错误？","该问题可能与 Python 版本有关，Python 3 中不再支持 `unicode` 类型。检查是否在代码中重新定义了 `any2unicode` 或其他相关函数。此外，确保运行环境（如 Google Colab）中没有冲突的自定义实现。","https:\u002F\u002Fgithub.com\u002Fpiskvorky\u002Fgensim-data\u002Fissues\u002F24",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},4213,"为什么使用 gensim.downloader 时会出现 'No module named gensim.downloader' 错误？","这可能是由于 Gensim 版本问题或安装不完整导致的。请确保安装的是最新版本的 Gensim，并检查是否正确安装了所有依赖项。如果问题仍然存在，可以尝试降级到较早版本（如 3.8.1）进行测试。","https:\u002F\u002Fgithub.com\u002Fpiskvorky\u002Fgensim-data\u002Fissues\u002F11",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},4214,"如何离线使用已下载的 Gensim 数据集？","在 Gensim 2545 号拉取请求中修复了此问题。现在，用户可以在本地下载数据集后完全离线使用。如果遇到需要在线检查的情况，可以通过添加“持久化”标志来避免网络依赖。","https:\u002F\u002Fgithub.com\u002Fpiskvorky\u002Fgensim-data\u002Fissues\u002F23",[141,146,151,156,161,166,171,176,181,186,191,196,201,206,211,216,221,226,231,236],{"id":142,"version":143,"summary_zh":144,"released_at":145},103657,"fasttext-wiki-news-subwords-300","Pre-trained FastText 1 million word vectors trained on Wikipedia 2017, UMBC webbase corpus and statmt.org news dataset (16B tokens).\r\n\r\nFeature | Description\r\n------------ | -------------\r\nFile size | 959MB\r\nNumber of vectors | 999999\r\nDimension | 300\r\nLicense | https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby-sa\u002F3.0\u002F\r\n\r\n\r\nRead more:\r\n- https:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fenglish-vectors.html\r\n- [Tomas Mikolov, Edouard Grave, Piotr Bojanowski, Christian Puhrsch, Armand Joulin: \"Advances in Pre-Training Distributed Word Representations\"](https:\u002F\u002Farxiv.org\u002Fabs\u002F1712.09405)\r\n- [Armand Joulin Edouard Grave Piotr Bojanowski Tomas Mikolov: \"Bag of Tricks for Efficient Text Classification\"](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1607.01759.pdf)\r\n\r\n\r\nExample\r\n\r\n```python\r\nimport gensim.downloader as api\r\n\r\nmodel = api.load(\"fasttext-wiki-news-subwords-300\")\r\nmodel.most_similar(positive=[\"russia\", \"river\"])\r\n\r\n\"\"\"\r\nOutput:\r\n\r\n[(u'russias', 0.6939424276351929),\r\n (u'danube', 0.6881916522979736),\r\n (u'river.', 0.6683923006057739),\r\n (u'crimea', 0.6638611555099487),\r\n (u'rhine', 0.6632323861122131),\r\n (u'rivermouth', 0.6602864265441895),\r\n (u'wester', 0.6586191058158875),\r\n (u'finland', 0.6585439443588257),\r\n (u'volga', 0.6576792001724243),\r\n (u'ukraine', 0.6569074392318726)]\r\n\r\n\"\"\"\r\n```\r\n","2018-03-16T12:50:24",{"id":147,"version":148,"summary_zh":149,"released_at":150},103658,"semeval-2016-2017-task3-subtaskBC","SemEval 2016 \u002F 2017 Task 3 Subtask B and C datasets contain train+development (317 original questions, 3,169 related questions, and 31,690 comments), and test datasets in English. The description of the tasks and the collected data is given in sections 3 and 4.1 of the [2016 task paper](http:\u002F\u002Falt.qcri.org\u002Fsemeval2016\u002Ftask3\u002Fdata\u002Fuploads\u002Fsemeval2016-task3-report.pdf) linked in section “Papers” of #18.\r\n\r\nRelated issue #18 \r\n\r\n| attribute | value |\r\n|------------|---------|\r\n| File size | 6MB|\r\n| Number of records | 4 (upper level) |\r\n\r\nRead more:\r\n- [SemEval task 3: community question answering](http:\u002F\u002Falt.qcri.org\u002Fsemeval2017\u002Ftask3\u002F)\r\n- [Preslav Nakov, Doris Hoogeveen, Llu´ıs Marquez, Alessandro Moschitti, Hamdy Mubarak, Timothy Baldwin, Karin Verspoor: \"SemEval-2017 Task 3: Community Question Answering\"](http:\u002F\u002Falt.qcri.org\u002Fsemeval2017\u002Ftask3\u002Fdata\u002Fuploads\u002Fsemeval2017-task3.pdf)\r\n\r\nProduced by: https:\u002F\u002Fgithub.com\u002FWitiko\u002Fsemeval-2016_2017-task3-subtaskB-english\r\n\r\nExample:\r\n```python\r\nimport gensim.downloader as api\r\nfrom gensim.corpora import Dictionary\r\nfrom gensim.similarities import MatrixSimilarity\r\nfrom gensim.utils import simple_preprocess\r\nimport numpy as np\r\n\r\n\r\ndef read_corpus():\r\n    for thread in api.load(\"semeval-2016-2017-task3-subtaskA-unannotated\"):\r\n        yield simple_preprocess(thread[\"RelQuestion\"][\"RelQSubject\"])\r\n        yield simple_preprocess(thread[\"RelQuestion\"][\"RelQBody\"])\r\n        for relcomment in thread[\"RelComments\"]:\r\n            yield simple_preprocess(relcomment[\"RelCText\"])\r\n\r\n\r\ndictionary = Dictionary(read_corpus())\r\ndatasets = api.load(\"semeval-2016-2017-task3-subtaskBC\")\r\n\r\n\r\ndef produce_test_data(dataset):\r\n    for orgquestion in datasets[dataset]:\r\n        relquestions = [\r\n            (\r\n                dictionary.doc2bow(simple_preprocess(thread[\"RelQuestion\"][\"RelQSubject\"]) + simple_preprocess(thread[\"RelQuestion\"][\"RelQBody\"])),\r\n                thread[\"RelQuestion\"][\"RELQ_RELEVANCE2ORGQ\"] in (\"PerfectMatch\", \"Relevant\")\r\n            )\r\n            for thread in orgquestion[\"Threads\"]\r\n        ]\r\n\r\n        relcomments = [\r\n            (\r\n                dictionary.doc2bow(simple_preprocess(relcomment[\"RelCText\"])),\r\n                relcomment[\"RELC_RELEVANCE2ORGQ\"] == \"Good\"\r\n            )\r\n            for thread in orgquestion[\"Threads\"] for relcomment in thread[\"RelComments\"]\r\n        ]\r\n\r\n        orgquestion = dictionary.doc2bow(simple_preprocess(orgquestion[\"OrgQSubject\"]) + simple_preprocess(orgquestion[\"OrgQBody\"]))\r\n        yield orgquestion, dict(subtaskB=relquestions, subtaskC=relcomments)\r\n\r\n\r\ndef average_precision(similarities, relevance):\r\n    precision = [\r\n        (num_correct + 1) \u002F (num_total + 1) \\\r\n        for num_correct, num_total in enumerate(\r\n            num_total for num_total, (_, relevant) in enumerate(\r\n                sorted(zip(similarities, relevance), reverse=True)\r\n            )\r\n            if relevant)\r\n        ]\r\n\r\n    return np.mean(precision) if precision else 0.0\r\n\r\n\r\ndef evaluate(dataset, subtask):\r\n    results = []\r\n    for orgquestion, subtasks in produce_test_data(dataset):\r\n        documents, relevance = zip(*subtasks[subtask])\r\n        index = MatrixSimilarity(documents, num_features=len(dictionary))\r\n        similarities = index[orgquestion]\r\n        results.append(average_precision(similarities, relevance))\r\n\r\n    return np.mean(results) * 100.0\r\n\r\n\r\nfor dataset in (\"2016-dev\", \"2016-test\", \"2017-test\"):\r\n    print(\"MAP score on the {} dataset:\\t{:.2f} (Subtask B)\\t{:.2f} (Subtask C)\".format(dataset, evaluate(dataset, \"subtaskB\"), evaluate(dataset, \"subtaskC\")))\r\n\r\n\r\n\r\n\"\"\"\r\nOutput:\r\n\r\nMAP score on the 2016-dev dataset:\t41.89 (Subtask B)\t3.33 (Subtask C)\r\nMAP score on the 2016-test dataset:\t51.42 (Subtask B)\t5.59 (Subtask C)\r\nMAP score on the 2017-test dataset:\t23.65 (Subtask B)\t0.74 (Subtask C)\r\n\"\"\"\r\n```","2018-02-05T18:42:45",{"id":152,"version":153,"summary_zh":154,"released_at":155},103659,"semeval-2016-2017-task3-subtaskA-unannotated","SemEval 2016 \u002F 2017 Task 3 Subtask A unannotated dataset contains 189,941 questions and 1,894,456 comments in English collected from the Community Question Answering (CQA) web forum of Qatar Living. These can be used as a corpus for language modelling.\r\n\r\nRelated issue #18 \r\n\r\n| attribute | value |\r\n|------------|---------|\r\n| File size | 224MB|\r\n| Number of records | 189941 |\r\n\r\nRead more:\r\n- http:\u002F\u002Falt.qcri.org\u002Fsemeval2017\u002Ftask3\u002F\r\n- http:\u002F\u002Falt.qcri.org\u002Fsemeval2017\u002Ftask3\u002Fdata\u002Fuploads\u002Fsemeval2017-task3.pdf\r\n\r\nProduced by: https:\u002F\u002Fgithub.com\u002FWitiko\u002Fsemeval-2016_2017-task3-subtaskA-unannotated-english\r\n\r\nExample:\r\n```python\r\nimport gensim.downloader as api\r\n\r\n\r\nfor thread in api.load(\"semeval-2016-2017-task3-subtaskA-unannotated\"):\r\n    print(\"Question subjects: {}\\n\".format(thread[\"RelQuestion\"][\"RelQSubject\"]))\r\n    print(\"Question body: {}\\n\".format(thread[\"RelQuestion\"][\"RelQBody\"]))\r\n    print(\"Relevat comments: \")\r\n    for idx, relcomment in enumerate(thread[\"RelComments\"]):\r\n        print(\"\\t#{}: {}\\n\".format(idx + 1, relcomment[\"RelCText\"]))\r\n    break\r\n\r\n\"\"\"\r\nOutput:\r\n\r\nQuestion subjects: Thailand:IT Minsitry blocks CNN; Facebook;\r\n\r\nQuestion body: The state of Internet in Thailand:IT Minsitry blocks CNN; Facebook; Yahoo; Flickr Thai Immigration website listed as dangerousFull story: http:\u002F\u002Fwww.thaivisa.com\u002Fforum\u002FThai-Govt-Blocks-Cnn-Yahoo-Financ-t321851.html\r\n\r\nRelevat comments: \r\n\t#1: have they blocked porn??? \u003Cimg src=\"http:\u002F\u002Fwww.qatarliving.com\u002Ffiles\u002Fimages\u002FDa.gif\">\r\n\r\n\t#2: like trying to contain a tsunami with a hand towel ************************************ I'm Jack's complete lack of surprise\r\n\r\n\t#3: oops double post.. ----------------- \"HE WHO DARES WINS\" Derek Edward Trotter\r\n\r\n\t#4: What next they gonna ban all *** tourist from entering the country? ----------------- \"HE WHO DARES WINS\" Derek Edward Trotter\r\n\r\n\t#5: Or you can always make your own there with some thai babys Rules are a guideline for intelligent people; but they must be adhered to by idiots.\r\n\r\n\t#6: why CNN? they want to die ignorant of what happens around?\r\n\"\"\"\r\n\r\n```\r\n","2018-02-05T16:35:24",{"id":157,"version":158,"summary_zh":159,"released_at":160},103660,"patent-2017","Raw full text and metadata of patent grants, from the [US Patent and Trademark Office](https:\u002F\u002Fwww.uspto.gov\u002F) (USPTO), as distributed by [Reed Tech](http:\u002F\u002Fpatents.reedtech.com\u002F).\r\n\r\nContains the full text including tables, [International Patent Classification](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FInternational_Patent_Classification) (IPC) and [Cooperative Patent Classification](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCooperative_Patent_Classification) (CPC), sequence data and 'in-line' mathematical expressions of **each patent grant issued in 2017**.\r\n\r\nRead more about dataset history, usage and conditions:\r\n- http:\u002F\u002Fpatents.reedtech.com\u002Fpgrbft.php\r\n\r\n| attribute | value |\r\n|------------|---------|\r\n| File size | 3GB |\r\n| Number of patents | 353,197 |\r\n\r\nFor alternative patent datasets, see the discussion in issue #8.\r\n\r\nExample:\r\n\r\n```python\r\nimport gensim.downloader as api\r\nimport json\r\n\r\ndataset = api.load(\"patent-2017\")\r\nfor idx, document in enumerate(dataset):\r\n    print(json.dumps(document, indent=2))\r\n\r\n\"\"\"\r\nOutput:\r\n\r\n{\r\n  \"description\": {\r\n    \"p\": [\r\n      \"The present application claims the benefit under 35 U.S.C. \\u00a7119 to U.S. provisional patent application Ser. No. 61\u002F768,295, filed Feb. 22, 2013. The foregoing application is hereby incorporated by reference into the present application in its entirety.\", \r\n      \"The present inventions relate to tissue stimulation systems, and more particularly, to systems and methods for adjusting the stimulation provided to tissue to minimize the energy requirements of the systems.\", \r\n      \"Implantable neurostimulation systems have proven therapeutic in a wide variety of diseases and disorders. Pacemakers and Implantable Cardiac Defibrillators (ICDs) have proven highly effective in the treatment of a number of cardiac conditions (e.g., arrhythmias). Spinal Cord Stimulation (SCS) systems have long been accepted as a therapeutic modality for the treatment of chronic pain syndromes, and the application of spinal stimulation has begun to expand to additional applications, such as angina pectoris and incontinence. Deep Brain Stimulation (DBS) has also been applied therapeutically for well over a decade for the treatment of refractory Parkinson's Disease, and DBS has also recently been applied in additional areas, such as essential tremor and epilepsy. Further, in recent investigations, Peripheral Nerve Stimulation (PNS) systems have demonstrated efficacy in the treatment of chronic pain syndromes and incontinence, and a number of additional applications are currently under investigation. Furthermore, Functional Electrical Stimulation (FES) systems such as the Freehand system by NeuroControl (Cleveland, Ohio) have been applied to restore some functionality to paralyzed extremities in spinal cord injury patients.\", \r\n      \"Each of these implantable neurostimulation systems typically includes one or more electrode carrying stimulation leads, which are implanted at the desired stimulation site, and a neurostimulation device implanted remotely from the stimulation site, but coupled either directly to the stimulation lead(s) or indirectly to the stimulation lead(s) via a lead extension. Thus, electrical pulses can be delivered from the neurostimulation device to the electrode(s) to activate a volume of tissue in accordance with a set of stimulation parameters and provide the desired efficacious therapy to the patient. In particular, electrical energy conveyed between at least one cathodic electrode and at least one anodic electrode creates an electrical field, which when strong enough, depolarizes (or \\u201cstimulates\\u201d) the neurons beyond a threshold level, thereby evoking action potentials (APs) that propagate along the neural fibers. A typical stimulation parameter set may include the electrodes that are sourcing (anodes) or returning (cathodes) the modulating current at any given time, as well as the amplitude, duration, and rate of the stimulation pulses.\", \r\n      \"The neurostimulation system may further comprise a handheld patient programmer to remotely instruct the neurostimulation device to generate electrical stimulation pulses in accordance with selected stimulation parameters. The handheld programmer in the form of a remote control (RC) may, itself, be programmed by a clinician, for example, by using a clinician's programmer (CP), which typically includes a general purpose computer, such as a laptop, with a programming software package installed thereon.\", \r\n      \"Of course, neurostimulation devices are active devices requiring energy for operation, and thus, the neurostimulation system may oftentimes includes an external charger to recharge a neurostimulation device, so that a surgical procedure to replace a power depleted neurostimulation device can be avoided. To wirelessly convey energy between the external charger and the implanted neurostimulation device, the charger typically includes an alternating current (AC) charging coil that supplies energy to a s","2017-12-28T10:38:18",{"id":162,"version":163,"summary_zh":164,"released_at":165},103661,"conceptnet-numberbatch-17-06-300","[ConceptNet Numberbatch](https:\u002F\u002Fgithub.com\u002Fcommonsense\u002Fconceptnet-numberbatch) consists of state-of-the-art semantic vectors (also known as word embeddings) that can be used directly as a representation of word meanings or as a starting point for further machine learning.\r\n\r\nRelated issue #9.\r\n\r\n| attribute | value |\r\n|------------|---------|\r\n| File size | 1.14GB |\r\n| Number of vectors | 1917247 |\r\n| Dimension | 300 |\r\n| License | https:\u002F\u002Fgithub.com\u002Fcommonsense\u002Fconceptnet-numberbatch\u002Fblob\u002Fmaster\u002FLICENSE.txt | \r\n\r\nRead more:\r\n- http:\u002F\u002Faaai.org\u002Focs\u002Findex.php\u002FAAAI\u002FAAAI17\u002Fpaper\u002Fview\u002F14972\r\n- https:\u002F\u002Fgithub.com\u002Fcommonsense\u002Fconceptnet-numberbatch\r\n- http:\u002F\u002Fconceptnet.io\u002F\r\n\r\nExample\r\n\r\n```python\r\nimport gensim.downloader as api\r\n\r\nmodel = api.load(\"conceptnet-numberbatch-17-06-300\")\r\nfor word, distance in model.most_similar(\"\u002Fc\u002Fen\u002Fbeer\"):\r\n    print(u\"{}: {:4f}\".format(word, distance))\r\n\r\n\"\"\"\r\noutput:\r\n\r\n\u002Fc\u002Fca\u002Fbirra: 0.995633\r\n\u002Fc\u002Feu\u002Fzerbeza: 0.995058\r\n\u002Fc\u002Fhi\u002Fबियर: 0.994754\r\n\u002Fc\u002Fja\u002Fビア: 0.994656\r\n\u002Fc\u002Fja\u002Fビヤ: 0.994406\r\n\u002Fc\u002Fja\u002Fビーア: 0.994406\r\n\u002Fc\u002Feu\u002Fgaragardo: 0.994178\r\n\u002Fc\u002Fku\u002Fبیرە: 0.993689\r\n\u002Fc\u002Feu\u002Fbiera: 0.993634\r\n\u002Fc\u002Fsh\u002Fпиво: 0.992218\r\n\"\"\"\r\n\r\n```","2017-12-18T12:27:49",{"id":167,"version":168,"summary_zh":169,"released_at":170},103662,"word2vec-ruscorpora-300","Word2vec Continuous Skipgram vectors trained on the full Russian National Corpus (about 250M words).\r\n\r\nRelated issue https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data\u002Fissues\u002F3.\r\n\r\n| attribute | value |\r\n|------------|---------|\r\n| File size | 199MB |\r\n| Number of vectors | 184973 |\r\n| Preprocessing | The corpus (used for training) was lemmatized and tagged with Universal PoS |\r\n| Window size | 10 |\r\n| Dimension | 300 |\r\n| License | https:\u002F\u002Fcreativecommons.org\u002Flicenses\u002Fby\u002F4.0\u002Fdeed.en | \r\n\r\nRead more:\r\n- https:\u002F\u002Fwww.academia.edu\u002F24306935\u002FWebVectors_a_Toolkit_for_Building_Web_Interfaces_for_Vector_Semantic_Models\r\n- http:\u002F\u002Frusvectores.org\u002Fen\u002F\r\n\r\nExample\r\n\r\n```python\r\nimport gensim.downloader as api\r\n\r\nmodel = api.load(\"word2vec-ruscorpora-300\")\r\nfor word, distance in model.most_similar(u\"кот_NOUN\"):  \r\n    print(u\"{}: {:.3f}\".format(word, distance))\r\n  \r\n\"\"\"\r\noutput:\r\n\r\nкошка_NOUN: 0.757\r\nкотенок_NOUN: 0.668\r\nпес_NOUN: 0.563\r\nмяукать_VERB: 0.562\r\nтобик_NOUN: 0.559\r\nфоксик_NOUN: 0.557\r\nсобака_NOUN: 0.557\r\nмяучать_VERB: 0.554\r\nхарлашка_NOUN: 0.552\r\nкотяра_NOUN: 0.551\r\n\"\"\"\r\n\r\n```","2017-12-18T08:56:15",{"id":172,"version":173,"summary_zh":174,"released_at":175},103663,"wiki-english-20171001","Plaintext extracted from raw XML Wikipedia dump from October 2017. Each article is split into its constituent sections and their headlines (see the `section_texts` and `section_titles` attributes of each record).\r\n\r\nattribute | value\r\n------------|--------\r\nFile size | 6.3GB\r\nNumber of articles | 4,924,894\r\nTotal number of sections | 23,179,735\r\n\r\nRead more:\r\n- https:\u002F\u002Fdumps.wikimedia.org\u002Fbackup-index.html\r\n- https:\u002F\u002Fdumps.wikimedia.org\u002Fenwiki\u002F20171001\r\n\r\nProduced by \r\n`python -m gensim.scripts.segment_wiki -f enwiki-20171001-pages-articles.xml.bz2 -o wiki-english-20171001.gz`\r\n\r\nExample:\r\n\r\n```python\r\nimport gensim.downloader as api\r\n\r\ndata = api.load(\"wiki-english-20171001\")\r\nfor article in data:\r\n    for section_title, section_text in zip(article['section_titles'], article['section_texts']):\r\n        print(\"Section title: %s\" % section_title)\r\n        print(\"Section text: %s\" % section_text)\r\n    break\r\n\r\n\"\"\"\r\nSection title: Introduction\r\nSection text: \r\n\r\n\r\n\r\n\r\n'''Anarchism''' is a political philosophy that advocates self-governed societies based on voluntary institutions. These are often described as stateless societies, although several authors have defined them more specifically as institutions based on non-hierarchical free associations. Anarchism holds the state to be undesirable, unnecessary and harmful.\r\n\r\nWhile anti-statism is central, anarchism specifically entails opposing authority or hierarchical organisation in the conduct of all human relations, including—but not limited to—the state system. Anarchism is usually considered a far-left ideology and much of anarchist economics and anarchist legal philosophy reflects anti-authoritarian interpretations of communism, collectivism, syndicalism, mutualism or participatory economics.\r\n\r\nAnarchism does not offer a fixed body of doctrine from a single particular world view, instead fluxing and flowing as a philosophy. Many types and traditions of anarchism exist, not all of which are mutually exclusive. Anarchist schools of thought can differ fundamentally, supporting anything from extreme individualism to complete collectivism. Strains of anarchism have often been divided into the categories of social and individualist anarchism or similar dual classifications.\r\n\r\nSection title: Etymology and terminology\r\nSection text: \r\n\r\nThe word ''anarchism'' is composed from the word ''anarchy'' and the suffix ''-ism'', themselves derived respectively from the Greek , i.e. ''anarchy'' (from , ''anarchos'', meaning \"one without rulers\"; from the privative prefix ἀν- (''an-'', i.e. \"without\") and , ''archos'', i.e. \"leader\", \"ruler\"; (cf. ''archon'' or , ''arkhē'', i.e. \"authority\", \"sovereignty\", \"realm\", \"magistracy\")) and the suffix  or  (''-ismos'', ''-isma'', from the verbal infinitive suffix -ίζειν, ''-izein''). The first known use of this word was in 1539. Various factions within the French Revolution labelled opponents as anarchists (as Robespierre did the Hébertists) although few shared many views of later anarchists.  There would be many revolutionaries of the early nineteenth century who contributed to the anarchist doctrines of the next generation, such as William Godwin and Wilhelm Weitling, but they did not use the word ''anarchist'' or ''anarchism'' in describing themselves or their beliefs.\r\n\r\nThe first political philosopher to call himself an anarchist was Pierre-Joseph Proudhon, marking the formal birth of anarchism in the mid-nineteenth century. Since the 1890s, and beginning in France, the term \"libertarianism\" has often been used as a synonym for anarchism and was used almost exclusively in this sense until the 1950s in the United States; its use as a synonym is still common outside the United States. On the other hand, some use libertarianism to refer to individualistic free-market philosophy only, referring to free-market anarchism as libertarian anarchism.\r\n\r\nSection title: History\r\nSection text: \r\n\r\n===Origins===\r\nWoodcut from a Diggers document by William Everard\r\n\r\nThe earliest anarchist themes can be found in the 6th century BC among the works of Taoist philosopher Laozi and in later centuries by Zhuangzi and Bao Jingyan. Zhuangzi's philosophy has been described by various sources as anarchist. Zhuangzi wrote: \"A petty thief is put in jail. A great brigand becomes a ruler of a Nation\". Diogenes of Sinope and the Cynics, as well as their contemporary Zeno of Citium, the founder of Stoicism, also introduced similar topics. Jesus is sometimes considered the first anarchist in the Christian anarchist tradition. Georges Lechartier wrote: \"The true founder of anarchy was Jesus Christ and ... the first anarchist society was that of the apostles\". In early Islamic history, some manifestations of anarchic thought are found during the Islamic civil war over the Caliphate, where the Kharijites insisted that the imamate is a right for each individual within the Islamic society.\r\n\r\nThe French renaissance political philosopher Étienne de La Boétie wr","2017-11-10T02:38:02",{"id":177,"version":178,"summary_zh":179,"released_at":180},103664,"quora-duplicate-questions","Over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line contains a duplicate pair or not.\r\n\r\nattribute | value\r\n------------|--------\r\nFile size | 21MB\r\nNumber of pairs | 404290\r\nLicense | probably https:\u002F\u002Fwww.quora.com\u002Fabout\u002Ftos\r\n\r\nRead more:\r\n- https:\u002F\u002Fdata.quora.com\u002FFirst-Quora-Dataset-Release-Question-Pairs\r\n\r\nExample\r\n\r\n```python\r\nimport gensim.downloader as api\r\nimport json\r\n\r\ndata = api.load(\"quora-duplicate-questions\")\r\n\r\nfor question_pair in data:\r\n    print(json.dumps(question_pair, indent=4))\r\n    break\r\n\r\n\"\"\"\r\nOutput:\r\n\r\n{\r\n    \"qid1\": \"1\",\r\n    \"question2\": \"What is the step by step guide to invest in share market?\",\r\n    \"qid2\": \"2\",\r\n    \"is_duplicate\": \"0\",\r\n    \"question1\": \"What is the step by step guide to invest in share market in india?\",\r\n    \"id\": \"0\"\r\n}\r\n\"\"\"\r\n```\r\n","2017-11-14T05:08:21",{"id":182,"version":183,"summary_zh":184,"released_at":185},103665,"word2vec-google-news-300","Pre-trained vectors trained on a part of the Google News dataset (about 100 billion words). The model contain vectors for 3 million words and phrases. The phrases were obtained using a simple data-driven approach described in [\"Distributed Representations of Words and Phrases and their Compositionality\"](https:\u002F\u002Farxiv.org\u002Fabs\u002F1310.4546).\r\n\r\nFeature | Description\r\n------------ | -------------\r\nFile size | 1.6GB\r\nNumber of vectors | 3000000\r\nDimension | 300\r\n\r\n\r\nRead more:\r\n- https:\u002F\u002Fcode.google.com\u002Farchive\u002Fp\u002Fword2vec\u002F\r\n- [Efficient Estimation of Word Representations in Vector Space](https:\u002F\u002Farxiv.org\u002Fabs\u002F1301.3781)\r\n- [Distributed Representations of Words and Phrases and their Compositionality](https:\u002F\u002Farxiv.org\u002Fabs\u002F1310.4546)\r\n- [Linguistic Regularities in Continuous Space Word Representations](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Fresearch\u002Fpublication\u002Flinguistic-regularities-in-continuous-space-word-representations\u002F?from=http%3A%2F%2Fresearch.microsoft.com%2Fpubs%2F189726%2Frvecs.pdf)\r\n\r\n\r\nExample\r\n\r\n```python\r\nimport gensim.downloader as api\r\n\r\nmodel = api.load(\"word2vec-google-news-300\")\r\nmodel.most_similar(positive=[\"king\", \"woman\"], negative=[\"man\"])\r\n\r\n\"\"\"\r\nOutput:\r\n\r\n[(u'queen', 0.7118192911148071),\r\n (u'monarch', 0.6189674139022827),\r\n (u'princess', 0.5902431011199951),\r\n (u'crown_prince', 0.5499460697174072),\r\n (u'prince', 0.5377321243286133),\r\n (u'kings', 0.5236844420433044),\r\n (u'Queen_Consort', 0.5235945582389832),\r\n (u'queens', 0.518113374710083),\r\n (u'sultan', 0.5098593235015869),\r\n (u'monarchy', 0.5087411999702454)]\r\n\r\n\"\"\"\r\n```\r\n","2017-11-09T08:50:18",{"id":187,"version":188,"summary_zh":189,"released_at":190},103666,"__testing_word2vec-matrix-synopsis",":exclamation: For testing purposes only :exclamation:\r\nThis a word2vec model of [matrix-synopsis](http:\u002F\u002Fwww.imdb.com\u002Ftitle\u002Ftt0133093\u002Fplotsummary?ref_=ttpl_pl_syn#synopsis).","2017-10-29T17:45:16",{"id":192,"version":193,"summary_zh":194,"released_at":195},103667,"__testing_multipart-matrix-synopsis",":exclamation: For testing purposes only :exclamation:\r\n\r\nSource : [matrix-synopsis](http:\u002F\u002Fwww.imdb.com\u002Ftitle\u002Ftt0133093\u002Fplotsummary?ref_=ttpl_pl_syn#synopsis)\r\n","2017-11-08T10:24:16",{"id":197,"version":198,"summary_zh":199,"released_at":200},103668,"__testing_matrix-synopsis",":exclamation: For testing purposes only :exclamation:\r\n\r\nSource : [matrix-synopsis](http:\u002F\u002Fwww.imdb.com\u002Ftitle\u002Ftt0133093\u002Fplotsummary?ref_=ttpl_pl_syn#synopsis)","2017-10-29T17:23:06",{"id":202,"version":203,"summary_zh":204,"released_at":205},103669,"glove-twitter-200","Pre-trained glove vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased.\r\n\r\nattribute | value\r\n----------- | -------\r\nFile size | 759MB\r\nNumber of vectors | 1193514\r\nDimension | 200\r\nLicense | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F\r\n\r\nRead more:\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\r\n\r\nExample:\r\n```python\r\nimport gensim.downloader as api\r\n\r\nmodel = api.load(\"glove-twitter-200\")\r\nmodel.most_similar(positive=['king', 'woman'],negative=['man'], topn=1)\r\n\r\n\"\"\"\r\nOutput:\r\n\r\n[('queen', 0.6820898056030273)]\r\n\"\"\"\r\n```","2017-10-30T18:16:28",{"id":207,"version":208,"summary_zh":209,"released_at":210},103670,"glove-wiki-gigaword-300","Pre-trained glove vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, uncased\r\n\r\nattribute | value\r\n----------- | -------\r\nFile size | 376MB\r\nNumber of vectors | 400000\r\nDimension | 300\r\nLicense | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F\r\n\r\nRead more:\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\r\n\r\n\r\nExample\r\n\r\n```python\r\nimport gensim.downloader as api\r\nmodel = api.load(\"glove-wiki-gigaword-300\")\r\nmodel.most_similar(positive=['mature', 'boy'], topn=1)\r\n\r\n\"\"\"\r\nOutput:\r\n\r\n[('girl', 0.6623601913452148)]\r\n\"\"\"\r\n```","2017-10-28T16:40:02",{"id":212,"version":213,"summary_zh":214,"released_at":215},103671,"glove-wiki-gigaword-200","Pre-trained glove vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, uncased\r\n\r\nattribute | value\r\n----------- | -------\r\nFile size | 252MB\r\nNumber of vectors | 400000\r\nDimension | 200\r\nLicense | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F\r\n\r\nRead more:\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\r\n\r\nExample\r\n\r\n```python\r\nimport gensim.downloader as api\r\nmodel = api.load(\"glove-wiki-gigaword-200\")\r\nmodel.most_similar(positive=['tomato'],negative=['fruit'], topn=1)\r\n\r\n\"\"\"\r\nOutput:\r\n\r\n[('marinara', 0.48418283462524414)]\r\n\"\"\"\r\n```","2017-10-28T16:14:23",{"id":217,"version":218,"summary_zh":219,"released_at":220},103672,"glove-wiki-gigaword-100","Pre-trained glove vectors based on Wikipedia 2014 + Gigaword, 5.6B tokens, uncased\r\n\r\nattribute | value\r\n----------- | -------\r\nFile size | 128MB\r\nNumber of vectors | 400000\r\nDimension | 100\r\nLicense | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F\r\n\r\nRead more:\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\r\n\r\nExample\r\n\r\n```python\r\nimport gensim.downloader as api\r\n\r\nmodel = api.load(\"glove-wiki-gigaword-100\")\r\nmodel.most_similar(positive=['highest', 'mountain'], topn=1)\r\n\r\n\"\"\"\r\nOutput:\r\n\r\n[('peak', 0.7558295726776123)]\r\n\"\"\"\r\n```","2017-10-28T15:57:42",{"id":222,"version":223,"summary_zh":224,"released_at":225},103673,"glove-twitter-50","Pre-trained glove vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased.\r\n\r\nattribute | value\r\n----------- | -------\r\nFile size | 200MB\r\nNumber of vectors | 1193514\r\nDimension | 50\r\nLicense | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F\r\n\r\nRead more:\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\r\n\r\nExample\r\n\r\n```python\r\nimport gensim.downloader as api\r\n\r\nmodel = api.load(\"glove-twitter-50\")\r\nmodel.most_similar(positive=['human', 'crime'], negative=['party'], topn=1)\r\n\r\n\"\"\"\r\nOutput:\r\n\r\n[('disease', 0.7200273871421814)]\r\n\"\"\"\r\n```\r\n","2017-10-28T17:22:20",{"id":227,"version":228,"summary_zh":229,"released_at":230},103674,"glove-twitter-25","Pre-trained glove vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased.\r\n\r\nattribute | value\r\n----------- | -------\r\nFile size | 105MB\r\nNumber of vectors | 1193514\r\nDimension | 25\r\nLicense | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F\r\n\r\nRead more:\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\r\n\r\nExample\r\n\r\n```python\r\nimport gensim.downloader as api\r\nmodel = api.load(\"glove-twitter-25\")\r\nmodel.most_similar(positive=['fruit', 'flower'], topn=1)\r\n\r\n\"\"\"\r\nOutput:\r\n\r\n[('cherry', 0.9183273911476135)]\r\n\"\"\"\r\n```","2017-10-28T16:55:00",{"id":232,"version":233,"summary_zh":234,"released_at":235},103675,"glove-twitter-100","Pre-trained glove vectors based on 2B tweets, 27B tokens, 1.2M vocab, uncased.\r\n\r\nattribute | value\r\n----------- | -------\r\nFile size | 387MB\r\nNumber of vectors | 1193514\r\nDimension | 100\r\nLicense | http:\u002F\u002Fopendatacommons.org\u002Flicenses\u002Fpddl\u002F\r\n\r\nRead more:\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F\r\n- https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf\r\n\r\n\r\nExample\r\n\r\n```python\r\nimport gensim.downloader as api\r\n\r\nmodel = api.load(\"glove-twitter-100\")\r\nmodel.most_similar(positive=['table', 'chair'], topn=1)\r\n\r\n\"\"\"\r\nOutput:\r\n\r\n[('desk', 0.8098949790000916)]\r\n\"\"\"\r\n```","2017-10-28T17:33:41",{"id":237,"version":238,"summary_zh":239,"released_at":240},103676,"20-newsgroups","The notorious collection of newsgroup posts partitioned (nearly) evenly across 20 different newsgroups.\r\n\r\nattribute | value\r\n----------- | -------\r\nFile size | 14MB\r\nNumber of posts | 18846\r\n\r\n\r\nRead more:\r\n- http:\u002F\u002Fqwone.com\u002F~jason\u002F20Newsgroups\u002F\r\n\r\nExample\r\n```python\r\nimport gensim.downloader as api\r\nimport json\r\n\r\nnewsgroups_dataset = api.load(\"20-newsgroups\")\r\nfor doc in newsgroups_dataset:\r\n\tprint(json.dumps(doc, indent=4))\r\n\tbreak\r\n\r\n\"\"\"\r\nOutput:\r\n{\r\n    \"set\": \"train\",\r\n    \"data\": \"From: db7n+@andrew.cmu.edu (D. Andrew Byler)\\nSubject: Re: Serbian genocide Work of God?\\nOrganization: Freshman, Civil Engineering, Carnegie Mellon, Pittsburgh, PA\\nLines: 61\\n\\nVera Shanti Noyes writes;\\n\\n>this is what indicates to me that you may believe in predestination.\\n>am i correct?  i do not believe in predestination -- i believe we all\\n>choose whether or not we will accept God's gift of salvation to us.\\n>again, fundamental difference which can't really be resolved.\\n\\nOf course I believe in Predestination.  It's a very biblical doctrine as\\nRomans 8.28-30 shows (among other passages).  Furthermore, the Church\\nhas always taught predestination, from the very beginning.  But to say\\nthat I believe in Predestination does not mean I do not believe in free\\nwill.  Men freely choose the course of their life, which is also\\naffected by the grace of God.  However, unlike the Calvinists and\\nJansenists, I hold that grace is resistable, otherwise you end up with\\nthe idiocy of denying the universal saving will of God (1 Timothy 2.4). \\nFor God must give enough grace to all to be saved.  But only the elect,\\nwho he foreknew, are predestined and receive the grace of final\\nperserverance, which guarantees heaven.  This does not mean that those\\nwithout that grace can't be saved, it just means that god foreknew their\\nobstinacy and chose not to give it to them, knowing they would not need\\nit, as they had freely chosen hell.\\n\\t\\t\\t\\t\\t\\t\\t  ^^^^^^^^^^^\\nPeople who are saved are saved by the grace of God, and not by their own\\neffort, for it was God who disposed them to Himself, and predestined\\nthem to become saints.  But those who perish in everlasting fire perish\\nbecause they hardened their heart and chose to perish.  Thus, they were\\ndeserving of God;s punishment, as they had rejected their Creator, and\\nsinned against the working of the Holy Spirit.\\n\\n>yes, it is up to God to judge.  but he will only mete out that\\n>punishment at the last judgement. \\n\\nWell, I would hold that as God most certainly gives everybody some\\nblessing for what good they have done (even if it was only a little),\\nfor those He can't bless in the next life, He blesses in this one.  And\\nthose He will not punish in the next life, will be chastised in this one\\nor in Purgatory for their sins.  Every sin incurs some temporal\\npunishment, thus, God will punish it unless satisfaction is made for it\\n(cf. 2 Samuel 12.13-14, David's sin of Adultery and Murder were\\nforgiven, but he was still punished with the death of his child.)  And I\\nneed not point out the idea of punishment because of God's judgement is\\nquite prevelant in the Bible.  Sodom and Gommorrah, Moses barred from\\nthe Holy Land, the slaughter of the Cannanites, Annias and Saphira,\\nJerusalem in 70 AD, etc.\\n\\n> if jesus stopped the stoning of an adulterous woman (perhaps this is\\nnot a >good parallel, but i'm going to go with it anyway), why should we\\nnot >stop the murder and violation of people who may (or may not) be more\\n>innocent?\\n\\nWe should stop the slaughter of the innocent (cf Proverbs 24.11-12), but\\ndoes that mean that Christians should support a war in Bosnia with the\\nU.S. or even the U.N. involved?  I do not think so, but I am an\\nisolationist, and disagree with foreign adventures in general.  But in\\nthe case of Bosnia, I frankly see no excuse for us getting militarily\\ninvolved, it would not be a \\\"just war.\\\"  \\\"Blessed\\\" after all, \\\"are the\\npeacemakers\\\" was what Our Lord said, not the interventionists.  Our\\nactions in Bosnia must be for peace, and not for a war which is\\nunrelated to anything to justify it for us.\\n\\nAndy Byler\\n\",\r\n    \"id\": \"21408\",\r\n    \"topic\": \"soc.religion.christian\"\r\n}\r\n\"\"\"\r\n```","2017-10-28T18:04:22"]