[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-keon--awesome-nlp":3,"tool-keon--awesome-nlp":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",147882,2,"2026-04-09T11:32:47",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108111,"2026-04-08T11:23:26",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":78,"owner_twitter":79,"owner_website":80,"owner_url":81,"languages":82,"stars":83,"forks":84,"last_commit_at":85,"license":86,"difficulty_score":87,"env_os":88,"env_gpu":89,"env_ram":89,"env_deps":90,"category_tags":93,"github_topics":94,"view_count":32,"oss_zip_url":82,"oss_zip_packed_at":82,"status":17,"created_at":103,"updated_at":104,"faqs":105,"releases":106},5852,"keon\u002Fawesome-nlp","awesome-nlp",":book: A curated list of resources dedicated to Natural Language Processing (NLP)","awesome-nlp 是一个精心整理的自然语言处理（NLP）资源清单，旨在为从业者和学习者提供一站式的信息导航。面对 NLP 领域海量且分散的论文、代码库、数据集及教程，用户往往难以快速定位高质量内容，awesome-nlp 正是为解决这一痛点而生。它系统性地汇集了从基础理论到前沿趋势的各类资源，涵盖深度学习在 NLP 中的应用综述、全球顶尖实验室动态、多语言（包括中文、韩语、阿拉伯语等）专项资源，以及针对 Python、Java、C++ 等多种编程语言的实用工具库。\n\n这份清单特别适合 NLP 研究人员、算法工程师、学生以及对人工智能感兴趣的开发者使用。无论是需要追踪最新学术进展（如 ACL、EMNLP 会议亮点），还是寻找具体的标注工具、开源框架或特定语种的数据集，都能在这里找到经过筛选的优质链接。其独特的技术亮点在于不仅关注通用的英语资源，还广泛收录了多种小语种的 NLP 项目，极大地促进了多语言技术的普及与交流。作为一份持续更新的社区驱动型指南，awesome-nlp 以开放和专业的态度，帮助用户高效构建知识体系，降低入门门槛，是探索自然语言处理世界不可或缺的得力助手。","# awesome-nlp\n\n[![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome)\n\nA curated list of resources dedicated to Natural Language Processing\n\n![Awesome NLP Logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkeon_awesome-nlp_readme_2033b3bda8b0.jpg)\n\nRead this in [English](.\u002FREADME.md), [Traditional Chinese](.\u002FREADME-ZH-TW.md)\n\n_Please read the [contribution guidelines](contributing.md) before contributing. Please add your favourite NLP resource by raising a [pull request](https:\u002F\u002Fgithub.com\u002Fkeonkim\u002Fawesome-nlp\u002Fpulls)_\n\n## Contents\n\n* [Research Summaries and Trends](#research-summaries-and-trends)\n* [Prominent NLP Research Labs](#prominent-nlp-research-labs)\n* [Tutorials](#tutorials)\n  * [Reading Content](#reading-content)\n  * [Videos and Courses](#videos-and-online-courses)\n  * [Books](#books)\n* [Libraries](#libraries)\n  * [Node.js](#node-js)\n  * [Python](#python)\n  * [C++](#c++)\n  * [Java](#java)\n  * [Kotlin](#kotlin)\n  * [Scala](#scala)\n  * [R](#R)\n  * [Clojure](#clojure)\n  * [Ruby](#ruby)\n  * [Rust](#rust)\n  * [NLP++](#NLP++)\n  * [Julia](#julia)\n* [Services](#services)\n* [Annotation Tools](#annotation-tools)\n* [Datasets](#datasets)\n* [NLP in Korean](#nlp-in-korean)\n* [NLP in Arabic](#nlp-in-arabic)\n* [NLP in Chinese](#nlp-in-chinese)\n* [NLP in German](#nlp-in-german)\n* [NLP in Polish](#nlp-in-polish)\n* [NLP in Spanish](#nlp-in-spanish)\n* [NLP in Indic Languages](#nlp-in-indic-languages)\n* [NLP in Thai](#nlp-in-thai)\n* [NLP in Danish](#nlp-in-danish)\n* [NLP in Vietnamese](#nlp-in-vietnamese)\n* [NLP for Dutch](#nlp-for-dutch)\n* [NLP in Indonesian](#nlp-in-indonesian)\n* [NLP in Urdu](#nlp-in-urdu)\n* [NLP in Persian](#nlp-in-persian)\n* [NLP in Ukrainian](#nlp-in-ukrainian)\n* [NLP in Hungarian](#nlp-in-hungarian)\n* [NLP in Portuguese](#nlp-in-portuguese)\n* [Other Languages](#other-languages)\n* [Citation](#citation)\n* [Credits](#credits)\n\n## Research Summaries and Trends\n\n* [NLP-Overview](https:\u002F\u002Fnlpoverview.com\u002F) is an up-to-date overview of deep learning techniques applied to NLP, including theory, implementations, applications, and state-of-the-art results. This is a great Deep NLP Introduction for researchers.\n* [NLP-Progress](https:\u002F\u002Fnlpprogress.com\u002F) tracks the progress in Natural Language Processing, including the datasets and the current state-of-the-art for the most common NLP tasks\n* [NLP's ImageNet moment has arrived](https:\u002F\u002Fthegradient.pub\u002Fnlp-imagenet\u002F)\n* [ACL 2018 Highlights: Understanding Representation and Evaluation in More Challenging Settings](http:\u002F\u002Fruder.io\u002Facl-2018-highlights\u002F)\n* [Four deep learning trends from ACL 2017. Part One: Linguistic Structure and Word Embeddings](https:\u002F\u002Fwww.abigailsee.com\u002F2017\u002F08\u002F30\u002Ffour-deep-learning-trends-from-acl-2017-part-1.html)\n* [Four deep learning trends from ACL 2017. Part Two: Interpretability and Attention](https:\u002F\u002Fwww.abigailsee.com\u002F2017\u002F08\u002F30\u002Ffour-deep-learning-trends-from-acl-2017-part-2.html)\n* [Highlights of EMNLP 2017: Exciting Datasets, Return of the Clusters, and More!](http:\u002F\u002Fblog.aylien.com\u002Fhighlights-emnlp-2017-exciting-datasets-return-clusters\u002F)\n* [Deep Learning for Natural Language Processing (NLP): Advancements & Trends](https:\u002F\u002Ftryolabs.com\u002Fblog\u002F2017\u002F12\u002F12\u002Fdeep-learning-for-nlp-advancements-and-trends-in-2017\u002F?utm_campaign=Revue%20newsletter&utm_medium=Newsletter&utm_source=The%20Wild%20Week%20in%20AI)\n* [Survey of the State of the Art in Natural Language Generation](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.09902)\n\n## Prominent NLP Research Labs\n[Back to Top](#contents)\n\n* [The Berkeley NLP Group](http:\u002F\u002Fnlp.cs.berkeley.edu\u002Findex.shtml) - Notable contributions include a tool to reconstruct long dead languages, referenced [here](https:\u002F\u002Fwww.bbc.com\u002Fnews\u002Fscience-environment-21427896) and by taking corpora from 637 languages currently spoken in Asia and the Pacific and recreating their descendant.\n* [Language Technologies Institute, Carnegie Mellon University](http:\u002F\u002Fwww.cs.cmu.edu\u002F~nasmith\u002Fnlp-cl.html) - Notable projects include [Avenue Project](http:\u002F\u002Fwww.cs.cmu.edu\u002F~avenue\u002F), a syntax driven machine translation system for endangered languages like Quechua and Aymara and previously, [Noah's Ark](http:\u002F\u002Fwww.cs.cmu.edu\u002F~ark\u002F) which created [AQMAR](http:\u002F\u002Fwww.cs.cmu.edu\u002F~ark\u002FAQMAR\u002F) to improve NLP tools for Arabic.\n* [NLP research group, Columbia University](http:\u002F\u002Fwww1.cs.columbia.edu\u002Fnlp\u002Findex.cgi) - Responsible for creating BOLT ( interactive error handling for speech translation systems) and an un-named project to characterize laughter in dialogue.\n* [The Center or Language and Speech Processing, John Hopkins University](http:\u002F\u002Fclsp.jhu.edu\u002F) - Recently in the news for developing speech recognition software to create a diagnostic test or Parkinson's Disease, [here](https:\u002F\u002Fwww.clsp.jhu.edu\u002F2019\u002F03\u002F27\u002Fspeech-recognition-software-and-machine-learning-tools-are-being-used-to-create-diagnostic-test-for-parkinsons-disease\u002F#.XNFqrIkzYdU).\n* [Computational Linguistics and Information Processing Group, University of Maryland](https:\u002F\u002Fwiki.umiacs.umd.edu\u002Fclip\u002Findex.php\u002FMain_Page) - Notable contributions include [Human-Computer Cooperation or Word-by-Word Question Answering](http:\u002F\u002Fwww.umiacs.umd.edu\u002F~jbg\u002Fprojects\u002FIIS-1652666) and modeling development of phonetic representations. \n* [Penn Natural Language Processing, University of Pennsylvania](https:\u002F\u002Fnlp.cis.upenn.edu\u002F)- Famous for creating the [Penn Treebank](https:\u002F\u002Fwww.seas.upenn.edu\u002F~pdtb\u002F).\n* [The Stanford Nautral Language Processing Group](https:\u002F\u002Fnlp.stanford.edu\u002F)- One of the top NLP research labs in the world, notable for creating [Stanford CoreNLP](https:\u002F\u002Fnlp.stanford.edu\u002Fsoftware\u002Fcorenlp.shtml) and their [coreference resolution system](https:\u002F\u002Fnlp.stanford.edu\u002Fsoftware\u002Fdcoref.shtml)\n\n\n## Tutorials\n[Back to Top](#contents)\n\n### Reading Content\n\nGeneral Machine Learning\n\n* [Machine Learning 101](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1kSuQyW5DTnkVaZEjGYCkfOxvzCqGEFzWBy4e9Uedd9k\u002Fedit?usp=sharing) from Google's Senior Creative Engineer explains Machine Learning for engineer's and executives alike\n* [AI Playbook](https:\u002F\u002Faiplaybook.a16z.com\u002F) - a16z AI playbook is a great link to forward to your managers or content for your presentations\n* [Ruder's Blog](http:\u002F\u002Fruder.io\u002F#open) by [Sebastian Ruder](https:\u002F\u002Ftwitter.com\u002Fseb_ruder) for commentary on the best of NLP Research\n* [How To Label Data](https:\u002F\u002Fwww.lighttag.io\u002Fhow-to-label-data\u002F) guide to managing larger linguistic annotation projects\n* [Depends on the Definition](https:\u002F\u002Fwww.depends-on-the-definition.com\u002F) collection of blog posts covering a wide array of NLP topics with detailed implementation\n\nIntroductions and Guides to NLP\n\n* [Understand & Implement Natural Language Processing](https:\u002F\u002Fwww.analyticsvidhya.com\u002Fblog\u002F2017\u002F01\u002Fultimate-guide-to-understand-implement-natural-language-processing-codes-in-python\u002F)\n* [NLP in Python](http:\u002F\u002Fgithub.com\u002FNirantK\u002Fnlp-python-deep-learning) - Collection of Github notebooks\n* [Natural Language Processing: An Introduction](https:\u002F\u002Facademic.oup.com\u002Fjamia\u002Farticle\u002F18\u002F5\u002F544\u002F829676) - Oxford\n* [Deep Learning for NLP with Pytorch](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fbeginner\u002Fdeep_learning_nlp_tutorial.html)\n* [Hands-On NLTK Tutorial](https:\u002F\u002Fgithub.com\u002Fhb20007\u002Fhands-on-nltk-tutorial) - NLTK Tutorials, Jupyter notebooks\n* [Natural Language Processing with Python – Analyzing Text with the Natural Language Toolkit](https:\u002F\u002Fwww.nltk.org\u002Fbook\u002F) - An online and print book introducing NLP concepts using NLTK. The book's authors also wrote the NLTK library.\n* [Train a new language model from scratch](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fhow-to-train) - Hugging Face 🤗\n* [The Super Duper NLP Repo (SDNLPR)](https:\u002F\u002Fnotebooks.quantumstat.com\u002F): Collection of Colab notebooks covering a wide array of NLP task implementations.\n* [Advanced NLP with spaCy](https:\u002F\u002Fcourse.spacy.io\u002Fen\u002F) - Free online course covering text processing, large-scale data analysis, processing pipelines, and training neural network models for custom NLP tasks.\n* [Kaggle NLP Learning Guide](https:\u002F\u002Fwww.kaggle.com\u002Flearn-guide\u002Fnatural-language-processing) - Beginner-friendly tutorials including getting started guides, deep learning for NLP, and visual explanations of techniques like BERT, GloVe, and TF-IDF.\n\nBlogs and Newsletters\n\n* [Deep Learning, NLP, and Representations](https:\u002F\u002Fcolah.github.io\u002Fposts\u002F2014-07-NLP-RNNs-Representations\u002F)\n* [The Illustrated BERT, ELMo, and co. (How NLP Cracked Transfer Learning)](https:\u002F\u002Fjalammar.github.io\u002Fillustrated-bert\u002F) and [The Illustrated Transformer](https:\u002F\u002Fjalammar.github.io\u002Fillustrated-transformer\u002F)\n* [Natural Language Processing](https:\u002F\u002Fnlpers.blogspot.com\u002F) by Hal Daumé III\n* [arXiv: Natural Language Processing (Almost) from Scratch](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1103.0398.pdf)\n* [Karpathy's The Unreasonable Effectiveness of Recurrent Neural Networks](https:\u002F\u002Fkarpathy.github.io\u002F2015\u002F05\u002F21\u002Frnn-effectiveness)\n* [Machine Learning Mastery: Deep Learning for Natural Language Processing](https:\u002F\u002Fmachinelearningmastery.com\u002Fcategory\u002Fnatural-language-processing)\n* [Visual NLP Paper Summaries](https:\u002F\u002Famitness.com\u002Fcategories\u002F#nlp)\n\n### Videos and Online Courses\n[Back to Top](#contents)\n\n* [Advanced Natural Language Processing](https:\u002F\u002Fpeople.cs.umass.edu\u002F~miyyer\u002Fcs685_f20\u002F) - CS 685, UMass Amherst CS\n* [Deep Natural Language Processing](https:\u002F\u002Fgithub.com\u002Foxford-cs-deepnlp-2017\u002Flectures) - Lectures series from Oxford\n* [Deep Learning for Natural Language Processing (cs224-n)](https:\u002F\u002Fweb.stanford.edu\u002Fclass\u002Fcs224n\u002F) - Richard Socher and Christopher Manning's Stanford Course\n* [Neural Networks for NLP](http:\u002F\u002Fphontron.com\u002Fclass\u002Fnn4nlp2017\u002F) - Carnegie Mellon Language Technology Institute there\n* [Deep NLP Course](https:\u002F\u002Fgithub.com\u002Fyandexdataschool\u002Fnlp_course) by Yandex Data School, covering important ideas from text embedding to machine translation including sequence modeling, language models and so on.\n* [fast.ai Code-First Intro to Natural Language Processing](https:\u002F\u002Fwww.fast.ai\u002F2019\u002F07\u002F08\u002Ffastai-nlp\u002F) - This covers a blend of traditional NLP topics (including regex, SVD, naive bayes, tokenization) and recent neural network approaches (including RNNs, seq2seq, GRUs, and the Transformer), as well as addressing urgent ethical issues, such as bias and disinformation. Find the Jupyter Notebooks [here](https:\u002F\u002Fgithub.com\u002Ffastai\u002Fcourse-nlp)\n* [Machine Learning University - Accelerated Natural Language Processing](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL8P_Z6C4GcuWfAq8Pt6PBYlck4OprHXsw) - Lectures go from introduction to NLP and text processing to Recurrent Neural Networks and Transformers.\nMaterial can be found [here](https:\u002F\u002Fgithub.com\u002Faws-samples\u002Faws-machine-learning-university-accelerated-nlp).\n* [Applied Natural Language Processing](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLH-xYrxjfO2WyR3pOAB006CYMhNt4wTqp)- Lecture series from IIT Madras taking from the basics all the way to autoencoders and everything. The github notebooks for this course are also available [here](https:\u002F\u002Fgithub.com\u002FRamaseshanr\u002Fanlp)\n* [DeepLearning.AI Natural Language Processing Specialization](https:\u002F\u002Fwww.deeplearning.ai\u002Fcourses\u002Fnatural-language-processing-specialization\u002F) - 4-course program covering sentiment analysis, word embeddings, RNNs, LSTMs, attention mechanisms, and Transformer models like BERT and T5 for tasks including machine translation and summarization.\n\n\n### Books\n\n* [Speech and Language Processing](https:\u002F\u002Fweb.stanford.edu\u002F~jurafsky\u002Fslp3\u002F) - free, by Prof. Dan Jurafsy\n* [Natural Language Processing](https:\u002F\u002Fgithub.com\u002Fjacobeisenstein\u002Fgt-nlp-class) - free, NLP notes by Dr. Jacob Eisenstein at GeorgiaTech\n* [NLP with PyTorch](https:\u002F\u002Fgithub.com\u002Fjoosthub\u002FPyTorchNLPBook) - Brian & Delip Rao\n* [Text Mining in R](https:\u002F\u002Fwww.tidytextmining.com)\n* [Natural Language Processing with Python](https:\u002F\u002Fwww.nltk.org\u002Fbook\u002F)\n* [Practical Natural Language Processing](https:\u002F\u002Fwww.oreilly.com\u002Flibrary\u002Fview\u002Fpractical-natural-language\u002F9781492054047\u002F)\n* [Natural Language Processing with Spark NLP](https:\u002F\u002Fwww.oreilly.com\u002Flibrary\u002Fview\u002Fnatural-language-processing\u002F9781492047759\u002F)\n* [Deep Learning for Natural Language Processing](https:\u002F\u002Fwww.manning.com\u002Fbooks\u002Fdeep-learning-for-natural-language-processing) by Stephan Raaijmakers\n* [Real-World Natural Language Processing](https:\u002F\u002Fwww.manning.com\u002Fbooks\u002Freal-world-natural-language-processing) - by Masato Hagiwara\n* [Natural Language Processing in Action, Second Edition](https:\u002F\u002Fwww.manning.com\u002Fbooks\u002Fnatural-language-processing-in-action-second-edition) - by Hobson Lane and Maria Dyshel\n* [Transformers in Action](https:\u002F\u002Fwww.manning.com\u002Fbooks\u002Ftransformers-in-action) - by Nicole Koenigstein\n* [The Math Behind Artificial Intelligence](https:\u002F\u002Fwww.freecodecamp.org\u002Fnews\u002Fthe-math-behind-artificial-intelligence-book) - bt Tiago MOnteiro | A free FreeCodeCamp book teaching the math behind AI in plain English from an engineering point of view. It covers linear algebra, calculus, probability & statistics, and optimization theory with analogies, real-life applications, and Python code examples.\n  \n## Libraries\n\n[Back to Top](#contents)\n\n* \u003Ca id=\"node-js\">**Node.js and Javascript** - Node.js Libaries for NLP\u003C\u002Fa> | [Back to Top](#contents)\n  * [Twitter-text](https:\u002F\u002Fgithub.com\u002Ftwitter\u002Ftwitter-text) - A JavaScript implementation of Twitter's text processing library\n  * [Knwl.js](https:\u002F\u002Fgithub.com\u002Fbenhmoore\u002FKnwl.js) - A Natural Language Processor in JS\n  * [Retext](https:\u002F\u002Fgithub.com\u002Fretextjs\u002Fretext) - Extensible system for analyzing and manipulating natural language\n  * [NLP Compromise](https:\u002F\u002Fgithub.com\u002Fspencermountain\u002Fcompromise) - Natural Language processing in the browser\n  * [Natural](https:\u002F\u002Fgithub.com\u002FNaturalNode\u002Fnatural) - general natural language facilities for node\n  * [Poplar](https:\u002F\u002Fgithub.com\u002Fsynyi\u002Fpoplar) - A web-based annotation tool for natural language processing (NLP)\n  * [NLP.js](https:\u002F\u002Fgithub.com\u002Faxa-group\u002Fnlp.js) - An NLP library for building bots\n  * [node-question-answering](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fnode-question-answering) - Fast and production-ready question answering w\u002F DistilBERT in Node.js\n\n* \u003Ca id=\"python\"> **Python** - Python NLP Libraries\u003C\u002Fa> | [Back to Top](#contents)\n  - [sentimental-onix](https:\u002F\u002Fgithub.com\u002Fsloev\u002Fsentimental-onix) Sentiment models for spacy using onnx\n  - [TextAttack](https:\u002F\u002Fgithub.com\u002FQData\u002FTextAttack) - Adversarial attacks, adversarial training, and data augmentation in NLP\n  - [TextBlob](http:\u002F\u002Ftextblob.readthedocs.org\u002F) - Providing a consistent API for diving into common natural language processing (NLP) tasks. Stands on the giant shoulders of [Natural Language Toolkit (NLTK)](https:\u002F\u002Fwww.nltk.org\u002F) and [Pattern](https:\u002F\u002Fgithub.com\u002Fclips\u002Fpattern), and plays nicely with both :+1:\n  - [spaCy](https:\u002F\u002Fgithub.com\u002Fexplosion\u002FspaCy) - Industrial strength NLP with Python and Cython :+1:\n  - [Speedster](https:\u002F\u002Fgithub.com\u002Fnebuly-ai\u002Fnebullvm\u002Ftree\u002Fmain\u002Fapps\u002Faccelerate\u002Fspeedster) - Automatically apply SOTA optimization techniques to achieve the maximum inference speed-up on your hardware\n    - [textacy](https:\u002F\u002Fgithub.com\u002Fchartbeat-labs\u002Ftextacy) - Higher level NLP built on spaCy\n  - [gensim](https:\u002F\u002Fradimrehurek.com\u002Fgensim\u002Findex.html) - Python library to conduct unsupervised semantic modelling from plain text :+1:\n  - [scattertext](https:\u002F\u002Fgithub.com\u002FJasonKessler\u002Fscattertext) - Python library to produce d3 visualizations of how language differs between corpora\n  - [GluonNLP](https:\u002F\u002Fgithub.com\u002Fdmlc\u002Fgluon-nlp) - A deep learning toolkit for NLP, built on MXNet\u002FGluon, for research prototyping and industrial deployment of state-of-the-art models on a wide range of NLP tasks.\n  - [AllenNLP](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fallennlp) - An NLP research library, built on PyTorch, for developing state-of-the-art deep learning models on a wide variety of linguistic tasks.\n  - [PyTorch-NLP](https:\u002F\u002Fgithub.com\u002FPetrochukM\u002FPyTorch-NLP) - NLP research toolkit designed to support rapid prototyping with better data loaders, word vector loaders, neural network layer representations, common NLP metrics such as BLEU\n  - [Rosetta](https:\u002F\u002Fgithub.com\u002Fcolumbia-applied-data-science\u002Frosetta) - Text processing tools and wrappers (e.g. Vowpal Wabbit)\n  - [PyNLPl](https:\u002F\u002Fgithub.com\u002Fproycon\u002Fpynlpl) - Python Natural Language Processing Library. General purpose NLP library for Python, handles some specific formats like ARPA language models, Moses phrasetables, GIZA++ alignments.\n  - [foliapy](https:\u002F\u002Fgithub.com\u002Fproycon\u002Ffoliapy) - Python library for working with [FoLiA](https:\u002F\u002Fproycon.github.io\u002Ffolia\u002F), an XML format for linguistic annotation.\n  - [PySS3](https:\u002F\u002Fgithub.com\u002Fsergioburdisso\u002Fpyss3) - Python package that implements a novel white-box machine learning model for text classification, called SS3. Since SS3 has the ability to visually explain its rationale, this package also comes with easy-to-use interactive visualizations tools ([online demos](http:\u002F\u002Ftworld.io\u002Fss3\u002F)).\n  - [jPTDP](https:\u002F\u002Fgithub.com\u002Fdatquocnguyen\u002FjPTDP) - A toolkit for joint part-of-speech (POS) tagging and dependency parsing. jPTDP provides pre-trained models for 40+ languages.\n  - [BigARTM](https:\u002F\u002Fgithub.com\u002Fbigartm\u002Fbigartm) - a fast library for topic modelling\n  - [Snips NLU](https:\u002F\u002Fgithub.com\u002Fsnipsco\u002Fsnips-nlu) - A production ready library for intent parsing\n  - [Chazutsu](https:\u002F\u002Fgithub.com\u002Fchakki-works\u002Fchazutsu) - A library for downloading&parsing standard NLP research datasets\n  - [Word Forms](https:\u002F\u002Fgithub.com\u002Fgutfeeling\u002Fword_forms) - Word forms can accurately generate all possible forms of an English word\n  - [Multilingual Latent Dirichlet Allocation (LDA)](https:\u002F\u002Fgithub.com\u002FArtificiAI\u002FMultilingual-Latent-Dirichlet-Allocation-LDA) - A multilingual and extensible document clustering pipeline\n  - [Natural Language Toolkit (NLTK)](https:\u002F\u002Fwww.nltk.org\u002F) - A library containing a wide variety of NLP functionality, supporting over 50 corpora.\n  - [NLP Architect](https:\u002F\u002Fgithub.com\u002FNervanaSystems\u002Fnlp-architect) - A library for exploring the state-of-the-art deep learning topologies and techniques for NLP and NLU\n  - [Flair](https:\u002F\u002Fgithub.com\u002Fzalandoresearch\u002Fflair) - A very simple framework for state-of-the-art multilingual NLP built on PyTorch. Includes BERT, ELMo and Flair embeddings.\n  - [Kashgari](https:\u002F\u002Fgithub.com\u002FBrikerMan\u002FKashgari) - Simple, Keras-powered multilingual NLP framework, allows you to build your models in 5 minutes for named entity recognition (NER), part-of-speech tagging (PoS) and text classification tasks. Includes BERT and word2vec embedding.\n  - [FARM](https:\u002F\u002Fgithub.com\u002Fdeepset-ai\u002FFARM) - Fast & easy transfer learning for NLP. Harvesting language models for the industry. Focus on Question Answering.\n  - [Haystack](https:\u002F\u002Fgithub.com\u002Fdeepset-ai\u002Fhaystack) - End-to-end Python framework for building natural language search interfaces to data. Leverages Transformers and the State-of-the-Art of NLP. Supports DPR, Elasticsearch, HuggingFace’s Modelhub, and much more!\n  - [PraisonAI](https:\u002F\u002Fgithub.com\u002FMervinPraison\u002FPraisonAI) - Multi-AI Agents framework with 100+ LLM support via LiteLLM, MCP integration, agentic workflows, and built-in memory for NLP tasks.\n  - [Rita DSL](https:\u002F\u002Fgithub.com\u002Fzaibacu\u002Frita-dsl) - a DSL, loosely based on [RUTA on Apache UIMA](https:\u002F\u002Fuima.apache.org\u002Fruta.html). Allows to define language patterns (rule-based NLP) which are then translated into [spaCy](https:\u002F\u002Fspacy.io\u002F), or if you prefer less features and lightweight - regex patterns.\n  - [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) - Natural Language Processing for TensorFlow 2.0 and PyTorch.\n  - [Tokenizers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftokenizers) - Tokenizers optimized for Research and Production.\n  - [fairSeq](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ffairseq) Facebook AI Research implementations of SOTA seq2seq models in Pytorch. \n  - [corex_topic](https:\u002F\u002Fgithub.com\u002Fgregversteeg\u002Fcorex_topic) - Hierarchical Topic Modeling with Minimal Domain Knowledge\n  - [Sockeye](https:\u002F\u002Fgithub.com\u002Fawslabs\u002Fsockeye) - Neural Machine Translation (NMT) toolkit that powers Amazon Translate.\n  - [DL Translate](https:\u002F\u002Fgithub.com\u002Fxhlulu\u002Fdl-translate) - A deep learning-based translation library for 50 languages, built on `transformers` and Facebook's mBART Large.\n  - [Jury](https:\u002F\u002Fgithub.com\u002Fobss\u002Fjury) - Evaluation of NLP model outputs offering various automated metrics.\n  - [python-ucto](https:\u002F\u002Fgithub.com\u002Fproycon\u002Fpython-ucto) - Unicode-aware regular-expression based tokenizer for various languages. Python binding to C++ library, supports [FoLiA format](https:\u002F\u002Fproycon.github.io\u002Ffolia).\n  - [Pearmut](https:\u002F\u002Fgithub.com\u002Fzouharvi\u002Fpearmut) - Human annotation tool for multilingual NLP tasks, such as machine translation.\n\n- \u003Ca id=\"c++\">**C++** - C++ Libraries\u003C\u002Fa> | [Back to Top](#contents)\n  - [InsNet](https:\u002F\u002Fgithub.com\u002Fchncwang\u002FInsNet) - A neural network library for building instance-dependent NLP models with padding-free dynamic batching.\n  - [MIT Information Extraction Toolkit](https:\u002F\u002Fgithub.com\u002Fmit-nlp\u002FMITIE) - C, C++, and Python tools for named entity recognition and relation extraction\n  - [CRF++](https:\u002F\u002Ftaku910.github.io\u002Fcrfpp\u002F) - Open source implementation of Conditional Random Fields (CRFs) for segmenting\u002Flabeling sequential data & other Natural Language Processing tasks.\n  - [CRFsuite](http:\u002F\u002Fwww.chokkan.org\u002Fsoftware\u002Fcrfsuite\u002F) - CRFsuite is an implementation of Conditional Random Fields (CRFs) for labeling sequential data.\n  - [BLLIP Parser](https:\u002F\u002Fgithub.com\u002FBLLIP\u002Fbllip-parser) - BLLIP Natural Language Parser (also known as the Charniak-Johnson parser)\n  - [colibri-core](https:\u002F\u002Fgithub.com\u002Fproycon\u002Fcolibri-core) - C++ library, command line tools, and Python binding for extracting and working with basic linguistic constructions such as n-grams and skipgrams in a quick and memory-efficient way.\n  - [ucto](https:\u002F\u002Fgithub.com\u002FLanguageMachines\u002Fucto) - Unicode-aware regular-expression based tokenizer for various languages. Tool and C++ library. Supports FoLiA format.\n  - [libfolia](https:\u002F\u002Fgithub.com\u002FLanguageMachines\u002Flibfolia) - C++ library for the [FoLiA format](https:\u002F\u002Fproycon.github.io\u002Ffolia\u002F)\n  - [frog](https:\u002F\u002Fgithub.com\u002FLanguageMachines\u002Ffrog) - Memory-based NLP suite developed for Dutch: PoS tagger, lemmatiser, dependency parser, NER, shallow parser, morphological analyzer.\n  - [MeTA](https:\u002F\u002Fgithub.com\u002Fmeta-toolkit\u002Fmeta) - [MeTA : ModErn Text Analysis](https:\u002F\u002Fmeta-toolkit.org\u002F) is a C++ Data Sciences Toolkit that facilitates mining big text data.\n  - [Mecab (Japanese)](https:\u002F\u002Ftaku910.github.io\u002Fmecab\u002F)\n  - [Moses](http:\u002F\u002Fstatmt.org\u002Fmoses\u002F)\n  - [StarSpace](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FStarSpace) - a library from Facebook for creating embeddings of word-level, paragraph-level, document-level and for text classification\n  - [QSMM](http:\u002F\u002Fqsmm.org) - adaptive probabilistic top-down and bottom-up parsers\n\n- \u003Ca id=\"java\">**Java** - Java NLP Libraries\u003C\u002Fa> | [Back to Top](#contents)\n  - [Stanford NLP](https:\u002F\u002Fnlp.stanford.edu\u002Fsoftware\u002Findex.shtml)\n  - [OpenNLP](https:\u002F\u002Fopennlp.apache.org\u002F)\n  - [NLP4J](https:\u002F\u002Femorynlp.github.io\u002Fnlp4j\u002F)\n  - [Word2vec in Java](https:\u002F\u002Fdeeplearning4j.org\u002Fdocs\u002Flatest\u002Fdeeplearning4j-nlp-word2vec)\n  - [ReVerb](https:\u002F\u002Fgithub.com\u002Fknowitall\u002Freverb\u002F) Web-Scale Open Information Extraction\n  - [OpenRegex](https:\u002F\u002Fgithub.com\u002Fknowitall\u002Fopenregex) An efficient and flexible token-based regular expression language and engine.\n  - [CogcompNLP](https:\u002F\u002Fgithub.com\u002FCogComp\u002Fcogcomp-nlp) - Core libraries developed in the U of Illinois' Cognitive Computation Group.\n  - [MALLET](http:\u002F\u002Fmallet.cs.umass.edu\u002F) - MAchine Learning for LanguagE Toolkit - package for statistical natural language processing, document classification, clustering, topic modeling, information extraction, and other machine learning applications to text.\n  - [RDRPOSTagger](https:\u002F\u002Fgithub.com\u002Fdatquocnguyen\u002FRDRPOSTagger) - A robust POS tagging toolkit available (in both Java & Python) together with pre-trained models for 40+ languages.\n\n- \u003Ca id=\"kotlin\">**Kotlin** - Kotlin NLP Libraries\u003C\u002Fa> | [Back to Top](#contents)\n  - [Lingua](https:\u002F\u002Fgithub.com\u002Fpemistahl\u002Flingua\u002F) A language detection library for Kotlin and Java, suitable for long and short text alike\n  - [Kotidgy](https:\u002F\u002Fgithub.com\u002Fmeiblorn\u002Fkotidgy) — an index-based text data generator written in Kotlin\n\n- \u003Ca id=\"scala\">**Scala** - Scala NLP Libraries\u003C\u002Fa> | [Back to Top](#contents)\n  - [Saul](https:\u002F\u002Fgithub.com\u002FCogComp\u002Fsaul) - Library for developing NLP systems, including built in modules like SRL, POS, etc.\n  - [ATR4S](https:\u002F\u002Fgithub.com\u002Fispras\u002Fatr4s) - Toolkit with state-of-the-art [automatic term recognition](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTerminology_extraction) methods.\n  - [tm](https:\u002F\u002Fgithub.com\u002Fispras\u002Ftm) - Implementation of topic modeling based on regularized multilingual [PLSA](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FProbabilistic_latent_semantic_analysis).\n  - [word2vec-scala](https:\u002F\u002Fgithub.com\u002FRefefer\u002Fword2vec-scala) - Scala interface to word2vec model; includes operations on vectors like word-distance and word-analogy.\n  - [Epic](https:\u002F\u002Fgithub.com\u002Fdlwh\u002Fepic) - Epic is a high performance statistical parser written in Scala, along with a framework for building complex structured prediction models.\n  - [Spark NLP](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Fspark-nlp) - Spark NLP is a natural language processing library built on top of Apache Spark ML that provides simple, performant & accurate NLP annotations for machine learning pipelines that scale easily in a distributed environment.\n\n- \u003Ca id=\"R\">**R** - R NLP Libraries\u003C\u002Fa> | [Back to Top](#contents)\n  - [text2vec](https:\u002F\u002Fgithub.com\u002Fdselivanov\u002Ftext2vec) - Fast vectorization, topic modeling, distances and GloVe word embeddings in R.\n  - [wordVectors](https:\u002F\u002Fgithub.com\u002Fbmschmidt\u002FwordVectors) - An R package for creating and exploring word2vec and other word embedding models\n  - [RMallet](https:\u002F\u002Fgithub.com\u002Fmimno\u002FRMallet) - R package to interface with the Java machine learning tool MALLET\n  - [dfr-browser](https:\u002F\u002Fgithub.com\u002Fagoldst\u002Fdfr-browser) - Creates d3 visualizations for browsing topic models of text in a web browser.\n  - [dfrtopics](https:\u002F\u002Fgithub.com\u002Fagoldst\u002Fdfrtopics) - R package for exploring topic models of text.\n  - [sentiment_classifier](https:\u002F\u002Fgithub.com\u002Fkevincobain2000\u002Fsentiment_classifier) - Sentiment Classification using Word Sense Disambiguation and WordNet Reader\n  - [jProcessing](https:\u002F\u002Fgithub.com\u002Fkevincobain2000\u002FjProcessing) - Japanese Natural Langauge Processing Libraries, with Japanese sentiment classification\n  - [corporaexplorer](https:\u002F\u002Fkgjerde.github.io\u002Fcorporaexplorer\u002F) - An R package for dynamic exploration of text collections\n  - [tidytext](https:\u002F\u002Fgithub.com\u002Fjuliasilge\u002Ftidytext) - Text mining using tidy tools\n  - [spacyr](https:\u002F\u002Fgithub.com\u002Fquanteda\u002Fspacyr) - R wrapper to spaCy NLP\n  - [CRAN Task View: Natural Language Processing](https:\u002F\u002Fgithub.com\u002Fcran-task-views\u002FNaturalLanguageProcessing\u002F)\n\n- \u003Ca id=\"clojure\">**Clojure**\u003C\u002Fa> | [Back to Top](#contents)\n  - [Clojure-openNLP](https:\u002F\u002Fgithub.com\u002Fdakrone\u002Fclojure-opennlp) - Natural Language Processing in Clojure (opennlp)\n  - [Infections-clj](https:\u002F\u002Fgithub.com\u002Fr0man\u002Finflections-clj) - Rails-like inflection library for Clojure and ClojureScript\n  - [postagga](https:\u002F\u002Fgithub.com\u002Ffekr\u002Fpostagga) - A library to parse natural language in Clojure and ClojureScript\n\n- \u003Ca id=\"ruby\">**Ruby**\u003C\u002Fa> | [Back to Top](#contents)\n  - Kevin Dias's [A collection of Natural Language Processing (NLP) Ruby libraries, tools and software](https:\u002F\u002Fgithub.com\u002Fdiasks2\u002Fruby-nlp)\n  - [Practical Natural Language Processing done in Ruby](https:\u002F\u002Fgithub.com\u002Farbox\u002Fnlp-with-ruby)\n\n- \u003Ca id=\"rust\">**Rust**\u003C\u002Fa> | [Back to Top](#contents)\n  - [adk-rust](https:\u002F\u002Fgithub.com\u002Fzavora-ai\u002Fadk-rust) - Production-ready AI agent development kit with model-agnostic design (Gemini, OpenAI, Anthropic), multiple agent types, and MCP support\n  - [whatlang](https:\u002F\u002Fgithub.com\u002Fgreyblake\u002Fwhatlang-rs) — Natural language recognition library based on trigrams\n  - [snips-nlu-rs](https:\u002F\u002Fgithub.com\u002Fsnipsco\u002Fsnips-nlu-rs) - A production ready library for intent parsing\n  - [rust-bert](https:\u002F\u002Fgithub.com\u002Fguillaume-be\u002Frust-bert) - Ready-to-use NLP pipelines and Transformer-based models\n\n- \u003Ca id=\"NLP++\">**NLP++** - NLP++ Language\u003C\u002Fa> | [Back to Top](#contents)\n  - [VSCode Language Extension](https:\u002F\u002Fmarketplace.visualstudio.com\u002Fitems?itemName=dehilster.nlp) - NLP++ Language Extension for VSCode\n  - [nlp-engine](https:\u002F\u002Fgithub.com\u002FVisualText\u002Fnlp-engine) - NLP++ engine to run NLP++ code on Linux including a full English parser\n  - [VisualText](http:\u002F\u002Fvisualtext.org) - Homepage for the NLP++ Language\n  - [NLP++ Wiki](http:\u002F\u002Fwiki.naturalphilosophy.org\u002Findex.php?title=NLP%2B%2B) - Wiki entry for the NLP++ language\n\n- \u003Ca id=\"julia\">**Julia**\u003C\u002Fa> | [Back to Top](#contents)\n  - [CorpusLoaders](https:\u002F\u002Fgithub.com\u002FJuliaText\u002FCorpusLoaders.jl) - A variety of loaders for various NLP corpora\n  - [Languages](https:\u002F\u002Fgithub.com\u002FJuliaText\u002FLanguages.jl) - A package for working with human languages\n  - [TextAnalysis](https:\u002F\u002Fgithub.com\u002FJuliaText\u002FTextAnalysis.jl) - Julia package for text analysis\n  - [TextModels](https:\u002F\u002Fgithub.com\u002FJuliaText\u002FTextModels.jl) - Neural Network based models for Natural Language Processing\n  - [WordTokenizers](https:\u002F\u002Fgithub.com\u002FJuliaText\u002FWordTokenizers.jl) - High performance tokenizers for natural language processing and other related tasks\n  - [Word2Vec](https:\u002F\u002Fgithub.com\u002FJuliaText\u002FWord2Vec.jl) - Julia interface to word2vec\n\n### Services\n\nNLP as API with higher level functionality such as NER, Topic tagging and so on | [Back to Top](#contents)\n\n- [Vedika API](https:\u002F\u002Fvedika.io) - AI-powered Vedic astrology API with multi-agent swarm intelligence\n- [Wit-ai](https:\u002F\u002Fgithub.com\u002Fwit-ai\u002Fwit) - Natural Language Interface for apps and devices\n- [IBM Watson's Natural Language Understanding](https:\u002F\u002Fgithub.com\u002Fwatson-developer-cloud\u002Fnatural-language-understanding-nodejs) - API and Github demo\n- [Amazon Comprehend](https:\u002F\u002Faws.amazon.com\u002Fcomprehend\u002F) - NLP and ML suite covers most common tasks like NER, tagging, and sentiment analysis\n- [Google Cloud Natural Language API](https:\u002F\u002Fcloud.google.com\u002Fnatural-language\u002F) - Syntax Analysis, NER, Sentiment Analysis, and Content tagging in atleast 9 languages include English and Chinese (Simplified and Traditional).\n- [ParallelDots](https:\u002F\u002Fwww.paralleldots.com\u002Ftext-analysis-apis) - High level Text Analysis API Service ranging from Sentiment Analysis to Intent Analysis\n- [Microsoft Cognitive Service](https:\u002F\u002Fazure.microsoft.com\u002Fen-us\u002Fservices\u002Fcognitive-services\u002Ftext-analytics\u002F)\n- [TextRazor](https:\u002F\u002Fwww.textrazor.com\u002F)\n- [Rosette](https:\u002F\u002Fwww.rosette.com\u002F)\n- [Textalytic](https:\u002F\u002Fwww.textalytic.com) - Natural Language Processing in the Browser with sentiment analysis, named entity extraction, POS tagging, word frequencies, topic modeling, word clouds, and more\n- [NLP Cloud](https:\u002F\u002Fnlpcloud.io) - SpaCy NLP models (custom and pre-trained ones) served through a RESTful API for named entity recognition (NER), POS tagging, and more.\n- [Cloudmersive](https:\u002F\u002Fcloudmersive.com\u002Fnlp-api) - Unified and free NLP APIs that perform actions such as speech tagging, text rephrasing, language translation\u002Fdetection, and sentence parsing\n\n### Annotation Tools\n\n- [GATE](https:\u002F\u002Fgate.ac.uk\u002Foverview.html) - General Architecture and Text Engineering is 15+ years old, free and open source\n- [Anafora](https:\u002F\u002Fgithub.com\u002Fweitechen\u002Fanafora) is free and open source, web-based raw text annotation tool\n- [brat](https:\u002F\u002Fbrat.nlplab.org\u002F) - brat rapid annotation tool is an online environment for collaborative text annotation\n- [doccano](https:\u002F\u002Fgithub.com\u002Fchakki-works\u002Fdoccano) - doccano is free, open-source, and provides annotation features for text classification, sequence labeling and sequence to sequence\n- [INCEpTION](https:\u002F\u002Finception-project.github.io) - A semantic annotation platform offering intelligent assistance and knowledge management\n- [tagtog](https:\u002F\u002Fwww.tagtog.net\u002F), team-first web tool to find, create, maintain, and share datasets - costs $\n- [prodigy](https:\u002F\u002Fprodi.gy\u002F) is an annotation tool powered by active learning, costs $\n- [LightTag](https:\u002F\u002Flighttag.io) - Hosted and managed text annotation tool for teams, costs $\n- [rstWeb](https:\u002F\u002Fcorpling.uis.georgetown.edu\u002Frstweb\u002Finfo\u002F) - open source local or online tool for discourse tree annotations\n- [GitDox](https:\u002F\u002Fcorpling.uis.georgetown.edu\u002Fgitdox\u002F) - open source server annotation tool with GitHub version control and validation for XML data and collaborative spreadsheet grids\n- [Label Studio](https:\u002F\u002Fwww.heartex.ai\u002F) - Hosted and managed text annotation tool for teams, freemium based, costs $\n- [Datasaur](https:\u002F\u002Fdatasaur.ai\u002F) support various NLP tasks for individual or teams, freemium based\n- [Konfuzio](https:\u002F\u002Fkonfuzio.com\u002Fen\u002F) - team-first hosted and on-prem text, image and PDF annotation tool powered by active learning, freemium based, costs $\n- [UBIAI](https:\u002F\u002Fubiai.tools\u002F) - Easy-to-use text annotation tool for teams with most comprehensive auto-annotation features. Supports NER, relations and document classification as well as OCR annotation for invoice labeling, costs $\n- [Shoonya](https:\u002F\u002Fgithub.com\u002FAI4Bharat\u002FShoonya-Backend) - Shoonya is free and open source data annotation platform with wide varities of organization and workspace level management system. Shoonya is data agnostic, can be used by teams to annotate data with various level of verification stages at scale.\n- [Annotation Lab](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fannotation-lab\u002F) - Free End-to-End No-Code platform for text annotation and DL model training\u002Ftuning. Out-of-the-box support for Named Entity Recognition, Classification, Relation extraction and Assertion Status Spark NLP models. Unlimited support for users, teams, projects, documents. Not FOSS. \n- [FLAT](https:\u002F\u002Fgithub.com\u002Fproycon\u002Fflat) - FLAT is a web-based linguistic annotation environment based around the [FoLiA format](http:\u002F\u002Fproycon.github.io\u002Ffolia), a rich XML-based format for linguistic annotation. Free and open source.\n\n\n## Techniques\n\n### Text Embeddings\n\n#### Word Embeddings\n\n- Thumb Rule: **fastText >> GloVe > word2vec**\n\n- [word2vec](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) - [implementation](https:\u002F\u002Fcode.google.com\u002Farchive\u002Fp\u002Fword2vec\u002F) - [explainer blog](http:\u002F\u002Fcolah.github.io\u002Fposts\u002F2014-07-NLP-RNNs-Representations\u002F)\n- [glove](https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf) - [explainer blog](https:\u002F\u002Fblog.acolyer.org\u002F2016\u002F04\u002F22\u002Fglove-global-vectors-for-word-representation\u002F)\n- fasttext - [implementation](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText) - [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1607.04606) - [explainer blog](https:\u002F\u002Ftowardsdatascience.com\u002Ffasttext-under-the-hood-11efc57b2b3)\n\n#### Sentence and Language Model Based Word Embeddings\n\n[Back to Top](#contents)\n\n- ElMo - [Deep Contextualized Word Representations](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.05365) - [PyTorch implmentation](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fallennlp\u002Fblob\u002Fmaster\u002Ftutorials\u002Fhow_to\u002Felmo.md) - [TF Implementation](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fbilm-tf)\n- ULMFiT - [Universal Language Model Fine-tuning for Text Classification](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.06146) by Jeremy Howard and Sebastian Ruder\n- InferSent - [Supervised Learning of Universal Sentence Representations from Natural Language Inference Data](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.02364) by facebook\n- CoVe - [Learned in Translation: Contextualized Word Vectors](https:\u002F\u002Farxiv.org\u002Fabs\u002F1708.00107)\n- Pargraph vectors - from [Distributed Representations of Sentences and Documents](https:\u002F\u002Fcs.stanford.edu\u002F~quocle\u002Fparagraph_vector.pdf). See [doc2vec tutorial at gensim](https:\u002F\u002Frare-technologies.com\u002Fdoc2vec-tutorial\u002F)\n- [sense2vec](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06388) - on word sense disambiguation\n- [Skip Thought Vectors](https:\u002F\u002Farxiv.org\u002Fabs\u002F1506.06726) - word representation method\n- [Adaptive skip-gram](https:\u002F\u002Farxiv.org\u002Fabs\u002F1502.07257) - similar approach, with adaptive properties\n- [Sequence to Sequence Learning](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F5346-sequence-to-sequence-learning-with-neural-networks.pdf) - word vectors for machine translation\n\n### Question Answering and Knowledge Extraction\n\n[Back to Top](#contents)\n\n- [DrQA](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDrQA) - Open Domain Question Answering work by Facebook Research on Wikipedia data\n- [Document-QA](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fdocument-qa) - Simple and Effective Multi-Paragraph Reading Comprehension by AllenAI\n- [Template-Based Information Extraction without the Templates](https:\u002F\u002Fwww.usna.edu\u002FUsers\u002Fcs\u002Fnchamber\u002Fpubs\u002Facl2011-chambers-templates.pdf)\n- [Privee: An Architecture for Automatically Analyzing Web Privacy Policies](https:\u002F\u002Fwww.sebastianzimmeck.de\u002FzimmeckAndBellovin2014Privee.pdf)\n\n## Datasets\n\n[Back to Top](#contents)\n\n- [nlp-datasets](https:\u002F\u002Fgithub.com\u002Fniderhoff\u002Fnlp-datasets) great collection of nlp datasets\n- [gensim-data](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data) - Data repository for pretrained NLP models and NLP corpora.\n- [tiny_qa_benchmark_pp](https:\u002F\u002Fgithub.com\u002Fvincentkoc\u002Ftiny_qa_benchmark_pp\u002F) - Repository of tiny NLP multi-lingual QA datasets and library to generate your own synthetic copies.\n\n## Multilingual NLP Frameworks\n\n[Back to Top](#contents)\n\n- [UDPipe](https:\u002F\u002Fgithub.com\u002Fufal\u002Fudpipe) is a trainable pipeline for tokenizing, tagging, lemmatizing and parsing Universal Treebanks and other CoNLL-U files. Primarily written in C++, offers a fast and reliable solution for multilingual NLP processing.\n- [NLP-Cube](https:\u002F\u002Fgithub.com\u002Fadobe\u002FNLP-Cube) : Natural Language Processing Pipeline - Sentence Splitting, Tokenization, Lemmatization, Part-of-speech Tagging and Dependency Parsing. New platform, written in Python with Dynet 2.0. Offers standalone (CLI\u002FPython bindings) and server functionality (REST API).\n- [UralicNLP](https:\u002F\u002Fgithub.com\u002Fmikahama\u002FuralicNLP) is an NLP library mostly for many endangered Uralic languages such as Sami languages, Mordvin languages, Mari languages, Komi languages and so on. Also some non-endangered languages are supported such as Finnish together with non-Uralic languages such as Swedish and Arabic. UralicNLP can do morphological analysis, generation, lemmatization and disambiguation.\n\n## NLP in Korean\n\n[Back to Top](#contents)\n\n### Libraries\n\n- [KoNLPy](http:\u002F\u002Fkonlpy.org) - Python package for Korean natural language processing.\n- [Mecab (Korean)](https:\u002F\u002Feunjeon.blogspot.com\u002F) - C++ library for Korean NLP\n- [KoalaNLP](https:\u002F\u002Fkoalanlp.github.io\u002Fkoalanlp\u002F) - Scala library for Korean Natural Language Processing.\n- [KoNLP](https:\u002F\u002Fcran.r-project.org\u002Fpackage=KoNLP) - R package for Korean Natural language processing\n\n### Blogs and Tutorials\n\n- [dsindex's blog](https:\u002F\u002Fdsindex.github.io\u002F)\n- [Kangwon University's NLP course in Korean](http:\u002F\u002Fcs.kangwon.ac.kr\u002F~leeck\u002FNLP\u002F)\n\n### Datasets\n\n- [KAIST Corpus](http:\u002F\u002Fsemanticweb.kaist.ac.kr\u002Fhome\u002Findex.php\u002FKAIST_Corpus) - A corpus from the Korea Advanced Institute of Science and Technology in Korean.\n- [Naver Sentiment Movie Corpus in Korean](https:\u002F\u002Fgithub.com\u002Fe9t\u002Fnsmc\u002F)\n- [Chosun Ilbo archive](http:\u002F\u002Fsrchdb1.chosun.com\u002Fpdf\u002Fi_archive\u002F) - dataset in Korean from one of the major newspapers in South Korea, the Chosun Ilbo.\n- [Chat data](https:\u002F\u002Fgithub.com\u002Fsongys\u002FChatbot_data) - Chatbot data in Korean\n- [Petitions](https:\u002F\u002Fgithub.com\u002Fakngs\u002Fpetitions) - Collect expired petition data from the Blue House National Petition Site.\n- [Korean Parallel corpora](https:\u002F\u002Fgithub.com\u002Fj-min\u002Fkorean-parallel-corpora) - Neural Machine Translation(NMT) Dataset for **Korean to French** & **Korean to English**\n- [KorQuAD](https:\u002F\u002Fkorquad.github.io\u002F) - Korean SQuAD dataset with Wiki HTML source. Mentions both v1.0 and v2.1 at the time of adding to Awesome NLP\n\n## NLP in Arabic\n\n[Back to Top](#contents)\n\n### Libraries\n\n- [goarabic](https:\u002F\u002Fgithub.com\u002F01walid\u002Fgoarabic) - Go package for Arabic text processing\n- [jsastem](https:\u002F\u002Fgithub.com\u002Fejtaal\u002Fjsastem) - Javascript for Arabic stemming\n- [PyArabic](https:\u002F\u002Fpypi.org\u002Fproject\u002FPyArabic\u002F) - Python libraries for Arabic\n- [RFTokenizer](https:\u002F\u002Fgithub.com\u002Famir-zeldes\u002FRFTokenizer) - trainable Python segmenter for Arabic, Hebrew and Coptic\n\n### Datasets\n\n- [Multidomain Datasets](https:\u002F\u002Fgithub.com\u002Fhadyelsahar\u002Flarge-arabic-sentiment-analysis-resouces) - Largest Available Multi-Domain Resources for Arabic Sentiment Analysis\n- [LABR](https:\u002F\u002Fgithub.com\u002Fmohamedadaly\u002Flabr) - LArge Arabic Book Reviews dataset\n- [Arabic Stopwords](https:\u002F\u002Fgithub.com\u002Fmohataher\u002Farabic-stop-words) - A list of Arabic stopwords from various resources\n\n## NLP in Chinese\n\n[Back to Top](#contents)\n\n### Libraries\n\n- [jieba](https:\u002F\u002Fgithub.com\u002Ffxsjy\u002Fjieba#jieba-1) - Python package for Words Segmentation Utilities in Chinese\n- [SnowNLP](https:\u002F\u002Fgithub.com\u002Fisnowfy\u002Fsnownlp) - Python package for Chinese NLP\n- [FudanNLP](https:\u002F\u002Fgithub.com\u002FFudanNLP\u002Ffnlp) - Java library for Chinese text processing\n- [HanLP](https:\u002F\u002Fgithub.com\u002Fhankcs\u002FHanLP) - The multilingual NLP library\n\n### Anthology\n- [funNLP](https:\u002F\u002Fgithub.com\u002Ffighting41love\u002FfunNLP) - Collection of NLP tools and resources mainly for Chinese\n\n## NLP in German\n\n- [German-NLP](https:\u002F\u002Fgithub.com\u002Fadbar\u002FGerman-NLP) - Curated list of open-access\u002Fopen-source\u002Foff-the-shelf resources and tools developed with a particular focus on German\n\n## NLP in Polish\n\n- [Polish-NLP](https:\u002F\u002Fgithub.com\u002Fksopyla\u002Fawesome-nlp-polish) - A curated list of resources dedicated to Natural Language Processing (NLP) in polish. Models, tools, datasets.\n\n## NLP in Spanish\n\n[Back to Top](#contents)\n\n### Libraries\n\n- [spanlp](https:\u002F\u002Fgithub.com\u002Fjfreddypuentes\u002Fspanlp) - Python library to detect, censor and clean profanity, vulgarities, hateful words, racism, xenophobia and bullying in texts written in Spanish. It contains data of 21 Spanish-speaking countries.\n\n### Data\n\n- [Columbian Political Speeches](https:\u002F\u002Fgithub.com\u002Fdav009\u002FLatinamericanTextResources)\n- [Copenhagen Treebank](https:\u002F\u002Fmbkromann.github.io\u002Fcopenhagen-dependency-treebank\u002F)\n- [Spanish Billion words corpus with Word2Vec embeddings](https:\u002F\u002Fgithub.com\u002Fcrscardellino\u002Fsbwce)\n- [Compilation of Spanish Unannotated Corpora](https:\u002F\u002Fgithub.com\u002Fjosecannete\u002Fspanish-unannotated-corpora)\n\n### Word and Sentence Embeddings\n- [Spanish Word Embeddings Computed with Different Methods and from Different Corpora](https:\u002F\u002Fgithub.com\u002Fdccuchile\u002Fspanish-word-embeddings)\n- [Spanish Word Embeddings Computed from Large Corpora and Different Sizes Using fastText](https:\u002F\u002Fgithub.com\u002FBotCenter\u002FspanishWordEmbeddings)\n- [Spanish Sentence Embeddings Computed from Large Corpora Using sent2vec](https:\u002F\u002Fgithub.com\u002FBotCenter\u002FspanishSent2Vec)\n- [Beto - BERT for Spanish](https:\u002F\u002Fgithub.com\u002Fdccuchile\u002Fbeto)\n\n\n## NLP in Indic languages\n\n[Back to Top](#contents)\n\n### Data, Corpora and Treebanks\n\n- [Hindi Dependency Treebank](https:\u002F\u002Fltrc.iiit.ac.in\u002Ftreebank_H2014\u002F) - A multi-representational multi-layered treebank for Hindi and Urdu\n- [Universal Dependencies Treebank in Hindi](https:\u002F\u002Funiversaldependencies.org\u002Ftreebanks\u002Fhi_hdtb\u002Findex.html)\n  - [Parallel Universal Dependencies Treebank in Hindi](http:\u002F\u002Funiversaldependencies.org\u002Ftreebanks\u002Fhi_pud\u002Findex.html) - A smaller part of the above-mentioned treebank.\n- [ISI FIRE Stopwords List (Hindi and Bangla)](https:\u002F\u002Fwww.isical.ac.in\u002F~fire\u002Fdata\u002F)\n- [Peter Graham's Stopwords List](https:\u002F\u002Fgithub.com\u002F6\u002Fstopwords-json)\n- [NLTK Corpus](https:\u002F\u002Fwww.nltk.org\u002Fbook\u002Fch02.html) 60k Words POS Tagged, Bangla, Hindi, Marathi, Telugu\n- [Hindi Movie Reviews Dataset](https:\u002F\u002Fgithub.com\u002Fgoru001\u002Fnlp-for-hindi) ~1k Samples, 3 polarity classes\n- [BBC News Hindi Dataset](https:\u002F\u002Fgithub.com\u002FNirantK\u002Fhindi2vec\u002Freleases\u002Ftag\u002Fbbc-hindi-v0.1) 4.3k Samples, 14 classes\n- [IIT Patna Hindi ABSA Dataset](https:\u002F\u002Fgithub.com\u002Fpnisarg\u002FABSA) 5.4k Samples, 12 Domains, 4k aspect terms, aspect and sentence level polarity in 4 classes\n- [Bangla ABSA](https:\u002F\u002Fgithub.com\u002FAtikRahman\u002FBangla_Datasets_ABSA) 5.5k Samples, 2 Domains, 10 aspect terms\n- [IIT Patna Movie Review Sentiment Dataset](https:\u002F\u002Fwww.iitp.ac.in\u002F~ai-nlp-ml\u002Fresources.html) 2k Samples, 3 polarity labels\n\n#### Corpora\u002FDatasets that need a login\u002Faccess can be gained via email\n\n- [SAIL 2015](http:\u002F\u002Famitavadas.com\u002FSAIL\u002F) Twitter and Facebook labelled sentiment samples in Hindi, Bengali, Tamil, Telugu.\n- [IIT Bombay NLP Resources](http:\u002F\u002Fwww.cfilt.iitb.ac.in\u002FSentiment_Analysis_Resources.html) Sentiwordnet, Movie and Tourism parallel labelled corpora, polarity labelled sense annotated corpus, Marathi polarity labelled corpus.\n- [TDIL-IC aggregates a lot of useful resources and provides access to otherwise gated datasets](https:\u002F\u002Ftdil-dc.in\u002Findex.php?option=com_catalogue&task=viewTools&id=83&lang=en)\n\n### Language Models and Word Embeddings\n\n- [Hindi2Vec](https:\u002F\u002Fnirantk.com\u002Fhindi2vec\u002F) and [nlp-for-hindi](https:\u002F\u002Fgithub.com\u002Fgoru001\u002Fnlp-for-hindi) ULMFIT style languge model\n- [IIT Patna Bilingual Word Embeddings Hi-En](https:\u002F\u002Fwww.iitp.ac.in\u002F~ai-nlp-ml\u002Fresources.html)\n- [Fasttext word embeddings in a whole bunch of languages, trained on Common Crawl](https:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fcrawl-vectors.html)\n- [Hindi and Bengali Word2Vec](https:\u002F\u002Fgithub.com\u002FKyubyong\u002Fwordvectors)\n- [Hindi and Urdu Elmo Model](https:\u002F\u002Fgithub.com\u002FHIT-SCIR\u002FELMoForManyLangs)\n- [Sanskrit Albert](https:\u002F\u002Fhuggingface.co\u002Fsurajp\u002Falbert-base-sanskrit) Trained on Sanskrit Wikipedia and OSCAR corpus\n\n### Libraries and Tooling\n\n- [Multi-Task Deep Morphological Analyzer](https:\u002F\u002Fgithub.com\u002FSaurav0074\u002Fmt-dma) Deep Network based Morphological Parser for Hindi and Urdu\n- [Anoop Kunchukuttan](https:\u002F\u002Fgithub.com\u002Fanoopkunchukuttan\u002Findic_nlp_library) 18 Languages, whole host of features from tokenization to translation\n- [SivaReddy's Dependency Parser](http:\u002F\u002Fsivareddy.in\u002Fdownloads) Dependency Parser and Pos Tagger for Kannada, Hindi and Telugu. [Python3 Port](https:\u002F\u002Fgithub.com\u002FCalmDownKarm\u002Fsivareddydependencyparser)\n- [iNLTK](https:\u002F\u002Fgithub.com\u002Fgoru001\u002Finltk) - A Natural Language Toolkit for Indic Languages (Indian subcontinent languages) built on top of Pytorch\u002FFastai, which aims to provide out of the box support for common NLP tasks.\n\n## NLP in Thai\n\n[Back to Top](#contents)\n\n### Libraries\n\n- [PyThaiNLP](https:\u002F\u002Fgithub.com\u002FPyThaiNLP\u002Fpythainlp) - Thai NLP in Python Package\n- [JTCC](https:\u002F\u002Fgithub.com\u002Fwittawatj\u002Fjtcc) - A character cluster library in Java\n- [CutKum](https:\u002F\u002Fgithub.com\u002Fpucktada\u002Fcutkum) - Word segmentation with deep learning in TensorFlow\n- [Thai Language Toolkit](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Ftltk\u002F) - Based on a paper by Wirote Aroonmanakun in 2002 with included dataset\n- [SynThai](https:\u002F\u002Fgithub.com\u002FKenjiroAI\u002FSynThai) - Word segmentation and POS tagging using deep learning in Python\n\n### Data\n\n- [Inter-BEST](https:\u002F\u002Fwww.nectec.or.th\u002Fcorpus\u002Findex.php?league=pm) - A text corpus with 5 million words with word segmentation\n- [Prime Minister 29](https:\u002F\u002Fgithub.com\u002FPyThaiNLP\u002Flexicon-thai\u002Ftree\u002Fmaster\u002Fthai-corpus\u002FPrime%20Minister%2029) - Dataset containing speeches of the current Prime Minister of Thailand\n\n## NLP in Danish\n\n- [Named Entity Recognition for Danish](https:\u002F\u002Fgithub.com\u002FITUnlp\u002Fdaner)\n- [DaNLP](https:\u002F\u002Fgithub.com\u002Falexandrainst\u002Fdanlp) - NLP resources in Danish\n- [Awesome Danish](https:\u002F\u002Fgithub.com\u002Ffnielsen\u002Fawesome-danish) - A curated list of awesome resources for Danish language technology\n\n## NLP in Vietnamese\n\n### Libraries\n\n- [underthesea](https:\u002F\u002Fgithub.com\u002Fundertheseanlp\u002Funderthesea) - Vietnamese NLP Toolkit\n- [vn.vitk](https:\u002F\u002Fgithub.com\u002Fphuonglh\u002Fvn.vitk) - A Vietnamese Text Processing Toolkit\n- [VnCoreNLP](https:\u002F\u002Fgithub.com\u002Fvncorenlp\u002FVnCoreNLP) - A Vietnamese natural language processing toolkit\n- [PhoBERT](https:\u002F\u002Fgithub.com\u002FVinAIResearch\u002FPhoBERT) - Pre-trained language models for Vietnamese\n- [pyvi](https:\u002F\u002Fgithub.com\u002Ftrungtv\u002Fpyvi) - Python Vietnamese Core NLP Toolkit\n- [VieNeu-TTS](https:\u002F\u002Fgithub.com\u002Fpnnbao97\u002FVieNeu-TTS) - An Advanced On-Device Vietnamese Text-to-Speech System With Instant Voice Cloning.\n\n### Data\n\n- [Vietnamese treebank](https:\u002F\u002Fvlsp.hpda.vn\u002Fdemo\u002F?page=resources&lang=en) - 10,000 sentences for the constituency parsing task\n- [BKTreeBank](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.05519.pdf) - a Vietnamese Dependency Treebank\n- [UD_Vietnamese](https:\u002F\u002Fgithub.com\u002FUniversalDependencies\u002FUD_Vietnamese-VTB) - Vietnamese Universal Dependency Treebank\n- [VIVOS](https:\u002F\u002Failab.hcmus.edu.vn\u002Fvivos\u002F) - a free Vietnamese speech corpus consisting of 15 hours of recording speech by AILab\n- [VNTQcorpus(big).txt](http:\u002F\u002Fviet.jnlp.org\u002Fdownload-du-lieu-tu-vung-corpus) - 1.75 million sentences in news\n- [ViText2SQL](https:\u002F\u002Fgithub.com\u002FVinAIResearch\u002FViText2SQL) - A dataset for Vietnamese Text-to-SQL semantic parsing (EMNLP-2020 Findings)\n- [EVB Corpus](https:\u002F\u002Fgithub.com\u002Fqhungngo\u002FEVBCorpus) - 20,000,000 words (20 million) from 15 bilingual books, 100 parallel English-Vietnamese \u002F Vietnamese-English texts, 250 parallel law and ordinance texts, 5,000 news articles, and 2,000 film subtitles.\n\n\n## NLP for Dutch\n\n[Back to Top](#contents)\n\n- [python-frog](https:\u002F\u002Fgithub.com\u002Fproycon\u002Fpython-frog) - Python binding to Frog, an NLP suite for Dutch. (pos tagging, lemmatisation, dependency parsing, NER)\n- [SimpleNLG_NL](https:\u002F\u002Fgithub.com\u002Frfdj\u002FSimpleNLG-NL) - Dutch surface realiser used for Natural Language Generation in Dutch, based on the SimpleNLG implementation for English and French.\n- [Alpino](https:\u002F\u002Fgithub.com\u002Frug-compling\u002Falpino) - Dependency parser for Dutch (also does PoS tagging and Lemmatisation).\n- [Kaldi NL](https:\u002F\u002Fgithub.com\u002Fopensource-spraakherkenning-nl\u002FKaldi_NL) - Dutch Speech Recognition models based on [Kaldi](http:\u002F\u002Fkaldi-asr.org\u002F).\n- [spaCy](https:\u002F\u002Fspacy.io\u002F) - [Dutch model](https:\u002F\u002Fspacy.io\u002Fmodels\u002Fnl) available. - Industrial strength NLP with Python and Cython. \n\n\n## NLP in Indonesian\n\n### Datasets\n- Kompas and Tempo collections at [ILPS](http:\u002F\u002Filps.science.uva.nl\u002Fresources\u002Fbahasa\u002F)\n- [PANL10N for PoS tagging](http:\u002F\u002Fwww.panl10n.net\u002Fenglish\u002Foutputs\u002FIndonesia\u002FUI\u002F0802\u002FUI-1M-tagged.zip): 39K sentences and 900K word tokens\n- [IDN for PoS tagging](https:\u002F\u002Fgithub.com\u002Ffamrashel\u002Fidn-tagged-corpus): This corpus contains 10K sentences and 250K word tokens\n- [Indonesian Treebank](https:\u002F\u002Fgithub.com\u002Ffamrashel\u002Fidn-treebank) and [Universal Dependencies-Indonesian](https:\u002F\u002Fgithub.com\u002FUniversalDependencies\u002FUD_Indonesian-GSD)\n- [IndoSum](https:\u002F\u002Fgithub.com\u002Fkata-ai\u002Findosum) for text summarization and classification both\n- [Wordnet-Bahasa](http:\u002F\u002Fwn-msa.sourceforge.net\u002F) - large, free, semantic dictionary\n- IndoBenchmark [IndoNLU](https:\u002F\u002Fgithub.com\u002Findobenchmark\u002Findonlu) includes pre-trained language model (IndoBERT), FastText model, Indo4B corpus, and several NLU benchmark datasets\n\n### Libraries & Embedding\n- Natural language toolkit [bahasa](https:\u002F\u002Fgithub.com\u002Fkangfend\u002Fbahasa)\n- [Indonesian Word Embedding](https:\u002F\u002Fgithub.com\u002Fgaluhsahid\u002Findonesian-word-embedding)\n- Pretrained [Indonesian fastText Text Embedding](https:\u002F\u002Fs3-us-west-1.amazonaws.com\u002Ffasttext-vectors\u002Fwiki.id.zip) trained on Wikipedia\n- IndoBenchmark [IndoNLU](https:\u002F\u002Fgithub.com\u002Findobenchmark\u002Findonlu) includes pretrained language model (IndoBERT), FastText model, Indo4B corpus, and several NLU benchmark datasets\n\n## NLP in Urdu\n\n### Datasets\n- [Collection of Urdu datasets](https:\u002F\u002Fgithub.com\u002Fmirfan899\u002FUrdu) for POS, NER and NLP tasks\n\n### Libraries\n- [Natural Language Processing library](https:\u002F\u002Fgithub.com\u002Furduhack\u002Furduhack) for ( 🇵🇰)Urdu language\n\n## NLP in Persian\n\n[Back to Top](#contents)\n\n### Libraries\n- [Hazm](https:\u002F\u002Fgithub.com\u002Froshan-research\u002Fhazm) - Persian NLP Toolkit.\n- [Parsivar](https:\u002F\u002Fgithub.com\u002FICTRC\u002FParsivar): A Language Processing Toolkit for Persian\n- [Perke](https:\u002F\u002Fgithub.com\u002FAlirezaTheH\u002Fperke): Perke is a Python keyphrase extraction package for Persian language. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models.\n- [Perstem](https:\u002F\u002Fgithub.com\u002Fjonsafari\u002Fperstem): Persian stemmer, morphological analyzer, transliterator, and partial part-of-speech tagger\n- [ParsiAnalyzer](https:\u002F\u002Fgithub.com\u002FNarimanN2\u002FParsiAnalyzer): Persian Analyzer For Elasticsearch\n- [virastar](https:\u002F\u002Fgithub.com\u002Faziz\u002Fvirastar): Cleaning up Persian text!\n\n### Datasets\n- [Bijankhan Corpus](https:\u002F\u002Fdbrg.ut.ac.ir\u002Fبیژن%E2%80%8Cخان\u002F): Bijankhan corpus is a tagged corpus that is suitable for natural language processing research on the Persian (Farsi) language. This collection is gathered form daily news and common texts. In this collection all documents are categorized into different subjects such as political, cultural and so on. Totally, there are 4300 different subjects. The Bijankhan collection contains about 2.6 millions manually tagged words with a tag set that contains 40 Persian POS tags.\n- [Uppsala Persian Corpus (UPC)](https:\u002F\u002Fsites.google.com\u002Fsite\u002Fmojganserajicom\u002Fhome\u002Fupc): Uppsala Persian Corpus (UPC) is a large, freely available Persian corpus. The corpus is a modified version of the Bijankhan corpus with additional sentence segmentation and consistent tokenization containing 2,704,028 tokens and annotated with 31 part-of-speech tags. The part-of-speech tags are listed with explanations in [this table](https:\u002F\u002Fsites.google.com\u002Fsite\u002Fmojganserajicom\u002Fhome\u002Fupc\u002FTable_tag.pdf).\n- [Large-Scale Colloquial Persian](http:\u002F\u002Fhdl.handle.net\u002F11234\u002F1-3195): Large Scale Colloquial Persian Dataset (LSCP) is hierarchically organized in asemantic taxonomy that focuses on multi-task informal Persian language understanding as a comprehensive problem. LSCP includes 120M sentences from 27M casual Persian tweets with its dependency relations in syntactic annotation, Part-of-speech tags, sentiment polarity and automatic translation of original Persian sentences in English (EN), German (DE), Czech (CS), Italian (IT) and Hindi (HI) spoken languages. Learn more about this project at [LSCP webpage](https:\u002F\u002Fiasbs.ac.ir\u002F~ansari\u002Flscp\u002F).\n- [ArmanPersoNERCorpus](https:\u002F\u002Fgithub.com\u002FHaniehP\u002FPersianNER): The dataset includes 250,015 tokens and 7,682 Persian sentences in total. It is available in 3 folds to be used in turn as training and test sets. Each file contains one token, along with its manually annotated named-entity tag, per line. Each sentence is separated with a newline. The NER tags are in IOB format.\n- [FarsiYar PersianNER](https:\u002F\u002Fgithub.com\u002FText-Mining\u002FPersian-NER): The dataset includes about 25,000,000 tokens and about 1,000,000 Persian sentences in total based on [Persian Wikipedia Corpus](https:\u002F\u002Fgithub.com\u002FText-Mining\u002FPersian-Wikipedia-Corpus). The NER tags are in IOB format. More than 1000 volunteers contributed tag improvements to this dataset via web panel or android app. They release updated tags every two weeks.\n- [PERLEX](http:\u002F\u002Ffarsbase.net\u002FPERLEX.html): The first Persian dataset for relation extraction, which is an expert translated version of the “Semeval-2010-Task-8” dataset. Link to the relevant publication.\n- [Persian Syntactic Dependency Treebank](http:\u002F\u002Fdadegan.ir\u002Fcatalog\u002Fperdt): This treebank is supplied for free noncommercial use. For commercial uses feel free to contact us. The number of annotated sentences is 29,982 sentences including samples from almost all verbs of the Persian valency lexicon.\n- [Uppsala Persian Dependency Treebank (UPDT)](http:\u002F\u002Fstp.lingfil.uu.se\u002F~mojgan\u002FUPDT.html): Dependency-based syntactically annotated corpus.\n- [Hamshahri](https:\u002F\u002Fdbrg.ut.ac.ir\u002Fhamshahri\u002F): Hamshahri collection is a standard reliable Persian text collection that was used at Cross Language Evaluation Forum (CLEF) during years 2008 and 2009 for evaluation of Persian information retrieval systems.\n\n\n## NLP in Ukrainian\n\n[Back to Top](#contents)\n\n- [awesome-ukrainian-nlp](https:\u002F\u002Fgithub.com\u002Fasivokon\u002Fawesome-ukrainian-nlp) - a curated list of Ukrainian NLP datasets, models, etc.\n- [UkrainianLT](https:\u002F\u002Fgithub.com\u002FHelsinki-NLP\u002FUkrainianLT) - another curated list with a focus on machine translation and speech processing\n\n\n## NLP in Hungarian\n\n[Back to Top](#contents)\n\n- [awesome-hungarian-nlp](https:\u002F\u002Fgithub.com\u002Foroszgy\u002Fawesome-hungarian-nlp): A curated list of free resources dedicated to Hungarian Natural Language Processing.\n\n## NLP in Portuguese\n\n[Back to Top](#contents)\n\n- [Portuguese-nlp](https:\u002F\u002Fgithub.com\u002Fajdavidl\u002FPortuguese-NLP) - a List of resources and tools developed with focus on Portuguese.\n\n## Other Languages\n\n- Russian: [pymorphy2](https:\u002F\u002Fgithub.com\u002Fkmike\u002Fpymorphy2) - a good pos-tagger for Russian\n- Asian Languages: Thai, Lao, Chinese, Japanese, and Korean [ICU Tokenizer](https:\u002F\u002Fwww.elastic.co\u002Fguide\u002Fen\u002Felasticsearch\u002Fplugins\u002Fcurrent\u002Fanalysis-icu-tokenizer.html) implementation in ElasticSearch\n- Ancient Languages: [CLTK](https:\u002F\u002Fgithub.com\u002Fcltk\u002Fcltk): The Classical Language Toolkit is a Python library and collection of texts for doing NLP in ancient languages\n- Hebrew: [NLPH_Resources](https:\u002F\u002Fgithub.com\u002FNLPH\u002FNLPH_Resources) - A collection of papers, corpora and linguistic resources for NLP in Hebrew\n\n[Back to Top](#contents)\n\n## Citation\n\nIf you find this repository useful, please consider citing this list:\n\n```bibtex\n@misc{awesome-nlp,\n  title  = {Awesome NLP},\n  author = {Kim, Keon and Chelikavada, Krish},\n  year   = {2018},\n  url    = {https:\u002F\u002Fgithub.com\u002Fkeon\u002Fawesome-nlp},\n  note   = {GitHub repository}\n}\n```\n\n### Core Contributors and Maintainers\n\n- [Krish Chelikavada](https:\u002F\u002Flinkedin.com\u002Fin\u002Fcskc1)\n- [Keon Kim](https:\u002F\u002Flinkedin.com\u002Fin\u002Fkeon)\n\n[Credits](.\u002FCREDITS.md) for initial curators and sources\n\n## License\n[License](.\u002FLICENSE) - CC0\n","# 令人惊叹的自然语言处理\n\n[![Awesome](https:\u002F\u002Fcdn.rawgit.com\u002Fsindresorhus\u002Fawesome\u002Fd7305f38d29fed78fa85652e3a63e154dd8e8829\u002Fmedia\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fsindresorhus\u002Fawesome)\n\n一份精心整理的自然语言处理资源列表\n\n![Awesome NLP Logo](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkeon_awesome-nlp_readme_2033b3bda8b0.jpg)\n\n请阅读 [英文版](.\u002FREADME.md)，[繁体中文版](.\u002FREADME-ZH-TW.md)\n\n_在贡献之前，请先阅读[贡献指南](contributing.md)。请通过提交[拉取请求](https:\u002F\u002Fgithub.com\u002Fkeonkim\u002Fawesome-nlp\u002Fpulls)来添加您喜爱的NLP资源_\n\n## 目录\n\n* [研究综述与趋势](#research-summaries-and-trends)\n* [知名NLP研究实验室](#prominent-nlp-research-labs)\n* [教程](#tutorials)\n  * [阅读材料](#reading-content)\n  * [视频与在线课程](#videos-and-online-courses)\n  * [书籍](#books)\n* [库](#libraries)\n  * [Node.js](#node-js)\n  * [Python](#python)\n  * [C++](#c++)\n  * [Java](#java)\n  * [Kotlin](#kotlin)\n  * [Scala](#scala)\n  * [R](#R)\n  * [Clojure](#clojure)\n  * [Ruby](#ruby)\n  * [Rust](#rust)\n  * [NLP++](#NLP++)\n  * [Julia](#julia)\n* [服务](#services)\n* [标注工具](#annotation-tools)\n* [数据集](#datasets)\n* [韩语中的NLP](#nlp-in-korean)\n* [阿拉伯语中的NLP](#nlp-in-arabic)\n* [中文中的NLP](#nlp-in-chinese)\n* [德语中的NLP](#nlp-in-german)\n* [波兰语中的NLP](#nlp-in-polish)\n* [西班牙语中的NLP](#nlp-in-spanish)\n* [印地语系语言中的NLP](#nlp-in-indic-languages)\n* [泰语中的NLP](#nlp-in-thai)\n* [丹麦语中的NLP](#nlp-in-danish)\n* [越南语中的NLP](#nlp-in-vietnamese)\n* [荷兰语中的NLP](#nlp-for-dutch)\n* [印尼语中的NLP](#nlp-in-indonesian)\n* [乌尔都语中的NLP](#nlp-in-urdu)\n* [波斯语中的NLP](#nlp-in-persian)\n* [乌克兰语中的NLP](#nlp-in-ukrainian)\n* [匈牙利语中的NLP](#nlp-in-hungarian)\n* [葡萄牙语中的NLP](#nlp-in-portuguese)\n* [其他语言](#other-languages)\n* [引用](#citation)\n* [致谢](#credits)\n\n## 研究综述与趋势\n\n* [NLP-Overview](https:\u002F\u002Fnlpoverview.com\u002F) 是一个关于深度学习技术在NLP领域应用的最新概述，涵盖了理论、实现、应用以及最前沿的研究成果。对于研究人员来说，这是一个极佳的深度NLP入门资料。\n* [NLP-Progress](https:\u002F\u002Fnlpprogress.com\u002F) 跟踪自然语言处理领域的进展，包括常用NLP任务的数据集和当前最先进的技术水平。\n* [NLP的ImageNet时刻已经到来](https:\u002F\u002Fthegradient.pub\u002Fnlp-imagenet\u002F)\n* [ACL 2018亮点：在更具挑战性的场景中理解表示与评估](http:\u002F\u002Fruder.io\u002Facl-2018-highlights\u002F)\n* [ACL 2017的四大深度学习趋势。第一部分：语言结构与词嵌入](https:\u002F\u002Fwww.abigailsee.com\u002F2017\u002F08\u002F30\u002Ffour-deep-learning-trends-from-acl-2017-part-1.html)\n* [ACL 2017的四大深度学习趋势。第二部分：可解释性与注意力机制](https:\u002F\u002Fwww.abigailsee.com\u002F2017\u002F08\u002F30\u002Ffour-deep-learning-trends-from-acl-2017-part-2.html)\n* [EMNLP 2017亮点：令人兴奋的数据集、聚类的回归等！](http:\u002F\u002Fblog.aylien.com\u002Fhighlights-emnlp-2017-exciting-datasets-return-clusters\u002F)\n* [自然语言处理（NLP）中的深度学习：进展与趋势](https:\u002F\u002Ftryolabs.com\u002Fblog\u002F2017\u002F12\u002F12\u002Fdeep-learning-for-nlp-advancements-and-trends-in-2017\u002F?utm_campaign=Revue%20newsletter&utm_medium=Newsletter&utm_source=The%20Wild%20Week%20in%20AI)\n* [自然语言生成领域的现状综述](https:\u002F\u002Farxiv.org\u002Fabs\u002F1703.09902)\n\n## 知名NLP研究实验室\n[返回顶部](#contents)\n\n* [伯克利NLP小组](http:\u002F\u002Fnlp.cs.berkeley.edu\u002Findex.shtml) - 其著名贡献包括开发了一种用于重建已消亡语言的工具，该工具被[此处](https:\u002F\u002Fwww.bbc.com\u002Fnews\u002Fscience-environment-21427896)提及，并且他们还从亚洲和太平洋地区现存的637种语言中收集语料，重建这些语言的后代语言。\n* [卡内基梅隆大学语言技术研究所](http:\u002F\u002Fwww.cs.cmu.edu\u002F~nasmith\u002Fnlp-cl.html) - 其著名项目包括[Avenue项目](http:\u002F\u002Fwww.cs.cmu.edu\u002F~avenue\u002F)，这是一套基于语法的机器翻译系统，专为像克丘亚语和艾马拉语这样的濒危语言设计；此外，他们还曾开发过[诺亚方舟项目](http:\u002F\u002Fwww.cs.cmu.edu\u002F~ark\u002F)，该项目创建了[AQMAR](http:\u002F\u002Fwww.cs.cmu.edu\u002F~ark\u002FAQMAR\u002F)，旨在改进针对阿拉伯语的NLP工具。\n* [哥伦比亚大学NLP研究组](http:\u002F\u002Fwww1.cs.columbia.edu\u002Fnlp\u002Findex.cgi) - 该研究组负责开发BOLT（语音翻译系统的交互式错误处理系统）以及一项未命名的项目，用于描述对话中的笑声特征。\n* [约翰霍普金斯大学语言与语音处理中心](http:\u002F\u002Fclsp.jhu.edu\u002F) - 最近因开发用于帕金森病诊断测试的语音识别软件而备受关注，详情见[这里](https:\u002F\u002Fwww.clsp.jhu.edu\u002F2019\u002F03\u002F27\u002Fspeech-recognition-software-and-machine-learning-tools-are-being-used-to-create-diagnostic-test-for-parkinsons-disease\u002F#.XNFqrIkzYdU)。\n* [马里兰大学计算语言学与信息处理小组](https:\u002F\u002Fwiki.umiacs.umd.edu\u002Fclip\u002Findex.php\u002FMain_Page) - 其著名贡献包括[人机协作式的逐字问答系统](http:\u002F\u002Fwww.umiacs.umd.edu\u002F~jbg\u002Fprojects\u002FIIS-1652666)以及对语音表征发展的建模研究。\n* [宾夕法尼亚大学自然语言处理小组](https:\u002F\u002Fnlp.cis.upenn.edu\u002F) - 以创建[Penn Treebank](https:\u002F\u002Fwww.seas.upenn.edu\u002F~pdtb\u002F)而闻名。\n* [斯坦福大学自然语言处理小组](https:\u002F\u002Fnlp.stanford.edu\u002F) - 作为全球顶尖的NLP研究实验室之一，其著名成就包括开发[Stanford CoreNLP](https:\u002F\u002Fnlp.stanford.edu\u002Fsoftware\u002Fcorenlp.shtml)及其[核心指代消解系统](https:\u002F\u002Fnlp.stanford.edu\u002Fsoftware\u002Fdcoref.shtml)\n\n\n## 教程\n[返回顶部](#contents)\n\n### 阅读内容\n\n通用机器学习\n\n* Google高级创意工程师提供的[机器学习101](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1kSuQyW5DTnkVaZEjGYCkfOxvzCqGEFzWBy4e9Uedd9k\u002Fedit?usp=sharing)，为工程师和高管们讲解机器学习基础知识\n* [AI行动指南](https:\u002F\u002Faiplaybook.a16z.com\u002F) - a16z的AI行动指南是非常适合转发给管理层或用于演示文稿的内容\n* [塞巴斯蒂安·鲁德尔](https:\u002F\u002Ftwitter.com\u002Fseb_ruder)的[Ruder博客](http:\u002F\u002Fruder.io\u002F#open)，提供关于自然语言处理研究前沿的评论\n* [数据标注指南](https:\u002F\u002Fwww.lighttag.io\u002Fhow-to-label-data\u002F)：管理大型语言标注项目的指南\n* [取决于定义](https:\u002F\u002Fwww.depends-on-the-definition.com\u002F)：涵盖广泛NLP主题并附有详细实现的博客文章合集\n\n自然语言处理入门与指南\n\n* [理解并实现自然语言处理](https:\u002F\u002Fwww.analyticsvidhya.com\u002Fblog\u002F2017\u002F01\u002Fultimate-guide-to-understand-implement-natural-language-processing-codes-in-python\u002F)\n* [Python中的NLP](http:\u002F\u002Fgithub.com\u002FNirantK\u002Fnlp-python-deep-learning)：GitHub上的笔记本合集\n* [自然语言处理导论](https:\u002F\u002Facademic.oup.com\u002Fjamia\u002Farticle\u002F18\u002F5\u002F544\u002F829676) - 牛津大学出版\n* [使用PyTorch进行NLP深度学习](https:\u002F\u002Fpytorch.org\u002Ftutorials\u002Fbeginner\u002Fdeep_learning_nlp_tutorial.html)\n* [动手实践NLTK教程](https:\u002F\u002Fgithub.com\u002Fhb20007\u002Fhands-on-nltk-tutorial)：NLTK教程，包含Jupyter笔记本\n* [用Python进行自然语言处理——利用自然语言工具包分析文本](https:\u002F\u002Fwww.nltk.org\u002Fbook\u002F)：一本在线及纸质书籍，介绍使用NLTK进行NLP的概念。该书作者同时也是NLTK库的开发者。\n* [从零开始训练新的语言模型](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fhow-to-train) - Hugging Face 🤗\n* [超级NLP资源库（SDNLPR）](https:\u002F\u002Fnotebooks.quantumstat.com\u002F)：涵盖多种NLP任务实现的Colab笔记本合集\n* [使用spaCy进阶NLP](https:\u002F\u002Fcourse.spacy.io\u002Fen\u002F)：免费在线课程，涵盖文本处理、大规模数据分析、处理流水线以及针对自定义NLP任务训练神经网络模型等内容\n* [Kaggle NLP学习指南](https:\u002F\u002Fwww.kaggle.com\u002Flearn-guide\u002Fnatural-language-processing)：适合初学者的教程，包括入门指南、NLP深度学习以及对BERT、GloVe和TF-IDF等技术的可视化解释。\n\n博客与通讯\n\n* [深度学习、NLP与表示学习](https:\u002F\u002Fcolah.github.io\u002Fposts\u002F2014-07-NLP-RNNs-Representations\u002F)\n* [图解BERT、ELMo等（NLP如何破解迁移学习）](https:\u002F\u002Fjalammar.github.io\u002Fillustrated-bert\u002F) 和 [图解Transformer](https:\u002F\u002Fjalammar.github.io\u002Fillustrated-transformer\u002F)\n* [自然语言处理](https:\u002F\u002Fnlpers.blogspot.com\u002F) 由Hal Daumé III撰写\n* [arXiv：从零开始的自然语言处理（几乎）](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1103.0398.pdf)\n* [卡帕西的《循环神经网络的不合理有效性》](https:\u002F\u002Fkarpathy.github.io\u002F2015\u002F05\u002F21\u002Frnn-effectiveness)\n* [机器学习精通：自然语言处理的深度学习](https:\u002F\u002Fmachinelearningmastery.com\u002Fcategory\u002Fnatural-language-processing)\n* [NLP论文可视化摘要](https:\u002F\u002Famitness.com\u002Fcategories\u002F#nlp)\n\n### 视频与在线课程\n[返回顶部](#contents)\n\n* [高级自然语言处理](https:\u002F\u002Fpeople.cs.umass.edu\u002F~miyyer\u002Fcs685_f20\u002F) - 马萨诸塞大学阿默斯特分校CS 685课程\n* [深度自然语言处理](https:\u002F\u002Fgithub.com\u002Foxford-cs-deepnlp-2017\u002Flectures) - 牛津大学系列讲座\n* [斯坦福大学CS224n：自然语言处理的深度学习](https:\u002F\u002Fweb.stanford.edu\u002Fclass\u002Fcs224n\u002F) - 由理查德·索彻和克里斯托弗·曼宁主讲的斯坦福课程\n* [卡内基梅隆大学语言技术研究所的NLP神经网络课程](http:\u002F\u002Fphontron.com\u002Fclass\u002Fnn4nlp2017\u002F) - 专注于NLP领域的神经网络应用\n* [Yandex数据学院的深度NLP课程](https:\u002F\u002Fgithub.com\u002Fyandexdataschool\u002Fnlp_course)，涵盖从文本嵌入到机器翻译的重要概念，包括序列建模、语言模型等\n* [fast.ai代码优先的自然语言处理入门](https:\u002F\u002Fwww.fast.ai\u002F2019\u002F07\u002F08\u002Ffastai-nlp\u002F)：本课程结合了传统NLP主题（如正则表达式、奇异值分解、朴素贝叶斯、分词）和最新的神经网络方法（如RNN、seq2seq、GRU以及Transformer），同时探讨偏见和虚假信息等紧迫的伦理问题。相关Jupyter笔记本可在[此处](https:\u002F\u002Fgithub.com\u002Ffastai\u002Fcourse-nlp)找到\n* [机器学习大学——加速自然语言处理](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL8P_Z6C4GcuWfAq8Pt6PBYlck4OprHXsw)：课程内容从NLP和文本处理的基础知识，逐步深入到循环神经网络和Transformer模型\n资料可在此处找到：[AWS机器学习大学加速NLP课程](https:\u002F\u002Fgithub.com\u002Faws-samples\u002Faws-machine-learning-university-accelerated-nlp)\n* [马德拉斯印度理工学院的应用自然语言处理](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLH-xYrxjfO2WyR3pOAB006CYMhNt4wTqp)：系列讲座从基础概念讲起，一直延伸到自编码器等内容。该课程的GitHub笔记本也可在此处获取：[Ramaseshanr的ANLP课程](https:\u002F\u002Fgithub.com\u002FRamaseshanr\u002Fanlp)\n* [DeepLearning.AI自然语言处理专项课程](https:\u002F\u002Fwww.deeplearning.ai\u002Fcourses\u002Fnatural-language-processing-specialization\u002F)：四门课程组成的专项计划，涵盖情感分析、词嵌入、RNN、LSTM、注意力机制以及BERT和T5等Transformer模型，适用于机器翻译和文本摘要等任务。\n\n### 书籍\n\n* [语音与语言处理](https:\u002F\u002Fweb.stanford.edu\u002F~jurafsky\u002Fslp3\u002F) - 免费，由丹·朱拉夫斯基教授编写\n* [自然语言处理](https:\u002F\u002Fgithub.com\u002Fjacobeisenstein\u002Fgt-nlp-class) - 免费，乔治亚理工学院雅各布·艾森斯坦博士的NLP笔记\n* [PyTorch自然语言处理](https:\u002F\u002Fgithub.com\u002Fjoosthub\u002FPyTorchNLPBook) - 布赖恩 & 德利普·拉奥\n* [R语言文本挖掘](https:\u002F\u002Fwww.tidytextmining.com)\n* [用Python进行自然语言处理](https:\u002F\u002Fwww.nltk.org\u002Fbook\u002F)\n* [实用自然语言处理](https:\u002F\u002Fwww.oreilly.com\u002Flibrary\u002Fview\u002Fpractical-natural-language\u002F9781492054047\u002F)\n* [使用Spark NLP进行自然语言处理](https:\u002F\u002Fwww.oreilly.com\u002Flibrary\u002Fview\u002Fnatural-language-processing\u002F9781492047759\u002F)\n* [自然语言处理的深度学习](https:\u002F\u002Fwww.manning.com\u002Fbooks\u002Fdeep-learning-for-natural-language-processing) 斯蒂芬·拉伊马克尔斯著\n* [真实世界的自然语言处理](https:\u002F\u002Fwww.manning.com\u002Fbooks\u002Freal-world-natural-language-processing) - 樋原正人著\n* [自然语言处理实战（第二版）](https:\u002F\u002Fwww.manning.com\u002Fbooks\u002Fnatural-language-processing-in-action-second-edition) - 霍布森·莱恩和玛丽亚·迪谢尔著\n* [Transformer实战](https:\u002F\u002Fwww.manning.com\u002Fbooks\u002Ftransformers-in-action) - 妮可·科恩格斯坦著\n* [人工智能背后的数学](https:\u002F\u002Fwww.freecodecamp.org\u002Fnews\u002Fthe-math-behind-artificial-intelligence-book) - 蒂亚戈·蒙特罗著 | 一本免费的FreeCodeCamp书籍，从工程角度以通俗易懂的语言讲解AI背后的数学知识。内容涵盖线性代数、微积分、概率统计以及最优化理论，并配有类比、实际应用和Python代码示例。\n\n## 库\n\n[返回顶部](#contents)\n\n* \u003Ca id=\"node-js\">**Node.js和JavaScript** - 用于NLP的Node.js库\u003C\u002Fa> | [返回顶部](#contents)\n  * [Twitter-text](https:\u002F\u002Fgithub.com\u002Ftwitter\u002Ftwitter-text) - Twitter文本处理库的JavaScript实现\n  * [Knwl.js](https:\u002F\u002Fgithub.com\u002Fbenhmoore\u002FKnwl.js) - JavaScript中的自然语言处理器\n  * [Retext](https:\u002F\u002Fgithub.com\u002Fretextjs\u002Fretext) - 用于分析和操作自然语言的可扩展系统\n  * [NLP Compromise](https:\u002F\u002Fgithub.com\u002Fspencermountain\u002Fcompromise) - 浏览器端的自然语言处理工具\n  * [Natural](https:\u002F\u002Fgithub.com\u002FNaturalNode\u002Fnatural) - Node.js中通用的自然语言处理工具\n  * [Poplar](https:\u002F\u002Fgithub.com\u002Fsynyi\u002Fpoplar) - 一款基于Web的自然语言处理（NLP）标注工具\n  * [NLP.js](https:\u002F\u002Fgithub.com\u002Faxa-group\u002Fnlp.js) - 用于构建聊天机器人的NLP库\n  * [node-question-answering](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fnode-question-answering) - 在Node.js中使用DistilBERT实现快速且可用于生产的问答系统\n\n* \u003Ca id=\"python\"> **Python** - Python NLP 库\u003C\u002Fa> | [返回顶部](#contents)\n  - [sentimental-onix](https:\u002F\u002Fgithub.com\u002Fsloev\u002Fsentimental-onix) 使用 ONNX 的 spaCy 情感分析模型\n  - [TextAttack](https:\u002F\u002Fgithub.com\u002FQData\u002FTextAttack) - NLP 中的对抗攻击、对抗训练和数据增强\n  - [TextBlob](http:\u002F\u002Ftextblob.readthedocs.org\u002F) - 提供一致的 API，用于处理常见的自然语言处理任务。基于 [Natural Language Toolkit (NLTK)](https:\u002F\u002Fwww.nltk.org\u002F) 和 [Pattern](https:\u002F\u002Fgithub.com\u002Fclips\u002Fpattern) 构建，并与两者良好兼容 :+1:\n  - [spaCy](https:\u002F\u002Fgithub.com\u002Fexplosion\u002FspaCy) - 基于 Python 和 Cython 的工业级 NLP 工具 :+1:\n  - [Speedster](https:\u002F\u002Fgithub.com\u002Fnebuly-ai\u002Fnebullvm\u002Ftree\u002Fmain\u002Fapps\u002Faccelerate\u002Fspeedster) - 自动应用 SOTA 优化技术，以在您的硬件上实现最大的推理加速\n    - [textacy](https:\u002F\u002Fgithub.com\u002Fchartbeat-labs\u002Ftextacy) - 基于 spaCy 构建的更高层次的 NLP 工具\n  - [gensim](https:\u002F\u002Fradimrehurek.com\u002Fgensim\u002Findex.html) - 用于从纯文本中进行无监督语义建模的 Python 库 :+1:\n  - [scattertext](https:\u002F\u002Fgithub.com\u002FJasonKessler\u002Fscattertext) - 用于生成语言在不同语料库之间差异的 d3 可视化效果的 Python 库\n  - [GluonNLP](https:\u002F\u002Fgithub.com\u002Fdmlc\u002Fgluon-nlp) - 基于 MXNet\u002FGluon 的深度学习 NLP 工具包，适用于广泛 NLP 任务的研究原型开发和工业部署。\n  - [AllenNLP](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fallennlp) - 基于 PyTorch 的 NLP 研究库，用于开发各种语言任务中的最先进深度学习模型。\n  - [PyTorch-NLP](https:\u002F\u002Fgithub.com\u002FPetrochukM\u002FPyTorch-NLP) - NLP 研究工具包，旨在支持快速原型开发，提供更好的数据加载器、词向量加载器、神经网络层表示以及 BLEU 等常见 NLP 指标。\n  - [Rosetta](https:\u002F\u002Fgithub.com\u002Fcolumbia-applied-data-science\u002Frosetta) - 文本处理工具和封装器（例如 Vowpal Wabbit）\n  - [PyNLPl](https:\u002F\u002Fgithub.com\u002Fproycon\u002Fpynlpl) - Python 自然语言处理库。通用 NLP 库，可处理 ARPA 语言模型、Moses 短语表、GIZA++ 对齐等特定格式。\n  - [foliapy](https:\u002F\u002Fgithub.com\u002Fproycon\u002Ffoliapy) - 用于处理 [FoLiA](https:\u002F\u002Fproycon.github.io\u002Ffolia\u002F) 的 Python 库，这是一种用于语言学标注的 XML 格式。\n  - [PySS3](https:\u002F\u002Fgithub.com\u002Fsergioburdisso\u002Fpyss3) - 实现了一种新颖的白盒机器学习文本分类模型 SS3 的 Python 包。由于 SS3 能够直观地解释其推理过程，该包还附带易于使用的交互式可视化工具（[在线演示](http:\u002F\u002Ftworld.io\u002Fss3\u002F)）。\n  - [jPTDP](https:\u002F\u002Fgithub.com\u002Fdatquocnguyen\u002FjPTDP) - 用于联合词性标注和依存句法分析的工具包。jPTDP 提供 40 多种语言的预训练模型。\n  - [BigARTM](https:\u002F\u002Fgithub.com\u002Fbigartm\u002Fbigartm) - 一种快速的主题建模库\n  - [Snips NLU](https:\u002F\u002Fgithub.com\u002Fsnipsco\u002Fsnips-nlu) - 一个可用于生产的意图解析库\n  - [Chazutsu](https:\u002F\u002Fgithub.com\u002Fchakki-works\u002Fchazutsu) - 用于下载和解析标准 NLP 研究数据集的库\n  - [Word Forms](https:\u002F\u002Fgithub.com\u002Fgutfeeling\u002Fword_forms) - 可准确生成英语单词的所有可能形式\n  - [多语言潜在狄利克雷分配 (LDA)](https:\u002F\u002Fgithub.com\u002FArtificiAI\u002FMultilingual-Latent-Dirichlet-Allocation-LDA) - 一个多语言且可扩展的文档聚类流水线\n  - [Natural Language Toolkit (NLTK)](https:\u002F\u002Fwww.nltk.org\u002F) - 包含多种 NLP 功能的库，支持超过 50 种语料库。\n  - [NLP Architect](https:\u002F\u002Fgithub.com\u002FNervanaSystems\u002Fnlp-architect) - 用于探索 NLP 和 NLU 领域最先进深度学习拓扑和技术的库\n  - [Flair](https:\u002F\u002Fgithub.com\u002Fzalandoresearch\u002Fflair) - 基于 PyTorch 的非常简单的最先进的多语言 NLP 框架。包括 BERT、ELMo 和 Flair 嵌入。\n  - [Kashgari](https:\u002F\u002Fgithub.com\u002FBrikerMan\u002FKashgari) - 简单、基于 Keras 的多语言 NLP 框架，允许您在 5 分钟内为命名实体识别 (NER)、词性标注 (PoS) 和文本分类任务构建模型。包含 BERT 和 word2vec 嵌入。\n  - [FARM](https:\u002F\u002Fgithub.com\u002Fdeepset-ai\u002FFARM) - 快速且易于使用的 NLP 迁移学习。为行业提取语言模型。专注于问答任务。\n  - [Haystack](https:\u002F\u002Fgithub.com\u002Fdeepset-ai\u002Fhaystack) - 用于构建数据自然语言搜索界面的端到端 Python 框架。利用 Transformer 和 NLP 最先进技术。支持 DPR、Elasticsearch、HuggingFace’s Modelhub 等！\n  - [PraisonAI](https:\u002F\u002Fgithub.com\u002FMervinPraison\u002FPraisonAI) - 多 AI 代理框架，通过 LiteLLM 支持 100 多个 LLM，集成 MCP，支持代理工作流，并内置记忆功能，适用于 NLP 任务。\n  - [Rita DSL](https:\u002F\u002Fgithub.com\u002Fzaibacu\u002Frita-dsl) - 一种基于 [RUTA on Apache UIMA](https:\u002F\u002Fuima.apache.org\u002Fruta.html) 的 DSL。允许定义语言模式（基于规则的 NLP），然后将其转换为 [spaCy](https:\u002F\u002Fspacy.io\u002F)，或者如果您更喜欢功能较少、轻量级的正则表达式模式。\n  - [Transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) - 面向 TensorFlow 2.0 和 PyTorch 的自然语言处理库。\n  - [Tokenizers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftokenizers) - 针对研究和生产优化的分词器。\n  - [fairSeq](https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ffairseq) Facebook AI Research 在 PyTorch 中实现的 SOTA 序列到序列模型。\n  - [corex_topic](https:\u002F\u002Fgithub.com\u002Fgregversteeg\u002Fcorex_topic) - 无需太多领域知识的层次化主题建模\n  - [Sockeye](https:\u002F\u002Fgithub.com\u002Fawslabs\u002Fsockeye) - 亚马逊 Translate 后台运行的神经机器翻译 (NMT) 工具包。\n  - [DL Translate](https:\u002F\u002Fgithub.com\u002Fxhlulu\u002Fdl-translate) - 基于 `transformers` 和 Facebook 的 mBART Large 构建的 50 种语言的深度学习翻译库。\n  - [Jury](https:\u002F\u002Fgithub.com\u002Fobss\u002Fjury) - 用于评估 NLP 模型输出的各种自动化指标。\n  - [python-ucto](https:\u002F\u002Fgithub.com\u002Fproycon\u002Fpython-ucto) - 一种支持 Unicode 的基于正则表达式的多语言分词器。它是 C++ 库的 Python 绑定，支持 [FoLiA 格式](https:\u002F\u002Fproycon.github.io\u002Ffolia)。\n  - [Pearmut](https:\u002F\u002Fgithub.com\u002Fzouharvi\u002Fpearmut) - 用于多语言 NLP 任务（如机器翻译）的人工标注工具。\n\n- \u003Ca id=\"c++\">**C++** - C++ 库\u003C\u002Fa> | [返回顶部](#contents)\n  - [InsNet](https:\u002F\u002Fgithub.com\u002Fchncwang\u002FInsNet) - 一个用于构建实例相关自然语言处理模型的神经网络库，支持无填充的动态批处理。\n  - [MIT 信息抽取工具包](https:\u002F\u002Fgithub.com\u002Fmit-nlp\u002FMITIE) - 用于命名实体识别和关系抽取的 C、C++ 和 Python 工具。\n  - [CRF++](https:\u002F\u002Ftaku910.github.io\u002Fcrfpp\u002F) - 条件随机场（CRFs）的开源实现，用于序列数据的分段\u002F标注及其他自然语言处理任务。\n  - [CRFsuite](http:\u002F\u002Fwww.chokkan.org\u002Fsoftware\u002Fcrfsuite\u002F) - CRFsuite 是条件随机场（CRFs）的实现，用于序列数据的标注。\n  - [BLLIP 解析器](https:\u002F\u002Fgithub.com\u002FBLLIP\u002Fbllip-parser) - BLLIP 自然语言解析器（也称为 Charniak-Johnson 解析器）。\n  - [colibri-core](https:\u002F\u002Fgithub.com\u002Fproycon\u002Fcolibri-core) - C++ 库、命令行工具及 Python 绑定，用于以快速且内存高效的方式提取和处理基本的语言学结构，如 n-gram 和 skipgram。\n  - [ucto](https:\u002F\u002Fgithub.com\u002FLanguageMachines\u002Fucto) - 支持多种语言的 Unicode 感知正则表达式分词器。工具及 C++ 库。支持 FoLiA 格式。\n  - [libfolia](https:\u002F\u002Fgithub.com\u002FLanguageMachines\u002Flibfolia) - 用于 [FoLiA 格式](https:\u002F\u002Fproycon.github.io\u002Ffolia\u002F) 的 C++ 库。\n  - [frog](https:\u002F\u002Fgithub.com\u002FLanguageMachines\u002Ffrog) - 为荷兰语开发的基于记忆的 NLP 套件：词性标注器、词形还原器、依存句法分析器、NER、浅层分析器、形态分析器。\n  - [MeTA](https:\u002F\u002Fgithub.com\u002Fmeta-toolkit\u002Fmeta) - [MeTA : ModErn Text Analysis](https:\u002F\u002Fmeta-toolkit.org\u002F) 是一个 C++ 数据科学工具包，便于挖掘大型文本数据。\n  - [Mecab（日语）](https:\u002F\u002Ftaku910.github.io\u002Fmecab\u002F)\n  - [Moses](http:\u002F\u002Fstatmt.org\u002Fmoses\u002F)\n  - [StarSpace](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FStarSpace) - Facebook 提供的用于创建单词级、段落级、文档级嵌入以及文本分类的库。\n  - [QSMM](http:\u002F\u002Fqsmm.org) - 自适应概率自顶向下和自底向上的解析器。\n\n- \u003Ca id=\"java\">**Java** - Java NLP 库\u003C\u002Fa> | [返回顶部](#contents)\n  - [斯坦福 NLP](https:\u002F\u002Fnlp.stanford.edu\u002Fsoftware\u002Findex.shtml)\n  - [OpenNLP](https:\u002F\u002Fopennlp.apache.org\u002F)\n  - [NLP4J](https:\u002F\u002Femorynlp.github.io\u002Fnlp4j\u002F)\n  - [Java 中的 Word2vec](https:\u002F\u002Fdeeplearning4j.org\u002Fdocs\u002Flatest\u002Fdeeplearning4j-nlp-word2vec)\n  - [ReVerb](https:\u002F\u002Fgithub.com\u002Fknowitall\u002Freverb\u002F) 网络规模开放信息抽取。\n  - [OpenRegex](https:\u002F\u002Fgithub.com\u002Fknowitall\u002Fopenregex) 一种高效灵活的基于标记的正则表达式语言和引擎。\n  - [CogcompNLP](https:\u002F\u002Fgithub.com\u002FCogComp\u002Fcogcomp-nlp) - 伊利诺伊大学认知计算小组开发的核心库。\n  - [MALLET](http:\u002F\u002Fmallet.cs.umass.edu\u002F) - MAchine Learning for LanguagE Toolkit - 用于统计自然语言处理、文档分类、聚类、主题建模、信息抽取以及其他文本机器学习应用的软件包。\n  - [RDRPOSTagger](https:\u002F\u002Fgithub.com\u002Fdatquocnguyen\u002FRDRPOSTagger) - 一套鲁棒的词性标注工具，提供 Java 和 Python 版本，并附带 40 多种语言的预训练模型。\n\n- \u003Ca id=\"kotlin\">**Kotlin** - Kotlin NLP 库\u003C\u002Fa> | [返回顶部](#contents)\n  - [Lingua](https:\u002F\u002Fgithub.com\u002Fpemistahl\u002Flingua\u002F) - 适用于 Kotlin 和 Java 的语言检测库，适合长短文本。\n  - [Kotidgy](https:\u002F\u002Fgithub.com\u002Fmeiblorn\u002Fkotidgy) - 用 Kotlin 编写的基于索引的文本数据生成器。\n\n- \u003Ca id=\"scala\">**Scala** - Scala NLP 库\u003C\u002Fa> | [返回顶部](#contents)\n  - [Saul](https:\u002F\u002Fgithub.com\u002FCogComp\u002Fsaul) - 用于开发 NLP 系统的库，内置 SRL、POS 等模块。\n  - [ATR4S](https:\u002F\u002Fgithub.com\u002Fispras\u002Fatr4s) - 包含最先进 [自动术语识别](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTerminology_extraction) 方法的工具包。\n  - [tm](https:\u002F\u002Fgithub.com\u002Fispras\u002Ftm) - 基于正则化多语言 [PLSA](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FProbabilistic_latent_semantic_analysis) 的主题建模实现。\n  - [word2vec-scala](https:\u002F\u002Fgithub.com\u002FRefefer\u002Fword2vec-scala) - Word2vec 模型的 Scala 接口；包含词距离、词类比等向量操作。\n  - [Epic](https:\u002F\u002Fgithub.com\u002Fdlwh\u002Fepic) - Epic 是用 Scala 编写的高性能统计句法分析器，并配有用于构建复杂结构化预测模型的框架。\n  - [Spark NLP](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Fspark-nlp) - Spark NLP 是基于 Apache Spark ML 构建的自然语言处理库，为可在分布式环境中轻松扩展的机器学习管道提供简单、高效且准确的 NLP 注释。\n\n- \u003Ca id=\"R\">**R** - R NLP 库\u003C\u002Fa> | [返回顶部](#contents)\n  - [text2vec](https:\u002F\u002Fgithub.com\u002Fdselivanov\u002Ftext2vec) - 在 R 中实现快速向量化、主题建模、距离计算以及 GloVe 词嵌入。\n  - [wordVectors](https:\u002F\u002Fgithub.com\u002Fbmschmidt\u002FwordVectors) - 用于创建和探索 word2vec 及其他词嵌入模型的 R 包。\n  - [RMallet](https:\u002F\u002Fgithub.com\u002Fmimno\u002FRMallet) - 与 Java 机器学习工具 MALLET 对接的 R 包。\n  - [dfr-browser](https:\u002F\u002Fgithub.com\u002Fagoldst\u002Fdfr-browser) - 在网页浏览器中创建 d3 可视化，用于浏览文本的主题模型。\n  - [dfrtopics](https:\u002F\u002Fgithub.com\u002Fagoldst\u002Fdfrtopics) - 用于探索文本主题模型的 R 包。\n  - [sentiment_classifier](https:\u002F\u002Fgithub.com\u002Fkevincobain2000\u002Fsentiment_classifier) - 使用词义消歧和 WordNet 阅读器进行情感分类。\n  - [jProcessing](https:\u002F\u002Fgithub.com\u002Fkevincobain2000\u002FjProcessing) - 日语自然语言处理库，包含日语情感分类功能。\n  - [corporaexplorer](https:\u002F\u002Fkgjerde.github.io\u002Fcorporaexplorer\u002F) - 用于动态探索文本集合的 R 包。\n  - [tidytext](https:\u002F\u002Fgithub.com\u002Fjuliasilge\u002Ftidytext) - 使用整洁工具进行文本挖掘。\n  - [spacyr](https:\u002F\u002Fgithub.com\u002Fquanteda\u002Fspacyr) - spaCy NLP 的 R 封装。\n  - [CRAN 任务视图：自然语言处理](https:\u002F\u002Fgithub.com\u002Fcran-task-views\u002FNaturalLanguageProcessing\u002F)\n\n- \u003Ca id=\"clojure\">**Clojure\u003C\u002Fa> | [返回顶部](#contents)\n  - [Clojure-openNLP](https:\u002F\u002Fgithub.com\u002Fdakrone\u002Fclojure-opennlp) - Clojure 中的自然语言处理（opennlp）。\n  - [Infections-clj](https:\u002F\u002Fgithub.com\u002Fr0man\u002Finflections-clj) - 类似 Rails 的变格库，适用于 Clojure 和 ClojureScript。\n  - [postagga](https:\u002F\u002Fgithub.com\u002Ffekr\u002Fpostagga) - 用于在 Clojure 和 ClojureScript 中解析自然语言的库。\n\n- \u003Ca id=\"ruby\">**Ruby\u003C\u002Fa> | [返回顶部](#contents)\n  - Kevin Dias 的 [自然语言处理（NLP）Ruby 库、工具和软件合集](https:\u002F\u002Fgithub.com\u002Fdiasks2\u002Fruby-nlp)\n  - [用 Ruby 实践自然语言处理](https:\u002F\u002Fgithub.com\u002Farbox\u002Fnlp-with-ruby)\n\n- \u003Ca id=\"rust\">**Rust**\u003C\u002Fa> | [返回顶部](#contents)\n  - [adk-rust](https:\u002F\u002Fgithub.com\u002Fzavora-ai\u002Fadk-rust) - 生产就绪的AI智能体开发工具包，具有模型无关的设计（支持Gemini、OpenAI、Anthropic等），支持多种智能体类型及MCP协议\n  - [whatlang](https:\u002F\u002Fgithub.com\u002Fgreyblake\u002Fwhatlang-rs) — 基于三元组的自然语言识别库\n  - [snips-nlu-rs](https:\u002F\u002Fgithub.com\u002Fsnipsco\u002Fsnips-nlu-rs) - 用于意图解析的生产级库\n  - [rust-bert](https:\u002F\u002Fgithub.com\u002Fguillaume-be\u002Frust-bert) - 即用型NLP流水线和基于Transformer的模型\n\n- \u003Ca id=\"NLP++\">**NLP++** - NLP++语言\u003C\u002Fa> | [返回顶部](#contents)\n  - [VSCode语言扩展](https:\u002F\u002Fmarketplace.visualstudio.com\u002Fitems?itemName=dehilster.nlp) - 适用于VSCode的NLP++语言扩展\n  - [nlp-engine](https:\u002F\u002Fgithub.com\u002FVisualText\u002Fnlp-engine) - 在Linux上运行NLP++代码的引擎，包含完整的英语语法分析器\n  - [VisualText](http:\u002F\u002Fvisualtext.org) - NLP++语言的主页\n  - [NLP++维基](http:\u002F\u002Fwiki.naturalphilosophy.org\u002Findex.php?title=NLP%2B%2B) - NLP++语言的维基条目\n\n- \u003Ca id=\"julia\">**Julia**\u003C\u002Fa> | [返回顶部](#contents)\n  - [CorpusLoaders](https:\u002F\u002Fgithub.com\u002FJuliaText\u002FCorpusLoaders.jl) - 用于各种NLP语料库的加载器\n  - [Languages](https:\u002F\u002Fgithub.com\u002FJuliaText\u002FLanguages.jl) - 处理人类语言的软件包\n  - [TextAnalysis](https:\u002F\u002Fgithub.com\u002FJuliaText\u002FTextAnalysis.jl) - Julia用于文本分析的软件包\n  - [TextModels](https:\u002F\u002Fgithub.com\u002FJuliaText\u002FTextModels.jl) - 基于神经网络的自然语言处理模型\n  - [WordTokenizers](https:\u002F\u002Fgithub.com\u002FJuliaText\u002FWordTokenizers.jl) - 高性能的分词工具，适用于自然语言处理及其他相关任务\n  - [Word2Vec](https:\u002F\u002Fgithub.com\u002FJuliaText\u002FWord2Vec.jl) - Julia对word2vec的接口\n\n\n\n### 服务\n\n以API形式提供的NLP服务，具备更高层次的功能，如命名实体识别、主题标注等 | [返回顶部](#contents)\n\n- [Vedika API](https:\u002F\u002Fvedika.io) - 基于AI的吠陀占星学API，采用多智能体群集智能\n- [Wit-ai](https:\u002F\u002Fgithub.com\u002Fwit-ai\u002Fwit) - 应用程序和设备的自然语言交互界面\n- [IBM Watson的自然语言理解](https:\u002F\u002Fgithub.com\u002Fwatson-developer-cloud\u002Fnatural-language-understanding-nodejs) - API及GitHub演示\n- [Amazon Comprehend](https:\u002F\u002Faws.amazon.com\u002Fcomprehend\u002F) - NLP和ML套件，涵盖最常见的任务，如NER、标签化和情感分析\n- [Google Cloud自然语言API](https:\u002F\u002Fcloud.google.com\u002Fnatural-language\u002F) - 支持至少9种语言的句法分析、NER、情感分析和内容标签化，包括英语、简体中文和繁体中文。\n- [ParallelDots](https:\u002F\u002Fwww.paralleldots.com\u002Ftext-analysis-apis) - 高层次文本分析API服务，从情感分析到意图分析\n- [Microsoft认知服务](https:\u002F\u002Fazure.microsoft.com\u002Fen-us\u002Fservices\u002Fcognitive-services\u002Ftext-analytics\u002F)\n- [TextRazor](https:\u002F\u002Fwww.textrazor.com\u002F)\n- [Rosette](https:\u002F\u002Fwww.rosette.com\u002F)\n- [Textalytic](https:\u002F\u002Fwww.textalytic.com) - 浏览器端的自然语言处理服务，提供情感分析、命名实体提取、词性标注、词频统计、主题建模、词云等功能\n- [NLP Cloud](https:\u002F\u002Fnlpcloud.io) - 通过RESTful API提供SpaCy NLP模型（自定义和预训练模型），用于命名实体识别、词性标注等任务。\n- [Cloudmersive](https:\u002F\u002Fcloudmersive.com\u002Fnlp-api) - 统一且免费的NLP API，可执行诸如语音标注、文本改写、语言翻译\u002F检测以及句子解析等操作。\n\n### 标注工具\n\n- [GATE](https:\u002F\u002Fgate.ac.uk\u002Foverview.html) - 通用架构与文本工程系统，已有15年以上历史，免费且开源\n- [Anafora](https:\u002F\u002Fgithub.com\u002Fweitechen\u002Fanafora) 是一款免费开源的基于Web的原始文本标注工具\n- [brat](https:\u002F\u002Fbrat.nlplab.org\u002F) - brat快速标注工具是一个用于协作式文本标注的在线环境\n- [doccano](https:\u002F\u002Fgithub.com\u002Fchakki-works\u002Fdoccano) - doccano是免费开源的，提供文本分类、序列标注及序列到序列任务的标注功能\n- [INCEpTION](https:\u002F\u002Finception-project.github.io) - 语义标注平台，提供智能辅助和知识管理\n- [tagtog](https:\u002F\u002Fwww.tagtog.net\u002F)，一款以团队为核心的Web工具，用于查找、创建、维护和共享数据集 - 需付费\n- [prodigy](https:\u002F\u002Fprodi.gy\u002F) 是一款基于主动学习的标注工具，需付费\n- [LightTag](https:\u002F\u002Flighttag.io) - 托管式文本标注工具，专为团队设计，需付费\n- [rstWeb](https:\u002F\u002Fcorpling.uis.georgetown.edu\u002Frstweb\u002Finfo\u002F) - 开源的本地或在线工具，用于话语树结构的标注\n- [GitDox](https:\u002F\u002Fcorpling.uis.georgetown.edu\u002Fgitdox\u002F) - 开源的服务器端标注工具，结合GitHub版本控制和XML数据验证，适用于协作式电子表格网格\n- [Label Studio](https:\u002F\u002Fwww.heartex.ai\u002F) - 托管式的文本标注工具，专为团队设计，采用免费增值模式，需付费\n- [Datasaur](https:\u002F\u002Fdatasaur.ai\u002F) 支持个人或团队的各种NLP任务，采用免费增值模式\n- [Konfuzio](https:\u002F\u002Fkonfuzio.com\u002Fen\u002F) - 一款以团队为核心的托管式文本、图像和PDF标注工具，基于主动学习，采用免费增值模式，需付费\n- [UBIAI](https:\u002F\u002Fubiai.tools\u002F) - 易于使用的团队文本标注工具，具备最全面的自动标注功能。支持NER、关系抽取和文档分类，以及发票标注的OCR标注，需付费\n- [Shoonya](https:\u002F\u002Fgithub.com\u002FAI4Bharat\u002FShoonya-Backend) - Shoonya是一款免费开源的数据标注平台，拥有丰富的组织和工作空间管理系统。Shoonya不依赖特定数据格式，可供团队大规模标注数据，并设置不同级别的验证阶段。\n- [Annotation Lab](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fannotation-lab\u002F) - 免费的端到端无代码平台，用于文本标注和深度学习模型的训练\u002F调优。开箱即用，支持命名实体识别、分类、关系抽取和断言状态等Spark NLP模型。不限用户、团队、项目和文档数量。非开源。\n- [FLAT](https:\u002F\u002Fgithub.com\u002Fproycon\u002Fflat) - FLAT是一个基于[FoLiA格式](http:\u002F\u002Fproycon.github.io\u002Ffolia)的Web语言学标注环境，FoLiA是一种丰富的基于XML的语言学标注格式。免费且开源。\n\n\n## 技术\n\n### 文本嵌入\n\n#### 词嵌入\n\n- 经验法则：**fastText >> GloVe > word2vec**\n\n- [word2vec](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F5021-distributed-representations-of-words-and-phrases-and-their-compositionality.pdf) - [实现](https:\u002F\u002Fcode.google.com\u002Farchive\u002Fp\u002Fword2vec\u002F) - [解释性博客](http:\u002F\u002Fcolah.github.io\u002Fposts\u002F2014-07-NLP-RNNs-Representations\u002F)\n- [glove](https:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf) - [解释性博客](https:\u002F\u002Fblog.acolyer.org\u002F2016\u002F04\u002F22\u002Fglove-global-vectors-for-word-representation\u002F)\n- fasttext - [实现](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText) - [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F1607.04606) - [解释性博客](https:\u002F\u002Ftowardsdatascience.com\u002Ffasttext-under-the-hood-11efc57b2b3)\n\n#### 基于句子和语言模型的词嵌入\n\n[返回顶部](#contents)\n\n- ElMo - [深度上下文相关的词表示](https:\u002F\u002Farxiv.org\u002Fabs\u002F1802.05365) - [PyTorch实现](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fallennlp\u002Fblob\u002Fmaster\u002Ftutorials\u002Fhow_to\u002Felmo.md) - [TF实现](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fbilm-tf)\n- ULMFiT - [用于文本分类的通用语言模型微调](https:\u002F\u002Farxiv.org\u002Fabs\u002F1801.06146) 由Jeremy Howard和Sebastian Ruder提出\n- InferSent - [基于自然语言推理数据的有监督学习通用句子表示](https:\u002F\u002Farxiv.org\u002Fabs\u002F1705.02364) 由Facebook提出\n- CoVe - [在翻译中学习：上下文相关的词向量](https:\u002F\u002Farxiv.org\u002Fabs\u002F1708.00107)\n- 段落向量 - 来自[句子和文档的分布式表示](https:\u002F\u002Fcs.stanford.edu\u002F~quocle\u002Fparagraph_vector.pdf)。参见[Gensim中的doc2vec教程](https:\u002F\u002Frare-technologies.com\u002Fdoc2vec-tutorial\u002F)\n- [sense2vec](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.06388) - 用于词义消歧\n- [Skip Thought Vectors](https:\u002F\u002Farxiv.org\u002Fabs\u002F1506.06726) - 词表示方法\n- [自适应skip-gram](https:\u002F\u002Farxiv.org\u002Fabs\u002F1502.07257) - 类似的方法，具有自适应特性\n- [序列到序列学习](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F5346-sequence-to-sequence-learning-with-neural-networks.pdf) - 用于机器翻译的词向量\n\n### 问答与知识抽取\n\n[返回顶部](#contents)\n\n- [DrQA](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FDrQA) - Facebook Research基于维基百科数据的开放域问答工作\n- [Document-QA](https:\u002F\u002Fgithub.com\u002Fallenai\u002Fdocument-qa) - AllenAI的简单有效的多段落阅读理解\n- [无模板的基于模板的信息抽取](https:\u002F\u002Fwww.usna.edu\u002FUsers\u002Fcs\u002Fnchamber\u002Fpubs\u002Facl2011-chambers-templates.pdf)\n- [Privee：一种自动分析网络隐私政策的架构](https:\u002F\u002Fwww.sebastianzimmeck.de\u002FzimmeckAndBellovin2014Privee.pdf)\n\n## 数据集\n\n[返回顶部](#contents)\n\n- [nlp-datasets](https:\u002F\u002Fgithub.com\u002Fniderhoff\u002Fnlp-datasets) 优秀的NLP数据集集合\n- [gensim-data](https:\u002F\u002Fgithub.com\u002FRaRe-Technologies\u002Fgensim-data) - 预训练NLP模型和NLP语料库的数据仓库。\n- [tiny_qa_benchmark_pp](https:\u002F\u002Fgithub.com\u002Fvincentkoc\u002Ftiny_qa_benchmark_pp\u002F) - 小型多语言问答数据集存储库，以及生成您自己的合成数据的库。\n\n## 多语言NLP框架\n\n[返回顶部](#contents)\n\n- [UDPipe](https:\u002F\u002Fgithub.com\u002Fufal\u002Fudpipe) 是一个可训练的流水线，用于分词、词性标注、词形还原和解析通用树库及其他CoNLL-U文件。主要用C++编写，为多语言NLP处理提供快速可靠解决方案。\n- [NLP-Cube](https:\u002F\u002Fgithub.com\u002Fadobe\u002FNLP-Cube)：自然语言处理流水线 - 句子分割、分词、词形还原、词性标注和依存句法分析。新平台，用Python和Dynet 2.0编写。提供独立（CLI\u002FPython绑定）和服务器功能（REST API）。\n- [UralicNLP](https:\u002F\u002Fgithub.com\u002Fmikahama\u002FuralicNLP) 是一个主要用于许多濒危乌拉尔语系语言的NLP库，如萨米语、莫尔多瓦语、马里语、科米语等。同时也支持一些非濒危语言，如芬兰语，以及非乌拉尔语系的语言，如瑞典语和阿拉伯语。UralicNLP可以进行形态分析、生成、词形还原和消歧。\n\n## 韩语NLP\n\n[返回顶部](#contents)\n\n### 库\n\n- [KoNLPy](http:\u002F\u002Fkonlpy.org) - 用于韩语自然语言处理的Python包。\n- [Mecab (韩语)](https:\u002F\u002Feunjeon.blogspot.com\u002F) - 用于韩语NLP的C++库\n- [KoalaNLP](https:\u002F\u002Fkoalanlp.github.io\u002Fkoalanlp\u002F) - 用于韩语自然语言处理的Scala库。\n- [KoNLP](https:\u002F\u002Fcran.r-project.org\u002Fpackage=KoNLP) - 用于韩语自然语言处理的R包\n\n### 博客和教程\n\n- [dsindex的博客](https:\u002F\u002Fdsindex.github.io\u002F)\n- [江原大学的韩语NLP课程](http:\u002F\u002Fcs.kangwon.ac.kr\u002F~leeck\u002FNLP\u002F)\n\n### 数据集\n\n- [KAIST语料库](http:\u002F\u002Fsemanticweb.kaist.ac.kr\u002Fhome\u002Findex.php\u002FKAIST_Corpus) - 来自韩国科学技术院的韩语语料库。\n- [Naver情感电影语料库](https:\u002F\u002Fgithub.com\u002Fe9t\u002Fnsmc\u002F)\n- [朝鲜日报档案](http:\u002F\u002Fsrchdb1.chosun.com\u002Fpdf\u002Fi_archive\u002F) - 来自韩国主要报纸之一朝鲜日报的韩语数据集。\n- [聊天数据](https:\u002F\u002Fgithub.com\u002Fsongys\u002FChatbot_data) - 韩语聊天机器人数据\n- [请愿书](https:\u002F\u002Fgithub.com\u002Fakngs\u002Fpetitions) - 收集来自青瓦台国家请愿网站的已过期请愿数据。\n- [韩英平行语料库](https:\u002F\u002Fgithub.com\u002Fj-min\u002Fkorean-parallel-corpora) - 用于**韩语到法语**及**韩语到英语**的神经机器翻译(NMT)数据集\n- [KorQuAD](https:\u002F\u002Fkorquad.github.io\u002F) - 韩语SQuAD数据集，包含Wiki HTML源。在添加到Awesome NLP时提到了v1.0和v2.1版本。\n\n## 阿拉伯语NLP\n\n[返回顶部](#contents)\n\n### 库\n\n- [goarabic](https:\u002F\u002Fgithub.com\u002F01walid\u002Fgoarabic) - 用于阿拉伯语文本处理的Go包\n- [jsastem](https:\u002F\u002Fgithub.com\u002Fejtaal\u002Fjsastem) - 用于阿拉伯语词干提取的Javascript\n- [PyArabic](https:\u002F\u002Fpypi.org\u002Fproject\u002FPyArabic\u002F) - 用于阿拉伯语的Python库\n- [RFTokenizer](https:\u002F\u002Fgithub.com\u002Famir-zeldes\u002FRFTokenizer) - 可训练的Python分词器，适用于阿拉伯语、希伯来语和科普特语。\n\n### 数据集\n\n- [多领域数据集](https:\u002F\u002Fgithub.com\u002Fhadyelsahar\u002Flarge-arabic-sentiment-analysis-resouces) - 当前可用的最大规模阿拉伯语情感分析多领域资源\n- [LABR](https:\u002F\u002Fgithub.com\u002Fmohamedadaly\u002Flabr) - 大型阿拉伯语书籍评论数据集\n- [阿拉伯语停用词](https:\u002F\u002Fgithub.com\u002Fmohataher\u002Farabic-stop-words) - 来自各种资源的阿拉伯语停用词列表\n\n## 中文NLP\n\n[返回顶部](#contents)\n\n### 库\n\n- [jieba](https:\u002F\u002Fgithub.com\u002Ffxsjy\u002Fjieba#jieba-1) - 用于中文分词的Python包\n- [SnowNLP](https:\u002F\u002Fgithub.com\u002Fisnowfy\u002Fsnownlp) - 用于中文NLP的Python包\n- [FudanNLP](https:\u002F\u002Fgithub.com\u002FFudanNLP\u002Ffnlp) - 用于中文文本处理的Java库\n- [HanLP](https:\u002F\u002Fgithub.com\u002Fhankcs\u002FHanLP) - 多语言NLP库\n\n### 文集\n- [funNLP](https:\u002F\u002Fgithub.com\u002Ffighting41love\u002FfunNLP) - 主要针对中文的自然语言处理工具和资源集合\n\n## 德语中的NLP\n\n- [German-NLP](https:\u002F\u002Fgithub.com\u002Fadbar\u002FGerman-NLP) - 一个精心整理的开放获取\u002F开源\u002F现成资源和工具列表，特别关注德语。\n\n## 波兰语中的NLP\n\n- [Polish-NLP](https:\u002F\u002Fgithub.com\u002Fksopyla\u002Fawesome-nlp-polish) - 一个专门用于波兰语自然语言处理（NLP）的精选资源列表。包括模型、工具和数据集。\n\n## 西班牙语中的NLP\n\n[返回顶部](#contents)\n\n### 库\n\n- [spanlp](https:\u002F\u002Fgithub.com\u002Fjfreddypuentes\u002Fspanlp) - 一个Python库，用于检测、审查和清理西班牙语文本中的脏话、粗俗用语、仇恨言论、种族主义、仇外心理和欺凌行为。它包含了21个西班牙语国家的数据。\n\n### 数据\n\n- [哥伦比亚政治演讲](https:\u002F\u002Fgithub.com\u002Fdav009\u002FLatinamericanTextResources)\n- [哥本哈根依存树库](https:\u002F\u002Fmbkromann.github.io\u002Fcopenhagen-dependency-treebank\u002F)\n- [带有Word2Vec词向量的西班牙语十亿词语料库](https:\u002F\u002Fgithub.com\u002Fcrscardellino\u002Fsbwce)\n- [西班牙未标注语料汇编](https:\u002F\u002Fgithub.com\u002Fjosecannete\u002Fspanish-unannotated-corpora)\n\n### 词和句子嵌入\n\n- [使用不同方法和来自不同语料库计算的西班牙语词嵌入](https:\u002F\u002Fgithub.com\u002Fdccuchile\u002Fspanish-word-embeddings)\n- [使用fastText从大型语料库中计算的不同大小的西班牙语词嵌入](https:\u002F\u002Fgithub.com\u002FBotCenter\u002FspanishWordEmbeddings)\n- [使用sent2vec从大型语料库计算的西班牙语句子嵌入](https:\u002F\u002Fgithub.com\u002FBotCenter\u002FspanishSent2Vec)\n- [Beto - 面向西班牙语的BERT](https:\u002F\u002Fgithub.com\u002Fdccuchile\u002Fbeto)\n\n\n## 印度语言中的NLP\n\n[返回顶部](#contents)\n\n### 数据、语料库和树库\n\n- [印地语依存树库](https:\u002F\u002Fltrc.iiit.ac.in\u002Ftreebank_H2014\u002F) - 一个面向印地语和乌尔都语的多表征、多层树库。\n- [印地语通用依存树库](https:\u002F\u002Funiversaldependencies.org\u002Ftreebanks\u002Fhi_hdtb\u002Findex.html)\n  - [印地语平行通用依存树库](http:\u002F\u002Funiversaldependencies.org\u002Ftreebanks\u002Fhi_pud\u002Findex.html) - 上述树库的一个较小部分。\n- [ISI FIRE停用词表（印地语和孟加拉语）](https:\u002F\u002Fwww.isical.ac.in\u002F~fire\u002Fdata\u002F)\n- [Peter Graham的停用词表](https:\u002F\u002Fgithub.com\u002F6\u002Fstopwords-json)\n- [NLTK语料库](https:\u002F\u002Fwww.nltk.org\u002Fbook\u002Fch02.html) 6万词的词性标注，包括孟加拉语、印地语、马拉地语和泰卢固语。\n- [印地语电影评论数据集](https:\u002F\u002Fgithub.com\u002Fgoru001\u002Fnlp-for-hindi) 约1000个样本，3个极性类别。\n- [BBC新闻印地语数据集](https:\u002F\u002Fgithub.com\u002FNirantK\u002Fhindi2vec\u002Freleases\u002Ftag\u002Fbbc-hindi-v0.1) 4300个样本，14个类别。\n- [IIT Patna印地语ABSA数据集](https:\u002F\u002Fgithub.com\u002Fpnisarg\u002FABSA) 5400个样本，12个领域，4000个方面术语，方面和句子级别的极性分为4类。\n- [孟加拉语ABSA](https:\u002F\u002Fgithub.com\u002FAtikRahman\u002FBangla_Datasets_ABSA) 5500个样本，2个领域，10个方面术语。\n- [IIT Patna电影评论情感数据集](https:\u002F\u002Fwww.iitp.ac.in\u002F~ai-nlp-ml\u002Fresources.html) 2000个样本，3种极性标签。\n\n#### 需要登录\u002F访问权限的语料库\u002F数据集可通过电子邮件获取\n\n- [SAIL 2015](http:\u002F\u002Famitavadas.com\u002FSAIL\u002F) 推特和脸书上印地语、孟加拉语、泰米尔语和泰卢固语的情感标注样本。\n- [IIT Bombay NLP资源](http:\u002F\u002Fwww.cfilt.iitb.ac.in\u002FSentiment_Analysis_Resources.html) 包括Sentiwordnet、电影和旅游领域的平行标注语料库、带极性标注的语义标注语料库以及马拉地语的极性标注语料库。\n- [TDIL-IC汇集了许多有用的资源，并提供对原本受限制数据集的访问](https:\u002F\u002Ftdil-dc.in\u002Findex.php?option=com_catalogue&task=viewTools&id=83&lang=en)\n\n### 语言模型和词嵌入\n\n- [Hindi2Vec](https:\u002F\u002Fnirantk.com\u002Fhindi2vec\u002F) 和 [nlp-for-hindi](https:\u002F\u002Fgithub.com\u002Fgoru001\u002Fnlp-for-hindi) ULMFIT风格的语言模型。\n- [IIT Patna双语词嵌入Hi-En](https:\u002F\u002Fwww.iitp.ac.in\u002F~ai-nlp-ml\u002Fresources.html)\n- [Fasttext在多种语言上的词嵌入，基于Common Crawl训练](https:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fcrawl-vectors.html)\n- [印地语和孟加拉语Word2Vec](https:\u002F\u002Fgithub.com\u002FKyubyong\u002Fwordvectors)\n- [印地语和乌尔都语Elmo模型](https:\u002F\u002Fgithub.com\u002FHIT-SCIR\u002FELMoForManyLangs)\n- [梵语Albert](https:\u002F\u002Fhuggingface.co\u002Fsurajp\u002Falbert-base-sanskrit) 在梵语维基百科和OSCAR语料库上训练。\n\n### 库和工具\n\n- [多任务深度形态分析器](https:\u002F\u002Fgithub.com\u002FSaurav0074\u002Fmt-dma) 基于深度网络的印地语和乌尔都语形态解析器。\n- [Anoop Kunchukuttan](https:\u002F\u002Fgithub.com\u002Fanoopkunchukuttan\u002Findic_nlp_library) 支持18种语言，涵盖从分词到翻译的多种功能。\n- [SivaReddy的依存句法分析器](http:\u002F\u002Fsivareddy.in\u002Fdownloads) 可用于坎那达语、印地语和泰卢固语的依存句法分析和词性标注。[Python3移植版](https:\u002F\u002Fgithub.com\u002FCalmDownKarm\u002Fsivareddydependencyparser)。\n- [iNLTK](https:\u002F\u002Fgithub.com\u002Fgoru001\u002Finltk) - 一个基于Pytorch\u002FFastai构建的印度语言自然语言工具包，旨在为常见的NLP任务提供开箱即用的支持。\n\n## 泰语中的NLP\n\n[返回顶部](#contents)\n\n### 库\n\n- [PyThaiNLP](https:\u002F\u002Fgithub.com\u002FPyThaiNLP\u002Fpythainlp) - 泰语NLP的Python包。\n- [JTCC](https:\u002F\u002Fgithub.com\u002Fwittawatj\u002Fjtcc) - 一个Java字符聚类库。\n- [CutKum](https:\u002F\u002Fgithub.com\u002Fpucktada\u002Fcutkum) - 使用TensorFlow进行深度学习的分词工具。\n- [泰国语言工具包](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Ftltk\u002F) - 基于Wirote Aroonmanakun在2002年发表的一篇论文，并附带数据集。\n- [SynThai](https:\u002F\u002Fgithub.com\u002FKenjiroAI\u002FSynThai) - 使用Python深度学习进行分词和词性标注。\n\n### 数据\n\n- [Inter-BEST](https:\u002F\u002Fwww.nectec.or.th\u002Fcorpus\u002Findex.php?league=pm) - 一个包含500万词且已分词的文本语料库。\n- [Prime Minister 29](https:\u002F\u002Fgithub.com\u002FPyThaiNLP\u002Flexicon-thai\u002Ftree\u002Fmaster\u002Fthai-corpus\u002FPrime%20Minister%2029) - 包含泰国现任总理演讲的数据集。\n\n## 丹麦语中的NLP\n\n- [丹麦语命名实体识别](https:\u002F\u002Fgithub.com\u002FITUnlp\u002Fdaner)\n- [DaNLP](https:\u002F\u002Fgithub.com\u002Falexandrainst\u002Fdanlp) - 丹麦语NLP资源。\n- [Awesome Danish](https:\u002F\u002Fgithub.com\u002Ffnielsen\u002Fawesome-danish) - 一个精心挑选的丹麦语语言技术优秀资源列表。\n\n## 越南语中的NLP\n\n### 库\n\n- [underthesea](https:\u002F\u002Fgithub.com\u002Fundertheseanlp\u002Funderthesea) - 越南语NLP工具包。\n- [vn.vitk](https:\u002F\u002Fgithub.com\u002Fphuonglh\u002Fvn.vitk) - 一个越南语文本处理工具包。\n- [VnCoreNLP](https:\u002F\u002Fgithub.com\u002Fvncorenlp\u002FVnCoreNLP) - 一个越南语自然语言处理工具包。\n- [PhoBERT](https:\u002F\u002Fgithub.com\u002FVinAIResearch\u002FPhoBERT) - 预训练的越南语语言模型。\n- [pyvi](https:\u002F\u002Fgithub.com\u002Ftrungtv\u002Fpyvi) - Python越南语核心NLP工具包。\n- [VieNeu-TTS](https:\u002F\u002Fgithub.com\u002Fpnnbao97\u002FVieNeu-TTS) - 一款先进的设备端越南语文本转语音系统，支持即时语音克隆。\n\n### 数据\n\n- [越南语树库](https:\u002F\u002Fvlsp.hpda.vn\u002Fdemo\u002F?page=resources&lang=en) - 用于句法分析任务的10,000个句子\n- [BKTreeBank](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.05519.pdf) - 越南语依存树库\n- [UD_Vietnamese](https:\u002F\u002Fgithub.com\u002FUniversalDependencies\u002FUD_Vietnamese-VTB) - 越南语通用依存树库\n- [VIVOS](https:\u002F\u002Failab.hcmus.edu.vn\u002Fvivos\u002F) - 免费的越南语语音语料库，包含AILab录制的15小时语音数据\n- [VNTQcorpus(big).txt](http:\u002F\u002Fviet.jnlp.org\u002Fdownload-du-lieu-tu-vung-corpus) - 新闻领域的175万句文本\n- [ViText2SQL](https:\u002F\u002Fgithub.com\u002FVinAIResearch\u002FViText2SQL) - 用于越南语文本到SQL语义解析的数据集（EMNLP-2020 Findings）\n- [EVB语料库](https:\u002F\u002Fgithub.com\u002Fqhungngo\u002FEVBCorpus) - 包含来自15本双语书籍的2000万词、100篇英越\u002F越英平行文本、250篇法律和条例的平行文本、5000篇新闻文章以及2000条电影字幕。\n\n\n## 荷兰语自然语言处理\n\n[返回顶部](#contents)\n\n- [python-frog](https:\u002F\u002Fgithub.com\u002Fproycon\u002Fpython-frog) - Frog的Python绑定，Frog是针对荷兰语的NLP工具包。（词性标注、词形还原、依存句法分析、命名实体识别）\n- [SimpleNLG_NL](https:\u002F\u002Fgithub.com\u002Frfdj\u002FSimpleNLG-NL) - 基于SimpleNLG的英语和法语实现，用于荷兰语自然语言生成的荷兰语表面生成器。\n- [Alpino](https:\u002F\u002Fgithub.com\u002Frug-compling\u002Falpino) - 荷兰语依存句法分析器（同时进行词性标注和词形还原）。\n- [Kaldi NL](https:\u002F\u002Fgithub.com\u002Fopensource-spraakherkenning-nl\u002FKaldi_NL) - 基于[Kaldi](http:\u002F\u002Fkaldi-asr.org\u002F)的荷兰语语音识别模型。\n- [spaCy](https:\u002F\u002Fspacy.io\u002F) - 提供荷兰语模型。[荷兰语模型](https:\u002F\u002Fspacy.io\u002Fmodels\u002Fnl) - 使用Python和Cython实现的工业级NLP技术。\n\n\n## 印尼语自然语言处理\n\n### 数据集\n- Kompas和Tempo语料，可在[ILPS](http:\u002F\u002Filps.science.uva.nl\u002Fresources\u002Fbahasa\u002F)获取\n- [PANL10N用于词性标注](http:\u002F\u002Fwww.panl10n.net\u002Fenglish\u002Foutputs\u002FIndonesia\u002FUI\u002F0802\u002FUI-1M-tagged.zip): 39,000个句子和900,000个词项\n- [IDN用于词性标注](https:\u002F\u002Fgithub.com\u002Ffamrashel\u002Fidn-tagged-corpus): 该语料包含10,000个句子和250,000个词项\n- [印尼语树库](https:\u002F\u002Fgithub.com\u002Ffamrashel\u002Fidn-treebank)和[通用依存-印尼语](https:\u002F\u002Fgithub.com\u002FUniversalDependencies\u002FUD_Indonesian-GSD)\n- [IndoSum](https:\u002F\u002Fgithub.com\u002Fkata-ai\u002Findosum)可用于文本摘要和分类任务\n- [Wordnet-Bahasa](http:\u002F\u002Fwn-msa.sourceforge.net\u002F) - 大型免费语义词典\n- IndoBenchmark [IndoNLU](https:\u002F\u002Fgithub.com\u002Findobenchmark\u002Findonlu)包括预训练的语言模型（IndoBERT）、FastText模型、Indo4B语料库以及多个NLU基准数据集\n\n### 库与词嵌入\n- 自然语言工具包[bahasa](https:\u002F\u002Fgithub.com\u002Fkangfend\u002Fbahasa)\n- [印尼语词嵌入](https:\u002F\u002Fgithub.com\u002Fgaluhsahid\u002Findonesian-word-embedding)\n- 预训练的[印尼语fastText文本嵌入](https:\u002F\u002Fs3-us-west-1.amazonaws.com\u002Ffasttext-vectors\u002Fwiki.id.zip)，基于维基百科训练而成\n- IndoBenchmark [IndoNLU](https:\u002F\u002Fgithub.com\u002Findobenchmark\u002Findonlu)包括预训练的语言模型（IndoBERT）、FastText模型、Indo4B语料库以及多个NLU基准数据集\n\n## 乌尔都语自然语言处理\n\n### 数据集\n- [乌尔都语数据集集合](https:\u002F\u002Fgithub.com\u002Fmirfan899\u002FUrdu)适用于词性标注、命名实体识别及其他NLP任务。\n\n### 库\n- [自然语言处理库](https:\u002F\u002Fgithub.com\u002Furduhack\u002Furduhack)专为（巴基斯坦）乌尔都语设计。\n\n## 波斯语自然语言处理\n\n[返回顶部](#contents)\n\n### 库\n- [Hazm](https:\u002F\u002Fgithub.com\u002Froshan-research\u002Fhazm) - 波斯语NLP工具箱。\n- [Parsivar](https:\u002F\u002Fgithub.com\u002FICTRC\u002FParsivar): 波斯语语言处理工具箱\n- [Perke](https:\u002F\u002Fgithub.com\u002FAlirezaTheH\u002Fperke): Perke是一个用于波斯语的关键短语提取Python包。它提供了一个端到端的关键短语提取流程，其中每个组件都可以轻松修改或扩展以开发新模型。\n- [Perstem](https:\u002F\u002Fgithub.com\u002Fjonsafari\u002Fperstem): 波斯语词干提取器、形态分析器、转写器以及部分词性标注器\n- [ParsiAnalyzer](https:\u002F\u002Fgithub.com\u002FNarimanN2\u002FParsiAnalyzer): 用于Elasticsearch的波斯语分析器\n- [virastar](https:\u002F\u002Fgithub.com\u002Faziz\u002Fvirastar): 用于清理波斯语文本！\n\n### 数据集\n- [Bijankhan语料库](https:\u002F\u002Fdbrg.ut.ac.ir\u002Fبیژن%E2%80%8Cخان\u002F)：Bijankhan语料库是一个标注过的语料库，适用于波斯语（法尔西语）的自然语言处理研究。该语料库由日常新闻和常用文本组成，所有文档被分类为政治、文化等不同主题。总共包含4300个不同的主题。Bijankhan语料库包含约260万个手动标注的词语，使用的标签集包含40个波斯语词性标签。\n- [乌普萨拉波斯语语料库（UPC）](https:\u002F\u002Fsites.google.com\u002Fsite\u002Fmojganserajicom\u002Fhome\u002Fupc)：乌普萨拉波斯语语料库（UPC）是一个大型且免费开放的波斯语语料库。该语料库是Bijankhan语料库的修改版本，增加了句子切分和一致的分词，共包含2,704,028个词素，并用31个词性标签进行了标注。词性标签及其解释列于[此表格](https:\u002F\u002Fsites.google.com\u002Fsite\u002Fmojganserajicom\u002Fhome\u002Fupc\u002FTable_tag.pdf)中。\n- [大规模口语波斯语](http:\u002F\u002Fhdl.handle.net\u002F11234\u002F1-3195)：大规模口语波斯语数据集（LSCP）按照语义分类体系进行层次化组织，旨在将非正式波斯语理解作为一个综合性问题来解决。LSCP包含来自2700万条波斯语推文的1.2亿个句子，这些句子带有句法标注中的依存关系、词性标签、情感极性，以及原始波斯语文本到英语（EN）、德语（DE）、捷克语（CS）、意大利语（IT）和印地语（HI）的自动翻译。更多关于该项目的信息，请访问[LSCP网页](https:\u002F\u002Fiasbs.ac.ir\u002F~ansari\u002Flscp\u002F)。\n- [ArmanPersoNERCorpus](https:\u002F\u002Fgithub.com\u002FHaniehP\u002FPersianNER)：该数据集共包含250,015个词素和7,682句波斯语文本。数据以三折形式提供，可轮流作为训练集和测试集使用。每行包含一个词素及其手动标注的命名实体标签，句子之间以换行符分隔。NER标签采用IOB格式。\n- [FarsiYar PersianNER](https:\u002F\u002Fgithub.com\u002FText-Mining\u002FPersian-NER)：该数据集基于[波斯语维基百科语料库](https:\u002F\u002Fgithub.com\u002FText-Mining\u002FPersian-Wikipedia-Corpus)，共包含约2500万个词素和约100万句波斯语文本。NER标签采用IOB格式。超过1000名志愿者通过网络平台或安卓应用为该数据集的标签改进做出了贡献，他们每两周发布一次更新后的标签。\n- [PERLEX](http:\u002F\u002Ffarsbase.net\u002FPERLEX.html)：首个用于关系抽取的波斯语数据集，它是“Semeval-2010-Task-8”数据集的专家翻译版本。相关论文链接。\n- [波斯语句法依存树库](http:\u002F\u002Fdadegan.ir\u002Fcatalog\u002Fperdt)：该树库供非商业用途免费使用。如需商业用途，请随时联系我们。已标注的句子数量为29,982句，涵盖了波斯语价性词典中几乎所有动词的样本。\n- [乌普萨拉波斯语依存树库（UPDT）](http:\u002F\u002Fstp.lingfil.uu.se\u002F~mojgan\u002FUPDT.html)：基于依存关系的句法标注语料库。\n- [Hamshahri](https:\u002F\u002Fdbrg.ut.ac.ir\u002Fhamshahri\u002F)：Hamshahri语料库是一套标准且可靠的波斯语文本集合，在2008年和2009年期间曾被用于跨语言评估论坛（CLEF），以评估波斯语信息检索系统。\n\n\n## 乌克兰语NLP\n\n[返回顶部](#contents)\n\n- [awesome-ukrainian-nlp](https:\u002F\u002Fgithub.com\u002Fasivokon\u002Fawesome-ukrainian-nlp) - 一份精选的乌克兰语NLP数据集、模型等列表。\n- [UkrainianLT](https:\u002F\u002Fgithub.com\u002FHelsinki-NLP\u002FUkrainianLT) - 另一份精选列表，专注于机器翻译和语音处理。\n\n\n## 匈牙利语NLP\n\n[返回顶部](#contents)\n\n- [awesome-hungarian-nlp](https:\u002F\u002Fgithub.com\u002Foroszgy\u002Fawesome-hungarian-nlp)：一份专门针对匈牙利语自然语言处理的免费资源精选列表。\n\n## 葡萄牙语NLP\n\n[返回顶部](#contents)\n\n- [Portuguese-nlp](https:\u002F\u002Fgithub.com\u002Fajdavidl\u002FPortuguese-NLP) - 一份专注于葡萄牙语开发的资源和工具列表。\n\n## 其他语言\n\n- 俄语：[pymorphy2](https:\u002F\u002Fgithub.com\u002Fkmike\u002Fpymorphy2) - 一款优秀的俄语词性标注工具。\n- 亚洲语言：泰语、老挝语、汉语、日语和韩语可在ElasticSearch中使用[ICU分词器](https:\u002F\u002Fwww.elastic.co\u002Fguide\u002Fen\u002Felasticsearch\u002Fplugins\u002Fcurrent\u002Fanalysis-icu-tokenizer.html)实现。\n- 古代语言：[CLTK](https:\u002F\u002Fgithub.com\u002Fcltk\u002Fcltk)：古典语言工具包是一个Python库及文本集合，用于古代语言的NLP研究。\n- 希伯来语：[NLPH_Resources](https:\u002F\u002Fgithub.com\u002FNLPH\u002FNLPH_Resources) - 一份包含论文、语料库和语言学资源的集合，用于希伯来语的NLP研究。\n\n[返回顶部](#contents)\n\n## 引用\n如果您觉得本仓库有用，请考虑引用此列表：\n\n```bibtex\n@misc{awesome-nlp,\n  title  = {Awesome NLP},\n  author = {Kim, Keon 和 Chelikavada, Krish},\n  year   = {2018},\n  url    = {https:\u002F\u002Fgithub.com\u002Fkeon\u002Fawesome-nlp},\n  note   = {GitHub仓库}\n}\n```\n\n### 核心贡献者和维护者\n\n- [Krish Chelikavada](https:\u002F\u002Flinkedin.com\u002Fin\u002Fcskc1)\n- [Keon Kim](https:\u002F\u002Flinkedin.com\u002Fin\u002Fkeon)\n\n[致谢](.\u002FCREDITS.md)列出最初的策划者和资料来源。\n\n## 许可证\n[许可证](.\u002FLICENSE) - CC0","# awesome-nlp 快速上手指南\n\n`awesome-nlp` 并非一个可直接安装的软件库或框架，而是一个**精选的自然语言处理（NLP）资源清单**。它汇集了全球顶尖的研究论文、教程、开源库、数据集及服务，旨在帮助开发者快速找到适合特定编程语言或任务的 NLP 工具。\n\n本指南将指导你如何利用该清单，在中国网络环境下高效获取并启动主流的 NLP 开发环境（以目前最流行的 Python 生态为例）。\n\n## 环境准备\n\n在开始探索 `awesome-nlp` 推荐的工具前，请确保你的开发环境满足以下基础要求：\n\n*   **操作系统**：Windows 10\u002F11, macOS, 或 Linux (Ubuntu\u002FCentOS 推荐)。\n*   **Python 版本**：建议安装 **Python 3.8 - 3.10**（大多数 NLP 库对最新版本的兼容性尚在测试中）。\n*   **包管理器**：已安装 `pip` 或 `conda`。\n*   **国内加速配置**（强烈推荐）：\n    由于 NLP 模型和依赖包体积较大，建议配置国内镜像源以提升下载速度。\n    \n    *临时使用（单次命令）：*\n    ```bash\n    pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple \u003Cpackage_name>\n    ```\n    \n    *永久配置：*\n    ```bash\n    pip config set global.index-url https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n    ```\n\n## 安装步骤\n\n由于 `awesome-nlp` 是资源列表，你需要根据需求选择具体的库进行安装。以下是清单中推荐的几个核心 Python 库的安装方法：\n\n### 1. 安装基础处理库 (NLTK & spaCy)\n适用于文本分词、词性标注等基础任务。\n\n```bash\n# 安装 NLTK\npip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple nltk\n\n# 安装 spaCy (包含中文模型)\npip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple spacy\npython -m spacy download zh_core_web_sm\n```\n\n### 2. 安装深度学习框架 (PyTorch & Transformers)\n适用于使用 BERT、GPT 等预训练模型进行高级 NLP 任务。清单中多次提及 Hugging Face 生态。\n\n```bash\n# 安装 PyTorch (建议使用清华源加速)\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n\n# 安装 Hugging Face Transformers 库\npip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple transformers datasets accelerate\n```\n\n### 3. 克隆资源清单 (可选)\n如果你想本地浏览完整的资源列表或贡献代码：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fkeonkim\u002Fawesome-nlp.git\ncd awesome-nlp\n```\n\n## 基本使用\n\n以下示例展示如何使用 `awesome-nlp` 清单中推荐的主流工具进行最简单的中文文本处理。\n\n### 示例 1：使用 spaCy 进行中文分词与实体识别\n基于清单中推荐的 [Advanced NLP with spaCy](https:\u002F\u002Fcourse.spacy.io\u002Fen\u002F) 教程。\n\n```python\nimport spacy\n\n# 加载中文小模型\nnlp = spacy.load(\"zh_core_web_sm\")\n\n# 待处理文本\ntext = \"马云于 1964 年在杭州创立了阿里巴巴集团。\"\ndoc = nlp(text)\n\n# 输出分词结果\nprint(\"分词结果：\", [token.text for token in doc])\n\n# 输出命名实体识别结果\nprint(\"实体识别：\")\nfor ent in doc.ents:\n    print(f\"{ent.text}: {ent.label_}\")\n```\n\n### 示例 2：使用 Transformers 加载预训练模型进行情感分析\n基于清单中推荐的 [Hugging Face](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fhow-to-train) 资源。\n\n```python\nfrom transformers import pipeline\n\n# 创建一个中文情感分析管道 (自动下载模型)\n# 如果下载缓慢，可设置 HF_ENDPOINT 环境变量指向国内镜像\nimport os\nos.environ['HF_ENDPOINT'] = 'https:\u002F\u002Fhf-mirror.com'\n\nclassifier = pipeline(\"sentiment-analysis\", model=\"uer\u002Froberta-base-finetuned-jd-binary-chinese\")\n\ntext = \"这家餐厅的味道非常好，服务也很周到！\"\nresult = classifier(text)\n\nprint(f\"文本：{text}\")\nprint(f\"分析结果：{result}\")\n```\n\n### 下一步探索\n访问 `awesome-nlp` 的 **Contents** 目录，根据你的具体需求查找更多资源：\n*   **Datasets**: 寻找特定领域的中文数据集。\n*   **NLP in Chinese**: 专门针对中文处理的工具和论文。\n*   **Libraries**: 查看 Node.js, C++, Java 等其他语言的实现方案。","某初创公司的算法工程师团队正着手开发一款支持多语种（特别是中文和越南语）的智能客服情感分析系统，急需寻找合适的开源库和标注工具。\n\n### 没有 awesome-nlp 时\n- **资源检索如大海捞针**：工程师需在 Google、GitHub 和学术论文库间反复切换搜索，难以区分过时的教程与最新的前沿成果。\n- **小众语言支持难寻**：针对越南语和中文的特定 NLP 库分散在各处，花费数天仍无法确认哪些库维护活跃且兼容当前技术栈。\n- **技术选型风险高**：缺乏权威的评测趋势参考，团队误选了一个已停止维护的标注工具，导致后期数据格式迁移成本巨大。\n- **学习曲线陡峭**：新入职成员缺乏系统性的入门指引，只能零散阅读文档，耗时两周才理清基础概念和主流框架。\n\n### 使用 awesome-nlp 后\n- **一站式获取精选资源**：团队直接通过\"Libraries\"和\"Tutorials\"分类，快速锁定了 Python 和 R 语言下评分最高的几个主流库及配套视频课程。\n- **精准定位多语种方案**：利用\"NLP in Chinese\"和\"NLP in Vietnamese\"专属章节，立即找到了经过社区验证的特定语言处理工具和数据集。\n- **紧跟前沿避免踩坑**：参考\"Research Summaries and Trends\"中的状态追踪，团队避开了过时方案，直接采用了当前 SOTA（最先进）的模型架构。\n- **高效统一团队认知**：新人通过推荐的书籍和综述文章，在两天内便掌握了核心知识体系，迅速投入到实际编码工作中。\n\nawesome-nlp 将原本需要数周的碎片化调研工作压缩至几天，让团队能专注于核心算法创新而非资源筛选。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkeon_awesome-nlp_2033b3bd.jpg","keon","Keon","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fkeon_e7eb5051.jpg","Co-founder @ Om Labs,\r\nex-Uber","Om Labs","New York City","kwk236@gmail.com","keonwkim","https:\u002F\u002Fkeon.kim","https:\u002F\u002Fgithub.com\u002Fkeon",null,18400,2776,"2026-04-09T04:47:27","CC0-1.0",1,"","未说明",{"notes":91,"python":89,"dependencies":92},"awesome-nlp 是一个自然语言处理（NLP）资源的精选列表，而非具体的软件工具或代码库。它主要包含论文、教程、课程、书籍、其他开源库链接及数据集的索引。因此，该项目本身没有特定的操作系统、GPU、内存、Python 版本或依赖库要求。用户需根据列表中引用的具体子项目（如 Hugging Face Transformers、spaCy、NLTK 等）查阅其各自的运行环境需求。",[],[35,14],[95,96,97,98,99,100,101,102],"natural-language-processing","deep-learning","machine-learning","language","awesome","awesome-list","nlp","text-mining","2026-03-27T02:49:30.150509","2026-04-09T21:10:06.608049",[],[]]