[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-kk7nc--Text_Classification":3,"tool-kk7nc--Text_Classification":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":23,"env_os":94,"env_gpu":95,"env_ram":95,"env_deps":96,"category_tags":101,"github_topics":102,"view_count":10,"oss_zip_url":82,"oss_zip_packed_at":82,"status":16,"created_at":123,"updated_at":124,"faqs":125,"releases":151},746,"kk7nc\u002FText_Classification","Text_Classification","Text Classification Algorithms: A Survey","Text_Classification 是一份专注于文本分类算法的开源综述资源，旨在为用户提供从数据预处理到模型构建的全流程指导。它主要解决自然语言处理中原始文本噪声大、特征难以提取的问题，例如停用词干扰、拼写错误或格式混乱，这些都会显著影响分类效果。通过系统讲解文本清洗、分词、大小写标准化及特征提取（如词嵌入和加权词）等方法，Text_Classification 为构建高效分类器提供了清晰的技术路径。\n\n这份资源非常适合 NLP 领域的研究人员、算法工程师以及正在学习机器学习的学生。它不仅包含理论概述，还附带了基于 NLTK 等库的 Python 代码示例，涵盖分词实现和停用词过滤等实际操作。对于需要快速搭建文本分析 pipeline 或深入理解分类算法原理的用户来说，Text_Classification 是一个兼具实用性与学术价值的参考宝库，能帮助团队减少重复造轮子的时间，专注于核心业务逻辑的实现。","\n################################################\nText Classification Algorithms: A Survey\n################################################\n\n|UniversityCube| |DOI| |Best| |medium| |mendeley| |contributions-welcome| |arXiv| |ansicolortags| |contributors| |twitter|\n  \n  \n.. figure:: docs\u002Fpic\u002FWordArt.png \n \n \n Referenced paper : `Text Classification Algorithms: A Survey \u003Chttps:\u002F\u002Farxiv.org\u002Fabs\u002F1904.08067>`__    \n \n|BPW|  \n\n\n\n##################\nTable of Contents\n##################\n.. contents::\n  :local:\n  :depth: 4\n\n============\nIntroduction\n============\n\n.. figure:: docs\u002Fpic\u002FOverviewTextClassification.png \n \n    \n    \n====================================\nText and Document Feature Extraction\n====================================\n\n----\n\n\nText feature extraction and pre-processing for classification algorithms are very significant. In this section, we start to talk about text cleaning since most of documents contain a lot of noise. In this part, we discuss two primary methods of text feature extractions- word embedding and weighted word.\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nText Cleaning and Pre-processing\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nIn Natural Language Processing (NLP), most of the text and documents contain many words that are redundant for text classification, such as stopwords, miss-spellings, slangs, and etc. In this section, we briefly explain some techniques and methods for text cleaning and pre-processing text documents. In many algorithms like statistical and probabilistic learning methods, noise and unnecessary features can negatively affect the overall perfomance. So, elimination of these features are extremely important.\n\n\n-------------\nTokenization\n-------------\n\nTokenization is the process of breaking down a stream of text into words, phrases, symbols, or any other meaningful elements called tokens. The main goal of this step is to extract individual words in a sentence. Along with text classifcation, in text mining, it is necessay to incorporate a parser in the pipeline which performs the tokenization of the documents; for example:\n\nsentence:\n\n.. code::\n\n  After sleeping for four hours, he decided to sleep for another four\n\n\nIn this case, the tokens are as follows:\n\n.. code::\n\n    {'After', 'sleeping', 'for', 'four', 'hours', 'he', 'decided', 'to', 'sleep', 'for', 'another', 'four'}\n\n\nHere is python code for Tokenization:\n\n.. code:: python\n\n  from nltk.tokenize import word_tokenize\n  text = \"After sleeping for four hours, he decided to sleep for another four\"\n  tokens = word_tokenize(text)\n  print(tokens)\n\n-----------\nStop words\n-----------\n\n\nText and document classification over social media, such as Twitter, Facebook, and so on is usually affected by the noisy nature (abbreviations, irregular forms) of the text corpuses.\n\nHere is an exmple from  `geeksforgeeks \u003Chttps:\u002F\u002Fwww.geeksforgeeks.org\u002Fremoving-stop-words-nltk-python\u002F>`__\n\n.. code:: python\n\n  from nltk.corpus import stopwords\n  from nltk.tokenize import word_tokenize\n\n  example_sent = \"This is a sample sentence, showing off the stop words filtration.\"\n\n  stop_words = set(stopwords.words('english'))\n\n  word_tokens = word_tokenize(example_sent)\n\n  filtered_sentence = [w for w in word_tokens if not w in stop_words]\n\n  filtered_sentence = []\n\n  for w in word_tokens:\n      if w not in stop_words:\n          filtered_sentence.append(w)\n\n  print(word_tokens)\n  print(filtered_sentence)\n\n\n\nOutput:\n\n.. code:: python \n\n  ['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', \n  'off', 'the', 'stop', 'words', 'filtration', '.']\n  ['This', 'sample', 'sentence', ',', 'showing', 'stop',\n  'words', 'filtration', '.']\n\n\n---------------\nCapitalization\n---------------\n\nSentences can contain a mixture of uppercase and lower case letters. Multiple sentences make up a text document. To reduce the problem space, the most common approach is to reduce everything to lower case. This brings all words in a document in same space, but it often changes the meaning of some words, such as \"US\" to \"us\" where first one represents the United States of America and second one is a pronoun. To solve this, slang and abbreviation converters can be applied.\n\n.. code:: python\n\n  text = \"The United States of America (USA) or America, is a federal republic composed of 50 states\"\n  print(text)\n  print(text.lower())\n\nOutput:\n\n.. code:: python\n\n  \"The United States of America (USA) or America, is a federal republic composed of 50 states\"\n  \"the united states of america (usa) or america, is a federal republic composed of 50 states\"\n\n-----------------------\nSlangs and Abbreviations\n-----------------------\n\nSlangs and abbreviations can cause problems while executing the pre-processing steps. An abbreviation  is a shortened form of a word, such as SVM stand for Support Vector Machine. Slang is a version of language that depicts informal conversation or text that has different meaning, such as \"lost the plot\", it essentially means that 'they've gone mad'. Common method to deal with these words is converting them to formal language.\n\n---------------\nNoise Removal\n---------------\n\n\nAnother issue of text cleaning as a pre-processing step is noise removal. Text documents generally contains characters like punctuations or  special characters and they are not necessary for text mining or classification purposes. Although punctuation is critical to understand the meaning of the sentence, but it can affect the classification algorithms negatively.\n\n\nHere is simple code to remove standard noise from text:\n\n\n.. code:: python\n\n  def text_cleaner(text):\n      rules = [\n          {r'>\\s+': u'>'},  # remove spaces after a tag opens or closes\n          {r'\\s+': u' '},  # replace consecutive spaces\n          {r'\\s*\u003Cbr\\s*\u002F?>\\s*': u'\\n'},  # newline after a \u003Cbr>\n          {r'\u003C\u002F(div)\\s*>\\s*': u'\\n'},  # newline after \u003C\u002Fp> and \u003C\u002Fdiv> and \u003Ch1\u002F>...\n          {r'\u003C\u002F(p|h\\d)\\s*>\\s*': u'\\n\\n'},  # newline after \u003C\u002Fp> and \u003C\u002Fdiv> and \u003Ch1\u002F>...\n          {r'\u003Chead>.*\u003C\\s*(\u002Fhead|body)[^>]*>': u''},  # remove \u003Chead> to \u003C\u002Fhead>\n          {r'\u003Ca\\s+href=\"([^\"]+)\"[^>]*>.*\u003C\u002Fa>': r'\\1'},  # show links instead of texts\n          {r'[ \\t]*\u003C[^\u003C]*?\u002F?>': u''},  # remove remaining tags\n          {r'^\\s+': u''}  # remove spaces at the beginning\n      ]\n      for rule in rules:\n      for (k, v) in rule.items():\n          regex = re.compile(k)\n          text = regex.sub(v, text)\n      text = text.rstrip()\n      return text.lower()\n    \n\n\n-------------------\nSpelling Correction\n-------------------\n\n\nAn optional part of the pre-processing step is correcting the misspelled words. Different techniques, such as hashing-based and context-sensitive spelling correction techniques, or  spelling correction using trie and damerau-levenshtein distance bigram have been introduced to tackle this issue.\n\n\n.. code:: python\n\n  from autocorrect import spell\n\n  print spell('caaaar')\n  print spell(u'mussage')\n  print spell(u'survice')\n  print spell(u'hte')\n\nResult:\n\n.. code::\n\n    caesar\n    message\n    service\n    the\n\n\n------------\nStemming\n------------\n\n\nText Stemming is modifying a word to obtain its variants using different linguistic processeses like affixation (addition of affixes). For example, the stem of the word \"studying\" is \"study\", to which -ing.\n\n\nHere is an example of Stemming from `NLTK \u003Chttps:\u002F\u002Fpythonprogramming.net\u002Fstemming-nltk-tutorial\u002F>`__\n\n.. code:: python\n\n    from nltk.stem import PorterStemmer\n    from nltk.tokenize import sent_tokenize, word_tokenize\n\n    ps = PorterStemmer()\n\n    example_words = [\"python\",\"pythoner\",\"pythoning\",\"pythoned\",\"pythonly\"]\n    \n    for w in example_words:\n    print(ps.stem(w))\n\n\nResult:\n\n.. code::\n\n  python\n  python\n  python\n  python\n  pythonli\n\n-------------\nLemmatization\n-------------\n\n\nText lemmatization is the process of eliminating redundant prefix or suffix of a word and extract the base word (lemma).\n\n\n.. code:: python\n\n  from nltk.stem import WordNetLemmatizer\n\n  lemmatizer = WordNetLemmatizer()\n\n  print(lemmatizer.lemmatize(\"cats\"))\n\n~~~~~~~~~~~~~~\nWord Embedding\n~~~~~~~~~~~~~~\n\nDifferent word embedding procedures have been proposed to translate these unigrams into consummable input for machine learning algorithms. A very simple way to perform such embedding is term-frequency~(TF) where each word will be mapped to a number corresponding to the number of occurrence of that word in the whole corpora. The other term frequency functions have been also used that represent word-frequency as Boolean or logarithmically scaled number. Here, each document will be converted to a vector of same length containing the frequency of the words in that document. Although such approach may seem very intuitive but it suffers from the fact that particular words that are used very commonly in language literature might dominate this sort of word representations.\n\n.. image:: docs\u002Fpic\u002FCBOW.png\n\n\n--------\nWord2Vec\n--------\n\nOriginal from https:\u002F\u002Fcode.google.com\u002Fp\u002Fword2vec\u002F\n\nI’ve copied it to a github project so that I can apply and track community\npatches (starting with capability for Mac OS X\ncompilation).\n\n-  **makefile and some source has been modified for Mac OS X\n   compilation** See\n   https:\u002F\u002Fcode.google.com\u002Fp\u002Fword2vec\u002Fissues\u002Fdetail?id=1#c5\n-  **memory patch for word2vec has been applied** See\n   https:\u002F\u002Fcode.google.com\u002Fp\u002Fword2vec\u002Fissues\u002Fdetail?id=2\n-  Project file layout altered\n\nThere seems to be a segfault in the compute-accuracy utility.\n\nTo get started:\n\n::\n\n   cd scripts && .\u002Fdemo-word.sh\n\nOriginal README text follows:\n\nThis tool provides an efficient implementation of the continuous bag-of-words and skip-gram architectures for computing vector representations of words. These representations can be subsequently used in many natural language processing applications and for further research purposes. \n\n\nthis code provides an implementation of the Continuous Bag-of-Words (CBOW) and\nthe Skip-gram model (SG), as well as several demo scripts.\n\nGiven a text corpus, the word2vec tool learns a vector for every word in\nthe vocabulary using the Continuous Bag-of-Words or the Skip-Gram neural\nnetwork architectures. The user should specify the following: -\ndesired vector dimensionality (size of the context window for\neither the Skip-Gram or the Continuous Bag-of-Words model),  training\nalgorithm (hierarchical softmax and \u002F or negative sampling), threshold\nfor downsampling the frequent words, number of threads to use,\nformat of the output word vector file (text or binary).\n\nUsually, other hyper-parameters, such as the learning rate do not\nneed to be tuned for different training sets.\n\nThe script demo-word.sh downloads a small (100MB) text corpus from the\nweb, and trains a small word vector model. After the training is\nfinished, users can interactively explore the similarity of the\nwords.\n\nMore information about the scripts is provided at\nhttps:\u002F\u002Fcode.google.com\u002Fp\u002Fword2vec\u002F\n\n\n----------------------------------------------\nGlobal Vectors for Word Representation (GloVe)\n----------------------------------------------\n\n.. image:: \u002Fdocs\u002Fpic\u002FGlove.PNG\n\nAn implementation of the GloVe model for learning word representations is provided, and describe how to download web-dataset vectors or train your own. See the  `project page \u003Chttp:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F>`__  or the   `paper \u003Chttp:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf>`__  for more information on glove vectors.\n\n\n------------------------------------\nContextualized Word Representations\n------------------------------------\n\nELMo is a deep contextualized word representation that models both (1) complex characteristics of word use (e.g., syntax and semantics), and (2) how these uses vary across linguistic contexts (i.e., to model polysemy). These word vectors are learned functions of the internal states of a deep bidirectional language model (biLM), which is pre-trained on a large text corpus. They can be easily added to existing models and significantly improve the state of the art across a broad range of challenging NLP problems, including question answering, textual entailment and sentiment analysis.\n\n\n**ELMo representations are:**\n\n-  **Contextual:** The representation for each word depends on the entire context in which it is used.\n-  **Deep:** The word representations combine all layers of a deep pre-trained neural network.\n-  **Character based:** ELMo representations are purely character based, allowing the network to use morphological clues to form robust representations for out-of-vocabulary tokens unseen in training.\n\n\n**Tensorflow implementation**\n\nTensorflow implementation of the pretrained biLM used to compute ELMo representations from `\"Deep contextualized word representations\" \u003Chttp:\u002F\u002Farxiv.org\u002Fabs\u002F1802.05365>`__.\n\nThis repository supports both training biLMs and using pre-trained models for prediction.\n\nWe also have a pytorch implementation available in `AllenNLP \u003Chttp:\u002F\u002Fallennlp.org\u002F>`__.\n\nYou may also find it easier to use the version provided in `Tensorflow Hub \u003Chttps:\u002F\u002Fwww.tensorflow.org\u002Fhub\u002Fmodules\u002Fgoogle\u002Felmo\u002F2>`__ if you just like to make predictions.\n\n**pre-trained models:**\n\nWe have got several pre-trained English language biLMs available for use. Each model is specified with two separate files, a JSON formatted \"options\" file with hyperparameters and a hdf5 formatted file with the model weights. Links to the pre-trained models are available `here \u003Chttps:\u002F\u002Fallennlp.org\u002Felmo>`__.\n\nThere are three ways to integrate ELMo representations into a downstream task, depending on your use case.\n\n1. Compute representations on the fly from raw text using character input. This is the most general method and will handle any input text. It is also the most computationally expensive.\n2. Precompute and cache the context independent token representations, then compute context dependent representations using the biLSTMs for input data. This method is less computationally expensive then #1, but is only applicable with a fixed, prescribed vocabulary.\n3. Precompute the representations for your entire dataset and save to a file.\n\nWe have used all of these methods in the past for various use cases. #1 is necessary for evaluating at test time on unseen data (e.g. public SQuAD leaderboard). #2 is a good compromise for large datasets where the size of the file in is unfeasible (SNLI, SQuAD). #3 is a good choice for smaller datasets or in cases where you'd like to use ELMo in other frameworks.\n\nIn all cases, the process roughly follows the same steps. First, create a ``Batcher`` (or ``TokenBatcher`` for #2) to translate tokenized strings to numpy arrays of character (or token) ids. Then, load the pretrained ELMo model (class ``BidirectionalLanguageModel``). Finally, for steps #1 and #2 use ``weight_layers`` to compute the final ELMo representations. For #3, use ``BidirectionalLanguageModel`` to write all the intermediate layers to a file.\n\n\n\n.. figure:: docs\u002Fpic\u002Fngram_cnn_highway_1.png \nArchitecture of the language model applied to an example sentence [Reference:  `arXiv paper \u003Chttps:\u002F\u002Farxiv.org\u002Fpdf\u002F1508.06615.pdf>`__]. \n\n\n.. figure:: docs\u002Fpic\u002FGlove_VS_DCWE.png \n\n--------\nFastText\n--------\n\n.. figure:: docs\u002Fpic\u002Ffasttext-logo-color-web.png\n\nfastText is a library for efficient learning of word representations and sentence classification.\n\n**Github:**  `facebookresearch\u002FfastText \u003Chttps:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText>`__\n\n**Models**\n\n-  Recent state-of-the-art `English word vectors \u003Chttps:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fenglish-vectors.html>`__.\n-  Word vectors for `157 languages trained on Wikipedia and Crawl \u003Chttps:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText\u002Fblob\u002Fmaster\u002Fdocs\u002Fcrawl-vectors.md>`__.\n-  Models for `language identification \u003Chttps:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Flanguage-identification.html#content>`__ and `various supervised tasks \u003Chttps:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fsupervised-models.html#content>`__.\n\n**Supplementary data :**\n\n\n-  The preprocessed `YFCC100M data \u003Chttps:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fdataset.html#content>`__ .\n\n**FAQ**\n\nYou can find `answers to frequently asked questions \u003Chttps:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Ffaqs.html#content>`__ on Their project `website \u003Chttps:\u002F\u002Ffasttext.cc\u002F>`__.\n\n**Cheatsheet**\n\nAlso a `cheatsheet \u003Chttps:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fcheatsheet.html#content>`__ is provided full of useful one-liners.\n\n\n\n~~~~~~~~~~~~~~\nWeighted Words\n~~~~~~~~~~~~~~\n\n\n--------------\nTerm frequency\n--------------\n\nTerm frequency is Bag of words that is one of the simplest techniques of text feature extraction. This method is based on counting number of the words in each document and assign it to feature space.\n\n\n-----------------------------------------\nTerm Frequency-Inverse Document Frequency\n-----------------------------------------\nThe mathematical representation of weight of a term in a document by Tf-idf is given:\n\n.. image:: docs\u002Feq\u002Ftf-idf.gif\n   :width: 10px\n   \nWhere N is number of documents and df(t) is the number of documents containing the term t in the corpus. The first part would improve recall and the later would improve the precision of the word embedding. Although tf-idf tries to overcome the problem of common terms in document, it still suffers from some other descriptive limitations. Namely, tf-idf cannot account for the similarity between words in the document since each word is presented as an index. In the recent years, with development of more complex models, such as neural nets, new methods has been presented that can incorporate concepts, such as similarity of words and part of speech tagging. This work uses, word2vec and Glove, two of the most common methods that have been successfully used for deep learning techniques.\n\n\n.. code:: python\n\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    def loadData(X_train, X_test,MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\",str(np.array(X_train).shape[1]),\"features\")\n        return (X_train,X_test)\n   \n   \n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nComparison of Feature Extraction Techniques\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|                **Model**              |                                                                        **Advantages**                                                                    |                                                   **Limitation**                                               |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|            **Weighted Words**         |  * Easy to compute                                                                                                                                       |  * It does not capture the position in the text (syntactic)                                                    |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * Easy to compute the similarity between 2 documents using it                                                                                           |  * It does not capture meaning in the text (semantics)                                                         |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * Basic metric to extract the most descriptive terms in a document                                                                                      |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Common words effect on the results (e.g., “am”, “is”, etc.)                                                 |\n|                                       |  * Works with an unknown word (e.g., New words in languages)                                                                                             |                                                                                                                |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|            **TF-IDF**                 |  * Easy to compute                                                                                                                                       |  * It does not capture the position in the text (syntactic)                                                    |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * Easy to compute the similarity between 2 documents using it                                                                                           |  * It does not capture meaning in the text (semantics)                                                         |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * Basic metric to extract the most descriptive terms in a document                                                                                      |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * Common words do not affect the results due to IDF (e.g., “am”, “is”, etc.)                                                                            |                                                                                                                |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|               **Word2Vec**            |  * It captures the position of the words in the text (syntactic)                                                                                         |  * It cannot capture the meaning of the word from the text (fails to capture polysemy)                         |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * It captures meaning in the words (semantics)                                                                                                          |  * It cannot capture out-of-vocabulary words from corpus                                                       |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|         **GloVe (Pre-Trained)**       |  * It captures the position of the words in the text (syntactic)                                                                                         |  * It cannot capture the meaning of the word from  the text (fails to capture polysemy)                        |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * It captures meaning in the words (semantics)                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Memory consumption for storage                                                                              |\n|                                       |  * Trained on huge corpus                                                                                                                                |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * It cannot capture out-of-vocabulary words from corpus                                                       |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|           **GloVe (Trained)**         |  * It is very straightforward, e.g., to enforce the word vectors to capture sub-linear relationships in the vector space (performs better than Word2vec) |  * Memory consumption for storage                                                                              |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * Lower weight for highly frequent word pairs, such as stop words like “am”, “is”, etc. Will not dominate training progress                             |  * Needs huge corpus to learn                                                                                  |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * It cannot capture out-of-vocabulary words from the corpus                                                   |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * It cannot capture the meaning of the word from  the text (fails to capture polysemy)                        |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|               **FastText**            |  * Works for rare words (rare in their character n-grams which are still shared with other words                                                         |  * It cannot capture the meaning of the word from the text (fails to capture polysemy)                         |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Memory consumption for storage                                                                              |\n|                                       |  * Solves out of vocabulary words with n-gram in character level                                                                                         |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Computationally is more expensive in comparing with GloVe and Word2Vec                                      |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|**Contextualized Word Representations**|  * It captures the meaning of the word from the text (incorporates context, handling polysemy)                                                           |  * Memory consumption for storage                                                                              |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Improves performance notably on downstream tasks. Computationally is more expensive in comparison to others |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Needs another word embedding for all LSTM and feedforward layers                                            |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * It cannot capture out-of-vocabulary words from a corpus                                                     |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * Works only sentence and document level (it cannot work for individual word level)                           |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n\n\n========================\nDimensionality Reduction\n========================\n\n----\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nPrincipal Component Analysis (PCA)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nPrinciple component analysis~(PCA) is the most popular technique in multivariate analysis and dimensionality reduction. PCA is a method to identify a subspace in which the data approximately lies. This means finding new variables that are uncorrelated and maximizing the variance to preserve as much variability as possible.\n\n\nExample of PCA on text dataset (20newsgroups) from  tf-idf with 75000 features to 2000 components:\n\n.. code:: python\n\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n\n    def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\", str(np.array(X_train).shape[1]), \"features\")\n        return (X_train, X_test)\n\n\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train,X_test = TFIDF(X_train,X_test)\n\n    from sklearn.decomposition import PCA\n    pca = PCA(n_components=2000)\n    X_train_new = pca.fit_transform(X_train)\n    X_test_new = pca.transform(X_test)\n\n    print(\"train with old features: \",np.array(X_train).shape)\n    print(\"train with new features:\" ,np.array(X_train_new).shape)\n    \n    print(\"test with old features: \",np.array(X_test).shape)\n    print(\"test with new features:\" ,np.array(X_test_new).shape)\n\noutput:\n\n.. code:: python\n\n    tf-idf with 75000 features\n    train with old features:  (11314, 75000)\n    train with new features: (11314, 2000)\n    test with old features:  (7532, 75000)\n    test with new features: (7532, 2000)\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nLinear Discriminant Analysis (LDA)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\nLinear Discriminant Analysis (LDA) is another commonly used technique for data classification and dimensionality reduction. LDA is particularly helpful where the within-class frequencies are unequal and their performances have been evaluated on randomly generated test data. Class-dependent and class-independent transformation are two approaches in LDA where the ratio of between-class-variance to within-class-variance and the ratio of the overall-variance to within-class-variance are used respectively. \n\n\n\n.. code:: python\n\n\n  from sklearn.feature_extraction.text import TfidfVectorizer\n  import numpy as np\n  from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n\n\n  def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):\n      vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n      X_train = vectorizer_x.fit_transform(X_train).toarray()\n      X_test = vectorizer_x.transform(X_test).toarray()\n      print(\"tf-idf with\", str(np.array(X_train).shape[1]), \"features\")\n      return (X_train, X_test)\n\n\n  from sklearn.datasets import fetch_20newsgroups\n\n  newsgroups_train = fetch_20newsgroups(subset='train')\n  newsgroups_test = fetch_20newsgroups(subset='test')\n  X_train = newsgroups_train.data\n  X_test = newsgroups_test.data\n  y_train = newsgroups_train.target\n  y_test = newsgroups_test.target\n\n  X_train,X_test = TFIDF(X_train,X_test)\n\n\n\n  LDA = LinearDiscriminantAnalysis(n_components=15)\n  X_train_new = LDA.fit(X_train,y_train)\n  X_train_new =  LDA.transform(X_train)\n  X_test_new = LDA.transform(X_test)\n\n  print(\"train with old features: \",np.array(X_train).shape)\n  print(\"train with new features:\" ,np.array(X_train_new).shape)\n\n  print(\"test with old features: \",np.array(X_test).shape)\n  print(\"test with new features:\" ,np.array(X_test_new).shape)\n\n\noutput:\n\n.. code:: \n\n    tf-idf with 75000 features\n    train with old features:  (11314, 75000)\n    train with new features: (11314, 15)\n    test with old features:  (7532, 75000)\n    test with new features: (7532, 15)\n    \n    \n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nNon-negative Matrix Factorization (NMF)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n.. code:: python\n\n\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n    from sklearn.decomposition import NMF\n\n\n    def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\", str(np.array(X_train).shape[1]), \"features\")\n        return (X_train, X_test)\n\n\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train,X_test = TFIDF(X_train,X_test)\n\n\n\n    NMF_ = NMF(n_components=2000)\n    X_train_new = NMF_.fit(X_train)\n    X_train_new =  NMF_.transform(X_train)\n    X_test_new = NMF_.transform(X_test)\n\n    print(\"train with old features: \",np.array(X_train).shape)\n    print(\"train with new features:\" ,np.array(X_train_new).shape)\n\n    print(\"test with old features: \",np.array(X_test).shape)\n    print(\"test with new features:\" ,np.array(X_test_new))\n\noutput:\n\n.. code:: \n\n    tf-idf with 75000 features\n    train with old features:  (11314, 75000)\n    train with new features: (11314, 2000)\n    test with old features:  (7532, 75000)\n    test with new features: (7532, 2000)\n    \n    \n\n~~~~~~~~~~~~~~~~~\nRandom Projection\n~~~~~~~~~~~~~~~~~\nRandom projection or random feature is a dimensionality reduction technique mostly used for very large volume dataset or very high dimensional feature space. Text and document, especially with weighted feature extraction, can contain a huge number of underlying features.\nMany researchers addressed Random Projection for text data for text mining, text classification and\u002For dimensionality reduction.\nWe start to review some random projection techniques. \n\n\n.. image:: docs\u002Fpic\u002FRandom%20Projection.png\n\n.. code:: python\n\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n\n    def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\", str(np.array(X_train).shape[1]), \"features\")\n        return (X_train, X_test)\n\n\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train,X_test = TFIDF(X_train,X_test)\n\n    from sklearn import random_projection\n\n    RandomProjection = random_projection.GaussianRandomProjection(n_components=2000)\n    X_train_new = RandomProjection.fit_transform(X_train)\n    X_test_new = RandomProjection.transform(X_test)\n\n    print(\"train with old features: \",np.array(X_train).shape)\n    print(\"train with new features:\" ,np.array(X_train_new).shape)\n\n    print(\"test with old features: \",np.array(X_test).shape)\n    print(\"test with new features:\" ,np.array(X_test_new).shape)\n\noutput:\n\n.. code:: python\n\n    tf-idf with 75000 features\n    train with old features:  (11314, 75000)\n    train with new features: (11314, 2000)\n    test with old features:  (7532, 75000)\n    test with new features: (7532, 2000)\n    \n~~~~~~~~~~~\nAutoencoder\n~~~~~~~~~~~\n\n\nAutoencoder is a neural network technique that is trained to attempt to map its input to its output. The autoencoder as dimensional reduction methods have achieved great success via the powerful reprehensibility of neural networks. The main idea is, one hidden layer between the input and output layers with fewer neurons can be used to reduce the dimension of feature space. Specially for texts, documents, and sequences that contains many features, autoencoder could help to process data faster and more efficiently.\n\n\n.. image:: docs\u002Fpic\u002FAutoencoder.png\n\n\n\n.. code:: python\n\n  from keras.layers import Input, Dense\n  from keras.models import Model\n\n  # this is the size of our encoded representations\n  encoding_dim = 1500  \n\n  # this is our input placeholder\n  input = Input(shape=(n,))\n  # \"encoded\" is the encoded representation of the input\n  encoded = Dense(encoding_dim, activation='relu')(input)\n  # \"decoded\" is the lossy reconstruction of the input\n  decoded = Dense(n, activation='sigmoid')(encoded)\n\n  # this model maps an input to its reconstruction\n  autoencoder = Model(input, decoded)\n\n  # this model maps an input to its encoded representation\n  encoder = Model(input, encoded)\n  \n\n  encoded_input = Input(shape=(encoding_dim,))\n  # retrieve the last layer of the autoencoder model\n  decoder_layer = autoencoder.layers[-1]\n  # create the decoder model\n  decoder = Model(encoded_input, decoder_layer(encoded_input))\n  \n  autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')\n  \n  \n\nLoad data:\n\n\n.. code:: python\n\n  autoencoder.fit(x_train, x_train,\n                  epochs=50,\n                  batch_size=256,\n                  shuffle=True,\n                  validation_data=(x_test, x_test))\n                  \n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nT-distributed Stochastic Neighbor Embedding (T-SNE)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n\nT-distributed Stochastic Neighbor Embedding (T-SNE) is a nonlinear dimensionality reduction technique for embedding high-dimensional data which is mostly used for visualization in a low-dimensional space. This approach is based on `G. Hinton and ST. Roweis \u003Chttps:\u002F\u002Fwww.cs.toronto.edu\u002F~fritz\u002Fabsps\u002Fsne.pdf>`__ . SNE works by converting the high dimensional Euclidean distances into conditional probabilities which represent similarities.\n\n `Example \u003Chttp:\u002F\u002Fscikit-learn.org\u002Fstable\u002Fmodules\u002Fgenerated\u002Fsklearn.manifold.TSNE.html>`__:\n\n\n.. code:: python\n\n   import numpy as np\n   from sklearn.manifold import TSNE\n   X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])\n   X_embedded = TSNE(n_components=2).fit_transform(X)\n   X_embedded.shape\n\n\nExample of Glove and T-SNE for text:\n\n.. image:: docs\u002Fpic\u002FTSNE.png\n\n===============================\nText Classification Techniques\n===============================\n\n----\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nRocchio classification\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nThe first version of Rocchio algorithm is introduced by rocchio in 1971 to use relevance feedback in querying full-text databases. Since then many researchers have addressed and developed this technique for text and document classification. This method uses TF-IDF weights for each informative word instead of a set of Boolean features. Using a training set of documents, Rocchio's algorithm builds a prototype vector for each class which is an average vector over all training document vectors that belongs to a certain class. Then, it will assign each test document to a class with maximum similarity that between test document and each of the prototype vectors.\n\n\nWhen in nearest centroid classifier, we used for text as input data for classification with tf-idf vectors, this classifier is known as the Rocchio classifier.\n\n.. code:: python\n\n    from sklearn.neighbors.nearest_centroid import NearestCentroid\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', NearestCentroid()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\n\n\nOutput:\n\n.. code:: python\n\n                  precision    recall  f1-score   support\n\n              0       0.75      0.49      0.60       319\n              1       0.44      0.76      0.56       389\n              2       0.75      0.68      0.71       394\n              3       0.71      0.59      0.65       392\n              4       0.81      0.71      0.76       385\n              5       0.83      0.66      0.74       395\n              6       0.49      0.88      0.63       390\n              7       0.86      0.76      0.80       396\n              8       0.91      0.86      0.89       398\n              9       0.85      0.79      0.82       397\n             10       0.95      0.80      0.87       399\n             11       0.94      0.66      0.78       396\n             12       0.40      0.70      0.51       393\n             13       0.84      0.49      0.62       396\n             14       0.89      0.72      0.80       394\n             15       0.55      0.73      0.63       398\n             16       0.68      0.76      0.71       364\n             17       0.97      0.70      0.81       376\n             18       0.54      0.53      0.53       310\n             19       0.58      0.39      0.47       251\n\n    avg \u002F total       0.74      0.69      0.70      7532\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nBoosting and Bagging\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n---------\nBoosting\n---------\n\n.. image:: docs\u002Fpic\u002FBoosting.PNG\n\n\n**Boosting** is a Ensemble learning meta-algorithm for primarily reducing variance in supervised learning. It is basically a family of machine learning algorithms that convert weak learners to strong ones. Boosting is based on the question posed by `Michael Kearns \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMichael_Kearns_(computer_scientist)>`__  and Leslie Valiant (1988, 1989) Can a set of weak learners create a single strong learner? A weak learner is defined to be a Classification that is only slightly correlated with the true classification (it can label examples better than random guessing). In contrast, a strong learner is a classifier that is arbitrarily well-correlated with the true classification.\n\n\n\n\n.. code:: python\n\n  from sklearn.ensemble import GradientBoostingClassifier\n  from sklearn.pipeline import Pipeline\n  from sklearn import metrics\n  from sklearn.feature_extraction.text import CountVectorizer\n  from sklearn.feature_extraction.text import TfidfTransformer\n  from sklearn.datasets import fetch_20newsgroups\n\n  newsgroups_train = fetch_20newsgroups(subset='train')\n  newsgroups_test = fetch_20newsgroups(subset='test')\n  X_train = newsgroups_train.data\n  X_test = newsgroups_test.data\n  y_train = newsgroups_train.target\n  y_test = newsgroups_test.target\n\n  text_clf = Pipeline([('vect', CountVectorizer()),\n                       ('tfidf', TfidfTransformer()),\n                       ('clf', GradientBoostingClassifier(n_estimators=100)),\n                       ])\n\n  text_clf.fit(X_train, y_train)\n\n\n  predicted = text_clf.predict(X_test)\n\n  print(metrics.classification_report(y_test, predicted))\n\n\nOutput:\n \n.. code:: python\n\n               precision    recall  f1-score   support\n            0       0.81      0.66      0.73       319\n            1       0.69      0.70      0.69       389\n            2       0.70      0.68      0.69       394\n            3       0.64      0.72      0.68       392\n            4       0.79      0.79      0.79       385\n            5       0.83      0.64      0.72       395\n            6       0.81      0.84      0.82       390\n            7       0.84      0.75      0.79       396\n            8       0.90      0.86      0.88       398\n            9       0.90      0.85      0.88       397\n           10       0.93      0.86      0.90       399\n           11       0.90      0.81      0.85       396\n           12       0.33      0.69      0.45       393\n           13       0.87      0.72      0.79       396\n           14       0.87      0.84      0.85       394\n           15       0.85      0.87      0.86       398\n           16       0.65      0.78      0.71       364\n           17       0.96      0.74      0.84       376\n           18       0.70      0.55      0.62       310\n           19       0.62      0.56      0.59       251\n\n  avg \u002F total       0.78      0.75      0.76      7532\n\n  \n-------\nBagging\n-------\n\n.. image:: docs\u002Fpic\u002FBagging.PNG\n\n\n.. code:: python\n\n    from sklearn.ensemble import BaggingClassifier\n    from sklearn.neighbors import KNeighborsClassifier\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', BaggingClassifier(KNeighborsClassifier())),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\nOutput:\n \n.. code:: python\n\n               precision    recall  f1-score   support\n            0       0.57      0.74      0.65       319\n            1       0.60      0.56      0.58       389\n            2       0.62      0.54      0.58       394\n            3       0.54      0.57      0.55       392\n            4       0.63      0.54      0.58       385\n            5       0.68      0.62      0.65       395\n            6       0.55      0.46      0.50       390\n            7       0.77      0.67      0.72       396\n            8       0.79      0.82      0.80       398\n            9       0.74      0.77      0.76       397\n           10       0.81      0.86      0.83       399\n           11       0.74      0.85      0.79       396\n           12       0.67      0.49      0.57       393\n           13       0.78      0.51      0.62       396\n           14       0.76      0.78      0.77       394\n           15       0.71      0.81      0.76       398\n           16       0.73      0.73      0.73       364\n           17       0.64      0.79      0.71       376\n           18       0.45      0.69      0.54       310\n           19       0.61      0.54      0.57       251\n\n  avg \u002F total       0.67      0.67      0.67      7532\n  \n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nNaive Bayes Classifier\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nNaïve Bayes text classification has been used in industry\nand academia for a long time (introduced by Thomas Bayes\nbetween 1701-1761). However, this technique\nis being studied since the 1950s for text and document categorization. Naive Bayes Classifier (NBC) is generative\nmodel which is widely used in Information Retrieval. Many researchers addressed and developed this technique\nfor their applications. We start with the most basic version\nof NBC which developed by using term-frequency (Bag of\nWord) fetaure extraction technique by counting number of\nwords in documents\n\n\n.. code:: python\n\n    from sklearn.naive_bayes import MultinomialNB\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', MultinomialNB()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n \n \nOutput:\n \n.. code:: python\n\n                   precision    recall  f1-score   support\n\n              0       0.80      0.52      0.63       319\n              1       0.81      0.65      0.72       389\n              2       0.82      0.65      0.73       394\n              3       0.67      0.78      0.72       392\n              4       0.86      0.77      0.81       385\n              5       0.89      0.75      0.82       395\n              6       0.93      0.69      0.80       390\n              7       0.85      0.92      0.88       396\n              8       0.94      0.93      0.93       398\n              9       0.92      0.90      0.91       397\n             10       0.89      0.97      0.93       399\n             11       0.59      0.97      0.74       396\n             12       0.84      0.60      0.70       393\n             13       0.92      0.74      0.82       396\n             14       0.84      0.89      0.87       394\n             15       0.44      0.98      0.61       398\n             16       0.64      0.94      0.76       364\n             17       0.93      0.91      0.92       376\n             18       0.96      0.42      0.58       310\n             19       0.97      0.14      0.24       251\n\n    avg \u002F total       0.82      0.77      0.77      7532\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nK-nearest Neighbor\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nR\nIn machine learning, the k-nearest neighbors algorithm (kNN)\nis a non-parametric technique used for classification.\nThis method is used in Natural-language processing (NLP)\nas a text classification technique in many researches in the past\ndecades.\n\n.. image:: docs\u002Fpic\u002FKNN.png\n\n.. code:: python\n\n    from sklearn.neighbors import KNeighborsClassifier\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', KNeighborsClassifier()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\nOutput:\n\n.. code:: python\n\n                   precision    recall  f1-score   support\n\n              0       0.43      0.76      0.55       319\n              1       0.50      0.61      0.55       389\n              2       0.56      0.57      0.57       394\n              3       0.53      0.58      0.56       392\n              4       0.59      0.56      0.57       385\n              5       0.69      0.60      0.64       395\n              6       0.58      0.45      0.51       390\n              7       0.75      0.69      0.72       396\n              8       0.84      0.81      0.82       398\n              9       0.77      0.72      0.74       397\n             10       0.85      0.84      0.84       399\n             11       0.76      0.84      0.80       396\n             12       0.70      0.50      0.58       393\n             13       0.82      0.49      0.62       396\n             14       0.79      0.76      0.78       394\n             15       0.75      0.76      0.76       398\n             16       0.70      0.73      0.72       364\n             17       0.62      0.76      0.69       376\n             18       0.55      0.61      0.58       310\n             19       0.56      0.49      0.52       251\n\n    avg \u002F total       0.67      0.66      0.66      7532\n\n\n\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nSupport Vector Machine (SVM)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\nThe original version of SVM was introduced by Vapnik and  Chervonenkis in 1963. The early 1990s, nonlinear version was addressed by BE. Boser et al.. Original version of SVM was designed for binary classification problem, but Many researchers have worked on multi-class problem using this authoritative technique.\n\n\nThe advantages of support vector machines are based on scikit-learn page:\n\n* Effective in high dimensional spaces.\n* Still effective in cases where number of dimensions is greater than the number of samples.\n* Uses a subset of training points in the decision function (called support vectors), so it is also memory efficient.\n* Versatile: different Kernel functions can be specified for the decision function. Common kernels are provided, but it is also possible to specify custom kernels.\n\n\nThe disadvantages of support vector machines include:\n\n* If the number of features is much greater than the number of samples, avoiding over-fitting via choosing kernel functions and regularization term is crucial.\n* SVMs do not directly provide probability estimates, these are calculated using an expensive five-fold cross-validation (see Scores and probabilities, below).\n\n\n\n.. image:: docs\u002Fpic\u002FSVM.png\n\n\n.. code:: python\n\n\n    from sklearn.svm import LinearSVC\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', LinearSVC()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\noutput:\n\n\n.. code:: python\n\n                   precision    recall  f1-score   support\n\n              0       0.82      0.80      0.81       319\n              1       0.76      0.80      0.78       389\n              2       0.77      0.73      0.75       394\n              3       0.71      0.76      0.74       392\n              4       0.84      0.86      0.85       385\n              5       0.87      0.76      0.81       395\n              6       0.83      0.91      0.87       390\n              7       0.92      0.91      0.91       396\n              8       0.95      0.95      0.95       398\n              9       0.92      0.95      0.93       397\n             10       0.96      0.98      0.97       399\n             11       0.93      0.94      0.93       396\n             12       0.81      0.79      0.80       393\n             13       0.90      0.87      0.88       396\n             14       0.90      0.93      0.92       394\n             15       0.84      0.93      0.88       398\n             16       0.75      0.92      0.82       364\n             17       0.97      0.89      0.93       376\n             18       0.82      0.62      0.71       310\n             19       0.75      0.61      0.68       251\n\n    avg \u002F total       0.85      0.85      0.85      7532\n\n\n\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nDecision Tree\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nOne of earlier classification algorithm for text and data mining is decision tree. Decision tree classifiers (DTC's) are used successfully in many diverse areas of classification. The structure of this technique includes a hierarchical decomposition of the data space (only train dataset). Decision tree as classification task was introduced by `D. Morgan \u003Chttp:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP95-1037>`__ and developed by `JR. Quinlan \u003Chttps:\u002F\u002Fcourses.cs.ut.ee\u002F2009\u002Fbayesian-networks\u002Fextras\u002Fquinlan1986.pdf>`__. The main idea is creating trees based on the attributes of the data points, but the challenge is determining which attribute should be in parent level and which one should be in child level. To solve this problem, `De Mantaras \u003Chttps:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1023\u002FA:1022694001379>`__ introduced statistical modeling for feature selection in tree.\n\n\n.. code:: python\n\n    from sklearn import tree\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', tree.DecisionTreeClassifier()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\noutput:\n\n\n.. code:: python\n\n                   precision    recall  f1-score   support\n\n              0       0.51      0.48      0.49       319\n              1       0.42      0.42      0.42       389\n              2       0.51      0.56      0.53       394\n              3       0.46      0.42      0.44       392\n              4       0.50      0.56      0.53       385\n              5       0.50      0.47      0.48       395\n              6       0.66      0.73      0.69       390\n              7       0.60      0.59      0.59       396\n              8       0.66      0.72      0.69       398\n              9       0.53      0.55      0.54       397\n             10       0.68      0.66      0.67       399\n             11       0.73      0.69      0.71       396\n             12       0.34      0.33      0.33       393\n             13       0.52      0.42      0.46       396\n             14       0.65      0.62      0.63       394\n             15       0.68      0.72      0.70       398\n             16       0.49      0.62      0.55       364\n             17       0.78      0.60      0.68       376\n             18       0.38      0.38      0.38       310\n             19       0.32      0.32      0.32       251\n\n    avg \u002F total       0.55      0.55      0.55      7532\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nRandom Forest\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\nRandom forests or random decision forests technique is an ensemble learning method for text classification. This method was introduced by `T. Kam Ho \u003Chttps:\u002F\u002Fdoi.org\u002F10.1109\u002FICDAR.1995.598994>`__ in 1995 for first time which used t trees in parallel. This technique was later developed by `L. Breiman \u003Chttps:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1023\u002FA:1010933404324>`__ in 1999 that they found converged for RF as a margin measure.\n\n\n.. image:: docs\u002Fpic\u002FRF.png\n\n.. code:: python\n\n    from sklearn.ensemble import RandomForestClassifier\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', RandomForestClassifier(n_estimators=100)),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\noutput:\n\n\n.. code:: python\n\n\n                    precision    recall  f1-score   support\n\n              0       0.69      0.63      0.66       319\n              1       0.56      0.69      0.62       389\n              2       0.67      0.78      0.72       394\n              3       0.67      0.67      0.67       392\n              4       0.71      0.78      0.74       385\n              5       0.78      0.68      0.73       395\n              6       0.74      0.92      0.82       390\n              7       0.81      0.79      0.80       396\n              8       0.90      0.89      0.90       398\n              9       0.80      0.89      0.84       397\n             10       0.90      0.93      0.91       399\n             11       0.89      0.91      0.90       396\n             12       0.68      0.49      0.57       393\n             13       0.83      0.65      0.73       396\n             14       0.81      0.88      0.84       394\n             15       0.68      0.91      0.78       398\n             16       0.67      0.86      0.75       364\n             17       0.93      0.78      0.85       376\n             18       0.86      0.48      0.61       310\n             19       0.79      0.31      0.45       251\n\n    avg \u002F total       0.77      0.76      0.75      7532\n\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nConditional Random Field (CRF)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nConditional Random Field (CRF) is an undirected graphical model as shown in figure. CRFs state the conditional probability of a label sequence *Y* give a sequence of observation *X* *i.e.* P(Y|X). CRFs can incorporate complex features of observation sequence without violating the independence assumption by modeling the conditional probability of the label sequences rather than the joint probability P(X,Y). The concept of clique which is a fully connected subgraph and clique potential are used for computing P(X|Y). Considering one potential function for each clique of the graph, the probability of a variable configuration corresponds to the product of a series of non-negative potential function. The value computed by each potential function is equivalent to the probability of the variables in its corresponding clique taken on a particular configuration.\n\n\n.. image:: docs\u002Fpic\u002FCRF.png\n\n\nExample from `Here \u003Chttp:\u002F\u002Fsklearn-crfsuite.readthedocs.io\u002Fen\u002Flatest\u002Ftutorial.html>`__\nLet’s use CoNLL 2002 data to build a NER system\nCoNLL2002 corpus is available in NLTK. We use Spanish data.\n\n\n.. code:: python\n\n      import nltk\n      import sklearn_crfsuite\n      from sklearn_crfsuite import metrics\n      nltk.corpus.conll2002.fileids()\n      train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))\n      test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))\n      \n      \nsklearn-crfsuite (and python-crfsuite) supports several feature formats; here we use feature dicts.\n\n.. code:: python\n\n      def word2features(sent, i):\n          word = sent[i][0]\n          postag = sent[i][1]\n\n          features = {\n              'bias': 1.0,\n              'word.lower()': word.lower(),\n              'word[-3:]': word[-3:],\n              'word[-2:]': word[-2:],\n              'word.isupper()': word.isupper(),\n              'word.istitle()': word.istitle(),\n              'word.isdigit()': word.isdigit(),\n              'postag': postag,\n              'postag[:2]': postag[:2],\n          }\n          if i > 0:\n              word1 = sent[i-1][0]\n              postag1 = sent[i-1][1]\n              features.update({\n                  '-1:word.lower()': word1.lower(),\n                  '-1:word.istitle()': word1.istitle(),\n                  '-1:word.isupper()': word1.isupper(),\n                  '-1:postag': postag1,\n                  '-1:postag[:2]': postag1[:2],\n              })\n          else:\n              features['BOS'] = True\n\n          if i \u003C len(sent)-1:\n              word1 = sent[i+1][0]\n              postag1 = sent[i+1][1]\n              features.update({\n                  '+1:word.lower()': word1.lower(),\n                  '+1:word.istitle()': word1.istitle(),\n                  '+1:word.isupper()': word1.isupper(),\n                  '+1:postag': postag1,\n                  '+1:postag[:2]': postag1[:2],\n              })\n          else:\n              features['EOS'] = True\n\n          return features\n\n\n      def sent2features(sent):\n          return [word2features(sent, i) for i in range(len(sent))]\n\n      def sent2labels(sent):\n          return [label for token, postag, label in sent]\n\n      def sent2tokens(sent):\n          return [token for token, postag, label in sent]\n\n      X_train = [sent2features(s) for s in train_sents]\n      y_train = [sent2labels(s) for s in train_sents]\n\n      X_test = [sent2features(s) for s in test_sents]\n      y_test = [sent2labels(s) for s in test_sents]\n\n\nTo see all possible CRF parameters check its docstring. Here we are useing L-BFGS training algorithm (it is default) with Elastic Net (L1 + L2) regularization.\n\n\n\n.. code:: python\n\n      crf = sklearn_crfsuite.CRF(\n          algorithm='lbfgs',\n          c1=0.1,\n          c2=0.1,\n          max_iterations=100,\n          all_possible_transitions=True\n      )\n      crf.fit(X_train, y_train)\n\n\nEvaluation\n\n\n.. code:: python\n\n      y_pred = crf.predict(X_test)\n      print(metrics.flat_classification_report(\n          y_test, y_pred,  digits=3\n      ))\n\n\nOutput:\n\n.. code:: python\n\n                     precision    recall  f1-score   support\n\n            B-LOC      0.810     0.784     0.797      1084\n           B-MISC      0.731     0.569     0.640       339\n            B-ORG      0.807     0.832     0.820      1400\n            B-PER      0.850     0.884     0.867       735\n            I-LOC      0.690     0.637     0.662       325\n           I-MISC      0.699     0.589     0.639       557\n            I-ORG      0.852     0.786     0.818      1104\n            I-PER      0.893     0.943     0.917       634\n                O      0.992     0.997     0.994     45355\n\n      avg \u002F total      0.970     0.971     0.971     51533\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nDeep Learning\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n-----------------------------------------\nDeep Neural Networks\n-----------------------------------------\n\nDeep Neural Networks architectures are designed to learn through multiple connection of layers where each single layer only receives connection from previous and provides connections only to the next layer in hidden part. The input is a connection of feature space (As discussed in Section Feature_extraction with first hidden layer. For Deep Neural Networks (DNN), input layer could be tf-ifd, word embedding, or etc. as shown in standard DNN in Figure. The output layer houses neurons equal to the number of classes for multi-class classification and only one neuron for binary classification. But our main contribution in this paper is that we have many trained DNNs to serve different purposes. Here, we have multi-class DNNs where each learning model is generated randomly (number of nodes in each layer as well as the number of layers are randomly assigned). Our implementation of Deep Neural Network (DNN) is basically a discriminatively trained model that uses standard back-propagation algorithm and sigmoid or ReLU as activation functions. The output layer for multi-class classification should use Softmax.\n\n\n.. image:: docs\u002Fpic\u002FDNN.png\n\nimport packages:\n\n.. code:: python\n\n    from sklearn.datasets import fetch_20newsgroups\n    from keras.layers import  Dropout, Dense\n    from keras.models import Sequential\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n    from sklearn import metrics\n\n\nconvert text to TF-IDF:\n\n.. code:: python\n\n    def TFIDF(X_train, X_test,MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\",str(np.array(X_train).shape[1]),\"features\")\n        return (X_train,X_test)\n\n\nBuild a DNN Model for Text:\n\n.. code:: python\n\n    def Build_Model_DNN_Text(shape, nClasses, dropout=0.5):\n        \"\"\"\n        buildModel_DNN_Tex(shape, nClasses,dropout)\n        Build Deep neural networks Model for text classification\n        Shape is input feature space\n        nClasses is number of classes\n        \"\"\"\n        model = Sequential()\n        node = 512 # number of nodes\n        nLayers = 4 # number of  hidden layer\n\n        model.add(Dense(node,input_dim=shape,activation='relu'))\n        model.add(Dropout(dropout))\n        for i in range(0,nLayers):\n            model.add(Dense(node,input_dim=node,activation='relu'))\n            model.add(Dropout(dropout))\n        model.add(Dense(nClasses, activation='softmax'))\n\n        model.compile(loss='sparse_categorical_crossentropy',\n                      optimizer='adam',\n                      metrics=['accuracy'])\n\n        return model\n\n\n\nLoad text dataset (20newsgroups):\n\n.. code:: python\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n\n\nrun DNN and see our result:\n\n\n.. code:: python\n\n    X_train_tfidf,X_test_tfidf = TFIDF(X_train,X_test)\n    model_DNN = Build_Model_DNN_Text(X_train_tfidf.shape[1], 20)\n    model_DNN.fit(X_train_tfidf, y_train,\n                                  validation_data=(X_test_tfidf, y_test),\n                                  epochs=10,\n                                  batch_size=128,\n                                  verbose=2)\n\n    predicted = model_DNN.predict_class(X_test_tfidf)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\nModel summary:\n\n.. code:: python \n\n    _________________________________________________________________\n    Layer (type)                 Output Shape              Param #   \n    =================================================================\n    dense_1 (Dense)              (None, 512)               38400512  \n    _________________________________________________________________\n    dropout_1 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_2 (Dense)              (None, 512)               262656    \n    _________________________________________________________________\n    dropout_2 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_3 (Dense)              (None, 512)               262656    \n    _________________________________________________________________\n    dropout_3 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_4 (Dense)              (None, 512)               262656    \n    _________________________________________________________________\n    dropout_4 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_5 (Dense)              (None, 512)               262656    \n    _________________________________________________________________\n    dropout_5 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_6 (Dense)              (None, 20)                10260     \n    =================================================================\n    Total params: 39,461,396\n    Trainable params: 39,461,396\n    Non-trainable params: 0\n    _________________________________________________________________\n\n\n\nOutput:\n\n.. code:: python \n\n        Train on 11314 samples, validate on 7532 samples\n        Epoch 1\u002F10\n         - 16s - loss: 2.7553 - acc: 0.1090 - val_loss: 1.9330 - val_acc: 0.3184\n        Epoch 2\u002F10\n         - 15s - loss: 1.5330 - acc: 0.4222 - val_loss: 1.1546 - val_acc: 0.6204\n        Epoch 3\u002F10\n         - 15s - loss: 0.7438 - acc: 0.7257 - val_loss: 0.8405 - val_acc: 0.7499\n        Epoch 4\u002F10\n         - 15s - loss: 0.2967 - acc: 0.9020 - val_loss: 0.9214 - val_acc: 0.7767\n        Epoch 5\u002F10\n         - 15s - loss: 0.1557 - acc: 0.9543 - val_loss: 0.8965 - val_acc: 0.7917\n        Epoch 6\u002F10\n         - 15s - loss: 0.1015 - acc: 0.9705 - val_loss: 0.9427 - val_acc: 0.7949\n        Epoch 7\u002F10\n         - 15s - loss: 0.0595 - acc: 0.9835 - val_loss: 0.9893 - val_acc: 0.7995\n        Epoch 8\u002F10\n         - 15s - loss: 0.0495 - acc: 0.9866 - val_loss: 0.9512 - val_acc: 0.8079\n        Epoch 9\u002F10\n         - 15s - loss: 0.0437 - acc: 0.9867 - val_loss: 0.9690 - val_acc: 0.8117\n        Epoch 10\u002F10\n         - 15s - loss: 0.0443 - acc: 0.9880 - val_loss: 1.0004 - val_acc: 0.8070\n\n\n                       precision    recall  f1-score   support\n\n                  0       0.76      0.78      0.77       319\n                  1       0.67      0.80      0.73       389\n                  2       0.82      0.63      0.71       394\n                  3       0.76      0.69      0.72       392\n                  4       0.65      0.86      0.74       385\n                  5       0.84      0.75      0.79       395\n                  6       0.82      0.87      0.84       390\n                  7       0.86      0.90      0.88       396\n                  8       0.95      0.91      0.93       398\n                  9       0.91      0.92      0.92       397\n                 10       0.98      0.92      0.95       399\n                 11       0.96      0.85      0.90       396\n                 12       0.71      0.69      0.70       393\n                 13       0.95      0.70      0.81       396\n                 14       0.86      0.91      0.88       394\n                 15       0.85      0.90      0.87       398\n                 16       0.79      0.84      0.81       364\n                 17       0.99      0.77      0.87       376\n                 18       0.58      0.75      0.65       310\n                 19       0.52      0.60      0.55       251\n\n        avg \u002F total       0.82      0.81      0.81      7532\n\n\n-----------------------------------------\nRecurrent Neural Networks (RNN)\n-----------------------------------------\n\n.. image:: docs\u002Fpic\u002FRNN.png\n\nAnother neural network architecture that is addressed by the researchers for text miming and classification is Recurrent Neural Networks (RNN). RNN assigns more weights to the previous data points of sequence. Therefore, this technique is a powerful method for text, string and sequential data classification. Moreover, this technique could be used for image classification as we did in this work. In RNN, the neural net considers the information of previous nodes in a very sophisticated method which allows for better semantic analysis of the structures in the dataset. \n\n\nGated Recurrent Unit (GRU)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nGated Recurrent Unit (GRU) is a gating mechanism for RNN which was introduced by  `J. Chung et al. \u003Chttps:\u002F\u002Farxiv.org\u002Fabs\u002F1412.3555>`__ and `K.Cho et al. \u003Chttps:\u002F\u002Farxiv.org\u002Fabs\u002F1406.1078>`__. GRU is a simplified variant of the LSTM architecture, but there are differences as follows: GRU contains two gates and does not possess any internal memory (as shown in Figure; and finally, a second non-linearity is not applied (tanh in Figure).\n\n.. image:: docs\u002Fpic\u002FLSTM.png\n\nLong Short-Term Memory (LSTM)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nLong Short-Term Memory~(LSTM) was introduced by `S. Hochreiter and J. Schmidhuber \u003Chttps:\u002F\u002Fwww.mitpressjournals.org\u002Fdoi\u002Fabs\u002F10.1162\u002Fneco.1997.9.8.1735>`__  and developed by many research scientists.\n\nTo deal with these problems Long Short-Term Memory (LSTM) is a special type of RNN that preserves long term dependency in a more effective way compared to the basic RNNs. This is particularly useful to overcome vanishing gradient problem. Although LSTM has a chain-like structure similar to RNN, LSTM uses multiple gates to carefully regulate the amount of information that will be allowed into each node state. Figure shows the basic cell of a LSTM model.\n\n\n\nimport packages:\n\n.. code:: python\n\n\n    from keras.layers import Dropout, Dense, GRU, Embedding\n    from keras.models import Sequential\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n    from sklearn import metrics\n    from keras.preprocessing.text import Tokenizer\n    from keras.preprocessing.sequence import pad_sequences\n    from sklearn.datasets import fetch_20newsgroups\n\nconvert text to word embedding (Using GloVe):\n\n.. code:: python\n\n    def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):\n        np.random.seed(7)\n        text = np.concatenate((X_train, X_test), axis=0)\n        text = np.array(text)\n        tokenizer = Tokenizer(num_words=MAX_NB_WORDS)\n        tokenizer.fit_on_texts(text)\n        sequences = tokenizer.texts_to_sequences(text)\n        word_index = tokenizer.word_index\n        text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n        print('Found %s unique tokens.' % len(word_index))\n        indices = np.arange(text.shape[0])\n        # np.random.shuffle(indices)\n        text = text[indices]\n        print(text.shape)\n        X_train = text[0:len(X_train), ]\n        X_test = text[len(X_train):, ]\n        embeddings_index = {}\n        f = open(\".\\\\Glove\\\\glove.6B.50d.txt\", encoding=\"utf8\")\n        for line in f:\n\n            values = line.split()\n            word = values[0]\n            try:\n                coefs = np.asarray(values[1:], dtype='float32')\n            except:\n                pass\n            embeddings_index[word] = coefs\n        f.close()\n        print('Total %s word vectors.' % len(embeddings_index))\n        return (X_train, X_test, word_index,embeddings_index)\n\nBuild a RNN Model for Text:\n\n.. code:: python\n\n\n    def Build_Model_RNN_Text(word_index, embeddings_index, nclasses,  MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):\n        \"\"\"\n        def buildModel_RNN(word_index, embeddings_index, nclasses,  MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):\n        word_index in word index ,\n        embeddings_index is embeddings index, look at data_helper.py\n        nClasses is number of classes,\n        MAX_SEQUENCE_LENGTH is maximum lenght of text sequences\n        \"\"\"\n\n        model = Sequential()\n        hidden_layer = 3\n        gru_node = 32\n\n        embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))\n        for word, i in word_index.items():\n            embedding_vector = embeddings_index.get(word)\n            if embedding_vector is not None:\n                # words not found in embedding index will be all-zeros.\n                if len(embedding_matrix[i]) != len(embedding_vector):\n                    print(\"could not broadcast input array from shape\", str(len(embedding_matrix[i])),\n                          \"into shape\", str(len(embedding_vector)), \" Please make sure your\"\n                                                                    \" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,\")\n                    exit(1)\n                embedding_matrix[i] = embedding_vector\n        model.add(Embedding(len(word_index) + 1,\n                                    EMBEDDING_DIM,\n                                    weights=[embedding_matrix],\n                                    input_length=MAX_SEQUENCE_LENGTH,\n                                    trainable=True))\n\n\n        print(gru_node)\n        for i in range(0,hidden_layer):\n            model.add(GRU(gru_node,return_sequences=True, recurrent_dropout=0.2))\n            model.add(Dropout(dropout))\n        model.add(GRU(gru_node, recurrent_dropout=0.2))\n        model.add(Dropout(dropout))\n        model.add(Dense(256, activation='relu'))\n        model.add(Dense(nclasses, activation='softmax'))\n\n\n        model.compile(loss='sparse_categorical_crossentropy',\n                          optimizer='adam',\n                          metrics=['accuracy'])\n        return model\n\n\n\n\nrun RNN and see our result:\n\n\n.. code:: python\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)\n\n\n    model_RNN = Build_Model_RNN_Text(word_index,embeddings_index, 20)\n\n    model_RNN.fit(X_train_Glove, y_train,\n                                  validation_data=(X_test_Glove, y_test),\n                                  epochs=10,\n                                  batch_size=128,\n                                  verbose=2)\n\n    predicted = model_RNN.predict_classes(X_test_Glove)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\nModel summary:\n\n.. code:: python \n\n    _________________________________________________________________\n    Layer (type)                 Output Shape              Param #   \n    =================================================================\n    embedding_1 (Embedding)      (None, 500, 50)           8960500   \n    _________________________________________________________________\n    gru_1 (GRU)                  (None, 500, 256)          235776    \n    _________________________________________________________________\n    dropout_1 (Dropout)          (None, 500, 256)          0         \n    _________________________________________________________________\n    gru_2 (GRU)                  (None, 500, 256)          393984    \n    _________________________________________________________________\n    dropout_2 (Dropout)          (None, 500, 256)          0         \n    _________________________________________________________________\n    gru_3 (GRU)                  (None, 500, 256)          393984    \n    _________________________________________________________________\n    dropout_3 (Dropout)          (None, 500, 256)          0         \n    _________________________________________________________________\n    gru_4 (GRU)                  (None, 256)               393984    \n    _________________________________________________________________\n    dense_1 (Dense)              (None, 20)                5140      \n    =================================================================\n    Total params: 10,383,368\n    Trainable params: 10,383,368\n    Non-trainable params: 0\n    _________________________________________________________________\n\n\n\nOutput:\n\n.. code:: python \n\n    Train on 11314 samples, validate on 7532 samples\n    Epoch 1\u002F20\n     - 268s - loss: 2.5347 - acc: 0.1792 - val_loss: 2.2857 - val_acc: 0.2460\n    Epoch 2\u002F20\n     - 271s - loss: 1.6751 - acc: 0.3999 - val_loss: 1.4972 - val_acc: 0.4660\n    Epoch 3\u002F20\n     - 270s - loss: 1.0945 - acc: 0.6072 - val_loss: 1.3232 - val_acc: 0.5483\n    Epoch 4\u002F20\n     - 269s - loss: 0.7761 - acc: 0.7312 - val_loss: 1.1009 - val_acc: 0.6452\n    Epoch 5\u002F20\n     - 269s - loss: 0.5513 - acc: 0.8112 - val_loss: 1.0395 - val_acc: 0.6832\n    Epoch 6\u002F20\n     - 269s - loss: 0.3765 - acc: 0.8754 - val_loss: 0.9977 - val_acc: 0.7086\n    Epoch 7\u002F20\n     - 270s - loss: 0.2481 - acc: 0.9202 - val_loss: 1.0485 - val_acc: 0.7270\n    Epoch 8\u002F20\n     - 269s - loss: 0.1717 - acc: 0.9463 - val_loss: 1.0269 - val_acc: 0.7394\n    Epoch 9\u002F20\n     - 269s - loss: 0.1130 - acc: 0.9644 - val_loss: 1.1498 - val_acc: 0.7369\n    Epoch 10\u002F20\n     - 269s - loss: 0.0640 - acc: 0.9808 - val_loss: 1.1442 - val_acc: 0.7508\n    Epoch 11\u002F20\n     - 269s - loss: 0.0567 - acc: 0.9828 - val_loss: 1.2318 - val_acc: 0.7414\n    Epoch 12\u002F20\n     - 268s - loss: 0.0472 - acc: 0.9858 - val_loss: 1.2204 - val_acc: 0.7496\n    Epoch 13\u002F20\n     - 269s - loss: 0.0319 - acc: 0.9910 - val_loss: 1.1895 - val_acc: 0.7657\n    Epoch 14\u002F20\n     - 268s - loss: 0.0466 - acc: 0.9853 - val_loss: 1.2821 - val_acc: 0.7517\n    Epoch 15\u002F20\n     - 271s - loss: 0.0269 - acc: 0.9917 - val_loss: 1.2869 - val_acc: 0.7557\n    Epoch 16\u002F20\n     - 271s - loss: 0.0187 - acc: 0.9950 - val_loss: 1.3037 - val_acc: 0.7598\n    Epoch 17\u002F20\n     - 268s - loss: 0.0157 - acc: 0.9959 - val_loss: 1.2974 - val_acc: 0.7638\n    Epoch 18\u002F20\n     - 270s - loss: 0.0121 - acc: 0.9966 - val_loss: 1.3526 - val_acc: 0.7602\n    Epoch 19\u002F20\n     - 269s - loss: 0.0262 - acc: 0.9926 - val_loss: 1.4182 - val_acc: 0.7517\n    Epoch 20\u002F20\n     - 269s - loss: 0.0249 - acc: 0.9918 - val_loss: 1.3453 - val_acc: 0.7638\n\n\n                   precision    recall  f1-score   support\n\n              0       0.71      0.71      0.71       319\n              1       0.72      0.68      0.70       389\n              2       0.76      0.62      0.69       394\n              3       0.67      0.58      0.62       392\n              4       0.68      0.67      0.68       385\n              5       0.75      0.73      0.74       395\n              6       0.82      0.74      0.78       390\n              7       0.83      0.83      0.83       396\n              8       0.81      0.90      0.86       398\n              9       0.92      0.90      0.91       397\n             10       0.91      0.94      0.93       399\n             11       0.87      0.76      0.81       396\n             12       0.57      0.70      0.63       393\n             13       0.81      0.85      0.83       396\n             14       0.74      0.93      0.82       394\n             15       0.82      0.83      0.83       398\n             16       0.74      0.78      0.76       364\n             17       0.96      0.83      0.89       376\n             18       0.64      0.60      0.62       310\n             19       0.48      0.56      0.52       251\n\n    avg \u002F total       0.77      0.76      0.76      7532\n\n-----------------------------------------\nConvolutional Neural Networks (CNN)\n-----------------------------------------\n\nAnother deep learning architecture that is employed for hierarchical document classification is  Convolutional Neural Networks (CNN) . Although originally built for image processing  with architecture similar to the visual cortex, CNNs have also been effectively used for text classification. In a basic CNN for image processing, an image tensor is convolved with a set of kernels of size *d by d*. These convolution layers are called feature maps and can be stacked to provide multiple filters on the input. To reduce the computational complexity, CNNs use pooling which reduces the size of the output from one layer to the next in the network. Different pooling techniques are used to reduce outputs while preserving important features.\n\nThe most common pooling method is max pooling where the maximum element is selected from the pooling window. In order to feed the pooled output from stacked featured maps to the next layer, the maps are flattened into one column. The final layers in a CNN are typically fully connected dense layers.\nIn general, during the back-propagation step of a convolutional neural network not only the weights are adjusted but also the feature detector filters. A potential problem of CNN used for text is the number of 'channels', *Sigma* (size of the feature space). This might be very large (e.g. 50K), for text but for images this is less of a problem (e.g. only 3 channels of RGB). This means the dimensionality of the CNN for text is very high.\n\n\n.. image:: docs\u002Fpic\u002FCNN.png\n\nimport packages:\n\n.. code:: python\n\n\n    from keras.layers import Dropout, Dense,Input,Embedding,Flatten, MaxPooling1D, Conv1D\n    from keras.models import Sequential,Model\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n    from sklearn import metrics\n    from keras.preprocessing.text import Tokenizer\n    from keras.preprocessing.sequence import pad_sequences\n    from sklearn.datasets import fetch_20newsgroups\n    from keras.layers.merge import Concatenate\n\n\n\nconvert text to word embedding (Using GloVe):\n\n.. code:: python\n\n    def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):\n        np.random.seed(7)\n        text = np.concatenate((X_train, X_test), axis=0)\n        text = np.array(text)\n        tokenizer = Tokenizer(num_words=MAX_NB_WORDS)\n        tokenizer.fit_on_texts(text)\n        sequences = tokenizer.texts_to_sequences(text)\n        word_index = tokenizer.word_index\n        text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n        print('Found %s unique tokens.' % len(word_index))\n        indices = np.arange(text.shape[0])\n        # np.random.shuffle(indices)\n        text = text[indices]\n        print(text.shape)\n        X_train = text[0:len(X_train), ]\n        X_test = text[len(X_train):, ]\n        embeddings_index = {}\n        f = open(\".\\\\Glove\\\\glove.6B.50d.txt\", encoding=\"utf8\")\n        for line in f:\n            values = line.split()\n            word = values[0]\n            try:\n                coefs = np.asarray(values[1:], dtype='float32')\n            except:\n                pass\n            embeddings_index[word] = coefs\n        f.close()\n        print('Total %s word vectors.' % len(embeddings_index))\n        return (X_train, X_test, word_index,embeddings_index)\n\n\nBuild a CNN Model for Text:\n\n.. code:: python\n\n    def Build_Model_CNN_Text(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):\n\n        \"\"\"\n            def buildModel_CNN(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):\n            word_index in word index ,\n            embeddings_index is embeddings index, look at data_helper.py\n            nClasses is number of classes,\n            MAX_SEQUENCE_LENGTH is maximum lenght of text sequences,\n            EMBEDDING_DIM is an int value for dimention of word embedding look at data_helper.py\n        \"\"\"\n\n        model = Sequential()\n        embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))\n        for word, i in word_index.items():\n            embedding_vector = embeddings_index.get(word)\n            if embedding_vector is not None:\n                # words not found in embedding index will be all-zeros.\n                if len(embedding_matrix[i]) !=len(embedding_vector):\n                    print(\"could not broadcast input array from shape\",str(len(embedding_matrix[i])),\n                                     \"into shape\",str(len(embedding_vector)),\" Please make sure your\"\n                                     \" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,\")\n                    exit(1)\n\n                embedding_matrix[i] = embedding_vector\n\n        embedding_layer = Embedding(len(word_index) + 1,\n                                    EMBEDDING_DIM,\n                                    weights=[embedding_matrix],\n                                    input_length=MAX_SEQUENCE_LENGTH,\n                                    trainable=True)\n\n        # applying a more complex convolutional approach\n        convs = []\n        filter_sizes = []\n        layer = 5\n        print(\"Filter  \",layer)\n        for fl in range(0,layer):\n            filter_sizes.append((fl+2))\n\n        node = 128\n        sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')\n        embedded_sequences = embedding_layer(sequence_input)\n\n        for fsz in filter_sizes:\n            l_conv = Conv1D(node, kernel_size=fsz, activation='relu')(embedded_sequences)\n            l_pool = MaxPooling1D(5)(l_conv)\n            #l_pool = Dropout(0.25)(l_pool)\n            convs.append(l_pool)\n\n        l_merge = Concatenate(axis=1)(convs)\n        l_cov1 = Conv1D(node, 5, activation='relu')(l_merge)\n        l_cov1 = Dropout(dropout)(l_cov1)\n        l_pool1 = MaxPooling1D(5)(l_cov1)\n        l_cov2 = Conv1D(node, 5, activation='relu')(l_pool1)\n        l_cov2 = Dropout(dropout)(l_cov2)\n        l_pool2 = MaxPooling1D(30)(l_cov2)\n        l_flat = Flatten()(l_pool2)\n        l_dense = Dense(1024, activation='relu')(l_flat)\n        l_dense = Dropout(dropout)(l_dense)\n        l_dense = Dense(512, activation='relu')(l_dense)\n        l_dense = Dropout(dropout)(l_dense)\n        preds = Dense(nclasses, activation='softmax')(l_dense)\n        model = Model(sequence_input, preds)\n\n        model.compile(loss='sparse_categorical_crossentropy',\n                      optimizer='adam',\n                      metrics=['accuracy'])\n\n\n\n        return model\n\n\n\nrun CNN and see our result:\n\n\n.. code:: python\n\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)\n\n\n    model_CNN = Build_Model_CNN_Text(word_index,embeddings_index, 20)\n\n\n    model_CNN.summary()\n\n    model_CNN.fit(X_train_Glove, y_train,\n                                  validation_data=(X_test_Glove, y_test),\n                                  epochs=15,\n                                  batch_size=128,\n                                  verbose=2)\n\n    predicted = model_CNN.predict(X_test_Glove)\n\n    predicted = np.argmax(predicted, axis=1)\n\n\n    print(metrics.classification_report(y_test, predicted))\n\n\nModel:\n\n.. code:: python \n\n    __________________________________________________________________________________________________\n    Layer (type)                    Output Shape         Param #     Connected to                     \n    ==================================================================================================\n    input_1 (InputLayer)            (None, 500)          0                                            \n    __________________________________________________________________________________________________\n    embedding_1 (Embedding)         (None, 500, 50)      8960500     input_1[0][0]                    \n    __________________________________________________________________________________________________\n    conv1d_1 (Conv1D)               (None, 499, 128)     12928       embedding_1[0][0]                \n    __________________________________________________________________________________________________\n    conv1d_2 (Conv1D)               (None, 498, 128)     19328       embedding_1[0][0]                \n    __________________________________________________________________________________________________\n    conv1d_3 (Conv1D)               (None, 497, 128)     25728       embedding_1[0][0]                \n    __________________________________________________________________________________________________\n    conv1d_4 (Conv1D)               (None, 496, 128)     32128       embedding_1[0][0]                \n    __________________________________________________________________________________________________\n    conv1d_5 (Conv1D)               (None, 495, 128)     38528       embedding_1[0][0]                \n    __________________________________________________________________________________________________\n    max_pooling1d_1 (MaxPooling1D)  (None, 99, 128)      0           conv1d_1[0][0]                   \n    __________________________________________________________________________________________________\n    max_pooling1d_2 (MaxPooling1D)  (None, 99, 128)      0           conv1d_2[0][0]                   \n    __________________________________________________________________________________________________\n    max_pooling1d_3 (MaxPooling1D)  (None, 99, 128)      0           conv1d_3[0][0]                   \n    __________________________________________________________________________________________________\n    max_pooling1d_4 (MaxPooling1D)  (None, 99, 128)      0           conv1d_4[0][0]                   \n    __________________________________________________________________________________________________\n    max_pooling1d_5 (MaxPooling1D)  (None, 99, 128)      0           conv1d_5[0][0]                   \n    __________________________________________________________________________________________________\n    concatenate_1 (Concatenate)     (None, 495, 128)     0           max_pooling1d_1[0][0]            \n                                                                     max_pooling1d_2[0][0]            \n                                                                     max_pooling1d_3[0][0]            \n                                                                     max_pooling1d_4[0][0]            \n                                                                     max_pooling1d_5[0][0]            \n    __________________________________________________________________________________________________\n    conv1d_6 (Conv1D)               (None, 491, 128)     82048       concatenate_1[0][0]              \n    __________________________________________________________________________________________________\n    dropout_1 (Dropout)             (None, 491, 128)     0           conv1d_6[0][0]                   \n    __________________________________________________________________________________________________\n    max_pooling1d_6 (MaxPooling1D)  (None, 98, 128)      0           dropout_1[0][0]                  \n    __________________________________________________________________________________________________\n    conv1d_7 (Conv1D)               (None, 94, 128)      82048       max_pooling1d_6[0][0]            \n    __________________________________________________________________________________________________\n    dropout_2 (Dropout)             (None, 94, 128)      0           conv1d_7[0][0]                   \n    __________________________________________________________________________________________________\n    max_pooling1d_7 (MaxPooling1D)  (None, 3, 128)       0           dropout_2[0][0]                  \n    __________________________________________________________________________________________________\n    flatten_1 (Flatten)             (None, 384)          0           max_pooling1d_7[0][0]            \n    __________________________________________________________________________________________________\n    dense_1 (Dense)                 (None, 1024)         394240      flatten_1[0][0]                  \n    __________________________________________________________________________________________________\n    dropout_3 (Dropout)             (None, 1024)         0           dense_1[0][0]                    \n    __________________________________________________________________________________________________\n    dense_2 (Dense)                 (None, 512)          524800      dropout_3[0][0]                  \n    __________________________________________________________________________________________________\n    dropout_4 (Dropout)             (None, 512)          0           dense_2[0][0]                    \n    __________________________________________________________________________________________________\n    dense_3 (Dense)                 (None, 20)           10260       dropout_4[0][0]                  \n    ==================================================================================================\n    Total params: 10,182,536\n    Trainable params: 10,182,536\n    Non-trainable params: 0\n    __________________________________________________________________________________________________\n\n\nOutput:\n\n\n.. code:: python \n\n    Train on 11314 samples, validate on 7532 samples\n    Epoch 1\u002F15\n     - 6s - loss: 2.9329 - acc: 0.0783 - val_loss: 2.7628 - val_acc: 0.1403\n    Epoch 2\u002F15\n     - 4s - loss: 2.2534 - acc: 0.2249 - val_loss: 2.1715 - val_acc: 0.4007\n    Epoch 3\u002F15\n     - 4s - loss: 1.5643 - acc: 0.4326 - val_loss: 1.7846 - val_acc: 0.5052\n    Epoch 4\u002F15\n     - 4s - loss: 1.1771 - acc: 0.5662 - val_loss: 1.4949 - val_acc: 0.6131\n    Epoch 5\u002F15\n     - 4s - loss: 0.8880 - acc: 0.6797 - val_loss: 1.3629 - val_acc: 0.6256\n    Epoch 6\u002F15\n     - 4s - loss: 0.6990 - acc: 0.7569 - val_loss: 1.2013 - val_acc: 0.6624\n    Epoch 7\u002F15\n     - 4s - loss: 0.5037 - acc: 0.8200 - val_loss: 1.0674 - val_acc: 0.6807\n    Epoch 8\u002F15\n     - 4s - loss: 0.4050 - acc: 0.8626 - val_loss: 1.0223 - val_acc: 0.6863\n    Epoch 9\u002F15\n     - 4s - loss: 0.2952 - acc: 0.8968 - val_loss: 0.9045 - val_acc: 0.7120\n    Epoch 10\u002F15\n     - 4s - loss: 0.2314 - acc: 0.9217 - val_loss: 0.8574 - val_acc: 0.7326\n    Epoch 11\u002F15\n     - 4s - loss: 0.1778 - acc: 0.9436 - val_loss: 0.8752 - val_acc: 0.7270\n    Epoch 12\u002F15\n     - 4s - loss: 0.1475 - acc: 0.9524 - val_loss: 0.8299 - val_acc: 0.7355\n    Epoch 13\u002F15\n     - 4s - loss: 0.1089 - acc: 0.9657 - val_loss: 0.8034 - val_acc: 0.7491\n    Epoch 14\u002F15\n     - 4s - loss: 0.1047 - acc: 0.9666 - val_loss: 0.8172 - val_acc: 0.7463\n    Epoch 15\u002F15\n     - 4s - loss: 0.0749 - acc: 0.9774 - val_loss: 0.8511 - val_acc: 0.7313\n     \n     \n                   precision    recall  f1-score   support\n\n              0       0.75      0.61      0.67       319\n              1       0.63      0.74      0.68       389\n              2       0.74      0.54      0.62       394\n              3       0.49      0.76      0.60       392\n              4       0.60      0.70      0.64       385\n              5       0.79      0.57      0.66       395\n              6       0.73      0.76      0.74       390\n              7       0.83      0.74      0.78       396\n              8       0.86      0.88      0.87       398\n              9       0.95      0.78      0.86       397\n             10       0.93      0.93      0.93       399\n             11       0.92      0.77      0.84       396\n             12       0.55      0.72      0.62       393\n             13       0.76      0.85      0.80       396\n             14       0.86      0.83      0.84       394\n             15       0.91      0.73      0.81       398\n             16       0.75      0.65      0.70       364\n             17       0.95      0.86      0.90       376\n             18       0.60      0.49      0.54       310\n             19       0.37      0.60      0.46       251\n\n    avg \u002F total       0.76      0.73      0.74      7532\n\n\n\n\n-----------------------------------------\nHierarchical Attention Networks\n-----------------------------------------\n\n.. image:: docs\u002Fpic\u002FHAN.png\n\n---------------------------------------------\nRecurrent Convolutional Neural Networks (RCNN)\n---------------------------------------------\n\nRecurrent Convolutional Neural Networks (RCNN) is also used for text classification. The main idea of this technique is capturing contextual information with the recurrent structure and constructing the representation of text using a convolutional neural network. This architecture is a combination of RNN and CNN to use advantages of both technique in a model.\n\n\n\nimport packages:\n\n.. code:: python \n\n      from keras.preprocessing import sequence\n      from keras.models import Sequential\n      from keras.layers import Dense, Dropout, Activation\n      from keras.layers import Embedding\n      from keras.layers import GRU\n      from keras.layers import Conv1D, MaxPooling1D\n      from keras.datasets import imdb\n      from sklearn.datasets import fetch_20newsgroups\n      import numpy as np\n      from sklearn import metrics\n      from keras.preprocessing.text import Tokenizer\n      from keras.preprocessing.sequence import pad_sequences\n\n\n\nConvert text to word embedding (Using GloVe):\n\n.. code:: python \n\n      def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):\n          np.random.seed(7)\n          text = np.concatenate((X_train, X_test), axis=0)\n          text = np.array(text)\n          tokenizer = Tokenizer(num_words=MAX_NB_WORDS)\n          tokenizer.fit_on_texts(text)\n          sequences = tokenizer.texts_to_sequences(text)\n          word_index = tokenizer.word_index\n          text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n          print('Found %s unique tokens.' % len(word_index))\n          indices = np.arange(text.shape[0])\n          # np.random.shuffle(indices)\n          text = text[indices]\n          print(text.shape)\n          X_train = text[0:len(X_train), ]\n          X_test = text[len(X_train):, ]\n          embeddings_index = {}\n          f = open(\"C:\\\\Users\\\\kamran\\\\Documents\\\\GitHub\\\\RMDL\\\\Examples\\\\Glove\\\\glove.6B.50d.txt\", encoding=\"utf8\")\n          for line in f:\n              values = line.split()\n              word = values[0]\n              try:\n                  coefs = np.asarray(values[1:], dtype='float32')\n              except:\n                  pass\n              embeddings_index[word] = coefs\n          f.close()\n          print('Total %s word vectors.' % len(embeddings_index))\n          return (X_train, X_test, word_index,embeddings_index)\n\n\n.. code:: python \n\n      def Build_Model_RCNN_Text(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50):\n\n          kernel_size = 2\n          filters = 256\n          pool_size = 2\n          gru_node = 256\n\n          embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))\n          for word, i in word_index.items():\n              embedding_vector = embeddings_index.get(word)\n              if embedding_vector is not None:\n                  # words not found in embedding index will be all-zeros.\n                  if len(embedding_matrix[i]) !=len(embedding_vector):\n                      print(\"could not broadcast input array from shape\",str(len(embedding_matrix[i])),\n                                       \"into shape\",str(len(embedding_vector)),\" Please make sure your\"\n                                       \" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,\")\n                      exit(1)\n\n                  embedding_matrix[i] = embedding_vector\n\n\n\n          model = Sequential()\n          model.add(Embedding(len(word_index) + 1,\n                                      EMBEDDING_DIM,\n                                      weights=[embedding_matrix],\n                                      input_length=MAX_SEQUENCE_LENGTH,\n                                      trainable=True))\n          model.add(Dropout(0.25))\n          model.add(Conv1D(filters, kernel_size, activation='relu'))\n          model.add(MaxPooling1D(pool_size=pool_size))\n          model.add(Conv1D(filters, kernel_size, activation='relu'))\n          model.add(MaxPooling1D(pool_size=pool_size))\n          model.add(Conv1D(filters, kernel_size, activation='relu'))\n          model.add(MaxPooling1D(pool_size=pool_size))\n          model.add(Conv1D(filters, kernel_size, activation='relu'))\n          model.add(MaxPooling1D(pool_size=pool_size))\n          model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))\n          model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))\n          model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))\n          model.add(LSTM(gru_node, recurrent_dropout=0.2))\n          model.add(Dense(1024,activation='relu'))\n          model.add(Dense(nclasses))\n          model.add(Activation('softmax'))\n\n          model.compile(loss='sparse_categorical_crossentropy',\n                        optimizer='adam',\n                        metrics=['accuracy'])\n\n          return model\n\n\n.. code:: python \n\n      newsgroups_train = fetch_20newsgroups(subset='train')\n      newsgroups_test = fetch_20newsgroups(subset='test')\n      X_train = newsgroups_train.data\n      X_test = newsgroups_test.data\n      y_train = newsgroups_train.target\n      y_test = newsgroups_test.target\n\n      X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)\n\n\nRun RCNN :\n\n\n.. code:: python \n\n\n      model_RCNN = Build_Model_CNN_Text(word_index,embeddings_index, 20)\n\n\n      model_RCNN.summary()\n\n      model_RCNN.fit(X_train_Glove, y_train,\n                                    validation_data=(X_test_Glove, y_test),\n                                    epochs=15,\n                                    batch_size=128,\n                                    verbose=2)\n\n      predicted = model_RCNN.predict(X_test_Glove)\n\n      predicted = np.argmax(predicted, axis=1)\n      print(metrics.classification_report(y_test, predicted))\n\n\nsummary of the model:\n\n\n.. code:: python \n\n      _________________________________________________________________\n      Layer (type)                 Output Shape              Param #   \n      =================================================================\n      embedding_1 (Embedding)      (None, 500, 50)           8960500   \n      _________________________________________________________________\n      dropout_1 (Dropout)          (None, 500, 50)           0         \n      _________________________________________________________________\n      conv1d_1 (Conv1D)            (None, 499, 256)          25856     \n      _________________________________________________________________\n      max_pooling1d_1 (MaxPooling1 (None, 249, 256)          0         \n      _________________________________________________________________\n      conv1d_2 (Conv1D)            (None, 248, 256)          131328    \n      _________________________________________________________________\n      max_pooling1d_2 (MaxPooling1 (None, 124, 256)          0         \n      _________________________________________________________________\n      conv1d_3 (Conv1D)            (None, 123, 256)          131328    \n      _________________________________________________________________\n      max_pooling1d_3 (MaxPooling1 (None, 61, 256)           0         \n      _________________________________________________________________\n      conv1d_4 (Conv1D)            (None, 60, 256)           131328    \n      _________________________________________________________________\n      max_pooling1d_4 (MaxPooling1 (None, 30, 256)           0         \n      _________________________________________________________________\n      lstm_1 (LSTM)                (None, 30, 256)           525312    \n      _________________________________________________________________\n      lstm_2 (LSTM)                (None, 30, 256)           525312    \n      _________________________________________________________________\n      lstm_3 (LSTM)                (None, 30, 256)           525312    \n      _________________________________________________________________\n      lstm_4 (LSTM)                (None, 256)               525312    \n      _________________________________________________________________\n      dense_1 (Dense)              (None, 1024)              263168    \n      _________________________________________________________________\n      dense_2 (Dense)              (None, 20)                20500     \n      _________________________________________________________________\n      activation_1 (Activation)    (None, 20)                0         \n      =================================================================\n      Total params: 11,765,256\n      Trainable params: 11,765,256\n      Non-trainable params: 0\n      _________________________________________________________________\n\n\n\nOutput:\n\n.. code:: python \n\n      Train on 11314 samples, validate on 7532 samples\n      Epoch 1\u002F15\n       - 28s - loss: 2.6624 - acc: 0.1081 - val_loss: 2.3012 - val_acc: 0.1753\n      Epoch 2\u002F15\n       - 22s - loss: 2.1142 - acc: 0.2224 - val_loss: 1.9168 - val_acc: 0.2669\n      Epoch 3\u002F15\n       - 22s - loss: 1.7465 - acc: 0.3290 - val_loss: 1.8257 - val_acc: 0.3412\n      Epoch 4\u002F15\n       - 22s - loss: 1.4730 - acc: 0.4356 - val_loss: 1.5433 - val_acc: 0.4436\n      Epoch 5\u002F15\n       - 22s - loss: 1.1800 - acc: 0.5556 - val_loss: 1.2973 - val_acc: 0.5467\n      Epoch 6\u002F15\n       - 22s - loss: 0.9910 - acc: 0.6281 - val_loss: 1.2530 - val_acc: 0.5797\n      Epoch 7\u002F15\n       - 22s - loss: 0.8581 - acc: 0.6854 - val_loss: 1.1522 - val_acc: 0.6281\n      Epoch 8\u002F15\n       - 22s - loss: 0.7058 - acc: 0.7428 - val_loss: 1.2385 - val_acc: 0.6033\n      Epoch 9\u002F15\n       - 22s - loss: 0.6792 - acc: 0.7515 - val_loss: 1.0200 - val_acc: 0.6775\n      Epoch 10\u002F15\n       - 22s - loss: 0.5782 - acc: 0.7948 - val_loss: 1.0961 - val_acc: 0.6577\n      Epoch 11\u002F15\n       - 23s - loss: 0.4674 - acc: 0.8341 - val_loss: 1.0866 - val_acc: 0.6924\n      Epoch 12\u002F15\n       - 23s - loss: 0.4284 - acc: 0.8512 - val_loss: 0.9880 - val_acc: 0.7096\n      Epoch 13\u002F15\n       - 22s - loss: 0.3883 - acc: 0.8670 - val_loss: 1.0190 - val_acc: 0.7151\n      Epoch 14\u002F15\n       - 22s - loss: 0.3334 - acc: 0.8874 - val_loss: 1.0025 - val_acc: 0.7232\n      Epoch 15\u002F15\n       - 22s - loss: 0.2857 - acc: 0.9038 - val_loss: 1.0123 - val_acc: 0.7331\n\n\n                   precision    recall  f1-score   support\n\n                0       0.64      0.73      0.68       319\n                1       0.45      0.83      0.58       389\n                2       0.81      0.64      0.71       394\n                3       0.64      0.57      0.61       392\n                4       0.55      0.78      0.64       385\n                5       0.77      0.52      0.62       395\n                6       0.84      0.77      0.80       390\n                7       0.87      0.79      0.83       396\n                8       0.85      0.90      0.87       398\n                9       0.98      0.84      0.90       397\n               10       0.93      0.96      0.95       399\n               11       0.92      0.79      0.85       396\n               12       0.59      0.53      0.56       393\n               13       0.82      0.82      0.82       396\n               14       0.84      0.84      0.84       394\n               15       0.83      0.89      0.86       398\n               16       0.68      0.86      0.76       364\n               17       0.97      0.86      0.91       376\n               18       0.66      0.50      0.57       310\n               19       0.53      0.31      0.40       251\n\n      avg \u002F total       0.77      0.75      0.75      7532\n\n\n\n-----------------------------------------\nRandom Multimodel Deep Learning (RMDL)\n-----------------------------------------\n\n\nReferenced paper : `RMDL: Random Multimodel Deep Learning for\nClassification \u003Chttps:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F324922651_RMDL_Random_Multimodel_Deep_Learning_for_Classification>`__\n\n\nA new ensemble, deep learning approach for classification. Deep\nlearning models have achieved state-of-the-art results across many domains.\nRMDL solves the problem of finding the best deep learning structure\nand architecture while simultaneously improving robustness and accuracy\nthrough ensembles of different deep learning architectures. RDMLs can accept\na variety of data as input including text, video, images, and symbols.\n\n\n|RMDL|\n\nRandom Multimodel Deep Learning (RDML) architecture for classification.\nRMDL includes 3 Random models, oneDNN classifier at left, one Deep CNN\nclassifier at middle, and one Deep RNN classifier at right (each unit could be LSTMor GRU).\n\n\nInstallation\n\nThere are pip and git for RMDL installation:\n\nUsing pip\n\n\n.. code:: python\n\n        pip install RMDL\n\nUsing git\n\n.. code:: bash\n\n    git clone --recursive https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FRMDL.git\n\nThe primary requirements for this package are Python 3 with Tensorflow. The requirements.txt file\ncontains a listing of the required `Python packages \u003Chttps:\u002F\u002Fwww.scaler.com\u002Ftopics\u002Fpython\u002Fpython-packages\u002F>`__ to install all requirements, run the following:\n\n.. code:: bash\n\n    pip -r install requirements.txt\n\nOr\n\n.. code:: bash\n\n    pip3  install -r requirements.txt\n\nOr:\n\n.. code:: bash\n\n    conda install --file requirements.txt\n\nDocumentation:\n\n\nThe exponential growth in the number of complex datasets every year requires  more enhancement in\nmachine learning methods to provide robust and accurate data classification. Lately, deep learning\napproaches are achieving better results compared to previous machine learning algorithms\non tasks like image classification, natural language processing, face recognition, and etc. The\nsuccess of these deep learning algorithms rely on their capacity to model complex and non-linear\nrelationships within the data. However, finding suitable structures for these models has been a challenge\nfor researchers. This paper introduces Random Multimodel Deep Learning (RMDL): a new ensemble, deep learning\napproach for classification. RMDL aims to solve the problem of finding the best deep learning architecture while simultaneously improving the robustness and accuracy through ensembles of multiple deep\nlearning architectures. In short, RMDL trains multiple models of Deep Neural Networks (DNN),\nConvolutional Neural Network (CNN) and Recurrent Neural Network (RNN) in parallel and combine\ntheir results to produce the better results of any of those models individually. To create these models,\neach deep learning model has been constructed in a random fashion regarding the number of layers and\nnodes in their neural network structure. The resulting RDML model can be used in various domains such\nas text, video, images, and symbolism. In this Project, we describe the RMDL model in depth and show the results\nfor image and text classification as well as face recognition. For image classification, we compared our\nmodel with some of the available baselines using MNIST and CIFAR-10 datasets. Similarly, we used four\ndatasets namely, WOS, Reuters, IMDB, and 20newsgroup, and compared our results with available baselines.\nWeb of Science (WOS) has been collected by authors and consists of three sets~(small, medium, and large sets).\nLastly, we used ORL dataset to compare the performance of our approach with other face recognition methods.\nThese test results show that the RDML model consistently outperforms standard methods over a broad range of\ndata types and classification problems.\n\n--------------------------------------------\nHierarchical Deep Learning for Text (HDLTex)\n--------------------------------------------\n\nRefrenced paper : `HDLTex: Hierarchical Deep Learning for Text\nClassification \u003Chttps:\u002F\u002Farxiv.org\u002Fabs\u002F1709.08267>`__\n\n\n|HDLTex|\n\nDocumentation:\n\nIncreasingly large document collections require improved information processing methods for searching, retrieving, and organizing  text documents. Central to these information processing methods is document classification, which has become an important task supervised learning aims to solve. Recently, the performance of traditional supervised classifiers has degraded as the number of documents has increased. This exponential growth of document volume has also increated the number of categories. This paper approaches this problem differently from current document classification methods that view the problem as multi-class classification. Instead we perform hierarchical classification using an approach we call Hierarchical Deep Learning for Text classification (HDLTex). HDLTex employs stacks of deep learning architectures to provide hierarchical understanding of the documents.\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nComparison Text Classification Algorithms\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **Model**                          | **Advantages**                                                                                                                                           | **Disadvantages**                                                                                                                       |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **Rocchio Algorithm**              |  * Easy to implement                                                                                                                                     |  * The user can only retrieve a few relevant documents                                                                                  |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Computationally is very cheap                                                                                                                         |  * Rocchio often misclassifies the type for multimodal class                                                                            |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Relevance feedback mechanism (benefits to ranking documents as  not relevant)                                                                         |  * This technique is not very robust                                                                                                   |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |                                                                                                                                                          |  * linear combination in this algorithm is not good for multi-class datasets                                                            |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **Boosting and Bagging**           |  * Improves the stability and accuracy (takes the advantage of ensemble learning where in multiple weak learner outperform a single strong learner.)     |  * Computational complexity                                                                                                             |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Reducing variance which helps to avoid overfitting problems.                                                                                          |  * loss of interpretability (if the number of models is hight, understanding the model is very difficult)                               |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |                                                                                                                                                          |  * Requires careful tuning of different hyper-parameters.                                                                               |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **Logistic Regression**            |  * Easy to implement                                                                                                                                     |  * it cannot solve non-linear problems                                                                                                  |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * does not require too many computational resources                                                                                                     |  * prediction requires that each data point be independent                                                                              |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * it does not require input features to be scaled (pre-processing)                                                                                      |  * attempting to predict outcomes based on a set of independent variables                                                               |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * It does not require any tuning                                                                                                                        |                                                                                                                                         |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **Naive Bayes Classifier**         |  * It works very well with text data                                                                                                                     |  *  A strong assumption about the shape of the data distribution                                                                        |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Easy to implement                                                                                                                                     |  * limited by data scarcity for which any possible value in feature space, a likelihood value must be estimated by a frequentist        |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Fast in comparing to other algorithms                                                                                                                 |                                                                                                                                         |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **K-Nearest Neighbor**             |  * Effective for text datasets                                                                                                                           |  * computational of this model is very expensive                                                                                        |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * non-parametric                                                                                                                                        |  * diffcult to find optimal value of k                                                                                                  |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * More local characteristics of text or document are considered                                                                                         |  * Constraint for large search problem to find nearest neighbors                                                                        |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Naturally handles multi-class datasets                                                                                                                |  * Finding a meaningful distance function is difficult for text datasets                                                                |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **Support Vector Machine (SVM)**   |  * SVM can model non-linear decision boundaries                                                                                                          |  * lack of transparency in results caused by a high number of dimensions (especially for text data).                                    |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Performs similarly to logistic regression when linear separation                                                                                      |  * Choosing an efficient kernel function is difficult (Susceptible to overfitting\u002Ftraining issues depending on kernel)                  |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Robust against overfitting problems~(especially for text dataset due to high-dimensional space)                                                       |  * Memory complexity                                                                                                                    |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **Decision Tree**                  |  * Can easily handle qualitative (categorical) features                                                                                                  |  * Issues with diagonal decision boundaries                                                                                             |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Works well with decision boundaries parellel to the feature axis                                                                                      |  * Can be easily overfit                                                                                                                |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Decision tree is a very fast algorithm for both learning and prediction                                                                               |  * extremely sensitive to small perturbations in the data                                                                               |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |                                                                                                                                                          |  * Problems with out-of-sample prediction                                                                                               |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **Conditional Random Field (CRF)** |  * Its feature design is flexible                                                                                                                        |  * High computational complexity of the training step                                                                                   |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Since CRF computes the conditional probability of global optimal output nodes, it overcomes the drawbacks of label bias                               |  * this algorithm does not perform with unknown words                                                                                   |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Combining the advantages of classification and graphical modeling which combining the ability to compactly model multivariate data                    |  * Problem about online learning (It makes it very difficult to re-train the model when newer data becomes available.)                  |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **Random Forest**                  |  * Ensembles of decision trees are very fast to train in comparison to other techniques                                                                  |  * Quite slow to create predictions once trained                                                                                        |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Reduced variance (relative to regular trees)                                                                                                          |  * more trees in forest increases time complexity in the prediction step                                                                |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Not require preparation and pre-processing of the input data                                                                                          |  * Not as easy to visually interpret                                                                                                    |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |                                                                                                                                                          |  * Overfitting can easily occur                                                                                                         |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |                                                                                                                                                          |  * Need to choose the number of trees at forest                                                                                         |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **Deep Learning**                  |  * Flexible with features design (Reduces the need for feature engineering, one of the most time-consuming parts of machine learning practice.)          |  * Requires a large amount of data (if you only have small sample text data, deep learning is unlikely to outperform other approaches.  |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Architecture that can be adapted to new problems                                                                                                      |  * Is extremely computationally expensive to train.                                                                                     |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  *  Can deal with complex input-output mappings                                                                                                          |  * Model Interpretability is most important problem of deep learning~(Deep learning in most of the time is black-box)                   |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Can easily handle online learning (It makes it very easy to re-train the model when newer data becomes available.)                                    |  * Finding an efficient architecture and structure is still the main challenge of this technique                                        |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * Parallel processing capability (It can perform more than one job at the same time)                                                                    |                                                                                                                                         |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n\n\n\n==========\nEvaluation\n==========\n\n----\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nF1 Score\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n.. image:: docs\u002Fpic\u002FF1.png\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nMatthew correlation coefficient (MCC)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\nCompute the Matthews correlation coefficient (MCC)\n\nThe Matthews correlation coefficient is used in machine learning as a measure of the quality of binary (two-class) classification problems. It takes into account of true and false positives and negatives and is generally regarded as a balanced measure which can be used even if the classes are of very different sizes. The MCC is in essence a correlation coefficient value between -1 and +1. A coefficient of +1 represents a perfect prediction, 0 an average random prediction and -1 an inverse prediction. The statistic is also known as the phi coefficient. \n\n\n.. code:: python\n\n    from sklearn.metrics import matthews_corrcoef\n    y_true = [+1, +1, +1, -1]\n    y_pred = [+1, -1, +1, +1]\n    matthews_corrcoef(y_true, y_pred)  \n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nReceiver operating characteristics (ROC)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nROC curves are typically used in binary classification to study the output of a classifier. In order to extend ROC curve and ROC area to multi-class or multi-label classification, it is necessary to binarize the output. One ROC curve can be drawn per label, but one can also draw a ROC curve by considering each element of the label indicator matrix as a binary prediction (micro-averaging).\n\nAnother evaluation measure for multi-class classification is macro-averaging, which gives equal weight to the classification of each label. [`sources  \u003Chttp:\u002F\u002Fscikit-learn.org\u002Fstable\u002Fauto_examples\u002Fmodel_selection\u002Fplot_roc.html>`__] \n\n.. code:: python\n\n    import numpy as np\n    import matplotlib.pyplot as plt\n    from itertools import cycle\n\n    from sklearn import svm, datasets\n    from sklearn.metrics import roc_curve, auc\n    from sklearn.model_selection import train_test_split\n    from sklearn.preprocessing import label_binarize\n    from sklearn.multiclass import OneVsRestClassifier\n    from scipy import interp\n\n    # Import some data to play with\n    iris = datasets.load_iris()\n    X = iris.data\n    y = iris.target\n\n    # Binarize the output\n    y = label_binarize(y, classes=[0, 1, 2])\n    n_classes = y.shape[1]\n\n    # Add noisy features to make the problem harder\n    random_state = np.random.RandomState(0)\n    n_samples, n_features = X.shape\n    X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]\n\n    # shuffle and split training and test sets\n    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,\n                                                        random_state=0)\n\n    # Learn to predict each class against the other\n    classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,\n                                     random_state=random_state))\n    y_score = classifier.fit(X_train, y_train).decision_function(X_test)\n\n    # Compute ROC curve and ROC area for each class\n    fpr = dict()\n    tpr = dict()\n    roc_auc = dict()\n    for i in range(n_classes):\n        fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])\n        roc_auc[i] = auc(fpr[i], tpr[i])\n\n    # Compute micro-average ROC curve and ROC area\n    fpr[\"micro\"], tpr[\"micro\"], _ = roc_curve(y_test.ravel(), y_score.ravel())\n    roc_auc[\"micro\"] = auc(fpr[\"micro\"], tpr[\"micro\"])\n   \n\n\nPlot of a ROC curve for a specific class\n\n\n.. code:: python\n\n    plt.figure()\n    lw = 2\n    plt.plot(fpr[2], tpr[2], color='darkorange',\n             lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])\n    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n    plt.xlim([0.0, 1.0])\n    plt.ylim([0.0, 1.05])\n    plt.xlabel('False Positive Rate')\n    plt.ylabel('True Positive Rate')\n    plt.title('Receiver operating characteristic example')\n    plt.legend(loc=\"lower right\")\n    plt.show()\n\n\n.. image:: \u002Fdocs\u002Fpic\u002Fsphx_glr_plot_roc_001.png\n\n\n~~~~~~~~~~~~~~~~~~~~~~~\nArea Under Curve (AUC)\n~~~~~~~~~~~~~~~~~~~~~~~\n\nArea  under ROC curve (AUC) is a summary metric that measures the entire area underneath the ROC curve. AUC holds helpful properties, such as  increased  sensitivity in the analysis of variance (ANOVA) tests, independence of decision threshold, invariance to a priori class probability and the indication of how well negative and positive classes are regarding decision index.\n\n\n.. code:: python\n\n      import numpy as np\n      from sklearn import metrics\n      fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)\n      metrics.auc(fpr, tpr)\n\n\n\n\n==========================\nText and Document Datasets\n==========================\n\n----\n\n~~~~~\nIMDB\n~~~~~\n\n- `IMDB Dataset \u003Chttp:\u002F\u002Fai.stanford.edu\u002F~amaas\u002Fdata\u002Fsentiment\u002F>`__\n\nDataset of 25,000 movies reviews from IMDB, labeled by sentiment (positive\u002Fnegative). Reviews have been preprocessed, and each review is encoded as a sequence of word indexes (integers). For convenience, words are indexed by overall frequency in the dataset, so that for instance the integer \"3\" encodes the 3rd most frequent word in the data. This allows for quick filtering operations, such as \"only consider the top 10,000 most common words, but eliminate the top 20 most common words\".\n\nAs a convention, \"0\" does not stand for a specific word, but instead is used to encode any unknown word.\n\n\n.. code:: python\n\n\n  from keras.datasets import imdb\n\n  (x_train, y_train), (x_test, y_test) = imdb.load_data(path=\"imdb.npz\",\n                                                        num_words=None,\n                                                        skip_top=0,\n                                                        maxlen=None,\n                                                        seed=113,\n                                                        start_char=1,\n                                                        oov_char=2,\n                                                        index_from=3)\n\n~~~~~~~~~~~~~\nReuters-21578\n~~~~~~~~~~~~~\n\n- `Reters-21578 Dataset \u003Chttps:\u002F\u002Fkeras.io\u002Fdatasets\u002F>`__\n\n\nDataset of 11,228 newswires from Reuters, labeled over 46 topics. As with the IMDB dataset, each wire is encoded as a sequence of word indexes (same conventions).\n\n\n.. code:: python\n\n  from keras.datasets import reuters\n\n  (x_train, y_train), (x_test, y_test) = reuters.load_data(path=\"reuters.npz\",\n                                                           num_words=None,\n                                                           skip_top=0,\n                                                           maxlen=None,\n                                                           test_split=0.2,\n                                                           seed=113,\n                                                           start_char=1,\n                                                           oov_char=2,\n                                                           index_from=3)\n                                                         \n                                                         \n~~~~~~~~~~~~~\n20Newsgroups\n~~~~~~~~~~~~~\n\n- `20Newsgroups Dataset \u003Chttps:\u002F\u002Farchive.ics.uci.edu\u002Fml\u002Fdatasets\u002FTwenty+Newsgroups>`__\n\nThe 20 newsgroups dataset comprises around 18000 newsgroups posts on 20 topics split in two subsets: one for training (or development) and the other one for testing (or for performance evaluation). The split between the train and test set is based upon messages posted before and after a specific date.\n\nThis module contains two loaders. The first one, sklearn.datasets.fetch_20newsgroups, returns a list of the raw texts that can be fed to text feature extractors, such as sklearn.feature_extraction.text.CountVectorizer with custom parameters so as to extract feature vectors. The second one, sklearn.datasets.fetch_20newsgroups_vectorized, returns ready-to-use features, i.e., it is not necessary to use a feature extractor.\n\n\n.. code:: python\n\n  from sklearn.datasets import fetch_20newsgroups\n  newsgroups_train = fetch_20newsgroups(subset='train')\n\n  from pprint import pprint\n  pprint(list(newsgroups_train.target_names))\n  \n  ['alt.atheism',\n   'comp.graphics',\n   'comp.os.ms-windows.misc',\n   'comp.sys.ibm.pc.hardware',\n   'comp.sys.mac.hardware',\n   'comp.windows.x',\n   'misc.forsale',\n   'rec.autos',\n   'rec.motorcycles',\n   'rec.sport.baseball',\n   'rec.sport.hockey',\n   'sci.crypt',\n   'sci.electronics',\n   'sci.med',\n   'sci.space',\n   'soc.religion.christian',\n   'talk.politics.guns',\n   'talk.politics.mideast',\n   'talk.politics.misc',\n   'talk.religion.misc']\n \n \n~~~~~~~~~~~~~~~~~~~~~~\nWeb of Science Dataset\n~~~~~~~~~~~~~~~~~~~~~~\n\nDescription of Dataset:\n\nHere is three datasets which include WOS-11967 , WOS-46985, and WOS-5736\nEach folder contains:\n\n- X.txt\n- Y.txt\n- YL1.txt\n- YL2.txt\n\nX is input data that include text sequences\nY is target value\nYL1 is target value of level one (parent label)\nYL2 is target value of level one (child label)\n\nMeta-data:\nThis folder contain on data file as following attribute:\nY1 Y2 Y Domain area keywords Abstract\n\nAbstract is input data that include text sequences of 46,985 published paper\nY is target value\nYL1 is target value of level one (parent label)\nYL2 is target value of level one (child label)\nDomain is majaor domain which include 7 labales: {Computer Science,Electrical Engineering, Psychology, Mechanical Engineering,Civil Engineering, Medical Science, biochemistry}\narea is subdomain or area of the paper, such as CS-> computer graphics which contain 134 labels.\nkeywords : is authors keyword of the papers\n\n-  Web of Science Dataset `WOS-11967 \u003Chttp:\u002F\u002Fdx.doi.org\u002F10.17632\u002F9rw3vkcfy4.2>`__\n..\n\n  This dataset contains 11,967 documents with 35 categories which include 7 parents categories.\n\n-  Web of Science Dataset `WOS-46985 \u003Chttp:\u002F\u002Fdx.doi.org\u002F10.17632\u002F9rw3vkcfy4.2>`__\n      \n..\n\n  This dataset contains 46,985 documents with 134 categories which include 7 parents categories.\n\n-  Web of Science Dataset `WOS-5736 \u003Chttp:\u002F\u002Fdx.doi.org\u002F10.17632\u002F9rw3vkcfy4.2>`__\n\n..\n  \n  This dataset contains 5,736 documents with 11 categories which include 3 parents categories.\n\nReferenced paper: HDLTex: Hierarchical Deep Learning for Text Classification\n     \n================================\nText Classification Applications\n================================\n\n\n----\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~\nInformation Retrieval\n~~~~~~~~~~~~~~~~~~~~~~\nInformation retrieval is finding documents of an unstructured data that meet an information need from within large collections of documents. With the rapid growth of online information, particularly in text format, text classification has become a  significant technique for managing this type of data. Some of the important methods used in this area are Naive Bayes, SVM, decision tree, J48, k-NN and IBK. One of the most challenging applications for document and text dataset processing is applying document categorization methods for information retrieval.\n\n- 🎓 `Introduction to information retrieval \u003Chttp:\u002F\u002Feprints.bimcoordinator.co.uk\u002F35\u002F>`__ Manning, C., Raghavan, P., & Schütze, H. (2010).\n     \n- 🎓 `Web forum retrieval and text analytics: A survey \u003Chttp:\u002F\u002Fwww.nowpublishers.com\u002Farticle\u002FDetails\u002FINR-062>`__ Hoogeveen, Doris, et al.. (2018).\n\n- 🎓 `Automatic Text Classification in Information retrieval: A Survey \u003Chttps:\u002F\u002Fdl.acm.org\u002Fcitation.cfm?id=2905191>`__ Dwivedi, Sanjay K., and Chandrakala Arya.. (2016).\n\n~~~~~~~~~~~~~~~~~~~~~~\nInformation Filtering\n~~~~~~~~~~~~~~~~~~~~~~\nInformation filtering refers to selection of relevant information or rejection of irrelevant information from a stream of incoming data. Information filtering systems are typically used to measure and forecast users' long-term interests. Probabilistic models, such as Bayesian inference network, are commonly used in information filtering systems. Bayesian inference networks employ recursive inference to propagate values through the inference network and return documents with the highest ranking. Chris used vector space model with iterative refinement for filtering task.\n \n\n- 🎓 `Search engines: Information retrieval in practice \u003Chttp:\u002F\u002Flibrary.mpib-berlin.mpg.de\u002Ftoc\u002Fz2009_2465.pdf\u002F>`__ Croft, W. B., Metzler, D., & Strohman, T. (2010).\n\n- 🎓 `Implementation of the SMART information retrieval system \u003Chttps:\u002F\u002Fecommons.cornell.edu\u002Fbitstream\u002Fhandle\u002F1813\u002F6526\u002F85-686.pdf?sequence=1>`__ Buckley, Chris\n\n~~~~~~~~~~~~~~~~~~~~~~\nSentiment Analysis\n~~~~~~~~~~~~~~~~~~~~~~\nSentiment analysis is a computational approach toward identifying opinion, sentiment, and subjectivity in text. Sentiment classification methods classify a document associated with an opinion to be positive or negative. The assumption is that document d is expressing an opinion on a single entity e and opinions are formed via a single opinion holder h. Naive Bayesian classification and SVM are some of the most popular supervised learning methods that have been used for sentiment classification. Features such as terms and their respective frequency, part of speech, opinion words and phrases, negations and syntactic dependency have been used in sentiment classification techniques.\n\n- 🎓 `Opinion mining and sentiment analysis \u003Chttp:\u002F\u002Fwww.nowpublishers.com\u002Farticle\u002FDetails\u002FINR-011>`__ Pang, Bo, and Lillian Lee. (2008).\n\n- 🎓 `A survey of opinion mining and sentiment analysis \u003Chttps:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-1-4614-3223-4_13>`__ Liu, Bing, and Lei Zhang. (2010).\n\n- 🎓 `Thumbs up?: sentiment classification using machine learning techniques \u003Chttps:\u002F\u002Fdl.acm.org\u002Fcitation.cfm?id=1118704>`__ Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. \n\n~~~~~~~~~~~~~~~~~~~~~~\nRecommender Systems\n~~~~~~~~~~~~~~~~~~~~~~\nContent-based recommender systems suggest items to users based on the description of an item and a profile of the user's interests. \nA user's profile can be learned from user feedback (history of the search queries or self reports) on items as well as self-explained features~(filter or conditions on the queries) in one's profile. \nIn this way, input to such recommender systems can be semi-structured such that some attributes are extracted from free-text field while others are directly specified. Many different types of text classification methods, such as decision trees, nearest neighbor methods, Rocchio's algorithm, linear classifiers, probabilistic methods, and Naive Bayes, have been used to model user's preference.\n\n- 🎓 `Content-based recommender systems \u003Chttps:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-319-29659-3_4>`__ Aggarwal, Charu C. (2016).\n\n- 🎓 `Content-based recommendation systems \u003Chttps:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-540-72079-9_10>`__ Pazzani, Michael J., and Daniel Billsus.\n\n~~~~~~~~~~~~~~~~~~~~~~\nKnowledge Management\n~~~~~~~~~~~~~~~~~~~~~~\nTextual databases are significant sources of information and knowledge. A large percentage of corporate information (nearly 80 %) exists in textual data formats (unstructured). In knowledge distillation, patterns or knowledge are inferred from immediate forms that can be semi-structured ( e.g.conceptual graph representation) or structured\u002Frelational data representation). A given intermediate form can be document-based such that each entity represents an object or concept of interest in a particular domain. Document categorization is one of the most common methods for mining document-based intermediate forms. In the other work, text classification has been used to find the relationship between railroad accidents' causes and their correspondent descriptions in reports.\n\n- 🎓 `Text mining: concepts, applications, tools and issues-an overview \u003Chttp:\u002F\u002Fciteseerx.ist.psu.edu\u002Fviewdoc\u002Fdownload?doi=10.1.1.403.2426&rep=rep1&type=pdf>`__ Sumathy, K. L., and M. Chidambaram.  (2013).\n\n- 🎓 `Analysis of Railway Accidents' Narratives Using Deep Learning \u003Chttps:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8614260\u002F>`__ Heidarysafa, Mojtaba, et al. (2018).\n\n~~~~~~~~~~~~~~~~~~~~~~\nDocument Summarization\n~~~~~~~~~~~~~~~~~~~~~~\nText classification used for document summarizing which summary of a document may employ words or phrases which do not appear in the original document.  Multi-document summarization also is necessitated due to increasing online information rapidly. So, many researchers focus on this task using text classification to extract important feature out of a document.\n\n- 🎓 `Advances in automatic text summarization \u003Chttps:\u002F\u002Fbooks.google.com\u002Fbooks?hl=en&lr=&id=YtUZQaKDmzEC&oi=fnd&pg=PA215&dq=Advances+in+automatic+text+summarization&ots=ZpvCsrG-dC&sig=8ecTDTrQR4mMzDnKvI58sowh3Fg>`__ Mani, Inderjeet. \n\n- 🎓 `Improving Multi-Document Summarization via Text Classification. \u003Chttps:\u002F\u002Fwww.aaai.org\u002Focs\u002Findex.php\u002FAAAI\u002FAAAI17\u002Fpaper\u002FviewPaper\u002F14525>`__ Cao, Ziqiang, et al. (2017).\n\n================================\nText Classification Support\n================================\n\n~~~~~~~~~~~~~~~~~~~~~~\nHealth\n~~~~~~~~~~~~~~~~~~~~~~\nMost textual information in the medical domain is presented in an unstructured or narrative form with ambiguous terms and typographical errors. Such information needs to be available instantly throughout the patient-physicians encounters in different stages of diagnosis and treatment. Medical coding, which consists of assigning medical diagnoses to specific class values obtained from a large set of categories, is an area of healthcare applications where text classification techniques can be highly valuable. In the other research, J. Zhang et al. introduced Patient2Vec, to learn an interpretable deep representation of longitudinal electronic health record (EHR) data which is personalized for each patient. Patient2Vec is a novel technique of text dataset feature embedding that can learn a personalized interpretable deep representation of EHR data based on recurrent neural networks and the attention mechanism. Text classification has also been applied in the development of Medical Subject Headings (MeSH) and Gene Ontology (GO). \n\n\n- 🎓 `Patient2Vec: A Personalized Interpretable Deep Representation of the Longitudinal Electronic Health Record \u003Chttps:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8490816\u002F>`__ Zhang, Jinghe, et al. (2018)\n\n- 🎓 `Combining Bayesian text classification and shrinkage to automate healthcare coding: A data quality analysis \u003Chttps:\u002F\u002Fdl.acm.org\u002Fcitation.cfm?id=2063506>`__ Lauría, Eitel JM, and Alan D. March. (2011).\n\n- 🎓 `A \u003Chttp:\u002F\u002Fb\u002F>`__ c. (2010).\n\n- 🎓 `MeSH Up: effective MeSH text classification for improved document retrieval \u003Chttps:\u002F\u002Facademic.oup.com\u002Fbioinformatics\u002Farticle-abstract\u002F25\u002F11\u002F1412\u002F333120>`__ Trieschnigg, Dolf, et al.\n\n~~~~~~~~~~~~~~~~~~~~~~\nSocial Sciences\n~~~~~~~~~~~~~~~~~~~~~~\nText classification and document categorization has increasingly been applied to understanding human behavior in past decades. Recent data-driven efforts in human behavior research have focused on mining language contained in informal notes and text datasets, including short message service (SMS), clinical notes, social media, etc. These studies have mostly focused on using approaches based on frequencies of word occurrence (i.e. how often a word appears in a document) or features based on Linguistic Inquiry Word Count (LIWC), a well-validated lexicon of categories of words with psychological relevance.\n\n- 🎓 `Identification of imminent suicide risk among young adults using text messages \u003Chttps:\u002F\u002Fdl.acm.org\u002Fcitation.cfm?id=3173987>`__ Nobles, Alicia L., et al. (2018).\n\n- 🎓 `Textual Emotion Classification: An Interoperability Study on Cross-Genre Data Sets \u003Chttps:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-319-63004-5_21>`__ Ofoghi, Bahadorreza, and Karin Verspoor. (2017).\n\n- 🎓 `Social Monitoring for Public Health \u003Chttps:\u002F\u002Fwww.morganclaypool.com\u002Fdoi\u002Fabs\u002F10.2200\u002FS00791ED1V01Y201707ICR060>`__ Paul, Michael J., and Mark Dredze (2017).\n\n~~~~~~~~~~~~~~~~~~~~~~\nBusiness and Marketing\n~~~~~~~~~~~~~~~~~~~~~~\nprofitable companies and organizations are progressively using social media for marketing purposes. Opening mining from social media such as Facebook, Twitter, and so on is main target of companies to rapidly increase their profits. Text and documents classification is a powerful tool for companies to find their customers easier than ever.  \n\n- 🎓 `Opinion mining using ensemble text hidden Markov models for text classification \u003Chttps:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS0957417417304979>`__ Kang, Mangi, Jaelim Ahn, and Kichun Lee. (2018).\n\n- 🎓 `Classifying business marketing messages on Facebook \u003Chttps:\u002F\u002Fwww.researchgate.net\u002Fprofile\u002FBei_Yu2\u002Fpublication\u002F236246670_Classifying_Business_Marketing_Messages_on_Facebook\u002Flinks\u002F56bcb34408ae6cc737c6335b.pdf>`__ Yu, Bei, and Linchi Kwok.\n\n~~~~~~~~~~~~~~~~~~~~~~\nLaw\n~~~~~~~~~~~~~~~~~~~~~~\nHuge volumes of legal text information and documents have been generated by governmental institutions. Retrieving this information and automatically classifying it can not only help lawyers but also their clients.\nIn the United States, the law is derived from five sources: constitutional law, statutory law, treaties, administrative regulations, and the common law. Also, many new legal documents are created each year. Categorization of these documents is the main challenge of the lawyer community.\n\n- 🎓 `Represent yourself in court: How to prepare & try a winning case \u003Chttps:\u002F\u002Fbooks.google.com\u002Fbooks?hl=en&lr=&id=-lodDQAAQBAJ&oi=fnd&pg=PP1&dq=Represent+yourself+in+court:+How+to+prepare+%5C%26+try+a+winning+case&ots=tgJ8Q2MkH_&sig=9o3ILDn3LfO30BZKsyI2Ou7Q8Qs>`__ Bergman, Paul, and Sara J. Berman. (2016)\n\n- 🎓 `Text retrieval in the legal world \u003Chttps:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002FBF00877694>`__ Turtle, Howard.\n\n==========\nCitations:\n==========\n\n----\n\n.. code::\n\n    @ARTICLE{Kowsari2018Text_Classification,\n        title={Text Classification Algorithms: A Survey},\n        author={Kowsari, Kamran and Jafari Meimandi, Kiana and Heidarysafa, Mojtaba and Mendu, Sanjana and Barnes, Laura E. and Brown, Donald E.},\n        journal={Information},\n        VOLUME = {10},  \n        YEAR = {2019},\n        NUMBER = {4},\n        ARTICLE-NUMBER = {150},\n        URL = {http:\u002F\u002Fwww.mdpi.com\u002F2078-2489\u002F10\u002F4\u002F150},\n        ISSN = {2078-2489},\n        publisher={Multidisciplinary Digital Publishing Institute}\n    }\n\n.. |RMDL| image:: docs\u002Fpic\u002FRMDL.jpg\n.. |line| image:: docs\u002Fpic\u002Fline.png\n          :alt: Foo\n.. |HDLTex| image:: docs\u002Fpic\u002FHDLTex.png\n\n\n.. |twitter| image:: https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl\u002Fhttp\u002Fshields.io.svg?style=social\n    :target: https:\u002F\u002Ftwitter.com\u002Fintent\u002Ftweet?text=Text%20Classification%20Algorithms:%20A%20Survey%0aGitHub:&url=https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification&hashtags=Text_Classification,classification,MachineLearning,Categorization,NLP,NATURAL,LANGUAGE,PROCESSING\n    \n.. |contributions-welcome| image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcontributions-welcome-brightgreen.svg?style=flat\n    :target: https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification\u002Fpulls\n.. |ansicolortags| image:: https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fl\u002Fansicolortags.svg\n      :target: https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification\u002Fblob\u002Fmaster\u002FLICENSE\n.. |contributors| image:: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002Fkk7nc\u002FText_Classification.svg\n      :target: https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification\u002Fgraphs\u002Fcontributors \n\n.. |arXiv| image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-1904.08067-red.svg?style=flat\n   :target: https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.08067\n   \n.. |DOI| image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDOI-10.3390\u002Finfo10040150-blue.svg?style=flat\n   :target: https:\u002F\u002Fdoi.org\u002F10.3390\u002Finfo10040150\n   \n   \n.. |medium| image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMedium-Text%20Classification-blueviolet.svg\n    :target: https:\u002F\u002Fmedium.com\u002Ftext-classification-algorithms\u002Ftext-classification-algorithms-a-survey-a215b7ab7e2d\n\n.. |UniversityCube| image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FUniversityCube-Follow%20us%20for%20the%20Latest%20News!-blue.svg\n    :target: https:\u002F\u002Fwww.universitycube.net\u002Fnews\n\n\n.. |mendeley| image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMendeley-Add%20to%20Library-critical.svg\n    :target: https:\u002F\u002Fwww.mendeley.com\u002Fimport\u002F?url=https:\u002F\u002Fdoi.org\u002F10.3390\u002Finfo10040150\n    \n.. |Best| image::     https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAward-Best%20Paper%20Award%202019-brightgreen\n    :target: https:\u002F\u002Fwww.mdpi.com\u002Fjournal\u002Finformation\u002Fawards\n       \n.. |BPW| image:: docs\u002Fpic\u002FBPW.png\n    :target: https:\u002F\u002Fwww.mdpi.com\u002Fjournal\u002Finformation\u002Fawards\n","################################################\n文本分类算法：综述\n################################################\n\n|UniversityCube| |DOI| |Best| |medium| |mendeley| |contributions-welcome| |arXiv| |ansicolortags| |contributors| |twitter|\n  \n  \n.. figure:: docs\u002Fpic\u002FWordArt.png \n \n \n 引用论文 : `文本分类算法：综述 \u003Chttps:\u002F\u002Farxiv.org\u002Fabs\u002F1904.08067>`__    \n \n|BPW|  \n\n\n\n##################\n目录\n##################\n.. contents::\n  :local:\n  :depth: 4\n\n============\n简介\n============\n\n.. figure:: docs\u002Fpic\u002FOverviewTextClassification.png \n \n    \n    \n====================================\n文本与文档特征提取\n====================================\n\n----\n\n\n对于分类算法而言，文本特征提取和预处理非常重要。在本节中，我们开始讨论文本清洗，因为大多数文档包含大量噪声。在这一部分，我们讨论两种主要的文本特征提取方法——词嵌入（word embedding）和加权词（weighted word）。\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n文本清洗与预处理\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n在自然语言处理（Natural Language Processing, NLP）中，大多数文本和文档包含许多对文本分类冗余的词汇，例如停用词（stopwords）、拼写错误、俚语等。在本节中，我们将简要解释一些用于文本清洗和预处理文本文档的技术和方法。在许多算法中，如统计和概率学习方法，噪声和不必要的特征会对整体性能产生负面影响。因此，消除这些特征极其重要。\n\n\n-------------\n分词\n-------------\n\n分词（Tokenization）是将文本流分解为单词、短语、符号或任何其他称为 token 的有意义元素的过程。这一步的主要目标是提取句子中的单个单词。除了文本分类外，在文本挖掘中，有必要在管道（pipeline）中集成一个解析器（parser）来执行文档的分词；例如：\n\n句子：\n\n.. code::\n\n  After sleeping for four hours, he decided to sleep for another four\n\n\n在这种情况下，tokens 如下所示：\n\n.. code::\n\n    {'After', 'sleeping', 'for', 'four', 'hours', 'he', 'decided', 'to', 'sleep', 'for', 'another', 'four'}\n\n\n以下是 Python 代码用于分词：\n\n.. code:: python\n\n  from nltk.tokenize import word_tokenize\n  text = \"After sleeping for four hours, he decided to sleep for another four\"\n  tokens = word_tokenize(text)\n  print(tokens)\n\n-----------\n停用词\n-----------\n\n\n社交媒体上的文本和文档分类，如 Twitter、Facebook 等，通常受到文本语料库噪声性质（缩写、不规则形式）的影响。\n\n这是来自 `geeksforgeeks \u003Chttps:\u002F\u002Fwww.geeksforgeeks.org\u002Fremoving-stop-words-nltk-python\u002F>`__ 的一个示例。\n\n.. code:: python\n\n  from nltk.corpus import stopwords\n  from nltk.tokenize import word_tokenize\n\n  example_sent = \"This is a sample sentence, showing off the stop words filtration.\"\n\n  stop_words = set(stopwords.words('english'))\n\n  word_tokens = word_tokenize(example_sent)\n\n  filtered_sentence = [w for w in word_tokens if not w in stop_words]\n\n  filtered_sentence = []\n\n  for w in word_tokens:\n      if w not in stop_words:\n          filtered_sentence.append(w)\n\n  print(word_tokens)\n  print(filtered_sentence)\n\n\n\n输出：\n\n.. code:: python \n\n  ['This', 'is', 'a', 'sample', 'sentence', ',', 'showing', \n  'off', 'the', 'stop', 'words', 'filtration', '.']\n  ['This', 'sample', 'sentence', ',', 'showing', 'stop',\n  'words', 'filtration', '.']\n\n\n---------------\n大小写转换\n---------------\n\n句子可以包含大写字母和小写字母的混合。多个句子组成一个文本文档。为了减少问题空间，最常见的方法是将所有内容转换为小写。这将文档中的所有单词置于同一空间中，但通常会改变某些单词的含义，例如将 \"US\" 变为 \"us\"，前者代表美利坚合众国，后者是代词。为了解决这个问题，可以应用俚语和缩写转换器。\n\n.. code:: python\n\n  text = \"The United States of America (USA) or America, is a federal republic composed of 50 states\"\n  print(text)\n  print(text.lower())\n\n输出：\n\n.. code:: python\n\n  \"The United States of America (USA) or America, is a federal republic composed of 50 states\"\n  \"the united states of america (usa) or america, is a federal republic composed of 50 states\"\n\n-----------------------\n俚语与缩写\n-----------------------\n\n俚语和缩写在执行预处理步骤时可能会导致问题。缩写是单词的缩短形式，例如 SVM 代表支持向量机（Support Vector Machine）。俚语是一种描绘非正式对话或具有不同含义文本的语言版本，例如 \"lost the plot\"，它本质上意味着“他们疯了”。处理这些词的常见方法是将其转换为正式语言。\n\n---------------\n噪声去除\n---------------\n\n\n作为预处理步骤的文本清洗的另一个问题是噪声去除。文本文档通常包含标点符号或特殊字符等字符，它们对于文本挖掘或分类目的来说不是必需的。虽然标点符号对于理解句子的含义至关重要，但它可能会负面地影响分类算法。\n\n\n以下是从文本中删除标准噪声的简单代码：\n\n\n.. code:: python\n\n  def text_cleaner(text):\n      rules = [\n          {r'>\\s+': u'>'},  # remove spaces after a tag opens or closes\n          {r'\\s+': u' '},  # replace consecutive spaces\n          {r'\\s*\u003Cbr\\s*\u002F?>\\s*': u'\\n'},  # newline after a \u003Cbr>\n          {r'\u003C\u002F(div)\\s*>\\s*': u'\\n'},  # newline after \u003C\u002Fp> and \u003C\u002Fdiv> and \u003Ch1\u002F>...\n          {r'\u003C\u002F(p|h\\d)\\s*>\\s*': u'\\n\\n'},  # newline after \u003C\u002Fp> and \u003C\u002Fdiv> and \u003Ch1\u002F>...\n          {r'\u003Chead>.*\u003C\\s*(\u002Fhead|body)[^>]*>': u''},  # remove \u003Chead> to \u003C\u002Fhead>\n          {r'\u003Ca\\s+href=\"([^\"]+)\"[^>]*>.*\u003C\u002Fa>': r'\\1'},  # show links instead of texts\n          {r'[ \\t]*\u003C[^\u003C]*?\u002F?>': u''},  # remove remaining tags\n          {r'^\\s+': u''}  # remove spaces at the beginning\n      ]\n      for rule in rules:\n      for (k, v) in rule.items():\n          regex = re.compile(k)\n          text = regex.sub(v, text)\n      text = text.rstrip()\n      return text.lower()\n    \n\n\n-------------------\n拼写纠正\n-------------------\n\n\n预处理步骤的一个可选部分是纠正拼写错误的单词。不同的技术已被引入来解决这个问题，例如基于哈希和上下文敏感的拼写纠正技术，或使用 Trie（字典树）和 Damerau-Levenshtein 距离二元组（bigram）的拼写纠正。\n\n\n.. code:: python\n\n  from autocorrect import spell\n\n  print spell('caaaar')\n  print spell(u'mussage')\n  print spell(u'survice')\n  print spell(u'hte')\n\n结果：\n\n.. code::\n\ncaesar\n    message\n    service\n    the\n\n\n------------\n词干提取 (Stemming)\n------------\n\n\n文本词干提取 (Text Stemming) 是通过不同的语言过程（如附加法，即词缀的添加）修改单词以获得其变体的过程。例如，单词\"studying\"的词干是\"study\"，后面加上-ing。\n\n这是来自 `NLTK \u003Chttps:\u002F\u002Fpythonprogramming.net\u002Fstemming-nltk-tutorial\u002F>`__ 的一个词干提取示例。\n\n.. code:: python\n\n    from nltk.stem import PorterStemmer\n    from nltk.tokenize import sent_tokenize, word_tokenize\n\n    ps = PorterStemmer()\n\n    example_words = [\"python\",\"pythoner\",\"pythoning\",\"pythoned\",\"pythonly\"]\n    \n    for w in example_words:\n    print(ps.stem(w))\n\n\n结果：\n\n.. code::\n\n  python\n  python\n  python\n  python\n  pythonli\n\n-------------\n词形还原 (Lemmatization)\n-------------\n\n\n文本词形还原 (Text lemmatization) 是消除单词的冗余前缀或后缀并提取基础词（lemma）的过程。\n\n\n.. code:: python\n\n  from nltk.stem import WordNetLemmatizer\n\n  lemmatizer = WordNetLemmatizer()\n\n  print(lemmatizer.lemmatize(\"cats\"))\n\n~~~~~~~~~~~~~~\n词嵌入 (Word Embedding)\n~~~~~~~~~~~~~~\n\n不同的词嵌入 (Word Embedding) 过程已被提出，用于将这些一元词转换为机器学习算法可消费的输入。执行此类嵌入的一种非常简单的方法是词频~(TF)，其中每个单词将被映射到一个数字，对应于该单词在整个语料库中出现的次数。其他词频函数也被使用，将词频表示为布尔值或对数缩放数字。在这里，每个文档将被转换为包含该文档中单词频率的相同长度的向量。虽然这种方法可能看起来很直观，但它存在一个问题，即语言文献中使用非常频繁的特定单词可能会主导这种词表示。\n\n.. image:: docs\u002Fpic\u002FCBOW.png\n\n\n--------\nWord2Vec\n--------\n\n原始来源 https:\u002F\u002Fcode.google.com\u002Fp\u002Fword2vec\u002F\n\n我已将其复制到一个 GitHub 项目中，以便我可以应用和跟踪社区补丁（包括支持 Mac OS X 编译的功能）。\n\n-  **makefile 和一些源代码已针对 Mac OS X 编译进行了修改**。参见\n   https:\u002F\u002Fcode.google.com\u002Fp\u002Fword2vec\u002Fissues\u002Fdetail?id=1#c5\n-  **已应用 word2vec 的内存补丁**。参见\n   https:\u002F\u002Fcode.google.com\u002Fp\u002Fword2vec\u002Fissues\u002Fdetail?id=2\n-  项目文件布局已更改\n\ncompute-accuracy 实用程序中似乎存在段错误 (segfault)。\n\n开始操作：\n\n::\n\n   cd scripts && .\u002Fdemo-word.sh\n\n以下是原始 README 文本：\n\n此工具提供了用于计算词向量表示的连续词袋模型 (Continuous Bag-of-Words) 和跳字模型 (Skip-gram) 架构的高效实现。这些表示随后可以用于许多自然语言处理应用程序以及进一步的研究目的。\n\n此代码提供了连续词袋模型 (CBOW) 和跳字模型 (SG) 的实现，以及几个演示脚本。\n\n给定一个文本语料库，word2vec 工具使用连续词袋模型或跳字模型神经网络架构为词汇表中的每个单词学习一个向量。用户应指定以下内容：所需的向量维度（跳字模型或连续词袋模型的上下文窗口大小），训练算法（层次 Softmax 和\u002F或负采样），对频繁词进行下采样的阈值，使用的线程数，输出词向量文件的格式（文本或二进制）。\n\n通常，其他超参数（如学习率）不需要针对不同训练集进行调整。\n\n脚本 demo-word.sh 从网络下载一个小型（100MB）文本语料库，并训练一个小型词向量模型。训练完成后，用户可以交互式地探索单词的相似度。\n\n有关脚本的更多信息请参见\nhttps:\u002F\u002Fcode.google.com\u002Fp\u002Fword2vec\u002F\n\n\n----------------------------------------------\n词的全局向量表示 (Global Vectors for Word Representation \u002F GloVe)\n----------------------------------------------\n\n.. image:: \u002Fdocs\u002Fpic\u002FGlove.PNG\n\n提供了 GloVe 模型的学习词表示的实现，并描述了如何下载网络数据集向量或自行训练。有关 GloVe 向量的更多信息，请参阅 `项目页面 \u003Chttp:\u002F\u002Fnlp.stanford.edu\u002Fprojects\u002Fglove\u002F>`__ 或 `论文 \u003Chttp:\u002F\u002Fnlp.stanford.edu\u002Fpubs\u002Fglove.pdf>`__。\n\n\n------------------------------------\n上下文词表示 (Contextualized Word Representations)\n------------------------------------\n\nELMo 是一种深度上下文词表示，它建模了 (1) 单词使用的复杂特征（例如句法和语义），以及 (2) 这些用法如何在不同语言环境中变化（即建模多义性）。这些词向量是深度双向语言模型 (biLM) 内部状态的学习函数，该模型已在大型文本语料库上预训练。它们可以轻松添加到现有模型中，并在广泛的具有挑战性的自然语言处理问题（包括问答、文本蕴含和情感分析）中显著提高技术水平。\n\n**ELMo 表示具有以下特点：**\n\n-  **上下文相关：** 每个单词的表示取决于其使用的整个上下文。\n-  **深度：** 词表示结合了深度预训练神经网络的所有层。\n-  **基于字符：** ELMo 表示完全基于字符，允许网络利用形态学线索为训练时未见过的未登录词 (out-of-vocabulary tokens) 形成鲁棒的表示。\n\n\n**Tensorflow 实现**\n\n用于从 `\"Deep contextualized word representations\" \u003Chttp:\u002F\u002Farxiv.org\u002Fabs\u002F1802.05365>`__ 计算 ELMo 表示的预训练 biLM 的 Tensorflow 实现。\n\n此存储库支持训练 biLM 和使用预训练模型进行预测。\n\n我们还在 `AllenNLP \u003Chttp:\u002F\u002Fallennlp.org\u002F>`__ 中提供了 PyTorch 实现版本。\n\n如果您只想进行预测，使用 `Tensorflow Hub \u003Chttps:\u002F\u002Fwww.tensorflow.org\u002Fhub\u002Fmodules\u002Fgoogle\u002Felmo\u002F2>`__ 提供的版本可能会更容易。\n\n**预训练模型：**\n\n我们有几个可用的预训练英语语言 biLM 模型供使用。每个模型由两个单独的文件指定，一个是包含超参数的 JSON 格式“选项”文件，另一个是包含模型权重的 hdf5 格式文件。预训练模型的链接可在 `此处 \u003Chttps:\u002F\u002Fallennlp.org\u002Felmo>`__ 找到。\n\n根据您的用例，有三种方式将 ELMo 表示集成到下游任务中。\n\n1. 使用字符输入即时从原始文本计算表示。这是最通用的方法，能够处理任何输入文本。但其计算成本最高。\n2. 预计算并缓存与上下文无关的 token 表示，随后使用 biLSTMs（双向长短期记忆网络）为输入数据计算上下文相关的表示。此方法的计算成本低于 #1，但仅适用于固定且指定的词汇表。\n3. 预计算整个数据集的表示并保存至文件。\n\n我们过去在各种用例中都使用了所有这些方法。#1 对于在测试阶段评估未见过的数据是必要的（例如公共 SQuAD leaderboard）。#2 对于大型数据集是一个很好的折衷方案，此时文件大小不可行（SNLI, SQuAD）。#3 对于较小的数据集，或者您希望在其他框架中使用 ELMo（语言模型嵌入）的情况，是一个不错的选择。\n\n在所有情况下，流程大致遵循相同的步骤。首先，创建一个 ``Batcher``（对于 #2 使用 ``TokenBatcher``）将分词后的字符串转换为字符（或 token）id 的 numpy 数组。然后，加载预训练的 ELMo 模型（类 ``BidirectionalLanguageModel``）。最后，对于步骤 #1 和 #2，使用 ``weight_layers`` 计算最终的 ELMo 表示。对于 #3，使用 ``BidirectionalLanguageModel`` 将所有中间层写入文件。\n\n\n\n.. figure:: docs\u002Fpic\u002Fngram_cnn_highway_1.png \n应用于示例句子的语言模型架构 [参考：`arXiv paper \u003Chttps:\u002F\u002Farxiv.org\u002Fpdf\u002F1508.06615.pdf>`__]. \n\n\n.. figure:: docs\u002Fpic\u002FGlove_VS_DCWE.png \n\n--------\nFastText\n--------\n\n.. figure:: docs\u002Fpic\u002Ffasttext-logo-color-web.png\n\nfastText 是一个用于高效学习单词表示和句子分类的库。\n\n**Github:**  `facebookresearch\u002FfastText \u003Chttps:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText>`__\n\n**Models**\n\n- 最新的 `英语单词向量 \u003Chttps:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fenglish-vectors.html>`__。\n- 针对 `157 种语言的单词向量，基于维基百科和网络爬取数据训练 \u003Chttps:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FfastText\u002Fblob\u002Fmaster\u002Fdocs\u002Fcrawl-vectors.md>`__。\n- 用于 `语言识别 \u003Chttps:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Flanguage-identification.html#content>`__ 和 `各种监督任务 \u003Chttps:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fsupervised-models.html#content>`__ 的模型。\n\n**Supplementary data :**\n\n\n- 预处理过的 `YFCC100M 数据 \u003Chttps:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fdataset.html#content>`__ .\n\n**FAQ**\n\n您可以在其项目 `网站 \u003Chttps:\u002F\u002Ffasttext.cc\u002F>`__ 上找到 `常见问题解答 \u003Chttps:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Ffaqs.html#content>`__。\n\n**Cheatsheet**\n\n此外，还提供了充满实用单行代码的 `速查表 \u003Chttps:\u002F\u002Ffasttext.cc\u002Fdocs\u002Fen\u002Fcheatsheet.html#content>`__。\n\n\n\n~~~~~~~~~~~~~~\nWeighted Words\n~~~~~~~~~~~~~~\n\n\n--------------\nTerm frequency\n--------------\n\n词频是“词袋模型”（Bag of words），这是一种最简单的文本特征提取技术之一。该方法基于统计每个文档中的单词数量并将其分配给特征空间。\n\n\n-----------------------------------------\nTerm Frequency-Inverse Document Frequency\n-----------------------------------------\nTf-idf 给出的文档中术语权重的数学表示如下：\n\n.. image:: docs\u002Feq\u002Ftf-idf.gif\n   :width: 10px\n   \n其中 N 是文档数量，df(t) 是语料库中包含术语 t 的文档数量。第一部分可以提高召回率，而第二部分可以提高词嵌入（word embedding）的精确度。尽管 Tf-idf 试图克服文档中常见术语的问题，但它仍然受到一些其他描述性限制的影响。即，Tf-idf 无法解释文档中单词之间的相似性，因为每个单词都作为一个索引呈现。近年来，随着更复杂模型（如神经网络）的发展，出现了新的方法，可以整合诸如单词相似性和词性标注（part of speech tagging）等概念。本工作使用了 word2vec 和 GloVe，这两种是最常用于深度学习技术的常见方法。\n\n\n.. code:: python\n\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    def loadData(X_train, X_test,MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\",str(np.array(X_train).shape[1]),\"features\")\n        return (X_train,X_test)\n   \n   \n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nComparison of Feature Extraction Techniques\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|                **模型**               |                                                                        **优势**                                                                    |                                                   **局限性**                                               |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|            **加权词**                 |  * 易于计算                                                                                                                                       |  * 无法捕捉文本中的位置信息 (syntactic \u002F 句法)                                                    |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * 使用该方法易于计算两个文档之间的相似度                                                                                           |  * 无法捕捉文本中的含义 (semantics \u002F 语义)                                                         |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * 提取文档中最具描述性术语的基本指标                                                                                      |                                                                                                                |\n|                                       |                                                                                                                                                          |  * 常见词对结果的影响（例如，“am”, “is”等）                                                 |\n|                                       |  * 适用于未知词汇（例如语言中的新词）                                                                                             |                                                                                                                |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|            **TF-IDF**                 |  * 易于计算                                                                                                                                       |  * 无法捕捉文本中的位置信息 (syntactic \u002F 句法)                                                    |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * 使用该方法易于计算两个文档之间的相似度                                                                                           |  * 无法捕捉文本中的含义 (semantics \u002F 语义)                                                         |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * 提取文档中最具描述性术语的基本指标                                                                                      |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * 由于 IDF，常见词不会影响结果（例如，“am”, “is”等）                                                                            |                                                                                                                |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|               **Word2Vec**            |  * 捕捉文本中单词的位置 (syntactic \u002F 句法)                                                                                         |  * 无法从文本中捕捉单词的含义（无法捕捉 polysemy \u002F 多义性）                         |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * 捕捉单词中的含义 (semantics \u002F 语义)                                                                                                          |  * 无法捕捉 corpus (语料库) 中的未登录词                                                       |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|         **GloVe (预训练)**       |  * 捕捉文本中单词的位置 (syntactic \u002F 句法)                                                                                         |  * 无法从文本中捕捉单词的含义（无法捕捉 polysemy \u002F 多义性）                        |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * 捕捉单词中的含义 (semantics \u002F 语义)                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * 存储内存消耗                                                                              |\n|                                       |  * 在大型 corpus (语料库) 上训练                                                                                                                                |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * 无法捕捉 corpus (语料库) 中的未登录词                                                       |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|           **GloVe (训练)**         |  * 非常直接，例如，强制词向量捕捉向量空间中的亚线性关系（表现优于 Word2Vec） |  * 存储内存消耗                                                                              |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |  * 高频词对的权重较低，如 stop words (停用词) 如“am”, “is”等。不会主导训练进程                             |  * 需要大型 corpus (语料库) 进行学习                                                                                  |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * 无法捕捉 corpus (语料库) 中的未登录词                                                   |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * 无法从文本中捕捉单词的含义（无法捕捉 polysemy \u002F 多义性）                        |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|               **FastText**            |  * 适用于稀有词（其字符 n-grams (n-gram) 与其他词共享）                                                         |  * 无法从文本中捕捉单词的含义（无法捕捉 polysemy \u002F 多义性）                         |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * 存储内存消耗                                                                              |\n|                                       |  * 通过字符级别的 n-gram 解决未登录词问题                                                                                         |                                                                                                                |\n|                                       |                                                                                                                                                          |  * 与 GloVe 和 Word2Vec 相比，计算成本更高                                      |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n|**上下文相关的词表示**|  * 从文本中捕捉单词的含义（结合上下文，处理 polysemy \u002F 多义性）                                                           |  * 存储内存消耗                                                                              |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * 显著改善 downstream tasks (下游任务) 的性能。与其他方法相比计算成本更高 |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * 需要为所有 LSTM 和 feedforward layers (前馈层) 提供另一个词嵌入                                            |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * 无法捕捉 corpus (语料库) 中的未登录词                                                     |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |                                                                                                                |\n|                                       |                                                                                                                                                          |  * 仅适用于句子和文档级别（无法用于单个词级别）                           |\n+---------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+----------------------------------------------------------------------------------------------------------------+\n\n========================\n降维\n========================\n\n----\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n主成分分析 (PCA)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n主成分分析（Principal Component Analysis，简称 PCA）是多变量分析和降维中最流行的技术。PCA 是一种识别数据近似所在的子空间的方法。这意味着找到不相关的新变量，并最大化方差以尽可能保留更多的变异性。\n\n\n在文本数据集（20newsgroups）上对 tf-idf（词频 - 逆文档频率，具有 75000 个特征）进行 PCA 降至 2000 个分量的示例：\n\n.. code:: python\n\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n\n    def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\", str(np.array(X_train).shape[1]), \"features\")\n        return (X_train, X_test)\n\n\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train,X_test = TFIDF(X_train,X_test)\n\n    from sklearn.decomposition import PCA\n    pca = PCA(n_components=2000)\n    X_train_new = pca.fit_transform(X_train)\n    X_test_new = pca.transform(X_test)\n\n    print(\"train with old features: \",np.array(X_train).shape)\n    print(\"train with new features:\" ,np.array(X_train_new).shape)\n    \n    print(\"test with old features: \",np.array(X_test).shape)\n    print(\"test with new features:\" ,np.array(X_test_new).shape)\n\n输出：\n\n.. code:: python\n\n    tf-idf with 75000 features\n    train with old features:  (11314, 75000)\n    train with new features: (11314, 2000)\n    test with old features:  (7532, 75000)\n    test with new features: (7532, 2000)\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n线性判别分析 (LDA)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n线性判别分析（Linear Discriminant Analysis，简称 LDA）是另一种常用于数据分类和降维的技术。当类内频率不相等时，LDA 特别有用，并且其性能已在随机生成的测试数据上进行了评估。类依赖和类独立转换是 LDA 中的两种方法，分别使用类间方差与类内方差的比率以及总体方差与类内方差的比率。 \n\n\n\n.. code:: python\n\n\n  from sklearn.feature_extraction.text import TfidfVectorizer\n  import numpy as np\n  from sklearn.discriminant_analysis import LinearDiscriminantAnalysis\n\n\n  def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):\n      vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n      X_train = vectorizer_x.fit_transform(X_train).toarray()\n      X_test = vectorizer_x.transform(X_test).toarray()\n      print(\"tf-idf with\", str(np.array(X_train).shape[1]), \"features\")\n      return (X_train, X_test)\n\n\n  from sklearn.datasets import fetch_20newsgroups\n\n  newsgroups_train = fetch_20newsgroups(subset='train')\n  newsgroups_test = fetch_20newsgroups(subset='test')\n  X_train = newsgroups_train.data\n  X_test = newsgroups_test.data\n  y_train = newsgroups_train.target\n  y_test = newsgroups_test.target\n\n  X_train,X_test = TFIDF(X_train,X_test)\n\n\n\n  LDA = LinearDiscriminantAnalysis(n_components=15)\n  X_train_new = LDA.fit(X_train,y_train)\n  X_train_new =  LDA.transform(X_train)\n  X_test_new = LDA.transform(X_test)\n\n  print(\"train with old features: \",np.array(X_train).shape)\n  print(\"train with new features:\" ,np.array(X_train_new).shape)\n\n  print(\"test with old features: \",np.array(X_test).shape)\n  print(\"test with new features:\" ,np.array(X_test_new).shape)\n\n\n输出：\n\n.. code:: \n\n    tf-idf with 75000 features\n    train with old features:  (11314, 75000)\n    train with new features: (11314, 15)\n    test with old features:  (7532, 75000)\n    test with new features: (7532, 15)\n    \n    \n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n非负矩阵分解 (NMF)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n.. code:: python\n\n\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n    from sklearn.decomposition import NMF\n\n\n    def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\", str(np.array(X_train).shape[1]), \"features\")\n        return (X_train, X_test)\n\n\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train,X_test = TFIDF(X_train,X_test)\n\n\n\n    NMF_ = NMF(n_components=2000)\n    X_train_new = NMF_.fit(X_train)\n    X_train_new =  NMF_.transform(X_train)\n    X_test_new = NMF_.transform(X_test)\n\n    print(\"train with old features: \",np.array(X_train).shape)\n    print(\"train with new features:\" ,np.array(X_train_new).shape)\n\n    print(\"test with old features: \",np.array(X_test).shape)\n    print(\"test with new features:\" ,np.array(X_test_new))\n\n输出：\n\n.. code:: \n\n    tf-idf with 75000 features\n    train with old features:  (11314, 75000)\n    train with new features: (11314, 2000)\n    test with old features:  (7532, 75000)\n    test with new features: (7532, 2000)\n    \n    \n\n~~~~~~~~~~~~~~~~~\n随机投影\n~~~~~~~~~~~~~~~~~\n随机投影（Random Projection）或随机特征主要用于超大规模数据集或非常高维特征空间的降维技术。文本和文档，特别是带有加权特征提取时，可能包含大量的潜在特征。\n许多研究人员针对文本数据挖掘、文本分类和\u002F或降维应用了随机投影技术。\n我们开始回顾一些随机投影技术。 \n\n\n.. image:: docs\u002Fpic\u002FRandom%20Projection.png\n\n.. code:: python\n\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n\n    def TFIDF(X_train, X_test, MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\", str(np.array(X_train).shape[1]), \"features\")\n        return (X_train, X_test)\n\n\n    from sklearn.datasets import fetch_20newsgroups\n\nnewsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train,X_test = TFIDF(X_train,X_test)\n\n    from sklearn import random_projection\n\n    RandomProjection = random_projection.GaussianRandomProjection(n_components=2000)\n    X_train_new = RandomProjection.fit_transform(X_train)\n    X_test_new = RandomProjection.transform(X_test)\n\n    print(\"train with old features: \",np.array(X_train).shape)\n    print(\"train with new features:\" ,np.array(X_train_new).shape)\n\n    print(\"test with old features: \",np.array(X_test).shape)\n    print(\"test with new features:\" ,np.array(X_test_new).shape)\n\n输出：\n\n.. code:: python\n\n    tf-idf with 75000 features\n    train with old features:  (11314, 75000)\n    train with new features: (11314, 2000)\n    test with old features:  (7532, 75000)\n    test with new features: (7532, 2000)\n    \n~~~~~~~~~~~\nAutoencoder（自编码器）\n~~~~~~~~~~~\n\n\nAutoencoder（自编码器）是一种神经网络技术，旨在尝试将其输入映射到其输出。作为降维方法，Autoencoder 通过神经网络的强大表示能力取得了巨大成功。主要思想是，在输入层和输出层之间使用一个神经元较少的隐藏层来降低特征空间的维度。特别是对于包含许多特征的文本、文档和序列，Autoencoder 可以帮助更快、更高效地处理数据。\n\n\n.. image:: docs\u002Fpic\u002FAutoencoder.png\n\n\n\n.. code:: python\n\n  from keras.layers import Input, Dense\n  from keras.models import Model\n\n  # this is the size of our encoded representations\n  encoding_dim = 1500  \n\n  # this is our input placeholder\n  input = Input(shape=(n,))\n  # \"encoded\" is the encoded representation of the input\n  encoded = Dense(encoding_dim, activation='relu')(input)\n  # \"decoded\" is the lossy reconstruction of the input\n  decoded = Dense(n, activation='sigmoid')(encoded)\n\n  # this model maps an input to its reconstruction\n  autoencoder = Model(input, decoded)\n\n  # this model maps an input to its encoded representation\n  encoder = Model(input, encoded)\n  \n\n  encoded_input = Input(shape=(encoding_dim,))\n  # retrieve the last layer of the autoencoder model\n  decoder_layer = autoencoder.layers[-1]\n  # create the decoder model\n  decoder = Model(encoded_input, decoder_layer(encoded_input))\n  \n  autoencoder.compile(optimizer='adadelta', loss='binary_crossentropy')\n  \n  \n\n加载数据：\n\n\n.. code:: python\n\n  autoencoder.fit(x_train, x_train,\n                  epochs=50,\n                  batch_size=256,\n                  shuffle=True,\n                  validation_data=(x_test, x_test))\n                  \n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nT-分布随机邻域嵌入 (T-SNE)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n\nT-分布随机邻域嵌入 (T-SNE) 是一种用于嵌入高维数据的非线性降维技术，主要用于在低维空间中进行可视化。该方法基于 `G. Hinton and ST. Roweis \u003Chttps:\u002F\u002Fwww.cs.toronto.edu\u002F~fritz\u002Fabsps\u002Fsne.pdf>`__ 。SNE 通过将高维欧氏距离转换为表示相似度的条件概率来工作。\n\n `示例 \u003Chttp:\u002F\u002Fscikit-learn.org\u002Fstable\u002Fmodules\u002Fgenerated\u002Fsklearn.manifold.TSNE.html>`__:\n\n\n.. code:: python\n\n   import numpy as np\n   from sklearn.manifold import TSNE\n   X = np.array([[0, 0, 0], [0, 1, 1], [1, 0, 1], [1, 1, 1]])\n   X_embedded = TSNE(n_components=2).fit_transform(X)\n   X_embedded.shape\n\n\nGlove 和 T-SNE 在文本上的示例：\n\n.. image:: docs\u002Fpic\u002FTSNE.png\n\n===============================\n文本分类技术\n===============================\n\n----\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nRocchio 分类法\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nRocchio 算法的第一个版本由 rocchio 于 1971 年引入，用于查询全文数据库时的相关性反馈。此后，许多研究人员针对文本和文档分类解决并开发了该技术。该方法使用每个信息词的 TF-IDF 权重代替一组布尔特征。通过使用文档训练集，Rocchio 算法为每个类别构建一个原型向量，该向量属于特定类别的所有训练文档向量的平均向量。然后，它将每个测试文档分配给与每个原型向量之间具有最大相似度的类别。\n\n\n当在最近的质心分类器中使用时，我们将 tf-idf 向量用作文本分类的输入数据，该分类器被称为 Rocchio 分类器。\n\n.. code:: python\n\n    from sklearn.neighbors.nearest_centroid import NearestCentroid\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', NearestCentroid()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\n\n\n输出：\n\n.. code:: python\n\n                  precision    recall  f1-score   support\n\n              0       0.75      0.49      0.60       319\n              1       0.44      0.76      0.56       389\n              2       0.75      0.68      0.71       394\n              3       0.71      0.59      0.65       392\n              4       0.81      0.71      0.76       385\n              5       0.83      0.66      0.74       395\n              6       0.49      0.88      0.63       390\n              7       0.86      0.76      0.80       396\n              8       0.91      0.86      0.89       398\n              9       0.85      0.79      0.82       397\n             10       0.95      0.80      0.87       399\n             11       0.94      0.66      0.78       396\n             12       0.40      0.70      0.51       393\n             13       0.84      0.49      0.62       396\n             14       0.89      0.72      0.80       394\n             15       0.55      0.73      0.63       398\n             16       0.68      0.76      0.71       364\n             17       0.97      0.70      0.81       376\n             18       0.54      0.53      0.53       310\n             19       0.58      0.39      0.47       251\n\navg \u002F total       0.74      0.69      0.70      7532\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n提升与装袋\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n---------\n提升\n---------\n\n.. image:: docs\u002Fpic\u002FBoosting.PNG\n\n\n**Boosting（提升）** 是一种主要用于减少监督学习中方差的集成学习（Ensemble learning）元算法。它基本上是一族将弱学习器转化为强学习器的机器学习算法。提升基于 `Michael Kearns \u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMichael_Kearns_(computer_scientist)>`__ 和 Leslie Valiant (1988, 1989) 提出的问题：一组弱学习器能否创建一个单一的强学习器？弱学习器被定义为一种仅与真实分类略有相关的分类器（它可以比随机猜测更好地标记示例）。相比之下，强学习器是一种与真实分类任意高度相关的分类器。\n\n\n\n\n.. code:: python\n\n  from sklearn.ensemble import GradientBoostingClassifier\n  from sklearn.pipeline import Pipeline\n  from sklearn import metrics\n  from sklearn.feature_extraction.text import CountVectorizer\n  from sklearn.feature_extraction.text import TfidfTransformer\n  from sklearn.datasets import fetch_20newsgroups\n\n  newsgroups_train = fetch_20newsgroups(subset='train')\n  newsgroups_test = fetch_20newsgroups(subset='test')\n  X_train = newsgroups_train.data\n  X_test = newsgroups_test.data\n  y_train = newsgroups_train.target\n  y_test = newsgroups_test.target\n\n  text_clf = Pipeline([('vect', CountVectorizer()),\n                       ('tfidf', TfidfTransformer()),\n                       ('clf', GradientBoostingClassifier(n_estimators=100)),\n                       ])\n\n  text_clf.fit(X_train, y_train)\n\n\n  predicted = text_clf.predict(X_test)\n\n  print(metrics.classification_report(y_test, predicted))\n\n\n输出：\n \n.. code:: python\n\n               precision    recall  f1-score   support\n            0       0.81      0.66      0.73       319\n            1       0.69      0.70      0.69       389\n            2       0.70      0.68      0.69       394\n            3       0.64      0.72      0.68       392\n            4       0.79      0.79      0.79       385\n            5       0.83      0.64      0.72       395\n            6       0.81      0.84      0.82       390\n            7       0.84      0.75      0.79       396\n            8       0.90      0.86      0.88       398\n            9       0.90      0.85      0.88       397\n           10       0.93      0.86      0.90       399\n           11       0.90      0.81      0.85       396\n           12       0.33      0.69      0.45       393\n           13       0.87      0.72      0.79       396\n           14       0.87      0.84      0.85       394\n           15       0.85      0.87      0.86       398\n           16       0.65      0.78      0.71       364\n           17       0.96      0.74      0.84       376\n           18       0.70      0.55      0.62       310\n           19       0.62      0.56      0.59       251\n\n  avg \u002F total       0.78      0.75      0.76      7532\n\n  \n-------\n装袋\n-------\n\n.. image:: docs\u002Fpic\u002FBagging.PNG\n\n\n.. code:: python\n\n    from sklearn.ensemble import BaggingClassifier\n    from sklearn.neighbors import KNeighborsClassifier\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', BaggingClassifier(KNeighborsClassifier())),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\n输出：\n \n.. code:: python\n\n               precision    recall  f1-score   support\n            0       0.57      0.74      0.65       319\n            1       0.60      0.56      0.58       389\n            2       0.62      0.54      0.58       394\n            3       0.54      0.57      0.55       392\n            4       0.63      0.54      0.58       385\n            5       0.68      0.62      0.65       395\n            6       0.55      0.46      0.50       390\n            7       0.77      0.67      0.72       396\n            8       0.79      0.82      0.80       398\n            9       0.74      0.77      0.76       397\n           10       0.81      0.86      0.83       399\n           11       0.74      0.85      0.79       396\n           12       0.67      0.49      0.57       393\n           13       0.78      0.51      0.62       396\n           14       0.76      0.78      0.77       394\n           15       0.71      0.81      0.76       398\n           16       0.73      0.73      0.73       364\n           17       0.64      0.79      0.71       376\n           18       0.45      0.69      0.54       310\n           19       0.61      0.54      0.57       251\n\n  avg \u002F total       0.67      0.67      0.67      7532\n  \n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n朴素贝叶斯分类器\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n朴素贝叶斯文本分类在工业界和学术界已使用很长时间（由 Thomas Bayes 在 1701-1761 年间引入）。然而，该技术自 1950 年代起就被用于文本和文档分类研究。朴素贝叶斯分类器（NBC）是一种生成模型，广泛用于信息检索。许多研究人员针对其应用解决并开发了这项技术。我们从 NBC 的最基础版本开始，该版本通过使用词频（词袋）特征提取技术，通过统计文档中的单词数量来实现。\n\n\n.. code:: python\n\n    from sklearn.naive_bayes import MultinomialNB\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', MultinomialNB()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n \n \n输出：\n \n.. code:: python\n\n精确率    召回率  F1 分数   支持度\n\n              0       0.80      0.52      0.63       319\n              1       0.81      0.65      0.72       389\n              2       0.82      0.65      0.73       394\n              3       0.67      0.78      0.72       392\n              4       0.86      0.77      0.81       385\n              5       0.89      0.75      0.82       395\n              6       0.93      0.69      0.80       390\n              7       0.85      0.92      0.88       396\n              8       0.94      0.93      0.93       398\n              9       0.92      0.90      0.91       397\n             10       0.89      0.97      0.93       399\n             11       0.59      0.97      0.74       396\n             12       0.84      0.60      0.70       393\n             13       0.92      0.74      0.82       396\n             14       0.84      0.89      0.87       394\n             15       0.44      0.98      0.61       398\n             16       0.64      0.94      0.76       364\n             17       0.93      0.91      0.92       376\n             18       0.96      0.42      0.58       310\n             19       0.97      0.14      0.24       251\n\n    平均\u002F总计       0.82      0.77      0.77      7532\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nK 近邻算法 (K-nearest Neighbor)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nR\n在机器学习 (machine learning) 中，k 近邻算法 (k-nearest neighbors algorithm, kNN)\n是一种用于分类的非参数技术 (non-parametric technique)。\n该方法在过去几十年的许多研究中被用作自然语言处理 (Natural-language processing, NLP)\n中的文本分类 (text classification) 技术。\n\n.. image:: docs\u002Fpic\u002FKNN.png\n\n.. code:: python\n\n    from sklearn.neighbors import KNeighborsClassifier\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', KNeighborsClassifier()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n输出：\n\n.. code:: python\n\n                   precision    recall  f1-score   support\n\n              0       0.43      0.76      0.55       319\n              1       0.50      0.61      0.55       389\n              2       0.56      0.57      0.57       394\n              3       0.53      0.58      0.56       392\n              4       0.59      0.56      0.57       385\n              5       0.69      0.60      0.64       395\n              6       0.58      0.45      0.51       390\n              7       0.75      0.69      0.72       396\n              8       0.84      0.81      0.82       398\n              9       0.77      0.72      0.74       397\n             10       0.85      0.84      0.84       399\n             11       0.76      0.84      0.80       396\n             12       0.70      0.50      0.58       393\n             13       0.82      0.49      0.62       396\n             14       0.79      0.76      0.78       394\n             15       0.75      0.76      0.76       398\n             16       0.70      0.73      0.72       364\n             17       0.62      0.76      0.69       376\n             18       0.55      0.61      0.58       310\n             19       0.56      0.49      0.52       251\n\n    avg \u002F total       0.67      0.66      0.66      7532\n\n\n\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n支持向量机 (Support Vector Machine, SVM)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\nSVM 的原始版本由 Vapnik 和 Chervonenkis 于 1963 年提出。20 世纪 90 年代初，非线性版本由 BE. Boser 等人提出。SVM 的原始版本是为二分类问题 (binary classification problem) 设计的，但许多研究人员利用这一权威技术进行了多分类问题 (multi-class problem) 的研究。\n\n\n支持向量机的优势基于 scikit-learn 页面：\n\n* 在高维空间 (high dimensional spaces) 中有效。\n* 在维度数量大于样本数量的情况下仍然有效。\n* 在决策函数 (decision function) 中使用训练点的一个子集（称为支持向量 (support vectors)），因此它也具有内存效率 (memory efficient)。\n* 多功能：可以为决策函数指定不同的核函数 (Kernel functions)。提供了常见的核函数，但也可能指定自定义核 (custom kernels)。\n\n\n支持向量机的缺点包括：\n\n* 如果特征 (features) 数量远大于样本 (samples) 数量，通过选择核函数和正则化项 (regularization term) 来避免过拟合 (over-fitting) 至关重要。\n* SVM 不直接提供概率估计 (probability estimates)，这些是通过昂贵的五折交叉验证 (five-fold cross-validation) 计算得出的（见下面的“得分与概率”）。\n\n\n\n.. image:: docs\u002Fpic\u002FSVM.png\n\n\n.. code:: python\n\n\n    from sklearn.svm import LinearSVC\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', LinearSVC()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\n输出：\n\n\n.. code:: python\n\n                   precision    recall  f1-score   support\n\n0       0.82      0.80      0.81       319\n              1       0.76      0.80      0.78       389\n              2       0.77      0.73      0.75       394\n              3       0.71      0.76      0.74       392\n              4       0.84      0.86      0.85       385\n              5       0.87      0.76      0.81       395\n              6       0.83      0.91      0.87       390\n              7       0.92      0.91      0.91       396\n              8       0.95      0.95      0.95       398\n              9       0.92      0.95      0.93       397\n             10       0.96      0.98      0.97       399\n             11       0.93      0.94      0.93       396\n             12       0.81      0.79      0.80       393\n             13       0.90      0.87      0.88       396\n             14       0.90      0.93      0.92       394\n             15       0.84      0.93      0.88       398\n             16       0.75      0.92      0.82       364\n             17       0.97      0.89      0.93       376\n             18       0.82      0.62      0.71       310\n             19       0.75      0.61      0.68       251\n\n    avg \u002F total       0.85      0.85      0.85      7532\n\n\n\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n决策树\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n文本和数据挖掘中较早的分类算法之一是决策树。决策树分类器（Decision Tree Classifiers, DTC's）在许多不同的分类领域都得到了成功应用。该技术的结构包括数据空间的层次分解（仅针对训练数据集）。决策树作为分类任务由 `D. Morgan \u003Chttp:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP95-1037>`__ 引入，并由 `JR. Quinlan \u003Chttps:\u002F\u002Fcourses.cs.ut.ee\u002F2009\u002Fbayesian-networks\u002Fextras\u002Fquinlan1986.pdf>`__ 开发。主要思想是基于数据点的属性创建树，但挑战在于确定哪个属性应处于父级，哪个应处于子级。为了解决这个问题，`De Mantaras \u003Chttps:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1023\u002FA:1022694001379>`__ 引入了用于树中特征选择的统计建模。\n\n\n.. code:: python\n\n    from sklearn import tree\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', tree.DecisionTreeClassifier()),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\n输出：\n\n\n.. code:: python\n\n                   precision    recall  f1-score   support\n\n              0       0.51      0.48      0.49       319\n              1       0.42      0.42      0.42       389\n              2       0.51      0.56      0.53       394\n              3       0.46      0.42      0.44       392\n              4       0.50      0.56      0.53       385\n              5       0.50      0.47      0.48       395\n              6       0.66      0.73      0.69       390\n              7       0.60      0.59      0.59       396\n              8       0.66      0.72      0.69       398\n              9       0.53      0.55      0.54       397\n             10       0.68      0.66      0.67       399\n             11       0.73      0.69      0.71       396\n             12       0.34      0.33      0.33       393\n             13       0.52      0.42      0.46       396\n             14       0.65      0.62      0.63       394\n             15       0.68      0.72      0.70       398\n             16       0.49      0.62      0.55       364\n             17       0.78      0.60      0.68       376\n             18       0.38      0.38      0.38       310\n             19       0.32      0.32      0.32       251\n\n    avg \u002F total       0.55      0.55      0.55      7532\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n随机森林\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n随机森林（Random Forests）或随机决策森林技术是一种用于文本分类的集成学习方法（Ensemble Learning Method）。该方法于 1995 年由 `T. Kam Ho \u003Chttps:\u002F\u002Fdoi.org\u002F10.1109\u002FICDAR.1995.598994>`__ 首次提出，当时使用了 t 棵树并行运行。后来该技术由 `L. Breiman \u003Chttps:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1023\u002FA:1010933404324>`__ 在 1999 年进一步发展，他们发现其作为随机森林（RF）的边界度量是收敛的。\n\n\n.. image:: docs\u002Fpic\u002FRF.png\n\n.. code:: python\n\n    from sklearn.ensemble import RandomForestClassifier\n    from sklearn.pipeline import Pipeline\n    from sklearn import metrics\n    from sklearn.feature_extraction.text import CountVectorizer\n    from sklearn.feature_extraction.text import TfidfTransformer\n    from sklearn.datasets import fetch_20newsgroups\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    text_clf = Pipeline([('vect', CountVectorizer()),\n                         ('tfidf', TfidfTransformer()),\n                         ('clf', RandomForestClassifier(n_estimators=100)),\n                         ])\n\n    text_clf.fit(X_train, y_train)\n\n\n    predicted = text_clf.predict(X_test)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\n输出：\n\n\n.. code:: python\n\n\n                    precision    recall  f1-score   support\n\n0       0.69      0.63      0.66       319\n              1       0.56      0.69      0.62       389\n              2       0.67      0.78      0.72       394\n              3       0.67      0.67      0.67       392\n              4       0.71      0.78      0.74       385\n              5       0.78      0.68      0.73       395\n              6       0.74      0.92      0.82       390\n              7       0.81      0.79      0.80       396\n              8       0.90      0.89      0.90       398\n              9       0.80      0.89      0.84       397\n             10       0.90      0.93      0.91       399\n             11       0.89      0.91      0.90       396\n             12       0.68      0.49      0.57       393\n             13       0.83      0.65      0.73       396\n             14       0.81      0.88      0.84       394\n             15       0.68      0.91      0.78       398\n             16       0.67      0.86      0.75       364\n             17       0.93      0.78      0.85       376\n             18       0.86      0.48      0.61       310\n             19       0.79      0.31      0.45       251\n\n    avg \u002F total       0.77      0.76      0.75      7532\n\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n条件随机场（CRF）\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n条件随机场（CRF）是一种无向图模型，如图所示。CRF 定义了给定观测序列 *X* 的标签序列 *Y* 的条件概率，即 P(Y|X)。CRF 通过对标签序列的条件概率而非联合概率 P(X,Y) 进行建模，可以在不违反独立性假设的情况下整合观测序列的复杂特征。计算 P(X|Y) 时使用了“团”（clique，即完全连接的子图）的概念以及团势。考虑到图中每个团都有一个势函数，变量配置的概率对应于一系列非负势函数的乘积。每个势函数计算出的值等同于其对应团中的变量采取特定配置的概率。\n\n\n.. image:: docs\u002Fpic\u002FCRF.png\n\n\n示例来自 `此处 \u003Chttp:\u002F\u002Fsklearn-crfsuite.readthedocs.io\u002Fen\u002Flatest\u002Ftutorial.html>`__\n让我们使用 CoNLL 2002 数据来构建一个命名实体识别（NER）系统\nCoNLL2002 语料库可在 NLTK 中找到。我们使用西班牙语数据。\n\n\n.. code:: python\n\n      import nltk\n      import sklearn_crfsuite\n      from sklearn_crfsuite import metrics\n      nltk.corpus.conll2002.fileids()\n      train_sents = list(nltk.corpus.conll2002.iob_sents('esp.train'))\n      test_sents = list(nltk.corpus.conll2002.iob_sents('esp.testb'))\n      \n      \nsklearn-crfsuite（以及 python-crfsuite）支持多种特征格式；此处我们使用特征字典。\n\n.. code:: python\n\n      def word2features(sent, i):\n          word = sent[i][0]\n          postag = sent[i][1]\n\n          features = {\n              'bias': 1.0,\n              'word.lower()': word.lower(),\n              'word[-3:]': word[-3:],\n              'word[-2:]': word[-2:],\n              'word.isupper()': word.isupper(),\n              'word.istitle()': word.istitle(),\n              'word.isdigit()': word.isdigit(),\n              'postag': postag,\n              'postag[:2]': postag[:2],\n          }\n          if i > 0:\n              word1 = sent[i-1][0]\n              postag1 = sent[i-1][1]\n              features.update({\n                  '-1:word.lower()': word1.lower(),\n                  '-1:word.istitle()': word1.istitle(),\n                  '-1:word.isupper()': word1.isupper(),\n                  '-1:postag': postag1,\n                  '-1:postag[:2]': postag1[:2],\n              })\n          else:\n              features['BOS'] = True\n\n          if i \u003C len(sent)-1:\n              word1 = sent[i+1][0]\n              postag1 = sent[i+1][1]\n              features.update({\n                  '+1:word.lower()': word1.lower(),\n                  '+1:word.istitle()': word1.istitle(),\n                  '+1:word.isupper()': word1.isupper(),\n                  '+1:postag': postag1,\n                  '+1:postag[:2]': postag1[:2],\n              })\n          else:\n              features['EOS'] = True\n\n          return features\n\n\n      def sent2features(sent):\n          return [word2features(sent, i) for i in range(len(sent))]\n\n      def sent2labels(sent):\n          return [label for token, postag, label in sent]\n\n      def sent2tokens(sent):\n          return [token for token, postag, label in sent]\n\n      X_train = [sent2features(s) for s in train_sents]\n      y_train = [sent2labels(s) for s in train_sents]\n\n      X_test = [sent2features(s) for s in test_sents]\n      y_test = [sent2labels(s) for s in test_sents]\n\n\n要查看所有可能的 CRF 参数，请检查其文档字符串。此处我们使用 L-BFGS 训练算法（默认）配合 Elastic Net（L1 + L2）正则化。\n\n\n\n.. code:: python\n\n      crf = sklearn_crfsuite.CRF(\n          algorithm='lbfgs',\n          c1=0.1,\n          c2=0.1,\n          max_iterations=100,\n          all_possible_transitions=True\n      )\n      crf.fit(X_train, y_train)\n\n\n评估\n\n\n.. code:: python\n\n      y_pred = crf.predict(X_test)\n      print(metrics.flat_classification_report(\n          y_test, y_pred,  digits=3\n      ))\n\n\n输出：\n\n.. code:: python\n\n                     precision    recall  f1-score   support\n\n            B-LOC      0.810     0.784     0.797      1084\n           B-MISC      0.731     0.569     0.640       339\n            B-ORG      0.807     0.832     0.820      1400\n            B-PER      0.850     0.884     0.867       735\n            I-LOC      0.690     0.637     0.662       325\n           I-MISC      0.699     0.589     0.639       557\n            I-ORG      0.852     0.786     0.818      1104\n            I-PER      0.893     0.943     0.917       634\n                O      0.992     0.997     0.994     45355\n\n      avg \u002F total      0.970     0.971     0.971     51533\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n深度学习\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n-----------------------------------------\n深度神经网络\n-----------------------------------------\n\n深度神经网络（Deep Neural Networks）架构设计为通过多层连接进行学习，其中每一层仅在隐藏部分接收来自前一层的连接，并仅提供给下一层连接。输入是特征空间（feature space）的连接（如“特征提取”部分所述，与第一个隐藏层相关）。对于深度神经网络（DNN），输入层可以是 tf-idf、词嵌入（word embedding）等，如图中的标准 DNN 所示。输出层包含的神经元（neurons）数量等于多分类（multi-class classification）中的类别数，而二分类（binary classification）中仅有一个神经元。但本文的主要贡献在于我们拥有许多经过训练的 DNN 以服务于不同的目的。在此，我们有多分类 DNN，其中每个学习模型是随机生成的（每层的节点数以及层数都是随机分配的）。我们对深度神经网络（DNN）的实现基本上是一个判别式训练模型，使用标准的反向传播算法（back-propagation algorithm）以及 Sigmoid 或 ReLU 作为激活函数（activation functions）。用于多分类的输出层应使用 Softmax。\n\n.. image:: docs\u002Fpic\u002FDNN.png\n\n导入包：\n\n.. code:: python\n\n    from sklearn.datasets import fetch_20newsgroups\n    from keras.layers import  Dropout, Dense\n    from keras.models import Sequential\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n    from sklearn import metrics\n\n\n将文本转换为 TF-IDF：\n\n.. code:: python\n\n    def TFIDF(X_train, X_test,MAX_NB_WORDS=75000):\n        vectorizer_x = TfidfVectorizer(max_features=MAX_NB_WORDS)\n        X_train = vectorizer_x.fit_transform(X_train).toarray()\n        X_test = vectorizer_x.transform(X_test).toarray()\n        print(\"tf-idf with\",str(np.array(X_train).shape[1]),\"features\")\n        return (X_train,X_test)\n\n\n构建用于文本的 DNN 模型：\n\n.. code:: python\n\n    def Build_Model_DNN_Text(shape, nClasses, dropout=0.5):\n        \"\"\"\n        buildModel_DNN_Tex(shape, nClasses,dropout)\n        Build Deep neural networks Model for text classification\n        Shape is input feature space\n        nClasses is number of classes\n        \"\"\"\n        model = Sequential()\n        node = 512 # number of nodes\n        nLayers = 4 # number of  hidden layer\n\n        model.add(Dense(node,input_dim=shape,activation='relu'))\n        model.add(Dropout(dropout))\n        for i in range(0,nLayers):\n            model.add(Dense(node,input_dim=node,activation='relu'))\n            model.add(Dropout(dropout))\n        model.add(Dense(nClasses, activation='softmax'))\n\n        model.compile(loss='sparse_categorical_crossentropy',\n                      optimizer='adam',\n                      metrics=['accuracy'])\n\n        return model\n\n\n\n加载文本数据集（20newsgroups）：\n\n.. code:: python\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n\n\n运行 DNN 并查看我们的结果：\n\n\n.. code:: python\n\n    X_train_tfidf,X_test_tfidf = TFIDF(X_train,X_test)\n    model_DNN = Build_Model_DNN_Text(X_train_tfidf.shape[1], 20)\n    model_DNN.fit(X_train_tfidf, y_train,\n                                  validation_data=(X_test_tfidf, y_test),\n                                  epochs=10,\n                                  batch_size=128,\n                                  verbose=2)\n\n    predicted = model_DNN.predict_class(X_test_tfidf)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\n模型摘要：\n\n.. code:: python \n\n    _________________________________________________________________\n    Layer (type)                 Output Shape              Param #   \n    =================================================================\n    dense_1 (Dense)              (None, 512)               38400512  \n    _________________________________________________________________\n    dropout_1 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_2 (Dense)              (None, 512)               262656    \n    _________________________________________________________________\n    dropout_2 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_3 (Dense)              (None, 512)               262656    \n    _________________________________________________________________\n    dropout_3 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_4 (Dense)              (None, 512)               262656    \n    _________________________________________________________________\n    dropout_4 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_5 (Dense)              (None, 512)               262656    \n    _________________________________________________________________\n    dropout_5 (Dropout)          (None, 512)               0         \n    _________________________________________________________________\n    dense_6 (Dense)              (None, 20)                10260     \n    =================================================================\n    Total params: 39,461,396\n    Trainable params: 39,461,396\n    Non-trainable params: 0\n    _________________________________________________________________\n\n\n\n输出：\n\n.. code:: python \n\n        Train on 11314 samples, validate on 7532 samples\n        Epoch 1\u002F10\n         - 16s - loss: 2.7553 - acc: 0.1090 - val_loss: 1.9330 - val_acc: 0.3184\n        Epoch 2\u002F10\n         - 15s - loss: 1.5330 - acc: 0.4222 - val_loss: 1.1546 - val_acc: 0.6204\n        Epoch 3\u002F10\n         - 15s - loss: 0.7438 - acc: 0.7257 - val_loss: 0.8405 - val_acc: 0.7499\n        Epoch 4\u002F10\n         - 15s - loss: 0.2967 - acc: 0.9020 - val_loss: 0.9214 - val_acc: 0.7767\n        Epoch 5\u002F10\n         - 15s - loss: 0.1557 - acc: 0.9543 - val_loss: 0.8965 - val_acc: 0.7917\n        Epoch 6\u002F10\n         - 15s - loss: 0.1015 - acc: 0.9705 - val_loss: 0.9427 - val_acc: 0.7949\n        Epoch 7\u002F10\n         - 15s - loss: 0.0595 - acc: 0.9835 - val_loss: 0.9893 - val_acc: 0.7995\n        Epoch 8\u002F10\n         - 15s - loss: 0.0495 - acc: 0.9866 - val_loss: 0.9512 - val_acc: 0.8079\n        Epoch 9\u002F10\n         - 15s - loss: 0.0437 - acc: 0.9867 - val_loss: 0.9690 - val_acc: 0.8117\n        Epoch 10\u002F10\n         - 15s - loss: 0.0443 - acc: 0.9880 - val_loss: 1.0004 - val_acc: 0.8070\n\n\n                       precision    recall  f1-score   support\n\n0       0.76      0.78      0.77       319\n                  1       0.67      0.80      0.73       389\n                  2       0.82      0.63      0.71       394\n                  3       0.76      0.69      0.72       392\n                  4       0.65      0.86      0.74       385\n                  5       0.84      0.75      0.79       395\n                  6       0.82      0.87      0.84       390\n                  7       0.86      0.90      0.88       396\n                  8       0.95      0.91      0.93       398\n                  9       0.91      0.92      0.92       397\n                 10       0.98      0.92      0.95       399\n                 11       0.96      0.85      0.90       396\n                 12       0.71      0.69      0.70       393\n                 13       0.95      0.70      0.81       396\n                 14       0.86      0.91      0.88       394\n                 15       0.85      0.90      0.87       398\n                 16       0.79      0.84      0.81       364\n                 17       0.99      0.77      0.87       376\n                 18       0.58      0.75      0.65       310\n                 19       0.52      0.60      0.55       251\n\n        avg \u002F total       0.82      0.81      0.81      7532\n\n\n-----------------------------------------\n循环神经网络 (RNN)\n-----------------------------------------\n\n.. image:: docs\u002Fpic\u002FRNN.png\n\n研究人员针对文本模仿和分类提出的另一种**神经网络**（Neural Network）架构是**循环神经网络**（Recurrent Neural Networks，简称 RNN）。RNN 为序列中先前的数据点分配更高的权重。因此，该技术是一种强大的文本、字符串和序列数据分类方法。此外，正如本工作所示，该技术也可用于图像分类。在 RNN 中，神经网络以一种非常复杂的方法考虑先前节点的信息，这使得对数据集中的结构进行更好的语义分析成为可能。\n\n\n门控循环单元 (GRU)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n**门控循环单元**（Gated Recurrent Unit，简称 GRU）是 RNN 的一种门控机制，由 `J. Chung et al. \u003Chttps:\u002F\u002Farxiv.org\u002Fabs\u002F1412.3555>`__ 和 `K.Cho et al. \u003Chttps:\u002F\u002Farxiv.org\u002Fabs\u002F1406.1078>`__ 提出。GRU 是 **LSTM**（长短期记忆网络）架构的简化变体，但存在以下区别：GRU 包含两个门，且不拥有任何内部记忆（如图示所示）；最后，不应用第二个非线性激活函数（图中的 tanh）。\n\n.. image:: docs\u002Fpic\u002FLSTM.png\n\n长短期记忆网络 (LSTM)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n**长短期记忆网络**（Long Short-Term Memory，简称 LSTM）由 `S. Hochreiter and J. Schmidhuber \u003Chttps:\u002F\u002Fwww.mitpressjournals.org\u002Fdoi\u002Fabs\u002F10.1162\u002Fneco.1997.9.8.1735>`__ 提出，并由许多研究科学家开发。\n\n为了应对这些问题，长短期记忆网络（LSTM）是一种特殊的 RNN，与基本 RNN 相比，它能以更有效的方式保留长期依赖关系。这对于克服**梯度消失问题**（vanishing gradient problem）特别有用。虽然 LSTM 具有类似于 RNN 的链式结构，但 LSTM 使用多个门来仔细调节允许进入每个节点状态的信息量。图示展示了 LSTM 模型的基本单元。\n\n\n\n导入包：\n\n.. code:: python\n\n\n    from keras.layers import Dropout, Dense, GRU, Embedding\n    from keras.models import Sequential\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n    from sklearn import metrics\n    from keras.preprocessing.text import Tokenizer\n    from keras.preprocessing.sequence import pad_sequences\n    from sklearn.datasets import fetch_20newsgroups\n\n将文本转换为词嵌入（使用 GloVe）：\n\n.. code:: python\n\n    def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):\n        np.random.seed(7)\n        text = np.concatenate((X_train, X_test), axis=0)\n        text = np.array(text)\n        tokenizer = Tokenizer(num_words=MAX_NB_WORDS)\n        tokenizer.fit_on_texts(text)\n        sequences = tokenizer.texts_to_sequences(text)\n        word_index = tokenizer.word_index\n        text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n        print('Found %s unique tokens.' % len(word_index))\n        indices = np.arange(text.shape[0])\n        # np.random.shuffle(indices)\n        text = text[indices]\n        print(text.shape)\n        X_train = text[0:len(X_train), ]\n        X_test = text[len(X_train):, ]\n        embeddings_index = {}\n        f = open(\".\\\\Glove\\\\glove.6B.50d.txt\", encoding=\"utf8\")\n        for line in f:\n\n            values = line.split()\n            word = values[0]\n            try:\n                coefs = np.asarray(values[1:], dtype='float32')\n            except:\n                pass\n            embeddings_index[word] = coefs\n        f.close()\n        print('Total %s word vectors.' % len(embeddings_index))\n        return (X_train, X_test, word_index,embeddings_index)\n\n构建用于文本的 RNN 模型：\n\n.. code:: python\n\n\n    def Build_Model_RNN_Text(word_index, embeddings_index, nclasses,  MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):\n        \"\"\"\n        def buildModel_RNN(word_index, embeddings_index, nclasses,  MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):\n        word_index in word index ,\n        embeddings_index is embeddings index, look at data_helper.py\n        nClasses is number of classes,\n        MAX_SEQUENCE_LENGTH is maximum lenght of text sequences\n        \"\"\"\n\n        model = Sequential()\n        hidden_layer = 3\n        gru_node = 32\n\n        embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))\n        for word, i in word_index.items():\n            embedding_vector = embeddings_index.get(word)\n            if embedding_vector is not None:\n                # words not found in embedding index will be all-zeros.\n                if len(embedding_matrix[i]) != len(embedding_vector):\n                    print(\"could not broadcast input array from shape\", str(len(embedding_matrix[i])),\n                          \"into shape\", str(len(embedding_vector)), \" Please make sure your\"\n                                                                    \" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,\")\n                    exit(1)\n                embedding_matrix[i] = embedding_vector\n        model.add(Embedding(len(word_index) + 1,\n                                    EMBEDDING_DIM,\n                                    weights=[embedding_matrix],\n                                    input_length=MAX_SEQUENCE_LENGTH,\n                                    trainable=True))\n\nprint(gru_node)\n        for i in range(0,hidden_layer):\n            model.add(GRU(gru_node,return_sequences=True, recurrent_dropout=0.2))\n            model.add(Dropout(dropout))\n        model.add(GRU(gru_node, recurrent_dropout=0.2))\n        model.add(Dropout(dropout))\n        model.add(Dense(256, activation='relu'))\n        model.add(Dense(nclasses, activation='softmax'))\n\n\n        model.compile(loss='sparse_categorical_crossentropy',\n                          optimizer='adam',\n                          metrics=['accuracy'])\n        return model\n\n\n\n\n运行 RNN（循环神经网络）并查看我们的结果：\n\n\n.. code:: python\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)\n\n\n    model_RNN = Build_Model_RNN_Text(word_index,embeddings_index, 20)\n\n    model_RNN.fit(X_train_Glove, y_train,\n                                  validation_data=(X_test_Glove, y_test),\n                                  epochs=10,\n                                  batch_size=128,\n                                  verbose=2)\n\n    predicted = model_RNN.predict_classes(X_test_Glove)\n\n    print(metrics.classification_report(y_test, predicted))\n\n\n模型摘要：\n\n.. code:: python \n\n    _________________________________________________________________\n    Layer (type)                 Output Shape              Param #   \n    =================================================================\n    embedding_1 (Embedding)      (None, 500, 50)           8960500   \n    _________________________________________________________________\n    gru_1 (GRU)                  (None, 500, 256)          235776    \n    _________________________________________________________________\n    dropout_1 (Dropout)          (None, 500, 256)          0         \n    _________________________________________________________________\n    gru_2 (GRU)                  (None, 500, 256)          393984    \n    _________________________________________________________________\n    dropout_2 (Dropout)          (None, 500, 256)          0         \n    _________________________________________________________________\n    gru_3 (GRU)                  (None, 500, 256)          393984    \n    _________________________________________________________________\n    dropout_3 (Dropout)          (None, 500, 256)          0         \n    _________________________________________________________________\n    gru_4 (GRU)                  (None, 256)               393984    \n    _________________________________________________________________\n    dense_1 (Dense)              (None, 20)                5140      \n    =================================================================\n    Total params: 10,383,368\n    Trainable params: 10,383,368\n    Non-trainable params: 0\n    _________________________________________________________________\n\n\n\n输出：\n\n.. code:: python \n\n    Train on 11314 samples, validate on 7532 samples\n    Epoch 1\u002F20\n     - 268s - loss: 2.5347 - acc: 0.1792 - val_loss: 2.2857 - val_acc: 0.2460\n    Epoch 2\u002F20\n     - 271s - loss: 1.6751 - acc: 0.3999 - val_loss: 1.4972 - val_acc: 0.4660\n    Epoch 3\u002F20\n     - 270s - loss: 1.0945 - acc: 0.6072 - val_loss: 1.3232 - val_acc: 0.5483\n    Epoch 4\u002F20\n     - 269s - loss: 0.7761 - acc: 0.7312 - val_loss: 1.1009 - val_acc: 0.6452\n    Epoch 5\u002F20\n     - 269s - loss: 0.5513 - acc: 0.8112 - val_loss: 1.0395 - val_acc: 0.6832\n    Epoch 6\u002F20\n     - 269s - loss: 0.3765 - acc: 0.8754 - val_loss: 0.9977 - val_acc: 0.7086\n    Epoch 7\u002F20\n     - 270s - loss: 0.2481 - acc: 0.9202 - val_loss: 1.0485 - val_acc: 0.7270\n    Epoch 8\u002F20\n     - 269s - loss: 0.1717 - acc: 0.9463 - val_loss: 1.0269 - val_acc: 0.7394\n    Epoch 9\u002F20\n     - 269s - loss: 0.1130 - acc: 0.9644 - val_loss: 1.1498 - val_acc: 0.7369\n    Epoch 10\u002F20\n     - 269s - loss: 0.0640 - acc: 0.9808 - val_loss: 1.1442 - val_acc: 0.7508\n    Epoch 11\u002F20\n     - 269s - loss: 0.0567 - acc: 0.9828 - val_loss: 1.2318 - val_acc: 0.7414\n    Epoch 12\u002F20\n     - 268s - loss: 0.0472 - acc: 0.9858 - val_loss: 1.2204 - val_acc: 0.7496\n    Epoch 13\u002F20\n     - 269s - loss: 0.0319 - acc: 0.9910 - val_loss: 1.1895 - val_acc: 0.7657\n    Epoch 14\u002F20\n     - 268s - loss: 0.0466 - acc: 0.9853 - val_loss: 1.2821 - val_acc: 0.7517\n    Epoch 15\u002F20\n     - 271s - loss: 0.0269 - acc: 0.9917 - val_loss: 1.2869 - val_acc: 0.7557\n    Epoch 16\u002F20\n     - 271s - loss: 0.0187 - acc: 0.9950 - val_loss: 1.3037 - val_acc: 0.7598\n    Epoch 17\u002F20\n     - 268s - loss: 0.0157 - acc: 0.9959 - val_loss: 1.2974 - val_acc: 0.7638\n    Epoch 18\u002F20\n     - 270s - loss: 0.0121 - acc: 0.9966 - val_loss: 1.3526 - val_acc: 0.7602\n    Epoch 19\u002F20\n     - 269s - loss: 0.0262 - acc: 0.9926 - val_loss: 1.4182 - val_acc: 0.7517\n    Epoch 20\u002F20\n     - 269s - loss: 0.0249 - acc: 0.9918 - val_loss: 1.3453 - val_acc: 0.7638\n\n\n                   precision    recall  f1-score   support\n\n              0       0.71      0.71      0.71       319\n              1       0.72      0.68      0.70       389\n              2       0.76      0.62      0.69       394\n              3       0.67      0.58      0.62       392\n              4       0.68      0.67      0.68       385\n              5       0.75      0.73      0.74       395\n              6       0.82      0.74      0.78       390\n              7       0.83      0.83      0.83       396\n              8       0.81      0.90      0.86       398\n              9       0.92      0.90      0.91       397\n             10       0.91      0.94      0.93       399\n             11       0.87      0.76      0.81       396\n             12       0.57      0.70      0.63       393\n             13       0.81      0.85      0.83       396\n             14       0.74      0.93      0.82       394\n             15       0.82      0.83      0.83       398\n             16       0.74      0.78      0.76       364\n             17       0.96      0.83      0.89       376\n             18       0.64      0.60      0.62       310\n             19       0.48      0.56      0.52       251\n\n    avg \u002F total       0.77      0.76      0.76      7532\n\n-----------------------------------------\n卷积神经网络 (CNN)\n-----------------------------------------\n\n另一种应用于层次化文档分类的深度学习架构是卷积神经网络 (Convolutional Neural Networks, CNN)。虽然最初是为图像处理构建的，其架构类似于视觉皮层，但 CNN 也被有效地用于文本分类。在用于图像处理的基础 CNN 中，图像张量与一组大小为 *d 乘 d* 的卷积核进行卷积运算。这些卷积层被称为特征图 (feature maps)，可以堆叠起来为输入提供多个滤波器。为了降低计算复杂度，CNN 使用池化 (pooling) 技术，从而减小网络中从一层传递到下一层的输出尺寸。使用不同的池化技术可以在减少输出的同时保留重要特征。\n\n最常见的池化方法是最大池化 (max pooling)，即从池化窗口中选择最大元素。为了将堆叠特征图的池化输出馈送到下一层，这些图被展平为一列。CNN 中的最终层通常是全连接密集层 (fully connected dense layers)。\n\n一般来说，在卷积神经网络的反向传播 (back-propagation) 步骤中，不仅权重 (weights) 会被调整，特征检测器滤波器 (feature detector filters) 也会被调整。用于文本的 CNN 的一个潜在问题是“通道”(channels) 的数量，*Sigma*（特征空间的大小）。对于文本而言，这个数字可能非常大（例如 50K），而对于图像则问题较小（例如仅有 3 个 RGB 通道）。这意味着用于文本的 CNN 的维度 (dimensionality) 非常高。\n\n.. image:: docs\u002Fpic\u002FCNN.png\n\n导入包：\n\n.. code:: python\n\n\n    from keras.layers import Dropout, Dense,Input,Embedding,Flatten, MaxPooling1D, Conv1D\n    from keras.models import Sequential,Model\n    from sklearn.feature_extraction.text import TfidfVectorizer\n    import numpy as np\n    from sklearn import metrics\n    from keras.preprocessing.text import Tokenizer\n    from keras.preprocessing.sequence import pad_sequences\n    from sklearn.datasets import fetch_20newsgroups\n    from keras.layers.merge import Concatenate\n\n\n\n将文本转换为词嵌入 (使用 GloVe)：\n\n.. code:: python\n\n    def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):\n        np.random.seed(7)\n        text = np.concatenate((X_train, X_test), axis=0)\n        text = np.array(text)\n        tokenizer = Tokenizer(num_words=MAX_NB_WORDS)\n        tokenizer.fit_on_texts(text)\n        sequences = tokenizer.texts_to_sequences(text)\n        word_index = tokenizer.word_index\n        text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n        print('Found %s unique tokens.' % len(word_index))\n        indices = np.arange(text.shape[0])\n        # np.random.shuffle(indices)\n        text = text[indices]\n        print(text.shape)\n        X_train = text[0:len(X_train), ]\n        X_test = text[len(X_train):, ]\n        embeddings_index = {}\n        f = open(\".\\\\Glove\\\\glove.6B.50d.txt\", encoding=\"utf8\")\n        for line in f:\n            values = line.split()\n            word = values[0]\n            try:\n                coefs = np.asarray(values[1:], dtype='float32')\n            except:\n                pass\n            embeddings_index[word] = coefs\n        f.close()\n        print('Total %s word vectors.' % len(embeddings_index))\n        return (X_train, X_test, word_index,embeddings_index)\n\n\n构建用于文本的 CNN 模型：\n\n.. code:: python\n\n    def Build_Model_CNN_Text(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):\n\n        \"\"\"\n            def buildModel_CNN(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50, dropout=0.5):\n            word_index in word index ,\n            embeddings_index is embeddings index, look at data_helper.py\n            nClasses is number of classes,\n            MAX_SEQUENCE_LENGTH is maximum lenght of text sequences,\n            EMBEDDING_DIM is an int value for dimention of word embedding look at data_helper.py\n        \"\"\"\n\n        model = Sequential()\n        embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))\n        for word, i in word_index.items():\n            embedding_vector = embeddings_index.get(word)\n            if embedding_vector is not None:\n                # words not found in embedding index will be all-zeros.\n                if len(embedding_matrix[i]) !=len(embedding_vector):\n                    print(\"could not broadcast input array from shape\",str(len(embedding_matrix[i])),\n                                     \"into shape\",str(len(embedding_vector)),\" Please make sure your\"\n                                     \" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,\")\n                    exit(1)\n\n                embedding_matrix[i] = embedding_vector\n\n        embedding_layer = Embedding(len(word_index) + 1,\n                                    EMBEDDING_DIM,\n                                    weights=[embedding_matrix],\n                                    input_length=MAX_SEQUENCE_LENGTH,\n                                    trainable=True)\n\n        # applying a more complex convolutional approach\n        convs = []\n        filter_sizes = []\n        layer = 5\n        print(\"Filter  \",layer)\n        for fl in range(0,layer):\n            filter_sizes.append((fl+2))\n\n        node = 128\n        sequence_input = Input(shape=(MAX_SEQUENCE_LENGTH,), dtype='int32')\n        embedded_sequences = embedding_layer(sequence_input)\n\n        for fsz in filter_sizes:\n            l_conv = Conv1D(node, kernel_size=fsz, activation='relu')(embedded_sequences)\n            l_pool = MaxPooling1D(5)(l_conv)\n            #l_pool = Dropout(0.25)(l_pool)\n            convs.append(l_pool)\n\n        l_merge = Concatenate(axis=1)(convs)\n        l_cov1 = Conv1D(node, 5, activation='relu')(l_merge)\n        l_cov1 = Dropout(dropout)(l_cov1)\n        l_pool1 = MaxPooling1D(5)(l_cov1)\n        l_cov2 = Conv1D(node, 5, activation='relu')(l_pool1)\n        l_cov2 = Dropout(dropout)(l_cov2)\n        l_pool2 = MaxPooling1D(30)(l_cov2)\n        l_flat = Flatten()(l_pool2)\n        l_dense = Dense(1024, activation='relu')(l_flat)\n        l_dense = Dropout(dropout)(l_dense)\n        l_dense = Dense(512, activation='relu')(l_dense)\n        l_dense = Dropout(dropout)(l_dense)\n        preds = Dense(nclasses, activation='softmax')(l_dense)\n        model = Model(sequence_input, preds)\n\n        model.compile(loss='sparse_categorical_crossentropy',\n                      optimizer='adam',\n                      metrics=['accuracy'])\n\n\n\n        return model\n\n\n\n运行 CNN 并查看我们的结果：\n\n\n.. code:: python\n\n\n    newsgroups_train = fetch_20newsgroups(subset='train')\n    newsgroups_test = fetch_20newsgroups(subset='test')\n    X_train = newsgroups_train.data\n    X_test = newsgroups_test.data\n    y_train = newsgroups_train.target\n    y_test = newsgroups_test.target\n\n    X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)\n\n\n    model_CNN = Build_Model_CNN_Text(word_index,embeddings_index, 20)\n\nmodel_CNN.summary()\n\n    model_CNN.fit(X_train_Glove, y_train,\n                                  validation_data=(X_test_Glove, y_test),\n                                  epochs=15,\n                                  batch_size=128,\n                                  verbose=2)\n\n    predicted = model_CNN.predict(X_test_Glove)\n\n    predicted = np.argmax(predicted, axis=1)\n\n\n    print(metrics.classification_report(y_test, predicted))\n\n\n模型：\n\n.. code:: python \n\n    __________________________________________________________________________________________________\n    Layer (type)                    Output Shape         Param #     Connected to                     \n    ==================================================================================================\n    input_1 (InputLayer)            (None, 500)          0                                            \n    __________________________________________________________________________________________________\n    embedding_1 (Embedding)         (None, 500, 50)      8960500     input_1[0][0]                    \n    __________________________________________________________________________________________________\n    conv1d_1 (Conv1D)               (None, 499, 128)     12928       embedding_1[0][0]                \n    __________________________________________________________________________________________________\n    conv1d_2 (Conv1D)               (None, 498, 128)     19328       embedding_1[0][0]                \n    __________________________________________________________________________________________________\n    conv1d_3 (Conv1D)               (None, 497, 128)     25728       embedding_1[0][0]                \n    __________________________________________________________________________________________________\n    conv1d_4 (Conv1D)               (None, 496, 128)     32128       embedding_1[0][0]                \n    __________________________________________________________________________________________________\n    conv1d_5 (Conv1D)               (None, 495, 128)     38528       embedding_1[0][0]                \n    __________________________________________________________________________________________________\n    max_pooling1d_1 (MaxPooling1D)  (None, 99, 128)      0           conv1d_1[0][0]                   \n    __________________________________________________________________________________________________\n    max_pooling1d_2 (MaxPooling1D)  (None, 99, 128)      0           conv1d_2[0][0]                   \n    __________________________________________________________________________________________________\n    max_pooling1d_3 (MaxPooling1D)  (None, 99, 128)      0           conv1d_3[0][0]                   \n    __________________________________________________________________________________________________\n    max_pooling1d_4 (MaxPooling1D)  (None, 99, 128)      0           conv1d_4[0][0]                   \n    __________________________________________________________________________________________________\n    max_pooling1d_5 (MaxPooling1D)  (None, 99, 128)      0           conv1d_5[0][0]                   \n    __________________________________________________________________________________________________\n    concatenate_1 (Concatenate)     (None, 495, 128)     0           max_pooling1d_1[0][0]            \n                                                                     max_pooling1d_2[0][0]            \n                                                                     max_pooling1d_3[0][0]            \n                                                                     max_pooling1d_4[0][0]            \n                                                                     max_pooling1d_5[0][0]            \n    __________________________________________________________________________________________________\n    conv1d_6 (Conv1D)               (None, 491, 128)     82048       concatenate_1[0][0]              \n    __________________________________________________________________________________________________\n    dropout_1 (Dropout)             (None, 491, 128)     0           conv1d_6[0][0]                   \n    __________________________________________________________________________________________________\n    max_pooling1d_6 (MaxPooling1D)  (None, 98, 128)      0           dropout_1[0][0]                  \n    __________________________________________________________________________________________________\n    conv1d_7 (Conv1D)               (None, 94, 128)      82048       max_pooling1d_6[0][0]            \n    __________________________________________________________________________________________________\n    dropout_2 (Dropout)             (None, 94, 128)      0           conv1d_7[0][0]                   \n    __________________________________________________________________________________________________\n    max_pooling1d_7 (MaxPooling1D)  (None, 3, 128)       0           dropout_2[0][0]                  \n    __________________________________________________________________________________________________\n    flatten_1 (Flatten)             (None, 384)          0           max_pooling1d_7[0][0]            \n    __________________________________________________________________________________________________\n    dense_1 (Dense)                 (None, 1024)         394240      flatten_1[0][0]                  \n    __________________________________________________________________________________________________\n    dropout_3 (Dropout)             (None, 1024)         0           dense_1[0][0]                    \n    __________________________________________________________________________________________________\n    dense_2 (Dense)                 (None, 512)          524800      dropout_3[0][0]                  \n    __________________________________________________________________________________________________\n    dropout_4 (Dropout)             (None, 512)          0           dense_2[0][0]                    \n    __________________________________________________________________________________________________\n    dense_3 (Dense)                 (None, 20)           10260       dropout_4[0][0]                  \n    ==================================================================================================\n    Total params: 10,182,536\n    Trainable params: 10,182,536\n    Non-trainable params: 0\n    __________________________________________________________________________________________________\n\n\n输出：\n\n\n.. code:: python\n\n在 11314 个样本上训练，在 7532 个样本上验证\n    Epoch 1\u002F15\n     - 6s - 损失 (loss): 2.9329 - 准确率 (acc): 0.0783 - 验证集损失 (val_loss): 2.7628 - 验证集准确率 (val_acc): 0.1403\n    Epoch 2\u002F15\n     - 4s - 损失：2.2534 - 准确率：0.2249 - 验证集损失：2.1715 - 验证集准确率：0.4007\n    Epoch 3\u002F15\n     - 4s - 损失：1.5643 - 准确率：0.4326 - 验证集损失：1.7846 - 验证集准确率：0.5052\n    Epoch 4\u002F15\n     - 4s - 损失：1.1771 - 准确率：0.5662 - 验证集损失：1.4949 - 验证集准确率：0.6131\n    Epoch 5\u002F15\n     - 4s - 损失：0.8880 - 准确率：0.6797 - 验证集损失：1.3629 - 验证集准确率：0.6256\n    Epoch 6\u002F15\n     - 4s - 损失：0.6990 - 准确率：0.7569 - 验证集损失：1.2013 - 验证集准确率：0.6624\n    Epoch 7\u002F15\n     - 4s - 损失：0.5037 - 准确率：0.8200 - 验证集损失：1.0674 - 验证集准确率：0.6807\n    Epoch 8\u002F15\n     - 4s - 损失：0.4050 - 准确率：0.8626 - 验证集损失：1.0223 - 验证集准确率：0.6863\n    Epoch 9\u002F15\n     - 4s - 损失：0.2952 - 准确率：0.8968 - 验证集损失：0.9045 - 验证集准确率：0.7120\n    Epoch 10\u002F15\n     - 4s - 损失：0.2314 - 准确率：0.9217 - 验证集损失：0.8574 - 验证集准确率：0.7326\n    Epoch 11\u002F15\n     - 4s - 损失：0.1778 - 准确率：0.9436 - 验证集损失：0.8752 - 验证集准确率：0.7270\n    Epoch 12\u002F15\n     - 4s - 损失：0.1475 - 准确率：0.9524 - 验证集损失：0.8299 - 验证集准确率：0.7355\n    Epoch 13\u002F15\n     - 4s - 损失：0.1089 - 准确率：0.9657 - 验证集损失：0.8034 - 验证集准确率：0.7491\n    Epoch 14\u002F15\n     - 4s - 损失：0.1047 - 准确率：0.9666 - 验证集损失：0.8172 - 验证集准确率：0.7463\n    Epoch 15\u002F15\n     - 4s - 损失：0.0749 - 准确率：0.9774 - 验证集损失：0.8511 - 验证集准确率：0.7313\n     \n     \n                   精确率 (precision)    召回率 (recall)  F1 分数 (f1-score)   支持数 (support)\n\n              0       0.75      0.61      0.67       319\n              1       0.63      0.74      0.68       389\n              2       0.74      0.54      0.62       394\n              3       0.49      0.76      0.60       392\n              4       0.60      0.70      0.64       385\n              5       0.79      0.57      0.66       395\n              6       0.73      0.76      0.74       390\n              7       0.83      0.74      0.78       396\n              8       0.86      0.88      0.87       398\n              9       0.95      0.78      0.86       397\n             10       0.93      0.93      0.93       399\n             11       0.92      0.77      0.84       396\n             12       0.55      0.72      0.62       393\n             13       0.76      0.85      0.80       396\n             14       0.86      0.83      0.84       394\n             15       0.91      0.73      0.81       398\n             16       0.75      0.65      0.70       364\n             17       0.95      0.86      0.90       376\n             18       0.60      0.49      0.54       310\n             19       0.37      0.60      0.46       251\n\n    平均 \u002F 总计       0.76      0.73      0.74      7532\n\n\n\n\n-----------------------------------------\n层次化注意力网络 (Hierarchical Attention Networks)\n-----------------------------------------\n\n.. image:: docs\u002Fpic\u002FHAN.png\n\n---------------------------------------------\n循环卷积神经网络 (Recurrent Convolutional Neural Networks, RCNN)\n---------------------------------------------\n\n循环卷积神经网络 (RCNN) 也用于文本分类。该技术的主要思想是利用循环结构捕捉上下文信息，并使用卷积神经网络构建文本表示。该架构结合了 RNN（循环神经网络）和 CNN（卷积神经网络）的优点，在一个模型中利用两者的优势。\n\n\n\n导入包：\n\n.. code:: python \n\n      from keras.preprocessing import sequence\n      from keras.models import Sequential\n      from keras.layers import Dense, Dropout, Activation\n      from keras.layers import Embedding\n      from keras.layers import GRU\n      from keras.layers import Conv1D, MaxPooling1D\n      from keras.datasets import imdb\n      from sklearn.datasets import fetch_20newsgroups\n      import numpy as np\n      from sklearn import metrics\n      from keras.preprocessing.text import Tokenizer\n      from keras.preprocessing.sequence import pad_sequences\n\n\n\n将文本转换为词嵌入（使用 GloVe）：\n\n.. code:: python \n\n      def loadData_Tokenizer(X_train, X_test,MAX_NB_WORDS=75000,MAX_SEQUENCE_LENGTH=500):\n          np.random.seed(7)\n          text = np.concatenate((X_train, X_test), axis=0)\n          text = np.array(text)\n          tokenizer = Tokenizer(num_words=MAX_NB_WORDS)\n          tokenizer.fit_on_texts(text)\n          sequences = tokenizer.texts_to_sequences(text)\n          word_index = tokenizer.word_index\n          text = pad_sequences(sequences, maxlen=MAX_SEQUENCE_LENGTH)\n          print('Found %s unique tokens.' % len(word_index))\n          indices = np.arange(text.shape[0])\n          # np.random.shuffle(indices)\n          text = text[indices]\n          print(text.shape)\n          X_train = text[0:len(X_train), ]\n          X_test = text[len(X_train):, ]\n          embeddings_index = {}\n          f = open(\"C:\\\\Users\\\\kamran\\\\Documents\\\\GitHub\\\\RMDL\\\\Examples\\\\Glove\\\\glove.6B.50d.txt\", encoding=\"utf8\")\n          for line in f:\n              values = line.split()\n              word = values[0]\n              try:\n                  coefs = np.asarray(values[1:], dtype='float32')\n              except:\n                  pass\n              embeddings_index[word] = coefs\n          f.close()\n          print('Total %s word vectors.' % len(embeddings_index))\n          return (X_train, X_test, word_index,embeddings_index)\n\n\n.. code:: python \n\n      def Build_Model_RCNN_Text(word_index, embeddings_index, nclasses, MAX_SEQUENCE_LENGTH=500, EMBEDDING_DIM=50):\n\n          kernel_size = 2\n          filters = 256\n          pool_size = 2\n          gru_node = 256\n\n          embedding_matrix = np.random.random((len(word_index) + 1, EMBEDDING_DIM))\n          for word, i in word_index.items():\n              embedding_vector = embeddings_index.get(word)\n              if embedding_vector is not None:\n                  # words not found in embedding index will be all-zeros.\n                  if len(embedding_matrix[i]) !=len(embedding_vector):\n                      print(\"could not broadcast input array from shape\",str(len(embedding_matrix[i])),\n                                       \"into shape\",str(len(embedding_vector)),\" Please make sure your\"\n                                       \" EMBEDDING_DIM is equal to embedding_vector file ,GloVe,\")\n                      exit(1)\n\n                  embedding_matrix[i] = embedding_vector\n\nmodel = Sequential()\n          model.add(Embedding(len(word_index) + 1,\n                                      EMBEDDING_DIM,\n                                      weights=[embedding_matrix],\n                                      input_length=MAX_SEQUENCE_LENGTH,\n                                      trainable=True))\n          model.add(Dropout(0.25))\n          model.add(Conv1D(filters, kernel_size, activation='relu'))\n          model.add(MaxPooling1D(pool_size=pool_size))\n          model.add(Conv1D(filters, kernel_size, activation='relu'))\n          model.add(MaxPooling1D(pool_size=pool_size))\n          model.add(Conv1D(filters, kernel_size, activation='relu'))\n          model.add(MaxPooling1D(pool_size=pool_size))\n          model.add(Conv1D(filters, kernel_size, activation='relu'))\n          model.add(MaxPooling1D(pool_size=pool_size))\n          model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))\n          model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))\n          model.add(LSTM(gru_node, return_sequences=True, recurrent_dropout=0.2))\n          model.add(LSTM(gru_node, recurrent_dropout=0.2))\n          model.add(Dense(1024,activation='relu'))\n          model.add(Dense(nclasses))\n          model.add(Activation('softmax'))\n\n          model.compile(loss='sparse_categorical_crossentropy',\n                        optimizer='adam',\n                        metrics=['accuracy'])\n\n          return model\n\n\n.. code:: python \n\n      newsgroups_train = fetch_20newsgroups(subset='train')\n      newsgroups_test = fetch_20newsgroups(subset='test')\n      X_train = newsgroups_train.data\n      X_test = newsgroups_test.data\n      y_train = newsgroups_train.target\n      y_test = newsgroups_test.target\n\n      X_train_Glove,X_test_Glove, word_index,embeddings_index = loadData_Tokenizer(X_train,X_test)\n\n\n运行 RCNN：\n\n\n.. code:: python \n\n\n      model_RCNN = Build_Model_CNN_Text(word_index,embeddings_index, 20)\n\n\n      model_RCNN.summary()\n\n      model_RCNN.fit(X_train_Glove, y_train,\n                                    validation_data=(X_test_Glove, y_test),\n                                    epochs=15,\n                                    batch_size=128,\n                                    verbose=2)\n\n      predicted = model_RCNN.predict(X_test_Glove)\n\n      predicted = np.argmax(predicted, axis=1)\n      print(metrics.classification_report(y_test, predicted))\n\n\n模型摘要：\n\n\n.. code:: python \n\n      _________________________________________________________________\n      Layer (type)                 Output Shape              Param #   \n      =================================================================\n      embedding_1 (Embedding)      (None, 500, 50)           8960500   \n      _________________________________________________________________\n      dropout_1 (Dropout)          (None, 500, 50)           0         \n      _________________________________________________________________\n      conv1d_1 (Conv1D)            (None, 499, 256)          25856     \n      _________________________________________________________________\n      max_pooling1d_1 (MaxPooling1 (None, 249, 256)          0         \n      _________________________________________________________________\n      conv1d_2 (Conv1D)            (None, 248, 256)          131328    \n      _________________________________________________________________\n      max_pooling1d_2 (MaxPooling1 (None, 124, 256)          0         \n      _________________________________________________________________\n      conv1d_3 (Conv1D)            (None, 123, 256)          131328    \n      _________________________________________________________________\n      max_pooling1d_3 (MaxPooling1 (None, 61, 256)           0         \n      _________________________________________________________________\n      conv1d_4 (Conv1D)            (None, 60, 256)           131328    \n      _________________________________________________________________\n      max_pooling1d_4 (MaxPooling1 (None, 30, 256)           0         \n      _________________________________________________________________\n      lstm_1 (LSTM)                (None, 30, 256)           525312    \n      _________________________________________________________________\n      lstm_2 (LSTM)                (None, 30, 256)           525312    \n      _________________________________________________________________\n      lstm_3 (LSTM)                (None, 30, 256)           525312    \n      _________________________________________________________________\n      lstm_4 (LSTM)                (None, 256)               525312    \n      _________________________________________________________________\n      dense_1 (Dense)              (None, 1024)              263168    \n      _________________________________________________________________\n      dense_2 (Dense)              (None, 20)                20500     \n      _________________________________________________________________\n      activation_1 (Activation)    (None, 20)                0         \n      =================================================================\n      Total params: 11,765,256\n      Trainable params: 11,765,256\n      Non-trainable params: 0\n      _________________________________________________________________\n\n\n\n输出：\n\n.. code:: python \n\n      Train on 11314 samples, validate on 7532 samples\n      Epoch 1\u002F15\n       - 28s - loss: 2.6624 - acc: 0.1081 - val_loss: 2.3012 - val_acc: 0.1753\n      Epoch 2\u002F15\n       - 22s - loss: 2.1142 - acc: 0.2224 - val_loss: 1.9168 - val_acc: 0.2669\n      Epoch 3\u002F15\n       - 22s - loss: 1.7465 - acc: 0.3290 - val_loss: 1.8257 - val_acc: 0.3412\n      Epoch 4\u002F15\n       - 22s - loss: 1.4730 - acc: 0.4356 - val_loss: 1.5433 - val_acc: 0.4436\n      Epoch 5\u002F15\n       - 22s - loss: 1.1800 - acc: 0.5556 - val_loss: 1.2973 - val_acc: 0.5467\n      Epoch 6\u002F15\n       - 22s - loss: 0.9910 - acc: 0.6281 - val_loss: 1.2530 - val_acc: 0.5797\n      Epoch 7\u002F15\n       - 22s - loss: 0.8581 - acc: 0.6854 - val_loss: 1.1522 - val_acc: 0.6281\n      Epoch 8\u002F15\n       - 22s - loss: 0.7058 - acc: 0.7428 - val_loss: 1.2385 - val_acc: 0.6033\n      Epoch 9\u002F15\n       - 22s - loss: 0.6792 - acc: 0.7515 - val_loss: 1.0200 - val_acc: 0.6775\n      Epoch 10\u002F15\n       - 22s - loss: 0.5782 - acc: 0.7948 - val_loss: 1.0961 - val_acc: 0.6577\n      Epoch 11\u002F15\n       - 23s - loss: 0.4674 - acc: 0.8341 - val_loss: 1.0866 - val_acc: 0.6924\n      Epoch 12\u002F15\n       - 23s - loss: 0.4284 - acc: 0.8512 - val_loss: 0.9880 - val_acc: 0.7096\n      Epoch 13\u002F15\n       - 22s - loss: 0.3883 - acc: 0.8670 - val_loss: 1.0190 - val_acc: 0.7151\n      Epoch 14\u002F15\n       - 22s - loss: 0.3334 - acc: 0.8874 - val_loss: 1.0025 - val_acc: 0.7232\n      Epoch 15\u002F15\n       - 22s - loss: 0.2857 - acc: 0.9038 - val_loss: 1.0123 - val_acc: 0.7331\n\n\n                   precision    recall  f1-score   support\n\n0       0.64      0.73      0.68       319\n                1       0.45      0.83      0.58       389\n                2       0.81      0.64      0.71       394\n                3       0.64      0.57      0.61       392\n                4       0.55      0.78      0.64       385\n                5       0.77      0.52      0.62       395\n                6       0.84      0.77      0.80       390\n                7       0.87      0.79      0.83       396\n                8       0.85      0.90      0.87       398\n                9       0.98      0.84      0.90       397\n               10       0.93      0.96      0.95       399\n               11       0.92      0.79      0.85       396\n               12       0.59      0.53      0.56       393\n               13       0.82      0.82      0.82       396\n               14       0.84      0.84      0.84       394\n               15       0.83      0.89      0.86       398\n               16       0.68      0.86      0.76       364\n               17       0.97      0.86      0.91       376\n               18       0.66      0.50      0.57       310\n               19       0.53      0.31      0.40       251\n\n      avg \u002F total       0.77      0.75      0.75      7532\n\n\n\n-----------------------------------------\n随机多模型深度学习 (RMDL)\n-----------------------------------------\n\n\n参考论文：`RMDL: Random Multimodel Deep Learning for\nClassification \u003Chttps:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F324922651_RMDL_Random_Multimodel_Deep_Learning_for_Classification>`__\n\n\n一种用于分类的新型集成 (Ensemble) 深度学习 (Deep Learning) 方法。深度学习 (Deep Learning) 模型在许多领域都取得了最先进的结果。RMDL 解决了寻找最佳深度学习 (Deep Learning) 结构和架构的问题，同时通过不同深度学习 (Deep Learning) 架构的集成提高了鲁棒性 (Robustness) 和准确率 (Accuracy)。RDML 可以接受多种数据作为输入，包括文本、视频、图像和符号。\n\n\n|RMDL|\n\n用于分类的随机多模型深度学习 (RDML) 架构。RMDL 包含 3 个随机模型，左侧为 oneDNN 分类器，中间为深度 CNN 分类器，右侧为深度 RNN 分类器（每个单元可以是 LSTM 或 GRU）。\n\n\n安装\n\nRMDL 的安装可以使用 pip 和 git：\n\n使用 pip\n\n\n.. code:: python\n\n        pip install RMDL\n\n使用 git\n\n.. code:: bash\n\n    git clone --recursive https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FRMDL.git\n\n该软件包的主要要求是带有 Tensorflow 的 Python 3。`requirements.txt` 文件列出了安装所有要求所需的 `Python 包 \u003Chttps:\u002F\u002Fwww.scaler.com\u002Ftopics\u002Fpython\u002Fpython-packages\u002F>`__，请运行以下命令：\n\n.. code:: bash\n\n    pip -r install requirements.txt\n\n或者\n\n.. code:: bash\n\n    pip3  install -r requirements.txt\n\n或者：\n\n.. code:: bash\n\n    conda install --file requirements.txt\n\n文档：\n\n\n每年复杂数据集数量的指数级增长需要机器学习方法进一步增强，以提供稳健且准确的数据分类。最近，在图像分类、自然语言处理、人脸识别等任务上，深度学习 (Deep Learning) 方法相比以前的机器学习算法取得了更好的结果。这些深度学习 (Deep Learning) 算法的成功依赖于它们对数据内部复杂和非线性关系进行建模的能力。然而，为这些模型找到合适的结构一直是研究人员面临的挑战。本文介绍了随机多模型深度学习 (RMDL)：一种用于分类的新型集成 (Ensemble) 深度学习 (Deep Learning) 方法。RMDL 旨在解决寻找最佳深度学习 (Deep Learning) 架构的问题，同时通过多个深度学习 (Deep Learning) 架构的集成提高鲁棒性 (Robustness) 和准确率 (Accuracy)。简而言之，RMDL 并行训练多个深度神经网络 (DNN)、卷积神经网络 (CNN) 和循环神经网络 (RNN) 模型，并组合它们的结果，以产生优于任何单个模型的结果。为了创建这些模型，每个深度学习 (Deep Learning) 模型在其神经网络结构的层数和节点数方面都是随机构建的。生成的 RDML 模型可用于各种领域，如文本、视频、图像和符号。在本项目中，我们深入描述了 RMDL 模型，并展示了图像和文本分类以及人脸识别的结果。对于图像分类，我们使用 MNIST 和 CIFAR-10 数据集将我们的模型与一些可用的基线进行了比较。同样，我们使用了四个数据集，即 WOS、Reuters、IMDB 和 20newsgroup，并将我们的结果与可用的基线进行了比较。Web of Science (WOS) 由作者收集，包含三组 (~小、中、大组)。最后，我们使用 ORL 数据集将我们方法的性能与其他人脸识别方法进行了比较。这些测试结果表​​明，RDML 模型在各种数据类型和分类问题上始终优于标准方法。\n\n--------------------------------------------\n文本分层深度学习 (HDLTex)\n--------------------------------------------\n\n参考论文：`HDLTex: Hierarchical Deep Learning for Text\nClassification \u003Chttps:\u002F\u002Farxiv.org\u002Fabs\u002F1709.08267>`__\n\n\n|HDLTex|\n\n文档：\n\n日益庞大的文档集合需要改进的信息处理方法来搜索、检索和组织文本文档。这些信息处理方法的核心是文档分类，这已成为监督学习 (Supervised Learning) 旨在解决的重要任务。最近，随着文档数量的增加，传统监督分类器的性能有所下降。文档量的指数级增长也增加了类别的数量。本文以一种不同于当前文档分类方法的方式来解决这个问题，后者将问题视为多类分类 (Multi-class Classification)。相反，我们采用一种称为文本分类分层深度学习 (HDLTex) 的方法执行分层分类 (Hierarchical Classification)。HDLTex 采用堆叠的深度学习 (Deep Learning) 架构来提供对文档的分层理解。\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n文本分类算法比较\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **模型**                           | **优点**                                                                                                                                                 | **缺点**                                                                                                                        |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **Rocchio 算法**                   |  * 易于实现                                                                                                                                                |  * 用户只能检索到少量相关文档                                                                                                                 |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 计算成本非常低                                                                                                                                          |  * Rocchio 算法常对多模态类别进行错误分类                                                                                                   |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 相关性反馈机制（有助于将不相关文档的排名降低）                                                                                                          |  * 该技术鲁棒性不强                                                                                                                       |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |                                                                                                                                                          |  * 该算法中的线性组合不适用于多类数据集                                                                                                     |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **Boosting 和 Bagging**            |  * 提高了稳定性和准确性（利用集成学习 (Ensemble learning) 的优势，多个弱分类器优于单个强分类器。）                                                          |  * 计算复杂度高                                                                                                                           |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 降低方差，有助于避免过拟合 (Overfitting) 问题。                                                                                                         |  * 失去可解释性（如果模型数量过多，理解模型非常困难）                                                                                     |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |                                                                                                                                                          |  * 需要仔细调整不同的超参数 (Hyper-parameters)。                                                                                          |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **逻辑回归**                       |  * 易于实现                                                                                                                                                |  * 无法解决非线性问题                                                                                                                     |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 不需要太多计算资源                                                                                                                                      |  * 预测要求每个数据点相互独立                                                                                                             |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 不需要输入特征进行缩放（预处理）                                                                                                                        |  * 试图基于一组独立变量预测结果                                                                                                           |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 不需要任何调参                                                                                                                                          |                                                                                                                                         |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **朴素贝叶斯分类器**               |  * 在处理文本数据方面表现非常好                                                                                                                            |  * 对数据分布形状有强假设                                                                                                                 |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 易于实现                                                                                                                                                |  * 受限于数据稀缺，特征空间中任何可能值的似然值必须由频率学派估计                                                                          |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 与其他算法相比速度快                                                                                                                                    |                                                                                                                                         |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **K 近邻 (KNN)**                   |  * 对文本数据集有效                                                                                                                                        |  * 该模型的计算成本非常高                                                                                                                 |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 非参数                                                                                                                                                    |  * 难以找到最优的 k 值                                                                                                                    |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 考虑了文本或文档的更多局部特征                                                                                                                          |  * 在寻找最近邻的大搜索问题中存在限制                                                                                                     |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 自然处理多类数据集                                                                                                                                      |  * 对于文本数据集，找到一个有意义的距离函数很困难                                                                                         |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **支持向量机 (SVM)**               |  * SVM 可以建模非线性决策边界                                                                                                                              |  * 由于维度高导致结果缺乏透明度（尤其是文本数据）。                                                                                        |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 当线性可分时，表现与逻辑回归相似                                                                                                                        |  * 选择高效的核函数很困难（取决于核函数，容易受到过拟合\u002F训练问题的影响）                                                                   |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 对过拟合问题具有鲁棒性~（特别是对于文本数据集，因为高维空间）                                                                                           |  * 内存复杂度                                                                                                                             |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **决策树**                         |  * 可以轻松处理定性（类别）特征                                                                                                                            |  * 对角线决策边界存在问题                                                                                                                 |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 与平行于特征轴的决策边界配合良好                                                                                                                        |  * 容易过拟合                                                                                                                             |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 决策树在学习和预测方面都是非常快的算法                                                                                                                  |  * 对数据中的微小扰动极其敏感                                                                                                             |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |                                                                                                                                                          |  * 样本外预测存在问题                                                                                                                     |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **条件随机场 (CRF)**               |  * 其特征设计灵活                                                                                                                                            |  * 训练步骤的计算复杂度高                                                                                                                 |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 由于 CRF 计算全局最优输出节点的条件概率，它克服了标签偏差的缺点                                                                                         |  * 该算法无法处理未知词汇                                                                                                                 |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 结合了分类和图形建模的优势，能够紧凑地建模多元数据                                                                                                      |  * 关于在线学习 (Online learning) 的问题（这使得当新数据可用时很难重新训练模型。）                                                        |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **随机森林**                       |  * 与其他技术相比，决策树的集成训练非常快                                                                                                                    |  * 一旦训练完成，生成预测相当慢                                                                                                           |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 降低了方差（相对于普通树）                                                                                                                              |  * 森林中更多的树增加了预测步骤的时间复杂度                                                                                               |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 不需要准备和预处理输入数据                                                                                                                              |  * 不如其他方法易于直观解释                                                                                                               |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |                                                                                                                                                          |  * 容易发生过度拟合                                                                                                                       |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |                                                                                                                                                          |  * 需要选择森林中树的数量                                                                                                                 |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n| **深度学习 (Deep Learning)**       |  * 特征设计灵活（减少了对特征工程 (Feature engineering) 的需求，这是机器学习实践中最耗时的部分之一。）                                                      |  * 需要大量数据（如果你只有小样本文本数据，深度学习不太可能优于其他方法。）                                                                |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 架构可适应新问题                                                                                                                                        |  * 训练计算成本极高。                                                                                                                     |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 可以处理复杂的输入输出映射                                                                                                                              |  * 模型可解释性是深度学习最重要的问题~（大多数时候深度学习是黑盒 (Black-box)）                                                            |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 可以轻松处理在线学习（这使得当新数据可用时很容易重新训练模型。）                                                                                      |  * 找到高效的架构和结构仍然是该技术的主要挑战                                                                                             |\n|                                    |                                                                                                                                                          |                                                                                                                                         |\n|                                    |  * 并行处理能力（它可以同时执行多项工作）                                                                                                                    |                                                                                                                                         |\n+------------------------------------+----------------------------------------------------------------------------------------------------------------------------------------------------------+-----------------------------------------------------------------------------------------------------------------------------------------+\n\n==========\n评估\n==========\n\n----\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nF1 分数 (F1 Score)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n.. image:: docs\u002Fpic\u002FF1.png\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\nMatthews 相关系数 (Matthews Correlation Coefficient, MCC)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\n\n计算 Matthews 相关系数 (MCC)\n\nMatthews 相关系数在机器学习中被用作衡量二分类（两类）问题质量的指标。它考虑了真阳性、假阳性、真阴性和假阴性，通常被认为是一种平衡的度量标准，即使类别大小差异很大也可以使用。MCC 本质上是一个介于 -1 和 +1 之间的相关系数值。+1 代表完美预测，0 代表平均随机预测，-1 代表反向预测。该统计量也被称为 phi 系数。 \n\n\n.. code:: python\n\n    from sklearn.metrics import matthews_corrcoef\n    y_true = [+1, +1, +1, -1]\n    y_pred = [+1, -1, +1, +1]\n    matthews_corrcoef(y_true, y_pred)  \n\n\n\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n接收者操作特征曲线 (Receiver Operating Characteristics, ROC)\n~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~\n\nROC 曲线通常用于二分类中以研究分类器的输出。为了将 ROC 曲线和 ROC 面积扩展到多分类或多标签分类，需要对输出进行二值化。可以为每个标签绘制一条 ROC 曲线，也可以将标签指示矩阵的每个元素视为二元预测来绘制 ROC 曲线（微平均）。\n\n多分类分类的另一种评估指标是宏平均，它为每个标签的分类赋予相等的权重。 [`sources  \u003Chttp:\u002F\u002Fscikit-learn.org\u002Fstable\u002Fauto_examples\u002Fmodel_selection\u002Fplot_roc.html>`__] \n\n.. code:: python\n\n    import numpy as np\n    import matplotlib.pyplot as plt\n    from itertools import cycle\n\n    from sklearn import svm, datasets\n    from sklearn.metrics import roc_curve, auc\n    from sklearn.model_selection import train_test_split\n    from sklearn.preprocessing import label_binarize\n    from sklearn.multiclass import OneVsRestClassifier\n    from scipy import interp\n\n    # Import some data to play with\n    iris = datasets.load_iris()\n    X = iris.data\n    y = iris.target\n\n    # Binarize the output\n    y = label_binarize(y, classes=[0, 1, 2])\n    n_classes = y.shape[1]\n\n    # Add noisy features to make the problem harder\n    random_state = np.random.RandomState(0)\n    n_samples, n_features = X.shape\n    X = np.c_[X, random_state.randn(n_samples, 200 * n_features)]\n\n    # shuffle and split training and test sets\n    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=.5,\n                                                        random_state=0)\n\n    # Learn to predict each class against the other\n    classifier = OneVsRestClassifier(svm.SVC(kernel='linear', probability=True,\n                                     random_state=random_state))\n    y_score = classifier.fit(X_train, y_train).decision_function(X_test)\n\n    # Compute ROC curve and ROC area for each class\n    fpr = dict()\n    tpr = dict()\n    roc_auc = dict()\n    for i in range(n_classes):\n        fpr[i], tpr[i], _ = roc_curve(y_test[:, i], y_score[:, i])\n        roc_auc[i] = auc(fpr[i], tpr[i])\n\n    # Compute micro-average ROC curve and ROC area\n    fpr[\"micro\"], tpr[\"micro\"], _ = roc_curve(y_test.ravel(), y_score.ravel())\n    roc_auc[\"micro\"] = auc(fpr[\"micro\"], tpr[\"micro\"])\n   \n\n特定类别的 ROC 曲线图\n\n\n.. code:: python\n\n    plt.figure()\n    lw = 2\n    plt.plot(fpr[2], tpr[2], color='darkorange',\n             lw=lw, label='ROC curve (area = %0.2f)' % roc_auc[2])\n    plt.plot([0, 1], [0, 1], color='navy', lw=lw, linestyle='--')\n    plt.xlim([0.0, 1.0])\n    plt.ylim([0.0, 1.05])\n    plt.xlabel('False Positive Rate')\n    plt.ylabel('True Positive Rate')\n    plt.title('Receiver operating characteristic example')\n    plt.legend(loc=\"lower right\")\n    plt.show()\n\n\n.. image:: \u002Fdocs\u002Fpic\u002Fsphx_glr_plot_roc_001.png\n\n\n~~~~~~~~~~~~~~~~~~~~~~~\n曲线下面积 (Area Under Curve, AUC)\n~~~~~~~~~~~~~~~~~~~~~~~\n\nROC 曲线下面积 (AUC) 是一个汇总指标，用于测量 ROC 曲线下方的整个区域。AUC 具有有用的属性，例如在方差分析 (ANOVA) 测试中增加灵敏度、独立于决策阈值、对先验类别概率不变以及对负类和正类关于决策索引的指示程度。\n\n\n.. code:: python\n\n      import numpy as np\n      from sklearn import metrics\n      fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)\n      metrics.auc(fpr, tpr)\n\n\n\n\n==========================\n文本与文档数据集\n==========================\n\n----\n\n~~~~~\nIMDB\n~~~~~\n\n- `IMDB 数据集 \u003Chttp:\u002F\u002Fai.stanford.edu\u002F~amaas\u002Fdata\u002Fsentiment\u002F>`__\n\n来自 IMDB 的 25,000 条电影评论数据集，按情感（正面\u002F负面）标记。评论已预处理，每条评论编码为单词索引序列（整数）。为了方便起见，单词按数据集中的总体频率进行索引，因此例如整数\"3\"编码数据中第 3 个最常见的单词。这允许快速过滤操作，例如“只考虑最常见的 10,000 个单词，但消除最常见的 20 个单词”。\n\n按照惯例，“0”不代表特定单词，而是用于编码任何未知单词。\n\n\n.. code:: python\n\n\n  from keras.datasets import imdb\n\n  (x_train, y_train), (x_test, y_test) = imdb.load_data(path=\"imdb.npz\",\n                                                        num_words=None,\n                                                        skip_top=0,\n                                                        maxlen=None,\n                                                        seed=113,\n                                                        start_char=1,\n                                                        oov_char=2,\n                                                        index_from=3)\n\n~~~~~~~~~~~~~\nReuters-21578\n~~~~~~~~~~~~~\n\n- `Reuters-21578 数据集 \u003Chttps:\u002F\u002Fkeras.io\u002Fdatasets\u002F>`__\n\n\n来自路透社的 11,228 条新闻电文数据集，标记了 46 个主题。与 IMDB 数据集一样，每条电文编码为单词索引序列（相同约定）。\n\n\n.. code:: python\n\n  from keras.datasets import reuters\n\n(x_train, y_train), (x_test, y_test) = reuters.load_data(path=\"reuters.npz\",\n                                                           num_words=None,\n                                                           skip_top=0,\n                                                           maxlen=None,\n                                                           test_split=0.2,\n                                                           seed=113,\n                                                           start_char=1,\n                                                           oov_char=2,\n                                                           index_from=3)\n                                                         \n                                                         \n~~~~~~~~~~~~~\n20Newsgroups\n~~~~~~~~~~~~~\n\n- `20Newsgroups Dataset \u003Chttps:\u002F\u002Farchive.ics.uci.edu\u002Fml\u002Fdatasets\u002FTwenty+Newsgroups>`__\n\n20 个新闻组数据集 (20 Newsgroups Dataset) 包含约 18000 条关于 20 个主题的帖子，分为两个子集：一个用于训练（或开发），另一个用于测试（或性能评估）。训练集和测试集之间的划分基于特定日期之前和之后发布的消息。\n\n该模块包含两个加载器。第一个是 `sklearn.datasets.fetch_20newsgroups` (scikit-learn 数据集中获取 20 个新闻组的函数)，它返回原始文本列表，这些文本可以馈送给文本特征提取器，例如带有自定义参数的 `sklearn.feature_extraction.text.CountVectorizer` (词频向量器)，以提取特征向量。第二个是 `sklearn.datasets.fetch_20newsgroups_vectorized`，它返回现成的特征，即不需要使用特征提取器。\n\n\n.. code:: python\n\n  from sklearn.datasets import fetch_20newsgroups\n  newsgroups_train = fetch_20newsgroups(subset='train')\n\n  from pprint import pprint\n  pprint(list(newsgroups_train.target_names))\n  \n  ['alt.atheism',\n   'comp.graphics',\n   'comp.os.ms-windows.misc',\n   'comp.sys.ibm.pc.hardware',\n   'comp.sys.mac.hardware',\n   'comp.windows.x',\n   'misc.forsale',\n   'rec.autos',\n   'rec.motorcycles',\n   'rec.sport.baseball',\n   'rec.sport.hockey',\n   'sci.crypt',\n   'sci.electronics',\n   'sci.med',\n   'sci.space',\n   'soc.religion.christian',\n   'talk.politics.guns',\n   'talk.politics.mideast',\n   'talk.politics.misc',\n   'talk.religion.misc']\n \n \n~~~~~~~~~~~~~~~~~~~~~~\nWeb of Science Dataset\n~~~~~~~~~~~~~~~~~~~~~~\n\n数据集描述：\n\n这里有三个数据集，包括 WOS-11967 , WOS-46985, 和 WOS-5736。每个文件夹包含以下内容：\n\n- X.txt\n- Y.txt\n- YL1.txt\n- YL2.txt\n\nX 是包含文本序列的输入数据\nY 是目标值\nYL1 是第一层级目标值（父标签）\nYL2 是第二层级目标值（子标签）\n\n元数据：\n此文件夹包含一个数据文件，属性如下：\nY1 Y2 Y Domain area keywords Abstract\n\n摘要 (Abstract) 是包含 46,985 篇已发表论文文本序列的输入数据\nY 是目标值\nYL1 是第一层级目标值（父标签）\nYL2 是第二层级目标值（子标签）\n领域 (Domain) 是主要领域，包含 7 个标签：{计算机科学，电气工程，心理学，机械工程，土木工程，医学科学，生物化学}\n区域 (Area) 是论文的次级领域或区域，例如 CS->计算机图形学，包含 134 个标签。\n关键词 (Keywords)：是论文的作者关键词\n\n- Web of Science 数据集 `WOS-11967 \u003Chttp:\u002F\u002Fdx.doi.org\u002F10.17632\u002F9rw3vkcfy4.2>`__\n..\n\n  此数据集包含 11,967 份文档，共 35 个类别，其中包括 7 个父类别。\n\n- Web of Science 数据集 `WOS-46985 \u003Chttp:\u002F\u002Fdx.doi.org\u002F10.17632\u002F9rw3vkcfy4.2>`__\n      \n..\n\n  此数据集包含 46,985 份文档，共 134 个类别，其中包括 7 个父类别。\n\n- Web of Science 数据集 `WOS-5736 \u003Chttp:\u002F\u002Fdx.doi.org\u002F10.17632\u002F9rw3vkcfy4.2>`__\n\n..\n  \n  此数据集包含 5,736 份文档，共 11 个类别，其中包括 3 个父类别。\n\n参考文献：HDLTex: 用于文本分类的分层深度学习 (Hierarchical Deep Learning for Text Classification)\n     \n================================\n文本分类应用\n================================\n\n\n----\n\n\n\n~~~~~~~~~~~~~~~~~~~~~~\n信息检索\n~~~~~~~~~~~~~~~~~~~~~~\n信息检索 (Information retrieval) 是从大量文档集合中找到满足信息需求的非结构化数据文档的过程。随着在线信息的快速增长，特别是文本格式的信息，文本分类 (text classification) 已成为管理此类数据的重要技术。该领域使用的一些重要方法包括朴素贝叶斯 (Naive Bayes)、支持向量机 (SVM)、决策树 (decision tree)、J48、k-近邻 (k-NN) 和 IBK。文档和文本数据集处理最具挑战性的应用之一是将文档分类方法应用于信息检索。\n\n- 🎓 `信息检索简介 \u003Chttp:\u002F\u002Feprints.bimcoordinator.co.uk\u002F35\u002F>`__ Manning, C., Raghavan, P., & Schütze, H. (2010).\n     \n- 🎓 `网络论坛检索与文本分析：综述 \u003Chttp:\u002F\u002Fwww.nowpublishers.com\u002Farticle\u002FDetails\u002FINR-062>`__ Hoogeveen, Doris, et al.. (2018).\n\n- 🎓 `信息检索中的自动文本分类：综述 \u003Chttps:\u002F\u002Fdl.acm.org\u002Fcitation.cfm?id=2905191>`__ Dwivedi, Sanjay K., and Chandrakala Arya.. (2016).\n\n~~~~~~~~~~~~~~~~~~~~~~\n信息过滤\n~~~~~~~~~~~~~~~~~~~~~~\n信息过滤 (Information filtering) 指的是从传入的数据流中选择相关信息或拒绝无关信息。信息过滤系统通常用于衡量和预测用户的长期兴趣。概率模型 (Probabilistic models)，如贝叶斯推理网络 (Bayesian inference network)，常用于信息过滤系统。贝叶斯推理网络采用递归推理将值传播通过推理网络，并返回排名最高的文档。Chris 使用带有迭代优化的向量空间模型 (vector space model) 进行过滤任务。\n \n\n- 🎓 `搜索引擎：实践中的信息检索 \u003Chttp:\u002F\u002Flibrary.mpib-berlin.mpg.de\u002Ftoc\u002Fz2009_2465.pdf\u002F>`__ Croft, W. B., Metzler, D., & Strohman, T. (2010).\n\n- 🎓 `SMART 信息检索系统的实现 \u003Chttps:\u002F\u002Fecommons.cornell.edu\u002Fbitstream\u002Fhandle\u002F1813\u002F6526\u002F85-686.pdf?sequence=1>`__ Buckley, Chris\n\n~~~~~~~~~~~~~~~~~~~~~~\n情感分析\n~~~~~~~~~~~~~~~~~~~~~~\n情感分析 (Sentiment analysis) 是一种识别文本中观点、情感和主观性的计算方法。情感分类方法将与观点相关的文档分类为正面或负面。假设文档 d 表达了一个单一实体 e 的观点，且观点是通过单一观点持有者 h 形成的。朴素贝叶斯分类和支持向量机 (SVM) 是用于情感分类的最流行的监督学习方法 (supervised learning methods) 之一。术语及其各自频率、词性 (part of speech)、观点词汇和短语、否定词 (negations) 和句法依赖 (syntactic dependency) 等特征已被用于情感分类技术中。\n\n- 🎓 `观点挖掘与情感分析 \u003Chttp:\u002F\u002Fwww.nowpublishers.com\u002Farticle\u002FDetails\u002FINR-011>`__ Pang, Bo, and Lillian Lee. (2008).\n\n- 🎓 `意见挖掘与情感分析综述 \u003Chttps:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-1-4614-3223-4_13>`__ Liu, Bing, and Lei Zhang. (2010).\n\n- 🎓 `点赞？：使用机器学习技术进行情感分类 \u003Chttps:\u002F\u002Fdl.acm.org\u002Fcitation.cfm?id=1118704>`__ Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. \n\n~~~~~~~~~~~~~~~~~~~~~~\n推荐系统\n~~~~~~~~~~~~~~~~~~~~~~\n基于内容的推荐系统 (Content-based recommender systems) 根据项目的描述和用户兴趣的个人资料向用户推荐项目。用户的个人资料可以通过用户对项目的反馈（搜索查询历史或自我报告）以及个人资料中的自解释特征~（查询的过滤器或条件）来学习。通过这种方式，此类推荐系统的输入可以是半结构化的，即某些属性从自由文本字段中提取，而其他属性则直接指定。许多不同类型的文本分类方法已被用于建模用户偏好，例如决策树、最近邻方法、Rocchio 算法、线性分类器、概率方法和朴素贝叶斯 (Naive Bayes)。\n\n- 🎓 `基于内容的推荐系统 \u003Chttps:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-319-29659-3_4>`__ Aggarwal, Charu C. (2016).\n\n- 🎓 `基于内容的推荐系统 \u003Chttps:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-540-72079-9_10>`__ Pazzani, Michael J., and Daniel Billsus.\n\n~~~~~~~~~~~~~~~~~~~~~~\n知识管理\n~~~~~~~~~~~~~~~~~~~~~~\n文本数据库是信息和知识的重要来源。很大一部分企业信息（近 80%）存在于文本数据格式（非结构化）中。在知识提炼 (knowledge distillation) 中，模式或知识是从可以直接提取的形式中推断出来的，这些形式可以是半结构化的（例如概念图表示）或结构化\u002F关系型数据表示）。给定的中间形式可以基于文档，其中每个实体代表特定领域中感兴趣的对象或概念。文档分类 (Document categorization) 是挖掘基于文档的中间形式的最常用方法之一。在其他研究中，文本分类被用于查找铁路事故原因与其在报告中对应描述之间的关系。\n\n- 🎓 `文本挖掘：概念、应用、工具和问题——概述 \u003Chttp:\u002F\u002Fciteseerx.ist.psu.edu\u002Fviewdoc\u002Fdownload?doi=10.1.1.403.2426&rep=rep1&type=pdf>`__ Sumathy, K. L., and M. Chidambaram.  (2013).\n\n- 🎓 `使用深度学习分析铁路事故叙述 \u003Chttps:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8614260\u002F>`__ Heidarysafa, Mojtaba, et al. (2018).\n\n~~~~~~~~~~~~~~~~~~~~~~\n文档摘要\n~~~~~~~~~~~~~~~~~~~~~~\n文本分类用于文档摘要，文档摘要可能会使用原始文档中未出现的单词或短语。由于在线信息迅速增加，多文档摘要 (Multi-document summarization) 也变得必要。因此，许多研究人员专注于使用文本分类从文档中提取重要特征的任务。\n\n- 🎓 `自动文本摘要进展 \u003Chttps:\u002F\u002Fbooks.google.com\u002Fbooks?hl=en&lr=&id=YtUZQaKDmzEC&oi=fnd&pg=PA215&dq=Advances+in+automatic+text+summarization&ots=ZpvCsrG-dC&sig=8ecTDTrQR4mMzDnKvI58sowh3Fg>`__ Mani, Inderjeet. \n\n- 🎓 `通过文本分类改进多文档摘要 \u003Chttps:\u002F\u002Fwww.aaai.org\u002Focs\u002Findex.php\u002FAAAI\u002FAAAI17\u002Fpaper\u002FviewPaper\u002F14525>`__ Cao, Ziqiang, et al. (2017).\n\n================================\n文本分类支持\n================================\n\n~~~~~~~~~~~~~~~~~~~~~~\n健康医疗\n~~~~~~~~~~~~~~~~~~~~~~\n医学领域中的大多数文本信息以非结构化或叙事形式呈现，包含模糊术语和拼写错误。此类信息需要在诊断和治疗的不同阶段的医患诊疗过程中随时可用。医疗编码 (Medical coding)，即将医疗诊断分配给从大量类别中获得的具体类值，是文本分类技术可能非常有价值的医疗保健应用领域之一。在其他研究中，J. Zhang 等人引入了 Patient2Vec，以学习纵向电子健康记录 (Electronic Health Record, EHR) 数据的可解释深度表示，该表示针对每位患者个性化。Patient2Vec 是一种新颖的数据集特征嵌入技术，可以基于循环神经网络 (recurrent neural networks) 和注意力机制 (attention mechanism) 学习基于 EHR 数据的个性化可解释深度表示。文本分类也已应用于医学主题词表 (Medical Subject Headings, MeSH) 和基因本体论 (Gene Ontology, GO) 的开发。\n\n\n- 🎓 `Patient2Vec：纵向电子健康记录的个性化可解释深度表示 \u003Chttps:\u002F\u002Fieeexplore.ieee.org\u002Fabstract\u002Fdocument\u002F8490816\u002F>`__ Zhang, Jinghe, et al. (2018)\n\n- 🎓 `结合贝叶斯文本分类和收缩以自动化医疗编码：数据质量分析 \u003Chttps:\u002F\u002Fdl.acm.org\u002Fcitation.cfm?id=2063506>`__ Lauría, Eitel JM, and Alan D. March. (2011).\n\n- 🎓 `A \u003Chttp:\u002F\u002Fb\u002F>`__ 等。(2010).\n\n- 🎓 `MeSH Up：有效的 MeSH 文本分类以改善文档检索 \u003Chttps:\u002F\u002Facademic.oup.com\u002Fbioinformatics\u002Farticle-abstract\u002F25\u002F11\u002F1412\u002F333120>`__ Trieschnigg, Dolf, et al.\n\n~~~~~~~~~~~~~~~~~~~~~~\n社会科学\n~~~~~~~~~~~~~~~~~~~~~~\n过去几十年，文本分类和文档分类已越来越多地应用于理解人类行为。近年来，人类行为研究中的数据驱动工作侧重于挖掘非正式笔记和文本数据集中的语言，包括短消息服务 (SMS)、临床笔记、社交媒体等。这些研究主要侧重于使用基于单词出现频率的方法（即单词在文档中出现的频率）或基于语言调查词计数 (Linguistic Inquiry Word Count, LIWC) 的特征，LIWC 是一个经过验证的具有心理学相关性的单词类别词典。\n\n- 🎓 `使用短信识别年轻人迫在眉睫的自杀风险 \u003Chttps:\u002F\u002Fdl.acm.org\u002Fcitation.cfm?id=3173987>`__ Nobles, Alicia L., et al. (2018).\n\n- 🎓 `文本情感分类：跨数据集互操作性研究 \u003Chttps:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-319-63004-5_21>`__ Ofoghi, Bahadorreza, and Karin Verspoor. (2017).\n\n- 🎓 `公共卫生的社会监测 \u003Chttps:\u002F\u002Fwww.morganclaypool.com\u002Fdoi\u002Fabs\u002F10.2200\u002FS00791ED1V01Y201707ICR060>`__ Paul, Michael J., and Mark Dredze (2017).\n\n~~~~~~~~~~~~~~~~~~~~~~\n商业与营销\n~~~~~~~~~~~~~~~~~~~~~~\n盈利性公司和企业正越来越多地使用社交媒体进行营销目的。从社交媒体（如 Facebook、Twitter 等）中进行意见挖掘 (Opinion mining) 是公司快速增加利润的主要目标。文本和文档分类是公司比以往任何时候都更容易找到客户的强大工具。\n\n- 🎓 `Opinion mining using ensemble text hidden Markov models for text classification \u003Chttps:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS0957417417304979>`__ Kang, Mangi, Jaelim Ahn, and Kichun Lee. (2018).\n\n- 🎓 `Classifying business marketing messages on Facebook \u003Chttps:\u002F\u002Fwww.researchgate.net\u002Fprofile\u002FBei_Yu2\u002Fpublication\u002F236246670_Classifying_Business_Marketing_Messages_on_Facebook\u002Flinks\u002F56bcb34408ae6cc737c6335b.pdf>`__ Yu, Bei, and Linchi Kwok.\n\n~~~~~~~~~~~~~~~~~~~~~~\n法律\n~~~~~~~~~~~~~~~~~~~~~~\n政府机构已产生海量的法律文本信息及文档。检索这些信息并对其进行自动分类，不仅可以帮助律师，也能惠及他们的客户。\n在美国，法律源自五个来源：宪法、成文法、条约、行政法规和普通法。此外，每年都有大量新的法律文件被创建。对这些文件进行分类是法律界面临的主要挑战。\n\n- 🎓 `Represent yourself in court: How to prepare & try a winning case \u003Chttps:\u002F\u002Fbooks.google.com\u002Fbooks?hl=en&lr=&id=-lodDQAAQBAJ&oi=fnd&pg=PP1&dq=Represent+yourself+in+court:+How+to+prepare+%5C%26+try+a+winning+case&ots=tgJ8Q2MkH_&sig=9o3ILDn3LfO30BZKsyI2Ou7Q8Qs>`__ Bergman, Paul, and Sara J. Berman. (2016)\n\n- 🎓 `Text retrieval in the legal world \u003Chttps:\u002F\u002Flink.springer.com\u002Farticle\u002F10.1007\u002FBF00877694>`__ Turtle, Howard.\n\n==========\n引用：\n==========\n\n----\n\n.. code::\n\n    @ARTICLE{Kowsari2018Text_Classification,\n        title={Text Classification Algorithms: A Survey},\n        author={Kowsari, Kamran and Jafari Meimandi, Kiana and Heidarysafa, Mojtaba and Mendu, Sanjana and Barnes, Laura E. and Brown, Donald E.},\n        journal={Information},\n        VOLUME = {10},  \n        YEAR = {2019},\n        NUMBER = {4},\n        ARTICLE-NUMBER = {150},\n        URL = {http:\u002F\u002Fwww.mdpi.com\u002F2078-2489\u002F10\u002F4\u002F150},\n        ISSN = {2078-2489},\n        publisher={Multidisciplinary Digital Publishing Institute}\n    }\n\n.. |RMDL| image:: docs\u002Fpic\u002FRMDL.jpg\n.. |line| image:: docs\u002Fpic\u002Fline.png\n          :alt: 占位符\n.. |HDLTex| image:: docs\u002Fpic\u002FHDLTex.png\n\n\n.. |twitter| image:: https:\u002F\u002Fimg.shields.io\u002Ftwitter\u002Furl\u002Fhttp\u002Fshields.io.svg?style=social\n    :target: https:\u002F\u002Ftwitter.com\u002Fintent\u002Ftweet?text=Text%20Classification%20Algorithms:%20A%20Survey%0aGitHub:&url=https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification&hashtags=Text_Classification,classification,MachineLearning,Categorization,NLP,NATURAL,LANGUAGE,PROCESSING\n    \n.. |contributions-welcome| image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcontributions-welcome-brightgreen.svg?style=flat\n    :target: https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification\u002Fpulls\n.. |ansicolortags| image:: https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fl\u002Fansicolortags.svg\n      :target: https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification\u002Fblob\u002Fmaster\u002FLICENSE\n.. |contributors| image:: https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fcontributors\u002Fkk7nc\u002FText_Classification.svg\n      :target: https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification\u002Fgraphs\u002Fcontributors \n\n.. |arXiv| image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-1904.08067-red.svg?style=flat\n   :target: https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.08067\n   \n.. |DOI| image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDOI-10.3390\u002Finfo10040150-blue.svg?style=flat\n   :target: https:\u002F\u002Fdoi.org\u002F10.3390\u002Finfo10040150\n   \n   \n.. |medium| image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMedium-Text%20Classification-blueviolet.svg\n    :target: https:\u002F\u002Fmedium.com\u002Ftext-classification-algorithms\u002Ftext-classification-algorithms-a-survey-a215b7ab7e2d\n\n.. |UniversityCube| image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FUniversityCube-Follow%20us%20for%20the%20Latest%20News!-blue.svg\n    :target: https:\u002F\u002Fwww.universitycube.net\u002Fnews\n\n\n.. |mendeley| image:: https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FMendeley-Add%20to%20Library-critical.svg\n    :target: https:\u002F\u002Fwww.mendeley.com\u002Fimport\u002F?url=https:\u002F\u002Fdoi.org\u002F10.3390\u002Finfo10040150\n    \n.. |Best| image::     https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAward-Best%20Paper%20Award%202019-brightgreen\n    :target: https:\u002F\u002Fwww.mdpi.com\u002Fjournal\u002Finformation\u002Fawards\n       \n.. |BPW| image:: docs\u002Fpic\u002FBPW.png\n    :target: https:\u002F\u002Fwww.mdpi.com\u002Fjournal\u002Finformation\u002Fawards","# Text_Classification 快速上手指南\n\n本指南基于开源项目 **Text Classification Algorithms: A Survey** 整理，旨在帮助开发者快速搭建文本分类预处理与特征提取环境。\n\n## 1. 环境准备\n\n- **操作系统**：Linux \u002F macOS \u002F Windows（部分工具如 Word2Vec 已针对 macOS 编译优化）\n- **Python 版本**：推荐 Python 3.6+\n- **前置依赖**：Git、Python 包管理工具 pip\n\n## 2. 安装步骤\n\n### 2.1 克隆项目\n```bash\ngit clone \u003Crepository_url>\ncd Text_Classification\n```\n\n### 2.2 安装 Python 依赖\n推荐使用国内镜像源加速下载：\n```bash\npip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple nltk autocorrect\n```\n\n### 2.3 下载 NLTK 资源\n首次运行 NLTK 相关功能前，需下载必要的语料库数据：\n```python\nimport nltk\nnltk.download('punkt')\nnltk.download('stopwords')\n```\n\n## 3. 基本使用\n\n以下示例展示了如何使用 NLTK 进行分词及停用词过滤，这是文本分类中最基础的预处理步骤。\n\n### 3.1 停用词过滤示例\n将原始文本中的常见无意义词汇（如 \"is\", \"a\", \"the\"）移除，保留核心语义信息。\n\n```python\n  from nltk.corpus import stopwords\n  from nltk.tokenize import word_tokenize\n\n  example_sent = \"This is a sample sentence, showing off the stop words filtration.\"\n\n  stop_words = set(stopwords.words('english'))\n\n  word_tokens = word_tokenize(example_sent)\n\n  filtered_sentence = [w for w in word_tokens if not w in stop_words]\n\n  filtered_sentence = []\n\n  for w in word_tokens:\n      if w not in stop_words:\n          filtered_sentence.append(w)\n\n  print(word_tokens)\n  print(filtered_sentence)\n```\n\n### 3.2 其他功能模块\n项目中还包含以下算法实现，可根据需求进一步探索：\n- **Stemming (词干提取)**：使用 PorterStemmer 还原单词变体。\n- **Lemmatization (词形还原)**：使用 WordNetLemmatizer 获取基础词形。\n- **Word Embedding**：集成 Word2Vec 和 GloVe 模型用于向量表示。\n\n> **注意**：若需使用 C++ 版本的 Word2Vec 或 GloVe 训练模型，请参考项目文档中的具体编译说明。","某电商平台客服团队每日需处理上万条用户评论，旨在快速识别问题类型以优化服务体验。\n\n### 没有 Text_Classification 时\n- 人工逐条阅读评论，效率低下且极易因疲劳导致误判，响应速度慢\n- 原始文本包含大量网络缩写和拼写错误，直接干扰关键词匹配效果\n- 缺乏标准化的分词与去噪流程，导致特征提取不准确，模型泛化能力差\n- 历史数据堆积无法结构化，难以挖掘潜在的产品改进方向，决策滞后\n\n### 使用 Text_Classification 后\n- 利用其内置的分词与停用词过滤功能，自动清理文本中的无效噪声，提升数据质量\n- 基于算法自动将评论精准归类为“产品质量”、“物流配送”或“售后服务”，准确率高\n- 标准化特征提取确保了不同来源数据的分类逻辑高度一致，便于后续分析\n- 批量处理能力使原本需数周的人工标注工作缩短至几小时内完成，释放人力\n\n通过集成文本清洗与分类算法，实现了从非结构化数据到可执行洞察的高效转化。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fkk7nc_Text_Classification_c935cae1.png","kk7nc","Kamran  Kowsari","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fkk7nc_3c3aaa3a.jpg","Kamran Kowsari is scientist at University of California, Los Angeles (UCLA).","University of California, Los Angeles","Los Angeles","kk7nc@virginia.edu",null,"https:\u002F\u002Fwww.kamrankowsari.com\u002F","https:\u002F\u002Fgithub.com\u002Fkk7nc",[86],{"name":87,"color":88,"percentage":89},"Python","#3572A5",100,1831,543,"2026-03-23T19:47:22","MIT","Linux, macOS","未说明",{"notes":97,"python":95,"dependencies":98},"本项目为文本分类算法综述仓库，整合了多种特征提取与模型实现（如 Word2Vec, GloVe, ELMo）。代码示例依赖 NLTK 和 autocorrect 库。Word2Vec 组件包含针对 Mac OS X 的编译补丁。运行演示脚本需联网下载约 100MB 训练语料。",[99,100],"nltk","autocorrect",[26,13],[103,104,105,106,107,108,109,110,111,112,113,114,115,116,117,118,119,120,121,122],"text-classification","nlp-machine-learning","document-classification","text-processing","dimensionality-reduction","rocchio-algorithm","boosting-algorithms","logistic-regression","naive-bayes-classifier","k-nearest-neighbours","support-vector-machines","decision-trees","random-forest","conditional-random-fields","deep-learning","deep-neural-network","recurrent-neural-networks","convolutional-neural-networks","deep-belief-network","hierarchical-attention-networks","2026-03-27T02:49:30.150509","2026-04-06T05:19:29.719052",[126,131,136,141,146],{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},3174,"运行时遇到“分类指标无法处理多分类和连续多输出目标混合”的错误如何解决？","当 y_test 形状为 (n,) 而 predicted 形状为 (n, classes) 时，直接比较会报错。请在 classification_report 中对预测结果取最大索引值。修改代码为：print(metrics.classification_report(y_test, np.argmax(predicted, axis = 1)))。","https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification\u002Fissues\u002F8",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},3175,"CBOW 和 Skip-gram 的示意图是否存在错误？","是的，原图存在左右混淆。根据论文 Figure 1，左侧应为 CBOW，右侧应为 Skip-gram。该图片错误已被维护者修复，请参考最新版本。","https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification\u002Fissues\u002F2",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},3176,"为什么示例中选择使用 Porter Stemmer 而不是 Snowball Stemmer？","使用 Porter Stemmer 没有特殊原因，仅作为示例展示。Snowball Stemmer (PorterV2) 通常被认为更优越。你可以自行添加并使用，欢迎提交 Pull Request 丰富仓库。","https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification\u002Fissues\u002F10",{"id":142,"question_zh":143,"answer_zh":144,"source_url":145},3177,"RNN 代码块中出现的变量名拼写错误如何修正？","原代码 predicted = Build_Model_RNN_Text.predict_classes(X_test_Glove) 有误。应修改为 predicted = model_RNN.predict_classes(X_test_Glove)，以匹配实际定义的模型变量名。","https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification\u002Fissues\u002F7",{"id":147,"question_zh":148,"answer_zh":149,"source_url":150},3178,"CNN 章节中出现的模型类型标注笔误如何修正？","原文档存在将 CNN 误标为 RNN 的笔误。维护者已确认此问题，建议参考相关的 Pull Request 或手动修正文档中的标签以确保准确性。","https:\u002F\u002Fgithub.com\u002Fkk7nc\u002FText_Classification\u002Fissues\u002F5",[]]