[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-sreeharierk--datascience":3,"tool-sreeharierk--datascience":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":78,"owner_location":78,"owner_email":78,"owner_twitter":75,"owner_website":78,"owner_url":79,"languages":78,"stars":80,"forks":81,"last_commit_at":82,"license":83,"difficulty_score":84,"env_os":85,"env_gpu":85,"env_ram":85,"env_deps":86,"category_tags":89,"github_topics":90,"view_count":100,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":101,"updated_at":102,"faqs":103,"releases":119},739,"sreeharierk\u002Fdatascience","datascience","This repository is a compilation of free resources for learning Data Science.","datascience 是一个汇集数据科学学习资源的开源项目，旨在为初学者提供一条清晰的知识进阶路径。它解决了学习者在面对海量零散资料时难以构建系统知识体系的问题。通过这份详细的路线图，项目将数据科学的核心领域拆解为数学基础、算法复杂度及数据库原理等模块，涵盖矩阵运算、哈希函数、Big O 表示法以及 SQL 各类连接操作等关键内容。\n\n其独特的技术亮点在于图文并茂地解释抽象概念，例如用示意图展示二叉树结构或自然连接流程，并辅以具体的代码请求示例，让理论学习更加直观。这非常适合希望转型数据领域的学生、需要夯实基础的初级开发者，以及从事数据分析的研究人员。无论你是想查漏补缺还是从零开始规划学习路线，datascience 都能帮助你高效梳理核心技能，快速入门数据科学世界。","# Data-Scientist-Roadmap (2021)\n\n![roadmap-picture](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_3467d861335e.png)\n\n****\n\n# 1_ Fundamentals\n\n\n## 1_ Matrices & Algebra fundamentals\n\n### About\n\nIn mathematics, a matrix is a __rectangular array of numbers, symbols, or expressions, arranged in rows and columns__. A matrix could be reduced as a submatrix of a matrix by deleting any collection of rows and\u002For columns.\n\n![matrix-image](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fb\u002Fbb\u002FMatrix.svg)\n\n### Operations\n\nThere are a number of basic operations that can be applied to modify matrices:\n\n* [Addition](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMatrix_addition)\n* [Scalar Multiplication](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FScalar_multiplication)\n* [Transposition](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTranspose)\n* [Multiplication](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMatrix_multiplication)\n\n\n## 2_ Hash function, binary tree, O(n)\n\n### Hash function\n\n#### Definition\n\nA hash function is __any function that can be used to map data of arbitrary size to data of fixed size__. One use is a data structure called a hash table, widely used in computer software for rapid data lookup. Hash functions accelerate table or database lookup by detecting duplicated records in a large file.\n\n![hash-image](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F5\u002F58\u002FHash_table_4_1_1_0_0_1_0_LL.svg)\n\n### Binary tree\n\n#### Definition\n\nIn computer science, a binary tree is __a tree data structure in which each node has at most two children__, which are referred to as the left child and the right child.\n\n![binary-tree-image](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Ff\u002Ff7\u002FBinary_tree.svg)\n\n### O(n)\n\n#### Definition\n\nIn computer science, big O notation is used to __classify algorithms according to how their running time or space requirements grow as the input size grows__. In analytic number theory, big O notation is often used to __express a bound on the difference between an arithmetical function and a better understood approximation__.\n\n## 3_ Relational algebra, DB basics\n\n### Definition\n\nRelational algebra is a family of algebras with a __well-founded semantics used for modelling the data stored in relational databases__, and defining queries on it.\n\nThe main application of relational algebra is providing a theoretical foundation for __relational databases__, particularly query languages for such databases, chief among which is SQL.\n\n### Natural join\n\n#### About\n\nIn SQL language, a natural junction between two tables will be done if :\n\n* At least one column has the same name in both tables\n* Theses two columns have the same data type\n    * CHAR (character)\n    * INT (integer)\n    * FLOAT (floating point numeric data)\n    * VARCHAR (long character chain)\n    \n#### mySQL request\n\n        SELECT \u003CCOLUMNS>\n        FROM \u003CTABLE_1>\n        NATURAL JOIN \u003CTABLE_2>\n\n        SELECT \u003CCOLUMNS>\n        FROM \u003CTABLE_1>, \u003CTABLE_2>\n        WHERE TABLE_1.ID = TABLE_2.ID\n\n## 4_ Inner, Outer, Cross, theta-join\n\n### Inner join\n\nThe INNER JOIN keyword selects records that have matching values in both tables.\n\n#### Request\n\n      SELECT column_name(s)\n      FROM table1\n      INNER JOIN table2 ON table1.column_name = table2.column_name;\n\n![inner-join-image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_ac4e75e16439.gif)\n\n### Outer join\n\nThe FULL OUTER JOIN keyword return all records when there is a match in either left (table1) or right (table2) table records.\n\n#### Request\n\n      SELECT column_name(s)\n      FROM table1\n      FULL OUTER JOIN table2 ON table1.column_name = table2.column_name; \n\n![outer-join-image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_78c764bdbdaf.gif)\n\n### Left join\n\nThe LEFT JOIN keyword returns all records from the left table (table1), and the matched records from the right table (table2). The result is NULL from the right side, if there is no match.\n\n#### Request\n\n      SELECT column_name(s)\n      FROM table1\n      LEFT JOIN table2 ON table1.column_name = table2.column_name;\n\n![left-join-image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_35125df03a8e.gif)\n\n### Right join\n\nThe RIGHT JOIN keyword returns all records from the right table (table2), and the matched records from the left table (table1). The result is NULL from the left side, when there is no match.\n#### Request\n\n      SELECT column_name(s)\n      FROM table1\n      RIGHT JOIN table2 ON table1.column_name = table2.column_name;\n\n![left-join-image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_02187c9d9300.gif)\n\n## 5_ CAP theorem\n\nIt is impossible for a distributed data store to simultaneously provide more than two out of the following three guarantees:\n \n* Every read receives the most recent write or an error.\n* Every request receives a (non-error) response – without guarantee that it contains the most recent write.\n* The system continues to operate despite an arbitrary number of messages being dropped (or delayed) by the network between nodes.\n\nIn other words, the CAP Theorem states that in the presence of a network partition, one has to choose between consistency and availability. Note that consistency as defined in the CAP Theorem is quite different from the consistency guaranteed in ACID database transactions.\n\n## 6_ Tabular data\n\nTabular data are __opposed to relational__ data, like SQL database.\n\nIn tabular data, __everything is arranged in columns and rows__. Every row have the same number of column (except for missing value, which could be substituted by \"N\u002FA\".\n\nThe __first line__ of tabular data is most of the time a __header__, describing the content of each column.\n\nThe most used format of tabular data in data science is __CSV___. Every column is surrounded by a character (a tabulation, a coma ..), delimiting this column from its two neighbours.\n\n## 7_ Entropy\n\nEntropy is a __measure of uncertainty__. High entropy means the data has high variance and thus contains a lot of information and\u002For noise.\n\nFor instance, __a constant function where f(x) = 4 for all x has no entropy and is easily predictable__, has little information, has no noise and can be succinctly represented . Similarly, f(x) = ~4 has some entropy while f(x) = random number is very high entropy due to noise.\n\n## 8_ Data frames & series\n\nA data frame is used for storing data tables. It is a list of vectors of equal length.\n\nA series is a series of data points ordered.\n\n## 9_ Sharding\n\n*Sharding* is **horizontal(row wise) database partitioning** as opposed to **vertical(column wise) partitioning** which is *Normalization*\n\nWhy use Sharding?\n\n1. Database systems with large data sets or high throughput applications can challenge the capacity of a single server.\n2. Two methods to address the growth : Vertical Scaling and Horizontal Scaling\n3. Vertical Scaling\n\n    * Involves increasing the capacity of a single server\n    * But due to technological and economical restrictions, a single machine may not be sufficient for the given workload.\n\n4. Horizontal Scaling\n    * Involves dividing the dataset and load over multiple servers, adding additional servers to increase capacity as required\n    * While the overall speed or capacity of a single machine may not be high, each machine handles a subset of the overall workload, potentially providing better efficiency than a single high-speed high-capacity server. \n    * Idea is to use concepts of Distributed systems to achieve scale\n    * But it comes with same tradeoffs of increased complexity that comes hand in hand with distributed systems.\n    * Many Database systems provide Horizontal scaling via Sharding the datasets.\n\n## 10_ OLAP\n\nOnline analytical processing, or OLAP, is an approach to answering multi-dimensional analytical (MDA) queries swiftly in computing. \n\nOLAP is part of the __broader category of business intelligence__, which also encompasses relational database, report writing and data mining. Typical applications of OLAP include ___business reporting for sales, marketing, management reporting, business process management (BPM), budgeting and forecasting, financial reporting and similar areas, with new applications coming up, such as agriculture__.\n\nThe term OLAP was created as a slight modification of the traditional database term online transaction processing (OLTP).\n\n## 11_ Multidimensional Data model\n\n## 12_ ETL\n\n* Extract\n  * extracting the data from the multiple heterogenous source system(s)\n  * data validation to confirm whether the data pulled has the correct\u002Fexpected values in a given domain\n\n* Transform\n  * extracted data is fed into a pipeline which applies multiple functions on top of data\n  * these functions intend to convert the data into the format which is accepted by the end system\n  * involves cleaning the data to remove noise, anamolies and redudant data\n* Load\n  * loads the transformed data into the end target\n\n## 13_ Reporting vs BI vs Analytics\n\n## 14_ JSON and XML\n\n### JSON\n\nJSON is a language-independent data format. Example describing a person:\n\t\n\t{\n\t  \"firstName\": \"John\",\n\t  \"lastName\": \"Smith\",\n\t  \"isAlive\": true,\n\t  \"age\": 25,\n\t  \"address\": {\n\t    \"streetAddress\": \"21 2nd Street\",\n\t    \"city\": \"New York\",\n\t    \"state\": \"NY\",\n\t    \"postalCode\": \"10021-3100\"\n\t  },\n\t  \"phoneNumbers\": [\n\t    {\n\t      \"type\": \"home\",\n\t      \"number\": \"212 555-1234\"\n\t    },\n\t    {\n\t      \"type\": \"office\",\n\t      \"number\": \"646 555-4567\"\n\t    },\n\t    {\n\t      \"type\": \"mobile\",\n\t      \"number\": \"123 456-7890\"\n\t    }\n\t  ],\n\t  \"children\": [],\n\t  \"spouse\": null\n\t}\n\n## XML\n\nExtensible Markup Language (XML) is a markup language that defines a set of rules for encoding documents in a format that is both human-readable and machine-readable.\n \n \t\u003CCATALOG>\n\t  \u003CPLANT>\n\t    \u003CCOMMON>Bloodroot\u003C\u002FCOMMON>\n\t    \u003CBOTANICAL>Sanguinaria canadensis\u003C\u002FBOTANICAL>\n\t    \u003CZONE>4\u003C\u002FZONE>\n\t    \u003CLIGHT>Mostly Shady\u003C\u002FLIGHT>\n\t    \u003CPRICE>$2.44\u003C\u002FPRICE>\n\t    \u003CAVAILABILITY>031599\u003C\u002FAVAILABILITY>\n\t  \u003C\u002FPLANT>\n\t  \u003CPLANT>\n\t    \u003CCOMMON>Columbine\u003C\u002FCOMMON>\n\t    \u003CBOTANICAL>Aquilegia canadensis\u003C\u002FBOTANICAL>\n\t    \u003CZONE>3\u003C\u002FZONE>\n\t    \u003CLIGHT>Mostly Shady\u003C\u002FLIGHT>\n\t    \u003CPRICE>$9.37\u003C\u002FPRICE>\n\t    \u003CAVAILABILITY>030699\u003C\u002FAVAILABILITY>\n\t  \u003C\u002FPLANT>\n\t  \u003CPLANT>\n\t    \u003CCOMMON>Marsh Marigold\u003C\u002FCOMMON>\n\t    \u003CBOTANICAL>Caltha palustris\u003C\u002FBOTANICAL>\n\t    \u003CZONE>4\u003C\u002FZONE>\n\t    \u003CLIGHT>Mostly Sunny\u003C\u002FLIGHT>\n\t    \u003CPRICE>$6.81\u003C\u002FPRICE>\n\t    \u003CAVAILABILITY>051799\u003C\u002FAVAILABILITY>\n\t  \u003C\u002FPLANT>\n\t\u003C\u002FCATALOG>\n\n## 15_ NoSQL\n\nnoSQL is oppsed to relationnal databases (stand for __N__ot __O__nly __SQL__). Data are not structured and there's no notion of keys between tables.\n\nAny kind of data can be stored in a noSQL database (JSON, CSV, ...) whithout thinking about a complex relationnal scheme.\n\n__Commonly used noSQL stacks__: Cassandra, MongoDB, Redis, Oracle noSQL ...\n\n## 16_ Regex\n\n### About\n\n__Reg__ ular __ex__ pressions (__regex__) are commonly used in informatics.\n\nIt can be used in a wide range of possibilities :\n* Text replacing\n* Extract information in a text (email, phone number, etc)\n* List files with the .txt extension ..\n\nhttp:\u002F\u002Fregexr.com\u002F is a good website for experimenting on Regex.\n\n### Utilisation\n\nTo use them in [Python](https:\u002F\u002Fdocs.python.org\u002F3\u002Flibrary\u002Fre.html), just import:\n\n    import re\n\n## 17_ Vendor landscape\n\n## 18_ Env Setup\n\n# 2_ Statistics\n\n\n[Statistics-101 for data noobs](https:\u002F\u002Fmedium.com\u002F@debuggermalhotra\u002Fstatistics-101-for-data-noobs-2e2a0e23a5dc)\n\n## 1_ Pick a dataset\n\n### Datasets repositories\n\n#### Generalists\n\n- [KAGGLE](https:\u002F\u002Fwww.kaggle.com\u002Fdatasets)\n- [Google](https:\u002F\u002Ftoolbox.google.com\u002Fdatasetsearch)\n\n#### Medical\n\n- [PMC](https:\u002F\u002Fwww.ncbi.nlm.nih.gov\u002Fpmc\u002F)\n\n#### Other languages\n\n##### French\n\n- [DATAGOUV](https:\u002F\u002Fwww.data.gouv.fr\u002Ffr\u002F)\n\n## 2_ Descriptive statistics\n\n### Mean\n\nIn probability and statistics, population mean and expected value are used synonymously to refer to one __measure of the central tendency either of a probability distribution or of the random variable__ characterized by that distribution.\n\nFor a data set, the terms arithmetic mean, mathematical expectation, and sometimes average are used synonymously to refer to a central value of a discrete set of numbers: specifically, the __sum of the values divided by the number of values__.\n\n![mean_formula](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002Fbd2f5fb530fc192e4db7a315777f5bbb5d462c90)\n\n### Median\n\nThe median is the value __separating the higher half of a data sample, a population, or a probability distribution, from the lower half__. In simple terms, it may be thought of as the \"middle\" value of a data set.\n\n### Descriptive statistics in Python\n\n[Numpy](http:\u002F\u002Fwww.numpy.org\u002F) is a python library widely used for statistical analysis.\n\n#### Installation\n\n    pip3 install numpy\n\n#### Utilization\n    \n    import numpy\n\n## 3_ Exploratory data analysis\n\nThe step includes visualization and analysis of data. \n\nRaw data may possess improper distributions of data which may lead to issues moving forward.\n\nAgain, during applications we must also know the distribution of data, for instance, the fact whether the data is linear or spirally distributed.\n\n[Guide to EDA in Python](https:\u002F\u002Ftowardsdatascience.com\u002Fdata-preprocessing-and-interpreting-results-the-heart-of-machine-learning-part-1-eda-49ce99e36655)\n\n##### Libraries in Python \n\n[Matplotlib](https:\u002F\u002Fmatplotlib.org\u002F)\n\nLibrary used to plot graphs in Python\n\n__Installation__:\n\n    pip3 install matplotlib\n\n__Utilization__:\n\n    import matplotlib.pyplot as plt\n\n[Pandas](https:\u002F\u002Fpandas.pydata.org\u002F)\n\nLibrary used to large datasets in python\n\n__Installation__:\n\n    pip3 install pandas\n\n__Utilization__:\n\n    import pandas as pd\n    \n[Seaborn](https:\u002F\u002Fseaborn.pydata.org\u002F)\n\nYet another Graph Plotting Library in Python.\n\n__Installation__:\n\n    pip3 install seaborn\n\n__Utilization__:\n\n    import seaborn as sns\n\n\n#### PCA\n\nPCA stands for principle component analysis.\n\nWe often require to shape of the data distribution as we have seen previously. We need to plot the data for the same.\n\nData can be Multidimensional, that is, a dataset can have multiple features. \n\nWe can plot only two dimensional data, so, for multidimensional data, we project the multidimensional distribution in two dimensions, preserving the principle components of the distribution, in order to get an idea of the actual distribution through the 2D plot. \n\nIt is used for dimensionality reduction also. Often it is seen that several features do not significantly contribute any important insight to the data distribution. Such features creates complexity and increase dimensionality of the data. Such features are not considered which results in decrease of the dimensionality of the data.\n\n[Mathematical Explanation](https:\u002F\u002Fmedium.com\u002Ftowards-artificial-intelligence\u002Fdemystifying-principal-component-analysis-9f13f6f681e6)\n\n[Application in Python](https:\u002F\u002Ftowardsdatascience.com\u002Fdata-preprocessing-and-interpreting-results-the-heart-of-machine-learning-part-2-pca-feature-92f8f6ec8c8)\n\n## 4_ Histograms\n\nHistograms are representation of distribution of numerical data. The procedure consists of binnng the numeric values using range divisions i.e, the entire range in which the data varies is split into several fixed intervals. Count or frequency of occurences of the numbers in the range of the bins are represented.\n\n[Histograms](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHistogram)\n\n![plot](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002F1\u002F1d\u002FExample_histogram.png\u002F220px-Example_histogram.png)\n\nIn python, __Pandas__,__Matplotlib__,__Seaborn__ can be used to create Histograms.\n\n## 5_ Percentiles & outliers\n\n### Percentiles\n\nPercentiles are numberical measures in statistics, which represents how much or what percentage of data falls below a given number or instance in a numerical data distribution. \n\nFor instance, if we say 70 percentile, it represents, 70% of the data in the ditribution are below the given numerical value. \n\n[Percentiles](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FPercentile)\n\n### Outliers\n\nOutliers are data points(numerical) which have significant differences with other data points. They differ from majority of points in the distribution. Such points may cause the central measures of distribution, like mean, and median. So, they need to be detected and removed.\n\n[Outliers](https:\u002F\u002Fwww.itl.nist.gov\u002Fdiv898\u002Fhandbook\u002Fprc\u002Fsection1\u002Fprc16.htm)\n\n__Box Plots__ can be used detect Outliers in the data. They can be created using __Seaborn__ library\n\n![Image_Box_Plot](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_8d903230d58c.png)\n  \n## 6_ Probability theory\n\n__Probability__ is the likelihood of an event in a Random experiment. For instance, if a coin is tossed, the chance of getting a head is 50% so, probability is 0.5.\n\n__Sample Space__: It is the set of all possible outcomes of a Random Experiment. \n__Favourable Outcomes__: The set of outcomes we are looking for in a Random Experiment\n\n__Probability = (Number of Favourable Outcomes) \u002F (Sample Space)__\n\n__Probability theory__ is a branch of mathematics that is associated with the concept of probability.\n\n[Basics of Probability](https:\u002F\u002Ftowardsdatascience.com\u002Fbasic-probability-theory-and-statistics-3105ab637213)\n\n## 7_ Bayes theorem\n\n### Conditional Probability:\n\nIt is the probability of one event occurring, given that another event has already occurred. So, it gives a sense of relationship between two events and the probabilities of the occurences of those events.\n\nIt is given by:\n\n__P( A | B )__ : Probability of occurence of A, after B occured.\n\nThe formula is given by: \n\n![formula](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002F74cbddb93db29a62d522cd6ab266531ae295a0fb)\n\nSo, P(A|B) is equal to Probablity of occurence of A and B, divided by Probability of occurence of B.\n\n[Guide to Conditional Probability](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FConditional_probability)\n\n### Bayes Theorem\n\nBayes theorem provides a way to calculate conditional probability. Bayes theorem is widely used in machine learning most in Bayesian Classifiers.  \n\nAccording to Bayes theorem the probability of A, given that B has already occurred is given by Probability of A multiplied by the probability of B given A has already occurred divided by the probability of B.\n\n__P(A|B) =  P(A).P(B|A) \u002F P(B)__\n\n\n[Guide to Bayes Theorem](https:\u002F\u002Fmachinelearningmastery.com\u002Fbayes-theorem-for-machine-learning\u002F)\n\n\n## 8_ Random variables\n\nRandom variable are the numeric outcome of an experiment or random events. They are normally a set of values. \n\nThere are two main types of Random Variables:\n\n__Discrete Random Variables__: Such variables take only a finite number of distinct values\n\n__Continous Random Variables__: Such variables can take an infinite number of possible values.\n\n\n## 9_ Cumul Dist Fn (CDF)\n\nIn probability theory and statistics, the cumulative distribution function (CDF) of a real-valued random variable __X__, or just distribution function of __X__, evaluated at __x__, is the probability that __X__ will take a value less than or equal to __x__.\n\nThe cumulative distribution function of a real-valued random variable X is the function given by:\n\n![CDF](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002Ff81c05aba576a12b4e05ee3f4cba709dd16139c7)\n\nResource:\n\n[Wikipedia](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCumulative_distribution_function)\n\n## 10_ Continuous distributions\n\nA continuous distribution describes the probabilities of the possible values of a continuous random variable. A continuous random variable is a random variable with a set of possible values (known as the range) that is infinite and uncountable.\n\n## 11_ Skewness\n\nSkewness is the measure of assymetry in the data distribution or a random variable distribution about its mean. \n\nSkewness can be positive, negative or zero. \n\n![skewed image](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002Ff\u002Ff8\u002FNegative_and_positive_skew_diagrams_%28English%29.svg\u002F446px-Negative_and_positive_skew_diagrams_%28English%29.svg.png)\n\n__Negative skew__: Distribution Concentrated in the right, left tail is longer.\n\n__Positive skew__: Distribution Concentrated in the left, right tail is longer.\n\nVariation of central tendency measures are shown below.\n\n\n![cet](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002Fc\u002Fcc\u002FRelationship_between_mean_and_median_under_different_skewness.png\u002F434px-Relationship_between_mean_and_median_under_different_skewness.png)\n\nData Distribution are often Skewed which may cause trouble during processing the data. __Skewed Distribution can be converted to Symmetric Distribution, taking Log of the distribution__.\n\n##### Skew Distribution\n\n![Skew](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_12cb775e50a2.png)\n\n##### Log of the Skew Distribution.\n\n![log](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_af42cb82b664.png)\n\n\n[Guide to Skewness](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSkewness)\n\n\n## 12_ ANOVA\n\nANOVA stands for __analysis of variance__. \n\nIt is used to compare among groups of data distributions.\n\nOften we are provided with huge data. They are too huge to work with. The total data is called the __Population__.\n\nIn order to work with them, we pick random smaller groups of data. They are called __Samples__.\n\nANOVA is used to compare the variance among these groups or samples. \n\nVariance of  group is given by:\n\n![var](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_bdfacf6f5aee.png)\n\nThe differences in the collected samples are observed using the differences between the means of the groups. We often use the __t-test__ to compare the means and also to check if the samples belong to the same population,\n\nNow, t-test can only be possible among two groups. But, often we get more groups or samples.\n\nIf we try to use t-test for more than two groups we have to perform t-tests multiple times, once for each pair. This is where ANOVA is used.\n\nANOVA has two components:\n\n__1.Variation within each group__\n\n__2.Variation between groups__\n\nIt works on a ratio called the  __F-Ratio__\n\nIt is given by:\n\n![F-ratio](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_485efa4fa145.png)\n\nF ratio shows how much of the total variation comes from the variation between groups and how much comes from the variation within groups. If much of the variation comes from the variation between groups, it is more likely that the mean of groups are different. However, if most of the variation comes from the variation within groups, then we can conclude the elements in a group are different rather than entire groups. The larger the F ratio, the more likely that the groups have different means.\n\n\nResources:\n\n[Defnition](https:\u002F\u002Fstatistics.laerd.com\u002Fstatistical-guides\u002Fone-way-anova-statistical-guide.php)\n\n[GUIDE 1](https:\u002F\u002Ftowardsdatascience.com\u002Fanova-analysis-of-variance-explained-b48fee6380af)\n\n[Details](https:\u002F\u002Fmedium.com\u002F@StepUpAnalytics\u002Fanova-one-way-vs-two-way-6b3ff87d3a94)\n\n\n## 13_ Prob Den Fn (PDF)\n\nIt stands for probability density function. \n\n__In probability theory, a probability density function (PDF), or density of a continuous random variable, is a function whose value at any given sample (or point) in the sample space (the set of possible values taken by the random variable) can be interpreted as providing a relative likelihood that the value of the random variable would equal that sample.__\n\nThe probability density function (PDF) P(x) of a continuous distribution is defined as the derivative of the (cumulative) distribution function D(x).\n\nIt is given by the integral of the function over a given range.\n\n![PDF](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002F45fd7691b5fbd323f64834d8e5b8d4f54c73a6f8)\n\n## 14_ Central Limit theorem\n\n## 15_ Monte Carlo method\n\n## 16_ Hypothesis Testing\n\n### Types of curves\n\nWe need to know about two distribution curves first.\n\nDistribution curves reflect the probabilty of finding an instance or a sample of a population at a certain value of the distribution.\n\n__Normal Distribution__\n\n![normal distribution](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_3f69f968f470.jpg)\n\nThe normal distribution represents how the data is distributed. In this case, most of the data samples in the distribution are scattered at and around the mean of the distribution. A few instances are scattered or present at the long tail ends of the distribution.\n\nFew points about Normal Distributions are:\n\n1. The curve is always Bell-shaped. This is because most of the data is found around the mean, so the proababilty of finding a sample at the mean or central value is more.\n\n2. The curve is symmetric\n\n3. The area under the curve is always 1. This is because all the points of the distribution must be present under the curve\n\n4. For Normal Distribution, Mean and Median lie on the same line in the distribution. \n\n__Standard Normal Distribution__\n\nThis type of distribution are normal distributions which following conditions.\n\n1. Mean of the distribution is 0\n\n2. The Standard Deviation of the distribution is equal to 1.\n\nThe idea of Hypothesis Testing works completely on the data distributions.\n\n### Hypothesis Testing\n\nHypothesis testing is a statistical method that is used in making statistical decisions using experimental data. Hypothesis Testing is basically an assumption that we make about the population parameter.\n\nFor example, say, we take the hypothesis that boys in a class are taller than girls. \n\nThe above statement is just an assumption on the population of the class.\n\n__Hypothesis__ is just an assumptive proposal or statement made on the basis of observations made on a set of information or data. \n\nWe initially propose two mutually exclusive statements based on the population of the sample data. \n\nThe initial one is called __NULL HYPOTHESIS__. It is denoted by H0.\n\nThe second one is called __ALTERNATE HYPOTHESIS__. It is denoted by H1 or Ha. It is used as a contrary to Null Hypothesis. \n\nBased on the instances of the population we accept or reject the NULL Hypothesis and correspondingly we reject or accept the ALTERNATE Hypothesis.\n \n#### Level of Significance\n\nIt is the degree which we consider to decide whether to accept or reject the NULL hypothesis. When we consider a hypothesis on a population, it is not the case that 100% or all instances of the population abides the assumption, so we decide a __level of significance as a cutoff degree, i.e, if our level of significance is 5%, and (100-5)% = 95% of the data abides by the assumption, we accept the Hypothesis.__\n\n__It is said with 95% confidence, the hypothesis is accepted__\n\n![curve](https:\u002F\u002Fi.stack.imgur.com\u002Fd8iHd.png)\n\nThe non-reject region is called __acceptance region or beta region__. The rejection regions are called __critical or alpha regions__. __alpha__ denotes the __level of significance__.\n\nIf level of significance is 5%. the two alpha regions have (2.5+2.5)% of the population and the beta region has the 95%. \n\nThe acceptance and rejection gives rise to two kinds of errors:\n\n__Type-I Error:__ NULL Hypothesis is true, but wrongly Rejected.\n\n__Type-II Error:__ NULL Hypothesis if false but is wrongly accepted.\n\n![hypothesis](https:\u002F\u002Fmicrobenotes.com\u002Fwp-content\u002Fuploads\u002F2020\u002F07\u002FGraphical-representation-of-type-1-and-type-2-errors.jpg)\n\n### Tests for Hypothesis\n\n__One Tailed Test__: \n\n![One-tailed](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_acc48a5819d2.png)\n\nThis is a test for Hypothesis, where the rejection region is only one side of the sampling distribution. The rejection region may be in right tail end or in the left tail end.\n\nThe idea is if we say our level of significance is 5% and we consider a hypothesis \"Hieght of Boys in a class is \u003C=6 ft\". We consider the hypothesis true if atmost 5% of our population are more than 6 feet. So, this will be one-tailed as the test condition only restricts one tail end, the end with hieght > 6ft. \n\n![Two Tailed](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_88e6dd52a2b7.png)\n\nIn this case, the rejection region extends at both tail ends of the distribution.\n\nThe idea is if we say our level of significance is 5% and we consider a hypothesis \"Hieght of Boys in a class is !=6 ft\".\n\nHere, we can accept the NULL hyposthesis iff atmost 5% of the population is less than or greater than 6 feet. So, it is evident that the crirtical region will be at both tail ends and the region is 5% \u002F 2 = 2.5% at both ends of the distribution. \n\n\n\n## 17_ p-Value\n\nBefore we jump into P-values we need to look at another important topic in the context: Z-test.\n\n### Z-test\n\nWe need to know two terms: __Population and Sample.__\n\n__Population__ describes the entire available data distributed. So, it refers to all records provided in the dataset.\n\n__Sample__ is said to be a group of data points randomly picked from a population or a given distribution. The size of the sample can be any number of data points, given by __sample size.__\n\n__Z-test__ is simply used to determine if a given sample distribution belongs to a given population. \n\nNow,for Z-test we have to use __Standard Normal Form__ for the standardized comparison measures.\n\n![std1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_d64d52a25cae.png)\n\nAs we already have seen, standard normal form is a normal form with mean=0 and standard deviation=1.\n\nThe __Standard Deviation__ is a measure of how much differently the points are distributed around the mean.\n\n![std2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_a2c6afc944fa.png)\n\nIt states that approximately 68% , 95% and 99.7% of the data lies within 1, 2 and 3 standard deviations of a normal distribution respectively.\n\nNow, to convert the normal distribution to standard normal distribution we need a standard score called Z-Score.\nIt is given by:\n\n![Z-score](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_310333194ca3.png)\n\nx = value that we want to standardize\n\nµ = mean of the distribution of x\n\nσ = standard deviation of the distribution of x\n\nWe need to know another concept __Central Limit Theorem__.\n\n##### Central Limit Theorem \n\n_The theorem states that the mean of the sampling distribution of the sample means is equal to the population mean irrespective if the distribution of population where sample size is greater than 30._\n\nAnd\n\n_The sampling distribution of sampling mean will also follow the normal distribution._\n\nSo, it states, if we pick several samples from a distribution with the size above 30, and pick the static sample means and use the sample means to create a distribution, the mean of the newly created sampling distribution is equal to the original population mean.\n\nAccording to the theorem, if we draw samples of size N, from a population with population mean μ and population standard deviation σ, the condition stands:\n\n![std3](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_989de647b0b1.png)\n\ni.e, mean of the distribution of sample means is equal to the sample means.\n\nThe standard deviation of the sample means is give by:\n\n![std4](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_b42526ed4f4c.png)\n\nThe above term is also called standard error.\n\nWe use the theory discussed above for Z-test. If the sample mean lies close to the population mean, we say that the sample belongs to the population and if it lies at a distance from the population mean, we say the sample is taken from a different population.\n\nTo do this we use a formula and check if the z statistic is greater than or less than 1.96 (considering two tailed test, level of significance = 5%)\n\n![los](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_f6335d6d9baa.gif)\n\n![std5](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_402dc791b297.png)\n \n The above formula gives Z-static\n\nz = z statistic\n\nX̄ = sample mean\n\nμ = population mean\n\nσ = population standard deviation\n\nn = sample size\n\nNow, as the Z-score is used to standardize the distribution, it gives us an idea how the data is distributed overall.\n\n### P-values\n\nIt is used to check if the results are statistically significant based on the significance level.  \n\nSay, we perform an experiment and collect observations or data. Now, we make a hypothesis (NULL hypothesis) primary, and a second hypothesis, contradictory to the first one called the alternative hypothesis.\n\nThen we decide a level of significance which serve as a threshold for our null hypothesis. The P value actually gives the probability of the statement. Say, the p-value of our alternative hypothesis is 0.02, it means the probability of alternate hypothesis happenning is 2%. \n\nNow, the level of significance into play to decide if we can allow 2% or p-value of 0.02. It can be said as a level of endurance of the null hypothesis. If our level of significance is 5% using a two tailed test, we can allow 2.5% on both ends of the distribution, we accept the NULL hypothesis, as level of significance > p-value of alternate hypothesis. \n\nBut if the p-value is greater than level of significance, we tell that the result is __statistically significant, and we reject NULL hypothesis.__ .\n\nResources:\n\n1. https:\u002F\u002Fmedium.com\u002Fanalytics-vidhya\u002Feverything-you-should-know-about-p-value-from-scratch-for-data-science-f3c0bfa3c4cc\n\n2. https:\u002F\u002Ftowardsdatascience.com\u002Fp-values-explained-by-data-scientist-f40a746cfc8\n\n3.https:\u002F\u002Fmedium.com\u002Fanalytics-vidhya\u002Fz-test-demystified-f745c57c324c\n\n## 18_ Chi2 test\n\nChi2 test is extensively used in data science and machine learning problems for feature selection.\n\nA chi-square test is used in statistics to test the independence of two events. So, it is used to check for independence of features used. Often dependent features are used which do not convey a lot of information but adds dimensionality to a feature space.\n\nIt is one of the most common ways to examine relationships between two or more categorical variables.\n\nIt involves calculating a number, called the chi-square statistic - χ2. Which follows a chi-square distribution.\n\nIt is given as the summation of the difference of the expected values and observed value divided by the observed value.\n\n![Chi2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_4e9001423d04.png)\n\n\nResources:\n\n[Definitions](investopedia.com\u002Fterms\u002Fc\u002Fchi-square-statistic.asp)\n\n[Guide 1](https:\u002F\u002Ftowardsdatascience.com\u002Fchi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223)\n\n[Guide 2](https:\u002F\u002Fmedium.com\u002Fswlh\u002Fwhat-is-chi-square-test-how-does-it-work-3b7f22c03b01)\n\n[Example of Operation](https:\u002F\u002Fmedium.com\u002F@kuldeepnpatel\u002Fchi-square-test-of-independence-bafd14028250)\n\n\n## 19_ Estimation\n\n## 20_ Confid Int (CI)\n\n## 21_ MLE\n\n## 22_ Kernel Density estimate\n\nIn statistics, kernel density estimation (KDE) is a non-parametric way to estimate the probability density function of a random variable. Kernel density estimation is a fundamental data smoothing problem where inferences about the population are made, based on a finite data sample.\n\nKernel Density estimate can be regarded as another way to represent the probability distribution. \n\n![KDE1](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002F2\u002F2a\u002FKernel_density.svg\u002F250px-Kernel_density.svg.png)\n\nIt consists of choosing a kernel function. There are mostly three used.\n\n1. Gaussian \n\n2. Box\n\n3. Tri\n\nThe kernel function depicts the probability of finding a data point. So, it is highest at the centre and decreases as we move away from the point.\n\nWe assign a kernel function over all the data points and finally calculate the density of the functions, to get the density estimate of the distibuted data points. It practically adds up the Kernel function values at a particular point on the axis. It is as shown below.\n\n![KDE 2](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002F4\u002F41\u002FComparison_of_1D_histogram_and_KDE.png\u002F500px-Comparison_of_1D_histogram_and_KDE.png)\n\nNow, the kernel function is given by:\n\n![kde3](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002Ff3b09505158fb06033aabf9b0116c8c07a68bf31)\n\nwhere K is the kernel — a non-negative function — and h > 0 is a smoothing parameter called the bandwidth. \n\nThe 'h' or the bandwidth is the parameter, on which the curve varies.\n\n![kde4](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002Fe\u002Fe5\u002FComparison_of_1D_bandwidth_selectors.png\u002F220px-Comparison_of_1D_bandwidth_selectors.png)\n\nKernel density estimate (KDE) with different bandwidths of a random sample of 100 points from a standard normal distribution. Grey: true density (standard normal). Red: KDE with h=0.05. Black: KDE with h=0.337. Green: KDE with h=2.\n\nResources:\n\n[Basics](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=x5zLaWT5KPs)\n\n[Advanced](https:\u002F\u002Fjakevdp.github.io\u002FPythonDataScienceHandbook\u002F05.13-kernel-density-estimation.html)\n\n## 23_ Regression\n\nRegression tasks deal with predicting the value of a __dependent variable__ from a set of __independent variables.__\n\nSay, we want to predict the price of a car. So, it becomes a dependent variable say Y, and the features like engine capacity, top speed, class, and company become the independent variables, which helps to frame the equation to obtain the price.\n\nIf there is one feature say x. If the dependent variable y is linearly dependent on x, then it can be given by __y=mx+c__, where the m is the coefficient of the independent in the equation, c is the intercept or bias.\n\nThe image shows the types of regression\n\n![types](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_14eb56f13f69.png)\n\n[Guide to Regression](https:\u002F\u002Ftowardsdatascience.com\u002Fa-deep-dive-into-the-concept-of-regression-fb912d427a2e)\n\n## 24_ Covariance\n\n### Variance\nThe variance is a measure of how dispersed or spread out the set is. If it is said that the variance is zero, it means all the elements in the dataset are same. If the variance is low, it means the data are slightly dissimilar. If the variance is very high, it means the data in the dataset are largely dissimilar. \n\nMathematically, it is a measure of how far each value in the data set is from the mean.\n\nVariance (sigma^2) is given by summation of the square of distances of each point from the mean, divided by the number of points\n\n![formula var](https:\u002F\u002Fcdn.sciencebuddies.org\u002FFiles\u002F474\u002F9\u002FDefVarEqn.jpg)\n\n### Covariance\n\nCovariance gives us an idea about the degree of association between two considered random variables. Now, we know random variables create distributions. Distribution are a set of values or data points which the variable takes and we can easily represent as vectors in the vector space.\n\nFor vectors covariance is defined as the dot product of two vectors. The value of covariance can vary from positive infinity to negative infinity. If the two distributions or vectors grow in the same direction the covariance is positive and vice versa. The Sign gives the direction of variation and the Magnitude gives the amount of variation.  \n\nCovariance is given by:\n\n![cov_form](https:\u002F\u002Fcdn.corporatefinanceinstitute.com\u002Fassets\u002Fcovariance1.png)\n\nwhere Xi and Yi denotes the i-th point of the two distributions and X-bar and Y-bar represent the mean values of both the distributions, and n represents the number of values or data points in the distribution. \n\n## 25_ Correlation\n\nCovariance measures the total relation of the variables namely both direction and magnitude. Correlation is a scaled measure of covariance. It is dimensionless and independent of scale. It just shows the strength of variation for both the variables.\n\nMathematically, if we represent the distribution using vectors, correlation is said to be the cosine angle between the vectors. The value of correlation varies from +1 to -1. +1 is said to be a strong positive correlation and -1 is said to be a strong negative correlation. 0 implies no correlation, or the two variables are independent of each other. \n\nCorrelation is given by:\n\n![corr](https:\u002F\u002Fcdn.corporatefinanceinstitute.com\u002Fassets\u002Fcovariance3.png)\n\nWhere:\n\nρ(X,Y) – the correlation between the variables X and Y\n\nCov(X,Y) – the covariance between the variables X and Y\n\nσX – the standard deviation of the X-variable\n\nσY – the standard deviation of the Y-variable\n\nStandard deviation is given by square roo of variance.\n\n## 26_ Pearson coeff\n\n## 27_ Causation\n\n## 28_ Least2-fit\n\n## 29_ Euclidian Distance\n\n__Eucladian Distance is the most used and standard measure for the distance between two points.__\n\nIt is given as the square root of sum of squares of the difference between coordinates of two points.\n\n__The Euclidean distance between two points in Euclidean space is a number, the length of a line segment between the two points. It can be calculated from the Cartesian coordinates of the points using the Pythagorean theorem, and is occasionally called the Pythagorean distance.__\n\n__In the Euclidean plane, let point p have Cartesian coordinates (p_{1},p_{2}) and let point q have coordinates (q_{1},q_{2}). Then the distance between p and q is given by:__\n\n![eucladian](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002F9c0157084fd89f5f3d462efeedc47d3d7aa0b773)\n\n\n# 3_ Programming\n\n## 1_ Python Basics\n\n### About\n\nPython is a high-level programming langage. I can be used in a wide range of works.\n\nCommonly used in data-science, [Python](https:\u002F\u002Fwww.python.org\u002F)  has a huge set of libraries, helpful to quickly do something.\n\nMost of informatics systems already support Python, without installing anything.\n\n### Execute a script\n\n* Download the .py file on your computer\n* Make it executable (_chmod +x file.py_ on Linux)\n* Open a terminal and go to the directory containing the python file\n* _python file.py_ to run with Python2 or _python3 file.py_ with Python3\n\n## 2_ Working in excel\n\n## 3_ R setup \u002F R studio\n\n### About\n\nR is a programming language specialized in statistics and mathematical visualizations.\n\nIt can be used with manually created scripts using the terminal, or directly in the R console.\n\n### Installation\n\n#### Linux\n\n\tsudo apt-get install r-base\n\t\n\tsudo apt-get install r-base-dev\n\n#### Windows\n\nDownload the .exe setup available on [CRAN](https:\u002F\u002Fcran.rstudio.com\u002Fbin\u002Fwindows\u002Fbase\u002F) website.\n\n### R-studio\n\nRstudio is a graphical interface for R. It is available for free on [their website](https:\u002F\u002Fwww.rstudio.com\u002Fproducts\u002Frstudio\u002Fdownload\u002F).\n\nThis interface is divided in 4 main areas :\n\n![rstudio](https:\u002F\u002Fowi.usgs.gov\u002FR\u002Ftraining-curriculum\u002Fintro-curriculum\u002Fstatic\u002Fimg\u002Frstudio.png)\n\n* The top left is the script you are working on (highlight code you want to execute and press Ctrl + Enter)\n* The bottom left is the console to instant-execute some lines of codes\n* The top right is showing your environment (variables, history, ...)\n* The bottom right show figures you plotted, packages, help ... The result of code execution\n\n## 4_ R basics\n\nR is an open source programming language and software environment for statistical computing and graphics that is supported by the R Foundation for Statistical Computing.\n\nThe R language is widely used among statisticians and data miners for developing statistical software and data analysis.\n\nPolls, surveys of data miners, and studies of scholarly literature databases show that R's popularity has increased substantially in recent years.\n\n## 5_ Expressions\n\n## 6_ Variables\n\n## 7_ IBM SPSS\n\n## 8_ Rapid Miner\n\n## 9_ Vectors\n\n## 10_ Matrices\n\n## 11_ Arrays\n\n## 12_ Factors\n\n## 13_ Lists\n\n## 14_ Data frames\n\n## 15_ Reading CSV data\n\nCSV is a format of __tabular data__ comonly used in data science. Most of structured data will come in such a format.\n\nTo __open a CSV file__ in Python, just open the file as usual :\n\t\n\traw_file = open('file.csv', 'r')\n\t\n* 'r': Reading, no modification on the file is possible\n* 'w': Writing, every modification will erease the file \n* 'a': Adding, every modification will be made at the end of the file\n\n### How to read it ?\n\nMost of the time, you will parse this file line by line and do whatever you want on this line. If you want to store data to use them later, build lists or dictionnaries.\n\nTo read such a file row by row, you can use :\n\n* Python [library csv](https:\u002F\u002Fdocs.python.org\u002F3\u002Flibrary\u002Fcsv.html)\n* Python [function open](https:\u002F\u002Fdocs.python.org\u002F2\u002Flibrary\u002Ffunctions.html#open)\n\n## 16_ Reading raw data\n\n## 17_ Subsetting data\n\n## 18_ Manipulate data frames\n\n## 19_ Functions\n\nA function is helpful to execute redondant actions.\n\nFirst, define the function:\n\n\tdef MyFunction(number):\n\t\t\"\"\"This function will multiply a number by 9\"\"\"\n\t\tnumber = number * 9\n\t\treturn number\n\n## 20_ Factor analysis\n\n## 21_ Install PKGS\n\nPython actually has two mainly used distributions. Python2 and python3.\n\n### Install pip\n\nPip is a library manager for Python. Thus, you can easily install most of the packages with a one-line command. To install pip, just go to a terminal and do:\n\t\n\t# __python2__\n\tsudo apt-get install python-pip\n\t# __python3__\n\tsudo apt-get install python3-pip\n\t\nYou can then install a library with [pip](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fpip?) via a terminal doing:\n\n\t# __python2__ \n\tsudo pip install [PCKG_NAME]\n\t# __python3__ \n\tsudo pip3 install [PCKG_NAME]\n\nYou also can install it directly from the core (see 21_install_pkgs.py)\n\n\n# 4_ Machine learning\n\n## 1_ What is ML ?\n\n### Definition\n\nMachine Learning is part of the Artificial Intelligences study. It concerns the conception, devloppement and implementation of sophisticated methods, allowing a machine to achieve really hard tasks, nearly impossible to solve with classic algorithms.\n\nMachine learning mostly consists of three algorithms:\n\n![ml](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_e265e9615e06.png)\n\n### Utilisation examples\n\n* Computer vision\n* Search engines\n* Financial analysis\n* Documents classification\n* Music generation\n* Robotics ...\n\n## 2_ Numerical var\n\nVariables which can take continous integer or real values. They can take infinite values.\n\nThese types of variables are mostly used for features which involves measurements. For example, hieghts of all students in a class.\n\n## 3_ Categorical var\n\nVariables that take finite discrete values. They take a fixed set of values, in order to classify a data item.\n\nThey act like assigned labels. For example: Labelling the students of a class according to gender: 'Male' and 'Female'\n\n## 4_ Supervised learning\n\nSupervised learning is the machine learning task of inferring a function from __labeled training data__. \n\nThe training data consist of a __set of training examples__. \n\nIn supervised learning, each example is a pair consisting of an input object (typically a vector) and a desired output value (also called the supervisory signal). \n\nA supervised learning algorithm analyzes the training data and produces an inferred function, which can be used for mapping new examples. \n\nIn other words:\n\nSupervised Learning learns from a set of labeled examples. From the instances and the labels, supervised learning models try to find the correlation among the features, used to describe an instance, and learn how each feature contributes to the label corresponding to an instance. On receiving an unseen instance, the goal of supervised learning is to label the instance based on its feature correctly.\n\n__An optimal scenario will allow for the algorithm to correctly determine the class labels for unseen instances__.\n\n## 5_ Unsupervised learning\n\nUnsupervised machine learning is the machine learning task of inferring a function to describe hidden structure __from \"unlabeled\" data__ (a classification or categorization is not included in the observations). \n\nSince the examples given to the learner are unlabeled, there is no evaluation of the accuracy of the structure that is output by the relevant algorithm—which is one way of distinguishing unsupervised learning from supervised learning and reinforcement learning.\n\nUnsupervised learning deals with data instances only. This approach tries to group data and form clusters based on the similarity of features. If two instances have similar features and placed in close proximity in feature space, there are high chances the two instances will belong to the same cluster. On getting an unseen instance, the algorithm will try to find, to which cluster the instance should belong based on its feature.\n\nResource:\n\n[Guide to unsupervised learning](https:\u002F\u002Ftowardsdatascience.com\u002Fa-dive-into-unsupervised-learning-bf1d6b5f02a7)\n\n## 6_ Concepts, inputs and attributes\n\nA machine learning problem takes in the features of a dataset as input.\n\nFor supervised learning, the model trains on the data and then it is ready to perform. So, for supervised learning, apart from the features we also need to input  the corresponding labels of the data points to let the model train on them.\n\nFor unsupervised learning, the models simply perform by just citing complex relations among data items and grouping them accordingly. So, unsupervised learning do not need a labelled dataset. The input is only the feature section of the dataset.\n\n## 7_ Training and test data\n\nIf we train a supervised machine learning model using a dataset, the model captures the dependencies of that particular data set very deeply. So, the model will always perform well on the data and it won't be proper measure of how well the model performs. \n\nTo know how well the model performs, we must train and test the model on different datasets. The dataset we train the model on is called Training set, and the dataset we test the model on is called the test set.\n\nWe normally split the provided dataset to create the training and test set. The ratio of splitting is majorly: 3:7 or 2:8 depending on the data, larger being the trining data.\n\n#### sklearn.model_selection.train_test_split is used for splitting the data.\n\nSyntax:\n\n    from sklearn.model_selection import train_test_split\n    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n  \n[Sklearn docs](https:\u002F\u002Fscikit-learn.org\u002Fstable\u002Fmodules\u002Fgenerated\u002Fsklearn.model_selection.train_test_split.html)\n\n## 8_ Classifiers\n\nClassification is the most important and most common machine learning problem. Classification problems can be both suprvised and unsupervised problems.\n\nThe classification problems involve labelling data points to belong to a particular class based on the feature set corresponding to the particluar data point.\n\nClassification tasks can be performed using both machine learning and deep learning techniques.\n\nMachine learning classification techniques involve: Logistic Regressions, SVMs, and Classification trees. The models used to perform the classification are called classifiers.\n\n## 9_ Prediction\n\nThe output generated by a machine learning models for a particuolar problem is called its prediction. \n\nThere are majorly two kinds of predictions corresponding to two types of problen: \n\n1. Classification\n\n2. Regression\n\nIn classiication, the prediction is mostly a class or label, to which a data points belong\n\nIn regression, the prediction is a number, a continous a numeric value, because regression problems deal with predicting the value. For example, predicting the price of a house.\n\n## 10_ Lift\n\n## 11_ Overfitting\n\nOften we train our model so much or make our model so complex that our model fits too tghtly with the training data.\n\nThe training data often contains outliers or represents misleading patterns in the data. Fitting the training data with such irregularities to deeply cause the model to lose its generalization. The model performs very well on the training set but not so good on the test set. \n\n![overfitting](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_9e198f756486.png)\n\nAs we can see on training further a point the training error decreases and testing error increases.\n\nA hypothesis h1 is said to overfit iff there exists another hypothesis h where h gives more error than h1 on training data and less error than h1 on the test data\n\n## 12_ Bias & variance\n\nBias is the difference between the average prediction of our model and the correct value which we are trying to predict. Model with high bias pays very little attention to the training data and oversimplifies the model. It always leads to high error on training and test data.\n\nVariance is the variability of model prediction for a given data point or a value which tells us spread of our data. Model with high variance pays a lot of attention to training data and does not generalize on the data which it hasn’t seen before. As a result, such models perform very well on training data but has high error rates on test data.\n\n\nBasically High variance causes overfitting and high bias causes underfitting. We want our model to have low bias and low variance to perform perfectly. We need to avoid a model with higher variance and high bias\n\n![bias&variance](https:\u002F\u002Fcommunity.alteryx.com\u002Ft5\u002Fimage\u002Fserverpage\u002Fimage-id\u002F52874iE986B6E19F3248CF?v=1.0)\n\nWe can see that for Low bias and Low Variance our model predicts all the data points correctly. Again in the last image having high bias and high variance the model predicts no data point correctly.\n\n![B&v2](https:\u002F\u002Fadolfoeliazat.com\u002Fwp-content\u002Fuploads\u002F2020\u002F07\u002FBias-Variance-tradeoff-in-Machine-Learning.png)\n\nWe can see from the graph that rge Error increases when the complex is either too complex or the model is too simple. The bias increases with simpler model and Variance increases with complex models.\n\nThis is one of the most important tradeoffs in machine learning\n\n\n\n## 13_ Tree and classification\n\nWe have previously talked about classificaion. We have seen the most used methods are Logistic Regression, SVMs and decision trees. Now, if the decision boundary is linear the methods like logistic regression and SVM serves best, but its a complete scenerio when the decision boundary is non linear, this is where decision tree is used.\n\n![tree](https:\u002F\u002Fwww.researchgate.net\u002Fprofile\u002FZena_Hira\u002Fpublication\u002F279274803\u002Ffigure\u002Ffig4\u002FAS:324752402075653@1454438414424\u002FLinear-versus-nonlinear-classification-problems.png)\n\nThe first image shows linear decision boundary and second image shows non linear decision boundary.\n\nIh the cases, for non linear boundaries, the decision trees condition based approach work very well for classification problems. The algorithm creates conditions on features to drive and reach a decision, so is independent of functions.\n\n![tree2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_bc8e8d92fd7c.png)\n\nDecision tree approach for classification\n\n## 14_ Classification rate\n\n## 15_ Decision tree\n\nDecision Trees are some of the most used machine learning algorithms. They are used for both classification and Regression. They can be used for both linear and non-linear data, but they are mostly used for non-linear data. Decision Trees as the name suggests works on a set of decisions derived from the data and its behavior. It does not use a linear classifier or regressor, so its performance is independent of the linear nature of the data. \n\nOne of the other most important reasons to use tree models is that they are very easy to interpret.\n\nDecision Trees can be used for both classification and regression. The methodologies are a bit different, though principles are the same. The decision trees use the CART algorithm (Classification and Regression Trees)\n\nResource:\n\n[Guide to Decision Tree](https:\u002F\u002Ftowardsdatascience.com\u002Fa-dive-into-decision-trees-a128923c9298)\n\n\n## 16_ Boosting\n\n#### Ensemble Learning\n\nIt is the method used to enhance the performance of the Machine learning models by combining several number of models or weak learners. They provide improved efficiency.\n\nThere are two types of ensemble learning:\n\n__1. Parallel ensemble learning or bagging method__\n\n__2. Sequential ensemble learning or boosting method__\n\nIn parallel method or bagging technique, several weak classifiers are created in parallel. The training datasets are created randomly on a bootstrapping basis from the original dataset. The datasets used for the training and creation phases are weak classifiers. Later during predictions, the reults from all the classifiers are bagged together to provide the final results.\n\n![bag](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_e35b812a46d7.png)\n\nEx: Random Forests\n\nIn sequential learning or boosting weak learners are created one after another and the data sample set are weighted in such a manner that during creation, the next learner focuses on the samples that were wrongly predicted by the previous classifier. So, at each step, the classifier improves and learns from its previous mistakes or misclassifications.\n\n![boosting](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_45908f0b9604.jpg)\n\nThere are mostly three types of boosting algorithm:\n\n__1. Adaboost__\n\n__2. Gradient Boosting__\n\n__3. XGBoost__\n\n__Adaboost__ algorithm works in the exact way describe. It creates a weak learner, also known as stumps, they are not full grown trees, but contain a single node based on which the classification is done. The misclassifications are observed and they are weighted more than the correctly classified ones while training the next weak learner. \n\n__sklearn.ensemble.AdaBoostClassifier__ is used for the application of the classifier on real data in python.\n\n![adaboost](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_c97e86a66999.jpg)\n\nReources:\n\n[Understanding](https:\u002F\u002Fblog.paperspace.com\u002Fadaboost-optimizer\u002F#:~:text=AdaBoost%20is%20an%20ensemble%20learning,turn%20them%20into%20strong%20ones.)\n\n\n__Gradient Boosting__ algorithm starts with a node giving 0.5 as output for both classification and regression. It serves as the first stump or weak learner. We then observe the Errors in predictions. Now, we create other learners or decision trees to actually predict the errors based on the conditions. The errors are called Residuals. Our final output is:\n\n__0.5 (Provided by the first learner) + The error provided by the second tree or learner.__\n\nNow, if we use this method, it learns the predictions too tightly, and loses generalization. In order to avoid that gradient boosting uses a learning parameter _alpha_. \n\nSo, the final results after two learners is obtained as:\n\n__0.5 (Provided by the first learner) + _alpha_ X (The error provided by the second tree or learner.)__\n\nWe can see that using the added portion we take a small leap towards the correct results. We continue adding learners until the point we are very close to the actual value given by the training set.\n\nOverall the equation becomes:\n\n\n__0.5 (Provided by the first learner) + _alpha_ X (The error provided by the second tree or learner.)+ _alpha_ X (The error provided by the third tree or learner.)+.............__\n\n\n__sklearn.ensemble.GradientBoostingClassifier__ used to apply gradient boosting in python\n\n![GBM](https:\u002F\u002Fwww.elasticfeed.com\u002Fwp-content\u002Fuploads\u002F09cc1168a39db0c0d6ea1c66d27ecfd3.jpg)\n\nResource:\n\n[Guide](https:\u002F\u002Fmedium.com\u002Fmlreview\u002Fgradient-boosting-from-scratch-1e317ae4587d) \n\n## 17_ Naïves Bayes classifiers\n\nThe Naive Bayes classifiers are a collection of classification algorithms based on __Bayes’ Theorem.__\n\nBayes theorem describes the probability of an event, based on prior knowledge of conditions that might be related to the event. It is given by:\n\n![bayes](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002F87c061fe1c7430a5201eef3fa50f9d00eac78810)\n\nWhere P(A|B) is the probabaility of occurrence of A knowing B already occurred and P(B|A) is the probability of occurrence of B knowing A occurred.\n\n[Scikit-learn Guide](https:\u002F\u002Fgithub.com\u002Fabr-98\u002Fdata-scientist-roadmap\u002Fedit\u002Fmaster\u002F04_Machine-Learning\u002FREADME.md)\n\nThere are mostly two types of Naive Bayes:\n\n__1. Gaussian Naive Bayes__\n\n__2. Multinomial Naive Bayes.__\n\n#### Multinomial Naive Bayes\n\nThe method is used mostly for document classification. For example, classifying an article as sports article or say film magazine. It is also used for differentiating actual mails from spam mails. It uses the frequency of words used in different magazine to make a decision.\n\nFor example, the word \"Dear\" and \"friends\" are used a lot in actual mails and \"offer\" and \"money\" are used a lot in \"Spam\" mails. It calculates the prorbability of the occurrence of the words in case of actual mails and spam mails using the training examples. So, the probability of occurrence of \"money\" is much higher in case of spam mails and so on. \n\nNow, we calculate the probability of a mail being a spam mail using the occurrence of words in it. \n\n#### Gaussian Naive Bayes\n\nWhen the predictors take up a continuous value and are not discrete, we assume that these values are sampled from a gaussian distribution.\n\n![gnb](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_018b143af3e4.gif)\n\nIt links guassian distribution and Bayes theorem. \n\nResources:\n\n[GUIDE](https:\u002F\u002Fyoutu.be\u002FH3EjCKtlVog)\n\n## 18_ K-Nearest neighbor\n\nK-nearest neighbour algorithm is the most basic and still essential algorithm. It is a memory based approach and not a model based one. \n\nKNN is used in both supervised and unsupervised learning. It simply locates the data points across the feature space and used distance as a similarity metrics.\n\nLesser the distance between two data points, more similar the points are. \n\nIn K-NN classification algorithm, the point to classify is plotted on the feature space and classified as the class of its nearest K-neighbours. K is the user parameter. It gives the measure of how many points we should consider while deciding the label of the point concerned. If K is more than 1 we consider the label that is in majority.\n\nIf the dataset is very large, we can use a large k. The large k is less effected by noise and generates smooth boundaries. For small dataset, a small k must be used. A small k helps to notice the variation in boundaries better.\n\n![knn](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_9483bd7d8e73.jpg)\n\nResource:\n\n[GUIDE](https:\u002F\u002Ftowardsdatascience.com\u002Fmachine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761)\n\n## 19_ Logistic regression\n\nRegression is one of the most important concepts used in machine learning.\n\n[Guide to regression](https:\u002F\u002Ftowardsdatascience.com\u002Fa-deep-dive-into-the-concept-of-regression-fb912d427a2e)\n\nLogistic Regression is the most used classification algorithm for linearly seperable datapoints. Logistic Regression is used when the dependent variable is categorical. \n\nIt uses the linear regression equation:\n\n__Y= w1x1+w2x2+w3x3……..wkxk__\n\nin a modified format:\n\n__Y= 1\u002F 1+e^-(w1x1+w2x2+w3x3……..wkxk)__\n\nThis modification ensures the value always stays between 0 and 1. Thus, making it feasible to be used for classification.\n\nThe above equation is called __Sigmoid__ function. The function looks like:\n\n![Logreg](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_0f27d141e12c.png)\n\nThe loss fucnction used is called logloss or binary cross-entropy.\n\n__Loss= —Y_actual. log(h(x)) —(1 — Y_actual.log(1 — h(x)))__\n\nIf Y_actual=1, the first part gives the error, else the second part.\n\n![loss](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_39730dbde797.png)\n\nLogistic Regression is used for multiclass classification also. It uses softmax regresssion or One-vs-all logistic regression.\n\n[Guide to logistic Regression](https:\u002F\u002Ftowardsdatascience.com\u002Flogistic-regression-detailed-overview-46c4da4303bc)\n\n\n__sklearn.linear_model.LogisticRegression__ is used to apply logistic Regression in python.\n\n## 20_ Ranking\n\n## 21_ Linear regression\n\nRegression tasks deal with predicting the value of a dependent variable from a set of independent variables i.e, the provided features. Say, we want to predict the price of a car. So, it becomes a dependent variable say Y, and the features like engine capacity, top speed, class, and company become the independent variables, which helps to frame the equation to obtain the price.\n\n\nNow, if there is one feature say x. If the dependent variable y is linearly dependent on x, then it can be given by y=mx+c, where the m is the coefficient of the feature in the equation, c is the intercept or bias. Both M and C are the model parameters.\n\nWe use a loss function or cost function called Mean Square error of (MSE). It is given by the square of the difference between the actual and the predicted value of the dependent variable.\n\n__MSE=1\u002F2m * (Y_actual — Y_pred)²__\n\nIf we observe the function we will see its a parabola, i.e, the function is convex in nature. This convex function is the principle used in Gradient Descent to obtain the value of the model parameters\n\n![loss](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_e1dfc17ebb6a.png)\n\nThe image shows the loss function.\n\nTo get the correct estimate of the model parameters we use the method of __Gradient Descent__\n\n[Guide to Gradient Descent](https:\u002F\u002Ftowardsdatascience.com\u002Fan-introduction-to-gradient-descent-and-backpropagation-81648bdb19b2)\n\n[Guide to linear Regression](https:\u002F\u002Ftowardsdatascience.com\u002Flinear-regression-detailed-view-ea73175f6e86)\n\n__sklearn.linear_model.LinearRegression__ is used to apply linear regression in python\n\n## 22_ Perceptron\n\nThe perceptron has been the first model described in the 50ies.\n\nThis is a __binary classifier__, ie it can't separate more than 2 groups, and thoses groups have to be __linearly separable__.\n\nThe perceptron __works like a biological neuron__. It calculate an activation value, and if this value if positive, it returns 1, 0 otherwise.\n\n## 23_ Hierarchical clustering\n\nThe hierarchical algorithms are so-called because they create tree-like structures to create clusters. These algorithms also use a distance-based approach for cluster creation.\n\nThe most popular algorithms are:\n\n__Agglomerative Hierarchical clustering__\n\n__Divisive Hierarchical clustering__\n\n__Agglomerative Hierarchical clustering__: In this type of hierarchical clustering, each point initially starts as a cluster, and slowly the nearest or similar most clusters merge to create one cluster.\n\n__Divisive Hierarchical Clustering__: The type of hierarchical clustering is just the opposite of Agglomerative clustering. In this type, all the points start as one large cluster and slowly the clusters get divided into smaller clusters based on how large the distance or less similarity is between the two clusters. We keep on dividing the clusters until all the points become individual clusters.\n\nFor agglomerative clustering, we keep on merging the clusters which are nearest or have a high similarity score to one cluster. So, if we define a cut-off or threshold score for the merging we will get multiple clusters instead of a single one. For instance, if we say the threshold similarity metrics score is 0.5, it means the algorithm will stop merging the clusters if no two clusters are found with a similarity score less than 0.5, and the number of clusters present at that step will give the final number of clusters that need to be created to the clusters.\n\nSimilarly, for divisive clustering, we divide the clusters based on the least similarity scores. So, if we define a score of 0.5, it will stop dividing or splitting if the similarity score between two clusters is less than or equal to 0.5. We will be left with a number of clusters and it won’t reduce to every point of the distribution.\n\nThe process is as shown below:\n\n![HC](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_31d239255082.png)\n\nOne of the most used methods for the measuring distance and applying cutoff is the dendrogram method.\n\nThe dendogram for above clustering is:\n\n![Dend](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_d927a9f3404b.png)\n\n[Guide](https:\u002F\u002Ftowardsdatascience.com\u002Funderstanding-the-concept-of-hierarchical-clustering-technique-c6e8243758ec)\n\n## 24_ K-means clustering\n\nThe algorithm initially creates K clusters randomly using N data points and finds the mean of all the point values in a cluster for each cluster. So, for each cluster we find a central point or centroid calculating the mean of the values of the cluster. Then the algorithm calculates the sum of squared error (SSE) for each cluster. SSE is used to measure the quality of clusters. If a cluster has large distances between the points and the center, then the SSE will be high and if we check the interpretation it allows only points in the close vicinity to create clusters.\n\nThe algorithm works on the principle that the points lying close to a center of a cluster should be in that cluster. So, if a point x is closer to the center of cluster A than cluster B, then x will belong to cluster A. Thus a point enters a cluster and as even a single point moves from one cluster to another, the centroid changes and so does the SSE. We keep doing this until the SSE decreases and the centroid does not change anymore. After a certain number of shifts, the optimal clusters are found and the shifting stops as the centroids don’t change any more.\n\nThe initial number of clusters ‘K’ is a user parameter.\n\nThe image shows the method\n\n![Kmeans](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_92940bd6fb81.png)\n\nWe have seen that for this type of clustering technique we need a user-defined parameter ‘K’ which defines the number of clusters that need to be created. Now, this is a very important parameter. To, find this parameter a number of methods are used. The most important and used method is the elbow method.\nFor smaller datasets, k=(N\u002F2)^(1\u002F2) or the square root of half of the number of points in the distribution.\n\n[Guide](https:\u002F\u002Ftowardsdatascience.com\u002Funderstanding-k-means-clustering-in-machine-learning-6a6e67336aa1)\n\n## 25_ Neural networks\n\nNeural Networks are a set of interconnected layers of artificial neurons or nodes. They are frameworks that are modeled keeping in mind, the structure and working of the human brain. They are meant for predictive modeling and applications where they can be trained via a dataset. They are based on self-learning algorithms and predict based on conclusions and complex relations derived from their training sets of information.\n\nA typical Neural Network has a number of layers. The First Layer is called the Input Layer and The Last layer is called the Output Layer. The layers between the Input and Output layers are called Hidden Layers. It basically functions like a Black Box for prediction and classification. All the layers are interconnected and consist of numerous artificial neurons called Nodes.\n\n[Guide to nueral Networks](https:\u002F\u002Fmedium.com\u002Fai-in-plain-english\u002Fneural-networks-overview-e6ea484a474e)\n\nNeural networks are too complex to work on Gradient Descent algorithms, so it works on the principles of Backproapagations and Optimizers.\n\n[Guide to Backpropagation](https:\u002F\u002Ftowardsdatascience.com\u002Fan-introduction-to-gradient-descent-and-backpropagation-81648bdb19b2)\n\n[Guide to optimizers](https:\u002F\u002Ftowardsdatascience.com\u002Fintroduction-to-gradient-descent-weight-initiation-and-optimizers-ee9ae212723f)\n\n## 26_ Sentiment analysis\n\nText Classification and sentiment analysis is a very common machine learning problem and is used in a lot of activities like product predictions, movie recommendations, and several others.\n\nText classification problems like sentimental analysis can be achieved in a number of ways using a number of algorithms. These are majorly divided into two main categories:\n\nA bag of Word model: In this case, all the sentences in our dataset are tokenized to form a bag of words that denotes our vocabulary. Now each individual sentence or sample in our dataset is represented by that bag of words vector. This vector is called the feature vector. For example, ‘It is a sunny day’, and ‘The Sun rises in east’ are two sentences. The bag of words would be all the words in both the sentences uniquely.\n\nThe second method is based on a time series approach: Here each word is represented by an Individual vector. So, a sentence is represented as a vector of vectors.\n\n[Guide to sentimental analysis](https:\u002F\u002Ftowardsdatascience.com\u002Fa-guide-to-text-classification-and-sentiment-analysis-2ab021796317)\n\n## 27_ Collaborative filtering\n\nWe all have used services like Netflix, Amazon, and Youtube. These services use very sophisticated systems to recommend the best items to their users to make their experiences great. \n\nRecommenders mostly have 3 components mainly, out of which, one of the main component is Candidate generation. This method is responsible for generating smaller subsets of candidates to recommend to a user, given a huge pool of thousands of items.\n\nTypes of Candidate Generation Systems:\n\n__Content-based filtering System__\n\n__Collaborative filtering System__\n\n__Content-based filtering system__: Content-Based recommender system tries to guess the features or behavior of a user given the item’s features, he\u002Fshe reacts positively to.\n\n__Collaborative filtering System__: Collaborative does not need the features of the items to be given. Every user and item is described by a feature vector or embedding.\n\nIt creates embedding for both users and items on its own. It embeds both users and items in the same embedding space.\n\nIt considers other users’ reactions while recommending a particular user. It notes which items a particular user likes and also the items that the users with behavior and likings like him\u002Fher likes, to recommend items to that user.\n\nIt collects user feedbacks on different items and uses them for recommendations.\n\n[Guide to collaborative filtering](https:\u002F\u002Ftowardsdatascience.com\u002Fintroduction-to-recommender-systems-1-971bd274f421)\n\n## 28_ Tagging\n\n## 29_ Support Vector Machine\n\nSupport vector machines are used for both Classification and Regressions. \n\nSVM uses a margin around its classifier or regressor. The margin provides an extra robustness and accuracy to the model and its performance.  \n\n![SVM](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002F7\u002F72\u002FSVM_margin.png\u002F300px-SVM_margin.png)\n\nThe above image describes a SVM classifier. The Red line is the actual classifier and the dotted lines show the boundary. The points that lie on the boundary actually decide the Margins. They support the classifier margins, so they are called __Support Vectors__.\n\nThe distance between the classifier and the nearest points is called __Marginal Distance__.\n\nThere can be several classifiers possible but we choose the one with the maximum marginal distance. So, the marginal distance and the support vectors help to choose the best classifier.\n\n[Official Documentation from Sklearn](https:\u002F\u002Fscikit-learn.org\u002Fstable\u002Fmodules\u002Fsvm.html)\n\n[Guide to SVM](https:\u002F\u002Ftowardsdatascience.com\u002Fsupport-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47)\n\n## 30_Reinforcement Learning\n\n“Reinforcement learning (RL) is an area of machine learning concerned with how software agents ought to take actions in an environment in order to maximize the notion of cumulative reward.”\n\nTo play a game, we need to make multiple choices and predictions during the course of the game to achieve success, so they can be called a multiple decision processes. This is where we need a type of algorithm called reinforcement learning algorithms. The class of algorithm is based on decision-making chains which let such algorithms to support multiple decision processes.\n\nThe reinforcement algorithm can be used to reach a goal state from a starting state making decisions accordingly. \n\nThe reinforcement learning involves an agent which learns on its own. If it makes a correct or good move that takes it towards the goal, it is positively rewarded, else not. This way the agent learns.\n\n![reinforced](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_a9b482400aa0.png)\n\nThe above image shows reinforcement learning setup.\n\n[WIKI](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FReinforcement_learning#:~:text=Reinforcement%20learning%20(RL)%20is%20an,supervised%20learning%20and%20unsupervised%20learning.)\n\n# 5_ Text Mining\n\n## 1_ Corpus\n\n## 2_ Named Entity Recognition\n\n## 3_ Text Analysis\n\n## 4_ UIMA\n\n## 5_ Term Document matrix\n\n## 6_ Term frequency and Weight\n\n## 7_ Support Vector Machines (SVM)\n\n## 8_ Association rules\n\n## 9_ Market based analysis\n\n## 10_ Feature extraction\n\n## 11_ Using mahout\n\n## 12_ Using Weka\n\n## 13_ Using NLTK\n\n## 14_ Classify text\n\n## 15_ Vocabulary mapping\n\n# 6_ Data Visualization\n\nOpen .R scripts in Rstudio for line-by-line execution.\n\nSee [10_ Toolbox\u002F3_ R, Rstudio, Rattle](https:\u002F\u002Fgithub.com\u002FMrMimic\u002Fdata-scientist-roadmap\u002Ftree\u002Fmaster\u002F10_Toolbox#3_-r-rstudio-rattle) for installation.\n\n## 1_ Data exploration in R\n\nIn mathematics, the graph of a function f is the collection of all ordered pairs (x, f(x)). If the function input x is a scalar, the graph is a two-dimensional graph, and for a continuous function is a curve. If the function input x is an ordered pair (x1, x2) of real numbers, the graph is the collection of all ordered triples (x1, x2, f(x1, x2)), and for a continuous function is a surface.\n\n## 2_ Uni, bi and multivariate viz\n\n### Univariate\n\nThe term is commonly used in statistics to distinguish a distribution of one variable from a distribution of several variables, although it can be applied in other ways as well. For example, univariate data are composed of a single scalar component. In time series analysis, the term is applied with a whole time series as the object referred to: thus a univariate time series refers to the set of values over time of a single quantity.\n\n### Bivariate\n\nBivariate analysis is one of the simplest forms of quantitative (statistical) analysis.[1] It involves the analysis of two variables (often denoted as X, Y), for the purpose of determining the empirical relationship between them.\n\n### Multivariate\n\nMultivariate analysis (MVA) is based on the statistical principle of multivariate statistics, which involves observation and analysis of more than one statistical outcome variable at a time. In design and analysis, the technique is used to perform trade studies across multiple dimensions while taking into account the effects of all variables on the responses of interest.\n\n## 3_ ggplot2\n\n### About\n\nggplot2 is a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the bad parts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphics that makes it easy to produce complex multi-layered graphics.\n\n[http:\u002F\u002Fggplot2.org\u002F](http:\u002F\u002Fggplot2.org\u002F)\n\n### Documentation\n\n### Examples\n\n[http:\u002F\u002Fr4stats.com\u002Fexamples\u002Fgraphics-ggplot2\u002F](http:\u002F\u002Fr4stats.com\u002Fexamples\u002Fgraphics-ggplot2\u002F)\n\n## 4_ Histogram and pie (Uni)\n\n### About\n\nHistograms and pie are 2 types of graphes used to visualize frequencies. \n\nHistogram is showing the distribution of these frequencies over classes, and pie the relative proportion of this frequencies in a 100% circle.\n\n## 5_ Tree & tree map\n\n### About\n\n[Treemaps](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTreemapping) display hierarchical (tree-structured) data as a set of nested rectangles.\nEach branch of the tree is given a rectangle, which is then tiled with smaller rectangles representing sub-branches.\nA leaf node’s rectangle has an area proportional to a specified dimension of the data.\nOften the leaf nodes are colored to show a separate dimension of the data.\n\n### When to use it ?\n\n- Less than 10 branches.\n- Positive values.\n- Space for visualisation is limited.\n\n### Example\n\n![treemap-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_efbe0ca145c1.png)\n\nThis treemap describes volume for each product universe with corresponding surface. Liquid products are more sold than others.\nIf you want to explore more, we can go into products “liquid” and find which shelves are prefered by clients.\n\n### More information\n\n[Matplotlib Series 5: Treemap](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series5-treemap\u002F)\n\n## 6_ Scatter plot\n\n### About\n\nA [scatter plot](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FScatter_plot) (also called a scatter graph, scatter chart, scattergram, or scatter diagram) is a type of plot or mathematical diagram using Cartesian coordinates to display values for typically two variables for a set of data.\n\n### When to use it ?\n\nScatter plots are used when you want to show the relationship between two variables.\nScatter plots are sometimes called correlation plots because they show how two variables are correlated.\n\n### Example\n\n![scatter-plot-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_bcb0e1a441cb.png)\n\nThis plot describes the positive relation between store’s surface and its turnover(k euros), which is reasonable: for stores, the larger it is, more clients it can accept, more turnover it will generate.\n\n### More information\n\n[Matplotlib Series 4: Scatter plot](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series4-scatter-plot\u002F)\n\n## 7_ Line chart\n\n### About\n\nA [line chart](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLine_chart) or line graph is a type of chart which displays information as a series of data points called ‘markers’ connected by straight line segments. A line chart is often used to visualize a trend in data over intervals of time – a time series – thus the line is often drawn chronologically.\n\n### When to use it ?\n\n- Track changes over time.\n- X-axis displays continuous variables.\n- Y-axis displays measurement.\n\n### Example\n\n![line-chart-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_1ffcdf965ff5.png)\n\nSuppose that the plot above describes the turnover(k euros) of ice-cream’s sales during one year.\nAccording to the plot, we can clearly find that the sales reach a peak in summer, then fall from autumn to winter, which is logical.\n\n### More information\n\n[Matplotlib Series 2: Line chart](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series2-line-chart\u002F)\n\n## 8_ Spatial charts\n\n## 9_ Survey plot\n\n## 10_ Timeline\n\n## 11_ Decision tree\n\n## 12_ D3.js\n\n### About\n\nThis is a JavaScript library, allowing you to create a huge number of different figure easily.\n\nhttps:\u002F\u002Fd3js.org\u002F\n\n    D3.js is a JavaScript library for manipulating documents based on data. \n    D3 helps you bring data to life using  HTML, SVG, and CSS. \n    D3’s emphasis on web standards gives you the full capabilities of modern browsers without tying yourself to a proprietary framework, combining powerful visualization components and a data-driven approach to DOM manipulation. \n\n### Examples\n\nThere is many examples of chars using D3.js on [D3's Github](https:\u002F\u002Fgithub.com\u002Fd3\u002Fd3\u002Fwiki\u002FGallery).\n\n## 13_ InfoVis\n\n## 14_ IBM ManyEyes\n\n## 15_ Tableau\n\n## 16_ Venn diagram\n\n### About\n\nA [venn diagram](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FVenn_diagram) (also called primary diagram, set diagram or logic diagram) is a diagram that shows all possible logical relations between a finite collection of different sets.\n\n### When to use it ?\n\nShow logical relations between different groups (intersection, difference, union).\n\n### Example\n\n![venn-diagram-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_45aa3a103bef.png)\n\nThis kind of venn diagram can usually be used in retail trading.\nAssuming that we need to study the popularity of cheese and red wine, and 2500 clients answered our questionnaire.\nAccording to the diagram above, we find that among 2500 clients, 900 clients(36%) prefer cheese, 1200 clients(48%) prefer red wine, and 400 clients(16%) favor both product.\n\n### More information\n\n[Matplotlib Series 6: Venn diagram](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series6-venn-diagram\u002F)\n\n## 17_ Area chart\n\n### About\n\nAn [area chart](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FArea_chart) or area graph displays graphically quantitative data.\nIt is based on the line chart. The area between axis and line are commonly emphasized with colors, textures and hatchings.\n\n### When to use it ?\n\nShow or compare a quantitative progression over time.\n\n### Example\n\n![area-chart-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_ce048fe913d8.png)\n\nThis stacked area chart displays the amounts’ changes in each account, their contribution to total amount (in term of value) as well.\n\n### More information\n\n[Matplotlib Series 7: Area chart](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series7-area-chart\u002F)\n\n## 18_ Radar chart\n\n### About\n\nThe [radar chart](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FRadar_chart) is a chart and\u002For plot that consists of a sequence of equi-angular spokes, called radii, with each spoke representing one of the variables. The data length of a spoke is proportional to the magnitude of the variable for the data point relative to the maximum magnitude of the variable across all data points. A line is drawn connecting the data values for each spoke. This gives the plot a star-like appearance and the origin of one of the popular names for this plot.\n\n### When to use it ?\n\n- Comparing two or more items or groups on various features or characteristics.\n- Examining the relative values for a single data point.\n- Displaying less than ten factors on one radar chart.\n\n### Example\n\n![radar-chart-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_37065c0c9783.png)\n\nThis radar chart displays the preference of 2 clients among 4.\nClient c1 favors chicken and bread, and doesn’t like cheese that much.\nNevertheless, client c2 prefers cheese to other 4 products and doesn’t like beer.\nWe can have an interview with these 2 clients, in order to find the weakness of products which are out of preference.\n\n### More information\n\n[Matplotlib Series 8: Radar chart](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series8-radar-chart\u002F)\n\n## 19_ Word cloud\n\n### About\n\nA [word cloud](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTag_cloud) (tag cloud, or weighted list in visual design) is a novelty visual representation of text data. Tags are usually single words, and the importance of each tag is shown with font size or color. This format is useful for quickly perceiving the most prominent terms and for locating a term alphabetically to determine its relative prominence.\n\n### When to use it ?\n\n- Depicting keyword metadata (tags) on websites.\n- Delighting and provide emotional connection.\n\n### Example\n\n![word-cloud-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_5d25b51921d9.png)\n\nAccording to this word cloud, we can globally know that data science employs techniques and theories drawn from many fields within the context of mathematics, statistics, information science, and computer science. It can be used for business analysis, and called “The Sexiest Job of the 21st Century”.\n\n### More information\n\n[Matplotlib Series 9: Word cloud](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series9-word-cloud\u002F)\n\n\n# 7_ Big Data\n\n## 1_ Map Reduce fundamentals\n\n## 2_ Hadoop Ecosystem\n\n## 3_ HDFS\n\n## 4_ Data replications Principles\n\n## 5_ Setup Hadoop\n\n## 6_ Name & data nodes\n\n## 7_ Job & task tracker\n\n## 8_ M\u002FR\u002FSAS programming\n\n## 9_ Sqop: Loading data in HDFS\n\n## 10_ Flume, Scribe\n\n## 11_ SQL with Pig\n\n## 12_ DWH with Hive\n\n## 13_ Scribe, Chukwa for Weblog\n\n## 14_ Using Mahout\n\n## 15_ Zookeeper Avro\n\n## 16_ Lambda Architecture \n\n## 17_ Storm: Hadoop Realtime\n\n## 18_ Rhadoop, RHIPE\n\n## 19_ RMR\n\n## 20_ NoSQL Databases (MongoDB, Neo4j)\n\n## 21_ Distributed Databases and Systems (Cassandra)\n\n\n# 8_ Data Ingestion\n\n## 1_ Summary of data formats\n\n## 2_ Data discovery\n\n## 3_ Data sources & Acquisition\n\n## 4_ Data integration\n\n## 5_ Data fusion\n\n## 6_ Transformation & enrichment\n\n## 7_ Data survey\n\n## 8_ Google OpenRefine\n\n## 9_ How much data ?\n\n## 10_ Using ETL\n# 9_ Data Munging\n\n## 1_ Dim. and num. reduction\n\n## 2_ Normalization\n\n## 3_ Data scrubbing\n\n## 4_ Handling missing Values\n\n## 5_ Unbiased estimators\n\n## 6_ Binning Sparse Values\n\n## 7_ Feature extraction\n\n## 8_ Denoising\n\n## 9_ Sampling\n\n## 10_ Stratified sampling\n\n## 11_ PCA\n\n# 10_ Toolbox\n\n## 1_ MS Excel with Analysis toolpack\n\n## 2_ Java, Python\n\n## 3_ R, Rstudio, Rattle\n\n## 4_ Weka, Knime, RapidMiner\n\n## 5_ Hadoop dist of choice\n\n## 6_ Spark, Storm\n\n## 7_ Flume, Scibe, Chukwa\n\n## 8_ Nutch, Talend, Scraperwiki\n\n## 9_ Webscraper, Flume, Sqoop\n\n## 10_ tm, RWeka, NLTK\n\n## 11_ RHIPE\n\n## 12_ D3.js, ggplot2, Shiny\n\n## 13_ IBM Languageware\n\n## 14_ Cassandra, MongoDB\n\n## 13_ Microsoft Azure, AWS, Google Cloud\n\n## 14_ Microsoft Cognitive API\n\n## 15_ Tensorflow\n\nhttps:\u002F\u002Fwww.tensorflow.org\u002F\n\nTensorFlow is an open source software library for numerical computation using data flow graphs. \n\nNodes in the graph represent mathematical operations, while the graph edges represent the multidimensional data arrays (tensors) communicated between them. \n\nThe flexible architecture allows you to deploy computation to one or more CPUs or GPUs in a desktop, server, or mobile device with a single API. \n\nTensorFlow was originally developed by researchers and engineers working on the Google Brain Team within Google's Machine Intelligence research organization for the purposes of conducting machine learning and deep neural networks research, but the system is general enough to be applicable in a wide variety of other domains as well. \n\n\n\n\n\n# OTHER FREE COURSES \n\n### Artificial Intelligence\n\n- [CS 188 - Introduction to Artificial Intelligence, UC Berkeley - Spring 2015](http:\u002F\u002Fwww.infocobuild.com\u002Feducation\u002Faudio-video-courses\u002Fcomputer-science\u002Fcs188-spring2015-berkeley.html)\n- [6.034 Artificial Intelligence, MIT OCW](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Felectrical-engineering-and-computer-science\u002F6-034-artificial-intelligence-fall-2010\u002Flecture-videos\u002F)\n- [CS221: Artificial Intelligence: Principles and Techniques - Autumn 2019 - Stanford University](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoROMvodv4rO1NB9TD4iUZ3qghGEGtqNX)\n- [15-780 - Graduate Artificial Intelligence, Spring 14, CMU](http:\u002F\u002Fwww.cs.cmu.edu\u002F~zkolter\u002Fcourse\u002F15-780-s14\u002Flectures.html)\n- [CSE 592 Applications of Artificial Intelligence, Winter 2003 - University of Washington](https:\u002F\u002Fcourses.cs.washington.edu\u002Fcourses\u002Fcsep573\u002F03wi\u002Flectures\u002Findex.htm)\n- [CS322 - Introduction to Artificial Intelligence, Winter 2012-13 - UBC](http:\u002F\u002Fwww.cs.ubc.ca\u002F~mack\u002FCS322\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLDPnGbm0sUmpzvcGvktbz446SLdFbfZVU))\n- [CS 4804: Introduction to Artificial Intelligence, Fall 2016](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLUenpfvlyoa1iiSbGy9BBewgiXjzxVgBd)\n- [CS 5804: Introduction to Artificial Intelligence, Spring 2015](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLUenpfvlyoa0PB6_kqJ9WU7m6i6z1RhfJ)\n- [Artificial Intelligence - IIT Kharagpur](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106105077\u002F)\n- [Artificial Intelligence - IIT Madras](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106106126\u002F)\n- [Artificial Intelligence(Prof.P.Dasgupta) - IIT Kharagpur](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106105079\u002F)\n- [MOOC - Intro to Artificial Intelligence - Udacity](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAwxTw4SYaPlqMkzr4xyuD6cXTIgPuzgn)\n- [MOOC - Artificial Intelligence for Robotics - Udacity](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAwxTw4SYaPkCSYXw6-a_aAoXVKLDwnHK)\n- [Graduate Course in Artificial Intelligence, Autumn 2012 - University of Washington](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbQ3Aya0VERDoDdbMogU9EASJGWris9qG)\n- [Agent-Based Systems 2015\u002F16- University of Edinburgh](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fabs.htm)\n- [Informatics 2D - Reasoning and Agents 2014\u002F15- University of Edinburgh](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2014\u002Finf2d.htm)\n- [Artificial Intelligence - Hochschule Ravensburg-Weingarten](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL39B5D3AFC249556A)\n- [Deductive Databases and Knowledge-Based Systems - Technische Universität Braunschweig, Germany](http:\u002F\u002Fwww.ifis.cs.tu-bs.de\u002Fteaching\u002Fws-1516\u002FKBS)\n- [Artificial Intelligence: Knowledge Representation and Reasoning - IIT Madras](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106106140\u002F)\n- [Semantic Web Technologies by Dr. Harald Sack - HPI](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoOmvuyo5UAeihlKcWpzVzB51rr014TwD)\n- [Knowledge Engineering with Semantic Web Technologies by Dr. Harald Sack - HPI](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoOmvuyo5UAcBXlhTti7kzetSsi1PpJGR)\n\n--------------\n\n### Machine Learning\n\n- **Introduction to Machine Learning**\n\n\t- [MOOC Machine Learning Andrew Ng - Coursera\u002FStanford](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN) ([Notes](http:\u002F\u002Fwww.holehouse.org\u002Fmlclass\u002F))\n\t- [Introduction to Machine Learning for Coders](https:\u002F\u002Fcourse.fast.ai\u002Fml.html)\n\t- [MOOC - Statistical Learning, Stanford University](http:\u002F\u002Fwww.dataschool.io\u002F15-hours-of-expert-machine-learning-videos\u002F)\n\t- [Foundations of Machine Learning Boot Camp, Berkeley Simons Institute](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgKuh-lKre11GbZWneln-VZDLHyejO7YD)\n\t- [CS155 - Machine Learning & Data Mining, 2017 - Caltech](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLuz4CTPOUNi6BfMrltePqMAHdl5W33-bC) ([Notes](http:\u002F\u002Fwww.yisongyue.com\u002Fcourses\u002Fcs155\u002F2017_winter\u002F)) ([2016](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL5HdMttxBY0BVTP9y7qQtzTgmcjQ3P0mb))\n\t- [CS 156 - Learning from Data, Caltech](https:\u002F\u002Fwork.caltech.edu\u002Flectures.html)\n\t- [10-601 - Introduction to Machine Learning (MS) - Tom Mitchell - 2015, CMU](http:\u002F\u002Fwww.cs.cmu.edu\u002F~ninamf\u002Fcourses\u002F601sp15\u002Flectures.shtml) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAJ0alZrN8rD63LD0FkzKFiFgkOmEtltQ))\n\t- [10-601 Machine Learning | CMU | Fall 2017](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL7k0r4t5c10-g7CWCnHfZOAxLaiNinChk)\n\t- [10-701 - Introduction to Machine Learning (PhD) - Tom Mitchell, Spring 2011, CMU](http:\u002F\u002Fwww.cs.cmu.edu\u002F~tom\u002F10701_sp11\u002Flectures.shtml) ([Fall 2014](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL7y-1rk2cCsDZCVz2xS7LrExqidHpJM3B)) ([Spring 2015 by Alex Smola](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLZSO_6-bSqHTTV7w9u7grTXBHMH-mw3qn))\n\t- [10 - 301\u002F601 - Introduction to Machine Learning - Spring 2020 - CMU](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLpqQKYIU-snAPM89YPPwyQ9xdaiAdoouk)\n\t- [CMS 165 Foundations of Machine Learning and Statistical Inference - 2020 - Caltech](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLVNifWxslHCDlbyitaLLYBOAEPbmF1AHg)\n\t- [Microsoft Research - Machine Learning Course](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL34iyE0uXtxo7vPXGFkmm6KbgZQwjf9Kf)\n\t- [CS 446 - Machine Learning, Spring 2019, UIUC](https:\u002F\u002Fcourses.engr.illinois.edu\u002Fcs446\u002Fsp2019\u002FAGS\u002F_site\u002F)([ Fall 2016 Lectures](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLQcasX5-oG91TgY6A_gz-IW7YSpwdnD2O))\n\t- [undergraduate machine learning at UBC 2012, Nando de Freitas](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLE6Wd9FR--Ecf_5nCbnSQMHqORpiChfJf)\n\t- [CS 229 - Machine Learning - Stanford University](https:\u002F\u002Fsee.stanford.edu\u002FCourse\u002FCS229) ([Autumn 2018](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU))\n\t- [CS 189\u002F289A Introduction to Machine Learning, Prof Jonathan Shewchuk - UCBerkeley](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~jrs\u002F189\u002F)\n\t- [CPSC 340: Machine Learning and Data Mining (2018) - UBC](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLWmXHcz_53Q02ZLeAxigki1JZFfCO6M-b)\n\t- [CS4780\u002F5780 Machine Learning, Fall 2013 - Cornell University](http:\u002F\u002Fwww.cs.cornell.edu\u002Fcourses\u002Fcs4780\u002F2013fa\u002F)\n\t- [CS4780\u002F5780 Machine Learning, Fall 2018 - Cornell University](http:\u002F\u002Fwww.cs.cornell.edu\u002Fcourses\u002Fcs4780\u002F2018fa\u002Fpage18\u002Findex.html) ([Youtube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLl8OlHZGYOQ7bkVbuRthEsaLr7bONzbXS))\n\t- [CSE474\u002F574 Introduction to Machine Learning - SUNY University at Buffalo](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLEQDy5tl3xkMzk_zlo2DPzXteCquHA8bQ)\n\t- [CS 5350\u002F6350 - Machine Learning, Fall 2016, University of Utah](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbuogVdPnkCozRSsdueVwX7CF9N4QWL0B)\n\t- [ECE 5984 Introduction to Machine Learning, Spring 2015 - Virginia Tech](https:\u002F\u002Ffilebox.ece.vt.edu\u002F~s15ece5984\u002F)\n\t- [CSx824\u002FECEx242 Machine Learning, Bert Huang, Fall 2015 - Virginia Tech](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLUenpfvlyoa0rMoE5nXA8kdctBKE9eSob)\n\t- [STA 4273H - Large Scale Machine Learning, Winter 2015 - University of Toronto](http:\u002F\u002Fwww.cs.toronto.edu\u002F~rsalakhu\u002FSTA4273_2015\u002Flectures.html)\n\t- [CS 485\u002F685 Machine Learning, Shai Ben-David, University of Waterloo](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUCR4_akQ1HYMUcDszPQ6jh8Q\u002Fvideos)\n\t- [STAT 441\u002F841 Classification Winter 2017 , Waterloo](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG)\n\t- [10-605 - Machine Learning with Large Datasets, Fall 2016 - CMU](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUCIE4UdPoCJZMAZrTLuq-CPQ\u002Fvideos)\n\t- [Information Theory, Pattern Recognition, and Neural Networks - University of Cambridge](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLruBu5BI5n4aFpG32iMbdWoRVAA-Vcso6)\n\t- [Python and machine learning - Stanford Crowd Course Initiative](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLVxFQjPUB2cnYGZPAGG52OQc9SpWVKjjB)\n\t- [MOOC - Machine Learning Part 1a - Udacity\u002FGeorgia Tech](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAwxTw4SYaPl0N6-e1GvyLp5-MUMUjOKo) ([Part 1b](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAwxTw4SYaPlkESDcHD-0oqVx5sAIgz7O) [Part 2](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAwxTw4SYaPmaHhu-Lz3mhLSj-YH-JnG7) [Part 3](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAwxTw4SYaPnidDwo9e2c7ixIsu_pdSNp))\n\t- [Machine Learning and Pattern Recognition 2015\u002F16- University of Edinburgh](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fmlpr.htm)\n\t- [Introductory Applied Machine Learning 2015\u002F16- University of Edinburgh](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fiaml.htm)\n\t- [Pattern Recognition Class (2012)- Universität Heidelberg](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLuRaSnb3n4kRDZVU6wxPzGdx1CN12fn0w)\n\t- [Introduction to Machine Learning and Pattern Recognition - CBCSL OSU](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLcXJymqaE9PPGGtFsTNoDWKl-VNVX5d6b)\n\t- [Introduction to Machine Learning - IIT Kharagpur](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106105152\u002F)\n\t- [Introduction to Machine Learning - IIT Madras](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106106139\u002F)\n\t- [Pattern Recognition - IISC Bangalore](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F117108048\u002F)\n\t- [Pattern Recognition and Application - IIT Kharagpur](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F117105101\u002F)\n\t- [Pattern Recognition - IIT Madras](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106106046\u002F)\n\t- [Machine Learning Summer School 2013 - Max Planck Institute for Intelligent Systems Tübingen](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLqJm7Rc5-EXFv6RXaPZzzlzo93Hl0v91E)\n\t- [Machine Learning - Professor Kogan (Spring 2016) - Rutgers](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLauepKFT6DK_1_plY78bXMDj-bshv7UsQ)\n\t- [CS273a: Introduction to Machine Learning](http:\u002F\u002Fsli.ics.uci.edu\u002FClasses\u002F2015W-273a) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkWzaBlA7utJMRi89i9FAKMopL0h0LBMk))\n\t- [Machine Learning Crash Course 2015](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLyGKBDfnk-iD5dK8N7UBUFVVDBBtznenR)\n\t- [COM4509\u002FCOM6509 Machine Learning and Adaptive Intelligence 2015-16](http:\u002F\u002Finverseprobability.com\u002Fmlai2015\u002F)\n\t- [10715 Advanced Introduction to Machine Learning](https:\u002F\u002Fsites.google.com\u002Fsite\u002F10715advancedmlintro2017f\u002Flectures)\n\t- [Introduction to Machine Learning - Spring 2018 - ETH Zurich](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLzn6LN6WhlN273tsqyfdrBUsA-o5nUESV)\n\t- [Machine Learning - Pedro Domingos- University of Washington](https:\u002F\u002Fwww.youtube.com\u002Fuser\u002FUWCSE\u002Fplaylists?view=50&sort=dd&shelf_id=16)\n\t- [Advanced Machine Learning - 2019 - ETH Zürich](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLY-OA_xnxFwSe98pzMGVR4bjAZZYrNT7L)\n\t- [Machine Learning (COMP09012)](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLyH-5mHPFffFwz7Twap0XuVeUJ8vuco9t)\n\t- [Probabilistic Machine Learning 2020 - University of Tübingen](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL05umP7R6ij1tHaOFY96m5uX3J21a6yNd)\n\t- [Statistical Machine Learning 2020 - Ulrike von Luxburg - University of Tübingen](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL05umP7R6ij2XCvrRzLokX6EoHWaGA2cC)\n\t- [COMS W4995 - Applied Machine Learning - Spring 2020 - Columbia University](https:\u002F\u002Fwww.cs.columbia.edu\u002F~amueller\u002Fcomsw4995s20\u002Fschedule\u002F)\n\t\n- **Data Mining**\n\n\t- [CSEP 546, Data Mining - Pedro Domingos, Sp 2016 - University of Washington](https:\u002F\u002Fcourses.cs.washington.edu\u002Fcourses\u002Fcsep546\u002F16sp\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLTPQEx-31JXgtDaC6-3HxWcp7fq4N8YGr))\n\t- [CS 5140\u002F6140 - Data Mining, Spring 2016, University of Utah](https:\u002F\u002Fwww.cs.utah.edu\u002F~jeffp\u002Fteaching\u002Fcs5140.html) ([Youtube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbuogVdPnkCpXfb43Wvc7s5fXWzedwTPB))\n\t- [CS 5955\u002F6955 - Data Mining, University of Utah](http:\u002F\u002Fwww.cs.utah.edu\u002F~jeffp\u002Fteaching\u002Fcs5955.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUCcrlwW88yMcXujhGjSP2WBg\u002Fvideos))\n\t- [Statistics 202 - Statistical Aspects of Data Mining, Summer 2007 - Google](http:\u002F\u002Fwww.stats202.com\u002Foriginal_index.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLFE776F2C513A744E))\n\t- [MOOC - Text Mining and Analytics by ChengXiang Zhai](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLLssT5z_DsK8Xwnh_0bjN4KNT81bekvtt)\n\t- [Information Retrieval SS 2014, iTunes - HPI](https:\u002F\u002Fitunes.apple.com\u002Fus\u002Fitunes-u\u002Finformation-retrieval-ss-2014\u002Fid874200291)\n\t- [MOOC - Data Mining with Weka](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLm4W7_iX_v4NqPUjceOGd-OKNVO4c_cPD)\n\t- [CS 290 DataMining Lectures](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLB4CCA346A5741C4C)\n\t- [CS246 - Mining Massive Data Sets, Winter 2016, Stanford University](https:\u002F\u002Fweb.stanford.edu\u002Fclass\u002Fcs246\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUC_Oao2FYkLAUlUVkBfze4jg\u002Fvideos))\n\t- [Data Mining: Learning From Large Datasets - Fall 2017 - ETH Zurich](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLY-OA_xnxFwRHZO6L6yT253VPgrZazQs6)\n\t- [Information Retrieval - Spring 2018 - ETH Zurich](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLzn6LN6WhlN1ktkDvNurPSDwTQ_oGQisn)\n\t- [CAP6673 - Data Mining and Machine Learning - FAU](http:\u002F\u002Fwww.cse.fau.edu\u002F~taghi\u002Fclasses\u002Fcap6673\u002F)([Video lectures](https:\u002F\u002Fvimeo.com\u002Falbum\u002F1505953))\n\t- [Data Warehousing and Data Mining Techniques - Technische Universität Braunschweig, Germany](http:\u002F\u002Fwww.ifis.cs.tu-bs.de\u002Fteaching\u002Fws-1617\u002Fdwh)\n- **Data Science**\n\t- [Data 8: The Foundations of Data Science - UC Berkeley](http:\u002F\u002Fdata8.org\u002F) ([Summer 17](http:\u002F\u002Fdata8.org\u002Fsu17\u002F))\n\t- [CSE519 - Data Science Fall 2016 - Skiena, SBU](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLOtl7M3yp-DVBdLYatrltDJr56AKZ1qXo)\n\t- [CS 109 Data Science, Harvard University](http:\u002F\u002Fcs109.github.io\u002F2015\u002Fpages\u002Fvideos.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLb4G5axmLqiuneCqlJD2bYFkBwHuOzKus))\n\t- [6.0002 Introduction to Computational Thinking and Data Science - MIT OCW](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Felectrical-engineering-and-computer-science\u002F6-0002-introduction-to-computational-thinking-and-data-science-fall-2016\u002Flecture-videos\u002F)\n\t- [Data 100 - Summer 19- UC Berkeley](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLPHXc20GewP8J56CisONS_mFZWZAfa7jR)\n\t- [Distributed Data Analytics (WT 2017\u002F18) - HPI University of Potsdam](https:\u002F\u002Fwww.tele-task.de\u002Fseries\u002F1179\u002F)\n\t- [Statistics 133 - Concepts in Computing with Data, Fall 2013 - UC Berkeley](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL-XXv-cvA_iDsSPnMJlnhIyADGUmikoIO)\n\t- [Data Profiling and Data Cleansing (WS 2014\u002F15) - HPI University of Potsdam](https:\u002F\u002Fwww.tele-task.de\u002Fseries\u002F1027\u002F)\n\t- [AM 207 - Stochastic Methods for Data Analysis, Inference and Optimization, Harvard University](http:\u002F\u002Fam207.github.io\u002F2016\u002Findex.html)\n\t- [CS 229r - Algorithms for Big Data, Harvard University](http:\u002F\u002Fpeople.seas.harvard.edu\u002F~minilek\u002Fcs229r\u002Ffall15\u002Flec.html) ([Youtube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL2SOU6wwxB0v1kQTpqpuu5kEJo2i-iUyf))\n\t- [Algorithms for Big Data - IIT Madras](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106106142\u002F)\n- **Probabilistic Graphical Modeling**\n\t- [MOOC - Probabilistic Graphical Models - Coursera](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLvfF4UFg6Ejj6SX-ffw-O4--SPbB9P7eP)\n\t- [CS 6190 - Probabilistic Modeling, Spring 2016, University of Utah](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbuogVdPnkCpvxdF-Gy3gwaBObx7AnQut)\n\t- [10-708 - Probabilistic Graphical Models, Carnegie Mellon University](https:\u002F\u002Fwww.cs.cmu.edu\u002F~epxing\u002FClass\u002F10708-20\u002Flectures.html)\n\t- [Probabilistic Graphical Models, Daphne Koller, Stanford University](http:\u002F\u002Fopenclassroom.stanford.edu\u002FMainFolder\u002FCoursePage.php?course=ProbabilisticGraphicalModels)\n\t- [Probabilistic Models - UNIVERSITY OF HELSINKI](https:\u002F\u002Fwww.cs.helsinki.fi\u002Fen\u002Fcourses\u002F582636\u002F2015\u002FK\u002FK\u002F1)\n\t- [Probabilistic Modelling and Reasoning 2015\u002F16- University of Edinburgh](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fpmr.htm)\n\t- [Probabilistic Graphical Models, Spring 2018 - Notre Dame](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLd-PuDzW85AcV4bgdu7wHPL37hm60W4RM)\n- **Deep Learning**\n\t- [6.S191: Introduction to Deep Learning - MIT](http:\u002F\u002Fintrotodeeplearning.com\u002F)\n\t- [Deep Learning CMU](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUC8hYZGEkI2dDO8scT8C5UQA\u002Fvideos)\n\t- [Part 1: Practical Deep Learning for Coders, v3 - fast.ai](https:\u002F\u002Fcourse.fast.ai\u002F)\n\t- [Part 2: Deep Learning from the Foundations - fast.ai](https:\u002F\u002Fcourse.fast.ai\u002Fpart2)\n\t- [Deep learning at Oxford 2015 - Nando de Freitas](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu)\n\t- [6.S094: Deep Learning for Self-Driving Cars - MIT](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLrAXtmErZgOeiKm4sgNOknGvNjby9efdf)\n\t- [CS294-129 Designing, Visualizing and Understanding Deep Neural Networks](https:\u002F\u002Fbcourses.berkeley.edu\u002Fcourses\u002F1453965\u002Fpages\u002Fcs294-129-designing-visualizing-and-understanding-deep-neural-networks) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIxopmdJF_CLNqG3QuDFHQUm))\n\t- [CS230: Deep Learning - Autumn 2018 - Stanford University](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoROMvodv4rOABXSygHTsbvUz4G_YQhOb)\n\t- [STAT-157 Deep Learning 2019 - UC Berkeley ](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLZSO_6-bSqHQHBCoGaObUljoXAyyqhpFW)\n\t- [Full Stack DL Bootcamp 2019 - UC Berkeley](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL_Ig1a5kxu5645uORPL8xyvHr91Lg8G1l)\n\t- [Deep Learning, Stanford University](http:\u002F\u002Fopenclassroom.stanford.edu\u002FMainFolder\u002FCoursePage.php?course=DeepLearning)\n\t- [MOOC - Neural Networks for Machine Learning, Geoffrey Hinton 2016 - Coursera](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoRl3Ht4JOcdU872GhiYWf6jwrk_SNhz9)\n\t- [Deep Unsupervised Learning -- Berkeley Spring 2020](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLwRJQ4m4UJjPiJP3691u-qWwPGVKzSlNP)\n\t- [Stat 946 Deep Learning - University of Waterloo](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLehuLRPyt1Hyi78UOkMPWCGRxGcA9NVOE)\n\t- [Neural networks class - Université de Sherbrooke](http:\u002F\u002Finfo.usherbrooke.ca\u002Fhlarochelle\u002Fneural_networks\u002Fcontent.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH))\n\t- [CS294-158 Deep Unsupervised Learning SP19](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUCf4SX8kAZM_oGcZjMREsU9w\u002Fvideos)\n\t- [DLCV - Deep Learning for Computer Vision - UPC Barcelona](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL-5eMc3HQTBavDoZpFcX-bff5WgQqSLzR)\n\t- [DLAI - Deep Learning for Artificial Intelligence @ UPC Barcelona](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL-5eMc3HQTBagIUjKefjcTbnXC0wXC_vd)\n\t- [Neural Networks and Applications - IIT Kharagpur](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F117105084\u002F)\n\t- [UVA DEEP LEARNING COURSE](http:\u002F\u002Fuvadlc.github.io\u002F#lecture)\n\t- [Nvidia Machine Learning Class](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLTIkHmXc-7an8xbwhAJX-LQ4D4Uf-ar5I)\n\t- [Deep Learning - Winter 2020-21 - Tübingen Machine Learning](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL05umP7R6ij3NTWIdtMbfvX7Z-4WEXRqD)\n- **Reinforcement Learning**\n\t- [CS234: Reinforcement Learning - Winter 2019 - Stanford University](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u)\n\t- [Introduction to reinforcement learning - UCL](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ)\n\t- [Advanced Deep Learning & Reinforcement Learning - UCL](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLqYmG7hTraZDNJre23vqCGIVpfZ_K2RZs)\n\t- [Reinforcement Learning - IIT Madras](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLyqSpQzTE6M_FwzHFAyf4LSkz_IjMyjD9)\n\t- [CS885 Reinforcement Learning - Spring 2018 - University of Waterloo](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLdAoL1zKcqTXFJniO3Tqqn6xMBBL07EDc)\n\t- [CS 285 - Deep Reinforcement Learning- UC Berkeley](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A)\n\t- [CS 294 112 - Reinforcement Learning](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37)\n\t- [NUS CS 6101 - Deep Reinforcement Learning](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLllwxvcS7ca5wOmRLKm6ri-OaC0INYehv)\n\t- [ECE 8851: Reinforcement Learning](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL_Nk3YvgORJs1tCLQnlnSRsOJArj_cP9u)\n\t- [CS294-112, Deep Reinforcement Learning Sp17](http:\u002F\u002Frll.berkeley.edu\u002Fdeeprlcourse\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX))\n\t- [UCL Course 2015 on Reinforcement Learning by David Silver from DeepMind](http:\u002F\u002Fwww0.cs.ucl.ac.uk\u002Fstaff\u002Fd.silver\u002Fweb\u002FTeaching.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=2pWv7GOvuf0))\n\t- [Deep RL Bootcamp - Berkeley Aug 2017](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdeep-rl-bootcamp\u002Flectures)\n\t- [Reinforcement Learning - IIT Madras](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLyqSpQzTE6M_FwzHFAyf4LSkz_IjMyjD9)\n- **Advanced Machine Learning**\n\t- [Machine Learning 2013 - Nando de Freitas, UBC](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLE6Wd9FR--EdyJ5lbFl8UuGjecvVw66F6)\n\t- [Machine Learning, 2014-2015, University of Oxford](https:\u002F\u002Fwww.cs.ox.ac.uk\u002Fpeople\u002Fnando.defreitas\u002Fmachinelearning\u002F)\n\t- [10-702\u002F36-702 - Statistical Machine Learning - Larry Wasserman, Spring 2016, CMU](https:\u002F\u002Fwww.stat.cmu.edu\u002F~ryantibs\u002Fstatml\u002F) ([Spring 2015](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLjbUi5mgii6BWEUZf7He6nowWvGne_Y8r))\n\t- [10-715 Advanced Introduction to Machine Learning - CMU](http:\u002F\u002Fwww.cs.cmu.edu\u002F~bapoczos\u002FClasses\u002FML10715_2015Fall\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL4DwY1suLMkcu-wytRDbvBNmx57CdQ2pJ))\n\t- [CS 281B - Scalable Machine Learning, Alex Smola, UC Berkeley](http:\u002F\u002Falex.smola.org\u002Fteaching\u002Fberkeley2012\u002Fsyllabus.html)\n\t- [18.409 Algorithmic Aspects of Machine Learning Spring 2015 - MIT](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLB3sDpSRdrOvI1hYXNsa6Lety7K8FhPpx)\n\t- [CS 330 - Deep Multi-Task and Meta Learning - Fall 2019 - Stanford University](https:\u002F\u002Fcs330.stanford.edu\u002F) ([Youtube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoROMvodv4rMC6zfYmnD7UG3LVvwaITY5))\n- **ML based Natural Language Processing and Computer Vision**\n\t- [CS 224d - Deep Learning for Natural Language Processing, Stanford University](http:\u002F\u002Fcs224d.stanford.edu\u002Fsyllabus.html) ([Lectures - Youtube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLCJlDcMjVoEdtem5GaohTC1o9HTTFtK7_))\n\t- [CS 224N - Natural Language Processing, Stanford University](http:\u002F\u002Fweb.stanford.edu\u002Fclass\u002Fcs224n\u002F) ([Lecture videos](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgtM85Maly3n2Fp1gJVvqb0bTC39CPn1N))\n\t- [CS 124 - From Languages to Information - Stanford University](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUC_48v322owNVtORXuMeRmpA\u002Fplaylists?view=50&sort=dd&shelf_id=2)\n\t- [MOOC - Natural Language Processing, Dan Jurafsky & Chris Manning - Coursera](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL6397E4B26D00A269)\n\t- [fast.ai Code-First Intro to Natural Language Processing](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9) ([Github](https:\u002F\u002Fgithub.com\u002Ffastai\u002Fcourse-nlp))\n\t- [MOOC - Natural Language Processing - Coursera, University of Michigan](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLLssT5z_DsK8BdawOVCCaTCO99Ya58ryR)\n\t- [CS 231n - Convolutional Neural Networks for Visual Recognition, Stanford University](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv)\n\t- [CS224U: Natural Language Understanding - Spring 2019 - Stanford University](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoROMvodv4rObpMCir6rNNUlFAn56Js20)\n\t- [Deep Learning for Natural Language Processing, 2017 - Oxford University](https:\u002F\u002Fgithub.com\u002Foxford-cs-deepnlp-2017\u002Flectures)\n\t- [Machine Learning for Robotics and Computer Vision, WS 2013\u002F2014 - TU München](https:\u002F\u002Fvision.in.tum.de\u002Fteaching\u002Fws2013\u002Fml_ws13) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLTBdjV_4f-EIiongKlS9OKrBEp8QR47Wl))\n\t- [Informatics 1 - Cognitive Science 2015\u002F16- University of Edinburgh](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Finf1cs.htm)\n\t- [Informatics 2A - Processing Formal and Natural Languages 2016-17 - University of Edinburgh](http:\u002F\u002Fwww.inf.ed.ac.uk\u002Fteaching\u002Fcourses\u002Finf2a\u002Fschedule.html)\n\t- [Computational Cognitive Science 2015\u002F16- University of Edinburgh](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fccs.htm)\n\t- [Accelerated Natural Language Processing 2015\u002F16- University of Edinburgh](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fanlp.htm)\n\t- [Natural Language Processing - IIT Bombay](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106101007\u002F)\n\t- [NOC:Deep Learning For Visual Computing - IIT Kharagpur](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F108\u002F105\u002F108105103\u002F)\n\t- [CS 11-747 - Neural Nets for NLP - 2019 - CMU](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL8PYTP1V4I8Ajj7sY6sdtmjgkt7eo2VMs)\n\t- [Natural Language Processing - Michael Collins - Columbia University](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLA212ij5XG8OTDRl8IWFiJgHR9Ve2k9pv)\n\t- [Deep Learning for Computer Vision - University of Michigan](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r)\n\t- [CMU CS11-737 - Multilingual Natural Language Processing](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL8PYTP1V4I8CHhppU6n1Q9-04m96D9gt5)\n- **Time Series Analysis**\n\t- [02417 Time Series Analysis](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLtiTxpFJ4k6TZ0g496fVcQpt_-XJRNkbi)\n\t- [Applied Time Series Analysis](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLl0FT6O_WWDBm-4W-eoK34omYmEMseQDX)\n- **Misc Machine Learning Topics**\n\t- [EE364a: Convex Optimization I - Stanford University](http:\u002F\u002Fweb.stanford.edu\u002Fclass\u002Fee364a\u002Fvideos.html)\n\t- [CS 6955 - Clustering, Spring 2015, University of Utah](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbuogVdPnkCpRvi-qSMCdOwyn4UYoPxTI)\n\t- [Info 290 - Analyzing Big Data with Twitter, UC Berkeley school of information](http:\u002F\u002Fblogs.ischool.berkeley.edu\u002Fi290-abdt-s12\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLE8C1256A28C1487F))\n\t- [10-725 Convex Optimization, Spring 2015 - CMU](http:\u002F\u002Fwww.stat.cmu.edu\u002F~ryantibs\u002Fconvexopt-S15\u002F)\n\t- [10-725 Convex Optimization: Fall 2016 - CMU](http:\u002F\u002Fwww.stat.cmu.edu\u002F~ryantibs\u002Fconvexopt\u002F)\n\t- [CAM 383M - Statistical and Discrete Methods for Scientific Computing, University of Texas](http:\u002F\u002Fgranite.ices.utexas.edu\u002Fcoursewiki\u002Findex.php\u002FMain_Page)\n\t- [9.520 - Statistical Learning Theory and Applications, Fall 2015 - MIT](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLyGKBDfnk-iDj3FBd0Avr_dLbrU8VG73O)\n\t- [Reinforcement Learning - UCL](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLacBNHqv7n9gp9cBMrA6oDbzz_8JqhSKo)\n\t- [Regularization Methods for Machine Learning 2016](http:\u002F\u002Facademictorrents.com\u002Fdetails\u002F493251615310f9b6ae1f483126292378137074cd) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbF0BXX_6CPJ20Gf_KbLFnPWjFTvvRwCO))\n\t- [Statistical Inference in Big Data - University of Toronto](http:\u002F\u002Ffields2015bigdata2inference.weebly.com\u002Fmaterials.html)\n\t- [10-725 Optimization Fall 2012 - CMU](http:\u002F\u002Fwww.cs.cmu.edu\u002F~ggordon\u002F10725-F12\u002Fschedule.html)\n\t- [10-801 Advanced Optimization and Randomized Methods - CMU](http:\u002F\u002Fwww.cs.cmu.edu\u002F~suvrit\u002Fteach\u002Faopt.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLjTcdlvIS6cjdA8WVXNIk56X_SjICxt0d))\n\t- [Reinforcement Learning 2015\u002F16- University of Edinburgh](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Frl.htm)\n\t- [Reinforcement Learning - IIT Madras](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106106143\u002F)\n\t- [Statistical Rethinking Winter 2015 - Richard McElreath](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLDcUM9US4XdMdZOhJWJJD4mDBMnbTWw_z)\n\t- [Music Information Retrieval - University of Victoria, 2014](http:\u002F\u002Fmarsyas.cs.uvic.ca\u002FmirBook\u002Fcourse\u002F)\n\t- [PURDUE Machine Learning Summer School 2011](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL2A65507F7D725EFB)\n\t- [Foundations of Machine Learning - Blmmoberg Edu](https:\u002F\u002Fbloomberg.github.io\u002Ffoml\u002F#home)\n\t- [Introduction to reinforcement learning - UCL](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ)\n\t- [Advanced Deep Learning & Reinforcement Learning - UCL](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLqYmG7hTraZDNJre23vqCGIVpfZ_K2RZs)\n\t- [Web Information Retrieval (Proff. L. Becchetti - A. Vitaletti)](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAQopGWlIcya-9yzQ8c8UtPOuCv0mFZkr)\n\t- [Big Data Systems (WT 2019\u002F20) - Prof. Dr. Tilmann Rabl - HPI](https:\u002F\u002Fwww.tele-task.de\u002Fseries\u002F1286\u002F)\n\t- [Distributed Data Analytics (WT 2017\u002F18) - Dr. Thorsten Papenbrock - HPI](https:\u002F\u002Fwww.tele-task.de\u002Fseries\u002F1179\u002F)\n\n- **Probability & Statistics**\n\n\t- [6.041 Probabilistic Systems Analysis and Applied Probability - MIT OCW](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Felectrical-engineering-and-computer-science\u002F6-041sc-probabilistic-systems-analysis-and-applied-probability-fall-2013\u002F)\n\t- [Statistics 110 - Probability - Harvard University](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL2SOU6wwxB0uwwH80KTQ6ht66KWxbzTIo)\n\t- [STAT 2.1x: Descriptive Statistics | UC Berkeley](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL_Ig1a5kxu56TfFnGlRlH2YpOBWGiYsQD)\n\t- [STAT 2.2x: Probability | UC Berkeley](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL_Ig1a5kxu57qPZnHm-ie-D7vs9g7U-Cl)\n\t- [MOOC - Statistics: Making Sense of Data, Coursera](http:\u002F\u002Facademictorrents.com\u002Fdetails\u002Fa0cbaf3e03e0893085b6fbdc97cb6220896dddf2)\n\t- [MOOC - Statistics One - Coursera](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLycnP7USbo1V3jlyjAzWUB201cLxPq4NP)\n\t- [Probability and Random Processes - IIT Kharagpur](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F117105085\u002F)\n\t- [MOOC - Statistical Inference - Coursera](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgIPpm6tJZoSvrYM54BUqJJ4CWrYeGO40)\n\t- [131B - Introduction to Probability and Statistics, UCI](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLqOZ6FD_RQ7k-j-86QUC2_0nEu0QOP-Wy)\n\t- [STATS 250 - Introduction to Statistics and Data Analysis, UMichigan](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL432AB57AF9F43D4F)\n\t- [Sets, Counting and Probability - Harvard](http:\u002F\u002Fmatterhorn.dce.harvard.edu\u002Fengage\u002Fui\u002Findex.html#\u002F1999\u002F01\u002F82347)\n\t- [Opinionated Lessons in Statistics](http:\u002F\u002Fwww.opinionatedlessons.org\u002F) ([Youtube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLUAHeOPjkJseXJKbuk9-hlOfZU9Wd6pS0))\n\t- [Statistics - Brandon Foltz](https:\u002F\u002Fwww.youtube.com\u002Fuser\u002FBCFoltz\u002Fplaylists)\n\t- [Statistical Rethinking: A Bayesian Course Using R and Stan](https:\u002F\u002Fgithub.com\u002Frmcelreath\u002Fstatrethinking_winter2019) ([Lectures - Aalto University](https:\u002F\u002Faalto.cloud.panopto.eu\u002FPanopto\u002FPages\u002FSessions\u002FList.aspx#folderID=%22f0ec3a25-9e23-4935-873b-a9f401646812%22)) ([Book](http:\u002F\u002Fwww.stat.columbia.edu\u002F~gelman\u002Fbook\u002F))\n\t - [02402 Introduction to Statistics E12 - Technical University of Denmark](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLMn2aW3wpAtPC8tZHQy6nwWsFG7P6sPqw) ([F17](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgowegO9Se58_BnUNnaARajEE_bX-GJEz))\n- **Linear Algebra**\n\t- [18.06 - Linear Algebra, Prof. Gilbert Strang, MIT OCW](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Fmathematics\u002F18-06sc-linear-algebra-fall-2011\u002F)\n\t- [18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning - MIT OCW](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Fmathematics\u002F18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018\u002Fvideo-lectures\u002F)\n\t- [Linear Algebra (Princeton University)](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLGqzsq0erqU7w7ZrTZ-pWWk4-AOkiGEGp)\n\t- [MOOC: Coding the Matrix: Linear Algebra through Computer Science Applications - Coursera](http:\u002F\u002Facademictorrents.com\u002Fdetails\u002F54cd86f3038dfd446b037891406ba4e0b1200d5a)\n\t- [CS 053 - Coding the Matrix - Brown University](http:\u002F\u002Fcs.brown.edu\u002Fcourses\u002Fcs053\u002Fcurrent\u002Flectures.htm) ([Fall 14 videos](https:\u002F\u002Fcs.brown.edu\u002Fvideo\u002Fchannels\u002Fcoding-matrix-fall-2014\u002F))\n\t- [Linear Algebra Review - CMU](http:\u002F\u002Fwww.cs.cmu.edu\u002F~zkolter\u002Fcourse\u002Flinalg\u002Foutline.html)\n\t- [A first course in Linear Algebra - N J Wildberger - UNSW](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL44B6B54CBF6A72DF)\n\t- [INTRODUCTION TO MATRIX ALGEBRA](http:\u002F\u002Fma.mathforcollege.com\u002Fyoutube\u002Findex.html)\n\t- [Computational Linear Algebra - fast.ai](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLtmWHNX-gukIc92m1K0P6bIOnZb-mg0hY) ([Github](https:\u002F\u002Fgithub.com\u002Ffastai\u002Fnumerical-linear-algebra))\n- [10-600 Math Background for ML - CMU](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL7y-1rk2cCsA339crwXMWUaBRuLBvPBCg)\n- [MIT 18.065 Matrix Methods in Data Analysis, Signal Processing, and Machine Learning](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Fmathematics\u002F18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018\u002Fvideo-lectures\u002F)\n- [36-705 - Intermediate Statistics - Larry Wasserman, CMU](http:\u002F\u002Fwww.stat.cmu.edu\u002F~larry\u002F=stat705\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLcW8xNfZoh7eI7KSWneVWq-7wr8ffRtHF))\n- [Combinatorics - IISC Bangalore](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106108051\u002F)\n- [Advanced Engineering Mathematics - Notre Dame](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLd-PuDzW85Ae4pzlylMLzq_a-RHPx8ryA)\n- [Statistical Computing for Scientists and Engineers - Notre Dame](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLd-PuDzW85AeltIRcjDY7Z4q49NEAuMcA)\n- [Statistical Computing, Fall 2017 - Notre Dame](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLd-PuDzW85AcSgNGnT5TUHt85SrCljT3V)\n- [Mathematics for Machine Learning, Lectures by Ulrike von Luxburg - Tübingen Machine Learning](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL05umP7R6ij1a6KdEy8PVE9zoCv6SlHRS)\n\n\n-------------------------\n\n### Robotics\n\n- [CS 223A - Introduction to Robotics, Stanford University](https:\u002F\u002Fsee.stanford.edu\u002FCourse\u002FCS223A)\n- [6.832 Underactuated Robotics - MIT OCW](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Felectrical-engineering-and-computer-science\u002F6-832-underactuated-robotics-spring-2009\u002F)\n- [CS287 Advanced Robotics at UC Berkeley Fall 2019 -- Instructor: Pieter Abbeel](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLwRJQ4m4UJjNBPJdt8WamRAt4XKc639wF)\n- [CS 287 - Advanced Robotics, Fall 2011, UC Berkeley](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~pabbeel\u002Fcs287-fa11\u002F) ([Videos](http:\u002F\u002Frll.berkeley.edu\u002Fcs287\u002Flecture_videos\u002F))\n- [CS235 - Applied Robot Design for Non-Robot-Designers - Stanford University](https:\u002F\u002Fwww.youtube.com\u002Fuser\u002FStanfordCS235\u002Fvideos)\n- [Lecture: Visual Navigation for Flying Robots](https:\u002F\u002Fvision.in.tum.de\u002Fteaching\u002Fss2012\u002Fvisnav2012) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLTBdjV_4f-EKeki5ps2WHqJqyQvxls4ha))\n- [CS 205A: Mathematical Methods for Robotics, Vision, and Graphics (Fall 2013)](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLQ3UicqQtfNvQ_VzflHYKhAqZiTxOkSwi)\n- [Robotics 1, Prof. De Luca, Università di Roma](http:\u002F\u002Fwww.dis.uniroma1.it\u002F~deluca\u002Frob1_en\u002Fmaterial_rob1_en_2014-15.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAQopGWlIcyaqDBW1zSKx7lHfVcOmWSWt))\n- [Robotics 2, Prof. De Luca, Università di Roma](http:\u002F\u002Fwww.diag.uniroma1.it\u002F~deluca\u002Frob2_en\u002Fmaterial_rob2_en.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAQopGWlIcya6LnIF83QlJTqvpYmJXnDm))\n- [Robot Mechanics and Control, SNU](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkjy3Accn-E7mlbuSF4aajcMMckG4wLvW)\n- [Introduction to Robotics Course - UNCC](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL4847E1D1C121292F)\n- [SLAM Lectures](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLpUPoM7Rgzi_7YWn14Va2FODh7LzADBSm)\n- [Introduction to Vision and Robotics 2015\u002F16- University of Edinburgh](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fivr.htm)\n- [ME 597 – Autonomous Mobile Robotics – Fall 2014](http:\u002F\u002Fwavelab.uwaterloo.ca\u002Findex6ea9.html?page_id=267)\n- [ME 780 – Perception For Autonomous Driving – Spring 2017](http:\u002F\u002Fwavelab.uwaterloo.ca\u002Findexaef8.html?page_id=481)\n- [ME780 – Nonlinear State Estimation for Robotics and Computer Vision – Spring 2017](http:\u002F\u002Fwavelab.uwaterloo.ca\u002Findexe9a5.html?page_id=533)\n- [METR 4202\u002F7202 -- Robotics & Automation - University of Queensland](http:\u002F\u002Frobotics.itee.uq.edu.au\u002F~metr4202\u002Flectures.html)\n- [Robotics - IIT Bombay](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F112101099\u002F)\n- [Introduction to Machine Vision](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL1pxneANaikCO1-Z0XTaljLR3SE8tgRXY)\n- [6.834J Cognitive Robotics - MIT OCW ](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Faeronautics-and-astronautics\u002F16-412j-cognitive-robotics-spring-2016\u002F)\n- [Hello (Real) World with ROS – Robot Operating System - TU Delft](https:\u002F\u002Focw.tudelft.nl\u002Fcourses\u002Fhello-real-world-ros-robot-operating-system\u002F)\n- [Programming for Robotics (ROS) - ETH Zurich](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLE-BQwvVGf8HOvwXPgtDfWoxd4Cc6ghiP)\n- [Mechatronic System Design - TU Delft](https:\u002F\u002Focw.tudelft.nl\u002Fcourses\u002Fmechatronic-system-design\u002F)\n- [CS 206 Evolutionary Robotics Course Spring 2020](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAuiGdPEdw0inlKisMbjDypCbvcb_GBN9)\n- [Foundations of Robotics - UTEC 2018-I](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoWGuY2dW-Acmc8V5NYSAXBxADMm1rE4p)\n- [Robotics - Youtube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL_onPhFCkVQhuPiUxUW2lFHB39QsavEEA)\n- [Robotics and Control: Theory and Practice IIT Roorkee](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLLy_2iUCG87AjAXKbNMiKJZ2T9vvGpMB0)\n- [Mechatronics](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLtuwVtW88fOeTFS_szBWif0Mcc0lfNWaz)\n- [ME142 - Mechatronics Spring 2020 - UC Merced](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL-euleXgwWUNQ80DGq6qopHBmHcQyEyNQ)\n- [Mobile Sensing and Robotics - Bonn University](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgnQpQtFTOGQJXx-x0t23RmRbjp_yMb4v)\n- [MSR2 - Sensors and State Estimation Course (2020) - Bonn University](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgnQpQtFTOGQh_J16IMwDlji18SWQ2PZ6)\n- [SLAM Course (2013) - Bonn University](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgnQpQtFTOGQrZ4O5QzbIHgl3b1JHimN_)\n- [ENGR486 Robot Modeling and Control (2014W)](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLJzZfbLAMTelwaLxFXteeblbY2ytU2AxX)\n- [Robotics by Prof. D K Pratihar - IIT Kharagpur](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbRMhDVUMngcdUbBySzyzcPiFTYWr4rV_)\n- [Introduction to Mobile Robotics - SS 2019 - Universität Freiburg](http:\u002F\u002Fais.informatik.uni-freiburg.de\u002Fteaching\u002Fss19\u002Frobotics\u002F)\n- [Robot Mapping - WS 2018\u002F19 - Universität Freiburg](http:\u002F\u002Fais.informatik.uni-freiburg.de\u002Fteaching\u002Fws18\u002Fmapping\u002F)\n- [Mechanism and Robot Kinematics - IIT Kharagpur](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F112\u002F105\u002F112105236\u002F)\n- [Self-Driving Cars - Cyrill Stachniss - Winter 2020\u002F21 - University of Bonn) ](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgnQpQtFTOGQo2Z_ogbonywTg8jxCI9pD)\n- [Mobile Sensing and Robotics 1 – Part Stachniss (Jointly taught with PhoRS) - University of Bonn](https:\u002F\u002Fwww.ipb.uni-bonn.de\u002Fmsr1-2020\u002F)\n- [Mobile Sensing and Robotics 2 – Stachniss & Klingbeil\u002FHolst - University of Bonn](https:\u002F\u002Fwww.ipb.uni-bonn.de\u002Fmsr2-2020\u002F)\n\n\n----------------------------------\n\n## 500 + 𝗔𝗿𝘁𝗶𝗳𝗶𝗰𝗶𝗮𝗹 𝗜𝗻𝘁𝗲𝗹𝗹𝗶𝗴𝗲𝗻𝗰𝗲 𝗣𝗿𝗼𝗷𝗲𝗰𝘁 𝗟𝗶𝘀𝘁 𝘄𝗶𝘁𝗵 𝗰𝗼𝗱𝗲\n\n*500 AI Machine learning Deep learning Computer vision NLP Projects with code*\n\n***This list is continuously updated.*** - You can take pull request and contribute.\n\n| Sr No | Name                                                                  | Link                                |\n| ----- | --------------------------------------------------------------------- | ----------------------------------- |\n| 1     | 180 Machine learning Project                                          | [is.gd\u002FMLtyGk](http:\u002F\u002Fis.gd\u002FMLtyGk) |\n| 2     | 12 Machine learning Object Detection                                  | [is.gd\u002FjZMP1A](http:\u002F\u002Fis.gd\u002FjZMP1A) |\n| 3     | 20 NLP Project with Python                                            | [is.gd\u002FjcMvjB](http:\u002F\u002Fis.gd\u002FjcMvjB) |\n| 4     | 10 Machine Learning Projects on Time Series Forecasting               | [is.gd\u002FdOR66m](http:\u002F\u002Fis.gd\u002FdOR66m) |\n| 5     | 20 Deep Learning Projects Solved and Explained with Python            | [is.gd\u002F8Cv5EP](http:\u002F\u002Fis.gd\u002F8Cv5EP) |\n| 6     | 20 Machine learning Project                                           | [is.gd\u002FLZTF0J](http:\u002F\u002Fis.gd\u002FLZTF0J) |\n| 7     | 30 Python Project Solved and Explained                                | [is.gd\u002FxhT36v](http:\u002F\u002Fis.gd\u002FxhT36v) |\n| 8     | Machine learning Course for Free                                      | https:\u002F\u002Flnkd.in\u002FekCY8xw             |\n| 9     | 5 Web Scraping Projects with Python                                   | [is.gd\u002F6XOTSn](http:\u002F\u002Fis.gd\u002F6XOTSn) |\n| 10    | 20 Machine Learning Projects on Future Prediction with Python         | [is.gd\u002FxDKDkl](http:\u002F\u002Fis.gd\u002FxDKDkl) |\n| 11    | 4 Chatbot Project With Python                                         | [is.gd\u002FLyZfXv](http:\u002F\u002Fis.gd\u002FLyZfXv) |\n| 12    | 7 Python Gui project                                                  | [is.gd\u002F0KPBvP](http:\u002F\u002Fis.gd\u002F0KPBvP) |\n| 13    | All Unsupervised learning Projects                                    | [is.gd\u002Fcz11Kv](http:\u002F\u002Fis.gd\u002Fcz11Kv) |\n| 14    | 10 Machine learning Projects for Regression Analysis                  | [is.gd\u002Fk8faV1](http:\u002F\u002Fis.gd\u002Fk8faV1) |\n| 15    | 10 Machine learning Project for Classification with Python            | [is.gd\u002FBJQjMN](http:\u002F\u002Fis.gd\u002FBJQjMN) |\n| 16    | 6 Sentimental Analysis Projects with python                           | [is.gd\u002FWeiE5p](http:\u002F\u002Fis.gd\u002FWeiE5p) |\n| 17    | 4 Recommendations Projects with Python                                | [is.gd\u002FpPHAP8](http:\u002F\u002Fis.gd\u002FpPHAP8) |\n| 18    | 20 Deep learning Project with python                                  | [is.gd\u002Fl3OCJs](http:\u002F\u002Fis.gd\u002Fl3OCJs) |\n| 19    | 5 COVID19 Projects with Python                                        | [is.gd\u002FxFCnYi](http:\u002F\u002Fis.gd\u002FxFCnYi) |\n| 20    | 9 Computer Vision Project with python                                 | [is.gd\u002FlrNybj](http:\u002F\u002Fis.gd\u002FlrNybj) |\n| 21    | 8 Neural Network Project with python                                  | [is.gd\u002FFCyOOf](is.gd\u002FFCyOOf)        |\n| 22    | 5 Machine learning Project for healthcare                             | https:\u002F\u002Fbit.ly\u002F3b86bOH              |\n| 23    | 5 NLP Project with Python                                             | https:\u002F\u002Fbit.ly\u002F3hExtNS              |\n| 24    | 47 Machine Learning Projects for 2021                                 | https:\u002F\u002Fbit.ly\u002F356bjiC              |\n| 25    | 19 Artificial Intelligence Projects for 2021                          | https:\u002F\u002Fbit.ly\u002F38aLgsg              |\n| 26    | 28 Machine learning Projects for 2021                                 | https:\u002F\u002Fbit.ly\u002F3bguRF1              |\n| 27    | 16 Data Science Projects with Source Code for 2021                    | https:\u002F\u002Fbit.ly\u002F3oa4zYD              |\n| 28    | 24 Deep learning Projects with Source Code for 2021                   | https:\u002F\u002Fbit.ly\u002F3rQrOsU              |\n| 29    | 25 Computer Vision Projects with Source Code for 2021                 | https:\u002F\u002Fbit.ly\u002F2JDMO4I              |\n| 30    | 23 Iot Projects with Source Code for 2021                             | https:\u002F\u002Fbit.ly\u002F354gT53              |\n| 31    | 27 Django Projects with Source Code for 2021                          | https:\u002F\u002Fbit.ly\u002F2LdRPRZ              |\n| 32    | 37 Python Fun Projects with Code for 2021                             | https:\u002F\u002Fbit.ly\u002F3hBHzz4              |\n| 33    | 500 + Top Deep learning Codes                                         | https:\u002F\u002Fbit.ly\u002F3n7AkAc              |\n| 34    | 500 + Machine learning Codes                                          | https:\u002F\u002Fbit.ly\u002F3b32n13              |\n| 35    | 20+ Machine Learning Datasets & Project Ideas                         | https:\u002F\u002Fbit.ly\u002F3b2J48c              |\n| 36    | 1000+ Computer vision codes                                           | https:\u002F\u002Fbit.ly\u002F2LiX1nv              |\n| 37    | 300 + Industry wise Real world projects with code                     | https:\u002F\u002Fbit.ly\u002F3rN7lVR              |\n| 38    | 1000 + Python Project Codes                                           | https:\u002F\u002Fbit.ly\u002F3oca2xM              |\n| 39    | 363 + NLP Project with Code                                           | https:\u002F\u002Fbit.ly\u002F3b442DO              |\n| 40    | 50 + Code ML Models (For iOS 11) Projects                             | https:\u002F\u002Fbit.ly\u002F389dB2s              |\n| 41    | 180 + Pretrained Model Projects for Image, text, Audio and Video      | https:\u002F\u002Fbit.ly\u002F3hFyQMw              |\n| 42    | 50 + Graph Classification Project List                                | https:\u002F\u002Fbit.ly\u002F3rOYFhH              |\n| 43    | 100 + Sentence Embedding(NLP Resources)                               | https:\u002F\u002Fbit.ly\u002F355aS8c              |\n| 44    | 100 + Production Machine learning Projects                            | https:\u002F\u002Fbit.ly\u002F353ckI0              |\n| 45    | 300 + Machine Learning Resources Collection                           | https:\u002F\u002Fbit.ly\u002F3b2LjIE              |\n| 46    | 70 + Awesome AI                                                       | https:\u002F\u002Fbit.ly\u002F3hDIXkD              |\n| 47    | 150 + Machine learning Project Ideas with code                        | https:\u002F\u002Fbit.ly\u002F38bfpbg              |\n| 48    | 100 + AutoML Projects with code                                       | https:\u002F\u002Fbit.ly\u002F356zxZX              |\n| 49    | 100 + Machine Learning Model Interpretability Code Frameworks         | https:\u002F\u002Fbit.ly\u002F3n7FaNB              |\n| 50    | 120 + Multi Model Machine learning Code Projects                      | https:\u002F\u002Fbit.ly\u002F38QRI76              |\n| 51    | Awesome Chatbot Projects                                              | https:\u002F\u002Fbit.ly\u002F3rQyxmE              |\n| 52    | Awesome ML Demo Project with iOS                                      | https:\u002F\u002Fbit.ly\u002F389hZOY              |\n| 53    | 100 + Python based Machine learning Application Projects              | https:\u002F\u002Fbit.ly\u002F3n9zLWv              |\n| 54    | 100 + Reproducible Research Projects of ML and DL                     | https:\u002F\u002Fbit.ly\u002F2KQ0J8C              |\n| 55    | 25 + Python Projects                                                  | https:\u002F\u002Fbit.ly\u002F353fRpK              |\n| 56    | 8 + OpenCV Projects                                                   | https:\u002F\u002Fbit.ly\u002F389mj0B              |\n| 57    | 1000 + Awesome Deep learning Collection                               | https:\u002F\u002Fbit.ly\u002F3b0a9Jj              |\n| 58    | 200 + Awesome NLP learning Collection                                 | https:\u002F\u002Fbit.ly\u002F3b74b9o              |\n| 59    | 200 + The Super Duper NLP Repo                                        | https:\u002F\u002Fbit.ly\u002F3hDNnbd              |\n| 60    | 100 + NLP dataset for your Projects                                   | https:\u002F\u002Fbit.ly\u002F353h2Wc              |\n| 61    | 364 + Machine Learning Projects definition                            | https:\u002F\u002Fbit.ly\u002F2X5QRdb              |\n| 62    | 300+ Google Earth Engine Jupyter Notebooks to Analyze Geospatial Data | https:\u002F\u002Fbit.ly\u002F387JwjC              |\n| 63    | 1000 + Machine learning Projects Information                          | https:\u002F\u002Fbit.ly\u002F3rMGk4N              |\n| 64.   | 11 Computer Vision Projects with code                                 | https:\u002F\u002Fbit.ly\u002F38gz2OR              |\n| 65.   | 13 Computer Vision Projects with Code                                 | https:\u002F\u002Fbit.ly\u002F3hMJdhh              |\n| 66.   | 13 Cool Computer Vision GitHub Projects To Inspire You                | https:\u002F\u002Fbit.ly\u002F2LrSv6d              |\n| 67.   | Open-Source Computer Vision Projects (With Tutorials)                 | https:\u002F\u002Fbit.ly\u002F3pUss6U              |\n| 68.   | OpenCV Computer Vision Projects with Python                           | https:\u002F\u002Fbit.ly\u002F38jmGpn              |\n| 69.   | 100 + Computer vision Algorithm Implementation                        | https:\u002F\u002Fbit.ly\u002F3rWgrzF              |\n| 70.   | 80 + Computer vision Learning code                                    | https:\u002F\u002Fbit.ly\u002F3hKCpkm              |\n| 71.   | Deep learning Treasure                                                | https:\u002F\u002Fbit.ly\u002F359zLQb              |\n\n[#100+ Free Machine Learning Books](https:\u002F\u002Fwww.theinsaneapp.com\u002F2020\u002F12\u002Fdownload-free-machine-learning-books.html)\n\n\n#ALL THE CREDITS GOES TO THE RESPECTIVE CREATORS AND THESE RESOURCES ARE COMBINED TOGETHER TO MAKE A WONDERFUL AND COMPACT LEARNING RESOURCE FOR THE DATASCIENCE ENTHUSIASTS\n\nPart 1:- [Roadmap](https:\u002F\u002Fgithub.com\u002FMrMimic\u002Fdata-scientist-roadmap) \n \nPart 2:- [Free Online Courses](https:\u002F\u002Fgithub.com\u002FDeveloper-Y)\n\nPart 3:- [500 Datascience Projects](https:\u002F\u002Fgithub.com\u002Fashishpatel26\u002F500-AI-Machine-learning-Deep-learning-Computer-vision-NLP-Projects-with-code)\n\nPart 4:- [100+ Free Machine Learning Books](https:\u002F\u002Fwww.theinsaneapp.com\u002F2020\u002F12\u002Fdownload-free-machine-learning-books.html)\n\nPart 5:- [10 Machine Learning Books for Beginners](https:\u002F\u002Fwww.appliedaicourse.com\u002Fblog\u002Fmachine-learning-books\u002F)\n","# 数据科学家路线图 (2021)\n\n![roadmap-picture](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_3467d861335e.png)\n\n****\n\n# 1_ 基础\n\n\n## 1_ 矩阵与代数基础\n\n### 关于\n\n在数学中，矩阵是__按行和列排列的数字、符号或表达式的矩形数组__。矩阵可以通过删除任意数量的行和\u002F或列简化为子矩阵。\n\n![matrix-image](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fb\u002Fbb\u002FMatrix.svg)\n\n### 操作\n\n有许多基本操作可用于修改矩阵：\n\n* [加法](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMatrix_addition)\n* [标量乘法](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FScalar_multiplication)\n* [转置](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTranspose)\n* [乘法](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMatrix_multiplication)\n\n\n## 2_ 哈希函数、二叉树、O(n)\n\n### 哈希函数 (Hash function)\n\n#### 定义\n\n哈希函数是__任何可以将任意大小的数据映射到固定大小数据的函数__。一种用途是称为哈希表的数据结构，广泛用于计算机软件中的快速数据查找。哈希函数通过检测大型文件中的重复记录来加速表或数据库查找。\n\n![hash-image](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002F5\u002F58\u002FHash_table_4_1_1_0_0_1_0_LL.svg)\n\n### 二叉树\n\n#### 定义\n\n在计算机科学中，二叉树是__一种每个节点最多有两个子节点的树数据结构__，分别称为左子节点和右子节点。\n\n![binary-tree-image](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Ff\u002Ff7\u002FBinary_tree.svg)\n\n### O(n)\n\n#### 定义\n\n在计算机科学中，大 O 表示法 (Big O notation) 用于__根据算法的运行时间或空间需求随输入规模增长的方式对算法进行分类__。在解析数论中，大 O 表示法通常用于__表达算术函数与其更好理解的近似值之间的界限__。\n\n## 3_ 关系代数、数据库基础\n\n### 定义\n\n关系代数 (Relational algebra) 是一族具有__用于建模存储在关系数据库中的数据并定义其上查询的良好语义的代数__。\n\n关系代数的主要应用是为__关系数据库__提供理论基础，特别是此类数据库的查询语言，其中最主要的是 SQL。\n\n### 自然连接\n\n#### 关于\n\n在 SQL 语言中，如果满足以下条件，两个表之间将进行自然连接：\n\n* 至少有一列在两个表中名称相同\n* 这两列具有相同的数据类型\n    * CHAR (字符)\n    * INT (整数)\n    * FLOAT (浮点数值数据)\n    * VARCHAR (长字符串)\n    \n#### MySQL 请求\n\n        SELECT \u003CCOLUMNS>\n        FROM \u003CTABLE_1>\n        NATURAL JOIN \u003CTABLE_2>\n\n        SELECT \u003CCOLUMNS>\n        FROM \u003CTABLE_1>, \u003CTABLE_2>\n        WHERE TABLE_1.ID = TABLE_2.ID\n\n## 4_ 内连接、外连接、交叉连接、theta 连接\n\n### 内连接\n\nINNER JOIN 关键字选择两个表中具有匹配值的记录。\n\n#### 查询语句\n\n      SELECT column_name(s)\n      FROM table1\n      INNER JOIN table2 ON table1.column_name = table2.column_name;\n\n![inner-join-image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_ac4e75e16439.gif)\n\n### 外连接\n\nFULL OUTER JOIN 关键字在左表 (table1) 或右表 (table2) 的记录中存在匹配时返回所有记录。\n\n#### 查询语句\n\n      SELECT column_name(s)\n      FROM table1\n      FULL OUTER JOIN table2 ON table1.column_name = table2.column_name; \n\n![outer-join-image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_78c764bdbdaf.gif)\n\n### 左连接\n\nLEFT JOIN 关键字返回左表 (table1) 中的所有记录，以及右表 (table2) 中的匹配记录。如果没有匹配，右侧的结果为 NULL。\n\n#### 查询语句\n\n      SELECT column_name(s)\n      FROM table1\n      LEFT JOIN table2 ON table1.column_name = table2.column_name;\n\n![left-join-image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_35125df03a8e.gif)\n\n### 右连接\n\nRIGHT JOIN 关键字返回右表 (table2) 中的所有记录，以及左表 (table1) 中的匹配记录。当没有匹配时，左侧的结果为 NULL。\n#### 查询语句\n\n      SELECT column_name(s)\n      FROM table1\n      RIGHT JOIN table2 ON table1.column_name = table2.column_name;\n\n![left-join-image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_02187c9d9300.gif)\n\n## 5_ CAP 定理\n\n分布式数据存储无法同时提供以下三个保证中的两个以上：\n \n* 每个读取都收到最新的写入或错误。\n* 每个请求都收到（非错误）响应——但不保证包含最新写入。\n* 尽管节点之间的网络丢弃（或延迟）了任意数量的消息，系统仍继续运行。\n\n换句话说，CAP 定理指出，在网络分区存在的情况下，必须在一致性和可用性之间做出选择。注意，CAP 定理中定义的一致性不同于 ACID 数据库事务中保证的一致性。\n\n## 6_ 表格数据\n\n表格数据与__关系__数据（如 SQL 数据库）__相对__。\n\n在表格数据中，__所有内容都按列和行排列__。每一行都有相同数量的列（缺失值除外，可用\"N\u002FA\"代替）。\n\n表格数据的__第一行__通常是__标题__，描述每列的内容。\n\n数据科学中最常用的表格数据格式是__CSV___。每列都被一个字符（制表符、逗号等）包围，将此列与其两个邻居分隔开。\n\n## 7_ 熵\n\n熵是__不确定性的度量__。高熵意味着数据具有高方差，因此包含大量信息和\u002F或噪声。\n\n例如，__对于所有 x 都为 f(x) = 4 的常数函数没有熵且易于预测__，信息量少，没有噪声，可以简洁表示。类似地，f(x) = ~4 有一些熵，而 f(x) = 随机数由于噪声而具有非常高的熵。\n\n## 8_ 数据框与序列\n\n数据框用于存储数据表。它是长度相等的向量列表。\n\n序列是有序的数据点系列。\n\n## 9_ 分片 (Sharding)\n\n*分片* (**Sharding**) 是**水平（按行）数据库分区**，与**垂直（按列）分区**（即*规范化*）相对。\n\n为什么要使用分片？\n\n1. 拥有大型数据集或高吞吐量应用程序的数据库系统可能会挑战单台服务器的容量。\n2. 应对增长的两种方法：垂直扩展和水平扩展。\n3. 垂直扩展 (Vertical Scaling)\n\n    * 涉及增加单台服务器的容量。\n    * 但由于技术和经济限制，单台机器可能不足以应付给定的工作负载。\n\n4. 水平扩展 (Horizontal Scaling)\n    * 涉及将数据集和负载分散到多台服务器上，根据需要添加额外的服务器以增加容量。\n    * 虽然单台机器的整体速度或容量可能不高，但每台机器处理整体工作负载的一个子集，可能比单台高速大容量服务器提供更好的效率。 \n    * 理念是利用分布式系统的概念来实现扩展。\n    * 但它也伴随着分布式系统固有的复杂性增加的权衡。\n    * 许多数据库系统通过分片数据集提供水平扩展。\n\n## 10_ 联机分析处理 (OLAP)\n\n联机分析处理，或称 OLAP，是一种在计算中快速回答多维分析 (MDA) 查询的方法。 \n\nOLAP 属于 __更广泛的商业智能类别__，其中包括关系型数据库、报告编写和数据挖掘。OLAP 的典型应用包括 ___销售业务报告、营销、管理报告、业务流程管理 (BPM)、预算和预测、财务报告及类似领域，新的应用也在不断涌现，例如农业__。\n\nOLAP 一词是对传统数据库术语在线事务处理 (OLTP) 的轻微修改。\n\n## 11_ 多维数据模型\n\n## 12_ ETL\n\n* 抽取 (Extract)\n  * 从多个异构源系统中提取数据。\n  * 数据验证，以确认拉取的数据在给定的域中具有正确\u002F预期的值。\n\n* 转换 (Transform)\n  * 提取的数据被送入管道，对数据应用多种函数。\n  * 这些函数的意图是将数据转换为最终系统接受的格式。\n  * 涉及清理数据以去除噪声、异常值和冗余数据。\n* 加载 (Load)\n  * 将转换后的数据加载到最终目标。\n\n## 13_ 报表 vs 商业智能 vs 分析\n\n## 14_ JSON 和 XML\n\n### JSON\n\nJSON 是一种语言无关的数据格式。描述一个人的示例：\n\t\n\t{\n\t  \"firstName\": \"John\",\n\t  \"lastName\": \"Smith\",\n\t  \"isAlive\": true,\n\t  \"age\": 25,\n\t  \"address\": {\n\t    \"streetAddress\": \"21 2nd Street\",\n\t    \"city\": \"New York\",\n\t    \"state\": \"NY\",\n\t    \"postalCode\": \"10021-3100\"\n\t  },\n\t  \"phoneNumbers\": [\n\t    {\n\t      \"type\": \"home\",\n\t      \"number\": \"212 555-1234\"\n\t    },\n\t    {\n\t      \"type\": \"office\",\n\t      \"number\": \"646 555-4567\"\n\t    },\n\t    {\n\t      \"type\": \"mobile\",\n\t      \"number\": \"123 456-7890\"\n\t    }\n\t  ],\n\t  \"children\": [],\n\t  \"spouse\": null\n\t}\n\n## XML\n\n可扩展标记语言 (XML) 是一种标记语言，它定义了一组规则，用于将文档编码为既对人类可读又对机器可读的格式。\n \n \t\u003CCATALOG>\n\t  \u003CPLANT>\n\t    \u003CCOMMON>Bloodroot\u003C\u002FCOMMON>\n\t    \u003CBOTANICAL>Sanguinaria canadensis\u003C\u002FBOTANICAL>\n\t    \u003CZONE>4\u003C\u002FZONE>\n\t    \u003CLIGHT>Mostly Shady\u003C\u002FLIGHT>\n\t    \u003CPRICE>$2.44\u003C\u002FPRICE>\n\t    \u003CAVAILABILITY>031599\u003C\u002FAVAILABILITY>\n\t  \u003C\u002FPLANT>\n\t  \u003CPLANT>\n\t    \u003CCOMMON>Columbine\u003C\u002FCOMMON>\n\t    \u003CBOTANICAL>Aquilegia canadensis\u003C\u002FBOTANICAL>\n\t    \u003CZONE>3\u003C\u002FZONE>\n\t    \u003CLIGHT>Mostly Shady\u003C\u002FLIGHT>\n\t    \u003CPRICE>$9.37\u003C\u002FPRICE>\n\t    \u003CAVAILABILITY>030699\u003C\u002FAVAILABILITY>\n\t  \u003C\u002FPLANT>\n\t  \u003CPLANT>\n\t    \u003CCOMMON>Marsh Marigold\u003C\u002FCOMMON>\n\t    \u003CBOTANICAL>Caltha palustris\u003C\u002FBOTANICAL>\n\t    \u003CZONE>4\u003C\u002FZONE>\n\t    \u003CLIGHT>Mostly Sunny\u003C\u002FLIGHT>\n\t    \u003CPRICE>$6.81\u003C\u002FPRICE>\n\t    \u003CAVAILABILITY>051799\u003C\u002FAVAILABILITY>\n\t  \u003C\u002FPLANT>\n\t\u003C\u002FCATALOG>\n\n## 15_ NoSQL\n\nNoSQL 与关系型数据库相对（代表 __N__ot __O__nly __SQL__）。数据未结构化，且表之间没有键的概念。\n\n任何类型的数据都可以存储在 NoSQL 数据库中（JSON, CSV, ...），而无需考虑复杂的关系模式。\n\n__常用的 NoSQL 技术栈__：Cassandra, MongoDB, Redis, Oracle noSQL ...\n\n## 16_ 正则表达式 (Regex)\n\n### 简介\n\n__正__则__表__达式 (__regex__) 常用于计算机科学。\n\n它可以用于广泛的可能性：\n* 文本替换\n* 从文本中提取信息（电子邮件、电话号码等）\n* 列出具有 .txt 扩展名的文件 ..\n\nhttp:\u002F\u002Fregexr.com\u002F 是一个用于实验正则表达式的优秀网站。\n\n### 使用方法\n\n要在 [Python](https:\u002F\u002Fdocs.python.org\u002F3\u002Flibrary\u002Fre.html) 中使用它们，只需导入：\n\n    import re\n\n## 17_ 供应商格局\n\n## 18_ 环境设置\n\n# 2_ 统计学\n\n\n[面向数据新手的统计学 101](https:\u002F\u002Fmedium.com\u002F@debuggermalhotra\u002Fstatistics-101-for-data-noobs-2e2a0e23a5dc)\n\n## 1_ 选择一个数据集\n\n### 数据集仓库\n\n#### 通用\n\n- [KAGGLE](https:\u002F\u002Fwww.kaggle.com\u002Fdatasets)\n- [Google](https:\u002F\u002Ftoolbox.google.com\u002Fdatasetsearch)\n\n#### 医疗\n\n- [PMC](https:\u002F\u002Fwww.ncbi.nlm.nih.gov\u002Fpmc\u002F)\n\n#### 其他语言\n\n##### 法语\n\n- [DATAGOUV](https:\u002F\u002Fwww.data.gouv.fr\u002Ffr\u002F)\n\n## 2_ 描述性统计\n\n### 均值 (Mean)\n\n在概率论和统计学中，总体均值和期望值通常互换使用，指代由该分布表征的概率分布或随机变量的__集中趋势度量之一__。\n\n对于数据集，术语算术平均数、数学期望，有时也称为平均值，互换使用以指代离散数值集的中央值：具体来说，是__数值之和除以数值个数__。\n\n![mean_formula](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002Fbd2f5fb530fc192e4db7a315777f5bbb5d462c90)\n\n### 中位数 (Median)\n\n中位数是__分隔数据样本、总体或概率分布的上半部分与下半部分的值__。简单来说，它可以被视为数据集的“中间”值。\n\n### Python 中的描述性统计\n\n[Numpy](http:\u002F\u002Fwww.numpy.org\u002F) 是一个广泛用于统计分析的 Python 库。\n\n#### 安装\n\n    pip3 install numpy\n\n#### 使用\n    \n    import numpy\n\n## 3_ 探索性数据分析\n\n此步骤包括数据的可视化和分析。\n\n原始数据可能存在不合理的分布，这可能会导致后续问题。\n\n此外，在应用过程中，我们也必须了解数据的分布情况，例如数据是线性分布还是螺旋状分布。\n\n[Guide to EDA in Python](https:\u002F\u002Ftowardsdatascience.com\u002Fdata-preprocessing-and-interpreting-results-the-heart-of-machine-learning-part-1-eda-49ce99e36655)\n\n##### Python 库 \n\n[Matplotlib](https:\u002F\u002Fmatplotlib.org\u002F)\n\n用于在 Python 中绘制图表的库\n\n__安装__:\n\n    pip3 install matplotlib\n\n__使用方法__:\n\n    import matplotlib.pyplot as plt\n\n[Pandas](https:\u002F\u002Fpandas.pydata.org\u002F)\n\n用于处理 Python 中大数据集的库\n\n__安装__:\n\n    pip3 install pandas\n\n__使用方法__:\n\n    import pandas as pd\n    \n[Seaborn](https:\u002F\u002Fseaborn.pydata.org\u002F)\n\nPython 中的又一个图形绘制库。\n\n__安装__:\n\n    pip3 install seaborn\n\n__使用方法__:\n\n    import seaborn as sns\n\n\n#### PCA\n\nPCA 代表主成分分析（Principal Component Analysis）。\n\n正如我们之前所见，我们经常需要了解数据分布的形状。我们需要为此绘制数据图。\n\n数据可以是多维的，也就是说，一个数据集可以拥有多个特征。 \n\n我们只能绘制二维数据，因此，对于多维数据，我们将多维分布投影到两个维度上，保留分布的主成分，以便通过二维图了解实际分布的情况。 \n\n它也用于降维。通常可以看到，某些特征并没有对数据分布提供重要的见解。这些特征增加了复杂性并提高了数据的维度。不考虑这些特征会导致数据维度的降低。\n\n[Mathematical Explanation](https:\u002F\u002Fmedium.com\u002Ftowards-artificial-intelligence\u002Fdemystifying-principal-component-analysis-9f13f6f681e6)\n\n[Application in Python](https:\u002F\u002Ftowardsdatascience.com\u002Fdata-preprocessing-and-interpreting-results-the-heart-of-machine-learning-part-2-pca-feature-92f8f6ec8c8)\n\n## 4_ 直方图\n\n直方图是数值数据分布的表示。该过程包括使用范围划分对数值进行分箱，即数据变化的整个范围被分成几个固定的区间。表示的是落在分箱范围内的数字出现的次数或频率。\n\n[Histograms](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHistogram)\n\n![plot](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002F1\u002F1d\u002FExample_histogram.png\u002F220px-Example_histogram.png)\n\n在 Python 中，__Pandas__、__Matplotlib__、__Seaborn__ 可以用来创建直方图。\n\n## 5_ 百分位数与异常值\n\n### 百分位数\n\n百分位数是统计学中的数值度量，表示在数值数据分布中，有多少或百分之多少的数据低于给定的数值或实例。 \n\n例如，如果我们说第 70 百分位数，它表示分布中 70% 的数据低于给定的数值。 \n\n[Percentiles](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FPercentile)\n\n### 异常值\n\n异常值是与其他数据点有显著差异的数据点（数值型）。它们与分布中的大多数点不同。这样的点可能会扭曲分布的中心度量，如均值和中位数。因此，需要检测并移除它们。\n\n[Outliers](https:\u002F\u002Fwww.itl.nist.gov\u002Fdiv898\u002Fhandbook\u002Fprc\u002Fsection1\u002Fprc16.htm)\n\n__箱线图__可用于检测数据中的异常值。可以使用 __Seaborn__ 库创建它们。\n\n![Image_Box_Plot](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_8d903230d58c.png)\n  \n## 6_ 概率论\n\n__概率__是随机实验中事件发生的可能性。例如，如果抛硬币，得到正面的几率是 50%，所以概率是 0.5。\n\n__样本空间__：它是随机实验所有可能结果的集合。 \n__有利结果__：我们在随机实验中寻找的结果集合\n\n__概率 = (有利结果的数量) \u002F (样本空间)__\n\n__概率论__是与概率概念相关的数学分支。\n\n[Basics of Probability](https:\u002F\u002Ftowardsdatascience.com\u002Fbasic-probability-theory-and-statistics-3105ab637213)\n\n## 7_ 贝叶斯定理\n\n### 条件概率：\n\n这是在另一个事件已经发生的情况下，某个事件发生的概率。因此，它给出了两个事件之间的关系以及这些事件发生概率的概念。\n\n公式如下：\n\n__P( A | B )__：B 发生后 A 发生的概率。\n\n公式为： \n\n![formula](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002F74cbddb93db29a62d522cd6ab266531ae295a0fb)\n\n因此，P(A|B) 等于 A 和 B 同时发生的概率除以 B 发生的概率。\n\n[Guide to Conditional Probability](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FConditional_probability)\n\n### 贝叶斯定理\n\n贝叶斯定理提供了一种计算条件概率的方法。贝叶斯定理在机器学习中被广泛使用，尤其是在贝叶斯分类器中。  \n\n根据贝叶斯定理，在 B 已经发生的情况下 A 的概率，由 A 的概率乘以给定 A 已发生的情况下 B 的概率，再除以 B 的概率得出。\n\n__P(A|B) =  P(A).P(B|A) \u002F P(B)__\n\n\n[Guide to Bayes Theorem](https:\u002F\u002Fmachinelearningmastery.com\u002Fbayes-theorem-for-machine-learning\u002F)\n\n\n## 8_ 随机变量\n\n随机变量是实验或随机事件的数值结果。它们通常是一组值。 \n\n主要有两种类型的随机变量：\n\n__离散随机变量__：此类变量仅取有限个不同的值。\n\n__连续随机变量__：此类变量可以取无限个可能的值。\n\n\n## 9_ 累积分布函数 (CDF)\n\n在概率论和统计学中，实值随机变量 __X__ 的累积分布函数（CDF），或简称 __X__ 的分布函数，在 __x__ 处的取值，是 __X__ 取值小于或等于 __x__ 的概率。\n\n实值随机变量 X 的累积分布函数是由以下函数给出的：\n\n![CDF](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002Ff81c05aba576a12b4e05ee3f4cba709dd16139c7)\n\n资源：\n\n[Wikipedia](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCumulative_distribution_function)\n\n## 10_ 连续分布\n\n连续分布描述了连续随机变量可能值的概率。连续随机变量是一个具有无限且不可数的一组可能值（称为范围）的随机变量。\n\n## 11_ Skewness (偏度)\n\n偏度 (Skewness) 是衡量数据分布或随机变量分布关于其均值不对称程度的指标。\n\n偏度可以是正的、负的或零。\n\n![skewed image](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002Ff\u002Ff8\u002FNegative_and_positive_skew_diagrams_%28English%29.svg\u002F446px-Negative_and_positive_skew_diagrams_%28English%29.svg.png)\n\n__负偏__: 分布集中在右侧，左侧尾部较长。\n\n__正偏__: 分布集中在左侧，右侧尾部较长。\n\n集中趋势度量的变化如下所示。\n\n\n![cet](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002Fc\u002Fcc\u002FRelationship_between_mean_and_median_under_different_skewness.png\u002F434px-Relationship_between_mean_and_median_under_different_skewness.png)\n\n数据分布通常存在偏斜，这可能在数据处理过程中造成麻烦。__可以通过对分布取对数将偏态分布转换为对称分布__。\n\n##### Skew Distribution (偏态分布)\n\n![Skew](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_12cb775e50a2.png)\n\n##### Log of the Skew Distribution. (偏态分布的对数)\n\n![log](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_af42cb82b664.png)\n\n\n[Guide to Skewness](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSkewness)\n\n\n## 12_ ANOVA (方差分析)\n\nANOVA 代表 __方差分析 (Analysis of Variance)__。 \n\n它用于比较不同组的数据分布。\n\n通常我们会获得海量数据。这些数据太大难以直接处理。总数据被称为 __总体 (Population)__。\n\n为了处理它们，我们选取随机的较小数据组。它们被称为 __样本 (Samples)__。\n\nANOVA 用于比较这些组或样本之间的方差。 \n\n组的方差由下式给出：\n\n![var](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_bdfacf6f5aee.png)\n\n通过观察组均值之间的差异来观察收集到的样本的差异。我们经常使用 __t 检验 (t-test)__ 来比较均值，并检查样本是否属于同一总体，\n\n现在，t 检验仅适用于两组之间。但是，我们通常会得到更多的组或样本。\n\n如果我们尝试对超过两组的组使用 t 检验，我们必须多次执行 t 检验，每对一次。这就是 ANOVA 发挥作用的地方。\n\nANOVA 有两个组成部分：\n\n__1.Variation within each group__ (每个组内的变异)\n\n__2.Variation between groups__ (组间的变异)\n\n它基于一个称为 __F 比率 (F-Ratio)__ 的比率工作。\n\n它由下式给出：\n\n![F-ratio](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_485efa4fa145.png)\n\nF 比率显示了总变异中有多少来自组间变异，有多少来自组内变异。如果大部分变异来自组间变异，则组均值不同的可能性更大。然而，如果大部分变异来自组内变异，那么我们可以得出结论，组内的元素是不同的，而不是整个组。F 比率越大，组具有不同均值的可能性就越大。\n\n\nResources:\n\n[Defnition](https:\u002F\u002Fstatistics.laerd.com\u002Fstatistical-guides\u002Fone-way-anova-statistical-guide.php)\n\n[GUIDE 1](https:\u002F\u002Ftowardsdatascience.com\u002Fanova-analysis-of-variance-explained-b48fee6380af)\n\n[Details](https:\u002F\u002Fmedium.com\u002F@StepUpAnalytics\u002Fanova-one-way-vs-two-way-6b3ff87d3a94)\n\n\n## 13_ Prob Den Fn (PDF) (概率密度函数)\n\n它代表概率密度函数。 \n\n__在概率论中，概率密度函数（PDF），或连续随机变量的密度，是一个函数，其在样本空间（随机变量可能取值集合）中任何给定样本（或点）的值可以被解释为提供随机变量值等于该样本的相对可能性。__\n\n连续分布的概率密度函数 (PDF) P(x) 定义为 (累积) 分布函数 D(x) 的导数。\n\n它由函数在给定范围内的积分给出。\n\n![PDF](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002F45fd7691b5fbd323f64834d8e5b8d4f54c73a6f8)\n\n## 14_ Central Limit Theorem (中心极限定理)\n\n## 15_ Monte Carlo Method (蒙特卡洛方法)\n\n## 16_ Hypothesis Testing (假设检验)\n\n### Types of curves (曲线类型)\n\n我们需要先了解两种分布曲线。\n\n分布曲线反映了在分布的某个值处找到总体实例或样本的概率。\n\n__Normal Distribution (正态分布)__\n\n![normal distribution](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_3f69f968f470.jpg)\n\n正态分布代表了数据的分布情况。在这种情况下，分布中的大多数数据样本散布在分布均值及其周围。少数实例散布在分布的长尾末端。\n\n关于正态分布的几个要点是：\n\n1. 曲线总是钟形的。这是因为大多数数据都围绕均值发现，所以在均值或中心值找到样本的概率更高。\n\n2. 曲线是对称的\n\n3. 曲线下方的面积始终为 1。这是因为分布的所有点都必须存在于曲线下方\n\n4. 对于正态分布，均值和中位数位于分布的同一条线上。 \n\n__Standard Normal Distribution (标准正态分布)__\n\n这种类型的分布是满足以下条件的正态分布。\n\n1. 分布的均值为 0\n\n2. 分布的标准差等于 1。\n\n假设检验的思想完全基于数据分布。\n\n### 假设检验\n\n假设检验（Hypothesis Testing）是一种统计方法，用于利用实验数据进行统计决策。假设检验基本上是我们对总体参数所做的一个假设。\n\n例如，假设我们提出一个假设：班级里的男孩比女孩高。 \n\n上述陈述只是对班级总体的一个假设。\n\n__假设__（Hypothesis）仅仅是基于对一组信息或数据的观察而提出的推测性提议或陈述。 \n\n我们最初根据样本数据的总体提出两个互斥的陈述。 \n\n第一个称为__零假设__（NULL HYPOTHESIS）。它用 H0 表示。\n\n第二个称为__备择假设__（ALTERNATE HYPOTHESIS）。它用 H1 或 Ha 表示。它用作零假设的对立面。 \n\n基于总体的实例，我们接受或拒绝零假设，并相应地拒绝或接受备择假设。\n \n#### 显著性水平\n\n这是我们决定是否接受或拒绝零假设的程度。当我们对总体进行假设时，并非 100% 或所有总体实例都符合该假设，因此我们设定一个__显著性水平__（Level of Significance）作为截止程度，即，如果我们的显著性水平是 5%，且 (100-5)% = 95% 的数据符合假设，我们就接受该假设。\n\n__这意味着在 95% 的置信度下，该假设被接受__\n\n![curve](https:\u002F\u002Fi.stack.imgur.com\u002Fd8iHd.png)\n\n非拒绝区域称为__接受区域或 beta 区域__。拒绝区域称为__临界区域或 alpha 区域__。__alpha__ 表示__显著性水平__。\n\n如果显著性水平为 5%，则两个 alpha 区域包含 (2.5+2.5)% 的总体，beta 区域包含 95%。 \n\n接受和拒绝会产生两种类型的错误：\n\n__第一类错误__（Type-I Error）：零假设为真，但被错误地拒绝。\n\n__第二类错误__（Type-II Error）：零假设为假，但被错误地接受。\n\n![hypothesis](https:\u002F\u002Fmicrobenotes.com\u002Fwp-content\u002Fuploads\u002F2020\u002F07\u002FGraphical-representation-of-type-1-and-type-2-errors.jpg)\n\n### 假设检验测试\n\n__单尾检验__（One Tailed Test）: \n\n![One-tailed](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_acc48a5819d2.png)\n\n这是一种假设检验，其中拒绝区域仅位于抽样分布的一侧。拒绝区域可能在右尾端或左尾端。\n\n其思路是，如果我们说显著性水平是 5%，并且考虑一个假设“班级里男孩的身高 \u003C= 6 英尺”。如果至多 5% 的人口身高超过 6 英尺，我们就认为该假设成立。因此，这将是单尾的，因为测试条件仅限制了一端，即身高 > 6 英尺的那一端。 \n\n![Two Tailed](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_88e6dd52a2b7.png)\n\n在这种情况下，拒绝区域延伸到分布的两端尾部。\n\n其思路是，如果我们说显著性水平是 5%，并且考虑一个假设“班级里男孩的身高 != 6 英尺”。\n\n在这里，仅当至多 5% 的人口身高小于或大于 6 英尺时，我们才能接受零假设。因此，显然临界区域将位于两端，且分布两端的区域各为 5% \u002F 2 = 2.5%。 \n\n\n\n## 17_P 值\n\n在我们深入探讨 P 值（P-value）之前，我们需要先看一下上下文中的另一个重要主题：Z 检验（Z-test）。\n\n### Z 检验\n\n我们需要了解两个术语：__总体__（Population）和__样本__（Sample）。\n\n__总体__描述了整个可用的数据分布。因此，它指的是数据集中提供的所有记录。\n\n__样本__是指从总体或给定分布中随机选取的一组数据点。样本的大小可以是任意数量的数据点，由__样本量__（sample size）给出。\n\n__Z 检验__简单地用于确定给定的样本分布是否属于给定的总体。 \n\n现在，对于 Z 检验，我们必须使用__标准正态形式__（Standard Normal Form）来进行标准化比较度量。\n\n![std1](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_d64d52a25cae.png)\n\n正如我们已经看到的，标准正态形式是一个均值为 0、标准差为 1 的正态形式。\n\n__标准差__（Standard Deviation）是衡量点围绕均值分布差异程度的指标。\n\n![std2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_a2c6afc944fa.png)\n\n它指出，大约 68%、95% 和 99.7% 的数据分别位于正态分布的 1、2 和 3 个标准差范围内。\n\n现在，为了将正态分布转换为标准正态分布，我们需要一个称为__Z 分数__（Z-Score）的标准分数。它由以下公式给出：\n\n![Z-score](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_310333194ca3.png)\n\nx = 我们要标准化的值\n\nµ = x 分布的均值\n\nσ = x 分布的标准差\n\n我们需要了解另一个概念__中心极限定理__（Central Limit Theorem）。\n\n##### 中心极限定理 \n\n_该定理指出，无论总体分布如何，只要样本量大于 30，样本均值的抽样分布的均值等于总体均值。_\n\n并且\n\n_样本均值的抽样分布也将遵循正态分布。_\n\n因此，它指出，如果我们要从一个大小超过 30 的分布中选择几个样本，并选择样本均值并使用样本均值创建一个分布，那么新创建的抽样分布的均值等于原始总体均值。\n\n根据该定理，如果我们从具有总体均值 μ 和总体标准差 σ 的总体中抽取大小为 N 的样本，则条件如下：\n\n![std3](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_989de647b0b1.png)\n\n即，样本均值分布的均值等于总体均值。\n\n样本均值的标准差由以下公式给出：\n\n![std4](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_b42526ed4f4c.png)\n\n上述项也称为__标准误__（standard error）。\n\n我们使用上述理论进行 Z 检验。如果样本均值接近总体均值，我们说该样本属于该总体；如果它与总体均值相距甚远，我们说该样本取自不同的总体。\n\n为此，我们使用一个公式并检查 z 统计量是否大于或小于 1.96（考虑双尾检验，显著性水平 = 5%）\n\n![los](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_f6335d6d9baa.gif)\n\n![std5](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_402dc791b297.png)\n \n上述公式给出 Z 统计量\n\nz = z 统计量\n\nX̄ = 样本均值\n\nμ = 总体均值\n\nσ = 总体标准差\n\nn = 样本量\n\n现在，由于 Z 分数用于标准化分布，它为我们提供了数据整体分布情况的概览。\n\n### P 值 (P-values)\n\n它用于根据显著性水平检查结果是否具有统计显著性。\n\n例如，我们进行一项实验并收集观察值或数据。现在，我们提出一个主要假设（零假设 (NULL hypothesis)），以及第二个与第一个相反的假设，称为备择假设 (alternative hypothesis)。\n\n然后我们确定一个显著性水平 (significance level)，作为零假设的阈值。P 值实际上给出了该陈述的概率。例如，如果我们备择假设的 p 值为 0.02，这意味着备择假设发生的概率是 2%。\n\n现在，显著性水平发挥作用，决定我们是否允许 2% 或 0.02 的 p 值。这可以被视为零假设的耐受水平。如果使用双尾检验 (two tailed test)，我们的显著性水平为 5%，我们可以允许分布两端的各 2.5%，我们接受零假设，因为显著性水平 > 备择假设的 p 值。\n\n但如果 p 值大于显著性水平，我们就说结果是__具有统计显著性的 (statistically significant)，并且我们拒绝零假设 (NULL hypothesis)。__。\n\n资源：\n\n1. https:\u002F\u002Fmedium.com\u002Fanalytics-vidhya\u002Feverything-you-should-know-about-p-value-from-scratch-for-data-science-f3c0bfa3c4cc\n\n2. https:\u002F\u002Ftowardsdatascience.com\u002Fp-values-explained-by-data-scientist-f40a746cfc8\n\n3. https:\u002F\u002Fmedium.com\u002Fanalytics-vidhya\u002Fz-test-demystified-f745c57c324c\n\n## 18_ 卡方检验 (Chi2 test)\n\n卡方检验 (Chi2 test) 广泛用于数据科学和机器学习问题中的特征选择 (feature selection)。\n\n卡方检验 (chi-square test) 在统计学中用于测试两个事件的独立性。因此，它用于检查所使用的特征的独立性。经常使用相关的特征，这些特征没有传达太多信息，但增加了特征空间的维度。\n\n它是检查两个或多个分类变量 (categorical variables) 之间关系的最常用方法之一。\n\n它涉及计算一个数字，称为卡方统计量 (chi-square statistic) - χ2。它遵循卡方分布 (chi-square distribution)。\n\n它被表示为期望值与观测值之差除以观测值的求和。\n\n![Chi2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_4e9001423d04.png)\n\n\n资源：\n\n[定义](investopedia.com\u002Fterms\u002Fc\u002Fchi-square-statistic.asp)\n\n[指南 1](https:\u002F\u002Ftowardsdatascience.com\u002Fchi-square-test-for-feature-selection-in-machine-learning-206b1f0b8223)\n\n[指南 2](https:\u002F\u002Fmedium.com\u002Fswlh\u002Fwhat-is-chi-square-test-how-does-it-work-3b7f22c03b01)\n\n[操作示例](https:\u002F\u002Fmedium.com\u002F@kuldeepnpatel\u002Fchi-square-test-of-independence-bafd14028250)\n\n\n## 19_ 估计 (Estimation)\n\n## 20_ 置信区间 (Confid Int (CI))\n\n## 21_ 最大似然估计 (MLE)\n\n## 22_ 核密度估计 (Kernel Density estimate)\n\n在统计学中，核密度估计 (Kernel Density estimation, KDE) 是一种非参数 (non-parametric) 方法来估计随机变量的概率密度函数 (probability density function)。核密度估计是一个基本的数据平滑问题，基于有限的数据样本对总体进行推断。\n\n核密度估计可以被视为表示概率分布的另一种方式。 \n\n![KDE1](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002F2\u002F2a\u002FKernel_density.svg\u002F250px-Kernel_density.svg.png)\n\n它包括选择一个核函数 (kernel function)。主要有三种被使用。\n\n1. 高斯 (Gaussian) \n\n2. 箱型 (Box)\n\n3. 三角 (Tri)\n\n核函数描绘了找到一个数据点的概率。因此，它在中心最高，随着远离该点而降低。\n\n我们在所有数据点上分配一个核函数，最后计算函数的密度，以获得分布数据点的密度估计。它实际上是在轴上的某一点累加核函数的值。如下图所示。\n\n![KDE 2](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002F4\u002F41\u002FComparison_of_1D_histogram_and_KDE.png\u002F500px-Comparison_of_1D_histogram_and_KDE.png)\n\n现在，核函数由下式给出：\n\n![kde3](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002Ff3b09505158fb06033aabf9b0116c8c07a68bf31)\n\n其中 K 是核——一个非负函数——h > 0 是一个称为带宽 (bandwidth) 的平滑参数。 \n\n'h' 或带宽是曲线变化的参数。\n\n![kde4](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002Fe\u002Fe5\u002FComparison_of_1D_bandwidth_selectors.png\u002F220px-Comparison_of_1D_bandwidth_selectors.png)\n\n来自标准正态分布的 100 个点的随机样本的不同带宽的核密度估计 (KDE)。灰色：真实密度（标准正态）。红色：h=0.05 的 KDE。黑色：h=0.337 的 KDE。绿色：h=2 的 KDE。\n\n资源：\n\n[基础](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=x5zLaWT5KPs)\n\n[高级](https:\u002F\u002Fjakevdp.github.io\u002FPythonDataScienceHandbook\u002F05.13-kernel-density-estimation.html)\n\n## 23_ 回归 (Regression)\n\n回归任务涉及从一组__自变量 (independent variables)__预测__因变量 (dependent variable)__的值。\n\n例如，我们要预测汽车的价格。因此，它成为因变量，设为 Y，而特征如发动机容量、最高速度、类别和公司成为自变量，这有助于构建方程以获得价格。\n\n如果有一个特征，设为 x。如果因变量 y 与 x 线性相关，那么它可以表示为__y=mx+c__，其中 m 是方程中自变量的系数，c 是截距或偏差 (bias)。\n\n图片显示了回归的类型\n\n![types](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_14eb56f13f69.png)\n\n[回归指南](https:\u002F\u002Ftowardsdatascience.com\u002Fa-deep-dive-into-the-concept-of-regression-fb912d427a2e)\n\n## 24_ 协方差 (Covariance)\n\n### 方差 (Variance)\n方差是衡量集合分散或散布程度的指标。如果说方差为零，意味着数据集中的所有元素都相同。如果方差很低，意味着数据略有不同。如果方差非常高，意味着数据集中的数据在很大程度上是不同的。 \n\n数学上，它是衡量数据集中每个值与均值 (mean) 距离的指标。\n\n方差 (sigma^2) 由每个点到均值的距离平方的总和除以点数给出。\n\n![formula var](https:\u002F\u002Fcdn.sciencebuddies.org\u002FFiles\u002F474\u002F9\u002FDefVarEqn.jpg)\n\n### 协方差 (Covariance)\n\n协方差 (Covariance) 让我们了解两个随机变量 (Random Variables) 之间的关联程度。现在，我们知道随机变量会形成分布 (Distributions)。分布是变量所取的一组值或数据点，我们可以很容易地在向量空间 (Vector Space) 中将其表示为向量 (Vectors)。\n\n对于向量，协方差定义为两个向量的点积 (Dot Product)。协方差的值可以从正无穷 (Positive Infinity) 到负无穷 (Negative Infinity) 变化。如果两个分布或向量朝同一方向增长，则协方差为正，反之亦然。符号 (Sign) 给出变化的方向，大小 (Magnitude) 给出变化的幅度。\n\n协方差由下式给出：\n\n![cov_form](https:\u002F\u002Fcdn.corporatefinanceinstitute.com\u002Fassets\u002Fcovariance1.png)\n\n其中 Xi 和 Yi 表示两个分布的第 i 个点，X-bar 和 Y-bar 代表两个分布的均值 (Mean Values)，n 代表分布中的数值或数据点的数量。\n\n## 25_ 相关性 (Correlation)\n\n协方差衡量变量的总体关系，即方向和幅度。相关性 (Correlation) 是协方差的缩放度量。它是无量纲的，且与尺度无关。它仅显示两个变量变化的强度。\n\n从数学上讲，如果我们用向量表示分布，相关性被称为向量之间的夹角余弦 (Cosine Angle)。相关性的值在 +1 到 -1 之间变化。+1 被称为强正相关 (Strong Positive Correlation)，-1 被称为强负相关 (Strong Negative Correlation)。0 意味着无相关性，或者两个变量相互独立 (Independent)。\n\n相关性由下式给出：\n\n![corr](https:\u002F\u002Fcdn.corporatefinanceinstitute.com\u002Fassets\u002Fcovariance3.png)\n\n其中：\n\nρ(X,Y) – 变量 X 和 Y 之间的相关性\n\nCov(X,Y) – 变量 X 和 Y 之间的协方差\n\nσX – X 变量的标准差 (Standard Deviation)\n\nσY – Y 变量的标准差\n\n标准差是方差 (Variance) 的平方根。\n\n## 26_ 皮尔逊系数 (Pearson coeff)\n\n## 27_ 因果关系 (Causation)\n\n## 28_ 最小二乘法拟合 (Least2-fit)\n\n## 29_ 欧几里得距离 (Euclidian Distance)\n\n__欧几里得距离是最常用和标准的两点间距离度量。__\n\n它被定义为两点坐标之差的平方和的平方根。\n\n__欧几里得空间中的两点间的欧几里得距离是一个数，即两点之间线段的长度。它可以使用勾股定理从点的笛卡尔坐标 (Cartesian Coordinates) 计算得出，有时也被称为毕达哥拉斯距离 (Pythagorean Distance)。__\n\n__在欧几里得平面中，设点 p 具有笛卡尔坐标 (p_{1},p_{2})，设点 q 具有坐标 (q_{1},q_{2})。那么 p 和 q 之间的距离由下式给出：__\n\n![eucladian](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002F9c0157084fd89f5f3d462efeedc47d3d7aa0b773)\n\n\n# 3_ 编程\n\n## 1_ Python 基础\n\n### 简介\n\nPython 是一种高级编程语言 (High-level Programming Language)。它可以用于广泛的工作领域。\n\n常用于数据科学 (Data-science)，[Python](https:\u002F\u002Fwww.python.org\u002F) 拥有庞大的库 (Libraries) 集合，有助于快速完成工作。\n\n大多数信息系统已经支持 Python，无需安装任何内容。\n\n### 执行脚本\n\n* 将 .py 文件下载到你的计算机上\n* 使其可执行（在 Linux 上使用 _chmod +x file.py_）\n* 打开终端 (Terminal) 并进入包含 Python 文件的目录\n* 使用 _python file.py_ 运行 Python2 或使用 _python3 file.py_ 运行 Python3\n\n## 2_ 在 Excel 中工作\n\n## 3_ R 设置 \u002F R Studio\n\n### 简介\n\nR 是一门专门用于统计和数学可视化的编程语言。\n\n它可以通过终端使用手动创建的脚本，也可以直接在 R 控制台 (R Console) 中使用。\n\n### 安装\n\n#### Linux\n\t\n\tsudo apt-get install r-base\n\t\n\tsudo apt-get install r-base-dev\n\n#### Windows\n\n下载 [CRAN](https:\u002F\u002Fcran.rstudio.com\u002Fbin\u002Fwindows\u002Fbase\u002F) 网站上可用的 .exe 安装包。\n\n### R-studio\n\nRStudio 是 R 的图形界面。它在 [其网站](https:\u002F\u002Fwww.rstudio.com\u002Fproducts\u002Frstudio\u002Fdownload\u002F) 上免费提供。\n\n该界面分为 4 个主要区域：\n\n![rstudio](https:\u002F\u002Fowi.usgs.gov\u002FR\u002Ftraining-curriculum\u002Fintro-curriculum\u002Fstatic\u002Fimg\u002Frstudio.png)\n\n* 左上角是你正在编辑的脚本（高亮你想执行的代码并按 Ctrl + Enter）\n* 左下角是控制台，用于即时执行某些代码行\n* 右上角显示你的环境 (Environment)（变量、历史记录等）\n* 右下角显示你绘制的图表、包 (Packages)、帮助... 代码执行的结果\n\n## 4_ R 基础\n\nR 是由 R 统计计算基金会支持的用于统计计算和图形的开源编程语言和软件环境。\n\nR 语言在统计学家和数据挖掘者中被广泛用于开发统计软件和分析数据。\n\n民意调查、数据挖掘者调查以及学术文献数据库研究表明，近年来 R 的受欢迎程度显著增加。\n\n## 5_ 表达式 (Expressions)\n\n## 6_ 变量 (Variables)\n\n## 7_ IBM SPSS\n\n## 8_ Rapid Miner\n\n## 9_ 向量 (Vectors)\n\n## 10_ 矩阵 (Matrices)\n\n## 11_ 数组 (Arrays)\n\n## 12_ 因子 (Factors)\n\n## 13_ 列表 (Lists)\n\n## 14_ 数据框 (Data frames)\n\n## 15_ 读取 CSV 数据\n\nCSV 是一种在数据科学中常用的__表格数据 (Tabular Data)__格式。大多数结构化数据都将以此格式呈现。\n\n要在 Python 中__打开 CSV 文件__，只需像往常一样打开文件：\n\t\n\traw_file = open('file.csv', 'r')\n\t\n* 'r': 读取，无法修改文件\n* 'w': 写入，每次修改都会擦除文件 \n* 'a': 追加，每次修改将在文件末尾进行\n\n### 如何读取？\n\n大多数情况下，你将逐行解析此文件并对该行执行任何操作。如果你想存储数据以便稍后使用，请构建列表或字典 (Dictionaries)。\n\n要逐行读取此类文件，你可以使用：\n\n* Python [csv 库](https:\u002F\u002Fdocs.python.org\u002F3\u002Flibrary\u002Fcsv.html)\n* Python [open 函数](https:\u002F\u002Fdocs.python.org\u002F2\u002Flibrary\u002Ffunctions.html#open)\n\n## 16_ 读取原始数据\n\n## 17_ 子集数据 (Subsetting data)\n\n## 18_ 操作数据框\n\n## 19_ 函数 (Functions)\n\n函数有助于执行重复操作。\n\n首先，定义函数：\n\n\tdef MyFunction(number):\n\t\t\"\"\"This function will multiply a number by 9\"\"\"\n\t\tnumber = number * 9\n\t\treturn number\n\n## 20_ 因子分析 (Factor analysis)\n\n## 21_ 安装包 (Install PKGS)\n\nPython 实际上有两个主要使用的发行版。Python2 和 Python3。\n\n### 安装 pip\n\nPip 是 Python 的库管理器。因此，你可以轻松地使用一行命令安装大多数包 (Packages)。要安装 pip，只需进入终端并执行：\n\t\n\t# __python2__\n\tsudo apt-get install python-pip\n\t# __python3__\n\tsudo apt-get install python3-pip\n\t\nYou can then install a library with [pip](https:\u002F\u002Fpypi.python.org\u002Fpypi\u002Fpip?) via a terminal doing:\n\n\t# __python2__ \n\tsudo pip install [PCKG_NAME]\n\t# __python3__ \n\tsudo pip3 install [PCKG_NAME]\n\nYou also can install it directly from the core (see 21_install_pkgs.py)\n\n\n# 4_ 机器学习 (Machine learning)\n\n## 1_ 什么是机器学习 (ML)？\n\n### 定义\n\n机器学习是人工智能研究的一部分。它涉及复杂方法的构思、开发及实现，使机器能够完成极其困难的任务，这些任务用经典算法几乎无法解决。\n\n机器学习主要由三种算法组成：\n\n![ml](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_e265e9615e06.png)\n\n### 应用示例\n\n* 计算机视觉\n* 搜索引擎\n* 金融分析\n* 文档分类\n* 音乐生成\n* 机器人技术 ...\n\n## 2_ 数值变量\n\n可以取连续整数或实数值的变量。它们可以取无限多个值。\n\n这类变量主要用于涉及测量的特征。例如，班级中所有学生的身高。\n\n## 3_ 类别变量\n\n取有限离散值的变量。它们取一组固定值，用于对数据项进行分类。\n\n它们类似于分配的标签。例如：根据性别标记班级学生：'Male' 和 'Female'\n\n## 4_ 监督学习\n\n监督学习是从__标记的训练数据__推断函数的机器学习任务。 \n\n训练数据由__一组训练示例__组成。 \n\n在监督学习中，每个示例是一个包含输入对象（通常是向量）和期望输出值（也称为监督信号）的对。 \n\n监督学习算法分析训练数据并生成一个推断函数，可用于映射新示例。 \n\n换句话说：\n\n监督学习从一组标记的示例中学习。从实例和标签中，监督学习模型试图找到用于描述实例的特征之间的相关性，并学习每个特征如何贡献于对应实例的标签。当接收到未见过的实例时，监督学习的目标是根据其特征正确标记该实例。\n\n__最佳情况将允许算法正确确定未见实例的类别标签__。\n\n## 5_ 无监督学习\n\n无监督机器学习是从__“未标记”数据__推断函数以描述隐藏结构的机器学习任务（观察中不包含分类或归类）。 \n\n由于提供给学习者的示例是未标记的，因此无法评估相关算法输出的结构的准确性——这是区分无监督学习与监督学习和强化学习的一种方式。\n\n无监督学习仅处理数据实例。这种方法尝试根据特征的相似性对数据进行分组并形成聚类。如果两个实例具有相似的特征并在特征空间中彼此靠近，那么这两个实例属于同一聚类的可能性很高。当获得一个未见过的实例时，算法将尝试根据其特征找出该实例应属于哪个聚类。\n\n资源：\n\n[无监督学习指南](https:\u002F\u002Ftowardsdatascience.com\u002Fa-dive-into-unsupervised-learning-bf1d6b5f02a7)\n\n## 6_ 概念、输入和属性\n\n机器学习问题将数据集的特征作为输入。\n\n对于监督学习，模型在数据上进行训练，然后准备执行。因此，对于监督学习，除了特征外，我们还需要输入数据点对应的标签，以便让模型在这些标签上进行训练。\n\n对于无监督学习，模型只需引用数据项之间的复杂关系并相应地对它们进行分组即可执行。因此，无监督学习不需要标记的数据集。输入仅是数据集的特征部分。\n\n## 7_ 训练和测试数据\n\n如果我们使用数据集训练监督机器学习模型，模型会非常深入地捕捉该特定数据集的依赖关系。因此，模型在该数据上的表现总是很好，但这并不能正确衡量模型的实际表现。 \n\n为了了解模型的表现如何，我们必须使用不同的数据集来训练和测试模型。我们用来训练模型的数据集称为训练集，我们用来测试模型的数据集称为测试集。\n\n我们通常分割提供的数据集以创建训练集和测试集。分割比例主要是：3:7 或 2:8，取决于数据，较大的是训练数据。\n\n#### sklearn.model_selection.train_test_split 用于分割数据。\n\n语法：\n\n    from sklearn.model_selection import train_test_split\n    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.33, random_state=42)\n  \n[Sklearn 文档](https:\u002F\u002Fscikit-learn.org\u002Fstable\u002Fmodules\u002Fgenerated\u002Fsklearn.model_selection.train_test_split.html)\n\n## 8_ 分类器\n\n分类是最重要的也是最常见的机器学习问题。分类问题既可以是监督问题也可以是无监督问题。\n\n分类问题涉及根据特定数据点对应的特征集，将数据点标记为属于特定类别。\n\n分类任务可以使用机器学习和深度学习技术执行。\n\n机器学习分类技术包括：逻辑回归、SVM（支持向量机）以及分类树。用于执行分类的模型称为分类器。\n\n## 9_ 预测\n\n机器学习模型针对特定问题生成的输出称为其预测。 \n\n主要有两种类型的预测对应两种类型的问题： \n\n1. 分类\n\n2. 回归\n\n在分类中，预测通常是一个类别或标签，数据点属于该类。\n\n在回归中，预测是一个数字，一个连续的数值，因为回归问题涉及预测数值。例如，预测房屋价格。\n\n## 10_ 提升\n\n## 11_ 过拟合\n\n我们经常过度训练模型或使模型过于复杂，导致模型与训练数据拟合得过于紧密。\n\n训练数据通常包含异常值或代表数据中的误导模式。如此深度地拟合包含此类不规则性的训练数据会导致模型失去泛化能力。模型在训练集上表现非常好，但在测试集上表现不佳。 \n\n![overfitting](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_9e198f756486.png)\n\n正如我们所见，随着训练进一步深入，训练误差减小而测试误差增加。\n\n如果存在另一个假设 h，使得 h 在训练数据上的误差比 h1 大，而在测试数据上的误差比 h1 小，则称假设 h1 过拟合。\n\n## 12_ 偏差与方差\n\n偏差 (Bias) 是指模型的平均预测值与我们试图预测的正确值之间的差异。高偏差 (High Bias) 模型对训练数据关注很少，并且过度简化了模型。这总是导致在训练数据和测试数据上出现高误差。\n\n方差 (Variance) 是给定数据点或值的模型预测的变异性，它告诉我们数据的分布情况。高方差 (High Variance) 模型非常关注训练数据，并且在未见过的数据上无法泛化。因此，这类模型在训练数据上表现很好，但在测试数据上的错误率很高。\n\n基本上，高方差会导致过拟合 (Overfitting)，高偏差会导致欠拟合 (Underfitting)。我们希望我们的模型具有低偏差和低方差，以表现完美。我们需要避免具有高方差和高偏差的模型。\n\n![bias&variance](https:\u002F\u002Fcommunity.alteryx.com\u002Ft5\u002Fimage\u002Fserverpage\u002Fimage-id\u002F52874iE986B6E19F3248CF?v=1.0)\n\n我们可以看到，对于低偏差和低方差，我们的模型能正确预测所有数据点。同样，在最后那张图中，由于存在高偏差和高方差，模型没有正确预测任何数据点。\n\n![B&v2](https:\u002F\u002Fadolfoeliazat.com\u002Fwp-content\u002Fuploads\u002F2020\u002F07\u002FBias-Variance-tradeoff-in-Machine-Learning.png)\n\n从图中可以看出，当模型过于复杂或过于简单时，误差都会增加。偏差随着模型变简单而增加，方差随着模型变复杂而增加。\n\n这是机器学习 (Machine Learning) 中最重要的权衡之一。\n\n## 13_ 树与分类\n\n我们之前讨论过分类问题。我们看到最常用的方法是逻辑回归 (Logistic Regression)、支持向量机 (SVMs) 和决策树。现在，如果决策边界 (Decision Boundary) 是线性的，像逻辑回归和支持向量机这样的方法效果最好，但当决策边界是非线性时，这是一个完全不同的场景，这时就会使用决策树。\n\n![tree](https:\u002F\u002Fwww.researchgate.net\u002Fprofile\u002FZena_Hira\u002Fpublication\u002F279274803\u002Ffigure\u002Ffig4\u002FAS:324752402075653@1454438414424\u002FLinear-versus-nonlinear-classification-problems.png)\n\n第一张图显示了线性决策边界，第二张图显示了非线性决策边界。\n\n在这些情况下，对于非线性边界，基于条件的决策树方法在分类问题上工作得很好。该算法根据特征创建条件来驱动并做出决策，因此独立于函数形式。\n\n![tree2](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_6afa84ccc72a.png)\n\n用于分类的决策树方法\n\n## 14_ 分类率\n\n## 15_ 决策树\n\n决策树是最常用的机器学习 (Machine Learning) 算法之一。它们既用于分类也用于回归。它们可用于线性和非线性数据，但主要用于非线性数据。顾名思义，决策树基于从数据及其行为中得出的一组决策进行工作。它不使用线性分类器或回归器，因此其性能独立于数据的线性性质。\n\n使用树模型的另一个最重要原因是它们非常容易解释。\n\n决策树既可用于分类也可用于回归。方法论略有不同，但原理相同。决策树使用 CART 算法（分类与回归树，Classification and Regression Trees）。\n\n资源：\n\n[Guide to Decision Tree](https:\u002F\u002Ftowardsdatascience.com\u002Fa-dive-into-decision-trees-a128923c9298)\n\n## 16_ Boosting（提升法）\n\n#### 集成学习 (Ensemble Learning)\n\n这是一种通过组合多个模型或弱学习器 (weak learners) 来增强机器学习 (Machine learning) 模型性能的方法。它们提供了改进的效率。\n\n集成学习主要有两种类型：\n\n__1. 并行集成学习 (Parallel ensemble learning) 或 Bagging 方法__\n\n__2. 顺序集成学习 (Sequential ensemble learning) 或 Boosting 方法__\n\n在并行方法或 Bagging 技术中，多个弱分类器是并行创建的。训练数据集是基于自助采样法 (bootstrapping) 从原始数据集中随机创建的。用于训练和创建阶段的数据集是弱分类器。随后在预测期间，所有分类器的结果被打包在一起以提供最终结果。\n\n![bag](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_e35b812a46d7.png)\n\n示例：随机森林 (Random Forests)\n\n在顺序学习或 Boosting 中，弱学习器是一个接一个创建的，并且数据样本集的权重被调整，以便在创建过程中，下一个学习器专注于前一个分类器错误预测的样本。因此，在每个步骤中，分类器都会改进并从其之前的错误或误分类中学习。\n\n![boosting](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_45908f0b9604.jpg)\n\n主要有三种 Boosting 算法：\n\n__1. Adaboost__\n\n__2. 梯度提升 (Gradient Boosting)__\n\n__3. XGBoost__\n\n__Adaboost__ 算法的工作原理如下所述。它创建一个弱学习器，也称为桩 (stumps)，它们不是完整的树，而是包含单个节点，基于该节点进行分类。观察误分类情况，并在训练下一个弱学习器时，给予它们比正确分类的更高的权重。 \n\n__sklearn.ensemble.AdaBoostClassifier__ 用于在 Python 中将分类器应用于真实数据。\n\n![adaboost](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_c97e86a66999.jpg)\n\n资源：\n\n[理解](https:\u002F\u002Fblog.paperspace.com\u002Fadaboost-optimizer\u002F#:~:text=AdaBoost%20is%20an%20ensemble%20learning,turn%20them%20into%20strong%20ones.)\n\n\n__梯度提升 (Gradient Boosting)__ 算法从一个输出为 0.5 的节点开始，适用于分类和回归。它充当第一个桩或弱学习器。然后我们观察预测中的误差。现在，我们创建其他学习器或决策树，实际上根据条件预测误差。这些误差称为残差 (Residuals)。我们的最终输出是：\n\n__0.5（由第一个学习器提供）+ 第二个树或学习器提供的误差.__\n\n现在，如果我们使用这种方法，它会过于紧密地学习预测，并失去泛化能力 (generalization)。为了避免这种情况，梯度提升使用学习参数 _alpha_。 \n\n因此，两个学习器后的最终结果计算如下：\n\n__0.5（由第一个学习器提供）+ _alpha_ X（第二个树或学习器提供的误差。）__\n\n我们可以看到，通过使用添加的部分，我们向正确结果迈出了一小步。我们继续添加学习器，直到非常接近训练集给出的实际值。\n\n总体而言，方程变为：\n\n\n__0.5（由第一个学习器提供）+ _alpha_ X（第二个树或学习器提供的误差。）+ _alpha_ X（第三个树或学习器提供的误差。）+.............__\n\n\n__sklearn.ensemble.GradientBoostingClassifier__ 用于在 Python 中应用梯度提升\n\n![GBM](https:\u002F\u002Fwww.elasticfeed.com\u002Fwp-content\u002Fuploads\u002F09cc1168a39db0c0d6ea1c66d27ecfd3.jpg)\n\n资源：\n\n[指南](https:\u002F\u002Fmedium.com\u002Fmlreview\u002Fgradient-boosting-from-scratch-1e317ae4587d) \n\n## 17_ 朴素贝叶斯分类器 (Naive Bayes classifiers)\n\n朴素贝叶斯分类器是一组基于 __贝叶斯定理 (Bayes' Theorem)__ 的分类算法。\n\n贝叶斯定理描述了一个事件发生的概率，基于可能与该事件相关的条件的先前知识。公式如下：\n\n![bayes](https:\u002F\u002Fwikimedia.org\u002Fapi\u002Frest_v1\u002Fmedia\u002Fmath\u002Frender\u002Fsvg\u002F87c061fe1c7430a5201eef3fa50f9d00eac78810)\n\n其中 P(A|B) 是在已知 B 已经发生的情况下 A 发生的概率，P(B|A) 是在已知 A 发生的情况下 B 发生的概率。\n\n[Scikit-learn 指南](https:\u002F\u002Fgithub.com\u002Fabr-98\u002Fdata-scientist-roadmap\u002Fedit\u002Fmaster\u002F04_Machine-Learning\u002FREADME.md)\n\n主要有两种类型的朴素贝叶斯：\n\n__1. 高斯朴素贝叶斯 (Gaussian Naive Bayes)__\n\n__2. 多项式朴素贝叶斯 (Multinomial Naive Bayes).__\n\n#### 多项式朴素贝叶斯 (Multinomial Naive Bayes)\n\n该方法主要用于文档分类。例如，将文章分类为体育文章或电影杂志。它也用于区分真实邮件和垃圾邮件。它使用不同杂志中使用的单词频率来做决定。\n\n例如，单词“亲爱的”和“朋友们”在实际邮件中使用很多，而“优惠”和“钱”在“垃圾”邮件中使用很多。它使用训练示例计算实际邮件和垃圾邮件中单词出现的概率。因此，“钱”在垃圾邮件中出现的概率要高得多，以此类推。 \n\n现在，我们使用其中单词的出现来计算邮件是垃圾邮件的概率。 \n\n#### 高斯朴素贝叶斯 (Gaussian Naive Bayes)\n\n当预测变量取连续值而不是离散值时，我们假设这些值是从高斯分布 (gaussian distribution) 中采样的。\n\n![gnb](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_018b143af3e4.gif)\n\n它将高斯分布与贝叶斯定理联系起来。 \n\n资源：\n\n[指南](https:\u002F\u002Fyoutu.be\u002FH3EjCKtlVog)\n\n## 18_ K-近邻算法 (K-Nearest Neighbor)\n\nK-近邻算法是最基本且仍然重要的算法。它是一种基于内存的方法，而不是基于模型的方法。 \n\nKNN 既用于监督学习 (supervised learning) 也用于无监督学习 (unsupervised learning)。它简单地定位特征空间 (feature space) 中的数据点，并使用距离作为相似性度量标准 (similarity metrics)。\n\n两个数据点之间的距离越小，点就越相似。 \n\n在 K-NN 分类算法中，要分类的点绘制在特征空间中，并根据其最近的 K 个邻居的类别进行分类。K 是用户参数。它给出了我们在决定相关点的标签时应考虑多少点的度量。如果 K 大于 1，我们考虑多数标签。\n\n如果数据集非常大，我们可以使用较大的 k。较大的 k 受噪声影响较小，并生成平滑的边界。对于小数据集，必须使用较小的 k。较小的 k 有助于更好地注意边界的差异。\n\n![knn](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_9483bd7d8e73.jpg)\n\n资源：\n\n[指南](https:\u002F\u002Ftowardsdatascience.com\u002Fmachine-learning-basics-with-the-k-nearest-neighbors-algorithm-6a6e71d01761)\n\n## 19_ 逻辑回归\n\n回归是机器学习中最重要的概念之一。\n\n[回归指南](https:\u002F\u002Ftowardsdatascience.com\u002Fa-deep-dive-into-the-concept-of-regression-fb912d427a2e)\n\n逻辑回归是最常用于线性可分数据的分类算法。当因变量为分类变量时使用逻辑回归。\n\n它使用线性回归方程：\n\n__Y= w1x1+w2x2+w3x3……..wkxk__\n\n以修改后的格式：\n\n__Y= 1\u002F 1+e^-(w1x1+w2x2+w3x3……..wkxk)__\n\n此修改确保值始终保持在 0 和 1 之间。因此，使其可用于分类。\n\n上述方程被称为 __Sigmoid（S 形）__ 函数。该函数的样子如下：\n\n![Logreg](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_0f27d141e12c.png)\n\n使用的损失函数称为 logloss 或二元交叉熵。\n\n__Loss= —Y_actual. log(h(x)) —(1 — Y_actual.log(1 — h(x)))__\n\n如果 Y_actual=1，第一部分给出误差，否则第二部分。\n\n![loss](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_39730dbde797.png)\n\n逻辑回归也用于多分类任务。它使用 Softmax 回归或一对多（One-vs-all）逻辑回归。\n\n[逻辑回归指南](https:\u002F\u002Ftowardsdatascience.com\u002Flogistic-regression-detailed-overview-46c4da4303bc)\n\n`__sklearn.linear_model.LogisticRegression__` 用于在 Python 中应用逻辑回归。\n\n## 20_ 排序\n\n## 21_ 线性回归\n\n回归任务涉及从一组自变量（即提供的特征）预测因变量的值。例如，我们想预测汽车的价格。那么，价格成为因变量（设为 Y），而像发动机排量、最高速度、车型和厂商等特征成为自变量，这有助于构建方程以获得价格。\n\n现在，如果有一个特征，设为 x。如果因变量 y 与 x 线性相关，则可以用 y=mx+c 表示，其中 m 是方程中特征的系数，c 是截距或偏置。M 和 C 都是模型参数。\n\n我们使用一种称为均方误差（MSE）的损失函数或代价函数。它由因变量的实际值与预测值之差的平方给出。\n\n__MSE=1\u002F2m * (Y_actual — Y_pred)²__\n\n如果我们观察该函数，会发现它是一个抛物线，即该函数本质上是凸的。这种凸函数是 __梯度下降（Gradient Descent）__ 中用于获取模型参数值的原理。\n\n![loss](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_e1dfc17ebb6a.png)\n\n图片展示了损失函数。\n\n为了获得模型参数的正确估计，我们使用 __梯度下降（Gradient Descent）__ 方法。\n\n[梯度下降指南](https:\u002F\u002Ftowardsdatascience.com\u002Fan-introduction-to-gradient-descent-and-backpropagation-81648bdb19b2)\n\n[线性回归指南](https:\u002F\u002Ftowardsdatascience.com\u002Flinear-regression-detailed-view-ea73175f6e86)\n\n`__sklearn.linear_model.LinearRegression__` 用于在 Python 中应用线性回归。\n\n## 22_ 感知机\n\n感知机是 50 年代描述的第一个模型。\n\n这是一个 __二分类器（Binary Classifier）__，即它不能分离超过 2 个组，且这些组必须是 __线性可分（Linearly Separable）__ 的。\n\n感知机 __的工作原理类似于生物神经元__。它计算一个激活值，如果该值为正，则返回 1，否则返回 0。\n\n## 23_ 层次聚类\n\n层次算法之所以得名，是因为它们创建树状结构来形成簇。这些算法还使用基于距离的方法进行簇创建。\n\n最流行的算法有：\n\n__凝聚式层次聚类（Agglomerative Hierarchical clustering）__\n\n__分裂式层次聚类（Divisive Hierarchical clustering）__\n\n__凝聚式层次聚类（Agglomerative Hierarchical clustering）__：在这种类型的层次聚类中，每个点最初作为一个簇开始，然后最近或最相似的簇逐渐合并形成一个簇。\n\n__分裂式层次聚类（Divisive Hierarchical Clustering）__：这种类型的层次聚类正好与凝聚式聚类相反。在这种类型中，所有点最初作为一个大簇开始，然后根据两个簇之间的距离大小或相似度高低，簇逐渐被划分为更小的簇。我们不断划分簇，直到所有点都成为单独的簇。\n\n对于凝聚式聚类，我们不断合并最近或具有较高相似性得分的簇到一个簇中。因此，如果我们为合并定义一个截止或阈值得分，我们将得到多个簇而不是单个簇。例如，如果我们将阈值相似性指标得分设为 0.5，这意味着如果在找不到两个相似性得分低于 0.5 的簇时，算法将停止合并簇，该步骤存在的簇数量即为最终需要创建的簇的数量。\n\n同样，对于分裂式聚类，我们根据最低相似性得分来划分簇。因此，如果我们将得分定义为 0.5，如果两个簇之间的相似性得分小于或等于 0.5，它将停止划分或分割。我们将剩下一些簇，而不会减少到分布中的每一个点。\n\n过程如下图所示：\n\n![HC](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_31d239255082.png)\n\n测量距离和应用截止的最常用方法之一是树状图（Dendrogram）方法。\n\n上述聚类的树状图为：\n\n![Dend](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_d927a9f3404b.png)\n\n[指南](https:\u002F\u002Ftowardsdatascience.com\u002Funderstanding-the-concept-of-hierarchical-clustering-technique-c6e8243758ec)\n\n## 24_ K-means 聚类算法\n\n该算法最初使用 N 个数据点随机创建 K 个簇（clusters），并为每个簇计算簇内所有点值的平均值。因此，对于每个簇，我们通过计算簇中值的均值来找到一个中心点或质心（centroid）。然后，算法计算每个簇的平方误差和（Sum of Squared Error, SSE）。SSE 用于衡量簇的质量。如果一个簇中点与中心之间的距离较大，则 SSE 值会较高；如果检查其解释，它只允许在近距离范围内的点形成簇。\n\n该算法基于这样一个原则：位于簇中心附近的点应该属于该簇。因此，如果点 x 比簇 B 更接近簇 A 的中心，那么 x 将属于簇 A。于是，一个点进入一个簇，即使只有一个点从一个簇移动到另一个簇，质心也会改变，SSE 也随之改变。我们不断重复此过程，直到 SSE 减小且质心不再发生变化。经过一定次数的移动后，找到最优簇，当质心不再变化时，移动停止。\n\n初始簇的数量 'K' 是一个用户参数。\n\n图片展示了该方法：\n\n![Kmeans](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_92940bd6fb81.png)\n\n我们已经看到，对于这种类型的聚类技术，我们需要一个用户定义的参数 'K'，它定义了需要创建的簇的数量。现在，这是一个非常重要的参数。为了找到这个参数，人们使用了多种方法。最重要和最常用的方法是肘部法则（Elbow Method）。\n对于较小的数据集，k = (N\u002F2)^(1\u002F2)，即分布中点数的一半的平方根。\n\n[指南](https:\u002F\u002Ftowardsdatascience.com\u002Funderstanding-k-means-clustering-in-machine-learning-6a6e67336aa1)\n\n## 25_ 神经网络\n\n神经网络是一组相互连接的人工神经元或节点（Nodes）层。它们是框架，设计时考虑了人脑的结构和工作方式。它们旨在用于预测建模（predictive modeling）和应用，可以通过数据集进行训练。它们基于自学习算法（self-learning algorithms），并根据从训练信息集中得出的结论和复杂关系进行预测。\n\n典型的神经网络具有多层。第一层称为输入层（Input Layer），最后一层称为输出层（Output Layer）。输入层和输出层之间的层称为隐藏层（Hidden Layers）。它基本上像一个黑盒（Black Box）用于预测和分类。所有层都是相互连接的，由许多称为节点的（Nodes）人工神经元组成。\n\n[神经网络指南](https:\u002F\u002Fmedium.com\u002Fai-in-plain-english\u002Fneural-networks-overview-e6ea484a474e)\n\n神经网络过于复杂，无法直接在梯度下降（Gradient Descent）算法上工作，因此它基于反向传播（Backpropagation）和优化器（Optimizers）的原理工作。\n\n[反向传播指南](https:\u002F\u002Ftowardsdatascience.com\u002Fan-introduction-to-gradient-descent-and-backpropagation-81648bdb19b2)\n\n[优化器指南](https:\u002F\u002Ftowardsdatascience.com\u002Fintroduction-to-gradient-descent-weight-initiation-and-optimizers-ee9ae212723f)\n\n## 26_ 情感分析\n\n文本分类和情感分析是一个非常常见的机器学习问题，广泛应用于产品预测、电影推荐等多种活动。\n\n像情感分析这样的文本分类问题可以通过多种方式实现，使用多种算法。这些主要分为两大类：\n\n词袋模型（Bag of Words Model）：在这种情况下，数据集中的所有句子都被分词化，形成一个表示词汇的词袋。现在，数据集中的每个单独的句子或样本都由该词袋向量表示。这个向量称为特征向量（feature vector）。例如，“今天阳光明媚”和“太阳从东方升起”是两个句子。词袋将是这两个句子中所有唯一的单词。\n\n第二种方法基于时间序列方法（time series approach）：这里每个单词都表示为一个单独的向量。因此，句子被表示为向量的向量。\n\n[情感分析指南](https:\u002F\u002Ftowardsdatascience.com\u002Fa-guide-to-text-classification-and-sentiment-analysis-2ab021796317)\n\n## 27_ 协同过滤\n\n我们都使用过 Netflix、Amazon 和 Youtube 等服务。这些服务使用非常复杂的系统来为用户推荐最佳项目，以提升他们的体验。\n\n推荐系统主要由 3 个组件组成，其中主要组件之一是候选生成（Candidate generation）。该方法负责从数千个项目的巨大池中生成较小的候选子集以推荐给某个用户。\n\n候选生成系统的类型：\n\n__基于内容的过滤系统__\n\n__协同过滤系统__\n\n__基于内容的过滤系统__：基于内容的推荐系统试图根据用户积极反应的项目的特征，猜测用户的特征或行为。\n\n__协同过滤系统__：协同过滤不需要给出项目的特征。每个用户和项目都由特征向量或嵌入（embedding）描述。\n\n它为所有用户和项目创建嵌入。它将用户和项目嵌入到同一个嵌入空间中。\n\n它在推荐特定用户时考虑其他用户的反应。它记录特定用户喜欢哪些项目，以及那些行为和喜好与该用户相似的用户喜欢哪些项目，从而向该用户推荐项目。\n\n它收集用户对不同项目的反馈，并将其用于推荐。\n\n[协同过滤指南](https:\u002F\u002Ftowardsdatascience.com\u002Fintroduction-to-recommender-systems-1-971bd274f421)\n\n## 28_ 标签\n\n## 29_ 支持向量机\n\n支持向量机（Support Vector Machine）既用于分类也用于回归（Regression）。\n\nSVM 在其分类器或回归器周围使用间隔（Margin）。间隔为模型及其性能提供了额外的鲁棒性和准确性。\n\n![SVM](https:\u002F\u002Fupload.wikimedia.org\u002Fwikipedia\u002Fcommons\u002Fthumb\u002F7\u002F72\u002FSVM_margin.png\u002F300px-SVM_margin.png)\n\n上图描述了一个 SVM 分类器。红线是实际分类器，虚线显示边界。位于边界上的点实际上决定了间隔。它们支撑分类器的间隔，因此被称为__支持向量__（Support Vectors）。\n\n分类器与最近点之间的距离称为__间隔距离__（Marginal Distance）。\n\n可能存在多个分类器，但我们选择具有最大间隔距离的那个。因此，间隔距离和支持向量有助于选择最佳分类器。\n\n[来自 Sklearn 的官方文档](https:\u002F\u002Fscikit-learn.org\u002Fstable\u002Fmodules\u002Fsvm.html)\n\n[SVM 指南](https:\u002F\u002Ftowardsdatascience.com\u002Fsupport-vector-machine-introduction-to-machine-learning-algorithms-934a444fca47)\n\n## 30_强化学习\n\n“强化学习（Reinforcement Learning，简称 RL）是机器学习的一个领域，关注软件代理应如何在环境中采取行动，以最大化累积奖励的概念。”\n\n为了赢得游戏，我们需要在游戏过程中做出多次选择和预测以取得成功，因此它们可以被称为多决策过程。这就是我们需要一种称为强化学习算法的算法类型的原因。这类算法基于决策链，使此类算法能够支持多决策过程。\n\n强化算法可用于从起始状态到达目标状态，并据此做出决策。\n\n强化学习涉及一个自我学习的智能体。如果它做出了正确或良好的移动使其朝向目标，它会得到正向奖励，否则不会。通过这种方式，智能体进行学习。\n\n![reinforced](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_a9b482400aa0.png)\n\n上图展示了强化学习的设置。\n\n[WIKI](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FReinforcement_learning#:~:text=Reinforcement%20learning%20(RL)%20is%20an,supervised%20learning%20and%20unsupervised%20learning.)\n\n# 5_ 文本挖掘\n\n## 1_ 语料库\n\n## 2_ 命名实体识别\n\n## 3_ 文本分析\n\n## 4_ UIMA\n\n## 5_ 词文档矩阵\n\n## 6_ 词频与权重\n\n## 7_ 支持向量机 (SVM)\n\n## 8_ 关联规则\n\n## 9_ 基于市场的分析\n\n## 10_ 特征提取\n\n## 11_ 使用 Mahout\n\n## 12_ 使用 Weka\n\n## 13_ 使用 NLTK\n\n## 14_ 分类文本\n\n## 15_ 词汇映射\n\n# 6_ 数据可视化\n\n在 Rstudio 中打开 .R 脚本以逐行执行。\n\n有关安装信息，请参见 [10_工具箱\u002F3_R, Rstudio, Rattle](https:\u002F\u002Fgithub.com\u002FMrMimic\u002Fdata-scientist-roadmap\u002Ftree\u002Fmaster\u002F10_Toolbox#3_-r-rstudio-rattle)。\n\n## 1_ R 中的数据探索\n\n在数学中，函数 f 的图像是所有有序对 (x, f(x)) 的集合。如果函数输入 x 是标量，则图像是一个二维图，对于连续函数则是一条曲线。如果函数输入 x 是实数的有序对 (x1, x2)，则图像是所有有序三元组 (x1, x2, f(x1, x2)) 的集合，对于连续函数则是一个曲面。\n\n## 2_ 单变量、双变量与多变量可视化\n\n### 单变量\n\n该术语常用于统计学中，以区分单个变量的分布与多个变量的分布，尽管它也可以在其他方面应用。例如，单变量数据由单个标量分量组成。在时间序列分析中，该术语应用于整个时间序列作为所指对象：因此，单变量时间序列指的是单一数量随时间变化的值集。\n\n### 双变量\n\n双变量分析是最简单的定量（统计）分析形式之一 [1]。它涉及对两个变量（通常表示为 X, Y）的分析，以确定它们之间的经验关系。\n\n### 多变量\n\n多变量分析（MVA）基于多变量统计学的统计原理，涉及同时观察和分析多个统计结果变量。在设计和分析中，该技术用于在考虑所有变量对感兴趣响应的影响的同时，跨多个维度执行权衡研究。\n\n## 3_ ggplot2\n\n### 简介\n\nggplot2 是 R 语言的绘图系统，基于图形语法（grammar of graphics），它试图吸取基础图形和 lattice 图形的优点，摒弃其缺点。它处理了许多使绘图变得繁琐的细节（如图例绘制），同时也提供了一种强大的图形模型，使得生成复杂的多层图形变得容易。\n\n[http:\u002F\u002Fggplot2.org\u002F](http:\u002F\u002Fggplot2.org\u002F)\n\n### 文档\n\n### 示例\n\n[http:\u002F\u002Fr4stats.com\u002Fexamples\u002Fgraphics-ggplot2\u002F](http:\u002F\u002Fr4stats.com\u002Fexamples\u002Fgraphics-ggplot2\u002F)\n\n## 4_ 直方图与饼图（单变量）\n\n### 简介\n\n直方图和饼图是两种用于可视化频率的图表类型。\n\n直方图显示这些频率在各类别中的分布，而饼图显示这些频率在 100% 圆中的相对比例。\n\n## 5_ 树图与树状地图\n\n### 简介\n\n[树状地图](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTreemapping) 将分层（树形结构）数据显示为一组嵌套矩形。\n树的每个分支都被分配一个矩形，然后该矩形被平铺以代表子分支的较小矩形。\n叶节点的矩形面积与数据的指定维度成比例。\n通常，叶节点会着色以显示数据的另一个维度。\n\n### 何时使用？\n\n- 分支少于 10 个。\n- 正值。\n- 可视化空间有限。\n\n### 示例\n\n![treemap-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_efbe0ca145c1.png)\n\n此树状地图描述了每个产品宇宙的总体积及其对应的表面积。液体产品的销量高于其他产品。\n如果您想进一步探索，我们可以进入“液体”产品类别，找出客户更偏好哪些货架。\n\n### 更多信息\n\n[Matplotlib Series 5: Treemap](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series5-treemap\u002F)\n\n## 6_ 散点图\n\n### 简介\n\n[散点图](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FScatter_plot)（也称为散点图、散点图、散点图或散点图）是一种使用笛卡尔坐标系来显示一组数据中通常两个变量值的图表或数学图。\n\n### 何时使用？\n\n当您想要显示两个变量之间的关系时使用散点图。\n散点图有时被称为相关性图，因为它们显示了两个变量是如何相关的。\n\n### 示例\n\n![scatter-plot-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_bcb0e1a441cb.png)\n\n该图描述了商店表面积与其营业额（千欧元）之间的正相关关系，这是合理的：对于商店来说，店面越大，能接受的客户越多，产生的营业额也就越高。\n\n### 更多信息\n\n[Matplotlib Series 4: Scatter plot](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series4-scatter-plot\u002F)\n\n## 7_ 折线图\n\n### 简介\n\n[折线图](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLine_chart) 或线形图是一种图表类型，它以一系列称为“标记”的数据点的形式显示信息，并通过直线段连接。折线图通常用于可视化随时间间隔变化的趋势——即时间序列——因此线条通常是按时间顺序绘制的。\n\n### 何时使用？\n\n- 跟踪随时间的变化。\n- X 轴显示连续变量。\n- Y 轴显示测量值。\n\n### 示例\n\n![line-chart-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_1ffcdf965ff5.png)\n\n假设上图描述了一年中冰淇淋销售额的营业额（千欧元）。\n根据图表，我们可以清楚地发现销售额在夏季达到峰值，然后从秋季到冬季下降，这是合乎逻辑的。\n\n### 更多信息\n\n[Matplotlib 系列 2：折线图](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series2-line-chart\u002F)\n\n## 8_ 空间图表\n\n## 9_ 调查图\n\n## 10_ 时间线\n\n## 11_ 决策树\n\n## 12_ D3.js\n\n### 关于\n\n这是一个 JavaScript 库，允许您轻松创建大量不同的图表。\n\nhttps:\u002F\u002Fd3js.org\u002F\n\n    D3.js 是一个基于数据操作文档的 JavaScript 库。 \n    D3 帮助您使用 HTML、SVG 和 CSS 让数据活起来。 \n    D3 对网络标准的强调使您能够充分利用现代浏览器的全部功能，而无需绑定到专有框架，结合了强大的可视化组件和数据驱动的 DOM 操作方法。 \n\n### 示例\n\n在 [D3 的 Github](https:\u002F\u002Fgithub.com\u002Fd3\u002Fd3\u002Fwiki\u002FGallery) 上有很多使用 D3.js 的图表示例。\n\n## 13_ 信息可视化\n\n## 14_ IBM ManyEyes\n\n## 15_ Tableau\n\n## 16_ 维恩图\n\n### 关于\n\n一个 [维恩图 (Venn diagram)](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FVenn_diagram)（也称为主要图、集合图或逻辑图）是一种显示有限个不同集合之间所有可能逻辑关系的图表。\n\n### 何时使用？\n\n显示不同组之间的逻辑关系（交集、差集、并集）。\n\n### 示例\n\n![venn-diagram-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_45aa3a103bef.png)\n\n这种维恩图通常可用于零售交易。\n假设我们需要研究奶酪和红葡萄酒的受欢迎程度，并且有 2500 位客户回答了我们的问卷。\n根据上图，我们发现，在 2500 位客户中，900 位客户 (36%) 喜欢奶酪，1200 位客户 (48%) 喜欢红葡萄酒，400 位客户 (16%) 同时喜欢这两种产品。\n\n### 更多信息\n\n[Matplotlib 系列 6：维恩图](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series6-venn-diagram\u002F)\n\n## 17_ 面积图\n\n### 关于\n\n一个 [面积图 (Area chart)](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FArea_chart) 或面积图以图形方式显示定量数据。\n它基于折线图。轴和线之间的区域通常用颜色、纹理和阴影来强调。\n\n### 何时使用？\n\n显示或比较随时间变化的定量进展。\n\n### 示例\n\n![area-chart-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_ce048fe913d8.png)\n\n此堆叠面积图显示了每个账户的金额变化，以及它们对总金额（按价值计算）的贡献。\n\n### 更多信息\n\n[Matplotlib 系列 7：面积图](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series7-area-chart\u002F)\n\n## 18_ 雷达图\n\n### 关于\n\n[雷达图 (Radar chart)](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FRadar_chart) 是一种由一系列等角辐条（称为半径）组成的图表，每个辐条代表一个变量。辐条的数据长度与该数据点相对于所有数据点中该变量的最大幅度的变量幅度成正比。绘制一条线连接每个辐条的数据值。这使图表呈现出星形外观，这也是该图表流行名称之一的由来。\n\n### 何时使用？\n\n- 在各种特征或特性上比较两个或多个项目或组。\n- 检查单个数据点的相对值。\n- 在一个雷达图上显示少于十个因素。\n\n### 示例\n\n![radar-chart-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_37065c0c9783.png)\n\n此雷达图显示了 2 位客户在 4 种产品中的偏好。\n客户 c1 喜欢鸡肉和面包，不太喜欢奶酪。\n然而，客户 c2 比其他 4 种产品更喜欢奶酪，不喜欢啤酒。\n我们可以采访这 2 位客户，以找出那些不受青睐的产品的弱点。\n\n### 更多信息\n\n[Matplotlib 系列 8：雷达图](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series8-radar-chart\u002F)\n\n## 19_ 词云\n\n### 关于\n\n一个 [词云 (Word cloud)](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTag_cloud)（标签云，或视觉设计中的加权列表）是文本数据的创新视觉表示。标签通常是单个单词，每个标签的重要性通过字体大小或颜色显示。这种格式有助于快速感知最突出的术语，并按字母顺序定位术语以确定其相对突出程度。\n\n### 何时使用？\n\n- 描绘网站上的关键词元数据（标签）。\n- 令人愉悦并提供情感连接。\n\n### 示例\n\n![word-cloud-example](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_readme_5d25b51921d9.png)\n\n根据这个词云，我们可以总体上了解到，数据科学采用了从数学、统计学、信息科学和计算机科学等领域汲取的技术和理论。它可以用于商业分析，并被称为“21 世纪最性感的职业”。\n\n### 更多信息\n\n[Matplotlib 系列 9：词云](https:\u002F\u002Fjingwen-z.github.io\u002Fdata-viz-with-matplotlib-series9-word-cloud\u002F)\n\n\n# 7_ 大数据 (Big Data)\n\n## 1_ MapReduce (映射归约) 基础\n\n## 2_ Hadoop 生态系统\n\n## 3_ HDFS (Hadoop 分布式文件系统)\n\n## 4_ 数据复制原则\n\n## 5_ 设置 Hadoop\n\n## 6_ NameNode 与 DataNode\n\n## 7_ JobTracker 与 TaskTracker\n\n## 8_ M\u002FR\u002FSAS 编程\n\n## 9_ Sqop (Sqoop)：将数据加载到 HDFS\n\n## 10_ Flume, Scribe\n\n## 11_ Pig SQL\n\n## 12_ Hive DWH\n\n## 13_ 用于 Weblog 的 Scribe, Chukwa\n\n## 14_ 使用 Mahout\n\n## 15_ Zookeeper, Avro\n\n## 16_ Lambda 架构\n\n## 17_ Storm：Hadoop 实时处理\n\n## 18_ Rhadoop, RHIPE\n\n## 19_ RMR\n\n## 20_ NoSQL 数据库 (MongoDB, Neo4j)\n\n## 21_ 分布式数据库和系统 (Cassandra)\n\n\n# 8_ 数据摄入 (Data Ingestion)\n\n## 1_ 数据格式总结\n\n## 2_ 数据发现\n\n## 3_ 数据源与获取\n\n## 4_ 数据集成\n\n## 5_ 数据融合\n\n## 6_ 转换与增强\n\n## 7_ 数据调查\n\n## 8_ Google OpenRefine\n\n## 9_ 多少数据？\n\n## 10_ 使用 ETL\n\n# 9_ 数据清洗 (Data Munging)\n\n## 1_ 维度与数值降维 (Dim. and num. reduction)\n\n## 2_ 标准化\n\n## 3_ 数据清理\n\n## 4_ 处理缺失值\n\n## 5_ 无偏估计量\n\n## 6_ 稀疏值分箱\n\n## 7_ 特征提取\n\n## 8_ 去噪\n\n## 9_ 采样\n\n## 10_ 分层采样\n\n## 11_ PCA (主成分分析)\n\n# 10_ 工具箱 (Toolbox)\n\n## 1_ 带分析工具包的 MS Excel\n\n## 2_ Java, Python\n\n## 3_ R, Rstudio, Rattle\n\n## 4_ Weka, Knime, RapidMiner\n\n## 5_ 首选的 Hadoop 发行版\n\n## 6_ Spark, Storm\n\n## 7_ Flume, Scibe (Scribe), Chukwa\n\n## 8_ Nutch, Talend, Scraperwiki\n\n## 9_ Webscraper, Flume, Sqoop\n\n## 10_ tm, RWeka, NLTK\n\n## 11_ RHIPE\n\n## 12_ D3.js, ggplot2, Shiny\n\n## 13_ IBM Languageware\n\n## 14_ Cassandra, MongoDB\n\n## 13_ Microsoft Azure, AWS, Google Cloud\n\n## 14_ Microsoft Cognitive API\n\n## 15_ TensorFlow\n\nhttps:\u002F\u002Fwww.tensorflow.org\u002F\n\nTensorFlow 是一个开源软件库，用于使用**数据流图 (data flow graphs)** 进行数值计算。\n\n图中的节点代表数学运算，而图的边代表它们之间通信的多维数据数组（**张量 (tensors)**）。\n\n灵活的架构允许你通过单一 **API (应用程序编程接口)** 将计算部署到桌面、服务器或移动设备中的一个或多个 CPU 或 GPU 上。\n\nTensorFlow 最初由谷歌机器学习研究组织内的 Google Brain 团队的研究人员和工程师开发，旨在进行**机器学习 (machine learning)** 和**深度神经网络 (deep neural networks)** 研究，但该系统足够通用，可适用于各种其他领域。\n\n\n\n\n\n# 其他免费课程 \n\n### 人工智能\n\n- [CS 188 - 人工智能导论，加州大学伯克利分校 - 2015 年春季](http:\u002F\u002Fwww.infocobuild.com\u002Feducation\u002Faudio-video-courses\u002Fcomputer-science\u002Fcs188-spring2015-berkeley.html)\n- [6.034 人工智能，麻省理工学院开放课程](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Felectrical-engineering-and-computer-science\u002F6-034-artificial-intelligence-fall-2010\u002Flecture-videos\u002F)\n- [CS221: 人工智能：原理与技术 - 2019 年秋季 - 斯坦福大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoROMvodv4rO1NB9TD4iUZ3qghGEGtqNX)\n- [15-780 - 研究生人工智能，春季 14，卡内基梅隆大学](http:\u002F\u002Fwww.cs.cmu.edu\u002F~zkolter\u002Fcourse\u002F15-780-s14\u002Flectures.html)\n- [CSE 592 人工智能应用，冬季 2003 - 华盛顿大学](https:\u002F\u002Fcourses.cs.washington.edu\u002Fcourses\u002Fcsep573\u002F03wi\u002Flectures\u002Findex.htm)\n- [CS322 - 人工智能导论，冬季 2012-13 - 不列颠哥伦比亚大学](http:\u002F\u002Fwww.cs.ubc.ca\u002F~mack\u002FCS322\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLDPnGbm0sUmpzvcGvktbz446SLdFbfZVU))\n- [CS 4804: 人工智能导论，秋季 2016](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLUenpfvlyoa1iiSbGy9BBewgiXjzxVgBd)\n- [CS 5804: 人工智能导论，春季 2015](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLUenpfvlyoa0PB6_kqJ9WU7m6i6z1RhfJ)\n- [人工智能 - 印度理工学院卡拉格普尔分校](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106105077\u002F)\n- [人工智能 - 印度理工学院马德拉斯分校](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106106126\u002F)\n- [人工智能 (P. Dasgupta 教授) - 印度理工学院卡拉格普尔分校](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106105079\u002F)\n- [MOOC - 人工智能入门 - Udacity](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAwxTw4SYaPlqMkzr4xyuD6cXTIgPuzgn)\n- [MOOC - 机器人学人工智能 - Udacity](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAwxTw4SYaPkCSYXw6-a_aAoXVKLDwnHK)\n- [研究生人工智能课程，秋季 2012 - 华盛顿大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbQ3Aya0VERDoDdbMogU9EASJGWris9qG)\n- [基于代理的系统 2015\u002F16 - 爱丁堡大学](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fabs.htm)\n- [信息学 2D - 推理与代理 2014\u002F15 - 爱丁堡大学](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2014\u002Finf2d.htm)\n- [人工智能 - 拉文斯堡魏因加滕应用技术大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL39B5D3AFC249556A)\n- [演绎数据库与知识库系统 - 德国布伦瑞克工业大学](http:\u002F\u002Fwww.ifis.cs.tu-bs.de\u002Fteaching\u002Fws-1516\u002FKBS)\n- [人工智能：知识表示与推理 - 印度理工学院马德拉斯分校](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106106140\u002F)\n- [语义网技术 - Dr. Harald Sack - HPI](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoOmvuyo5UAeihlKcWpzVzB51rr014TwD)\n- [使用语义网技术的知识工程 - Dr. Harald Sack - HPI](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoOmvuyo5UAcBXlhTti7kzetSsi1PpJGR)\n\n--------------\n\n### 机器学习\n\n- **机器学习导论**\n\n- [大规模开放在线课程 (MOOC) 机器学习 (Machine Learning) Andrew Ng - Coursera\u002F斯坦福大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLLssT5z_DsK-h9vYZkQkYNWcItqhlRJLN) ([笔记](http:\u002F\u002Fwww.holehouse.org\u002Fmlclass\u002F))\n\t- [面向程序员的机器学习导论](https:\u002F\u002Fcourse.fast.ai\u002Fml.html)\n\t- [大规模开放在线课程 (MOOC) - 统计学习，斯坦福大学](http:\u002F\u002Fwww.dataschool.io\u002F15-hours-of-expert-machine-learning-videos\u002F)\n\t- [机器学习基础训练营，伯克利西蒙斯研究所](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgKuh-lKre11GbZWneln-VZDLHyejO7YD)\n\t- [CS155 - 机器学习与数据挖掘，2017 - 加州理工学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLuz4CTPOUNi6BfMrltePqMAHdl5W33-bC) ([笔记](http:\u002F\u002Fwww.yisongyue.com\u002Fcourses\u002Fcs155\u002F2017_winter\u002F)) ([2016 年](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL5HdMttxBY0BVTP9y7qQtzTgmcjQ3P0mb))\n\t- [CS 156 - 从数据中学习，加州理工学院](https:\u002F\u002Fwork.caltech.edu\u002Flectures.html)\n\t- [10-601 - 机器学习导论 (硕士) - Tom Mitchell - 2015, 卡内基梅隆大学](http:\u002F\u002Fwww.cs.cmu.edu\u002F~ninamf\u002Fcourses\u002F601sp15\u002Flectures.shtml) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAJ0alZrN8rD63LD0FkzKFiFgkOmEtltQ))\n\t- [10-601 机器学习 | 卡内基梅隆大学 | 2017 年秋季](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL7k0r4t5c10-g7CWCnHfZOAxLaiNinChk)\n\t- [10-701 - 机器学习导论 (博士) - Tom Mitchell, 2011 年春季，卡内基梅隆大学](http:\u002F\u002Fwww.cs.cmu.edu\u002F~tom\u002F10701_sp11\u002Flectures.shtml) ([2014 年秋季](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL7y-1rk2cCsDZCVz2xS7LrExqidHpJM3B)) ([2015 年春季 Alex Smola](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLZSO_6-bSqHTTV7w9u7grTXBHMH-mw3qn))\n\t- [10 - 301\u002F601 - 机器学习导论 - 2020 年春季 - 卡内基梅隆大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLpqQKYIU-snAPM89YPPwyQ9xdaiAdoouk)\n\t- [CMS 165 机器学习与统计推断基础 - 2020 - 加州理工学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLVNifWxslHCDlbyitaLLYBOAEPbmF1AHg)\n\t- [微软研究院 - 机器学习课程](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL34iyE0uXtxo7vPXGFkmm6KbgZQwjf9Kf)\n\t- [CS 446 - 机器学习，2019 年春季，UIUC](https:\u002F\u002Fcourses.engr.illinois.edu\u002Fcs446\u002Fsp2019\u002FAGS\u002F_site\u002F)([ 2016 年秋季讲座](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLQcasX5-oG91TgY6A_gz-IW7YSpwdnD2O))\n\t- [UBC 2012 年本科机器学习，Nando de Freitas](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLE6Wd9FR--Ecf_5nCbnSQMHqORpiChfJf)\n\t- [CS 229 - 机器学习 - 斯坦福大学](https:\u002F\u002Fsee.stanford.edu\u002FCourse\u002FCS229) ([2018 年秋季](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoROMvodv4rMiGQp3WXShtMGgzqpfVfbU))\n\t- [CS 189\u002F289A 机器学习导论，Jonathan Shewchuk 教授 - UC Berkeley](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~jrs\u002F189\u002F)\n\t- [CPSC 340: 机器学习与数据挖掘 (2018) - UBC](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLWmXHcz_53Q02ZLeAxigki1JZFfCO6M-b)\n\t- [CS4780\u002F5780 机器学习，2013 年秋季 - 康奈尔大学](http:\u002F\u002Fwww.cs.cornell.edu\u002Fcourses\u002Fcs4780\u002F2013fa\u002F)\n\t- [CS4780\u002F5780 机器学习，2018 年秋季 - 康奈尔大学](http:\u002F\u002Fwww.cs.cornell.edu\u002Fcourses\u002Fcs4780\u002F2018fa\u002Fpage18\u002Findex.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLl8OlHZGYOQ7bkVbuRthEsaLr7bONzbXS))\n\t- [CSE474\u002F574 机器学习导论 - 纽约州立大学布法罗分校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLEQDy5tl3xkMzk_zlo2DPzXteCquHA8bQ)\n\t- [CS 5350\u002F6350 - 机器学习，2016 年秋季，犹他大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbuogVdPnkCozRSsdueVwX7CF9N4QWL0B)\n\t- [ECE 5984 机器学习导论，2015 年春季 - 弗吉尼亚理工大学](https:\u002F\u002Ffilebox.ece.vt.edu\u002F~s15ece5984\u002F)\n\t- [CSx824\u002FECEx242 机器学习，Bert Huang, 2015 年秋季 - 弗吉尼亚理工大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLUenpfvlyoa0rMoE5nXA8kdctBKE9eSob)\n\t- [STA 4273H - 大规模机器学习，2015 年冬季 - 多伦多大学](http:\u002F\u002Fwww.cs.toronto.edu\u002F~rsalakhu\u002FSTA4273_2015\u002Flectures.html)\n\t- [CS 485\u002F685 机器学习，Shai Ben-David, 滑铁卢大学](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUCR4_akQ1HYMUcDszPQ6jh8Q\u002Fvideos)\n\t- [STAT 441\u002F841 分类 2017 年冬季，滑铁卢](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLehuLRPyt1HzXDemu7K4ETcF0Ld_B5adG)\n\t- [10-605 - 大数据集机器学习，2016 年秋季 - 卡内基梅隆大学](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUCIE4UdPoCJZMAZrTLuq-CPQ\u002Fvideos)\n\t- [信息论、模式识别与神经网络 - 剑桥大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLruBu5BI5n4aFpG32iMbdWoRVAA-Vcso6)\n\t- [Python 与机器学习 - 斯坦福大众课程计划](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLVxFQjPUB2cnYGZPAGG52OQc9SpWVKjjB)\n\t- [大规模开放在线课程 (MOOC) - 机器学习 Part 1a - Udacity\u002F佐治亚理工学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAwxTw4SYaPl0N6-e1GvyLp5-MUMUjOKo) ([Part 1b](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAwxTw4SYaPlkESDcHD-0oqVx5sAIgz7O) [Part 2](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAwxTw4SYaPmaHhu-Lz3mhLSj-YH-JnG7) [Part 3](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAwxTw4SYaPnidDwo9e2c7ixIsu_pdSNp))\n\t- [机器学习与模式识别 2015\u002F16- 爱丁堡大学](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fmlpr.htm)\n\t- [应用机器学习导论 2015\u002F16- 爱丁堡大学](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fiaml.htm)\n\t- [模式识别课程 (2012)- 海德堡大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLuRaSnb3n4kRDZVU6wxPzGdx1CN12fn0w)\n\t- [机器学习与模式识别导论 - CBCSL 俄亥俄州立大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLcXJymqaE9PPGGtFsTNoDWKl-VNVX5d6b)\n\t- [机器学习导论 - IIT 卡拉格普尔](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106105152\u002F)\n\t- [机器学习导论 - IIT 马德拉斯](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106106139\u002F)\n\t- [模式识别 - IISC 班加罗尔](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F117108048\u002F)\n\t- [模式识别与应用 - IIT 卡拉格普尔](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F117105101\u002F)\n\t- [模式识别 - IIT 马德拉斯](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106106046\u002F)\n\t- [2013 年机器学习暑期学校 - 德国蒂宾根马克斯·普朗克智能系统研究所](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLqJm7Rc5-EXFv6RXaPZzzlzo93Hl0v91E)\n\t- [机器学习 - Professor Kogan (2016 年春季) - 罗格斯大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLauepKFT6DK_1_plY78bXMDj-bshv7UsQ)\n\t- [CS273a: 机器学习导论](http:\u002F\u002Fsli.ics.uci.edu\u002FClasses\u002F2015W-273a) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkWzaBlA7utJMRi89i9FAKMopL0h0LBMk))\n\t- [2015 年机器学习速成课](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLyGKBDfnk-iD5dK8N7UBUFVVDBBtznenR)\n\t- [COM4509\u002FCOM6509 机器学习与自适应智能 2015-16](http:\u002F\u002Finverseprobability.com\u002Fmlai2015\u002F)\n\t- [10715 机器学习高级导论](https:\u002F\u002Fsites.google.com\u002Fsite\u002F10715advancedmlintro2017f\u002Flectures)\n\t- [机器学习导论 - 2018 年春季 - 苏黎世联邦理工学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLzn6LN6WhlN273tsqyfdrBUsA-o5nUESV)\n\t- [机器学习 - Pedro Domingos- 华盛顿大学](https:\u002F\u002Fwww.youtube.com\u002Fuser\u002FUWCSE\u002Fplaylists?view=50&sort=dd&shelf_id=16)\n\t- [高级机器学习 - 2019 - 苏黎世联邦理工学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLY-OA_xnxFwSe98pzMGVR4bjAZZYrNT7L)\n\t- [机器学习 (COMP09012)](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLyH-5mHPFffFwz7Twap0XuVeUJ8vuco9t)\n\t- [概率机器学习 2020 - 蒂宾根大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL05umP7R6ij1tHaOFY96m5uX3J21a6yNd)\n\t- [统计机器学习 2020 - Ulrike von Luxburg - 蒂宾根大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL05umP7R6ij2XCvrRzLokX6EoHWaGA2cC)\n\t- [COMS W4995 - 应用机器学习 - 2020 年春季 - 哥伦比亚大学](https:\u002F\u002Fwww.cs.columbia.edu\u002F~amueller\u002Fcomsw4995s20\u002Fschedule\u002F)\n\t\n- **数据挖掘 (Data Mining)**\n\n- [CSEP 546，数据挖掘 (Data Mining) - Pedro Domingos, 2016 年春季 - 华盛顿大学](https:\u002F\u002Fcourses.cs.washington.edu\u002Fcourses\u002Fcsep546\u002F16sp\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLTPQEx-31JXgtDaC6-3HxWcp7fq4N8YGr))\n\t- [CS 5140\u002F6140 - 数据挖掘，2016 年春季，犹他大学](https:\u002F\u002Fwww.cs.utah.edu\u002F~jeffp\u002Fteaching\u002Fcs5140.html) ([Youtube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbuogVdPnkCpXfb43Wvc7s5fXWzedwTPB))\n\t- [CS 5955\u002F6955 - 数据挖掘，犹他大学](http:\u002F\u002Fwww.cs.utah.edu\u002F~jeffp\u002Fteaching\u002Fcs5955.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUCcrlwW88yMcXujhGjSP2WBg\u002Fvideos))\n\t- [统计学 202 - 数据挖掘的统计方面，2007 年夏季 - 谷歌](http:\u002F\u002Fwww.stats202.com\u002Foriginal_index.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLFE776F2C513A744E))\n\t- [慕课 (MOOC) - 文本挖掘与分析 by ChengXiang Zhai](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLLssT5z_DsK8Xwnh_0bjN4KNT81bekvtt)\n\t- [信息检索 SS 2014, iTunes - HPI](https:\u002F\u002Fitunes.apple.com\u002Fus\u002Fitunes-u\u002Finformation-retrieval-ss-2014\u002Fid874200291)\n\t- [慕课 (MOOC) - 使用 Weka 进行数据挖掘](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLm4W7_iX_v4NqPUjceOGd-OKNVO4c_cPD)\n\t- [CS 290 数据挖掘讲座](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLB4CCA346A5741C4C)\n\t- [CS246 - 挖掘海量数据集，2016 年冬季，斯坦福大学](https:\u002F\u002Fweb.stanford.edu\u002Fclass\u002Fcs246\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUC_Oao2FYkLAUlUVkBfze4jg\u002Fvideos))\n\t- [数据挖掘：从大型数据集学习 - 2017 年秋季 - 苏黎世联邦理工学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLY-OA_xnxFwRHZO6L6yT253VPgrZazQs6)\n\t- [信息检索 - 2018 年春季 - 苏黎世联邦理工学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLzn6LN6WhlN1ktkDvNurPSDwTQ_oGQisn)\n\t- [CAP6673 - 数据挖掘与机器学习 - FAU](http:\u002F\u002Fwww.cse.fau.edu\u002F~taghi\u002Fclasses\u002Fcap6673\u002F)([视频讲座](https:\u002F\u002Fvimeo.com\u002Falbum\u002F1505953))\n\t- [数据仓库与数据挖掘技术 - 德国不伦瑞克工业大学](http:\u002F\u002Fwww.ifis.cs.tu-bs.de\u002Fteaching\u002Fws-1617\u002Fdwh)\n- **数据科学 (Data Science)**\n\t- [Data 8：数据科学基础 - 加州大学伯克利分校](http:\u002F\u002Fdata8.org\u002F) ([2017 年夏季](http:\u002F\u002Fdata8.org\u002Fsu17\u002F))\n\t- [CSE519 - 数据科学 2016 年秋季 - Skiena, 石溪大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLOtl7M3yp-DVBdLYatrltDJr56AKZ1qXo)\n\t- [CS 109 数据科学，哈佛大学](http:\u002F\u002Fcs109.github.io\u002F2015\u002Fpages\u002Fvideos.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLb4G5axmLqiuneCqlJD2bYFkBwHuOzKus))\n\t- [6.0002 计算思维与数据科学导论 - MIT OCW](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Felectrical-engineering-and-computer-science\u002F6-0002-introduction-to-computational-thinking-and-data-science-fall-2016\u002Flecture-videos\u002F)\n\t- [Data 100 - 2019 年夏季 - 加州大学伯克利分校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLPHXc20GewP8J56CisONS_mFZWZAfa7jR)\n\t- [分布式数据分析 (2017\u002F18 冬季学期) - 波茨坦大学哈恩研究所](https:\u002F\u002Fwww.tele-task.de\u002Fseries\u002F1179\u002F)\n\t- [统计学 133 - 数据计算概念，2013 年秋季 - 加州大学伯克利分校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL-XXv-cvA_iDsSPnMJlnhIyADGUmikoIO)\n\t- [数据概况与数据清洗 (2014\u002F15 冬季学期) - 波茨坦大学哈恩研究所](https:\u002F\u002Fwww.tele-task.de\u002Fseries\u002F1027\u002F)\n\t- [AM 207 - 数据分析、推断与优化的随机方法，哈佛大学](http:\u002F\u002Fam207.github.io\u002F2016\u002Findex.html)\n\t- [CS 229r - 大数据算法，哈佛大学](http:\u002F\u002Fpeople.seas.harvard.edu\u002F~minilek\u002Fcs229r\u002Ffall15\u002Flec.html) ([Youtube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL2SOU6wwxB0v1kQTpqpuu5kEJo2i-iUyf))\n\t- [大数据算法 - 印度理工学院马德拉斯分校](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106106142\u002F)\n- **概率图建模 (Probabilistic Graphical Modeling)**\n\t- [慕课 (MOOC) - 概率图模型 - Coursera](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLvfF4UFg6Ejj6SX-ffw-O4--SPbB9P7eP)\n\t- [CS 6190 - 概率建模，2016 年春季，犹他大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbuogVdPnkCpvxdF-Gy3gwaBObx7AnQut)\n\t- [10-708 - 概率图模型，卡内基梅隆大学](https:\u002F\u002Fwww.cs.cmu.edu\u002F~epxing\u002FClass\u002F10708-20\u002Flectures.html)\n\t- [概率图模型，Daphne Koller, 斯坦福大学](http:\u002F\u002Fopenclassroom.stanford.edu\u002FMainFolder\u002FCoursePage.php?course=ProbabilisticGraphicalModels)\n\t- [概率模型 - 赫尔辛基大学](https:\u002F\u002Fwww.cs.helsinki.fi\u002Fen\u002Fcourses\u002F582636\u002F2015\u002FK\u002FK\u002F1)\n\t- [概率建模与推理 2015\u002F16 - 爱丁堡大学](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fpmr.htm)\n\t- [概率图模型，2018 年春季 - 圣母大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLd-PuDzW85AcV4bgdu7wHPL37hm60W4RM)\n- **深度学习 (Deep Learning)**\n\t- [6.S191：深度学习导论 - MIT](http:\u002F\u002Fintrotodeeplearning.com\u002F)\n\t- [深度学习 CMU](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUC8hYZGEkI2dDO8scT8C5UQA\u002Fvideos)\n\t- [第一部分：面向程序员的实用深度学习，v3 - fast.ai](https:\u002F\u002Fcourse.fast.ai\u002F)\n\t- [第二部分：从基础开始的深度学习 - fast.ai](https:\u002F\u002Fcourse.fast.ai\u002Fpart2)\n\t- [2015 年牛津大学深度学习 - Nando de Freitas](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLE6Wd9FR--EfW8dtjAuPoTuPcqmOV53Fu)\n\t- [6.S094：自动驾驶汽车的深度学习 - MIT](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLrAXtmErZgOeiKm4sgNOknGvNjby9efdf)\n\t- [CS294-129 设计、可视化与理解深度神经网络](https:\u002F\u002Fbcourses.berkeley.edu\u002Fcourses\u002F1453965\u002Fpages\u002Fcs294-129-designing-visualizing-and-understanding-deep-neural-networks) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIxopmdJF_CLNqG3QuDFHQUm))\n\t- [CS230：深度学习 - 2018 年秋季 - 斯坦福大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoROMvodv4rOABXSygHTsbvUz4G_YQhOb)\n\t- [STAT-157 深度学习 2019 - 加州大学伯克利分校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLZSO_6-bSqHQHBCoGaObUljoXAyyqhpFW)\n\t- [全栈深度学习训练营 2019 - 加州大学伯克利分校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL_Ig1a5kxu5645uORPL8xyvHr91Lg8G1l)\n\t- [深度学习，斯坦福大学](http:\u002F\u002Fopenclassroom.stanford.edu\u002FMainFolder\u002FCoursePage.php?course=DeepLearning)\n\t- [慕课 (MOOC) - 机器学习的神经网络，Geoffrey Hinton 2016 - Coursera](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoRl3Ht4JOcdU872GhiYWf6jwrk_SNhz9)\n\t- [深度无监督学习 -- 伯克利 2020 年春季](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLwRJQ4m4UJjPiJP3691u-qWwPGVKzSlNP)\n\t- [Stat 946 深度学习 - 滑铁卢大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLehuLRPyt1Hyi78UOkMPWCGRxGcA9NVOE)\n\t- [神经网络课程 - 舍布鲁克大学](http:\u002F\u002Finfo.usherbrooke.ca\u002Fhlarochelle\u002Fneural_networks\u002Fcontent.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL6Xpj9I5qXYEcOhn7TqghAJ6NAPrNmUBH))\n\t- [CS294-158 深度无监督学习 SP19](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUCf4SX8kAZM_oGcZjMREsU9w\u002Fvideos)\n\t- [DLCV - 计算机视觉深度学习 - 巴塞罗那理工大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL-5eMc3HQTBavDoZpFcX-bff5WgQqSLzR)\n\t- [DLAI - 人工智能深度学习 @ 巴塞罗那理工大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL-5eMc3HQTBagIUjKefjcTbnXC0wXC_vd)\n\t- [神经网络与应用 - 印度理工学院卡拉格普尔分校](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F117105084\u002F)\n\t- [UVA 深度学习课程](http:\u002F\u002Fuvadlc.github.io\u002F#lecture)\n\t- [英伟达机器学习课程](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLTIkHmXc-7an8xbwhAJX-LQ4D4Uf-ar5I)\n\t- [深度学习 - 2020-21 年冬季 - 蒂宾根机器学习](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL05umP7R6ij3NTWIdtMbfvX7Z-4WEXRqD)\n- **强化学习 (Reinforcement Learning)**\n\t- [CS234：强化学习 - 2019 年冬季 - 斯坦福大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoROMvodv4rOSOPzutgyCTapiGlY2Nd8u)\n\t- [强化学习导论 - 伦敦大学学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ)\n\t- [高级深度学习与强化学习 - 伦敦大学学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLqYmG7hTraZDNJre23vqCGIVpfZ_K2RZs)\n\t- [强化学习 - 印度理工学院马德拉斯分校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLyqSpQzTE6M_FwzHFAyf4LSkz_IjMyjD9)\n\t- [CS885 强化学习 - 2018 年春季 - 滑铁卢大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLdAoL1zKcqTXFJniO3Tqqn6xMBBL07EDc)\n\t- [CS 285 - 深度强化学习 - 加州大学伯克利分校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIwhWJpGazJ9VSj9CFMkb79A)\n\t- [CS 294 112 - 强化学习](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIxJMR-j5A1mkxK26gh_qg37)\n\t- [NUS CS 6101 - 深度强化学习](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLllwxvcS7ca5wOmRLKm6ri-OaC0INYehv)\n\t- [ECE 8851：强化学习](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL_Nk3YvgORJs1tCLQnlnSRsOJArj_cP9u)\n\t- [CS294-112，深度强化学习 Sp17](http:\u002F\u002Frll.berkeley.edu\u002Fdeeprlcourse\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX))\n\t- [2015 年大卫·希尔弗 (David Silver) 在 DeepMind 讲授的强化学习课程 - 伦敦大学学院](http:\u002F\u002Fwww0.cs.ucl.ac.uk\u002Fstaff\u002Fd.silver\u002Fweb\u002FTeaching.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=2pWv7GOvuf0))\n\t- [深度强化学习训练营 - 伯克利 2017 年 8 月](https:\u002F\u002Fsites.google.com\u002Fview\u002Fdeep-rl-bootcamp\u002Flectures)\n\t- [强化学习 - 印度理工学院马德拉斯分校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLyqSpQzTE6M_FwzHFAyf4LSkz_IjMyjD9)\n- **高级机器学习 (Advanced Machine Learning)**\n\t- [机器学习 2013 - Nando de Freitas, 不列颠哥伦比亚大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLE6Wd9FR--EdyJ5lbFl8UuGjecvVw66F6)\n\t- [机器学习，2014-2015，牛津大学](https:\u002F\u002Fwww.cs.ox.ac.uk\u002Fpeople\u002Fnando.defreitas\u002Fmachinelearning\u002F)\n\t- [10-702\u002F36-702 - 统计机器学习 - Larry Wasserman, 2016 年春季，CMU](https:\u002F\u002Fwww.stat.cmu.edu\u002F~ryantibs\u002Fstatml\u002F) ([2015 年春季](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLjbUi5mgii6BWEUZf7He6nowWvGne_Y8r))\n\t- [10-715 机器学习高级导论 - CMU](http:\u002F\u002Fwww.cs.cmu.edu\u002F~bapoczos\u002FClasses\u002FML10715_2015Fall\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL4DwY1suLMkcu-wytRDbvBNmx57CdQ2pJ))\n\t- [CS 281B - 可扩展机器学习，Alex Smola, 加州大学伯克利分校](http:\u002F\u002Falex.smola.org\u002Fteaching\u002Fberkeley2012\u002Fsyllabus.html)\n\t- [18.409 机器学习的算法方面 2015 年春季 - MIT](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLB3sDpSRdrOvI1hYXNsa6Lety7K8FhPpx)\n\t- [CS 330 - 深度多任务与元学习 - 2019 年秋季 - 斯坦福大学](https:\u002F\u002Fcs330.stanford.edu\u002F) ([Youtube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoROMvodv4rMC6zfYmnD7UG3LVvwaITY5))\n- **基于机器学习的自然语言处理和计算机视觉**\n\t- [CS 224d - 自然语言处理的深度学习，斯坦福大学](http:\u002F\u002Fcs224d.stanford.edu\u002Fsyllabus.html) ([讲座 - Youtube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLCJlDcMjVoEdtem5GaohTC1o9HTTFtK7_))\n\t- [CS 224N - 自然语言处理，斯坦福大学](http:\u002F\u002Fweb.stanford.edu\u002Fclass\u002Fcs224n\u002F) ([讲座视频](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgtM85Maly3n2Fp1gJVvqb0bTC39CPn1N))\n\t- [CS 124 - 从语言到信息 - 斯坦福大学](https:\u002F\u002Fwww.youtube.com\u002Fchannel\u002FUC_48v322owNVtORXuMeRmpA\u002Fplaylists?view=50&sort=dd&shelf_id=2)\n\t- [慕课 (MOOC) - 自然语言处理，Dan Jurafsky & Chris Manning - Coursera](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL6397E4B26D00A269)\n\t- [fast.ai 代码优先的自然语言处理入门](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLtmWHNX-gukKocXQOkQjuVxglSDYWsSh9) ([Github](https:\u002F\u002Fgithub.com\u002Ffastai\u002Fcourse-nlp))\n\t- [慕课 (MOOC) - 自然语言处理 - Coursera, 密歇根大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLLssT5z_DsK8BdawOVCCaTCO99Ya58ryR)\n\t- [CS 231n - 用于视觉识别的卷积神经网络，斯坦福大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL3FW7Lu3i5JvHM8ljYj-zLfQRF3EO8sYv)\n\t- [CS224U：自然语言理解 - 2019 年春季 - 斯坦福大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoROMvodv4rObpMCir6rNNUlFAn56Js20)\n\t- [自然语言处理的深度学习，2017 - 牛津大学](https:\u002F\u002Fgithub.com\u002Foxford-cs-deepnlp-2017\u002Flectures)\n\t- [机器人与计算机视觉的机器学习，2013\u002F2014 冬季学期 - 慕尼黑工业大学](https:\u002F\u002Fvision.in.tum.de\u002Fteaching\u002Fws2013\u002Fml_ws13) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLTBdjV_4f-EIiongKlS9OKrBEp8QR47Wl))\n\t- [信息学 1 - 认知科学 2015\u002F16 - 爱丁堡大学](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Finf1cs.htm)\n\t- [信息学 2A - 处理形式与自然语言 2016-17 - 爱丁堡大学](http:\u002F\u002Fwww.inf.ed.ac.uk\u002Fteaching\u002Fcourses\u002Finf2a\u002Fschedule.html)\n\t- [计算认知科学 2015\u002F16 - 爱丁堡大学](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fccs.htm)\n\t- [加速自然语言处理 2015\u002F16 - 爱丁堡大学](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fanlp.htm)\n\t- [自然语言处理 - 印度理工学院孟买分校](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106101007\u002F)\n\t- [NOC：视觉计算的深度学习 - 印度理工学院卡拉格普尔分校](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F108\u002F105\u002F108105103\u002F)\n\t- [CS 11-747 - 神经网路用于 NLP - 2019 - CMU](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL8PYTP1V4I8Ajj7sY6sdtmjgkt7eo2VMs)\n\t- [自然语言处理 - Michael Collins - 哥伦比亚大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLA212ij5XG8OTDRl8IWFiJgHR9Ve2k9pv)\n\t- [计算机视觉的深度学习 - 密歇根大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL5-TkQAfAZFbzxjBHtzdVCWE0Zbhomg7r)\n\t- [CMU CS11-737 - 多语言自然语言处理](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL8PYTP1V4I8CHhppU6n1Q9-04m96D9gt5)\n- **时间序列分析**\n\t- [02417 时间序列分析](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLtiTxpFJ4k6TZ0g496fVcQpt_-XJRNkbi)\n\t- [应用时间序列分析](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLl0FT6O_WWDBm-4W-eoK34omYmEMseQDX)\n- **其他机器学习主题**\n\t- [EE364a：凸优化 I - 斯坦福大学](http:\u002F\u002Fweb.stanford.edu\u002Fclass\u002Fee364a\u002Fvideos.html)\n\t- [CS 6955 - 聚类，2015 年春季，犹他大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbuogVdPnkCpRvi-qSMCdOwyn4UYoPxTI)\n\t- [Info 290 - 使用 Twitter 分析大数据，加州大学伯克利分校信息学院](http:\u002F\u002Fblogs.ischool.berkeley.edu\u002Fi290-abdt-s12\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLE8C1256A28C1487F))\n\t- [10-725 凸优化，2015 年春季 - CMU](http:\u002F\u002Fwww.stat.cmu.edu\u002F~ryantibs\u002Fconvexopt-S15\u002F)\n\t- [10-725 凸优化：2016 年秋季 - CMU](http:\u002F\u002Fwww.stat.cmu.edu\u002F~ryantibs\u002Fconvexopt\u002F)\n\t- [CAM 383M - 科学计算的统计与离散方法，德克萨斯大学](http:\u002F\u002Fgranite.ices.utexas.edu\u002Fcoursewiki\u002Findex.php\u002FMain_Page)\n\t- [9.520 - 统计学习理论与应用，2015 年秋季 - MIT](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLyGKBDfnk-iDj3FBd0Avr_dLbrU8VG73O)\n\t- [强化学习 - 伦敦大学学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLacBNHqv7n9gp9cBMrA6oDbzz_8JqhSKo)\n\t- [机器学习的正则化方法 2016](http:\u002F\u002Facademictorrents.com\u002Fdetails\u002F493251615310f9b6ae1f483126292378137074cd) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbF0BXX_6CPJ20Gf_KbLFnPWjFTvvRwCO))\n\t- [大数据中的统计推断 - 多伦多大学](http:\u002F\u002Ffields2015bigdata2inference.weebly.com\u002Fmaterials.html)\n\t- [10-725 优化 2012 年秋季 - CMU](http:\u002F\u002Fwww.cs.cmu.edu\u002F~ggordon\u002F10725-F12\u002Fschedule.html)\n\t- [10-801 高级优化与随机方法 - CMU](http:\u002F\u002Fwww.cs.cmu.edu\u002F~suvrit\u002Fteach\u002Faopt.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLjTcdlvIS6cjdA8WVXNIk56X_SjICxt0d))\n\t- [强化学习 2015\u002F16 - 爱丁堡大学](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Frl.htm)\n\t- [强化学习 - 印度理工学院马德拉斯分校](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106106143\u002F)\n\t- [统计重构 2015 年冬季 - Richard McElreath](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLDcUM9US4XdMdZOhJWJJD4mDBMnbTWw_z)\n\t- [音乐信息检索 - 维多利亚大学，2014](http:\u002F\u002Fmarsyas.cs.uvic.ca\u002FmirBook\u002Fcourse\u002F)\n\t- [普渡大学 2011 年机器学习暑期学校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL2A65507F7D725EFB)\n\t- [机器学习基础 - Blmmoberg Edu](https:\u002F\u002Fbloomberg.github.io\u002Ffoml\u002F#home)\n\t- [强化学习导论 - 伦敦大学学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLqYmG7hTraZDM-OYHWgPebj2MfCFzFObQ)\n\t- [高级深度学习与强化学习 - 伦敦大学学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLqYmG7hTraZDNJre23vqCGIVpfZ_K2RZs)\n\t- [网络信息检索 (Proff. L. Becchetti - A. Vitaletti)](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAQopGWlIcya-9yzQ8c8UtPOuCv0mFZkr)\n\t- [大数据系统 (2019\u002F20 冬季学期) - Prof. Dr. Tilmann Rabl - HPI](https:\u002F\u002Fwww.tele-task.de\u002Fseries\u002F1286\u002F)\n\t- [分布式数据分析 (2017\u002F18 冬季学期) - Dr. Thorsten Papenbrock - HPI](https:\u002F\u002Fwww.tele-task.de\u002Fseries\u002F1179\u002F)\n\n- **概率与统计**\n\n\t- [6.041 概率系统分析与实用概率 - MIT OCW](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Felectrical-engineering-and-computer-science\u002F6-041sc-probabilistic-systems-analysis-and-applied-probability-fall-2013\u002F)\n\t- [统计学 110 - 概率论 - 哈佛大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL2SOU6wwxB0uwwH80KTQ6ht66KWxbzTIo)\n\t- [统计学 2.1x：描述性统计 | 加州大学伯克利分校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL_Ig1a5kxu56TfFnGlRlH2YpOBWGiYsQD)\n\t- [统计学 2.2x：概率论 | 加州大学伯克利分校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL_Ig1a5kxu57qPZnHm-ie-D7vs9g7U-Cl)\n\t- [MOOC - 统计学：理解数据，Coursera](http:\u002F\u002Facademictorrents.com\u002Fdetails\u002Fa0cbaf3e03e0893085b6fbdc97cb6220896dddf2)\n\t- [MOOC - 统计学基础 - Coursera](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLycnP7USbo1V3jlyjAzWUB201cLxPq4NP)\n\t- [概率论与随机过程 - 印度理工学院卡拉格普尔分校](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F117105085\u002F)\n\t- [MOOC - 统计推断 - Coursera](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgIPpm6tJZoSvrYM54BUqJJ4CWrYeGO40)\n\t- [131B - 概率与统计导论，加州大学欧文分校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLqOZ6FD_RQ7k-j-86QUC2_0nEu0QOP-Wy)\n\t- [统计学 250 - 统计与数据分析导论，密歇根大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL432AB57AF9F43D4F)\n\t- [集合、计数与概率 - 哈佛大学](http:\u002F\u002Fmatterhorn.dce.harvard.edu\u002Fengage\u002Fui\u002Findex.html#\u002F1999\u002F01\u002F82347)\n\t- [观点鲜明的统计课程](http:\u002F\u002Fwww.opinionatedlessons.org\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLUAHeOPjkJseXJKbuk9-hlOfZU9Wd6pS0))\n\t- [统计学 - Brandon Foltz](https:\u002F\u002Fwww.youtube.com\u002Fuser\u002FBCFoltz\u002Fplaylists)\n\t- [统计重构：使用 R 和 Stan 的贝叶斯课程](https:\u002F\u002Fgithub.com\u002Frmcelreath\u002Fstatrethinking_winter2019) ([讲座 - 阿尔托大学](https:\u002F\u002Faalto.cloud.panopto.eu\u002FPanopto\u002FPages\u002FSessions\u002FList.aspx#folderID=%22f0ec3a25-9e23-4935-873b-a9f401646812%22)) ([书籍](http:\u002F\u002Fwww.stat.columbia.edu\u002F~gelman\u002Fbook\u002F))\n\t- [02402 统计学导论 E12 - 丹麦技术大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLMn2aW3wpAtPC8tZHQy6nwWsFG7P6sPqw) ([F17](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgowegO9Se58_BnUNnaARajEE_bX-GJEz))\n- **线性代数**\n\t- [18.06 - 线性代数，Gilbert Strang 教授，MIT OCW](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Fmathematics\u002F18-06sc-linear-algebra-fall-2011\u002F)\n\t- [18.065 数据分析、信号处理与机器学习中的矩阵方法 - MIT OCW](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Fmathematics\u002F18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018\u002Fvideo-lectures\u002F)\n\t- [线性代数（普林斯顿大学）](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLGqzsq0erqU7w7ZrTZ-pWWk4-AOkiGEGp)\n\t- [MOOC: 矩阵编程：通过计算机科学应用学习线性代数 - Coursera](http:\u002F\u002Facademictorrents.com\u002Fdetails\u002F54cd86f3038dfd446b037891406ba4e0b1200d5a)\n\t- [CS 053 - 矩阵编程 - 布朗大学](http:\u002F\u002Fcs.brown.edu\u002Fcourses\u002Fcs053\u002Fcurrent\u002Flectures.htm) ([2014 秋季视频](https:\u002F\u002Fcs.brown.edu\u002Fvideo\u002Fchannels\u002Fcoding-matrix-fall-2014\u002F))\n\t- [线性代数复习 - CMU](http:\u002F\u002Fwww.cs.cmu.edu\u002F~zkolter\u002Fcourse\u002Flinalg\u002Foutline.html)\n\t- [线性代数入门课程 - N J Wildberger - 新南威尔士大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL44B6B54CBF6A72DF)\n\t- [矩阵代数导论](http:\u002F\u002Fma.mathforcollege.com\u002Fyoutube\u002Findex.html)\n\t- [计算线性代数 - fast.ai](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLtmWHNX-gukIc92m1K0P6bIOnZb-mg0hY) ([GitHub](https:\u002F\u002Fgithub.com\u002Ffastai\u002Fnumerical-linear-algebra))\n- [10-600 机器学习数学基础 - CMU](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL7y-1rk2cCsA339crwXMWUaBRuLBvPBCg)\n- [MIT 18.065 数据分析、信号处理与机器学习中的矩阵方法](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Fmathematics\u002F18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018\u002Fvideo-lectures\u002F)\n- [36-705 - 中级统计学 - Larry Wasserman, CMU](http:\u002F\u002Fwww.stat.cmu.edu\u002F~larry\u002F=stat705\u002F) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLcW8xNfZoh7eI7KSWneVWq-7wr8ffRtHF))\n- [组合数学 - 印度科学研究所班加罗尔分校](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F106108051\u002F)\n- [高级工程数学 - 圣母大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLd-PuDzW85Ae4pzlylMLzq_a-RHPx8ryA)\n- [科学家与工程师的统计计算 - 圣母大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLd-PuDzW85AeltIRcjDY7Z4q49NEAuMcA)\n- [统计计算，2017 年秋季 - 圣母大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLd-PuDzW85AcSgNGnT5TUHt85SrCljT3V)\n- [机器学习数学，Ulrike von Luxburg 讲座 - 蒂宾根机器学习](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL05umP7R6ij1a6KdEy8PVE9zoCv6SlHRS)\n\n\n-------------------------\n\n### 机器人学\n\n- [CS 223A - 机器人学导论，斯坦福大学](https:\u002F\u002Fsee.stanford.edu\u002FCourse\u002FCS223A)\n- [6.832 欠驱动机器人学 - MIT OCW](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Felectrical-engineering-and-computer-science\u002F6-832-underactuated-robotics-spring-2009\u002F)\n- [CS287 伯克利高级机器人学 2019 秋季学期 -- 讲师：Pieter Abbeel](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLwRJQ4m4UJjNBPJdt8WamRAt4XKc639wF)\n- [CS 287 - 高级机器人学，2011 年秋季，UC Berkeley](https:\u002F\u002Fpeople.eecs.berkeley.edu\u002F~pabbeel\u002Fcs287-fa11\u002F) ([视频](http:\u002F\u002Frll.berkeley.edu\u002Fcs287\u002Flecture_videos\u002F))\n- [CS235 - 面向非机器人设计师的应用机器人设计 - 斯坦福大学](https:\u002F\u002Fwww.youtube.com\u002Fuser\u002FStanfordCS235\u002Fvideos)\n- [讲座：飞行机器人的视觉导航](https:\u002F\u002Fvision.in.tum.de\u002Fteaching\u002Fss2012\u002Fvisnav2012) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLTBdjV_4f-EKeki5ps2WHqJqyQvxls4ha))\n- [CS 205A：机器人、视觉与图形的数学方法 (2013 年秋季)](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLQ3UicqQtfNvQ_VzflHYKhAqZiTxOkSwi)\n- [机器人学 1，De Luca 教授，罗马大学](http:\u002F\u002Fwww.dis.uniroma1.it\u002F~deluca\u002Frob1_en\u002Fmaterial_rob1_en_2014-15.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAQopGWlIcyaqDBW1zSKx7lHfVcOmWSWt))\n- [机器人学 2，De Luca 教授，罗马大学](http:\u002F\u002Fwww.diag.uniroma1.it\u002F~deluca\u002Frob2_en\u002Fmaterial_rob2_en.html) ([YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAQopGWlIcya6LnIF83QlJTqvpYmJXnDm))\n- [机器人力学与控制，首尔大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkjy3Accn-E7mlbuSF4aajcMMckG4wLvW)\n- [机器人学导论课程 - 北卡罗来纳大学夏洛特分校 (UNCC)](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL4847E1D1C121292F)\n- [SLAM（同步定位与建图）讲座](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLpUPoM7Rgzi_7YWn14Va2FODh7LzADBSm)\n- [视觉与机器人学导论 2015\u002F16 - 爱丁堡大学](http:\u002F\u002Fgroups.inf.ed.ac.uk\u002Fvision\u002FVIDEO\u002F2015\u002Fivr.htm)\n- [ME 597 – 自主移动机器人学 – 2014 年秋季](http:\u002F\u002Fwavelab.uwaterloo.ca\u002Findex6ea9.html?page_id=267)\n- [ME 780 – 自动驾驶感知 – 2017 年春季](http:\u002F\u002Fwavelab.uwaterloo.ca\u002Findexaef8.html?page_id=481)\n- [ME780 – 机器人和计算机视觉的非线性状态估计 – 2017 年春季](http:\u002F\u002Fwavelab.uwaterloo.ca\u002Findexe9a5.html?page_id=533)\n- [METR 4202\u002F7202 -- 机器人学与自动化 - 昆士兰大学](http:\u002F\u002Frobotics.itee.uq.edu.au\u002F~metr4202\u002Flectures.html)\n- [机器人学 - 印度孟买理工学院 (IIT Bombay)](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F112101099\u002F)\n- [机器视觉导论](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL1pxneANaikCO1-Z0XTaljLR3SE8tgRXY)\n- [6.834J 认知机器人学 - MIT OCW ](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Faeronautics-and-astronautics\u002F16-412j-cognitive-robotics-spring-2016\u002F)\n- [Hello (真实) 世界与 ROS（机器人操作系统） - 代尔夫特理工大学](https:\u002F\u002Focw.tudelft.nl\u002Fcourses\u002Fhello-real-world-ros-robot-operating-system\u002F)\n- [机器人编程 (ROS) - 苏黎世联邦理工学院](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLE-BQwvVGf8HOvwXPgtDfWoxd4Cc6ghiP)\n- [机电一体化系统设计 - 代尔夫特理工大学](https:\u002F\u002Focw.tudelft.nl\u002Fcourses\u002Fmechatronic-system-design\u002F)\n- [CS 206 进化机器人学课程 2020 年春季](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLAuiGdPEdw0inlKisMbjDypCbvcb_GBN9)\n- [机器人学基础 - UTEC 2018-I](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLoWGuY2dW-Acmc8V5NYSAXBxADMm1rE4p)\n- [机器人学 - YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL_onPhFCkVQhuPiUxUW2lFHB39QsavEEA)\n- [机器人学与控制：理论与实践 IIT 鲁尔基](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLLy_2iUCG87AjAXKbNMiKJZ2T9vvGpMB0)\n- [机电一体化](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLtuwVtW88fOeTFS_szBWif0Mcc0lfNWaz)\n- [ME142 - 机电一体化 2020 年春季 - 加州大学默塞德分校](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PL-euleXgwWUNQ80DGq6qopHBmHcQyEyNQ)\n- [移动传感与机器人学 - 波恩大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgnQpQtFTOGQJXx-x0t23RmRbjp_yMb4v)\n- [MSR2 - 传感器与状态估计课程 (2020) - 波恩大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgnQpQtFTOGQh_J16IMwDlji18SWQ2PZ6)\n- [SLAM（同步定位与建图）课程 (2013) - 波恩大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgnQpQtFTOGQrZ4O5QzbIHgl3b1JHimN_)\n- [ENGR486 机器人建模与控制 (2014 冬季)](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLJzZfbLAMTelwaLxFXteeblbY2ytU2AxX)\n- [D K Pratihar 教授主讲的机器人学 - IIT 卡拉格普尔](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLbRMhDVUMngcdUbBySzyzcPiFTYWr4rV_)\n- [移动机器人学导论 - SS 2019 - 弗赖堡大学](http:\u002F\u002Fais.informatik.uni-freiburg.de\u002Fteaching\u002Fss19\u002Frobotics\u002F)\n- [机器人地图构建 - WS 2018\u002F19 - 弗赖堡大学](http:\u002F\u002Fais.informatik.uni-freiburg.de\u002Fteaching\u002Fws18\u002Fmapping\u002F)\n- [机构学与机器人运动学 - IIT 卡拉格普尔](https:\u002F\u002Fnptel.ac.in\u002Fcourses\u002F112\u002F105\u002F112105236\u002F)\n- [自动驾驶汽车 - Cyrill Stachniss - 2020\u002F21 冬季 - 波恩大学](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLgnQpQtFTOGQo2Z_ogbonywTg8jxCI9pD)\n- [移动传感与机器人学 1 – Stachniss 部分（与 PhoRS 联合授课） - 波恩大学](https:\u002F\u002Fwww.ipb.uni-bonn.de\u002Fmsr1-2020\u002F)\n- [移动传感与机器人学 2 – Stachniss & Klingbeil\u002FHolst - 波恩大学](https:\u002F\u002Fwww.ipb.uni-bonn.de\u002Fmsr2-2020\u002F)\n\n\n----------------------------------\n\n## 500 + 人工智能项目列表（含代码）\n\n*500 个 AI（人工智能）、机器学习、深度学习、计算机视觉、NLP（自然语言处理）项目，均附带代码*\n\n***此列表持续更新。*** - 您可以提交 Pull Request（拉取请求）并参与贡献。\n\n| 序号 | 名称 | 链接 |\n| ----- | --------------------------------------------------------------------- | ----------------------------------- |\n| 1 | 180 个机器学习项目 | [is.gd\u002FMLtyGk](http:\u002F\u002Fis.gd\u002FMLtyGk) |\n| 2 | 12 个机器学习目标检测项目 | [is.gd\u002FjZMP1A](http:\u002F\u002Fis.gd\u002FjZMP1A) |\n| 3 | 20 个使用 Python 的 NLP (自然语言处理) 项目 | [is.gd\u002FjcMvjB](http:\u002F\u002Fis.gd\u002FjcMvjB) |\n| 4 | 10 个时间序列预测机器学习项目 | [is.gd\u002FdOR66m](http:\u002F\u002Fis.gd\u002FdOR66m) |\n| 5 | 20 个使用 Python 解决并解释的深度学习项目 | [is.gd\u002F8Cv5EP](http:\u002F\u002Fis.gd\u002F8Cv5EP) |\n| 6 | 20 个机器学习项目 | [is.gd\u002FLZTF0J](http:\u002F\u002Fis.gd\u002FLZTF0J) |\n| 7 | 30 个解决并解释的 Python 项目 | [is.gd\u002FxhT36v](http:\u002F\u002Fis.gd\u002FxhT36v) |\n| 8 | 免费机器学习课程 | https:\u002F\u002Flnkd.in\u002FekCY8xw |\n| 9 | 5 个使用 Python 的网络爬虫项目 | [is.gd\u002F6XOTSn](http:\u002F\u002Fis.gd\u002F6XOTSn) |\n| 10 | 20 个使用 Python 进行未来预测的机器学习项目 | [is.gd\u002FxDKDkl](http:\u002F\u002Fis.gd\u002FxDKDkl) |\n| 11 | 4 个使用 Python 的聊天机器人项目 | [is.gd\u002FLyZfXv](http:\u002F\u002Fis.gd\u002FLyZfXv) |\n| 12 | 7 个 Python GUI (图形用户界面) 项目 | [is.gd\u002F0KPBvP](http:\u002F\u002Fis.gd\u002F0KPBvP) |\n| 13 | 所有无监督学习项目 | [is.gd\u002Fcz11Kv](http:\u002F\u002Fis.gd\u002Fcz11Kv) |\n| 14 | 10 个用于回归分析的机器学习项目 | [is.gd\u002Fk8faV1](http:\u002F\u002Fis.gd\u002Fk8faV1) |\n| 15 | 10 个使用 Python 进行分类的机器学习项目 | [is.gd\u002FBJQjMN](http:\u002F\u002Fis.gd\u002FBJQjMN) |\n| 16 | 6 个使用 Python 的情感分析项目 | [is.gd\u002FWeiE5p](http:\u002F\u002Fis.gd\u002FWeiE5p) |\n| 17 | 4 个使用 Python 的推荐系统项目 | [is.gd\u002FpPHAP8](http:\u002F\u002Fis.gd\u002FpPHAP8) |\n| 18 | 20 个使用 Python 的深度学习项目 | [is.gd\u002Fl3OCJs](http:\u002F\u002Fis.gd\u002Fl3OCJs) |\n| 19 | 5 个使用 Python 的 COVID-19 项目 | [is.gd\u002FxFCnYi](http:\u002F\u002Fis.gd\u002FxFCnYi) |\n| 20 | 9 个使用 Python 的计算机视觉项目 | [is.gd\u002FlrNybj](http:\u002F\u002Fis.gd\u002FlrNybj) |\n| 21 | 8 个使用 Python 的神经网络项目 | [is.gd\u002FFCyOOf](is.gd\u002FFCyOOf) |\n| 22 | 5 个用于医疗健康的机器学习项目 | https:\u002F\u002Fbit.ly\u002F3b86bOH |\n| 23 | 5 个使用 Python 的 NLP 项目 | https:\u002F\u002Fbit.ly\u002F3hExtNS |\n| 24 | 47 个 2021 年机器学习项目 | https:\u002F\u002Fbit.ly\u002F356bjiC |\n| 25 | 2021 年 19 个人工智能项目 | https:\u002F\u002Fbit.ly\u002F38aLgsg |\n| 26 | 2021 年 28 个机器学习项目 | https:\u002F\u002Fbit.ly\u002F3bguRF1 |\n| 27 | 2021 年 16 个带源代码的数据科学项目 | https:\u002F\u002Fbit.ly\u002F3oa4zYD |\n| 28 | 2021 年 24 个带源代码的深度学习项目 | https:\u002F\u002Fbit.ly\u002F3rQrOsU |\n| 29 | 2021 年 25 个带源代码的计算机视觉项目 | https:\u002F\u002Fbit.ly\u002F2JDMO4I |\n| 30 | 2021 年 23 个带源代码的 IoT (物联网) 项目 | https:\u002F\u002Fbit.ly\u002F354gT53 |\n| 31 | 2021 年 27 个带源代码的 Django 项目 | https:\u002F\u002Fbit.ly\u002F2LdRPRZ |\n| 32 | 2021 年 37 个带代码的趣味 Python 项目 | https:\u002F\u002Fbit.ly\u002F3hBHzz4 |\n| 33 | 500+ 顶级深度学习方法代码 | https:\u002F\u002Fbit.ly\u002F3n7AkAc |\n| 34 | 500+ 机器学习代码 | https:\u002F\u002Fbit.ly\u002F3b32n13 |\n| 35 | 20+ 机器学习数据集与项目创意 | https:\u002F\u002Fbit.ly\u002F3b2J48c |\n| 36 | 1000+ 计算机视觉代码 | https:\u002F\u002Fbit.ly\u002F2LiX1nv |\n| 37 | 300+ 按行业划分的真实世界带代码项目 | https:\u002F\u002Fbit.ly\u002F3rN7lVR |\n| 38 | 1000+ Python 项目代码 | https:\u002F\u002Fbit.ly\u002F3oca2xM |\n| 39 | 363+ 带代码的 NLP 项目 | https:\u002F\u002Fbit.ly\u002F3b442DO |\n| 40 | 50+ 代码 ML 模型 (适用于 iOS 11) 项目 | https:\u002F\u002Fbit.ly\u002F389dB2s |\n| 41 | 180+ 图像、文本、音频和视频的预训练模型项目 | https:\u002F\u002Fbit.ly\u002F3hFyQMw |\n| 42 | 50+ 图分类项目列表 | https:\u002F\u002Fbit.ly\u002F3rOYFhH |\n| 43 | 100+ 句子嵌入 (NLP 资源) | https:\u002F\u002Fbit.ly\u002F355aS8c |\n| 44 | 100+ 生产级机器学习项目 | https:\u002F\u002Fbit.ly\u002F353ckI0 |\n| 45 | 300+ 机器学习资源合集 | https:\u002F\u002Fbit.ly\u002F3b2LjIE |\n| 46 | 70+ 精彩 AI 资源 | https:\u002F\u002Fbit.ly\u002F3hDIXkD |\n| 47 | 150+ 带代码的机器学习项目创意 | https:\u002F\u002Fbit.ly\u002F38bfpbg |\n| 48 | 100+ 带代码的 AutoML (自动机器学习) 项目 | https:\u002F\u002Fbit.ly\u002F356zxZX |\n| 49 | 100+ 机器学习模型可解释性代码框架 | https:\u002F\u002Fbit.ly\u002F3n7FaNB |\n| 50 | 120+ 多模型机器学习代码项目 | https:\u002F\u002Fbit.ly\u002F38QRI76 |\n| 51 | 精彩的聊天机器人项目 | https:\u002F\u002Fbit.ly\u002F3rQyxmE |\n| 52 | 带 iOS 的精美 ML 演示项目 | https:\u002F\u002Fbit.ly\u002F389hZOY |\n| 53 | 100+ 基于 Python 的机器学习应用项目 | https:\u002F\u002Fbit.ly\u002F3n9zLWv |\n| 54 | 100+ 机器学习和深度学习 (ML 和 DL) 的可复现研究项目 | https:\u002F\u002Fbit.ly\u002F2KQ0J8C |\n| 55 | 25+ Python 项目 | https:\u002F\u002Fbit.ly\u002F353fRpK |\n| 56 | 8+ OpenCV 项目 | https:\u002F\u002Fbit.ly\u002F389mj0B |\n| 57 | 1000+ 精彩深度学习合集 | https:\u002F\u002Fbit.ly\u002F3b0a9Jj |\n| 58 | 200+ 精彩 NLP 学习合集 | https:\u002F\u002Fbit.ly\u002F3b74b9o |\n| 59 | 200+ 超级 NLP 仓库 | https:\u002F\u002Fbit.ly\u002F3hDNnbd |\n| 60 | 100+ 用于你项目的 NLP 数据集 | https:\u002F\u002Fbit.ly\u002F353h2Wc |\n| 61 | 364+ 机器学习项目定义 | https:\u002F\u002Fbit.ly\u002F2X5QRdb |\n| 62 | 300+ Google Earth Engine Jupyter 笔记本用于分析地理空间数据 | https:\u002F\u002Fbit.ly\u002F387JwjC |\n| 63 | 1000+ 机器学习项目信息 | https:\u002F\u002Fbit.ly\u002F3rMGk4N |\n| 64. | 11 个带代码的计算机视觉项目 | https:\u002F\u002Fbit.ly\u002F38gz2OR |\n| 65. | 13 个带代码的计算机视觉项目 | https:\u002F\u002Fbit.ly\u002F3hMJdhh |\n| 66. | 13 个激发灵感的酷炫计算机视觉 GitHub 项目 | https:\u002F\u002Fbit.ly\u002F2LrSv6d |\n| 67. | 开源计算机视觉项目 (含教程) | https:\u002F\u002Fbit.ly\u002F3pUss6U |\n| 68. | 使用 Python 的 OpenCV 计算机视觉项目 | https:\u002F\u002Fbit.ly\u002F38jmGpn |\n| 69. | 100+ 计算机视觉算法实现 | https:\u002F\u002Fbit.ly\u002F3rWgrzF |\n| 70. | 80+ 计算机视觉学习代码 | https:\u002F\u002Fbit.ly\u002F3hKCpkm |\n| 71. | 深度学习宝藏 | https:\u002F\u002Fbit.ly\u002F359zLQb |\n\n[#100+ 免费机器学习书籍](https:\u002F\u002Fwww.theinsaneapp.com\u002F2020\u002F12\u002Fdownload-free-machine-learning-books.html)\n\n# 所有荣誉均归于各自的创作者，这些资源经过整合，旨在为数据科学爱好者打造一个精彩且紧凑的学习资源库。\n\n第一部分：[路线图](https:\u002F\u002Fgithub.com\u002FMrMimic\u002Fdata-scientist-roadmap)\n\n第二部分：[免费在线课程](https:\u002F\u002Fgithub.com\u002FDeveloper-Y)\n\n第三部分：[500 个数据科学项目](https:\u002F\u002Fgithub.com\u002Fashishpatel26\u002F500-AI-Machine-learning-Deep-learning-Computer-vision-NLP-Projects-with-code)\n\n第四部分：[100+ 本免费机器学习书籍](https:\u002F\u002Fwww.theinsaneapp.com\u002F2020\u002F12\u002Fdownload-free-machine-learning-books.html)\n\n第五部分：[10 本面向初学者的机器学习书籍](https:\u002F\u002Fwww.appliedaicourse.com\u002Fblog\u002Fmachine-learning-books\u002F)","# datascience (Data-Scientist-Roadmap) 快速上手指南\n\n本项目是一个全面的数据科学学习路线图资源库，涵盖了从基础数学、统计学到数据库原理及大数据架构的核心知识点。适合希望系统构建数据科学知识体系的开发者参考。\n\n## 1. 环境准备\n\n本工具主要为文档与资源集合，无需复杂的运行时环境，但建议配置以下基础工具以便查阅和学习：\n\n*   **操作系统**: Windows \u002F macOS \u002F Linux\n*   **版本控制**: Git (用于克隆项目)\n*   **浏览器**: Chrome \u002F Firefox \u002F Edge (用于查看在线链接和资源)\n*   **文本编辑器**: VS Code \u002F Sublime Text (可选，用于本地阅读 Markdown 文件)\n\n> **提示**: 在中国大陆访问 GitHub 时，如遇网络延迟，建议使用国内加速服务或代理工具进行克隆。\n\n## 2. 安装步骤\n\n由于本项目为知识库仓库，通过 Git 克隆即可获取全部资料。\n\n```bash\ngit clone [仓库地址]\ncd datascience\n```\n\n克隆完成后，项目将包含完整的 Markdown 文档及相关图片资源，可直接在本地打开阅读。\n\n## 3. 基本使用\n\n本项目采用模块化结构组织内容，主要包含两大核心板块：**Fundamentals** (基础) 和 **Statistics** (统计)。\n\n### 浏览核心模块\n\n进入项目目录后，重点阅读以下章节以建立知识框架：\n\n*   **1_ Fundamentals (基础篇)**\n    *   **Matrices & Algebra**: 了解矩阵运算（Addition, Multiplication 等）。\n    *   **DB basics**: 掌握关系代数与自然连接（Natural join）概念。\n    *   **SQL Joins**: 熟悉 Inner, Outer, Left, Right join 的语法示例：\n        ```sql\n        SELECT column_name(s)\n        FROM table1\n        INNER JOIN table2 ON table1.column_name = table2.column_name;\n        ```\n    *   **NoSQL & ETL**: 理解非关系型数据库架构及数据抽取流程。\n    *   **Regex**: 正则表达式在文本处理中的应用（Python 示例）：\n        ```python\n        import re\n        ```\n\n*   **2_ Statistics (统计篇)**\n    *   **Pick a dataset**: 访问 Kaggle 或 Google Dataset Search 获取数据集。\n    *   **Descriptive statistics**: 学习均值（Mean）等描述性统计指标。\n\n### 实践建议\n\n1.  **按需查阅**: 根据当前学习阶段，跳转到对应编号的章节（如 `1_ Matrices & Algebra fundamentals`）。\n2.  **外部链接**: 文中包含大量维基百科（Wikipedia）及官方文档链接，点击可深入阅读定义。\n3.  **代码复现**: 对于文中提供的 SQL 请求（Request）或 Python 导入语句，可在本地数据库或 IDE 中尝试运行以加深理解。","背景：某电商公司数据团队的新成员小李，需要在入职首周快速掌握数据处理所需的数学原理与 SQL 查询基础，以便顺利接手业务报表。\n\n### 没有 datascience 时\n- 网络资源杂乱无章，难以辨别矩阵运算与线性代数基础知识的正确性\n- 经常搞混自然连接与内连接的实际执行逻辑及最终返回结果的差异\n- 面对哈希函数映射和大 O 复杂度等抽象概念，缺乏直观的图文辅助说明\n- 学习路径完全靠个人摸索，导致大量宝贵时间耗费在筛选无效资料上\n\n### 使用 datascience 后\n- 依据 Roadmap 规划，按顺序攻克从基础代数到数据库查询的系统课程\n- 直接复用仓库中提供的标准 SQL 语句模板，大幅减少日常语法编写错误\n- 借助 Wiki 风格的图文说明，快速理解二叉树结构与算法运行复杂度\n- 建立统一的术语标准，有效避免在不同文档间切换造成的认知割裂感\n\n核心价值：datascience 通过整合高质量免费资源，显著降低了数据科学入门者的知识构建门槛与时间成本，助力新人快速上手。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fsreeharierk_datascience_0f9d554a.png","sreeharierk","Sreehari S","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fsreeharierk_c6037c98.png",null,"https:\u002F\u002Fgithub.com\u002Fsreeharierk",5154,531,"2026-04-04T15:29:12","GPL-3.0",1,"未说明",{"notes":87,"python":85,"dependencies":88},"该 README 内容为数据科学学习路线图（Data-Scientist-Roadmap）文档，属于理论知识与资源整理，并非可执行的软件工具或代码库。文中虽提及 Python 正则库示例及多种数据库技术概念，但未提供具体的软件安装步骤、运行环境配置、硬件性能要求或依赖包列表。",[],[51,14,13,26,54],[91,92,93,94,95,96,97,98,99],"data-science","machine-learning","machine-learning-algorithms","deeplearning","artificial-intelligence","computer-vision","natural-language-processing","neural-networks","datascienceproject",4,"2026-03-27T02:49:30.150509","2026-04-06T09:44:29.461060",[104,109,114],{"id":105,"question_zh":106,"answer_zh":107,"source_url":108},3134,"如何提交数据科学路线图建议？","通过提交 Issue 进行建议，维护者收到后会确认并在确认后回复你。","https:\u002F\u002Fgithub.com\u002Fsreeharierk\u002Fdatascience\u002Fissues\u002F25",{"id":110,"question_zh":111,"answer_zh":112,"source_url":113},3135,"统计学中的中心极限定理是如何定义的？","该定理指出，无论总体分布如何，只要样本量大于 30，样本均值的抽样分布的均值等于总体均值。此外，样本均值的抽样分布也将遵循正态分布。","https:\u002F\u002Fgithub.com\u002Fsreeharierk\u002Fdatascience\u002Fissues\u002F20",{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},3136,"README 和仓库描述中存在什么拼写错误？","发现了 '_Repositary_' 的拼写错误，在 README 和仓库描述中应更正为 '_Repository_'。","https:\u002F\u002Fgithub.com\u002Fsreeharierk\u002Fdatascience\u002Fissues\u002F18",[]]