[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-mbadry1--CS231n-2017-Summary":3,"tool-mbadry1--CS231n-2017-Summary":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":79,"owner_website":82,"owner_url":83,"languages":84,"stars":89,"forks":90,"last_commit_at":91,"license":92,"difficulty_score":93,"env_os":94,"env_gpu":95,"env_ram":95,"env_deps":96,"category_tags":99,"github_topics":100,"view_count":23,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":105,"updated_at":106,"faqs":107,"releases":118},2936,"mbadry1\u002FCS231n-2017-Summary","CS231n-2017-Summary","After watching all the videos of the famous Standford's CS231n course that took place in 2017, i decided to take summary of the whole course to help me to remember and to anyone who would like to know about it. I've skipped some contents in some lectures as it wasn't important to me.","CS231n-2017-Summary 是一份针对斯坦福大学 2017 年著名计算机视觉课程（CS231n）的精炼学习笔记。该项目由学习者在完整观看课程视频后整理而成，旨在帮助自己巩固记忆，同时也为其他希望快速掌握该课程核心内容的学习者提供便利。\n\n它主要解决了原版课程视频时长较长、内容庞杂，导致初学者难以快速抓住重点或复习成本过高的问题。作者对部分非核心内容进行了合理删减，提炼出从卷积神经网络（CNN）基础、损失函数优化、网络训练技巧，到目标检测、图像分割、生成模型及对抗样本等 16 个关键模块的知识精华。\n\n这份资料非常适合人工智能开发者、计算机视觉研究人员以及高校学生使用。对于想要系统入门深度学习图像识别领域，或者需要高效复习 CS231n 课程要点的人来说，它是一份极佳的辅助材料。其独特亮点在于不仅涵盖了理论推导，还紧密围绕 ImageNet 挑战赛等实战背景，清晰梳理了端到端模型的构建与调优思路，让复杂的深度学习架构变得条理清晰、易于理解。","# Standford CS231n 2017 Summary\n\nAfter watching all the videos of the famous Standford's [CS231n](http:\u002F\u002Fcs231n.stanford.edu\u002F) course that took place in 2017, i decided to take summary of the whole course to help me to remember and to anyone who would like to know about it. I've skipped some contents in some lectures as it wasn't important to me.\n\n## Table of contents\n\n* [Standford CS231n 2017 Summary](#standford-cs231n-2017-summary)\n   * [Table of contents](#table-of-contents)\n   * [Course Info](#course-info)\n   * [01. Introduction to CNN for visual recognition](#01-introduction-to-cnn-for-visual-recognition)\n   * [02. Image classification](#02-image-classification)\n   * [03. Loss function and optimization](#03-loss-function-and-optimization)\n   * [04. Introduction to Neural network](#04-introduction-to-neural-network)\n   * [05. Convolutional neural networks (CNNs)](#05-convolutional-neural-networks-cnns)\n   * [06. Training neural networks I](#06-training-neural-networks-i)\n   * [07. Training neural networks II](#07-training-neural-networks-ii)\n   * [08. Deep learning software](#08-deep-learning-software)\n   * [09. CNN architectures](#09-cnn-architectures)\n   * [10. Recurrent Neural networks](#10-recurrent-neural-networks)\n   * [11. Detection and Segmentation](#11-detection-and-segmentation)\n   * [12. Visualizing and Understanding](#12-visualizing-and-understanding)\n   * [13. Generative models](#13-generative-models)\n   * [14. Deep reinforcement learning](#14-deep-reinforcement-learning)\n   * [15. Efficient Methods and Hardware for Deep Learning](#15-efficient-methods-and-hardware-for-deep-learning)\n   * [16. Adversarial Examples and Adversarial Training](#16-adversarial-examples-and-adversarial-training)\n\n## Course Info\n\n- Website: http:\u002F\u002Fcs231n.stanford.edu\u002F\n\n- Lectures link: https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLC1qU-LWwrF64f4QKQT-Vg5Wr4qEE1Zxk\n\n- Full syllabus link: http:\u002F\u002Fcs231n.stanford.edu\u002Fsyllabus.html\n\n- Assignments solutions: https:\u002F\u002Fgithub.com\u002FBurton2000\u002FCS231n-2017\n\n- Number of lectures: **16**\n\n- Course description:\n\n  - > Computer Vision has become ubiquitous in our society, with applications in search, image understanding, apps, mapping, medicine, drones, and self-driving cars. Core to many of these applications are visual recognition tasks such as image classification, localization and detection. Recent developments in neural network (aka “deep learning”) approaches have greatly advanced the performance of these state-of-the-art visual recognition systems. This course is a deep dive into details of the deep learning architectures with a focus on learning end-to-end models for these tasks, particularly image classification. During the 10-week course, students will learn to implement, train and debug their own neural networks and gain a detailed understanding of cutting-edge research in computer vision. The final assignment will involve training a multi-million parameter convolutional neural network and applying it on the largest image classification dataset (ImageNet). We will focus on teaching how to set up the problem of image recognition, the learning algorithms (e.g. backpropagation), practical engineering tricks for training and fine-tuning the networks and guide the students through hands-on assignments and a final course project. Much of the background and materials of this course will be drawn from the [ImageNet Challenge](http:\u002F\u002Fimage-net.org\u002Fchallenges\u002FLSVRC\u002F2014\u002Findex).\n\n\n\n## 01. Introduction to CNN for visual recognition\n\n- A brief history of Computer vision starting from the late 1960s to 2017.\n- Computer vision problems includes image classification, object localization, object detection, and scene understanding.\n- [Imagenet](http:\u002F\u002Fwww.image-net.org\u002F) is one of the biggest datasets in image classification available right now.\n- Starting 2012 in the Imagenet competition, CNN (Convolutional neural networks) is always winning.\n- CNN actually has been invented in 1997 by [Yann Lecun](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F726791\u002F).\n\n\n\n## 02. Image classification\n\n- Image classification problem has a lot of challenges like illumination and viewpoints.\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_1ee9457872f7.jpeg)\n- An image classification algorithm can be solved with **K nearest neighborhood** (KNN) but it can poorly solve the problem. The properties of KNN are:\n  - Hyperparameters of KNN are: k and the distance measure\n  - K is the number of neighbors we are comparing to.\n  - Distance measures include:\n    - L2 distance (Euclidean distance)\n      - Best for non coordinate points\n    - L1 distance (Manhattan distance)\n      - Best for coordinate points\n- Hyperparameters can be optimized using Cross-validation as following (In our case we are trying tp predict K):\n  1. Split your dataset into `f` folds.\n  2. Given predicted hyperparameters:\n     - Train your algorithm with f-1 folds and test it with the remain flood. and repeat this with every fold.\n  3. Choose the hyperparameters that gives the best training values (Average over all folds)\n- **Linear SVM** classifier is an option for solving the image classification problem, but the curse of dimensions makes it stop improving at some point.\n- **Logistic regression** is a also a solution for image classification problem, but image classification problem is non linear!\n- Linear classifiers has to run the following equation: `Y = wX + b` \n  - shape of `w` is the same as `x` and shape of `b` is 1.\n- We can add 1 to X vector and remove the bias so that: `Y = wX`\n  - shape of `x` is `oldX+1` and `w` is the same as `x`\n- We need to know how can we get `w`'s and `b`'s that makes the classifier runs at best.\n\n\n\n## 03. Loss function and optimization\n\n- In the last section we talked about linear classifier but we didn't discussed how we could **train** the parameters of that model to get best `w`'s and `b`'s.\n\n- We need a loss function to measure how good or bad our current parameters.\n\n  - ```python\n    Loss = L[i] =(f(X[i],W),Y[i])\n    Loss_for_all = 1\u002FN * Sum(Li(f(X[i],W),Y[i]))      # Indicates the average\n    ```\n\n- Then we find a way to minimize the loss function given some parameters. This is called **optimization**.\n\n- Loss function for a linear **SVM** classifier:\n\n  - `L[i] = Sum where all classes except the predicted class (max(0, s[j] - s[y[i]] + 1))`\n  - We call this ***the hinge loss***.\n  - Loss function means we are happy if the best prediction are the same as the true value other wise we give an error with 1 margin.\n  - Example:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_7c204cd1010c.jpg)\n    - Given this example we want to compute the loss of this image.\n    - `L = max (0, 437.9 - (-96.8) + 1) + max(0, 61.95 - (-96.8) + 1) = max(0, 535.7) + max(0, 159.75) = 695.45`\n    - Final loss is 695.45 which is big and reflects that the cat score needs to be the best over all classes as its the lowest value now. We need to minimize that loss.\n  - Its OK for the margin to be 1. But its a hyperparameter too.\n\n- If your loss function gives you zero, are this value is the same value for your parameter? No there are a lot of parameters that can give you best score.\n\n- You’ll sometimes hear about people instead using the squared hinge loss SVM (or L2-SVM). that penalizes violated margins more strongly (quadratically instead of linearly). The unsquared version is more standard, but in some datasets the squared hinge loss can work better.\n\n- We add **regularization** for the loss function so that the discovered model don't overfit the data.\n\n  - ```python\n    Loss = L = 1\u002FN * Sum(Li(f(X[i],W),Y[i])) + lambda * R(W)\n    ```\n\n  - Where `R` is the regularizer, and `lambda` is the regularization term.\n\n- There are different regularizations techniques:\n\n  - | Regularizer           | Equation                            | Comments               |\n    | --------------------- | ----------------------------------- | ---------------------- |\n    | L2                    | `R(W) = Sum(W^2)`                   | Sum all the W squared  |\n    | L1                    | `R(W) = Sum(lWl)`                   | Sum of all Ws with abs |\n    | Elastic net (L1 + L2) | `R(W) = beta * Sum(W^2) + Sum(lWl)` |                        |\n    | Dropout               |                                     | No Equation            |\n\n- Regularization prefers smaller `W`s over big `W`s.\n\n- Regularizations is called weight decay. biases should not included in regularization.\n\n- Softmax loss (Like linear regression but works for more than 2 classes):\n\n  - Softmax function:\n\n    - ```python\n      A[L] = e^(score[L]) \u002F sum(e^(score[L]), NoOfClasses)\n      ```\n\n  - Sum of the vector should be 1.\n\n  - Softmax loss:\n\n    - ```python\n      Loss = -logP(Y = y[i]|X = x[i])\n      ```\n\n    - Log of the probability of the good class. We want it to be near 1 thats why we added a minus.\n\n    - Softmax loss is called cross-entropy loss.\n\n  - Consider this numerical problem when you are computing Softmax:\n\n    - ```python\n      f = np.array([123, 456, 789]) # example with 3 classes and each having large scores\n      p = np.exp(f) \u002F np.sum(np.exp(f)) # Bad: Numeric problem, potential blowup\n\n      # instead: first shift the values of f so that the highest number is 0:\n      f -= np.max(f) # f becomes [-666, -333, 0]\n      p = np.exp(f) \u002F np.sum(np.exp(f)) # safe to do, gives the correct answer\n      ```\n\n- **Optimization**:\n\n  - How we can optimize loss functions we discussed?\n  - Strategy one:\n    - Get a random parameters and try all of them on the loss and get the best loss. But its a bad idea.\n  - Strategy two:\n    - Follow the slope.\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_1eb15957ba4a.png)\n      - Image [source](https:\u002F\u002Frasbt.github.io\u002Fmlxtend\u002Fuser_guide\u002Fgeneral_concepts\u002Fgradient-optimization_files\u002Fball.png).\n\n    - Our goal is to compute the gradient of each parameter we have.\n      - **Numerical gradient**: Approximate, slow, easy to write.   (But its useful in debugging.)\n      - **Analytic gradient**: Exact, Fast, Error-prone.   (Always used in practice)\n\n    - After we compute the gradient of our parameters, we compute the gradient descent:\n      - ```python\n        W = W - learning_rate * W_grad\n        ```\n\n    - learning_rate is so important hyper parameter you should get the best value of it first of all the hyperparameters.\n\n    - stochastic gradient descent:\n      - Instead of using all the date, use a mini batch of examples (32\u002F64\u002F128 are commonly used) for faster results.\n\n\n\n\n## 04. Introduction to Neural network\n\n- Computing the analytic gradient for arbitrary complex functions:\n\n  - What is a Computational graphs?\n\n    - Used to represent any function. with nodes.\n    - Using Computational graphs can easy lead us to use a technique that called back-propagation. Even with complex models like CNN and RNN.\n\n  - Back-propagation simple example:\n\n    - Suppose we have `f(x,y,z) = (x+y)z`\n\n    - Then graph can be represented this way:\n\n    - ```\n      X         \n        \\\n         (+)--> q ---(*)--> f\n        \u002F           \u002F\n      Y            \u002F\n                  \u002F\n                 \u002F\n      Z---------\u002F\n      ```\n\n    - We made an intermediate variable `q`  to hold the values of `x+y`\n\n    - Then we have:\n\n      - ```python\n        q = (x+y)              # dq\u002Fdx = 1 , dq\u002Fdy = 1\n        f = qz                 # df\u002Fdq = z , df\u002Fdz = q\n        ```\n\n    - Then:\n\n      - ```python\n        df\u002Fdq = z\n        df\u002Fdz = q\n        df\u002Fdx = df\u002Fdq * dq\u002Fdx = z * 1 = z       # Chain rule\n        df\u002Fdy = df\u002Fdq * dq\u002Fdy = z * 1 = z       # Chain rule\n        ```\n\n  - So in the Computational graphs, we call each operation `f`. For each `f` we calculate the local gradient before we go on back propagation and then we compute the gradients in respect of the loss function using the chain rule.\n\n  - In the Computational graphs you can split each operation to as simple as you want but the nodes will be a lot. if you want the nodes to be smaller be sure that you can compute the gradient of this node.\n\n  - A bigger example:\n\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_16f775ff9bfb.png)\n    - Hint: the back propagation of two nodes going to one node from the back is by adding the two derivatives.\n\n  - Modularized implementation: forward\u002F backward API (example multiply code):\n\n    - ```python\n      class MultuplyGate(object):\n        \"\"\"\n        x,y are scalars\n        \"\"\"\n        def forward(x,y):\n          z = x*y\n          self.x = x  # Cache\n          self.y = y\t# Cache\n          # We cache x and y because we know that the derivatives contains them.\n          return z\n        def backward(dz):\n          dx = self.y * dz         #self.y is dx\n          dy = self.x * dz\n          return [dx, dy]\n      ```\n\n  - If you look at a deep learning framework you will find it follow the Modularized implementation where each class has a definition for forward and backward. For example:\n\n    - Multiplication\n    - Max\n    - Plus\n    - Minus\n    - Sigmoid\n    - Convolution\n\n- So to define neural network as a function:\n\n  - (Before) Linear score function: `f = Wx`\n  - (Now) 2-layer neural network:    `f = W2*max(0,W1*x)` \n    - Where max is the RELU non linear function\n  - (Now) 3-layer neural network:    `f = W3*max(0,W2*max(0,W1*x)`\n  - And so on..\n\n- Neural networks is a stack of some simple operation that forms complex operations.\n\n\n\n## 05. Convolutional neural networks (CNNs)\n\n- Neural networks history:\n  - First perceptron machine was developed by Frank Rosenblatt in 1957. It was used to recognize letters of the alphabet. Back propagation wasn't developed yet.\n  - Multilayer perceptron was developed in 1960 by Adaline\u002FMadaline. Back propagation wasn't developed yet.\n  - Back propagation was developed in 1986 by Rumeelhart.\n  - There was a period which nothing new was happening with NN. Cause of the limited computing resources and data.\n  - In [2006](www.cs.toronto.edu\u002F~fritz\u002Fabsps\u002Fnetflix.pdf) Hinton released a paper that shows that we can train a deep neural network using Restricted Boltzmann machines to initialize the weights then back propagation.\n  - The first strong results was in 2012 by Hinton in [speech recognition](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F6296526\u002F). And the [Alexnet](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) \"Convolutional neural networks\" that wins the image net in 2012 also by Hinton's team.\n  - After that NN is widely used in various applications.\n- Convolutional neural networks history:\n  - Hubel & Wisel in 1959 to 1968 experiments on cats cortex found that there are a topographical mapping in the cortex and that the neurons has hireical organization from simple to complex.\n  - In 1998, Yann Lecun gives the paper [Gradient-based learning applied to document recognition](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F726791\u002F) that introduced the Convolutional neural networks. It was good for recognizing zip letters but couldn't run on a more complex examples.\n  - In 2012 [AlexNet](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) used the same Yan Lecun architecture and won the image net challenge. The difference from 1998 that now we have a large data sets that can be used also the power of the GPUs solved a lot of performance problems.\n  - Starting from 2012 there are CNN that are used for various tasks (Here are some applications):\n    - Image classification.\n    - Image retrieval.\n      - Extracting features using a NN and then do a similarity matching.\n    - Object detection.\n    - Segmentation.\n      - Each pixel in an image takes a label.\n    - Face recognition.\n    - Pose recognition.\n    - Medical images.\n    - Playing Atari games with reinforcement learning.\n    - Galaxies classification.\n    - Street signs recognition.\n    - Image captioning.\n    - Deep dream.\n- ConvNet architectures make the explicit assumption that the inputs are images, which allows us to encode certain properties into the architecture.\n- There are a few distinct types of Layers in ConvNet (e.g. CONV\u002FFC\u002FRELU\u002FPOOL are by far the most popular)\n- Each Layer may or may not have parameters (e.g. CONV\u002FFC do, RELU\u002FPOOL don’t)\n- Each Layer may or may not have additional hyperparameters (e.g. CONV\u002FFC\u002FPOOL do, RELU doesn’t)\n- How Convolutional neural networks works?\n  - A fully connected layer is a layer in which all the neurons is connected. Sometimes we call it a dense layer.\n    - If input shape is `(X, M)` the weighs shape for this will be `(NoOfHiddenNeurons, X)`\n  - Convolution layer is a layer in which we will keep the structure of the input by a filter that goes through all the image.\n    - We do this with dot product: `W.T*X + b`. This equation uses the broadcasting technique.\n    - So we need to get the values of `W` and `b`\n    - We usually deal with the filter (`W`) as a vector not a matrix.\n  - We call output of the convolution activation map. We need to have multiple activation map.\n    - Example if we have 6 filters, here are the shapes:\n      - Input image                        `(32,32,3)`\n      - filter size                              `(5,5,3)`\n        - We apply 6 filters. The depth must be three because the input map has depth of three.\n      - Output of Conv.                 `(28,28,6)` \n        - if one filter it will be   `(28,28,1)`\n      - After RELU                          `(28,28,6)` \n      - Another filter                     `(5,5,6)`\n      - Output of Conv.                 `(24,24,10)`\n  - It turns out that convNets learns in the first layers the low features and then the mid-level features and then the high level features.\n  - After the Convnets we can have a linear classifier for a classification task.\n  - In Convolutional neural networks usually we have some (Conv ==> Relu)s and then we apply a pool operation to downsample the size of the activation.\n- What is stride when we are doing convolution:\n  - While doing a conv layer we have many choices to make regarding the stride of which we will take. I will explain this by examples.\n  - Stride is skipping while sliding. By default its 1.\n  - Given a matrix with shape of `(7,7)` and a filter with shape `(3,3)`:\n    - If stride is `1` then the output shape will be `(5,5)`              `# 2 are dropped`\n    - If stride is `2` then the output shape will be `(3,3)`             `# 4 are dropped`\n    - If stride is `3` it doesn't work.\n  - A general formula would be `((N-F)\u002Fstride +1)`\n    - If stride is `1` then `O = ((7-3)\u002F1)+1 = 4 + 1 = 5`\n    - If stride is `2` then `O = ((7-3)\u002F2)+1 = 2 + 1 = 3`\n    - If stride is `3` then `O = ((7-3)\u002F3)+1 = 1.33 + 1 = 2.33`        `# doesn't work`\n\n- In practice its common to zero pad the border.   `# Padding from both sides.`\n  - Give a stride of `1` its common to pad to this equation:  `(F-1)\u002F2` where F is the filter size\n    - Example `F = 3` ==> Zero pad with `1`\n    - Example `F = 5` ==> Zero pad with `2`\n  - If we pad this way we call this same convolution.\n  - Adding zeros gives another features to the edges thats why there are different padding techniques like padding the corners not zeros but in practice zeros works!\n  - We do this to maintain our full size of the input. If we didn't do that the input will be shrinking too fast and we will lose a lot of data.\n- Example:\n  - If we have input of shape `(32,32,3)` and ten filters with shape is `(5,5)` with stride `1` and pad `2`\n    - Output size will be `(32,32,10)`                       `# We maintain the size.`\n  - Size of parameters per filter `= 5*5*3 + 1 = 76`\n  - All parameters `= 76 * 10 = 76`\n- Number of filters is usually common to be to the power of 2.           `# To vectorize well.`\n- So here are the parameters for the Conv layer:\n  - Number of filters K.\n    - Usually a power of 2.\n  - Spatial content size F.\n    - 3,5,7 ....\n  - The stride S. \n    - Usually 1 or 2        (If the stride is big there will be a downsampling but different of pooling) \n  - Amount of Padding\n    - If we want the input shape to be as the output shape, based on the F if 3 its 1, if F is 5 the 2 and so on.\n- Pooling makes the representation smaller and more manageable.\n- Pooling Operates over each activation map independently.\n- Example of pooling is the maxpooling.\n  - Parameters of max pooling is the size of the filter and the stride\"\n    - Example `2x2` with stride `2`                     `# Usually the two parameters are the same 2 , 2`\n- Also example of pooling is average pooling.\n  - In this case it might be learnable.\n\n\n\n## 06. Training neural networks I\n\n- As a revision here are the Mini batch stochastic gradient descent algorithm steps:\n\n  - Loop:\n    1. Sample a batch of data.\n    2. Forward prop it through the graph (network) and get loss.\n    3. Backprop to calculate the gradients.\n    4. Update the parameters using the gradients.\n\n- Activation functions:\n\n  - Different choices for activation function includes Sigmoid, tanh, RELU, Leaky RELU, Maxout, and ELU.\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_89e6268a8f7a.png)\n\n  - Sigmoid:\n\n    - Squashes the numbers between [0,1]\n    - Used as a firing rate like human brains.\n    - `Sigmoid(x) = 1 \u002F (1 + e^-x)`\n    - Problems with sigmoid:\n      - big values neurons ***kill*** the gradients.\n        - Gradients are in most cases near 0 (Big values\u002Fsmall values), that kills the updates if the graph\u002Fnetwork are large.\n      - Not Zero-centered.\n        - Didn't produce zero-mean data.\n      - `exp()` is a bit compute expensive.\n        - just to mention. We have a more complex operations in deep learning like convolution.\n\n  - Tanh:\n\n    - Squashes the numbers between [-1,1]\n    - Zero centered.\n    - Still big values neurons \"kill\" the gradients.\n    - `Tanh(x)` is the equation.\n    - Proposed by Yann Lecun in 1991.\n\n  - RELU (Rectified linear unit):\n\n    - `RELU(x) = max(0,x)`\n    - Doesn't kill the gradients.\n      - Only small values that are killed. Killed the gradient in the half\n    - Computationally efficient.\n    - Converges much faster than Sigmoid and Tanh `(6x)`\n    - More biologically plausible than sigmoid.\n    - Proposed by Alex Krizhevsky in 2012 Toronto university. (AlexNet)\n    - Problems:\n      - Not zero centered.\n    - If weights aren't initialized good, maybe 75% of the neurons will be dead and thats a waste computation. But its still works. This is an active area of research to optimize this.\n    - To solve the issue mentioned above, people might initialize all the biases by 0.01\n\n  - Leaky RELU:\n\n    - `leaky_RELU(x) = max(0.01x,x)`\n    - Doesn't kill the gradients from both sides.\n    - Computationally efficient.\n    - Converges much faster than Sigmoid and Tanh (6x)\n    - Will not die.\n    - PRELU is placing the 0.01 by a variable alpha which is learned as a parameter.\n\n  - Exponential linear units (ELU):\n\n    - ```\n      ELU(x) = { x                                           if x > 0\n      \t\t   alpah *(exp(x) -1)\t\t                   if x \u003C= 0\n                 # alpah are a learning parameter\n      }\n      ```\n\n    - It has all the benefits of RELU\n\n    - Closer to zero mean outputs and adds some robustness to noise.\n\n    - problems\n\n      - `exp()` is a bit compute expensive. \n\n  - Maxout activations:\n\n    - `maxout(x) = max(w1.T*x + b1, w2.T*x + b2)`\n    - Generalizes RELU and Leaky RELU\n    - Doesn't die!\n    - Problems:\n      - oubles the number of parameters per neuron\n\n  - In practice:\n\n    - Use RELU. Be careful for your learning rates.\n    - Try out Leaky RELU\u002FMaxout\u002FELU\n    - Try out tanh but don't expect much.\n    - Don't use sigmoid!\n\n- **Data preprocessing**:\n\n  - Normalize the data:\n\n  - ```python\n    # Zero centered data. (Calculate the mean for every input).\n    # On of the reasons we do this is because we need data to be between positive and negative and not all the be negative or positive. \n    X -= np.mean(X, axis = 1)\n\n    # Then apply the standard deviation. Hint: in images we don't do this.\n    X \u002F= np.std(X, axis = 1)\n    ```\n\n  - To normalize images:\n\n    - Subtract the mean image (E.g. Alexnet)\n      - Mean image shape is the same as the input images.\n    - Or Subtract per-channel mean \n      - Means calculate the mean for each channel of all images. Shape is 3 (3 channels)\n\n- **Weight initialization**:\n\n  - What happened when initialize all Ws with zeros?\n\n    - All the neurons will do exactly the same thing. They will have the same gradient and they will have the same update.\n    - So if W's of a specific layer is equal the thing described happened\n\n  - First idea is to initialize the w's with small random numbers:\n\n    - ```python\n      W = 0.01 * np.random.rand(D, H)\n      # Works OK for small networks but it makes problems with deeper networks!\n      ```\n\n    - The standard deviations is going to zero in deeper networks. and the gradient will vanish sooner in deep networks.\n\n    - ```python\n      W = 1 * np.random.rand(D, H) \n      # Works OK for small networks but it makes problems with deeper networks!\n      ```\n\n    - The network will explode with big numbers!\n\n  - ***Xavier initialization***:\n\n    - ```python\n      W = np.random.rand(in, out) \u002F np.sqrt(in)\n      ```\n\n    - It works because we want the variance of the input to be as the variance of the output.\n\n    - But it has an issue, It breaks when you are using RELU.\n\n  - ***He initialization*** (Solution for the RELU issue):\n\n    - ```python\n      W = np.random.rand(in, out) \u002F np.sqrt(in\u002F2)\n      ```\n\n    - Solves the issue with RELU. Its recommended when you are using RELU\n\n  - Proper initialization is an active area of research.\n\n- **Batch normalization**:\n\n  - is a technique to provide any layer in a Neural Network with inputs that are zero mean\u002Funit variance.\n  - It speeds up the training. You want to do this a lot.\n    - Made by Sergey Ioffe and Christian Szegedy at 2015.\n  - We make a Gaussian activations in each layer. by calculating the mean and the variance.\n  - Usually inserted after (fully connected or Convolutional layers) and (before nonlinearity).\n  - Steps (For each output of a layer)\n    1. First we compute the mean and variance^2 of the batch for each feature.\n    2. We normalize by subtracting the mean and dividing by square root of (variance^2 + epsilon)\n       - epsilon to not divide by zero\n    3. Then we make a scale and shift variables: `Result = gamma * normalizedX + beta`  \n       - gamma and beta are learnable parameters.\n       - it basically possible to say “Hey!! I don’t want zero mean\u002Funit variance input, give me back the raw input - it’s better for me.”\n       - Hey shift and scale by what you want not just the mean and variance!\n  - The algorithm makes each layer flexible (It chooses which distribution it wants)\n  - We initialize the BatchNorm Parameters to transform the input to zero mean\u002Funit variance distributions but during training they can learn that any other distribution might be better.\n  - During the running of the training we need to calculate the globalMean and globalVariance for each layer by using weighted average.\n  - \u003Cu>Benefits of Batch Normalization\u003C\u002Fu>:\n    - Networks train faster.\n    - Allows higher learning rates.\n    - helps reduce the sensitivity to the initial starting weights.\n    - Makes more activation functions viable.\n    - Provides some regularization.\n      - Because we are calculating mean and variance for each batch that gives a slight regularization effect.\n  - In conv layers, we will have one variance and one mean per activation map.\n  - Batch normalization have worked best for CONV and regular deep NN, But for recurrent NN and reinforcement learning its still an active research area.\n    - Its challengey in reinforcement learning because the batch is small.\n\n- **Baby sitting the learning process**\n\n  1. Preprocessing of data.\n  2. Choose the architecture.\n  3. Make a forward pass and check the loss (Disable regularization). Check if the loss is reasonable.\n  4. Add regularization, the loss should go up!\n  5. Disable the regularization again and take a small number of data and try to train the loss and reach zero loss.\n     - You should overfit perfectly for small datasets.\n  6. Take your full training data, and small regularization then try some value of learning rate.\n     - If loss is barely changing, then the learning rate is small.\n     - If you got `NAN` then your NN exploded and your learning rate is high.\n     - Get your learning rate range by trying the min value (That can change) and the max value that doesn't explode the network.\n  7. Do Hyperparameters optimization to get the best hyperparameters values.\n\n- Hyperparameter Optimization\n\n  - Try Cross validation strategy.\n    - Run with a few ephocs, and try to optimize the ranges.\n  - Its best to optimize in log space.\n  - Adjust your ranges and try again.\n  - Its better to try random search instead of grid searches (In log space)\n\n\n\n## 07. Training neural networks II\n\n- **Optimization algorithms**:\n\n  - Problems with stochastic gradient descent:\n\n    - if loss quickly in one direction and slowly in another (For only two variables), you will get very slow progress along shallow dimension, jitter along steep direction. Our NN will have a lot of parameters then the problem will be more.\n    - Local minimum or saddle points\n      - If SGD went into local minimum we will stuck at this point because the gradient is zero.\n      - Also in saddle points the gradient will be zero so we will stuck.\n      - Saddle points says that at some point:\n        - Some gradients will get the loss up.\n        - Some gradients will get the loss down.\n        - And that happens more in high dimensional (100 million dimension for example)\n      - The problem of deep NN is more about saddle points than about local minimum because deep NN has high dimensions (Parameters)\n      - Mini batches are noisy because the gradient is not taken for the whole batch.\n\n  - **SGD + momentum**:\n\n    - Build up velocity as a running mean of gradients:\n\n    - ```python\n      # Computing weighted average. rho best is in range [0.9 - 0.99]\n      V[t+1] = rho * v[t] + dx\n      x[t+1] = x[t] - learningRate * V[t+1]\n      ```\n\n    - `V[0]` is zero.\n\n    - Solves the saddle point and local minimum problems.\n\n    - It overshoots the problem and returns to it back.\n\n  - **Nestrov momentum**:\n\n    - ```python\n      dx = compute_gradient(x)\n      old_v = v\n      v = rho * v - learning_rate * dx\n      x+= -rho * old_v + (1+rho) * v\n      ```\n\n    - Doesn't overshoot the problem but slower than SGD + momentum\n\n  - **AdaGrad**\n\n    - ```python\n      grad_squared = 0\n      while(True):\n        dx = compute_gradient(x)\n        \n        # here is a problem, the grad_squared isn't decayed (gets so large)\n        grad_squared += dx * dx\t\t\t\n        \n        x -= (learning_rate*dx) \u002F (np.sqrt(grad_squared) + 1e-7)\n      ```\n\n  - **RMSProp**\n\n    - ```python\n      grad_squared = 0\n      while(True):\n        dx = compute_gradient(x)\n        \n        #Solved ADAgra\n        grad_squared = decay_rate * grad_squared + (1-grad_squared) * dx * dx  \n        \n        x -= (learning_rate*dx) \u002F (np.sqrt(grad_squared) + 1e-7)\n      ```\n\n    - People uses this instead of AdaGrad\n\n  - **Adam**\n\n    - Calculates the momentum and RMSProp as the gradients.\n    - It need a Fixing bias to fix starts of gradients.\n    - Is the best technique so far runs best on a lot of problems.\n    - With `beta1 = 0.9` and `beta2 = 0.999` and `learning_rate = 1e-3` or `5e-4` is a great starting point for many models!\n\n  - **Learning decay**\n\n    - Ex. decay learning rate by half every few epochs.\n    - To help the learning rate not to bounce out.\n    - Learning decay is common with SGD+momentum but not common with Adam.\n    - Dont use learning decay from the start at choosing your hyperparameters. Try first and check if you need decay or not.\n\n  - All the above algorithms we have discussed is a first order optimization.\n\n  - **Second order optimization**\n\n    - Use gradient and Hessian to from quadratic approximation.\n    - Step to the minima of the approximation.\n    - What is nice about this update?\n      - It doesn't has a learning rate in some of the versions.\n    - But its unpractical for deep learning\n      - Has O(N^2) elements.\n      - Inverting takes O(N^3).\n    - **L-BFGS** is a version of second order optimization\n      - Works with batch optimization but not with mini-batches.\n\n  - In practice first use ADAM and if it didn't work try L-BFGS.\n\n  - Some says all the famous deep architectures uses **SGS + Nestrov momentum**\n\n- **Regularization**\n\n  - So far we have talked about reducing the training error, but we care about most is how our model will handle unseen data!\n  - What if the gab of the error between training data and validation data are too large?\n  - This error is called high variance.\n  - **Model Ensembles**:\n    - Algorithm:\n      - Train multiple independent models of the same architecture with different initializations.\n      - At test time average their results.\n    - It can get you extra 2% performance.\n    - It reduces the generalization error.\n    - You can use some snapshots of your NN at the training ensembles them and take the results.\n  - Regularization solves the high variance problem. We have talked about L1, L2 Regularization.\n  - Some Regularization techniques are designed for only NN and can do better.\n  - **Drop out**:\n    - In each forward pass, randomly set some of the neurons to zero. Probability of dropping is a hyperparameter that are 0.5 for almost cases.\n    - So you will chooses some activation and makes them zero.\n    - It works because:\n      - It forces the network to have redundant representation; prevent co-adaption of features!\n      - If you think about this, It ensemble some of the models in the same model!\n    - At test time we might multiply each dropout layer by the probability of the dropout.\n    - Sometimes at test time we don't multiply anything and leave it as it is.\n    - With drop out it takes more time to train.\n  - **Data augmentation**:\n    - Another technique that makes Regularization.\n    - Change the data!\n    - For example flip the image, or rotate it.\n    - Example in ResNet:\n      - Training: Sample random crops and scales:\n        1. Pick random L in range [256,480]\n        2. Resize training image, short side = L\n        3. Sample random 224x244 patch.\n      - Testing: average a fixed set of crops\n        1. Resize image at 5 scales: {224, 256, 384, 480, 640}\n        2. For each size, use 10 224x224 crops: 4 corners + center + flips\n      - Apply Color jitter or PCA\n      - Translation, rotation, stretching.\n  - Drop connect\n    - Like drop out idea it makes a regularization.\n    - Instead of dropping the activation, we randomly zeroing the weights.\n  - Fractional Max Pooling\n    - Cool regularization idea. Not commonly used.\n    - Randomize the regions in which we pool.\n  - Stochastic depth\n    - New idea.\n    - Eliminate layers, instead on neurons.\n    - Has the similar effect of drop out but its a new idea.\n\n- **Transfer learning**:\n\n  - Some times your data is overfitted by your model because the data is small not because of regularization.\n\n  - You need a lot of data if you want to train\u002Fuse CNNs.\n\n  - Steps of transfer learning\n\n    1. Train on a big dataset that has common features with your dataset. Called pretraining.\n    2. Freeze the layers except the last layer and feed your small dataset to learn only the last layer.\n    3. Not only the last layer maybe trained again, you can fine tune any number of layers you want based on the number of data you have\n\n  - Guide to use transfer learning:\n\n    - |                         | Very Similar dataset               | very different dataset                   |\n      | ----------------------- | ---------------------------------- | ---------------------------------------- |\n      | **very little dataset** | Use Linear classifier on top layer | You're in trouble.. Try linear classifier from different stages |\n      | **quite a lot of data** | Finetune a few layers              | Finetune a large layers                  |\n\n  - Transfer learning is the normal not an exception.\n\n\n\n## 08. Deep learning software\n\n- This section changes a lot every year in CS231n due to rabid changes in the deep learning softwares.\n- CPU vs GPU\n  - GPU The graphics card was developed to render graphics to play games or make 3D media,. etc.\n    - NVIDIA vs AMD\n      - Deep learning choose NVIDIA over AMD GPU because NVIDIA is pushing research forward deep learning also makes it architecture more suitable for deep learning.\n  - CPU has fewer cores but each core is much faster and much more capable; great at sequential tasks. While GPUs has more cores but each core is much slower \"dumber\"; great for parallel tasks.\n  - GPU cores needs to work together. and has its own memory.\n  - Matrix multiplication is from the operations that are suited for GPUs. It has MxN independent operations that can be done on parallel.\n  - Convolution operation also can be paralyzed because it has independent operations.\n  - Programming GPUs frameworks:\n    - **CUDA** (NVIDIA only)\n      - Write c-like code that runs directly on the GPU.\n      - Its hard to build a good optimized code that runs on GPU. Thats why they provided high level APIs.\n      - Higher level APIs: cuBLAS, cuDNN, etc\n      - **CuDNN** has implemented back prop. , convolution, recurrent and a lot more for you!\n      - In practice you won't write a parallel code. You will use the code implemented and optimized by others!\n    - **OpenCl**\n      - Similar to CUDA, but runs on any GPU.\n      - Usually Slower .\n      - Haven't much support yet from all deep learning softwares.\n  - There are a lot of courses for learning parallel programming.\n  - If you aren't careful, training can bottleneck on reading dara and transferring to GPU. So the solutions are:\n    - Read all the data into RAM. # If possible\n    - Use SSD instead of HDD\n    - Use multiple CPU threads to prefetch data!\n      - While the GPU are computing, a CPU thread will fetch the data for you.\n      - A lot of frameworks implemented that for you because its a little bit painful!\n- **Deep learning Frameworks**\n  - Its super fast moving!\n  - Currently available frameworks:\n    - Tensorflow (Google)\n    - Caffe (UC Berkeley)\n    - Caffe2 (Facebook)\n    - Torch (NYU \u002F Facebook)\n    - PyTorch (Facebook)\n    - Theano (U monteral) \n    - Paddle (Baidu)\n    - CNTK (Microsoft)\n    - MXNet (Amazon)\n  - The instructor thinks that you should focus on Tensorflow and PyTorch.\n  - The point of deep learning frameworks:\n    - Easily build big computational graphs.\n    - Easily compute gradients in computational graphs.\n    - Run it efficiently on GPU (cuDNN - cuBLAS)\n  - Numpy doesn't run on GPU.\n  - Most of the frameworks tries to be like NUMPY in the forward pass and then they compute the gradients for you.\n- **Tensorflow (Google)**\n  - Code are two parts:\n    1. Define computational graph.\n    2. Run the graph and reuse it many times.\n  - Tensorflow uses a static graph architecture.\n  - Tensorflow variables live in the graph. while the placeholders are feed each run.\n  - Global initializer function initializes the variables that lives in the graph.\n  - Use predefined optimizers and losses.\n  - You can make a full layers with layers.dense function.\n  - **Keras** (High level wrapper):\n    - Keras is a layer on top pf Tensorflow, makes common things easy to do.\n    - So popular!\n    - Trains a full deep NN in a few lines of codes.\n  - There are a lot high level wrappers:\n    - Keras\n    - TFLearn\n    - TensorLayer\n    - tf.layers   `#Ships with tensorflow`\n    - tf-Slim   `#Ships with tensorflow`\n    - tf.contrib.learn   `#Ships with tensorflow`\n    - Sonnet `# New from deep mind`\n  - Tensorflow has pretrained models that you can use while you are using transfer learning.\n  - Tensorboard adds logging to record loss, stats. Run server and get pretty graphs!\n  - It has distributed code if you want to split your graph on some nodes.\n  - Tensorflow is actually inspired from Theano. It has the same inspirations and structure.\n\n\n- **PyTorch (Facebook)**\n\n  - Has three layers of abstraction:\n    - Tensor: `ndarray` but runs on GPU     `#Like numpy arrays in tensorflow`\n      - Variable: Node in a computational graphs; stores data and gradient `#Like Tensor, Variable, Placeholders`\n    - Module: A NN layer; may store state or learnable weights`#Like tf.layers in tensorflow`\n  - In PyTorch the graphs runs in the same loop you are executing which makes it easier for debugging. This is called a dynamic graph.\n  - In PyTorch you can define your own autograd functions by writing forward and backward for tensors. Most of the times it will implemented for you.\n  - Torch.nn is a high level api like keras in tensorflow. You can create the models and go on and on.\n    - You can define your own nn module!\n  - Also Pytorch contains optimizers like tensorflow.\n  - It contains a data loader that wraps a Dataset and provides minbatches, shuffling and multithreading.\n  - PyTorch contains the best and super easy to use pretrained models\n  - PyTorch contains Visdom that are like tensorboard. but Tensorboard seems to be more powerful.\n  - PyTorch is new and still evolving compared to Torch. Its still in beta state.\n  - PyTorch is best for research.\n\n- Tensorflow builds the graph once, then run them many times (Called static graph)\n\n- In each PyTorch iteration we build a new graph (Called dynamic graph)\n\n- **Static vs dynamic graphs**:\n\n  - Optimization:\n\n    - With static graphs, framework can optimize the graph for you before it runs.\n\n  - Serialization\n\n    - **Static**: Once graph is built, can serialize it and run it without the code that built the graph. Ex use the graph in c++\n\n\n    - **Dynamic**: Always need to keep the code around.\n\n  - Conditional\n\n    - Is easier in dynamic graphs. And more complicated in static graphs.\n\n  - Loops:\n\n    - Is easier in dynamic graphs. And more complicated in static graphs.\n\n- Tensorflow fold make dynamic graphs easier in Tensorflow through dynamic batching.\n\n- Dynamic graph applications include: recurrent networks and recursive networks.\n\n- Caffe2 uses static graphs and can train model in python also works on IOS and Android\n\n- Tensorflow\u002FCaffe2 are used a lot in production especially on mobile.\n\n## 09. CNN architectures\n\n- This section talks about the famous CNN architectures. Focuses on CNN architectures that won [ImageNet](www.image-net.org\u002F) competition since 2012.\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_487c3026ca9c.png)\n\n- These architectures includes: [AlexNet](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf), [VGG](https:\u002F\u002Farxiv.org\u002Fabs\u002F1409.1556), [GoogLeNet](https:\u002F\u002Fresearch.google.com\u002Fpubs\u002Fpub43022.html), and [ResNet](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.03385).\n\n- Also we will discuss some interesting architectures as we go.\n\n- The first ConvNet that was made was [LeNet-5](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F726791\u002F) architectures are:by Yann Lecun at 1998.\n\n  - Architecture are: `CONV-POOL-CONV-POOL-FC-FC-FC`\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_1cc3fc4ed869.jpg)\n  - Each conv filters was `5x5` applied at stride 1\n  - Each pool was `2x2` applied at stride `2`\n  - It was useful in Digit recognition.\n  - In particular the insight that image features are distributed across the entire image, and convolutions with learnable parameters are an effective way to extract similar features at multiple location with few parameters.\n  - It contains exactly **\u003Cu>5 \u003C\u002Fu>** layers\n\n- In [2010](https:\u002F\u002Farxiv.org\u002Fabs\u002F1003.0358) Dan Claudiu Ciresan and Jurgen Schmidhuber published one of the very fist implementations of GPU Neural nets. This implementation had both forward and backward implemented on a a NVIDIA GTX 280 graphic processor of an up to 9 layers neural network.\n\n- [**AlexNet**](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf) (2012):\n\n  - ConvNet that started the evolution and wins the ImageNet at 2012.\n  - Architecture are: `CONV1-MAXPOOL1-NORM1-CONV2-MAXPOOL2-NORM2-CONV3-CONV4-CONV5-MAXPOOL3-FC6-FC7-FC8`\n  - Contains exactly **\u003Cu>8\u003C\u002Fu>** layers the first 5 are Convolutional and the last 3 are fully connected layers.\n  - AlexNet accuracy error was `16.4%`\n  - For example if the input is 227 x 227 x3 then these are the shapes of the of the outputs at each layer:\n    - CONV1\t(96 11 x 11 filters at stride 4, pad 0)\n      - Output shape `(55,55,96)`,   Number of weights are `(11*11*3*96)+96 = 34944`\n    - MAXPOOL1 (3 x 3 filters applied at stride 2)\n      - Output shape `(27,27,96)`,   No Weights\n    - NORM1\n      - Output shape `(27,27,96)`, \tWe don't do this any more\n    - CONV2 (256 5 x 5 filters at stride 1, pad 2)\n    - MAXPOOL2 (3 x 3 filters at stride 2)\n    - NORM2\n    - CONV3 (384 3 x 3 filters ar stride 1, pad 1)\n    - CONV4 (384 3 x 3 filters ar stride 1, pad 1)\n    - CONV5 (256 3 x 3 filters ar stride 1, pad 1)\n    - MAXPOOL3 (3 x 3 filters at stride 2)\n      - Output shape `(6,6,256)`\n    - FC6 (4096)\n    - FC7 (4096)\n    - FC8 (1000 neurons for class score)\n  - Some other details:\n    - First use of RELU.\n    - Norm layers but not used any more.\n    - heavy data augmentation\n    - Dropout `0.5`\n    - batch size `128`\n    - SGD momentum `0.9`\n    - Learning rate `1e-2` reduce by 10 at some iterations\n    - 7 CNN ensembles!\n  - AlexNet was trained on GTX 580 GPU with only 3 GB which wasn't enough to train in one machine so they have spread the feature maps in half. The first AlexNet was distributed!\n  - Its still used in transfer learning in a lot of tasks.\n  - Total number of parameters are `60 million`\n\n- [**ZFNet**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1311.2901) (2013)\n\n  - Won in 2013 with error 11.7%\n  - It has the same general structure but they changed a little in hyperparameters to get the best output.\n  - Also contains **\u003Cu>8\u003C\u002Fu>** layers.\n  - AlexNet but:\n    - `CONV1`: change from (11 x 11 stride 4) to (7 x 7 stride 2)\n    - `CONV3,4,5`: instead of 384, 384, 256 filters use 512, 1024, 512\n\n- [OverFeat](https:\u002F\u002Farxiv.org\u002Fabs\u002F1312.6229) (2013)\n\n  - Won the localization in imageNet in 2013\n  - We show how a multiscale and sliding window approach can be efficiently implemented within a ConvNet. We also introduce a novel deep learning approach to localization by learning to predict object boundaries.\n\n- [**VGGNet**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1409.1556) (2014) (Oxford)\n\n  - Deeper network with more layers.\n  - Contains 19 layers.\n  - Won on 2014 with GoogleNet with error 7.3%\n  - Smaller filters with deeper layers.\n  - The great advantage of VGG was the insight that multiple 3 × 3 convolution in sequence can emulate the effect of larger receptive fields, for examples 5 × 5 and 7 × 7.\n  - Used the simple 3 x 3 Conv all through the network.\n    - 3 (3 x 3) filters has the same effect as 7 x 7\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_2617c7ddb8a3.png)\n  - The Architecture contains several CONV layers then POOL layer over 5 times and then the full connected layers.\n  - It has a total memory of 96MB per image for only forward propagation!\n    - Most memory are in the earlier layers\n  - Total number of parameters are 138 million\n    - Most of the parameters are in the fully connected layers\n  - Has a similar details in training like AlexNet. Like using momentum and dropout.\n  - VGG19 are an upgrade for VGG16 that are slightly better but with more memory\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_4032840ebf4b.png)\n\n- [**GoogleNet**](https:\u002F\u002Fresearch.google.com\u002Fpubs\u002Fpub43022.html) (2014)\n\n  - Deeper network with more layers.\n  - Contains 22 layers.\n  - It has Efficient **\u003Cu>Inception\u003C\u002Fu>** module.\n  - Only 5 million parameters! 12x less than AlexNet\n  - Won on 2014 with VGGNet with error 6.7%\n  - Inception module:\n    - Design a good local network topology (network within a network (NiN)) and then stack these modules on top of each other.\n    - It consists of:\n      - Apply parallel filter operations on the input from previous layer\n        - Multiple convs of sizes (1 x 1, 3 x 3, 5 x 5) \n          - Adds padding to maintain the sizes.\n        - Pooling operation. (Max Pooling)\n          - Adds padding to maintain the sizes.\n      - Concatenate all filter outputs together depth-wise.\n    - For example:\n      - Input for inception module is 28 x 28 x 256\n      - Then the parallel filters applied:\n        - (1 x 1), 128 filter               `# output shape (28,28,128)`\n        - (3 x 3), 192 filter                 `# output shape (28,28,192)`\n        - (5 x 5), 96 filter                   `# output shape (28,28,96)`\n        - (3 x 3) Max pooling            `# output shape (28,28,256)`\n      - After concatenation this will be `(28,28,672)`\n    - By this design -We call Naive- it has a big computation complexity.\n      - The last example will make:\n        - [1 x 1 conv, 128] ==> 28 * 28 * 128 * 1 * 1 * 256 = 25 Million approx\n        - [3 x 3 conv, 192] ==> 28 * 28 * 192 *3 *3 * 256 = 346 Million approx\n        - [5 x 5 conv, 96] ==> 28 * 28 * 96 * 5 * 5 * 256 = 482 Million approx\n        - In total around 854 Million operation!\n    - Solution: **bottleneck** layers that use 1x1 convolutions to reduce feature depth.\n      - Inspired from NiN ([Network in network](https:\u002F\u002Farxiv.org\u002Fabs\u002F1312.4400))\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_0500f562a609.png)\n    - The bottleneck solution will make a total operations of 358M on this example which is good compared with the naive implementation.\n  - So GoogleNet stacks this Inception module multiple times to get a full architecture of a network that can solve a problem without the Fully connected layers.\n  - Just to mention, it uses an average pooling layer at the end before the classification step.\n  - Full architecture:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_3b26c7d5ea78.png)\n  - In February 2015 Batch-normalized Inception was introduced as Inception V2. Batch-normalization computes the mean and standard-deviation of all feature maps at the output of a layer, and normalizes their responses with these values.\n  - In December [2015](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.00567) they introduced a paper \"Rethinking the Inception Architecture for Computer Vision\" which explains the older inception models well also introducing a new version V3.\n\n- The first GoogleNet and VGG was before batch normalization invented so they had some hacks to train the NN and converge well.\n\n- [**ResNet**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.03385) (2015) (Microsoft Research)\n\n  - 152-layer model for ImageNet. Winner by 3.57% which is more than human level error.\n\n  - This is also the very first time that a network of > hundred, even 1000 layers was trained.\n\n  - Swept all classification and detection competitions in ILSVRC’15 and COCO’15!\n\n  - What happens when we continue stacking deeper layers on a “plain” Convolutional neural network?\n\n    - The deeper model performs worse, but it’s not caused by overfitting!\n    - The learning stops performs well somehow because deeper NN are harder to optimize!\n\n  - The deeper model should be able to perform at least as well as the shallower model.\n\n  - A solution by construction is copying the learned layers from the shallower model and setting additional layers to identity mapping.\n\n  - Residual block:\n\n    - Microsoft came with the Residual block which has this architecture:\n\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_0f4177522b50.png)\n\n    - ```python\n      # Instead of us trying To learn a new representation, We learn only Residual\n      Y = (W2* RELU(W1x+b1) + b2) + X\n      ```\n\n    - Say you have a network till a depth of N layers. You only want to add a new layer if you get something extra out of adding that layer.\n\n    - One way to ensure this new (N+1)th layer learns something new about your network is to also provide the input(x) without any transformation to the output of the (N+1)th layer. This essentially drives the new layer to learn something different from what the input has already encoded.\n\n    - The other advantage is such connections help in handling the Vanishing gradient problem in very deep networks.\n\n  - With the Residual block we can now have a deep NN of any depth without the fearing that we can't optimize the network.\n\n  - ResNet with a large number of layers started to use a bottleneck layer similar to the Inception bottleneck to reduce the dimensions.\n\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_7cc0455ef1d0.jpg)\n\n  - **\u003Cu>Full ResNet architecture\u003C\u002Fu>**:\n\n    - Stack residual blocks.\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_7230e72d1b3c.png)\n    - Every residual block has two 3 x 3 conv layers.\n    - Additional conv layer at the beginning.\n    - No FC layers at the end (only FC 1000 to output classes)\n    - Periodically, double number of filters and downsample spatially using stride 2 (\u002F2 in each dimension)\n    - Training ResNet in practice:\n      - Batch Normalization after every CONV layer.\n      - Xavier\u002F2 initialization from He et al.\n      - SGD + Momentum (`0.9`) \n      - Learning rate: 0.1, divided by 10 when validation error plateaus\n      - Mini-batch size `256`\n      - Weight decay of `1e-5`\n      - No dropout used.\n\n- [Inception-v4](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.07261): Resnet + Inception and was founded in 2016.\n\n- The complexity comparing over all the architectures:\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_75a149c1148c.png)\n  - VGG: Highest memory, most operations.\n  - GoogLeNet: most efficient.\n\n- **ResNets Improvements**:\n\n  - ([2016](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.05027)) \u003Cu>Identity Mappings in Deep Residual Networks\u003C\u002Fu>\n    - From the creators of ResNet.\n    - Gives better performance.\n  - ([2016](https:\u002F\u002Farxiv.org\u002Fabs\u002F1605.07146)) \u003Cu>Wide Residual Networks\u003C\u002Fu>\n    - Argues that residuals are the important factor, not depth\n    - 50-layer wide ResNet outperforms 152-layer original ResNet\n    - Increasing width instead of depth more computationally efficient (parallelizable)\n  - ([2016](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.09382)) Deep Networks with Stochastic Depth\n    - Motivation: reduce vanishing gradients and training time through short networks during training.\n    - Randomly drop a subset of layers during each training pass\n    - Use full deep network at test time.\n\n- **Beyond ResNets**:\n\n  - ([2017](https:\u002F\u002Farxiv.org\u002Fabs\u002F1605.07648)) \u003Cu>FractalNet: Ultra-Deep Neural Networks without Residuals\u003C\u002Fu>\n    - Argues that key is transitioning effectively from shallow to deep and residual representations are not necessary.\n    - Trained with dropping out sub-paths\n    - Full network at test time.\n  - ([2017](https:\u002F\u002Farxiv.org\u002Fabs\u002F1608.06993)) \u003Cu>Densely Connected Convolutional Networks\u003C\u002Fu>\n  - ([2017](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.07360)) SqueezeNet: AlexNet-level Accuracy With 50x Fewer Parameters and \u003C0.5Mb Model Size\n    - Good for production.\n    - It is a re-hash of many concepts from ResNet and Inception, and show that after all, a better design of architecture will deliver small network sizes and parameters without needing complex compression algorithms.\n\n- Conclusion:\n\n  - ResNet current best default.\n  - Trend towards extremely deep networks\n  - In the last couple of years, some models all using the shortcuts like \"ResNet\" to eaisly flow the gradients.\n\n\n\n## 10. Recurrent Neural networks\n\n- Vanilla Neural Networks \"Feed neural networks\", input of fixed size goes through some hidden units and then go to output. We call it a one to one network.\n\n- Recurrent Neural Networks RNN Models:\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_52e2243e4174.png)\n  - One to many\n    - Example: Image Captioning\n      - image ==> sequence of words\n  - Many to One\n    - Example: Sentiment Classification\n      - sequence of words ==> sentiment\n  - Many to many\n    - Example: Machine Translation\n      - seq of words in one language ==> seq of words in another language\n    - Example: Video classification on frame level\n\n- RNNs can also work for Non-Sequence Data (One to One problems)\n\n  - It worked in Digit classification through taking a series of “glimpses”\n    - “[Multiple Object Recognition with Visual Attention](https:\u002F\u002Farxiv.org\u002Fabs\u002F1412.7755)”, ICLR 2015.\n  - It worked on generating images one piece at a time\n    - i.e generating a [captcha](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7966808\u002F)\n\n- So what is a recurrent neural network?\n\n  - Recurrent core cell that take an input x and that cell has an internal state that are updated each time it reads an input.\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_0063c5c33e66.png)\n\n  - The RNN block should return a vector.\n\n  - We can process a sequence of vectors x by applying a recurrence formula at every time step:\n\n    - ```python\n      h[t] = fw (h[t-1], x[t])\t\t\t# Where fw is some function with parameters W\n      ```\n\n    - The same function and the same set of parameters are used at every time step.\n\n  - (Vanilla) Recurrent Neural Network:\n\n    - ```\n      h[t] = tanh (W[h,h]*h[t-1] + W[x,h]*x[t])    # Then we save h[t]\n      y[t] = W[h,y]*h[t]\n      ```\n\n    - This is the simplest example of a RNN.\n\n  - RNN works on a sequence of related data.\n\n- Recurrent NN Computational graph:\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_250f43bf2a46.png)\n  - `h0` are initialized to zero.\n  - Gradient of `W` is the sum of all the `W` gradients that has been calculated!\n  - A many to many graph:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_8ce5788ed639.png)\n    - Also the last is the sum of all losses and the weights of Y is one and is updated through summing all the gradients!\n  - A many to one graph:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_ab2a4db83ee8.png)\n  - A one to many graph:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_dbbef229ed86.png)\n  - sequence to sequence graph:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_7481a167a038.png)\n    - Encoder and decoder philosophy.\n\n- Examples:\n\n  - Suppose we are building words using characters. We want a model to predict the next character of a sequence. Lets say that the characters are only `[h, e, l, o]` and the words are [hello]\n    - Training:\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_b0f371b1bd3c.png)\n      - Only the third prediction here is true. The loss needs to be optimized.\n      - We can train the network by feeding the whole word(s).\n    - Testing time:\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_c6e3adb51ce3.png)\n      - At test time we work with a character by character. The output character will be the next input with the other saved hidden activations.\n      - This [link](https:\u002F\u002Fgist.github.com\u002Fkarpathy\u002Fd4dee566867f8291f086) contains all the code but uses Truncated Backpropagation through time as we will discuss.\n\n- Backpropagation through time Forward through entire sequence to compute loss, then backward through entire sequence to compute gradient.\n\n  - But if we choose the whole sequence it will be so slow and take so much memory and will never converge!\n\n- So in practice people are doing \"Truncated Backpropagation through time\" as we go on we Run forward and backward through chunks of the sequence instead of whole sequence\n\n  - Then Carry hidden states forward in time forever, but only backpropagate for some smaller number of steps.\n\n- Example on image captioning:\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_47a1e42c7b4d.png)\n  - They use \u003CEnd> token to finish running.\n  - The biggest dataset for image captioning is Microsoft COCO.\n\n- Image Captioning with Attention is a project in which when the RNN is generating captions, it looks at a specific part of the image not the whole image.\n\n  - Image Captioning with Attention technique is also used in \"Visual Question Answering\" problem\n\n- Multilayer RNNs is generally using some layers as the hidden layer that are feed into again. **LSTM** is a multilayer RNNs.\n\n- Backward flow of gradients in RNN can explode or vanish. Exploding is controlled with gradient clipping. Vanishing is controlled with additive interactions (LSTM)\n\n- LSTM stands for Long Short Term Memory. It was designed to help the vanishing gradient problem on RNNs.\n\n  - It consists of:\n    - f: Forget gate, Whether to erase cell\n    - i: Input gate, whether to write to cell\n    - g: Gate gate (?), How much to write to cell\n    - o: Output gate, How much to reveal cell\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_b7691bff3d9c.png)\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_bb04a47c5dba.png)\n  - The LSTM gradients are easily computed like ResNet\n  - The LSTM is keeping data on the long or short memory as it trains means it can remember not just the things from last layer but layers.\n\n- Highway networks is something between ResNet and LSTM that is still in research.\n\n- Better\u002Fsimpler architectures are a hot topic of current research\n\n- Better understanding (both theoretical and empirical) is needed.\n\n- RNN is used for problems that uses sequences of related inputs more. Like NLP and Speech recognition.\n\n\n\n\n## 11. Detection and Segmentation\n\n- So far we are talking about image classification problem. In this section we will talk about Segmentation, Localization, Detection.\n\n- **\u003Cu>Semantic Segmentation\u003C\u002Fu>**\n\n  - We want to Label each pixel in the image with a category label.\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_cd2214127a63.png)\n\n  - As you see the cows in the image, Semantic Segmentation Don’t differentiate instances, only care about pixels.\n\n  - The first idea is to use a **sliding window**. We take a small window size and slide it all over the picture. For each window we want to label the center pixel.\n\n    - It will work but its not a good idea because it will be computational expensive!\n    - Very inefficient! Not reusing shared features between overlapping patches.\n    - In practice nobody uses this.\n\n  - The second idea is designing a network as a bunch of Convolutional layers to make predictions for pixels all at once!\n\n    - Input is the whole image. Output is the image with each pixel labeled.\n    - We need a lot of labeled data. And its very expensive data.\n    - It needs a deep Conv. layers.\n    - The loss is cross entropy between each pixel provided.\n    - Data augmentation are good here.\n    - The problem with this implementation that convolutions at original image resolution will be very expensive.\n    - So in practice we don't see something like this right now.\n\n  - The third idea is based on the last idea. The difference is that we are downsampling and upsampling inside the network.\n\n    - We downsample because using the whole image as it is very expensive. So we go on multiple layers downsampling and then upsampling in the end.\n\n    - Downsampling is an operation like Pooling and strided convolution.\n\n    - Upsampling is like \"Nearest Neighbor\" or \"Bed of Nails\" or \"Max unpooling\"\n\n      - **Nearest Neighbor** example:\n\n        - ```\n          Input:   1  2               Output:   1  1  2  2\n                   3  4                         1  1  2  2\n                                                3  3  4  4\n                                                3  3  4  4\n          ```\n\n      - **Bed of Nails** example:\n\n        - ```\n          Input:   1  2               Output:   1  0  2  0\n                   3  4                         0  0  0  0\n                                                3  0  4  0\n                                                0  0  0  0\n          ```\n\n      - **Max unpooling** is depending on the earlier steps that was made by max pooling. You fill the pixel where max pooling took place and then fill other pixels by zero.\n\n    - Max unpooling seems to be the best idea for upsampling.\n\n    - There are an idea of Learnable Upsampling called \"**Transpose Convolution**\"\n\n      - Rather than making a convolution we make the reverse. \n      - Also called:\n        - Upconvolution.\n        - Fractionally strided convolution\n        - Backward strided convolution\n      - Learn the artimitic of the upsampling please refer to chapter 4 in this [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1603.07285).\n\n- **\u003Cu>Classification + Localization\u003C\u002Fu>**:\n\n  - In this problem we want to classify the main object in the image and its location as a rectangle.\n  - We assume there are one object.\n  - We will create a multi task NN. The architecture are as following:\n    - Convolution network layers connected to:\n      - FC layers that classify the object. `# The plain classification problem we know`\n      - FC layers that connects to a four numbers `(x,y,w,h)`\n        -  We treat Localization as a regression problem.\n  - This problem will have two losses:\n    - Softmax loss for classification\n    - Regression (Linear loss) for the localization (L2 loss)\n  - Loss = SoftmaxLoss + L2 loss\n  - Often the first Conv layers are pretrained NNs like AlexNet!\n  - This technique can be used in so many other problems like:  Human Pose Estimation.\n\n- **\u003Cu>Object Detection\u003C\u002Fu>**\n\n  - A core idea of computer vision. We will talk by details in this problem.\n  - The difference between \"Classification + Localization\" and this problem is that here we want to detect one or mode different objects and its locations!\n  - First idea is to use a sliding window\n    - Worked well and long time.\n    - The steps are:\n      - Apply a CNN to many different crops of the image, CNN classifies each crop as object or background.\n    - The problem is we need  to apply CNN to huge number of locations and scales, very computationally expensive!\n    - The brute force sliding window will make us take thousands of thousands of time.\n  - Region Proposals will help us deciding which region we should run our NN at:\n    - Find **blobby** image regions that are likely to contain objects.\n    - Relatively fast to run; e.g. Selective Search gives 1000 region proposals in a few seconds on CPU\n  - So now we can apply one of the Region proposals networks and then apply the first idea.\n  - There is another idea which is called R-CNN\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_7cef3b09f66f.png)\n    - The idea is bad because its taking parts of the image -With Region Proposals- if different sizes and feed it to CNN after scaling them all to one size. Scaling is bad\n    - Also its very slow.\n  - Fast R-CNN is another idea that developed on R-CNN\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_6715901f2f67.png)\n    - It uses one CNN to do everything.\n  - Faster R-CNN does its own region proposals by Inserting Region Proposal Network (RPN) to predict proposals from features.\n    - The fastest of the R-CNNs.\n  - Another idea is Detection without Proposals: YOLO \u002F SSD\n    - YOLO stands for you only look once.\n    - YOLO\u002FSDD is two separate algorithms.\n    - Faster but not as accurate.\n  - Takeaways\n    - Faster R-CNN is slower but more accurate.\n    - SSD\u002FYOLO is much faster but not as accurate.\n\n- **\u003Cu>Denese Captioning\u003C\u002Fu>**\n\n  - Denese Captioning is \"Object Detection + Captioning\"\n  - Paper that covers this idea can be found [here](https:\u002F\u002Farxiv.org\u002Fabs\u002F1511.07571).\n\n- **\u003Cu>Instance Segmentation\u003C\u002Fu>**\n\n  - This is like the full problem.\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_62fe79a1f73b.png)\n  - Rather than we want to predict the bounding box, we want to know which pixel label but also distinguish them.\n  - There are a lot of ideas.\n  - There are a new idea \"Mask R-CNN\"\n    - Like R-CNN but inside it we apply the Semantic Segmentation\n    - There are a lot of good results out of this paper.\n    - It sums all the things that we have discussed in this lecture.\n    - Performance of this seems good.\n\n\n\n## 12. Visualizing and Understanding\n\n- We want to know what’s going on inside ConvNets?\n\n- People want to trust the black box (CNN) and know how it exactly works and give and good decisions.\n\n- A first approach is to visualize filters of the first layer.\n\n  - Maybe the shape of the first layer filter is 5 x 5 x 3, and the number of filters are 16. Then we will have 16 different \"colored\" filter images.\n  - It turns out that these filters learns primitive shapes and oriented edges like the human brain does.\n  - These filters really looks the same on each Conv net you will train, Ex if you tried to get it out of AlexNet, VGG, GoogleNet, or ResNet.\n  - This will tell you what is the first convolution layer is looking for in the image.\n\n- We can visualize filters from the next layers but they won't tell us anything.\n\n  - Maybe the shape of the first layer filter is 5 x 5 x 20, and the number of filters are 16. Then we will have 16*20 different \"gray\" filter images.\n\n- In AlexNet, there was some FC layers in the end. If we took the 4096-dimensional feature vector for an image, and collecting these feature vectors.\n\n  - If we made a nearest neighbors between these feature vectors and get the real images of these features we will get something very good compared with running the KNN on the images directly!\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_e52f7f57e2cb.png)\n  - This similarity tells us that these CNNs are really getting the semantic meaning of these images instead of on the pixels level!\n  - We can make a dimensionality reduction on the 4096 dimensional feature and compress it to 2 dimensions.\n    - This can be made by PCA, or t-SNE.\n    - t-SNE are used more with deep learning to visualize the data. Example can be found [here](http:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fkarpathy\u002Fcnnembed\u002F).\n\n- We can Visualize the activation maps.\n\n  - For example if CONV5 feature map is 128 x 13 x 13, We can visualize it as 128 13 x 13 gray-scale images.\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_0df6c6d6ebbf.png)\n  - One of these features are activated corresponding to the input, so now we know that this particular map are looking for something.\n  - Its done by Yosinski et. More info are [here](http:\u002F\u002Fyosinski.com\u002Fdeepvis#toolbox).\n\n- There are something called **Maximally Activating Patches** that can help us visualize the intermediate features in Convnets\n\n  - The steps of doing this is as following:\n    - We choose a layer then a neuron\n      - Ex. We choose Conv5 in AlexNet which is 128 x 13 x 13 then pick channel (Neuron) 17\u002F128\n    - Run many images through the network, record values of chosen channel.\n    - Visualize image patches that correspond to maximal activations.\n      - We will find that each neuron is looking into a specific part of the image.\n      - Extracted images are extracted using receptive field.\n\n- Another idea is **Occlusion Experiments**\n\n  - We mask part of the image before feeding to CNN, draw heat-map of probability (Output is true) at each mask location\n  - It will give you the most important parts of the image in which the Conv. Network has learned from.\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_42e5cb145d50.png)\n\n- **Saliency Maps** tells which pixels matter for classification\n\n  - Like Occlusion Experiments but with a completely different approach\n  - We Compute gradient of (unnormalized) class score with respect to image pixels, take absolute value and max over RGB channels. It will get us a gray image that represents the most important areas in the image.\n  - This can be used for Semantic Segmentation sometimes.\n\n- (guided) backprop Makes something like **Maximally Activating Patches** but unlike it gets the pixels in which we are caring of.\n\n  - In this technique choose a channel like Maximally Activating Patches and then compute gradient of neuron value with respect to image pixels\n  - Images come out nicer if you only backprop positive gradients through each RELU (guided backprop)\n\n- **Gradient Ascent**\n\n  - Generate a synthetic image that maximally activates a neuron.\n\n  - Reverse of gradient decent. Instead of taking the minimum it takes the maximum.\n\n  - We want to maximize the neuron with the input image. So here instead we are trying to learn the image that maximize the activation:\n\n    - ```python\n      # R(I) is Natural image regularizer, f(I) is the neuron value.\n      I *= argmax(f(I)) + R(I)\n      ```\n\n  - Steps of gradient ascent\n\n    - Initialize image to zeros.\n    - Forward image to compute current scores.\n    - Backprop to get gradient of neuron value with respect to image pixels.\n    - Make a small update to the image\n\n  - `R(I)` may equal to L2 of generated image.\n\n  - To get a better results we use a better regularizer:\n\n    - penalize L2 norm of image; also during optimization periodically:\n      - Gaussian blur image\n      - Clip pixels with small values to 0\n      - Clip pixels with small gradients to 0\n\n  - A better regularizer makes out images cleaner!\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_25a43d8c4c56.png)\n\n  - The results in the latter layers seems to mean something more than the other layers.\n\n- We can fool CNN by using this procedure:\n\n  - Start from an arbitrary image.\t\t\t`# Random picture based on nothing.`\n  - Pick an arbitrary class. `# Random class`\n  - Modify the image to maximize the class.\n  - Repeat until network is fooled.\n\n- Results on fooling the network is pretty surprising!\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_7d5f63f9f50c.png)\n  - For human eyes they are the same, but it fooled the network by adding just some noise!\n\n- **DeepDream**: Amplify existing features\n\n  - Google released deep dream on their website.\n  - What its actually doing is the same procedure as fooling the NN that we discussed, but rather than synthesizing an image to maximize a specific neuron, instead try to amplify the neuron activations at some layer in the network.\n  - Steps:\n    - Forward: compute activations at chosen layer.\t\t`# form an input image (Any image)`\n    - Set gradient of chosen layer equal to its activation.\n      - Equivalent to `I* = arg max[I] sum(f(I)^2)`\n    - Backward: Compute gradient on image.\n    - Update image.\n  - The code of deep dream is online you can download and check it yourself.\n\n- **Feature Inversion**\n\n  - Gives us to know what types of elements parts of the image are captured at different layers in the network.\n  - Given a CNN feature vector for an image, find a new image that: \n    - Matches the given feature vector.\n    - *looks natural* (image prior regularization) \n\n- **Texture Synthesis**\n\n  - Old problem in computer graphics.\n  - Given a sample patch of some texture, can we generate a bigger image of the same texture?\n  - There is an algorithm which doesn't depend on NN:\n    - Wei and Levoy, Fast Texture Synthesis using Tree-structured Vector Quantization, SIGGRAPH 2000\n    - Its a really simple algorithm\n  - The idea here is that this is an old problem and there are a lot of algorithms that has already solved it but simple algorithms doesn't work well on complex textures!\n  - An idea of using NN has been proposed on 2015 based on gradient ascent and called it \"Neural Texture Synthesis\"\n    - It depends on something called Gram matrix.\n\n- Neural Style Transfer =  Feature + Gram Reconstruction\n\n  - Gatys, Ecker, and Bethge, Image style transfer using Convolutional neural networks, CVPR 2016\n  - Implementation by pytorch [here](https:\u002F\u002Fgithub.com\u002Fjcjohnson\u002Fneural-style).\n\n- Style transfer requires many forward \u002F backward passes through VGG; very slow!\n\n  - Train another neural network to perform style transfer for us!\n  - Fast Style Transfer is the solution.\n  - Johnson, Alahi, and Fei-Fei, Perceptual Losses for Real-Time Style Transfer and Super-Resolution, ECCV 2016\n  - https:\u002F\u002Fgithub.com\u002Fjcjohnson\u002Ffast-neural-style\n\n- There are a lot of work on these style transfer and it continues till now!\n\n- Summary:\n\n  - Activations: Nearest neighbors, Dimensionality reduction, maximal patches, occlusion\n  - Gradients: Saliency maps, class visualization, fooling images, feature inversion\n  - Fun: DeepDream, Style Transfer\n\n\n\n\n## 13. Generative models \n\n- Generative models are type of Unsupervised learning.\n\n- Supervised vs Unsupervised Learning:\n\n  - |                | Supervised Learning                      | Unsupervised Learning                    |\n    | -------------- | ---------------------------------------- | ---------------------------------------- |\n    | Data structure | Data: (x, y), and x is data, y is label  | Data: x, Just data, no labels!           |\n    | Data price     | Training data is expensive in a lot of cases. | Training data are cheap!                 |\n    | Goal           | Learn a function to map x -> y           | Learn some underlying hidden structure of the data |\n    | Examples       | Classification, regression, object detection, semantic segmentation, image captioning | Clustering, dimensionality reduction, feature learning, density estimation |\n\n- Autoencoders are a Feature learning technique.\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_8ebac2172fb5.png)\n  - It contains an encoder and a decoder. The encoder downsamples the image while the decoder upsamples the features.\n  - The loss are L2 loss.\n\n- Density estimation is where we want to learn\u002Festimate the underlaying distribution for the data!\n\n- There are a lot of research open problems in unsupervised learning compared with supervised learning!\n\n- **Generative Models**\n\n  - Given training data, generate new samples from same distribution.\n  - Addresses density estimation, a core problem in unsupervised learning.\n  - We have different ways to do this:\n    - Explicit density estimation: explicitly define and solve for the learning model.\n    - Learn model that can sample from the learning model without explicitly defining it.\n  - Why Generative Models?\n    - Realistic samples for artwork, super-resolution, colorization, etc\n    - Generative models of time-series data can be used for simulation and planning (reinforcement learning applications!)\n    - Training generative models can also enable inference of latent representations that can be useful as general features\n  - Taxonomy of Generative Models:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_eda42a4b948d.png)\n  - In this lecture we will discuss: PixelRNN\u002FCNN, Variational Autoencoder, and GANs as they are the popular models in research now.\n\n- **PixelRNN** and **PixelCNN**\n\n  - In a full visible belief network we use the chain rule to decompose likelihood of an image x into product of 1-d distributions\n    - `p(x) = sum(p(x[i]| x[1]x[2]....x[i-1]))`\n    - Where p(x) is the Likelihood of image x and x[i] is Probability of i’th pixel value given all previous pixels.\n  - To solve the problem we need to maximize the likelihood of training data but the distribution is so complex over pixel values.\n  - Also we will need to define ordering of \u003Cu>previous pixels\u003C\u002Fu>.\n  - PixelRNN\n    - Founded by [van der Oord et al. 2016]\n    - Dependency on previous pixels modeled using an RNN (LSTM)\n    - Generate image pixels starting from corner\n    - Drawback: sequential generation is slow! because you have to generate pixel by pixel!\n  - PixelCNN\n    - Also Founded by [van der Oord et al. 2016]\n    - Still generate image pixels starting from corner.\n    - Dependency on previous pixels now modeled using a CNN over context region\n    - Training is faster than PixelRNN (can parallelize convolutions since context region values known from training images)\n    - Generation must still proceed sequentially still slow.\n  - There are some tricks to improve PixelRNN & PixelCNN.\n  - PixelRNN and PixelCNN can generate good samples and are still active area of research.\n\n- **Autoencoders**\n\n  - Unsupervised approach for learning a lower-dimensional feature representation from unlabeled training data.\n  - Consists of Encoder and decoder.\n  - The encoder:\n    - Converts the input x to the features z. z should be smaller than x to get only the important values out of the input. We can call this dimensionality reduction.\n    - The encoder can be made with:\n      - Linear or non linear layers (earlier days days)\n      - Deep fully connected NN (Then)\n      - RELU CNN (Currently we use this on images)\n  - The decoder:\n    - We want the encoder to map the features we have produced to output something similar to x or the same x.\n    - The decoder can be made with the same techniques we made the encoder and currently it uses a RELU CNN.\n  - The encoder is a conv layer while the decoder is deconv layer! Means Decreasing and then increasing.\n  - The loss function is L2 loss function:\n    - `L[i] = |y[i] - y'[i]|^2`\n      - After training we though away the decoder.`# Now we have the features we need`\n  - We can use this encoder we have to make a supervised model.\n    - The value of this it can learn a good feature representation to the input you have.\n    - A lot of times we will have a small amount of data to solve problem. One way to tackle this is to use an Autoencoder that learns how to get features from images and train your small dataset on top of that model.\n  - The question is can we generate data (Images) from this Autoencoder?\n\n- **Variational Autoencoders (VAE)**\n\n  - Probabilistic spin on Autoencoders - will let us sample from the model to generate data!\n  - We have z as the features vector that has been formed using the encoder.\n  - We then choose prior p(z) to be simple, e.g. Gaussian. \n    - Reasonable for hidden attributes: e.g. pose, how much smile.\n  - Conditional p(x|z) is complex (generates image) => represent with neural network\n  - But we cant compute integral for P(z)p(x|z)dz as the following equation:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_387a99836481.png)\n  - After resolving all the equations that solves the last equation we should get this:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_b08c578ae0ad.png)\n  - Variational Autoencoder are an approach to generative models but Samples blurrier and lower quality compared to state-of-the-art (GANs)\n  - Active areas of research:\n    - More flexible approximations, e.g. richer approximate posterior instead of diagonal Gaussian\n    - Incorporating structure in latent variables\n\n- **Generative Adversarial Networks (GANs)**\n\n  - GANs don’t work with any explicit density function!\n\n  - Instead, take game-theoretic approach: learn to generate from training distribution through 2-player game.\n\n  - Yann LeCun, who oversees AI research at Facebook, has called GANs:\n\n    - > The coolest idea in deep learning in the last 20 years\n\n  - Problem: Want to sample from complex, high-dimensional training distribution. No direct way to do this as we have discussed!\n\n  - Solution: Sample from a simple distribution, e.g. random noise. Learn transformation to training distribution.\n\n  - So we create a noise image which are drawn from simple distribution feed it to NN we will call it a generator network that should learn to transform this into the distribution we want.\n\n  - Training GANs: Two-player game:\n\n    - **Generator network**: try to fool the discriminator by generating real-looking images.\n    - **Discriminator network**: try to distinguish between real and fake images.\n\n  - If we are able to train the Discriminator well then we can train the generator to generate the right images.\n\n  - The loss function of GANs as minimax game are here:\n\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_44f2d34d4d57.png)\n\n  - The label of the generator network will be 0 and the real images are 1.\n\n  - To train the network we will do:\n\n    - Gradient ascent on discriminator.\n    - Gradient ascent on generator but with different loss.\n\n  - You can read the full algorithm with the equations here:\n\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_bfebc105ff09.png)\n\n  - Aside: Jointly training two networks is challenging, can be unstable. Choosing objectives with better loss landscapes helps training is an active area of research.\n\n  - Convolutional Architectures:\n\n    - Generator is an upsampling network with fractionally-strided convolutions Discriminator is a Convolutional network.\n    - Guidelines for stable deep Conv GANs:\n      - Replace any pooling layers with strided convs (discriminator) and fractional-strided convs with (Generator).\n      - Use batch norm for both networks.\n      - Remove fully connected hidden layers for deeper architectures.\n      - Use RELU activation in generator for all layers except the output which uses Tanh\n      - Use leaky RELU in discriminator for all the layers.\n\n  - 2017 is the year of the GANs! it has exploded and there are some really good results.\n\n  - Active areas of research also is GANs for all kinds of applications.\n\n  - The GAN zoo can be found here: https:\u002F\u002Fgithub.com\u002Fhindupuravinash\u002Fthe-gan-zoo\n\n  - Tips and tricks for using GANs: https:\u002F\u002Fgithub.com\u002Fsoumith\u002Fganhacks\n\n  - NIPS 2016 Tutorial GANs: https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=AJVyzd0rqdc\n\n\n\n## 14. Deep reinforcement learning\n\n- This section contains a lot of math.\n- Reinforcement learning problems are involving an agent interacting with an environment, which provides numeric reward signals.\n- Steps are:\n  - Environment --> State `s[t]` --> Agent --> Action `a[t]` --> Environment --> `Reward r[t]` + Next state `s[t+1]` --> Agent --> and so on..\n- Our goal is learn how to take actions in order to maximize reward.\n- An example is Robot Locomotion:\n  - Objective: Make the robot move forward\n  - State: Angle and position of the joints\n  - Action: Torques applied on joints\n  - 1 at each time step upright + forward movement\n- Another example is Atari Games:\n  - Deep learning has a good state of art in this problem.\n  - Objective: Complete the game with the highest score.\n  - State: Raw pixel inputs of the game state.\n  - Action: Game controls e.g. Left, Right, Up, Down\n  - Reward: Score increase\u002Fdecrease at each time step\n- Go game is another example which AlphaGo team won in the last year (2016) was a big achievement for AI and deep learning because the problem was so hard.\n- We can mathematically formulate the RL (reinforcement learning) by using \u003Cu>**Markov Decision Process**\u003C\u002Fu>\n- **Markov Decision Process**\n  - Defined by (`S`, `A`, `R`, `P`, `Y`) where:\n    - `S`: set of possible states.\n    - `A`: set of possible actions\n    - `R`: distribution of reward given (state, action) pair\n    - `P`: transition probability i.e. distribution over next state given (state, action) pair\n      - `Y`: discount factor\t`# How much we value rewards coming up soon verses later on.`\n  - Algorithm:\n    - At time step `t=0`, environment samples initial state `s[0]`\n    - Then, for t=0 until done:\n      - Agent selects action `a[t]`\n      - Environment samples reward from `R` with (`s[t]`, `a[t]`)\n      - Environment samples next state from `P` with (`s[t]`, `a[t]`)\n      - Agent receives reward `r[t]` and next state `s[t+1]`\n  - A policy `pi`  is a function from S to A that specifies what action to take in each state.\n  - Objective: find policy `pi*` that maximizes cumulative discounted reward: `Sum(Y^t * r[t], t>0)`\n  - For example:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_33a45b9192b3.png)\n  - Solution would be:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_cf2df863ada1.png)\n- The value function at state `s`, is the expected cumulative reward from following the policy from state `s`:\n  - `V[pi](s) = Sum(Y^t * r[t], t>0) given s0 = s, pi`\n- The Q-value function at state s and action `a`, is the expected cumulative reward from taking action `a` in state `s` and then following the policy:\n  - `Q[pi](s,a) = Sum(Y^t * r[t], t>0) given s0 = s,a0 = a, pi`\n- The optimal Q-value function `Q*` is the maximum expected cumulative reward achievable from a given (state, action) pair:\n  - `Q*[s,a] = Max(for all of pi on (Sum(Y^t * r[t], t>0) given s0 = s,a0 = a, pi))`\n- Bellman equation\n  - Important thing is RL.\n  - Given any state action pair (s,a) the value of this pair is going to be the reward that you are going to get r plus the value of the state that you end in.\n  - `Q*[s,a] = r + Y * max Q*(s',a') given s,a  # Hint there is no policy in the equation`\n  - The optimal policy `pi*` corresponds to taking the best action in any state as specified by `Q*`\n- We can get the optimal policy using the value iteration algorithm that uses the Bellman equation as an iterative update\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_b457bc6f677a.png)\n- Due to the huge space dimensions in real world applications we will use a function approximator to estimate `Q(s,a)`. E.g. a neural network! this is called **Q-learning**\n  - Any time we have a complex function that we cannot represent we use Neural networks!\n- **Q-learning**\n  - The first deep learning algorithm that solves the RL.\n  - Use a function approximator to estimate the action-value function\n  - If the function approximator is a deep neural network => deep q-learning\n  - The loss function:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_c2d318005cbd.png)\n- Now lets consider the \"Playing Atari Games\" problem:\n  - Our total reward are usually the reward we are seeing in the top of the screen.\n  - Q-network Architecture:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_86674fe4cbdd.png)\n  - Learning from batches of consecutive samples is a problem. If we recorded a training data and set the NN to work with it, if the data aren't enough we will go to a high bias error. so we should use \"experience replay\" instead of consecutive samples where the NN will try the game again and again until it masters it.\n  - Continually update a replay memory table of transitions (`s[t]` , `a[t]` , `r[t]` , `s[t+1]`) as game (experience) episodes are played.\n  - Train Q-network on random minibatches of transitions from the replay memory, instead of consecutive samples.\n  - The full algorithm:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_6b7edfd45f04.png)\n  - A video that demonstrate the algorithm on Atari game can be found here: \"https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=V1eYniJ0Rnk\"\n- **Policy Gradients**\n  - The second deep learning algorithm that solves the RL.\n  - The problem with Q-function is that the Q-function can be very complicated.\n    - Example: a robot grasping an object has a very high-dimensional state.\n    - But the policy can be much simpler: just close your hand.\n  - Can we learn a policy directly, e.g. finding the best policy from a collection of policies?\n  - Policy Gradients equations:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_92d830c6995b.png)\n  - Converges to a local minima of `J(ceta)`, often good enough!\n  - REINFORCE algorithm is the algorithm that will get\u002Fpredict us the best policy\n  - Equation and intuition of the Reinforce algorithm:\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_b472ca6b4b9b.png)\n    - the problem was high variance with this equation can we solve this?\n    - variance reduction is an active research area!\n  - Recurrent Attention Model (RAM) is an algorithm that are based on REINFORCE algorithm and is used for image classification problems:\n    - Take a sequence of “glimpses” selectively focusing on regions of the image, to predict class\n      - Inspiration from human perception and eye movements.\n      - Saves computational resources => scalability\n        - If an image with high resolution you can save a lot of computations\n      - Able to ignore clutter \u002F irrelevant parts of image\n    - RAM is used now in a lot of tasks: including fine-grained image recognition, image captioning, and visual question-answering\n  - AlphaGo are using a mix of supervised learning and reinforcement learning, It also using policy gradients.\n- A good course from Standford on deep reinforcement learning\n  - http:\u002F\u002Fweb.stanford.edu\u002Fclass\u002Fcs234\u002Findex.html\n  - https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX\n- A good course on deep reinforcement learning (2017)\n  - http:\u002F\u002Frll.berkeley.edu\u002Fdeeprlcourse\u002F\n  - https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3\n- A good article\n  - https:\u002F\u002Fwww.kdnuggets.com\u002F2017\u002F09\u002F5-ways-get-started-reinforcement-learning.html\n\n\n\n## 15. Efficient Methods and Hardware for Deep Learning\n\n- The original lecture was given by Song Han a PhD Candidate at standford.\n- Deep Conv nets, Recurrent nets, and deep reinforcement learning are shaping a lot of applications and changing a lot of our lives.\n  - Like self driving cars, machine translations, alphaGo and so on.\n- But the trend now says that if we want a high accuracy we need a larger (Deeper) models.\n  - The model size in ImageNet competation from 2012 to 2015 has increased 16x to achieve a high accurecy.\n  - Deep speech 2 has 10x training operations than deep speech 1 and thats in only one year! `# At Baidu`\n- There are three challenges we got from this\n  - **Model Size**\n    - Its hard to deploy larger models on our PCs, mobiles, or cars.\n  - **Speed**\n    - ResNet152 took 1.5 weeks to train and give the 6.16% accurecy!\n    - Long training time limits ML researcher’s productivity\n  - **Energy Efficiency**\n    - AlphaGo: 1920 CPUs and 280 GPUs. $3000 electric bill per game\n    - If we use this on our mobile it will drain the battery.\n    - Google mentioned in thier blog if all the users used google speech for 3 minutes, they have to double thier data-center!\n    - Where is the Energy Consumed?\n      - larger model => more memory reference => more energy\n- We can improve the Efficiency of Deep Learning by Algorithm-Hardware Co-Design.\n  - From both the hardware and the algorithm perspectives.\n- Hardware 101: the Family\n  - **General Purpose**\t\t\t`# Used for any hardware`\n    - CPU\t\t\t\t`# Latency oriented, Single strong threaded like a single elepahnt`\n      - GPU\t\t\t`# Throughput oriented, So many small threads like a lot of ants`\n    - GPGPU\n      - **Specialized HW**\t\t`#Tuned for a domain of applications`\n        - FPGA# Programmable logic, Its cheaper but less effiecnet`\n        - ASIC`# Fixed logic, Designed for a certian applications (Can be designed for deep learning applications)`\n- Hardware 101: Number Representation\n  - Numbers in computer are represented with a discrete memory.\n  - Its very good and energy efficent for hardware to go from 32 bit to 16 bit in float point operations.\n- Part 1: **\u003Cu>Algorithms for Efficient Inference\u003C\u002Fu>**\n  - **Pruning neural networks**\n    - Idea is can we remove some of the weights\u002Fneurons and the NN still behave the same?\n    - In 2015 Han made AlexNet parameters from 60 million to 6 Million! by using the idea of Pruning.\n    - Pruning can be applied to CNN and RNN, iteratively it will reach the same accurecy as the original.\n    - Pruning actually happends to humans:\n      - Newborn(50 Trillion Synapses) ==> 1 year old(1000 Trillion Synapses) ==> Adolescent(500 Trillion Synapses)\n    - Algorithm:\n      1. Get Trained network.\n      2. Evaluate importance of neurons.\n      3. Remove the least important neuron.\n      4. Fine tune the network.\n      5. If we need to continue Pruning we go to step 2 again else we stop.\n  - **Weight Sharing**\n    - The idea is that we want to make the numbers is our models less.\n    - Trained Quantization:\n      - Example: all weight values that are 2.09, 2.12, 1.92, 1.87 will be replaced by 2\n      - To do that we can make k means clustering on a filter for example and reduce the numbers in it. By using this we can also reduce the number of operations that are used from calculating the gradients.\n      - After Trained Quantization the Weights are Discrete.\n      - Trained Quantization can reduce the number of bits we need for a number in each layer significantly.\n    - Pruning + Trained Quantization can Work Together to reduce the size of the model.\n    - Huffman Coding\n      - We can use Huffman Coding to reduce\u002Fcompress the number of bits of the weight.\n      - In-frequent weights: use more bits to represent.\n      - Frequent weights: use less bits to represent.\n    - Using Pruning + Trained Quantization + Huffman Coding is called deep compression.\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_08cf0558ee95.png)\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_d6b3c4c3901c.png)\n      - **SqueezeNet**\n        - All the models we have talked about till now was using a pretrained models. Can we make a new arcitecutre that saves memory and computations?\n        - SqueezeNet gets the alexnet accurecy with 50x fewer parameters and 0.5 model size.\n      - SqueezeNet can even be further compressed by applying deep compression on them.\n      - Models are now more energy efficient and has speed up a lot.\n      - Deep compression was applied in Industry through facebook and Baidu.\n  - **Quantization**\n    - Algorithm (Quantizing the Weight and Activation):\n      - Train with float.\n      - Quantizing the weight and activation:\n        - Gather the statistics for weight and activation.\n        - Choose proper radix point position.\n      - Fine-tune in float format.\n      - Convert to fixed-point format.\n  - **Low Rank Approximation**\n    - Is another size reduction algorithm that are used for CNN.\n    - Idea is decompose the conv layer and then try both of the composed layers.\n  - **Binary \u002F Ternary Net**\n    - Can we only use three numbers to represent weights in NN?\n    - The size will be much less with only -1, 0, 1.\n    - This is a new idea that was published in 2017 \"Zhu, Han, Mao, Dally. Trained Ternary Quantization, ICLR’17\"\n    - Works after training.\n    - They have tried it on AlexNet and it has reached almost the same error as AlexNet.\n    - Number of operation will increase per register: https:\u002F\u002Fxnor.ai\u002F\n  - **Winograd Transformation**\n    - Based on 3x3 WINOGRAD Convolutions which makes less operations than the ordiany convolution\n    - cuDNN 5 uses the WINOGRAD Convolutions which has improved the speed.\n- Part 2: **\u003Cu>Hardware for Efficient Inference\u003C\u002Fu>**\n  - There are a lot of ASICs that we developed for deep learning. All in which has the same goal of minimize memory access.\n    - Eyeriss MIT\n    - DaDiannao\n    - TPU Google (Tensor processing unit)\n      - It can be put to replace the disk in the server.\n      - Up to 4 cards per server.\n      - Power consumed by this hardware is a lot less than a GPU and the size of the chip is less.\n    - EIE Standford\n      - By Han at 2016 [et al. ISCA’16]\n      - We don't save zero weights and make quantization for the numbers from the hardware.\n      - He says that EIE has a better Throughput and energy efficient.\n- Part 3: **\u003Cu>Algorithms for Efficient Training\u003C\u002Fu>**\n  - **Parallelization**\n    - **Data Parallel** – Run multiple inputs in parallel\n      - Ex. Run two images in the same time!\n      - Run multiple training examples in parallel.\n      - Limited by batch size.\n      - Gradients have to be applied by a master node.\n    - **Model Parallel**\n      - Split up the Model – i.e. the network\n      - Split model over multiple processors By layer.\n    - Hyper-Parameter Parallel\n      - Try many alternative networks in parallel.\n      - Easy to get 16-64 GPUs training one model in parallel.\n  - **Mixed Precision** with FP16 and FP32\n    - We have discussed that if we use 16 bit real numbers all over the model the energy cost will be less by x4.\n    - Can we use a model entirely with 16 bit number? We can partially do this with mixed FP16 and FP32. We use 16 bit everywhere but at some points we need the FP32.\n    - By example in multiplying FP16 by FP16 we will need FP32.\n    - After you train the model you can be a near accuracy of the famous models like AlexNet and ResNet.\n  - **Model Distillation**\n    - The question is can we use a senior (Good) trained neural network(s) and make them guide a student (New) neural network?\n    - For more information look at Hinton et al. Dark knowledge \u002F Distilling the Knowledge in a Neural Network\n  - DSD: Dense-Sparse-Dense Training\n    - Han et al. “DSD: Dense-Sparse-Dense Training for Deep Neural Networks”, ICLR 2017\n    - Has a better regularization.\n    - The idea is Train the model lets call this the Dense, we then apply Pruning to it lets call this sparse.\n    - DSD produces same model architecture but can find better optimization solution arrives at better local minima, and achieves higher prediction accuracy.\n    - After the above two steps we go connect the remain connection and learn them again (To dense again).\n    - This improves the performace a lot in many deep learning models.\n- Part 4: **\u003Cu>Hardware for Efficient Training\u003C\u002Fu>**\n  - GPUs for training:\n    - Nvidia PASCAL GP100 (2016)\n    - Nvidia Volta GV100 (2017)\n      - Can make mixed precision operations!\n      - So powerful.\n      - The new neclar bomb!\n  - Google Announced \"Google Cloud TPU\" on May 2017!\n    - Cloud TPU delivers up to 180 teraflops to train and run machine learning models.\n    - One of our new large-scale translation models used to take a full day to train on 32 of the best commercially-available GPUs—now it trains to the same accuracy in an afternoon using just one eighth of a TPU pod.\n- We have moved from PC Era ==> Mobile-First Era ==> AI-First Era\n\n\n\n## 16. Adversarial Examples and Adversarial Training\n\n- **\u003Cu>What are adversarial examples?\u003C\u002Fu>**\n  - Since 2013, deep neural networks have matched human performance at..\n    - Face recognition\n    - Object recognition\n    - Captcha recognition\n      - Because its accuracy was higher than humans, Websites tried to find another solution than Captcha.\n    - And other tasks..\n  - Before 2013 no body was surprised if they saw a computer made a mistake! But now the deep learning exists and its so important to know the problems and the causes.\n  - Adversarial are problems and unusual mistake that deep learning make.\n  - This topic wasn't hot until deep learning can now do better and better than human!\n  - An adversarial is an example that has been carefully computed to to be misclassified.\n  - In a lot of cases the adversarial image isn't changed much compared to the original image from the human perspective.\n  - History of recent papers:\n    - Biggio [2013](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-642-40994-3_25): fool neural nets.\n    - Szegedy et al 2013: fool ImageNet classifiers imperceptibly\n    - Goodfellow et al [2014](https:\u002F\u002Farxiv.org\u002Fabs\u002F1412.6572): cheap, closed form attack.\n  - So the first story was in 2013. When Szegedy had a CNN that can classify images very well.\n    - He wanted to understand more about how CNN works to improve it.\n    - He give an image of an object and by using gradient ascent he tried to update the images so that it can be another object.\n    - Strangely he found that the result image hasn't changed much from the human perspective!\n    - If you tried it you won't notify any change and you will think that this is a bug! but it isn't if you go for the image you will notice that they are completely different!\n  - These mistakes can be found in almost any deep learning algorithm we have studied!\n    - It turns out that RBF (Radial Basis Network) can resist this.\n    - Deep Models for Density Estimation can resist this.\n  - Not just for neural nets can be fooled:\n    - Linear models\n      - Logistic regression\n      - Softmax regression\n      - SVMs\n    - Decision trees \n    - Nearest neighbors\n- **\u003Cu>Why do adversarial happen?\u003C\u002Fu>**\n  - In the process in trying to understand what is happening, in 2016 they thought it was from overfitting models in the high dimensional data case.\n    - Because in such high dimensions we could have some random errors which can be found.\n    - So if we trained a model with another parameters it should not make the same mistake?\n    - They found that not right. Models are reaching to the same mistakes so it doesn't mean its overfitting.\n  - In the previous mentioned experiment the found that the problem is caused by systematic thing not a random.\n    - If they add some vector to an example it would misclassified to any model.\n  - Maybe they are coming from underfitting not overfitting.\n  - Modern deep nets are very piecewise linear\n    - Rectified linear unit\n    - Carefully tuned sigmoid  `# Most of the time we are inside the linear curve`\n    - Maxout\n    - LSTM\n  - Relation between the parameter and the output are non linear because it's multiplied together thats what make training NN difficult, while mapping from linear from input and output are linear and much easier.\n- **\u003Cu>How can adversarial be used to compromise machine learning systems?\u003C\u002Fu>**\n  - If we are experimenting how easy a NN to fool, We want to make sure we are actually fooling it not just changing the output class, and if we are attackers we want to make this behavior to the NN (Get hole).\n  - When we build Adversarial example we use the max norm constrain to perturbation.\n  - The fast gradient sign method:\n    - This method comes from the fact that almost all NN are using a linear activations (Like RELU) the assumption we have told before.\n    - No pixel can be changed more than some amount epsilon.\n    - Fast way is to take the gradient of the cost you used to train the network with respect to the input and then take the sign of that gradient multiply this by epsilon.\n    - Equation:\n      - `Xdash = x + epslion * (sign of the gradient)`\n      - Where Xdash is the adversarial example and x is the normal example\n    - So it can be detected by just using the sign (direction) and some epsilon.\n  - Some attacks are based on ADAM optimizer.\n  - Adversarial examples are not random noises!\n  - NN are trained on some distribution and behaves well in that distribution. But if you shift this distribution the NN won't answer the right answers. They will be so easy to fool.\n  - deep RL can also be fooled.\n  - Attack of the weights:\n    - In linear models, We can take the learned weights image, take the signs of the image and add it to any example to force the class of the weights to be true. Andrej Karpathy, \"Breaking Linear Classifiers on ImageNet\"\n  - It turns out that some of the linaer models performs well (We cant get advertisal from them easily)\n    - In particular Shallow RBFs network resist adversarial perturbation # By The fast gradient sign method\n      - The problem is RBFs doesn't get so much accuracy on the datasets because its just a shallow model and if you tried to get this model deeper the gradients will become zero in almost all the layers.\n      - RBFs are so difficult to train even with batch norm. algorithm.\n      - Ian thinks if we have a better hyper parameters or a better optimization algorithm that gradient decent we will be able to train RBFs and solve the adversarial problem!\n  - We also can use another model to fool current model. Ex use an SVM to fool a deep NN.\n    - For more details follow the paper: \"Papernot 2016\"\n  - Transferability Attack\n    1. Target model with unknown weights, machine learning algorithm, training set; maybe non differentiable\n    2. Make your training set from this model using inputs from you, send them to the model and then get outputs from the model\n    3. Train you own model. \"Following some table from Papernot 2016\"\n    4. Create an Adversarial example on your model.\n    5. Use these examples against the model you are targeting.\n    6. You are almost likely to get good results and fool this target!\n  - In Transferability Attack to increase your probability by 100% of fooling a network, You can make more than just one model may be five models and then apply them. \"(Liu et al, 2016)\"\n  - Adversarial Examples are works for human brain also! for example images that tricks your eyes. They are a lot over the Internet.\n  - In practice some researches have fooled real models from (MetaMind, Amazon, Google)\n  - Someone has uploaded some perturbation into facebook and facebook was fooled :D\n- **\u003Cu>What are the defenses?\u003C\u002Fu>**\n  - A lot of defenses Ian tried failed really bad! Including:\n    - Ensembles\n    - Weight decay\n    - Dropout\n    - Adding noise at train time or at test time\n    - Removing perturbation with an autoencoder \n    - Generative modeling\n  - Universal approximator theorem\n    - Whatever shape we would like our classification function to have a big enough NN can make it.\n    - We could have train a NN that detects the Adversarial!\n  - Linear models & KNN can be fooled easier than NN. Neural nets can actually become more secure than other models. Adversarial trained neural nets have the best empirical success rate on adversarial examples of any machine learning model.\n    - Deep NNs can be trained with non linear functions but we will just need a good optimization technique or solve the problem with using such linear activator like \"RELU\"\n- **\u003Cu>How to use adversarial examples to improve machine learning, even when there is no adversary?\u003C\u002Fu>**\n  - Universal engineering machine (model-based optimization)\t\t`#Is called Universal engineering machine by Ian`\n    - For example:\n      - Imagine that we want to design a car that are fast.\n      - We trained a NN to look at the blueprints of a car and tell us if the blueprint will make us a fast car or not.\n      - The idea here is to optimize the input to the network so that the output will max this could give us the best blueprint for a car!\n    - Make new inventions by finding input that maximizes model’s predicted performance.\n    - Right now by using adversarial examples we are just getting the results we don't like but if we have solve this problem we can have the fastest car, the best GPU, the best chair, new drugs.....\n  - The whole adversarial is an active area of research especially defending the network!\n- Conclusion\n  - Attacking is easy\n  - Defending is difficult\n  - Adversarial training provides regularization and semi-supervised learning \n  - The out-of-domain input problem is a bottleneck for model-based optimization generally\n- There are a Github code that can make you learn everything about adversarial by code (Built above tensorflow):\n  - An adversarial example library for constructing attacks, building defenses, and benchmarking both: https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Fcleverhans\n\n\n\n\n\u003Cbr>\u003Cbr>\n\u003Cbr>\u003Cbr>\nThese Notes was made by [Mahmoud Badry](mailto:mma18@fayoum.edu.eg) @2017","# 斯坦福CS231n 2017课程总结\n\n在观看了2017年举办的著名斯坦福[CS231n](http:\u002F\u002Fcs231n.stanford.edu\u002F)课程的所有视频后，我决定对整个课程进行总结，以便于自己复习，同时也为那些想了解这门课程的人提供参考。在一些讲座中，由于内容对我来说并不重要，因此我选择略过了。\n\n## 目录\n\n* [斯坦福CS231n 2017课程总结](#standford-cs231n-2017-summary)\n   * [目录](#table-of-contents)\n   * [课程信息](#course-info)\n   * [01. 用于视觉识别的卷积神经网络简介](#01-introduction-to-cnn-for-visual-recognition)\n   * [02. 图像分类](#02-image-classification)\n   * [03. 损失函数与优化](#03-loss-function-and-optimization)\n   * [04. 神经网络简介](#04-introduction-to-neural-network)\n   * [05. 卷积神经网络（CNN）](#05-convolutional-neural-networks-cnns)\n   * [06. 神经网络训练I](#06-training-neural-networks-i)\n   * [07. 神经网络训练II](#07-training-neural-networks-ii)\n   * [08. 深度学习软件](#08-deep-learning-software)\n   * [09. CNN架构](#09-cnn-architectures)\n   * [10. 循环神经网络](#10-recurrent-neural-networks)\n   * [11. 检测与分割](#11-detection-and-segmentation)\n   * [12. 可视化与理解](#12-visualizing-and-understanding)\n   * [13. 生成模型](#13-generative-models)\n   * [14. 深度强化学习](#14-deep-reinforcement-learning)\n   * [15. 深度学习的高效方法与硬件](#15-efficient-methods-and-hardware-for-deep-learning)\n   * [16. 对抗样本与对抗训练](#16-adversarial-examples-and-adversarial-training)\n\n## 课程信息\n\n- 官网：http:\u002F\u002Fcs231n.stanford.edu\u002F\n\n- 讲座链接：https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLC1qU-LWwrF64f4QKQT-Vg5Wr4qEE1Zxk\n\n- 完整教学大纲链接：http:\u002F\u002Fcs231n.stanford.edu\u002Fsyllabus.html\n\n- 作业解答：https:\u002F\u002Fgithub.com\u002FBurton2000\u002FCS231n-2017\n\n- 讲座数量：**16**\n\n- 课程描述：\n\n  - > 计算机视觉已渗透到我们社会的方方面面，应用领域涵盖搜索、图像理解、应用程序、地图绘制、医学、无人机以及自动驾驶汽车等。这些应用的核心任务之一就是视觉识别，例如图像分类、定位和检测。近年来，基于神经网络（即“深度学习”）的方法取得了重大进展，极大地提升了这些先进视觉识别系统的性能。本课程将深入探讨深度学习架构的细节，重点在于为这些任务（尤其是图像分类）学习端到端的模型。在为期10周的课程中，学生将学会实现、训练和调试自己的神经网络，并深入了解计算机视觉领域的前沿研究。最终作业将涉及训练一个拥有数百万参数的卷积神经网络，并将其应用于最大的图像分类数据集（ImageNet）。我们将着重讲解如何设定图像识别问题、学习算法（如反向传播）、训练和微调网络的实用工程技巧，并通过动手实践作业和期末项目引导学生完成学习。本课程的许多背景知识和材料都将源自[ImageNet挑战赛](http:\u002F\u002Fimage-net.org\u002Fchallenges\u002FLSVRC\u002F2014\u002Findex)。\n\n\n\n## 01. 用于视觉识别的卷积神经网络简介\n\n- 计算机视觉从20世纪60年代末至今的发展简史。\n- 计算机视觉问题包括图像分类、目标定位、目标检测和场景理解。\n- [Imagenet](http:\u002F\u002Fwww.image-net.org\u002F) 是目前可用的最大图像分类数据集之一。\n- 自2012年起，在Imagenet竞赛中，卷积神经网络（CNN）一直占据主导地位。\n- 实际上，CNN早在1997年就由[Yann Lecun](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F726791\u002F)发明了。\n\n\n\n## 02. 图像分类\n\n- 图像分类问题面临诸多挑战，如光照和视角变化等。\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_1ee9457872f7.jpeg)\n- 图像分类问题可以用**K近邻**（KNN）算法解决，但其效果往往不佳。KNN的特点如下：\n  - KNN的超参数包括：k值和距离度量。\n  - k表示我们用来比较的邻居数量。\n  - 常用的距离度量包括：\n    - L2距离（欧几里得距离）\n      - 适用于非坐标点\n    - L1距离（曼哈顿距离）\n      - 适用于坐标点\n- 超参数可以通过交叉验证来优化，以确定最佳的k值：\n  1. 将数据集划分为`f`个折。\n  2. 对于给定的超参数：\n     - 使用`f-1`个折训练算法，并用剩下的1个折进行测试。重复此过程，直到每个折都参与一次测试。\n  3. 选择能使平均得分最高的超参数。\n- **线性支持向量机**（Linear SVM）分类器也可以用于解决图像分类问题，但由于维度灾难，其性能会在某个点后不再提升。\n- **逻辑回归**同样是解决图像分类问题的一种方法，但图像分类问题是非线性的！\n- 线性分类器需要运行以下方程：`Y = wX + b`\n  - 其中，`w`的形状与`x`相同，而`b`的形状为1。\n- 我们可以向`X`向量中添加一个1，并去除偏置项，从而简化为：`Y = wX`\n  - 此时，`x`的形状变为`oldX+1`，而`w`的形状仍与`x`相同。\n- 我们需要找到能够使分类器达到最佳效果的`w`和`b`值。\n\n## 03. 损失函数与优化\n\n- 在上一节中，我们讨论了线性分类器，但并未涉及如何**训练**该模型的参数，以找到最优的 `w` 和 `b`。\n\n- 我们需要一个损失函数来衡量当前参数的好坏。\n  - ```python\n    Loss = L[i] =(f(X[i],W),Y[i])\n    Loss_for_all = 1\u002FN * Sum(Li(f(X[i],W),Y[i]))      # 表示平均值\n    ```\n\n- 接下来，我们需要找到一种方法，在给定参数的情况下最小化损失函数。这被称为**优化**。\n\n- 线性**SVM**分类器的损失函数：\n  - `L[i] = Sum where all classes except the predicted class (max(0, s[j] - s[y[i]] + 1))`\n  - 我们称之为***合页损失***。\n  - 损失函数表示：如果最佳预测与真实标签一致，则认为满意；否则，会根据间隔大小给出误差。\n  - 示例：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_7c204cd1010c.jpg)\n    - 基于此示例，我们希望计算该图像的损失。\n    - `L = max (0, 437.9 - (-96.8) + 1) + max(0, 61.95 - (-96.8) + 1) = max(0, 535.7) + max(0, 159.75) = 695.45`\n    - 最终损失为695.45，数值较大，表明猫类别的得分应高于其他类别，因为目前它是最低的。我们需要尽量减小这个损失。\n  - 间隔为1是可以接受的，但它也是一个超参数。\n\n- 如果你的损失函数为零，是否意味着此时的参数就是最优解？并非如此，因为有许多不同的参数组合都能达到最佳效果。\n\n- 有时也会听到使用平方合页损失的SVM（或L2-SVM），它对违反间隔的惩罚更强（呈二次而非线性关系）。未平方的形式更为常见，但在某些数据集上，平方合页损失的表现可能更好。\n\n- 我们为损失函数添加**正则化**项，以防止模型过拟合数据。\n  - ```python\n    Loss = L = 1\u002FN * Sum(Li(f(X[i],W),Y[i])) + lambda * R(W)\n    ```\n\n  - 其中 `R` 是正则化项，`lambda` 是正则化系数。\n\n- 不同的正则化技术如下：\n\n  - | 正则化方法           | 公式                            | 备注               |\n    | --------------------- | ----------------------------------- | ---------------------- |\n    | L2                    | `R(W) = Sum(W^2)`                   | 所有权重的平方之和  |\n    | L1                    | `R(W) = Sum(lWl)`                   | 所有权重绝对值之和  |\n    | 弹性网络（L1 + L2）  | `R(W) = beta * Sum(W^2) + Sum(lWl)` |                        |\n    | Dropout               |                                     | 无公式            |\n\n- 正则化倾向于选择较小的 `W` 而非较大的 `W`。\n\n- 正则化也被称为权重衰减。偏置项不应包含在正则化中。\n\n- Softmax损失（类似于线性回归，但适用于多于两类的情况）：\n\n  - Softmax函数：\n    - ```python\n      A[L] = e^(score[L]) \u002F sum(e^(score[L]), NoOfClasses)\n      ```\n\n  - 向量各元素之和应为1。\n\n  - Softmax损失：\n    - ```python\n      Loss = -logP(Y = y[i]|X = x[i])\n      ```\n\n    - 即正确类别的概率的负对数。我们希望这个值接近1，因此加上了负号。\n    - Softmax损失也称为交叉熵损失。\n\n  - 计算Softmax时需注意以下数值问题：\n    - ```python\n      f = np.array([123, 456, 789]) # 示例：3个类别，每个类别的得分都很高\n      p = np.exp(f) \u002F np.sum(np.exp(f)) # 不佳：存在数值问题，可能导致溢出\n\n      # 改进方案：先将f中的数值调整，使最大值为0：\n      f -= np.max(f) # f变为[-666, -333, 0]\n      p = np.exp(f) \u002F np.sum(np.exp(f)) # 安全可靠，结果正确\n      ```\n\n- **优化**：\n\n  - 我们已经讨论了如何优化损失函数。有哪些策略呢？\n  - 策略一：\n    - 随机选取一组参数，逐一尝试并计算对应的损失，最终选择损失最小的一组。但这并不是一个好的方法。\n  - 策略二：\n    - 沿着梯度方向前进。\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_1eb15957ba4a.png)\n      - 图片[来源](https:\u002F\u002Frasbt.github.io\u002Fmlxtend\u002Fuser_guide\u002Fgeneral_concepts\u002Fgradient-optimization_files\u002Fball.png)。\n\n    - 我们的目的是计算每个参数的梯度。\n      - **数值梯度**：近似、速度较慢、易于实现。（但适合用于调试。）\n      - **解析梯度**：精确、速度快、易出错。（实际应用中通常使用解析梯度。）\n\n    - 计算出参数的梯度后，进行梯度下降：\n      - ```python\n        W = W - learning_rate * W_grad\n        ```\n\n    - 学习率是一个非常重要的超参数，应在所有超参数中优先确定其最优值。\n\n    - 随机梯度下降：\n      - 不再使用全部数据，而是采用小批量样本（常用32、64、128等规模）以加快收敛速度。\n\n## 04. 神经网络简介\n\n- 对任意复杂函数计算解析梯度：\n\n  - 什么是计算图？\n\n    - 计算图用于用节点表示任何函数。\n    - 使用计算图可以很容易地引导我们使用一种称为反向传播的技术，即使对于像CNN和RNN这样的复杂模型也是如此。\n\n  - 反向传播的简单示例：\n\n    - 假设我们有 `f(x,y,z) = (x+y)z`\n\n    - 那么可以用如下图表示：\n\n    - ```\n      X         \n        \\\n         (+)--> q ---(*)--> f\n        \u002F           \u002F\n      Y            \u002F\n                  \u002F\n                 \u002F\n      Z---------\u002F\n      ```\n\n    - 我们引入了一个中间变量 `q` 来保存 `x+y` 的值。\n\n    - 接着有：\n\n      - ```python\n        q = (x+y)              # dq\u002Fdx = 1 , dq\u002Fdy = 1\n        f = qz                 # df\u002Fdq = z , df\u002Fdz = q\n        ```\n\n    - 进而：\n\n      - ```python\n        df\u002Fdq = z\n        df\u002Fdz = q\n        df\u002Fdx = df\u002Fdq * dq\u002Fdx = z * 1 = z       # 链式法则\n        df\u002Fdy = df\u002Fdq * dq\u002Fdy = z * 1 = z       # 链式法则\n        ```\n\n  - 因此，在计算图中，我们将每个操作称为 `f`。对于每个 `f`，我们在进行反向传播之前先计算局部梯度，然后利用链式法则计算相对于损失函数的梯度。\n\n  - 在计算图中，你可以将每个操作拆解得尽可能简单，但这样节点会非常多。如果你希望节点更小，一定要确保能够计算该节点的梯度。\n\n  - 一个更大的例子：\n\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_16f775ff9bfb.png)\n    - 提示：两个节点从后向前传递到一个节点时，其反向传播是通过将这两个导数相加来完成的。\n\n  - 模块化实现：前向\u002F反向API（以乘法代码为例）：\n\n    - ```python\n      class MultuplyGate(object):\n        \"\"\"\n        x,y 是标量\n        \"\"\"\n        def forward(x,y):\n          z = x*y\n          self.x = x  # 缓存\n          self.y = y\t# 缓存\n          # 我们缓存x和y，因为知道它们会出现在导数中。\n          return z\n        def backward(dz):\n          dx = self.y * dz         #self.y就是dx\n          dy = self.x * dz\n          return [dx, dy]\n      ```\n\n  - 如果你查看深度学习框架，会发现它们都遵循模块化实现，即每个类都有前向和反向的定义。例如：\n\n    - 乘法\n    - 最大值\n    - 加法\n    - 减法\n    - Sigmoid\n    - 卷积\n\n- 因此，我们可以将神经网络定义为一个函数：\n\n  - （以前）线性得分函数：`f = Wx`\n  - （现在）两层神经网络：`f = W2*max(0,W1*x)` \n    - 其中max是ReLU非线性函数\n  - （现在）三层神经网络：`f = W3*max(0,W2*max(0,W1*x)`\n  - 以此类推……\n\n- 神经网络是由一些简单操作堆叠而成，从而形成复杂的操作。\n\n\n\n## 05. 卷积神经网络（CNNs）\n\n- 神经网络历史：\n  - 第一台感知机由弗兰克·罗森布拉特于1957年开发，用于识别字母表中的字母。当时反向传播算法尚未被提出。\n  - 多层感知机于1960年由Adaline\u002FMadaline提出，但此时反向传播算法仍未出现。\n  - 反向传播算法于1986年由鲁梅尔哈特等人提出。\n  - 在此之后的一段时间里，神经网络领域几乎没有新的进展，主要原因是计算资源和数据量的限制。\n  - 2006年，辛顿发表了一篇论文，表明可以使用受限玻尔兹曼机初始化权重，然后再通过反向传播训练深度神经网络。\n  - 2012年，辛顿团队在语音识别领域取得了突破性成果（参见IEEE文献）。同年，辛顿团队还提出了AlexNet卷积神经网络，并在ImageNet竞赛中获胜。\n  - 自那以后，神经网络被广泛应用于各种领域。\n- 卷积神经网络历史：\n  - 1959年至1968年间，休贝尔和威塞尔在猫的大脑皮层上进行实验，发现大脑皮层存在拓扑映射，且神经元具有从简单到复杂的层次化组织结构。\n  - 1998年，扬·勒丘恩发表了《基于梯度的学习在文档识别中的应用》一文，首次提出了卷积神经网络的概念。该模型在邮政编码识别任务上表现良好，但在更复杂的场景下难以应用。\n  - 2012年，AlexNet沿用了扬·勒丘恩的架构，并在ImageNet竞赛中夺冠。与1998年相比，如今我们拥有大规模的数据集，同时GPU的强大算力也解决了许多性能瓶颈问题。\n  - 自2012年起，卷积神经网络被广泛应用于多种任务，例如：\n    - 图像分类。\n    - 图像检索。\n      - 使用神经网络提取特征，再进行相似度匹配。\n    - 目标检测。\n    - 图像分割。\n      - 为图像中的每个像素分配标签。\n    - 人脸识别。\n    - 姿态识别。\n    - 医学影像分析。\n    - 使用强化学习玩雅达利游戏。\n    - 星系分类。\n    - 交通标志识别。\n    - 图像字幕生成。\n    - 深度梦。\n- 卷积神经网络架构明确假设输入是图像，这使得我们可以将某些先验知识编码进网络结构中。\n- 卷积神经网络中有几种不同类型的层（如CONV、FC、ReLU、POOL等是最常见的）。\n- 每一层可能包含参数，也可能不包含参数（例如CONV和FC层有参数，而ReLU和POOL层没有）。\n- 每一层也可能包含额外的超参数，或者不包含（例如CONV、FC和POOL层有超参数，而ReLU层没有）。\n- 卷积神经网络的工作原理：\n  - 全连接层是指所有神经元之间都相互连接的层，有时也称为密集层。\n    - 如果输入形状为`(X, M)`，则该层的权重形状为`(隐藏层神经元数, X)`。\n  - 卷积层是一种通过滤波器在整个图像上滑动来保持输入结构的层。\n    - 我们通过点积运算实现：`W.T*X + b`。该公式利用了广播机制。\n    - 因此需要确定`W`和`b`的值。\n    - 通常我们将滤波器`W`视为一个向量，而非矩阵。\n  - 卷积操作的输出称为激活图。我们需要生成多个激活图。\n    - 例如，如果有6个滤波器，其形状如下：\n      - 输入图像                        `(32,32,3)`\n      - 滤波器尺寸                              `(5,5,3)`\n        - 应用6个滤波器时，深度必须为3，因为输入图像的深度也是3。\n      - 卷积后的输出                 `(28,28,6)` \n        - 如果只用一个滤波器，则为   `(28,28,1)`\n      - 经过ReLU激活后                          `(28,28,6)` \n      - 再应用另一个滤波器                     `(5,5,6)`\n      - 卷积后的输出                 `(24,24,10)`\n  - 实际上，卷积神经网络会在早期层学习低级特征，随后逐步学习中级和高级特征。\n  - 在卷积层之后，我们可以使用线性分类器来进行分类任务。\n  - 在卷积神经网络中，通常会先经过若干个(Conv ==> Relu)组合，然后通过池化操作降低激活图的尺寸。\n- 卷积操作中的步幅是什么？\n  - 在进行卷积操作时，我们需要选择滑动步幅的大小。我将通过例子来说明。\n  - 步幅是指滑动时跳过的距离，默认值为1。\n  - 假设有一个形状为`(7,7)`的矩阵和一个形状为`(3,3)`的滤波器：\n    - 如果步幅为1，则输出形状为`(5,5)`              `# 会丢掉2行2列`\n    - 如果步幅为2，则输出形状为`(3,3)`             `# 会丢掉4行4列`\n    - 如果步幅为3，则无法正常工作。\n  - 一般公式为`((N-F)\u002Fstride +1)`。\n    - 如果步幅为1，则`O = ((7-3)\u002F1)+1 = 4 + 1 = 5`。\n    - 如果步幅为2，则`O = ((7-3)\u002F2)+1 = 2 + 1 = 3`。\n    - 如果步幅为3，则`O = ((7-3)\u002F3)+1 = 1.33 + 1 = 2.33`        `# 无法正常工作`\n\n- 在实践中，通常会对边界进行零填充。 `# 两侧填充。`\n  - 如果步幅为 `1`，常见的填充方式是使用公式 `(F-1)\u002F2`，其中 F 是卷积核的大小。\n    - 例如 `F = 3` ==> 零填充 `1`\n    - 例如 `F = 5` ==> 零填充 `2`\n  - 如果我们以这种方式填充，就称为“相同卷积”。\n  - 添加零可以为边缘提供额外的信息，因此存在不同的填充技术，比如用非零值填充角落。但在实际应用中，零填充效果很好！\n  - 我们这样做是为了保持输入的完整尺寸。如果不这样做，输入会迅速缩小，导致大量数据丢失。\n- 示例：\n  - 如果我们有一个形状为 `(32,32,3)` 的输入，以及十个形状为 `(5,5)`、步幅为 `1`、填充为 `2` 的卷积核：\n    - 输出大小将是 `(32,32,10)`                       `# 我们保持了尺寸。`\n  - 每个卷积核的参数量 `= 5*5*3 + 1 = 76`\n  - 总参数量 `= 76 * 10 = 760`\n- 卷积核的数量通常是 2 的幂次方。 `# 以便更好地向量化。`\n- 因此，卷积层的参数包括：\n  - 卷积核数量 K。\n    - 通常为 2 的幂次方。\n  - 空间卷积核大小 F。\n    - 3、5、7 等。\n  - 步幅 S。\n    - 通常为 1 或 2        （如果步幅较大，会发生下采样，但这与池化不同）\n  - 填充量\n    - 如果希望输入和输出的形状一致，则根据 F 的大小来决定：F 为 3 时填充 1，F 为 5 时填充 2，依此类推。\n- 池化可以使特征表示更小、更易于管理。\n- 池化操作独立地作用于每个激活图。\n- 池化的例子之一是最大池化。\n  - 最大池化的参数包括卷积核大小和步幅。\n    - 例如 `2x2`，步幅为 `2`                     `# 通常两个参数都是相同的 2 , 2`\n- 另一个池化的例子是平均池化。\n  - 在这种情况下，它可能是可学习的。\n\n\n\n## 06. 神经网络训练 I\n\n- 作为回顾，以下是小批量随机梯度下降算法的步骤：\n\n  - 循环：\n    1. 抽取一批数据。\n    2. 将其通过网络前向传播，得到损失。\n    3. 进行反向传播计算梯度。\n    4. 使用梯度更新参数。\n\n- 激活函数：\n\n  - 不同的激活函数选择包括 Sigmoid、tanh、RELU、Leaky RELU、Maxout 和 ELU。\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_89e6268a8f7a.png)\n\n  - Sigmoid：\n\n    - 将数值压缩到 [0,1] 范围内。\n    - 类似于人脑的发放率。\n    - `Sigmoid(x) = 1 \u002F (1 + e^-x)`\n    - Sigmoid 的问题：\n      - 大数值会导致神经元“杀死”梯度。\n        - 梯度在大多数情况下接近 0（无论是大值还是小值），这会阻止大型网络的参数更新。\n      - 不是零中心的。\n        - 无法产生零均值的数据。\n      - `exp()` 的计算成本较高。\n        - 仅作参考。深度学习中还有更复杂的操作，比如卷积。\n\n  - Tanh：\n\n    - 将数值压缩到 [-1,1] 范围内。\n    - 是零中心的。\n    - 同样，大数值会导致神经元“杀死”梯度。\n    - `Tanh(x)` 是其公式。\n    - 由 Yann LeCun 于 1991 年提出。\n\n  - RELU（修正线性单元）：\n\n    - `RELU(x) = max(0,x)`\n    - 不会杀死梯度。\n      - 只有小数值会被杀死。梯度只有一半被杀死。\n    - 计算效率高。\n    - 收敛速度比 Sigmoid 和 Tanh 快得多 `(6倍)`\n    - 比 Sigmoid 更符合生物学原理。\n    - 由 Alex Krizhevsky 于 2012 年在多伦多大学提出。（AlexNet）\n    - 问题：\n      - 不是零中心的。\n    - 如果权重初始化不好，可能有 75% 的神经元处于“死亡”状态，造成计算浪费。不过它仍然有效。目前仍在积极研究如何优化这一问题。\n    - 为了解决上述问题，人们可能会将所有偏置初始化为 0.01。\n\n  - Leaky RELU：\n\n    - `leaky_RELU(x) = max(0.01x,x)`\n    - 不会从两边杀死梯度。\n    - 计算效率高。\n    - 收敛速度比 Sigmoid 和 Tanh 快得多（6倍）。\n    - 不会“死亡”。\n    - PReLU 则将 0.01 替换为一个可学习的参数 alpha。\n\n  - 指数线性单元（ELU）：\n\n    - ```\n      ELU(x) = { x                                           若 x > 0\n      \t\t   alpah *(exp(x) -1)\t\t                   若 x \u003C= 0\n                 # alpah 是一个可学习的参数\n      }\n      ```\n\n    - 它具有 RELU 的所有优点。\n    - 输出更接近零均值，并且对噪声有一定的鲁棒性。\n    - 问题：\n      - `exp()` 的计算成本较高。\n\n  - Maxout 激活函数：\n\n    - `maxout(x) = max(w1.T*x + b1, w2.T*x + b2)`\n    - 是 RELU 和 Leaky RELU 的推广。\n    - 不会“死亡”！\n    - 问题：\n      - 会使每个神经元的参数数量翻倍。\n\n  - 实际应用：\n\n    - 使用 RELU。注意学习率的设置。\n    - 可以尝试 Leaky RELU\u002FMaxout\u002FELU。\n    - 可以尝试 tanh，但不要抱太大期望。\n    - 不要使用 Sigmoid！\n\n- **数据预处理**：\n\n  - 对数据进行归一化：\n\n  - ```python\n    # 零中心化数据。（计算每个输入的均值）。\n    # 我们这样做的原因之一是，需要数据在正负之间分布，而不是全部为正或全部为负。\n    X -= np.mean(X, axis = 1)\n\n    # 然后应用标准差。提示：对于图像，我们不这样做。\n    X \u002F= np.std(X, axis = 1)\n    ```\n\n  - 对图像进行归一化：\n\n    - 减去平均图像（例如 AlexNet）。\n      - 平均图像的形状与输入图像相同。\n    - 或者减去每通道的均值。\n      - 即计算所有图像每个通道的均值。形状为 3（3 个通道）。\n\n- **权重初始化**：\n\n  - 如果将所有 W 初始化为零，会发生什么？\n\n    - 所有神经元都会做完全相同的事情。它们会有相同的梯度和相同的更新。\n    - 因此，如果某一层的所有权重都相等，就会出现上述情况。\n\n  - 第一种想法是将权重初始化为小的随机数：\n\n    - ```python\n      W = 0.01 * np.random.rand(D, H)\n      # 对小型网络效果尚可，但对深层网络则会产生问题！\n      ```\n\n    - 在深层网络中，标准差会趋近于零，梯度也会更快消失。\n\n    - ```python\n      W = 1 * np.random.rand(D, H) \n      # 对小型网络效果尚可，但对深层网络则会产生问题！\n      ```\n\n    - 网络可能会因为数值过大而爆炸。\n\n  - ***Xavier 初始化***：\n\n    - ```python\n      W = np.random.rand(in, out) \u002F np.sqrt(in)\n      ```\n\n    - 它之所以有效，是因为我们希望输入的方差与输出的方差保持一致。\n    - 但它有一个问题：在使用 RELU 时会失效。\n\n  - ***He 初始化***（解决 RELU 问题）：\n\n    - ```python\n      W = np.random.rand(in, out) \u002F np.sqrt(in\u002F2)\n      ```\n\n    - 解决了 RELU 的问题。建议在使用 RELU 时采用此方法。\n\n  - 正确的权重初始化仍然是一个活跃的研究领域。\n\n- **批归一化**：\n\n- 是一种为神经网络中的任意一层提供均值为零、方差为一的输入的技术。\n  - 它可以加速训练过程。你应该经常使用它。\n    - 由谢尔盖·伊奥菲和克里斯蒂安·塞格迪于2015年提出。\n  - 我们通过计算每层的均值和方差，使激活值呈现高斯分布。\n  - 通常在（全连接层或卷积层）之后、（非线性激活函数）之前插入。\n  - 步骤（针对每一层的输出）：\n    1. 首先，我们为每个特征计算批次的均值和方差²。\n    2. 通过减去均值并除以（方差² + epsilon）的平方根来进行归一化。\n       - epsilon是为了避免除以零。\n    3. 然后，我们引入缩放和平移变量：`Result = gamma * normalizedX + beta`\n       - gamma和beta是可学习的参数。\n       - 这实际上允许模型说：“嘿！我不需要均值为零、方差为一的输入，把原始输入还给我——那样对我更好。”\n       - 可以根据需要调整平移和缩放，而不仅仅是基于均值和方差！\n  - 该算法使每一层都具有灵活性（它可以自行选择所需的分布）。\n  - 我们初始化BatchNorm参数，将输入转换为均值为零、方差为一的分布，但在训练过程中，它们可能会学习到其他分布更为合适。\n  - 在训练过程中，我们需要使用加权平均来计算每层的全局均值和全局方差。\n  - \u003Cu>批量归一化的优势\u003C\u002Fu>：\n    - 网络训练速度更快。\n    - 允许使用更高的学习率。\n    - 有助于降低对初始权重的敏感性。\n    - 使更多激活函数变得可行。\n    - 提供一定的正则化效果。\n      - 因为我们为每个批次计算均值和方差，这会带来轻微的正则化作用。\n  - 在卷积层中，每个激活图会有各自的方差和均值。\n  - 批量归一化在卷积神经网络和常规深度神经网络中表现最佳，但在循环神经网络和强化学习领域，它仍然是一个活跃的研究方向。\n    - 在强化学习中应用起来比较困难，因为批次通常较小。\n\n- **监督学习过程**\n\n  1. 数据预处理。\n  2. 选择网络架构。\n  3. 进行前向传播并检查损失（禁用正则化）。确认损失是否合理。\n  4. 添加正则化项，此时损失应该会增加！\n  5. 再次禁用正则化，取少量数据尝试训练，直到损失降至零。\n     - 对于小数据集，你应该能够完美地过拟合。\n  6. 使用完整的训练数据，并采用较小的正则化强度，尝试不同的学习率。\n     - 如果损失几乎没有变化，则说明学习率太低。\n     - 如果出现`NAN`，则表明网络发散了，学习率过高。\n     - 通过尝试最小可能值（可能会改变）和不会导致网络发散的最大值，确定你的学习率范围。\n  7. 进行超参数优化，以找到最佳的超参数组合。\n\n- 超参数优化\n\n  - 尝试交叉验证策略。\n    - 运行几个周期，尝试优化各个范围。\n  1. 最好在对数空间中进行优化。\n  2. 调整范围后再次尝试。\n  3. 相较于网格搜索，在对数空间中进行随机搜索效果更好。\n\n\n\n\n\n## 07. 训练神经网络II\n\n- **优化算法**：\n\n  - 随机梯度下降法的问题：\n\n    - 如果在一个方向上损失下降得很快，而在另一个方向上下降得很慢（仅针对两个变量），那么在浅维度上的进展会非常缓慢，而在陡峭维度上则会出现抖动。对于拥有大量参数的神经网络来说，这个问题会更加严重。\n    - 局部极小值或鞍点\n      - 如果SGD陷入局部极小值，由于梯度为零，模型就会卡在这个点上。\n      - 同样，在鞍点处梯度也为零，模型也会陷入停滞。\n      - 鞍点意味着在某些点上：\n        - 有些梯度会使损失上升。\n        - 有些梯度会使损失下降。\n        - 这种情况在高维空间中更为常见（例如，拥有1亿个维度的情况）。\n      - 对于深度神经网络而言，其主要问题在于鞍点，而非局部极小值，因为深度网络的参数维度非常高。\n      - 小批量数据由于并非基于整个批次计算梯度，因此噪声较大。\n\n  - **SGD + 动量**：\n\n    - 通过梯度的移动平均构建速度：\n\n    - ```python\n      # 计算加权平均。rho的最佳范围是[0.9 - 0.99]\n      V[t+1] = rho * v[t] + dx\n      x[t+1] = x[t] - learningRate * V[t+1]\n      ```\n\n    - `V[0]`为零。\n\n    - 解决了鞍点和局部极小值的问题。\n\n    - 它会稍微越过问题点，然后再返回。\n\n  - **Nesterov 动量**：\n\n    - ```python\n      dx = compute_gradient(x)\n      old_v = v\n      v = rho * v - learning_rate * dx\n      x+= -rho * old_v + (1+rho) * v\n      ```\n\n    - 不会过度越过问题点，但速度比SGD + 动量稍慢。\n\n  - **AdaGrad**：\n\n    - ```python\n      grad_squared = 0\n      while(True):\n        dx = compute_gradient(x)\n        \n        # 这里存在问题，grad_squared不会衰减（会变得非常大）\n        grad_squared += dx * dx\t\t\t\n        \n        x -= (learning_rate*dx) \u002F (np.sqrt(grad_squared) + 1e-7)\n      ```\n\n  - **RMSProp**：\n\n    - ```python\n      grad_squared = 0\n      while(True):\n        dx = compute_gradient(x)\n        \n        # 解决了AdaGrad的问题\n        grad_squared = decay_rate * grad_squared + (1-grad_squared) * dx * dx  \n        \n        x -= (learning_rate*dx) \u002F (np.sqrt(grad_squared) + 1e-7)\n      ```\n\n    - 人们现在更倾向于使用RMSProp而不是AdaGrad。\n\n  - **Adam**：\n\n    - 结合了动量和RMSProp的梯度信息。\n    - 需要进行偏差校正以修正梯度的初始状态。\n    - 到目前为止，它是表现最好的优化方法，在许多问题上都能取得优异的效果。\n    - 使用`beta1 = 0.9`、`beta2 = 0.999`以及`learning_rate = 1e-3`或`5e-4`作为许多模型的起点！\n\n  - **学习率衰减**：\n\n    - 例如，每隔几个周期将学习率减半。\n    - 这有助于防止学习率波动过大。\n    - 学习率衰减常用于SGD+动量，但不太适用于Adam。\n    - 在选择超参数时，不要一开始就使用学习率衰减。先尝试一下，再决定是否需要衰减。\n\n  - 我们讨论的所有上述算法都属于一阶优化方法。\n  - **二阶优化**：\n\n    - 使用梯度和海森矩阵构建二次近似。\n    - 沿着近似的极小值方向前进。\n    - 这种更新方式有什么优点？\n      - 在某些版本中，它不需要学习率。\n    - 但它在深度学习中并不实用。\n      - 海森矩阵有O(N^2)个元素。\n      - 逆矩阵计算需要O(N^3)的时间。\n    - **L-BFGS**是一种二阶优化方法。\n      - 它适用于批处理优化，但不适用于小批量数据。\n\n  - 实际操作中，首先使用ADAM，如果效果不佳再尝试L-BFGS。\n  - 有人认为所有著名的深度架构都使用了**SGD + Nesterov 动量**。\n\n- **正则化**\n\n- 到目前为止，我们讨论的是如何降低训练误差，但真正让我们关心的，是我们模型对未见过数据的处理能力！\n  - 如果训练数据和验证数据之间的误差差距过大，该怎么办？\n  - 这种误差被称为高方差。\n  - **模型集成**：\n    - 算法：\n      - 使用不同的初始化参数训练多个结构相同但相互独立的模型。\n      - 在测试时对它们的预测结果取平均。\n    - 这种方法通常可以提升约2%的性能。\n    - 它能够降低泛化误差。\n    - 你可以在训练过程中保存神经网络的若干快照，然后将这些快照集成起来，最终得到的结果。\n  - 正则化可以解决高方差问题。我们已经讨论过L1和L2正则化。\n  - 还有一些专门为神经网络设计的正则化技术，效果往往更好。\n  - **Dropout**：\n    - 在每次前向传播时，随机将一部分神经元的输出置为零。丢弃概率是一个超参数，大多数情况下设为0.5。\n    - 也就是说，你会随机选择一些激活值并将其置零。\n    - 它之所以有效，是因为：\n      - 它迫使网络形成冗余表示，从而防止特征之间的共适应；\n      - 从某种角度看，它实际上是在同一个模型中集成了多个子模型！\n    - 在测试时，我们可能会将每个dropout层的输出乘以丢弃概率。\n    - 有时在测试时则不进行任何缩放，直接使用原始输出。\n    - 使用dropout会增加训练时间。\n  - **数据增强**：\n    - 另一种起到正则化作用的技术。\n    - 对数据进行变换！\n    - 例如翻转图像或旋转图像。\n    - ResNet中的例子：\n      - 训练阶段：随机采样裁剪和缩放区域：\n        1. 在[256,480]范围内随机选取一个长度L。\n        2. 将训练图像短边调整为L。\n        3. 随机采样一个224x224的区域。\n      - 测试阶段：对固定的一组裁剪区域取平均：\n        1. 将图像调整为5种尺度：{224, 256, 384, 480, 640}。\n        2. 对每种尺度，分别从四个角、中心以及水平\u002F垂直翻转后的位置采样10个224x224的区域。\n      - 应用颜色抖动或PCA降维。\n      - 还可以进行平移、旋转和拉伸等操作。\n  - Drop Connect：\n    - 类似于dropout的思想，也是一种正则化方法。\n    - 不是丢弃激活值，而是随机将部分权重置为零。\n  - 分数最大池化：\n    - 一种很酷的正则化思路，不过并不常用。\n    - 在池化时随机划分区域。\n  - 随机深度：\n    - 一种新提出的正则化方法。\n    - 不是丢弃神经元，而是随机跳过某些层。\n    - 其效果与dropout类似，但属于一种全新的思路。\n\n- **迁移学习**：\n\n  - 有时候，你的模型出现过拟合，并不是因为正则化不足，而是因为数据量太小。\n\n  - 如果你想训练或使用卷积神经网络，通常需要大量的数据。\n\n  - 迁移学习的步骤：\n\n    1. 先在一个包含与你的数据集相似特征的大数据集上进行预训练。\n    2. 冻结除最后一层以外的所有层，只用你的小数据集来训练最后一层。\n    3. 你不仅可以重新训练最后一层，还可以根据数据量的多少，微调任意数量的层。\n\n  - 迁移学习使用指南：\n\n    |                         | 数据集非常相似               | 数据集差异较大                   |\n      | ----------------------- | ---------------------------------- | ---------------------------------------- |\n      | **数据量极少**         | 在顶层使用线性分类器           | 比较棘手……可以尝试从不同层次提取特征使用线性分类器 |\n      | **数据量较多**         | 微调几层                       | 微调较多层                  |\n\n  - 迁移学习其实是一种常态，而非特例。\n\n## 08. 深度学习软件\n\n- 由于深度学习软件的快速迭代，CS231n课程每年这一部分都会有很大的变化。\n- CPU与GPU\n  - GPU：显卡最初是为了渲染图形、运行游戏或制作3D媒体等而开发的。\n    - NVIDIA与AMD\n      - 在深度学习中，通常选择NVIDIA而非AMD的GPU，因为NVIDIA在推动深度学习研究方面更为积极，并且其架构也更适合深度学习任务。\n  - CPU的核心数量较少，但每个核心的速度更快、功能更强，擅长处理顺序性任务。而GPU的核心数量多，但单个核心的速度较慢、功能较弱，更适合并行计算。\n  - GPU的核心需要协同工作，并且拥有独立的显存。\n  - 矩阵乘法是非常适合在GPU上执行的操作之一，因为它包含M×N个可以并行进行的独立运算。\n  - 卷积操作同样可以并行化，因为它由多个独立的子操作组成。\n  - GPU编程框架：\n    - **CUDA**（仅适用于NVIDIA显卡）\n      - 使用类似C语言的代码直接在GPU上运行。\n      - 编写高效的GPU代码较为困难，因此NVIDIA提供了高层API来简化开发。\n      - 高层API包括cuBLAS、cuDNN等。\n      - **CuDNN**已经为你实现了反向传播、卷积、循环神经网络等操作！\n      - 实际上，你并不需要自己编写并行代码，而是可以直接使用他人已经实现并优化好的代码。\n    - **OpenCL**\n      - 类似于CUDA，但可以在任何品牌的GPU上运行。\n      - 通常速度较慢。\n      - 目前尚未得到所有深度学习框架的广泛支持。\n  - 存在许多关于并行编程的课程。\n  - 如果不注意，训练过程可能会因数据读取和传输到GPU而成为瓶颈。解决方法包括：\n    - 将所有数据加载到内存中（如果可能）。\n    - 使用SSD替代HDD。\n    - 使用多个CPU线程预取数据！\n      - 当GPU正在计算时，一个CPU线程会提前将数据准备好。\n      - 许多框架已经内置了这种机制，因为手动实现较为复杂。\n- **深度学习框架**\n  - 发展非常迅速！\n  - 当前可用的框架包括：\n    - TensorFlow（Google）\n    - Caffe（UC Berkeley）\n    - Caffe2（Facebook）\n    - Torch（NYU \u002F Facebook）\n    - PyTorch（Facebook）\n    - Theano（蒙特利尔大学）\n    - Paddle（百度）\n    - CNTK（微软）\n    - MXNet（亚马逊）\n  - 教师认为你应该重点关注TensorFlow和PyTorch。\n  - 深度学习框架的作用：\n    - 轻松构建大型计算图。\n    - 方便地计算计算图中的梯度。\n    - 高效地在GPU上运行（利用cuDNN和cuBLAS）。\n  - NumPy无法在GPU上运行。\n  - 大多数框架在前向传播阶段都尽量模仿NumPy的使用方式，然后自动为你计算梯度。\n- **TensorFlow（Google）**\n  - 代码分为两部分：\n    1. 定义计算图。\n    2. 运行并多次复用该计算图。\n  - TensorFlow采用静态图架构。\n  - TensorFlow中的变量是计算图的一部分，而占位符则在每次运行时被喂入。\n  - 全局初始化函数用于初始化计算图中的变量。\n  - 可以使用预定义的优化器和损失函数。\n  - 使用`layers.dense`函数即可创建完整的全连接层。\n  - **Keras**（高层封装）\n    - Keras是TensorFlow之上的一个高层封装，使常见操作更加简单。\n    - 非常流行！\n    - 几行代码就能训练一个完整的深度神经网络。\n  - 存在许多高层封装：\n    - Keras\n    - TFLearn\n    - TensorLayer\n    - `tf.layers` # TensorFlow自带\n    - `tf-Slim` # TensorFlow自带\n    - `tf.contrib.learn` # TensorFlow自带\n    - Sonnet # DeepMind推出的新框架\n  - TensorFlow提供了预训练模型，可用于迁移学习。\n  - TensorBoard可以记录损失、统计信息等，并启动服务器生成可视化图表。\n  - 如果需要将计算图分布到多个节点上，TensorFlow也支持分布式计算。\n  - TensorFlow实际上受到了Theano的启发，两者在设计理念和结构上有很多相似之处。\n\n\n- **PyTorch（Facebook）**\n\n  - 采用三层抽象：\n    - Tensor：类似于`ndarray`，但可以在GPU上运行 # 类似于TensorFlow中的NumPy数组\n      - Variable：计算图中的节点，存储数据和梯度 # 类似于TensorFlow中的Tensor、Variable和Placeholder\n    - Module：神经网络层，可能包含状态或可学习的权重 # 类似于TensorFlow中的`tf.layers`\n  - 在PyTorch中，计算图是在你执行代码的同一循环中动态构建的，这使得调试更加方便，称为动态图。\n  - 在PyTorch中，你可以通过为张量编写前向和反向传播函数来定义自己的自动求导函数，不过大多数情况下这些功能都已经为你实现好了。\n  - Torch.nn是一个类似于TensorFlow中Keras的高层API，你可以用它来构建模型并继续扩展。\n    - 也可以自定义自己的神经网络模块！\n  - PyTorch还包含了与TensorFlow类似的优化器。\n  - 它提供了一个数据加载器，可以对数据集进行分批、打乱顺序和多线程处理。\n  - PyTorch拥有最好且易于使用的预训练模型。\n  - PyTorch还配备了Visdom工具，类似于TensorBoard，不过TensorBoard的功能似乎更强大。\n  - 与Torch相比，PyTorch相对较新，仍在不断发展，目前仍处于测试阶段。\n  - PyTorch更适合用于研究。\n\n- TensorFlow只构建一次计算图，然后多次运行（称为静态图）。\n\n- 在PyTorch的每次迭代中，都会重新构建一个新的计算图（称为动态图）。\n\n- **静态图与动态图**：\n\n  - 优化：\n    - 对于静态图，框架可以在运行前为你优化计算图。\n  - 序列化：\n    - **静态图**：一旦计算图构建完成，就可以将其序列化并在没有原始代码的情况下运行，例如在C++中使用该计算图。\n    - **动态图**：始终需要保留原始代码。\n  - 条件语句：\n    - 动态图更容易实现条件逻辑，而静态图则相对复杂。\n  - 循环：\n    - 动态图中处理循环更为简单，而静态图中则较为复杂。\n  - TensorFlow的`fold`功能可以通过动态批处理使动态图的使用更加便捷。\n  - 动态图的应用场景包括：循环神经网络和递归神经网络。\n  - Caffe2采用静态图，可以用Python训练模型，同时支持iOS和Android平台。\n  - TensorFlow和Caffe2在生产环境中应用广泛，尤其是在移动端。\n\n## 09. CNN架构\n\n- 本节介绍著名的CNN架构，重点讨论自2012年以来在[ImageNet](www.image-net.org\u002F)竞赛中获胜的CNN架构。\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_487c3026ca9c.png)\n\n- 这些架构包括：[AlexNet](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)、[VGG](https:\u002F\u002Farxiv.org\u002Fabs\u002F1409.1556)、[GoogLeNet](https:\u002F\u002Fresearch.google.com\u002Fpubs\u002Fpub43022.html)和[ResNet](https:\u002F\u002Farxiv.org\u002Fabs\u002F1512.03385)。\n- 此外，我们还会在讲解过程中讨论一些有趣的架构。\n- 第一个卷积神经网络是[Yann LeCun](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F726791\u002F)于1998年提出的[LeNet-5](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F726791\u002F)架构。\n\n- 架构为：`CONV-POOL-CONV-POOL-FC-FC-FC`\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_1cc3fc4ed869.jpg)\n  - 每个卷积滤波器大小为 `5x5`，步幅为 1\n  - 每个池化层大小为 `2x2`，步幅为 2\n  - 在数字识别任务中非常有用。\n  - 特别是，图像特征分布在整个图像中，而带有可学习参数的卷积操作能够以较少的参数在多个位置有效地提取相似特征。\n  - 它恰好包含 **\u003Cu>5\u003C\u002Fu>** 层。\n\n- 2010 年，丹·克劳迪乌·西雷桑和尤尔根·施密德胡伯发表了最早的 GPU 神经网络实现之一。该实现使用 NVIDIA GTX 280 显卡，实现了多达 9 层神经网络的前向和反向传播。\n\n- [**AlexNet**](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F4824-imagenet-classification-with-deep-convolutional-neural-networks.pdf)（2012 年）：\n\n  - 这是一种卷积神经网络，开启了深度学习的发展，并在 2012 年 ImageNet 竞赛中获胜。\n  - 架构为：`CONV1-MAXPOOL1-NORM1-CONV2-MAXPOOL2-NORM2-CONV3-CONV4-CONV5-MAXPOOL3-FC6-FC7-FC8`\n  - 总共包含 **\u003Cu>8\u003C\u002Fu>** 层，其中前 5 层为卷积层，后 3 层为全连接层。\n  - AlexNet 的错误率为 `16.4%`。\n  - 例如，如果输入为 227 x 227 x 3，则各层的输出形状如下：\n    - CONV1 （96 个 11 x 11 滤波器，步幅 4，填充 0）\n      - 输出形状为 `(55,55,96)`，权重数量为 `(11*11*3*96)+96 = 34944`\n    - MAXPOOL1 （3 x 3 滤波器，步幅 2）\n      - 输出形状为 `(27,27,96)`，无权重\n    - NORM1\n      - 输出形状为 `(27,27,96)`，但现在已经不再使用了\n    - CONV2 （256 个 5 x 5 滤波器，步幅 1，填充 2）\n    - MAXPOOL2 （3 x 3 滤波器，步幅 2）\n    - NORM2\n    - CONV3 （384 个 3 x 3 滤波器，步幅 1，填充 1）\n    - CONV4 （384 个 3 x 3 滤波器，步幅 1，填充 1）\n    - CONV5 （256 个 3 x 3 滤波器，步幅 1，填充 1）\n    - MAXPOOL3 （3 x 3 滤波器，步幅 2）\n      - 输出形状为 `(6,6,256)`\n    - FC6 （4096 个神经元）\n    - FC7 （4096 个神经元）\n    - FC8 （1000 个神经元，用于分类得分）\n  - 其他细节：\n    - 首次使用 ReLU 激活函数。\n    - 曾使用归一化层，但现在已不再采用。\n    - 大量数据增强。\n    - Dropout 概率为 `0.5`。\n    - 批量大小为 `128`。\n    - SGD 动量为 `0.9`。\n    - 初始学习率为 `1e-2`，并在某些迭代中降低为原来的十分之一。\n    - 使用了 7 个 CNN 模型的集成！\n\n  - AlexNet 是在 GTX 580 GPU 上训练的，但该显卡仅有 3 GB 显存，不足以在单机上完成训练，因此他们将特征图分成了两半进行计算。这是第一个分布式版本的 AlexNet。\n  - 直到现在，它仍然被广泛应用于迁移学习的各种任务中。\n  - 总参数量为 `6000 万`。\n\n- [**ZFNet**](https:\u002F\u002Farxiv.org\u002Fabs\u002F1311.2901)（2013 年）\n\n  - 在 2013 年以 11.7% 的错误率获胜。\n  - 它的总体结构与 AlexNet 相同，但通过调整超参数获得了更好的效果。\n  - 同样包含 **\u003Cu>8\u003C\u002Fu>** 层。\n  - 与 AlexNet 不同的是：\n    - `CONV1`：从 (11 x 11，步幅 4) 改为 (7 x 7，步幅 2)。\n    - `CONV3、4、5`：将滤波器数量从 384、384、256 分别改为 512、1024、512。\n\n- [OverFeat](https:\u002F\u002Farxiv.org\u002Fabs\u002F1312.6229)（2013 年）\n\n  - 在 2013 年的 ImageNet 竞赛中赢得了目标定位任务。\n  - 该研究展示了如何在卷积神经网络中高效地实现多尺度和滑动窗口方法。此外，还提出了一种新的深度学习方法来预测目标边界，从而实现目标定位。\n\n- [**VGGNet**](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1409.1556)（2014 年）（牛津大学）\n\n  - 更深的网络，层数更多。\n  - 包含 19 层。\n  - 在 2014 年以 7.3% 的错误率击败 GoogleNet 获胜。\n  - 使用更小的滤波器和更深的层数。\n  - VGG 的一大优势在于，连续使用多个 3 × 3 卷积可以模拟更大感受野的效果，例如 5 × 5 和 7 × 7。\n  - 整个网络都采用了简单的 3 x 3 卷积。\n    - 三个 3 x 3 卷积的效果相当于一个 7 x 7 卷积。\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_2617c7ddb8a3.png)\n  - 该架构包含多层卷积，随后是五次池化操作，最后是全连接层。\n  - 对于每张图像，仅前向传播就需要 96 MB 的内存！\n    - 大部分内存消耗在早期的卷积层。\n  - 总参数量为 1.38 亿。\n    - 大部分参数位于全连接层。\n  - 训练细节与 AlexNet 类似，例如使用动量和 dropout。\n  - VGG19 是 VGG16 的升级版，性能稍好，但需要更多的内存。\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_4032840ebf4b.png)\n\n- [**GoogleNet**](https:\u002F\u002Fresearch.google.com\u002Fpubs\u002Fpub43022.html)（2014 年）\n\n- 更深的网络，层数更多。\n  - 包含22层。\n  - 使用高效的**\u003Cu>Inception\u003C\u002Fu>**模块。\n  - 参数量仅500万！比AlexNet少12倍。\n  - 在2014年与VGGNet并列夺冠，错误率为6.7%。\n  - Inception模块：\n    - 设计良好的局部网络拓扑结构（即“网络中的网络”NiN），然后将这些模块堆叠在一起。\n    - 其组成包括：\n      - 对前一层的输入同时应用多个并行的卷积操作：\n        - 多个不同尺寸的卷积核（1×1、3×3、5×5），\n          - 通过填充保持输出特征图的尺寸不变。\n        - 池化操作（最大池化），\n          - 同样通过填充保持输出特征图的尺寸不变。\n      - 将所有卷积和池化的输出在深度维度上拼接起来。\n    - 例如：\n      - Inception模块的输入为28×28×256。\n      - 并行应用的滤波器如下：\n        - (1×1)，128个滤波器               `# 输出形状为(28,28,128)`\n        - (3×3)，192个滤波器                 `# 输出形状为(28,28,192)`\n        - (5×5)，96个滤波器                   `# 输出形状为(28,28,96)`\n        - (3×3)最大池化            `# 输出形状为(28,28,256)`\n      - 拼接后输出为(28,28,672)。\n    - 这种设计——我们称之为“朴素”设计——计算复杂度非常高。\n      - 以上例子中：\n        - [1×1卷积，128个] ==> 28×28×128×1×1×256 = 约2500万次运算\n        - [3×3卷积，192个] ==> 28×28×192×3×3×256 = 约3.46亿次运算\n        - [5×5卷积，96个] ==> 28×28×96×5×5×256 = 约4.82亿次运算\n        - 总计约8.54亿次运算！\n    - 解决方案：使用1×1卷积的**瓶颈层**来降低特征图的深度。\n      - 受NiN（[Network in network](https:\u002F\u002Farxiv.org\u002Fabs\u002F1312.4400)）启发。\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_0500f562a609.png)\n    - 采用瓶颈层后，该示例的总运算次数降至3.58亿次，相比朴素实现有了显著改善。\n  - 因此，GoogleNet多次堆叠Inception模块，构建出完整的网络架构，能够在不使用全连接层的情况下解决任务。\n  - 需要指出的是，在分类步骤之前，它使用了一个平均池化层。\n  - 完整架构：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_3b26c7d5ea78.png)\n  - 2015年2月，引入了批归一化版本的Inception，即Inception V2。批归一化会计算某一层输出的所有特征图的均值和标准差，并用这些统计量对特征响应进行归一化。\n  - 2015年12月，他们发表了论文《重新思考计算机视觉中的Inception架构》，不仅详细解释了早期的Inception模型，还推出了新的V3版本。\n\n- 第一代GoogleNet和VGGNet诞生于批归一化技术发明之前，因此它们在训练神经网络时采用了多种技巧以确保收敛。\n\n- **ResNet**（2015年，微软研究院）\n\n  - 152层模型用于ImageNet竞赛，以3.57%的错误率获胜，这一成绩甚至低于人类水平的错误率。\n  - 这也是首次成功训练出超过百层，甚至上千层的深度神经网络。\n  - 在ILSVRC’15和COCO’15比赛中横扫所有分类和检测任务！\n\n  - 如果我们在一个“普通”的卷积神经网络上继续堆叠更深的层，会发生什么？\n    - 深度越大的模型性能反而更差，但这并非过拟合所致！\n    - 学习过程会停滞，因为深层网络更难优化！\n\n  - 理论上，更深的模型至少应该与浅层模型表现相当。\n  - 一种解决方案是直接复制浅层模型中已经学习到的参数，并将新增的层设置为恒等映射。\n  - 残差块：\n    - 微软提出了残差块结构：\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_0f4177522b50.png)\n    - ```python\n      # 我们不直接学习新的表示，而是只学习残差部分\n      Y = (W2 * RELU(W1*x + b1) + b2) + X\n      ```\n    - 假设你已经有一个深度为N层的网络，只有当你添加新层能带来额外收益时，才值得继续增加层数。\n    - 为了确保第(N+1)层能够学习到与输入不同的信息，可以将未经过变换的输入X也传递到第(N+1)层的输出端。这样可以促使新层学习到不同于输入编码的内容。\n    - 此外，这种连接方式还有助于缓解非常深的网络中出现的梯度消失问题。\n  - 有了残差块，我们现在可以构建任意深度的深层神经网络，而不必担心难以优化的问题。\n  - 拥有大量层的ResNet开始采用类似于Inception瓶颈层的设计，以降低特征图的维度。\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_7cc0455ef1d0.jpg)\n  - **\u003Cu>完整的ResNet架构\u003C\u002Fu>**：\n    - 堆叠残差块。\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_7230e72d1b3c.png)\n    - 每个残差块包含两个3×3的卷积层。\n    - 在网络开头增加一个额外的卷积层。\n    - 最后不使用全连接层（仅用一个1000维的全连接层输出类别）。\n    - 定期将滤波器数量翻倍，并通过步幅2进行空间降采样（每个维度减半）。\n    - 实际训练ResNet时：\n      - 每个卷积层后都进行批归一化。\n      - 使用He等人提出的Xavier\u002F2初始化方法。\n      - SGD结合动量（`0.9`）。\n      - 初始学习率为0.1，当验证误差趋于平稳时将其除以10。\n      - 小批量大小为`256`。\n      - 权重衰减为`1e-5`。\n      - 不使用丢弃法。\n\n- **Inception-v4**（2016年，论文编号：arXiv:1602.07261）：结合了ResNet和Inception，于2016年提出。\n\n- 各种架构的复杂度对比：\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_75a149c1148c.png)\n  - VGG：内存占用最高，计算量最大。\n  - GoogLeNet：效率最高。\n\n- **ResNet的改进**：\n  - （2016年，论文编号：arXiv:1603.05027）《深度残差网络中的恒等映射》\n    - 由ResNet的创造者提出。\n    - 能进一步提升性能。\n  - （2016年，论文编号：arXiv:1605.07146）《宽残差网络》\n    - 认为关键在于残差连接，而非网络深度。\n    - 50层的宽ResNet性能优于152层的原始ResNet。\n    - 增加宽度而非深度在计算上更为高效（可并行化）。\n  - （2016年，论文编号：arXiv:1603.09382）《具有随机深度的深度网络》\n    - 动机：通过在训练过程中使用较短的网络来减少梯度消失问题和训练时间。\n    - 每次训练时随机跳过一部分层。\n    - 测试时则使用完整的深层网络。\n\n- **超越ResNet**：\n\n- ([2017](https:\u002F\u002Farxiv.org\u002Fabs\u002F1605.07648)) \u003Cu>FractalNet：无需残差连接的超深度神经网络\u003C\u002Fu>\n    - 认为关键在于如何有效地从浅层网络过渡到深层网络，而残差表示并非必要。\n    - 通过丢弃部分路径进行训练。\n    - 测试时使用完整网络。\n  - ([2017](https:\u002F\u002Farxiv.org\u002Fabs\u002F1608.06993)) \u003Cu>密集连接卷积网络\u003C\u002Fu>\n  - ([2017](https:\u002F\u002Farxiv.org\u002Fabs\u002F1602.07360)) SqueezeNet：参数量减少50倍、模型大小小于0.5MB即可达到AlexNet级别的准确率\n    - 非常适合生产环境。\n    - 它综合了ResNet和Inception中的多种思想，并表明只要设计得当，就能在不依赖复杂压缩算法的情况下实现小型化网络与参数量。\n\n- 结论：\n\n  - ResNet目前仍是最佳默认选择。\n  - 网络架构正朝着极深的方向发展。\n  - 近两年来，许多模型都采用了类似“ResNet”的捷径结构，以促进梯度流动。\n\n\n\n\n\n## 10. 循环神经网络\n\n- 常规神经网络“前馈神经网络”：固定大小的输入经过若干隐藏单元后输出。我们称之为“一对一”网络。\n\n- 循环神经网络RNN模型：\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_52e2243e4174.png)\n  - 一对多\n    - 示例：图像字幕生成\n      - 图像 ==> 一系列单词\n  - 多对一\n    - 示例：情感分类\n      - 一系列单词 ==> 情感\n  - 多对多\n    - 示例：机器翻译\n      - 一种语言的一系列单词 ==> 另一种语言的一系列单词\n    - 示例：视频帧级分类\n\n- RNN也可用于非序列数据处理（一对一问题）：\n\n  - 曾通过一系列“瞥视”完成数字分类任务：\n    - “[基于视觉注意力的多目标识别](https:\u002F\u002Farxiv.org\u002Fabs\u002F1412.7755)”，ICLR 2015。\n  - 也曾逐块生成图像：\n    - 例如生成[验证码](http:\u002F\u002Fieeexplore.ieee.org\u002Fdocument\u002F7966808\u002F)。\n\n- 那么什么是循环神经网络？\n\n  - 循环核心单元接收输入x，并在每次读取输入时更新其内部状态。\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_0063c5c33e66.png)\n\n  - RNN模块应返回一个向量。\n\n  - 我们可以通过在每个时间步应用递推公式来处理向量序列：\n\n    - ```python\n      h[t] = fw (h[t-1], x[t])\t\t\t# 其中fw是带有参数W的函数\n      ```\n\n    - 每个时间步都使用相同的函数和参数集。\n\n  - （经典）循环神经网络：\n\n    - ```\n      h[t] = tanh (W[h,h]*h[t-1] + W[x,h]*x[t])    # 然后保存h[t]\n      y[t] = W[h,y]*h[t]\n      ```\n\n    - 这是最简单的RNN示例。\n\n  - RNN适用于处理相关数据的序列。\n\n- 循环NN计算图：\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_250f43bf2a46.png)\n  - `h0`初始化为零。\n  - `W`的梯度是所有已计算出的`W`梯度之和！\n  - 多对多图：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_8ce5788ed639.png)\n    - 最终损失也是所有损失之和，Y的权重为1，并通过累加所有梯度来更新！\n  - 多对一图：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_ab2a4db83ee8.png)\n  - 一对多图：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_dbbef229ed86.png)\n  - 序列到序列图：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_7481a167a038.png)\n    - 编码器-解码器思想。\n\n- 示例：\n\n  - 假设我们用字符构建单词，希望模型能够预测序列中的下一个字符。假设字符仅为`[h, e, l, o]`，单词为[hello]。\n    - 训练：\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_b0f371b1bd3c.png)\n      - 此处只有第三次预测正确。需要优化损失。\n      - 我们可以将整个单词作为输入来训练网络。\n    - 测试时：\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_c6e3adb51ce3.png)\n      - 测试时逐字符进行处理，输出字符将成为下一个输入，同时保留之前的隐藏状态。\n      - 此[链接](https:\u002F\u002Fgist.github.com\u002Fkarpathy\u002Fd4dee566867f8291f086)包含全部代码，但使用的是截断式时间反向传播，我们稍后会讨论。\n\n- 时间反向传播：先向前遍历整个序列计算损失，再向后遍历整个序列计算梯度。\n\n  - 但如果采用整个序列，速度会非常慢，占用大量内存，且可能无法收敛！\n\n- 因此，在实践中通常采用“截断式时间反向传播”：我们只对序列的一部分进行前向和反向传播，而不是整个序列。\n\n  - 同时，始终将隐藏状态向前传递，但在反向传播时仅回溯有限的几步。\n\n- 图像字幕生成示例：\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_47a1e42c7b4d.png)\n  - 使用\u003CEnd>标记结束运行。\n  - 图像字幕生成的最大数据集是Microsoft COCO。\n\n- 带注意力机制的图像字幕生成项目中，RNN在生成字幕时只会关注图像的特定区域，而非整张图像。\n\n  - 带注意力机制的图像字幕生成技术也被应用于“视觉问答”问题。\n\n- 多层RNN通常会将某些层作为隐藏层再次输入。**LSTM**就是一种多层RNN。\n\n- RNN中的梯度反向传播可能出现梯度爆炸或梯度消失现象。梯度爆炸可通过梯度裁剪来控制，而梯度消失则可通过添加门控机制（如LSTM）来缓解。\n\n- LSTM代表长短期记忆网络。它旨在解决RNN中的梯度消失问题。\n\n  - 它由以下部分组成：\n    - f：遗忘门，决定是否清除细胞状态。\n    - i：输入门，决定是否向细胞写入信息。\n    - g：细胞状态门（？），决定写入细胞的信息量。\n    - o：输出门，决定释放多少细胞状态。\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_b7691bff3d9c.png)\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_bb04a47c5dba.png)\n  - LSTM的梯度计算类似于ResNet，非常简便。\n  - LSTM在训练过程中既能保持长期记忆，也能保留短期记忆，因此不仅能记住上一层的信息，还能记住更深层次的信息。\n\n- 高速路网络介于ResNet和LSTM之间，目前仍在研究中。\n\n- 更好、更简单的架构是当前研究的热点。\n- 需要更深入的理论和实证理解。\n- RNN更适合处理具有相关性输入序列的问题，例如自然语言处理和语音识别。\n\n## 11. 检测与分割\n\n- 到目前为止，我们讨论的是图像分类问题。在这一节中，我们将讨论分割、定位和检测。\n\n- **\u003Cu>语义分割\u003C\u002Fu>**\n\n  - 我们希望为图像中的每个像素分配一个类别标签。\n  \n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_cd2214127a63.png)\n\n  - 如图所示，语义分割并不区分不同的实例，只关注像素本身。\n  \n  - 第一种思路是使用**滑动窗口**。我们选取一个小窗口，在整张图片上滑动。对于每个窗口，我们只需对中心像素进行分类。\n  \n    - 这种方法虽然可行，但并不理想，因为它计算成本非常高！\n    - 效率极低！无法复用重叠区域之间的共享特征。\n    - 实际上，很少有人采用这种方法。\n\n  - 第二种思路是设计一个由多个卷积层组成的网络，一次性对所有像素进行预测！\n\n    - 输入是整张图像，输出则是每个像素都被标注的图像。\n    - 这需要大量的标注数据，而获取这些数据的成本非常高。\n    - 网络需要较深的卷积层结构。\n    - 损失函数是基于每个像素的真实标签与预测标签之间的交叉熵。\n    - 数据增强在这里非常有用。\n    - 这种实现方式的问题在于，直接在原始分辨率上进行卷积运算会非常耗时且占用大量资源。\n    - 因此，目前实际应用中很少见到这种做法。\n\n  - 第三种思路是在第二种思路的基础上改进的。不同之处在于，我们在网络内部进行了下采样和上采样操作。\n  \n    - 我们进行下采样是因为直接处理整张图像的计算成本太高。因此，网络会经过多层下采样，最后再通过上采样恢复到原尺寸。\n  \n    - 下采样操作包括池化和步幅卷积。\n  \n    - 上采样则可以采用“最近邻插值”、“钉床插值”或“最大值反池化”等方法。\n    \n      - **最近邻插值**示例：\n      \n        - ```\n          输入:   1  2               输出:   1  1  2  2\n                   3  4                         1  1  2  2\n                                                3  3  4  4\n                                                3  3  4  4\n          ```\n\n      - **钉床插值**示例：\n      \n        - ```\n          输入:   1  2               输出:   1  0  2  0\n                   3  4                         0  0  0  0\n                                                3  0  4  0\n                                                0  0  0  0\n          ```\n\n      - **最大值反池化**则依赖于之前的最大值池化操作。我们会将最大值池化的位置填回原值，其余位置补零。\n  \n    - 最大值反池化似乎是上采样的最佳选择。\n  \n    - 此外，还有一种可学习的上采样方法，称为“转置卷积”。\n  \n      - 它不是常规的卷积操作，而是将其逆向执行。\n      - 也被称为：\n        - 上卷积\n        - 分数步幅卷积\n        - 反向步幅卷积\n      - 关于上采样的具体实现细节，请参考这篇论文的第4章：[arxiv.org\u002Fabs\u002F1603.07285]。\n\n- **\u003Cu>分类+定位\u003C\u002Fu>**：\n\n  - 在这个问题中，我们需要对图像中的主要目标进行分类，并用矩形框标出其位置。\n  - 假设图像中只有一个目标。\n  - 我们将构建一个多任务神经网络，其架构如下：\n    - 卷积网络层连接到：\n      - 用于分类的全连接层（即我们熟悉的普通分类问题）。\n      - 用于回归的全连接层，输出四个数值`(x, y, w, h)`。\n        - 我们将定位问题视为回归问题。\n  - 该问题有两个损失函数：\n    - 分类部分使用Softmax损失。\n    - 定位部分使用回归损失（L2损失）。\n  - 总损失 = SoftmaxLoss + L2损失。\n  - 通常，前几层卷积会使用预训练好的网络，例如AlexNet。\n  - 这种技术还可以应用于许多其他问题，比如人体姿态估计。\n\n- **\u003Cu>目标检测\u003C\u002Fu>**\n\n  - 目标检测是计算机视觉的核心问题之一，我们将在本节中详细讨论。\n  - 与“分类+定位”相比，目标检测的目标是检测一个或多个不同的目标及其位置！\n  - 第一种思路是使用滑动窗口法。\n    - 这种方法曾长期有效。\n    - 具体步骤如下：\n      - 对图像的不同裁剪区域分别应用卷积神经网络，判断每个区域是目标还是背景。\n    - 问题在于，我们需要对大量不同的位置和尺度应用CNN，计算成本极高！\n    - 粗暴的滑动窗口方法会导致重复计算成千上万次。\n  - 区域建议机制可以帮助我们决定应该在哪些区域运行神经网络：\n    - 找出可能包含目标的**团块状**图像区域。\n    - 运行速度相对较快；例如，Selective Search算法可以在CPU上几秒钟内生成1000个区域建议。\n  - 因此，我们可以先使用区域建议网络筛选出候选区域，然后再应用滑动窗口法。\n  - 另一种方法称为R-CNN。\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_7cef3b09f66f.png)\n    - 这种方法的缺点是：它会从图像中截取不同大小的区域，然后将它们缩放到统一尺寸后再输入CNN。缩放操作会引入误差。\n    - 此外，这种方法也非常慢。\n  - Fast R-CNN是在R-CNN基础上发展起来的一种方法。\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_6715901f2f67.png)\n    - 它使用一个CNN完成所有任务。\n  - Faster R-CNN则进一步改进了区域建议机制，通过插入区域建议网络（RPN）来直接从特征中预测候选区域。\n    - 这是R-CNN系列中速度最快的一种。\n  - 还有一种无需区域建议的方法：YOLO\u002FSSD。\n    - YOLO的意思是“你只需要看一次”。\n    - YOLO和SSD是两种独立的算法。\n    - 它们速度更快，但精度相对较低。\n  - 总结：\n    - Faster R-CNN速度较慢，但精度更高。\n    - SSD\u002FYOLO速度更快，但精度较低。\n\n- **\u003Cu>密集字幕生成\u003C\u002Fu>**\n\n  - 密集字幕生成是“目标检测+字幕生成”的结合。\n  - 有关这一想法的论文可以在这里找到：[arxiv.org\u002Fabs\u002F1511.07571]。\n\n- **\u003Cu>实例分割\u003C\u002Fu>**\n\n  - 实例分割可以看作是一个更全面的问题。\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_62fe79a1f73b.png)\n  - 与仅预测边界框不同，实例分割不仅需要确定每个像素的类别，还需要区分不同的实例。\n  - 目前有许多不同的方法。\n  - 其中一种新方法是“Mask R-CNN”。\n    - 它类似于R-CNN，但在内部集成了语义分割模块。\n    - 这篇论文取得了许多优秀的实验结果。\n    - Mask R-CNN综合了本节课中讨论的所有内容。\n    - 其性能表现良好。\n\n\n\n## 12. 可视化与理解\n\n- 我们希望了解卷积神经网络内部究竟发生了什么？\n  \n- 人们希望能够信任这个“黑箱”模型（CNN），并清楚地知道它是如何运作并做出准确决策的。\n  \n- 一种初步的方法是可视化第一层的滤波器。\n\n- 也许第一层卷积核的形状是5×5×3，卷积核数量为16个。这样我们就会得到16张不同“颜色”的卷积核图像。\n  - 结果发现，这些卷积核学习到的是类似于人脑的原始形状和方向性边缘。\n  - 无论你训练哪种卷积神经网络，比如AlexNet、VGG、GoogleNet或ResNet，这些卷积核看起来都差不多。\n  - 这就能告诉我们第一层卷积操作在图像中寻找的是什么。\n\n- 我们也可以可视化后续层的卷积核，但它们并不能给我们提供太多信息。\n\n  - 比如说，如果第一层卷积核的形状是5×5×20，卷积核数量仍然是16个，那么我们就会得到16×20张不同的“灰色”卷积核图像。\n\n- 在AlexNet的最后几层有一些全连接层。如果我们提取一张图像的4096维特征向量，并收集这些特征向量，\n  - 然后在这些特征向量之间进行最近邻搜索，找到与之最相似的真实图像，效果会比直接对原始图像运行KNN算法好得多！\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_e52f7f57e2cb.png)\n  - 这种相似性表明，这些卷积神经网络真正捕捉到了图像的语义信息，而不仅仅是像素级别的细节！\n  - 我们还可以对这4096维的特征向量进行降维，将其压缩到2维空间。\n    - 可以使用主成分分析（PCA）或t-SNE来实现。\n    - t-SNE更常用于深度学习中的数据可视化。示例可以参考[这里](http:\u002F\u002Fcs.stanford.edu\u002Fpeople\u002Fkarpathy\u002Fcnnembed\u002F)。\n\n- 我们可以可视化激活图。\n\n  - 比如说，如果CONV5的特征图大小是128×13×13，我们可以将其可视化为128张13×13的灰度图像。\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_0df6c6d6ebbf.png)\n  - 其中某些特征图会对输入图像产生强烈的响应，这就说明该特征图正在寻找特定的模式。\n  - 这项技术由Yosinski等人提出，更多信息可以参见[这里](http:\u002F\u002Fyosinski.com\u002Fdeepvis#toolbox)。\n\n- 还有一种叫做**最大激活补丁**的技术，可以帮助我们可视化卷积神经网络中的中间特征。\n\n  - 具体步骤如下：\n    - 首先选择某一层和某个神经元。\n      - 比如在AlexNet中选择Conv5层，其特征图大小为128×13×13，然后挑选第17个通道（神经元）。\n    - 将大量图像输入网络，记录所选通道的激活值。\n    - 可视化那些对应于最大激活值的图像区域。\n      - 我们会发现每个神经元都在关注图像中的特定部分。\n      - 提取的图像区域是通过感受野确定的。\n\n- 另一种方法是**遮挡实验**。\n\n  - 在将图像输入卷积神经网络之前，我们先遮挡住图像的一部分，并绘制出在不同遮挡位置下模型输出概率（即预测正确的概率）的热力图。\n  - 这样就能找出卷积神经网络最关注的图像区域。\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_42e5cb145d50.png)\n\n- **显著性图**可以告诉我们哪些像素对分类结果至关重要。\n\n  - 这种方法与遮挡实验类似，但采用了完全不同的思路。\n  - 我们计算未归一化的类别得分关于图像像素的梯度，取绝对值并在RGB三个通道上取最大值，最终得到一张灰度图像，它代表了图像中最关键的区域。\n  - 有时这种方法也可以用于语义分割任务。\n\n- （引导式）反向传播类似于**最大激活补丁**，但它能够定位我们真正关心的像素区域。\n\n  - 在这项技术中，我们像最大激活补丁一样选择一个通道，然后计算该神经元值关于图像像素的梯度。\n  - 如果只对每个ReLU单元的正梯度进行反向传播（引导式反向传播），生成的图像会更加清晰。\n\n- **梯度上升法**\n\n  - 通过这种方法可以生成一张能够最大程度激活某个神经元的合成图像。\n  - 它是梯度下降法的逆过程：不是寻找最小值，而是寻找最大值。\n  - 我们希望用输入图像来最大化某个神经元的激活值。因此，这里的目标是找到一张能够使该神经元激活程度最高的图像：\n\n    - ```python\n      # R(I) 是自然图像正则化项，f(I) 是神经元的激活值。\n      I *= argmax(f(I)) + R(I)\n      ```\n\n  - 梯度上升的具体步骤如下：\n    - 将图像初始化为零。\n    - 前向传播以计算当前得分。\n    - 反向传播以获取神经元值关于图像像素的梯度。\n    - 对图像进行小幅更新。\n\n  - `R(I)` 可能等于生成图像的L2范数。\n\n  - 为了获得更好的效果，我们可以使用更复杂的正则化方法：\n    - 对图像的L2范数施加惩罚；同时在优化过程中定期执行以下操作：\n      - 对图像进行高斯模糊处理。\n      - 将像素值过小的部分置为0。\n      - 将梯度过小的像素也置为0。\n\n  - 使用更高级的正则化方法可以让生成的图像更加清晰！\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_25a43d8c4c56.png)\n  - 后面几层生成的图像似乎比前面几层更有意义。\n\n- 我们也可以利用这一方法来欺骗卷积神经网络：\n\n  - 从一张任意图像开始。`# 一张完全随机的图片`\n  - 随机选择一个类别。`# 随机选择一个类别`\n  - 不断修改图像，使其尽可能地符合所选类别。\n  - 重复这个过程，直到网络被成功欺骗。\n\n- 欺骗网络的结果往往令人惊讶！\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_7d5f63f9f50c.png)\n  - 对人类来说，这些图像看起来并无差别，但仅仅加入一些噪声，就足以让网络做出错误的判断！\n\n- **DeepDream**：放大现有特征\n\n  - Google在其官网上发布了DeepDream工具。\n  - 实际上，DeepDream的工作原理与我们之前讨论的欺骗网络的方法相同，只不过它不是为了合成一张能够最大化某个特定神经元激活的图像，而是试图放大网络中某一层的神经元激活程度。\n  - 具体步骤如下：\n    - 前向传播：计算选定层的激活值。`# 准备一张输入图像（任何图像）`\n    - 将选定层的梯度设置为该层的激活值。\n      - 相当于 `I* = arg max[I] sum(f(I)^2)`\n    - 后向传播：计算图像的梯度。\n    - 更新图像。\n  - DeepDream的代码已经公开，你可以下载并自行尝试。\n\n- **特征反演**\n\n  - 通过这种方法，我们可以了解卷积神经网络的不同层分别捕捉到了图像中的哪些内容。\n  - 给定一张图像的卷积神经网络特征向量，找到一张新的图像，使其：\n    - 与给定的特征向量匹配。\n    - 同时保持自然外观（通过图像先验正则化）。\n\n- **纹理合成**\n\n  - 这是一个计算机图形学中的经典问题。\n  - 给定一块纹理样本，我们能否生成一张更大尺寸的相同纹理图像？\n  - 有一种不依赖于神经网络的算法：\n    - Wei和Levoy提出的基于树状矢量量化快速纹理合成算法，发表于SIGGRAPH 2000。\n    - 这是一种非常简单的算法。\n  - 问题在于，这是一个古老的问题，已经有许多算法解决了它，但对于复杂的纹理，简单的算法往往效果不佳！\n  - 2015年有人提出了基于梯度上升的“神经纹理合成”方法。\n    - 该方法依赖于Gram矩阵。\n\n- 神经风格迁移 = 特征 + Gram重构\n\n- Gatys、Ecker 和 Bethge，《使用卷积神经网络进行图像风格迁移》，CVPR 2016\n  - PyTorch 实现 [这里](https:\u002F\u002Fgithub.com\u002Fjcjohnson\u002Fneural-style)。\n\n- 风格迁移需要对 VGG 网络进行多次前向和反向传播，速度非常慢！\n\n  - 训练另一个神经网络来为我们执行风格迁移！\n  - 快速风格迁移就是解决方案。\n  - Johnson、Alahi 和 Fei-Fei，《用于实时风格迁移和超分辨率的感知损失》，ECCV 2016\n  - https:\u002F\u002Fgithub.com\u002Fjcjohnson\u002Ffast-neural-style\n\n- 关于风格迁移的研究非常多，并且至今仍在继续！\n\n- 总结：\n\n  - 激活值：最近邻、降维、最大特征块、遮挡实验\n  - 梯度：显著性图、类别可视化、欺骗性图像、特征反演\n  - 趣味应用：DeepDream、风格迁移\n\n\n\n\n\n\n## 13. 生成模型\n\n- 生成模型是无监督学习的一种类型。\n\n- 监督学习与无监督学习对比：\n\n  - |                | 监督学习                      | 无监督学习                    |\n    | -------------- | ---------------------------------------- | ---------------------------------------- |\n    | 数据结构       | 数据：(x, y)，其中 x 是数据，y 是标签  | 数据：x，只有数据，没有标签！           |\n    | 数据成本       | 在许多情况下，训练数据非常昂贵。       | 训练数据成本较低！                       |\n    | 目标           | 学习一个将 x 映射到 y 的函数           | 学习数据中的一些潜在隐藏结构             |\n    | 示例           | 分类、回归、目标检测、语义分割、图像字幕 | 聚类、降维、特征学习、密度估计 |\n\n- 自编码器是一种特征学习技术。\n\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_8ebac2172fb5.png)\n  - 它包含编码器和解码器。编码器对图像进行下采样，而解码器则对特征进行上采样。\n  - 损失函数为 L2 损失。\n\n- 密度估计是指我们希望学习或估计数据的底层分布！\n\n- 与监督学习相比，无监督学习领域仍存在大量未解决的研究问题！\n\n- **生成模型**\n\n  - 给定训练数据，从相同分布中生成新的样本。\n  - 解决了密度估计这一无监督学习的核心问题。\n  - 我们有多种方法可以做到这一点：\n    - 显式密度估计：明确定义并求解学习模型。\n    - 学习一个无需显式定义即可从中采样的模型。\n  - 为什么需要生成模型？\n    - 可用于艺术创作、超分辨率、彩色化等领域的逼真样本。\n    - 时间序列数据的生成模型可用于模拟和规划（强化学习应用！）。\n    - 训练生成模型还可以推断出潜在表示，这些表示可用作通用特征。\n  - 生成模型分类：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_eda42a4b948d.png)\n  - 在本讲中，我们将讨论 PixelRNN\u002FCNN、变分自编码器和 GANs，因为它们是目前研究中的热门模型。\n\n- **PixelRNN** 和 **PixelCNN**\n\n  - 在完全可见信念网络中，我们使用链式法则将图像 x 的似然分解为一维分布的乘积：\n    - `p(x) = sum(p(x[i]| x[1]x[2]....x[i-1]))`\n    - 其中 p(x) 是图像 x 的似然，x[i] 是在给定所有先前像素的情况下第 i 个像素值的概率。\n  - 为了解决这个问题，我们需要最大化训练数据的似然，但像素值的分布非常复杂。\n  - 此外，我们还需要定义\u003Cu>先前像素\u003C\u002Fu>的顺序。\n  - PixelRNN\n    - 由 van der Oord 等人于 2016 年提出\n    - 通过 RNN（LSTM）建模对先前像素的依赖关系\n    - 从图像一角开始逐像素生成\n    - 缺点：由于必须逐像素生成，因此速度较慢！\n  - PixelCNN\n    - 同样由 van der Oord 等人于 2016 年提出\n    - 仍然从图像一角开始逐像素生成。\n    - 现在使用 CNN 对上下文区域建模来处理对先前像素的依赖关系。\n    - 训练速度比 PixelRNN 快（由于上下文区域的值已知，卷积可以并行化）。\n    - 但生成过程仍然需要按顺序进行，速度依然较慢。\n  - 有一些技巧可以改进 PixelRNN 和 PixelCNN。\n  - PixelRNN 和 PixelCNN 可以生成不错的样本，并且仍然是活跃的研究领域。\n\n- **自编码器**\n\n  - 一种无监督方法，用于从无标签的训练数据中学习低维特征表示。\n  - 包括编码器和解码器。\n  - 编码器：\n    - 将输入 x 转换为特征 z。z 应该比 x 小，以便只提取输入中的重要信息。这可以称为降维。\n    - 编码器可以用以下方式构建：\n      - 线性或非线性层（早期）\n      - 深度全连接神经网络（后来）\n      - RELU 卷积神经网络（目前我们对图像使用这种方式）\n  - 解码器：\n    - 我们希望编码器将生成的特征映射回与 x 类似或相同的输出。\n    - 解码器可以采用与编码器相同的构建方式，目前也使用 RELU 卷积神经网络。\n  - 编码器是卷积层，而解码器是转置卷积层！意味着先减少再增加。\n  - 损失函数为 L2 损失函数：\n    - `L[i] = |y[i] - y'[i]|^2`\n      - 训练完成后，我们丢弃解码器。“# 现在我们有了所需的特征”\n  - 我们可以利用这个编码器来构建一个监督模型。\n    - 这种方法的优点是可以从输入中学习到良好的特征表示。\n    - 很多时候，我们手头的数据量很少。应对这种情况的一种方法是使用自编码器来学习如何从图像中提取特征，然后在此基础上用少量数据进行训练。\n  - 问题是，我们能否从这个自编码器中生成数据（图像）？\n\n- **变分自编码器（VAE）**\n\n  - 自编码器的概率论扩展——它将使我们能够从模型中采样以生成数据！\n  - 我们有通过编码器形成的特征向量 z。\n  - 然后我们选择一个简单的先验分布 p(z)，例如高斯分布。\n    - 这对于隐藏属性来说是合理的：例如姿态、微笑程度。\n  - 条件分布 p(x|z) 很复杂（生成图像），因此用神经网络表示。\n  - 但我们无法计算如下方程所示的积分 P(z)p(x|z)dz：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_387a99836481.png)\n  - 解决完所有方程后，我们应该得到：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_b08c578ae0ad.png)\n  - 变分自编码器是一种生成模型的方法，但其生成的样本相比最先进的 GANs 更模糊、质量更低。\n  - 当前的研究热点：\n    - 更灵活的近似方法，例如使用更丰富的近似后验分布代替对角高斯分布。\n    - 在潜在变量中引入结构信息。\n\n- **生成对抗网络（GANs）**\n\n  - GANs 不依赖任何显式的密度函数！\n\n- 相反，采用博弈论的方法：通过两人博弈来学习从训练数据分布中生成样本。\n\n  - 负责Facebook人工智能研究的Yann LeCun将GAN称为：\n\n    - > 过去20年里深度学习中最酷的想法\n\n  - 问题：我们希望从复杂、高维的训练数据分布中采样。正如我们之前讨论过的，目前并没有直接的方法可以做到这一点！\n\n  - 解决方案：从一个简单的分布中采样，比如随机噪声，然后学习如何将其映射到训练数据分布。\n\n  - 因此，我们生成一张由简单分布采样的噪声图像，并将其输入到一个神经网络中，这个网络被称为生成器网络，它的目标是学会将噪声图像转换为我们期望的数据分布。\n\n  - 训练GAN：这是一个两人博弈的过程：\n\n    - **生成器网络**：试图通过生成看起来逼真的图像来欺骗判别器。\n    - **判别器网络**：试图区分真实图像和伪造图像。\n\n  - 如果我们能够很好地训练判别器，那么就可以进一步训练生成器，使其生成符合要求的图像。\n\n  - GAN的损失函数可以表示为一个极小极大博弈，公式如下：\n\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_44f2d34d4d57.png)\n\n  - 生成器网络的目标标签是0，而真实图像的目标标签则是1。\n\n  - 在训练过程中，我们会执行以下操作：\n\n    - 对判别器进行梯度上升；\n    - 对生成器也进行梯度上升，但使用不同的损失函数。\n\n  - 完整的算法及相应公式可以在这里查看：\n\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_bfebc105ff09.png)\n\n  - 补充说明：同时训练两个网络具有挑战性，可能会导致不稳定。选择具有更好损失曲面的目标函数有助于稳定训练，这也是当前研究的一个热点方向。\n\n  - 卷积架构：\n\n    - 生成器是一个使用分数步长卷积的上采样网络；判别器则是一个卷积网络。\n    - 为了使深层卷积GAN更加稳定，建议遵循以下准则：\n      - 将判别器中的所有池化层替换为步长卷积，并将生成器中的普通卷积替换为分数步长卷积。\n      - 两个网络都应使用批归一化。\n      - 对于更深的架构，应移除全连接隐藏层。\n      - 生成器的所有层都使用ReLU激活函数，除了输出层使用Tanh。\n      - 判别器的所有层则使用Leaky ReLU激活函数。\n\n  - 2017年可以说是GAN的爆发之年！相关研究迅速发展，取得了许多非常出色的结果。\n\n  - 目前，GAN在各种应用场景中的研究也非常活跃。\n\n  - GAN相关的资源汇总可以在这里找到：https:\u002F\u002Fgithub.com\u002Fhindupuravinash\u002Fthe-gan-zoo\n\n  - 关于GAN使用的技巧和窍门，请参阅：https:\u002F\u002Fgithub.com\u002Fsoumith\u002Fganhacks\n\n  - NIPS 2016关于GAN的教程视频：https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=AJVyzd0rqdc\n\n## 14. 深度强化学习\n\n- 本节包含大量数学内容。\n- 强化学习问题涉及智能体与环境的交互，环境会提供数值奖励信号。\n- 其基本步骤如下：\n  - 环境 --> 状态 `s[t]` --> 智能体 --> 动作 `a[t]` --> 环境 --> 奖励 `r[t]` + 下一状态 `s[t+1]` --> 智能体 --> 以此类推。\n- 我们的目标是学习如何采取行动以最大化累积奖励。\n- 一个例子是机器人运动控制：\n  - 目标：让机器人向前移动。\n  - 状态：关节的角度和位置。\n  - 动作：施加在关节上的扭矩。\n  - 每个时间步都保持直立并向前移动。\n- 另一个例子是雅达利游戏：\n  - 在这个问题中，深度学习已经达到了最先进的水平。\n  - 目标：以最高分数完成游戏。\n  - 状态：游戏画面的原始像素输入。\n  - 动作：游戏操作指令，如左、右、上、下。\n  - 奖励：每个时间步的得分增减。\n- 围棋比赛是另一个例子，AlphaGo团队在2016年取得的胜利对人工智能和深度学习来说是一项重大成就，因为这个问题非常复杂。\n- 我们可以使用\u003Cu>**马尔可夫决策过程**\u003C\u002Fu>来从数学上形式化强化学习。\n- **马尔可夫决策过程**\n  - 定义为 (`S`, `A`, `R`, `P`, `Y`)，其中：\n    - `S`：可能的状态集合。\n    - `A`：可能的动作集合。\n    - `R`：给定 (状态, 动作) 对时的奖励分布。\n    - `P`：转移概率，即给定 (状态, 动作) 对时下一状态的分布。\n      - `Y`：折扣因子 `# 衡量我们对近期奖励与未来奖励的相对重视程度。`\n  - 算法：\n    - 在时间步 `t=0`，环境采样初始状态 `s[0]`。\n    - 然后，从 t=0 到结束：\n      - 智能体选择动作 `a[t]`。\n      - 环境根据 (`s[t]`, `a[t]`) 从 `R` 中采样奖励。\n      - 环境根据 (`s[t]`, `a[t]`) 从 `P` 中采样下一状态。\n      - 智能体接收奖励 `r[t]` 和下一状态 `s[t+1]`。\n  - 策略 `pi` 是一个从 S 到 A 的函数，用于指定在每个状态下应采取的动作。\n  - 目标：找到使累积折扣奖励最大化的策略 `pi*`：`Sum(Y^t * r[t], t>0)`。\n  - 例如：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_33a45b9192b3.png)\n  - 解决方案是：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_cf2df863ada1.png)\n- 状态 `s` 处的价值函数是遵循策略从状态 `s` 开始的预期累积奖励：\n  - `V[pi](s) = Sum(Y^t * r[t], t>0) 给定 s0 = s, pi`。\n- 状态 `s` 和动作 `a` 处的 Q 值函数是从状态 `s` 采取动作 `a` 后继续遵循策略的预期累积奖励：\n  - `Q[pi](s,a) = Sum(Y^t * r[t], t>0) 给定 s0 = s,a0 = a, pi`。\n- 最优 Q 值函数 `Q*` 是给定 (状态, 动作) 对时所能达到的最大预期累积奖励：\n  - `Q*[s,a] = Max(对于所有 pi，Sum(Y^t * r[t], t>0) 给定 s0 = s,a0 = a, pi))`。\n- 贝尔曼方程\n  - 这是强化学习中的关键概念。\n  - 对于任意状态动作对 (s,a)，该对的价值等于你将获得的奖励 r 加上你最终到达的状态的价值。\n  - `Q*[s,a] = r + Y * max Q*(s',a') 给定 s,a  # 注意方程中没有策略`。\n- 最优策略 `pi*` 对应于按照 `Q*` 规定在任何状态下采取最佳动作。\n- 我们可以使用基于贝尔曼方程的迭代更新算法——值迭代算法——来得到最优策略。\n  - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_b457bc6f677a.png)\n- 由于现实世界应用中的状态空间维度非常大，我们将使用函数近似器来估计 `Q(s,a)`。例如，神经网络！这种方法称为 **Q-learning**。\n  - 当我们需要表示一个复杂的函数但无法直接表达时，通常会使用神经网络。\n- **Q-learning**\n  - 第一个解决强化学习问题的深度学习算法。\n  - 使用函数近似器来估计动作价值函数。\n  - 如果函数近似器是深度神经网络，则称为深度 Q 学习。\n  - 损失函数：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_c2d318005cbd.png)\n- 现在让我们考虑“玩雅达利游戏”问题：\n  - 我们的总奖励通常是屏幕顶部显示的奖励。\n  - Q 网络架构：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_86674fe4cbdd.png)\n  - 从连续样本批次中学习是一个问题。如果我们记录了训练数据并让神经网络处理，而数据不足，就会导致高偏差误差。因此，我们应该使用“经验回放”而不是连续样本，让神经网络反复尝试游戏，直到掌握为止。\n  - 在游戏（经验）剧集进行的过程中，不断更新一个包含转换信息的回放缓冲区表（`s[t]` , `a[t]` , `r[t]` , `s[t+1]`）。\n  - 使用来自回放缓冲区的随机小批量转换数据来训练 Q 网络，而不是连续样本。\n  - 完整算法：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_6b7edfd45f04.png)\n  - 关于该算法在雅达利游戏中的演示视频可以在这里找到：“https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=V1eYniJ0Rnk”。\n- **策略梯度**\n  - 第二个解决强化学习问题的深度学习算法。\n  - Q 函数的问题在于它可能非常复杂。\n    - 例如：机器人抓取物体时，其状态空间维度非常高。\n    - 但策略却可以简单得多：只需闭合双手即可。\n  - 我们能否直接学习策略，例如从一组策略中找到最佳策略？\n  - 正则化策略梯度方程：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_92d830c6995b.png)\n  - 它会收敛到 `J(ceta)` 的局部最小值，通常已经足够好！\n  - REINFORCE 算法就是用来获取或预测最佳策略的算法。\n  - REINFORCE 算法的方程及直观理解：\n    - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_b472ca6b4b9b.png)\n    - 这个方程存在较高的方差问题，我们能否解决这个问题？\n    - 方差缩减目前仍是一个活跃的研究领域！\n  - 循环注意力模型 (RAM) 是一种基于 REINFORCE 算法的模型，用于图像分类问题：\n    - 通过有选择地聚焦于图像的不同区域来获取一系列“瞥视”，从而预测类别。\n      - 灵感来源于人类的感知和眼球运动。\n      - 节省计算资源 => 更好的可扩展性。\n        - 对于高分辨率图像，可以节省大量计算。\n      - 能够忽略图像中的杂乱或无关部分。\n    - RAM 现在被广泛应用于许多任务中，包括细粒度图像识别、图像字幕生成和视觉问答等。\n  - AlphaGo 同时使用了监督学习和强化学习，并且也采用了策略梯度方法。\n- 斯坦福大学关于深度强化学习的一门优秀课程：\n  - http:\u002F\u002Fweb.stanford.edu\u002Fclass\u002Fcs234\u002Findex.html\n  - https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIwTmSbCv9OVJB3YaO4sFwkX\n- 一门关于深度强化学习的优秀课程（2017年）：\n  - http:\u002F\u002Frll.berkeley.edu\u002Fdeeprlcourse\u002F\n  - https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLkFD6_40KJIznC9CDbVTjAF2oyt8_VAe3\n- 一篇不错的文章：\n  - https:\u002F\u002Fwww.kdnuggets.com\u002F2017\u002F09\u002F5-ways-get-started-reinforcement-learning.html\n\n\n\n## 15. 深度学习的高效方法与硬件\n\n- 原始讲座由斯坦福大学博士候选人宋翰主讲。\n- 深度卷积网络、循环网络以及深度强化学习正在塑造众多应用，并深刻改变我们的生活。\n  - 比如自动驾驶汽车、机器翻译、AlphaGo等。\n- 然而，当前的趋势表明，若要获得高精度，就必须使用更大（更深）的模型。\n  - 在ImageNet竞赛中，从2012年到2015年，为了达到更高的准确率，模型规模扩大了16倍。\n  - Deep Speech 2的训练次数是Deep Speech 1的10倍，而这仅仅发生在一年之内！ `# 在百度`\n- 这给我们带来了三大挑战：\n  - **模型规模**\n    - 将大型模型部署到个人电脑、手机或汽车上非常困难。\n  - **速度**\n    - ResNet152训练耗时1.5周，最终仅达到6.16%的准确率！\n    - 长时间的训练限制了机器学习研究人员的工作效率。\n  - **能源效率**\n    - AlphaGo：使用1920个CPU和280个GPU。每场比赛电费高达3000美元。\n    - 如果在手机上运行，电池会迅速耗尽。\n    - 谷歌在其博客中提到，如果所有用户每天使用谷歌语音识别功能3分钟，他们就需要将数据中心容量翻倍！\n    - 能量究竟消耗在哪里？\n      - 更大的模型意味着更多的内存访问，从而导致更高的能耗。\n- 我们可以通过算法与硬件协同设计来提升深度学习的效率。\n  - 从硬件和算法两个角度入手。\n- 硬件入门：硬件家族\n  - **通用型**\t\t\t`# 适用于任何硬件`\n    - CPU\t\t\t\t`# 注重延迟，单线程性能强大，像一头大象`\n      - GPU\t\t\t`# 注重吞吐量，拥有大量小线程，像一群蚂蚁`\n    - GPGPU\n      - **专用硬件**\t\t`# 针对特定应用领域优化`\n        - FPGA # 可编程逻辑，成本较低但效率稍逊\n        - ASIC # 固定逻辑，专为特定应用设计（也可用于深度学习）\n- 硬件入门：数值表示\n  - 计算机中的数字是通过离散的内存单元来表示的。\n  - 对于硬件而言，在浮点运算中从32位降至16位是非常高效且节能的。\n- 第一部分：**\u003Cu>高效推理的算法\u003C\u002Fu>**\n  - **神经网络剪枝**\n    - 核心思想是：能否移除部分权重或神经元，同时保持网络原有的性能？\n    - 2015年，Han利用剪枝技术将AlexNet的参数从6000万减少至600万！\n    - 剪枝既可应用于CNN，也可应用于RNN，通过迭代操作最终能达到与原始模型相同的准确率。\n    - 事实上，人类的大脑也在经历类似的过程：\n      - 新生儿（50万亿个突触） ==> 1岁儿童（1000万亿个突触） ==> 青少年（500万亿个突触）\n    - 算法步骤：\n      1. 获取已训练好的网络。\n      2. 评估各神经元的重要性。\n      3. 移除最不重要的神经元。\n      4. 对网络进行微调。\n      5. 若需继续剪枝，则返回步骤2；否则停止。\n  - **权重共享**\n    - 核心思想是减少模型中的数值种类。\n    - 训练后量化：\n      - 例如，所有值为2.09、2.12、1.92、1.87的权重都将被替换为2。\n      - 可以通过对滤波器进行k均值聚类来实现，从而减少其中的数值种类。这样做还能降低梯度计算所需的运算次数。\n      - 经过训练后量化处理后，权重变为离散值。\n      - 训练后量化能够显著减少每一层中每个数字所需的比特数。\n    - 剪枝结合训练后量化可以协同作用，进一步压缩模型规模。\n    - 哈夫曼编码\n      - 我们可以使用哈夫曼编码来减少或压缩权重的比特数。\n      - 不常用的权重：用较多的比特表示。\n      - 常用的权重：用较少的比特表示。\n    - 将剪枝、训练后量化和哈夫曼编码结合起来的方法称为深度压缩。\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_08cf0558ee95.png)\n      - ![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_readme_d6b3c4c3901c.png)\n      - **SqueezeNet**\n        - 到目前为止，我们讨论的所有模型都是基于预训练模型。那么，我们能否设计一种全新的架构，以节省内存并减少计算量呢？\n        - SqueezeNet仅用50分之一的参数和一半的模型大小，就能达到AlexNet的准确率。\n      - SqueezeNet甚至可以通过深度压缩进一步压缩。\n      - 如今的模型更加节能，且速度大幅提升。\n      - 深度压缩已在Facebook和百度等公司得到实际应用。\n  - **量化**\n    - 算法（量化权重和激活值）：\n      - 使用浮点数进行训练。\n      - 量化权重和激活值：\n        - 收集权重和激活值的统计信息。\n        - 选择合适的基数点位置。\n      - 以浮点格式进行微调。\n      - 转换为定点格式。\n  - **低秩近似**\n    - 这是另一种用于CNN的尺寸缩减算法。\n    - 其核心思想是将卷积层分解，然后分别测试分解后的两个子层。\n  - **二值化\u002F三值化网络**\n    - 我们能否仅用三个数值来表示神经网络中的权重呢？\n    - 如果只使用-1、0、1，模型的规模将大大缩小。\n    - 这是一项新提出的概念，发表于2017年的“Zhu, Han, Mao, Dally. 训练后的三值量化，ICLR’17”。\n    - 该方法在训练完成后实施。\n    - 他们在AlexNet上进行了尝试，结果误差几乎与原版AlexNet相同。\n    - 每次寄存器操作的数量会增加：https:\u002F\u002Fxnor.ai\u002F\n  - **维诺格拉德变换】\n    - 基于3x3 WINOGRAD卷积，其运算次数比普通卷积更少。\n    - cuDNN 5已经采用了WINOGRAD卷积，显著提升了速度。\n- 第二部分：**\u003Cu>高效推理的硬件\u003C\u002Fu>**\n  - 我们开发了许多用于深度学习的ASIC芯片，它们的目标都是尽量减少内存访问。\n    - Eyeriss MIT\n    - DaDiannao\n    - TPU Google（张量处理单元）\n      - 它可以替代服务器中的显卡。\n      - 每台服务器最多可安装4块TPU。\n      - 相较于GPU，这种硬件的能耗更低，芯片面积也更小。\n    - EIE Stanford\n      - 由Han等人于2016年提出[et al. ISCA’16]。\n      - 不保存零权重，并直接在硬件层面进行量化。\n      - 他认为EIE具有更好的吞吐量和更高的能效。\n- 第三部分：**\u003Cu>高效训练的算法\u003C\u002Fu>**\n  - **并行化】\n    - **数据并行** – 同时运行多个输入\n      - 例如，同时处理两张图片！\n      - 并行处理多个训练样本。\n      - 受批次大小限制。\n      - 梯度需要由主节点汇总。\n    - **模型并行】\n      - 将模型——即网络——拆分成若干部分。\n      - 按层将模型分配到多个处理器上。\n    - 超参数并行\n      - 同时尝试多种不同的网络架构。\n      - 很容易就能配置16到64个GPU来并行训练一个模型。\n  - **混合精度】使用FP16和FP32\n    - 我们已经讨论过，如果在整个模型中都使用16位实数，能耗将降低4倍。\n    - 那么，我们能否完全使用16位数字来构建模型呢？部分情况下可以采用混合FP16和FP32的方式。大部分地方使用16位，但在某些关键点仍需使用FP32。\n    - 例如，在FP16与FP16相乘时，就需要使用FP32。\n    - 训练完成后，模型的准确率可以接近AlexNet和ResNet等知名模型。\n  - **模型蒸馏】\n    - 问题在于，我们是否可以利用一个资深（优秀）的已训练神经网络来指导一个新手（新的）神经网络？\n    - 更多信息请参阅Hinton等人关于“暗知识”或“神经网络中的知识蒸馏”的研究。\n  - DSD：密集-稀疏-密集训练\n    - Han等人：“DSD：深度神经网络的密集-稀疏-密集训练”，ICLR 2017\n    - 具有更好的正则化效果。\n    - 其核心思想是：先以密集方式训练模型，随后对其进行剪枝，使其变为稀疏状态。\n    - DSD生成的模型架构相同，但能找到更好的优化解，达到更优的局部极小值，并实现更高的预测准确率。\n    - 在完成上述两步后，再将剩余的连接重新连接起来，再次进行密集训练。\n    - 这一方法显著提升了许多深度学习模型的性能。\n- 第四部分：**\u003Cu>高效训练的硬件\u003C\u002Fu>**\n  - 用于训练的GPU：\n    - Nvidia PASCAL GP100（2016年）\n    - Nvidia Volta GV100（2017年）\n      - 支持混合精度运算！\n      - 性能极其强大。\n      - 真正的新一代“核弹”。\n  - 谷歌于2017年5月宣布推出“Google Cloud TPU”！\n    - Cloud TPU可提供高达180 teraflops的算力，用于训练和运行机器学习模型。\n    - 我们之前的一个大型翻译模型，需要用32块市面上最好的商用GPU训练整整一天；而现在，只需使用八分之一的TPU pod，就能在一个下午内达到同样的准确率。\n- 我们已经从PC时代过渡到了移动优先时代，如今正迈向AI优先时代。\n\n## 16. 对抗样本与对抗训练\n\n- **\u003Cu>什么是对抗样本？\u003C\u002Fu>**\n  - 自2013年以来，深度神经网络在以下任务上的表现已达到甚至超越人类水平：\n    - 人脸识别\n    - 物体识别\n    - 验证码识别\n      - 由于其准确率高于人类，许多网站开始寻找替代验证码的解决方案。\n    - 以及其他任务……\n  - 在2013年之前，人们看到计算机犯错并不会感到惊讶！但如今，深度学习已经广泛应用，因此了解其存在的问题及原因显得尤为重要。\n  - 对抗样本是深度学习模型中出现的一种特殊错误现象。\n  - 这一话题直到深度学习的表现不断超越人类后才逐渐受到关注。\n  - 对抗样本是指经过精心构造，旨在使模型产生错误分类的输入数据。\n  - 在许多情况下，从人类视角来看，对抗样本与原始图像几乎没有明显差异。\n  - 近年来相关研究的历史：\n    - Biggio [2013](https:\u002F\u002Flink.springer.com\u002Fchapter\u002F10.1007\u002F978-3-642-40994-3_25)：欺骗神经网络。\n    - Szegedy等2013年：以几乎不可察觉的方式欺骗ImageNet分类器。\n    - Goodfellow等[2014](https:\u002F\u002Farxiv.org\u002Fabs\u002F1412.6572)：低成本、闭式解攻击方法。\n  - 因此，最早的相关研究可追溯到2013年。当时Szegedy训练了一个性能优异的卷积神经网络。\n    - 他希望通过深入理解CNN的工作机制来进一步优化它。\n    - 他输入一张物体的图像，并利用梯度上升法不断调整图像，使其被分类为另一类物体。\n    - 奇怪的是，最终生成的图像从人类视角来看几乎没有任何变化！\n    - 如果你亲自尝试，可能根本察觉不到任何改变，甚至会误以为这是程序错误。然而，仔细对比就会发现，这两张图实际上完全不同！\n  - 这种类型的错误几乎可以在我们所研究的任何深度学习算法中找到！\n    - 令人意外的是，RBF（径向基函数网络）能够抵御此类攻击。\n    - 用于密度估计的深度模型同样具备一定的抗干扰能力。\n  - 不仅神经网络容易被欺骗：\n    - 线性模型\n      - 逻辑回归\n      - Softmax回归\n      - SVMs\n    - 决策树\n    - 最近邻算法\n- **\u003Cu>为什么会出现对抗样本？\u003C\u002Fu>**\n  - 在试图理解这一现象的过程中，2016年曾有人认为这源于高维数据下的过拟合问题。\n    - 他们认为，在如此高维的空间中，可能会存在一些随机误差，而这些误差是可以被检测到的。\n    - 因此，如果使用不同的参数重新训练模型，应该不会犯同样的错误。\n    - 然而，实验结果表明这种观点并不正确。不同模型往往会陷入相同的错误模式，这显然不是过拟合所致。\n  - 在上述实验中，研究人员发现问题并非随机因素，而是具有系统性的。\n    - 只要向某个样本添加特定的向量，无论使用哪种模型，都会导致错误分类。\n  - 或许这些问题更多地源于欠拟合，而非过拟合。\n  - 现代深度神经网络大多由分段线性单元构成：\n    - 整流线性单元（ReLU）\n    - 经过精心调优的Sigmoid函数 `# 大多数情况下我们处于线性区间内`\n    - Maxout\n    - LSTM\n  - 参数与输出之间的关系是非线性的，因为它们是通过乘法连接的，这也使得训练神经网络变得困难；而如果输入和输出之间是线性映射，则会简单得多。\n- **\u003Cu>对抗样本如何被用来攻破机器学习系统？\u003C\u002Fu>**\n  - 当我们测试神经网络的脆弱性时，需要确保真正实现了欺骗效果，而不仅仅是改变了输出类别。而对于攻击者而言，则希望让目标模型表现出某种异常行为（即“挖洞”）。\n  - 构造对抗样本时，通常会对扰动施加最大范数约束。\n  - 快速梯度符号法：\n    - 该方法基于几乎所有神经网络都采用线性激活函数（如ReLU）这一假设。\n    - 每个像素的修改幅度不得超过某个阈值ε。\n    - 具体步骤是：计算损失函数关于输入的梯度，取其符号，再将该符号乘以ε。\n    - 公式如下：\n      - `Xdash = x + ε * (梯度的符号)`\n      - 其中，Xdash表示对抗样本，x表示正常样本。\n    - 因此，只需利用梯度的方向和一个很小的ε值，就能成功生成对抗样本。\n  - 有些攻击则基于ADAM优化器。\n  - 对抗样本并不是随机噪声！\n  - 神经网络是在特定数据分布上进行训练的，因此在其适用范围内表现良好。但如果数据分布发生偏移，模型便难以给出正确的预测，反而更容易被欺骗。\n  - 深度强化学习同样可能被攻破。\n  - 权重攻击：\n    - 对于线性模型，可以提取其学习到的权重矩阵，取其符号，然后将其叠加到任意样本上，从而强制模型按照这些权重的指示进行分类。——安德烈·卡帕西，《破解ImageNet上的线性分类器》\n  - 事实证明，某些线性模型对对抗样本具有较强的抵抗力（较难被攻破）：\n    - 尤其是浅层RBF网络，能够抵抗快速梯度符号法构造的对抗扰动。# 但问题在于，RBF网络在大多数数据集上的表现并不理想，因为它属于浅层模型。若试图加深网络层次，各层的梯度几乎都会变为零。\n    - 即使使用批归一化等技术，RBF网络也难以有效训练。伊恩认为，如果能找到更好的超参数或更优的优化算法替代梯度下降法，就有可能成功训练RBF网络，从而解决对抗样本问题。\n  - 我们还可以利用另一种模型来欺骗当前模型。例如，用支持向量机来欺骗深度神经网络。\n    - 更多细节请参阅论文：“Papernot 2016”\n  - 转移攻击\n    1. 目标模型的权重、机器学习算法及训练数据集均未知；甚至可能是不可微分的。\n    2. 使用自己的输入数据对该模型进行采样，将数据送入目标模型并获取输出。\n    3. 基于这些数据训练自己的模型。“参照Papernot 2016中的表格”\n    4. 在自己的模型上创建对抗样本。\n    5. 将这些对抗样本应用于目标模型。\n    6. 很大概率能够取得良好效果，成功欺骗目标模型。\n  - 为了将欺骗某网络的成功率提高至100%，可以在转移攻击中构建不止一个模型，而是多达五个模型，然后依次应用。（刘等人，2016年）\n  - 对抗样本同样会影响人类大脑！例如那些会欺骗视觉的图片，在互联网上随处可见。\n  - 实际上，已有研究团队成功欺骗了MetaMind、亚马逊和谷歌的真实模型。\n  - 曾有人将对抗扰动上传至Facebook，结果Facebook真的被欺骗了 :D\n- **\u003Cu>有哪些防御措施？\u003C\u002Fu>**\n  - 伊恩尝试过的许多防御方法都以失败告终！包括：\n    - 集成方法\n    - 权重衰减\n    - Dropout\n    - 在训练或测试阶段添加噪声\n    - 使用自编码器去除扰动\n    - 生成式建模\n  - 通用逼近定理\n    - 无论我们希望分类函数呈现何种形状，只要网络规模足够大，都能实现。\n    - 因此，我们可以训练一个专门用于检测对抗样本的神经网络！\n  - 线性模型和KNN比神经网络更容易被欺骗。相比之下，神经网络实际上可能更加安全。经过对抗训练的神经网络，在应对对抗样本方面的实际成功率远高于其他机器学习模型。\n    - 深度神经网络可以使用非线性激活函数，但关键在于找到合适的优化技术，或者直接采用像“ReLU”这样的线性激活函数。\n- **\u003Cu>如何利用对抗样本改进机器学习，即使不存在对手？\u003C\u002Fu>**\n  - 通用工程机（基于模型的优化）\t\t`#伊恩称之为通用工程机`\n    - 举例来说：\n      - 假设我们想要设计一辆速度极快的汽车。\n      - 我们训练了一个神经网络，让它分析汽车的设计图纸，并判断该图纸是否能造出一辆高速车。\n      - 此处的核心思想是优化网络的输入，使输出达到最大化，从而为我们提供最佳的汽车设计方案！\n    - 通过寻找能使模型预测性能最大化的输入，来实现新发明。\n    - 目前，我们借助对抗样本往往只能得到不理想的结果。但一旦解决了这个问题，我们就有机会制造出最快的汽车、最好的GPU、最舒适的椅子，甚至是开发出全新的药物……\n  - 总体而言，对抗样本的研究仍处于活跃状态，尤其是针对网络防御方面的研究。\n- 结论\n  - 攻击相对容易\n  - 防御则较为困难\n  - 对抗训练可以起到正则化和半监督学习的作用\n  - 数据分布外的输入问题是基于模型的优化方法普遍面临的瓶颈\n- GitHub上有一个代码库，可以帮助你通过编程全面了解对抗样本的相关知识（基于TensorFlow构建）：\n  - 一个用于构造攻击、构建防御以及对两者进行基准测试的对抗样本库：https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Fcleverhans\n\n\u003Cbr>\u003Cbr>\n\u003Cbr>\u003Cbr>\n这些笔记由[Mahmoud Badry](mailto:mma18@fayoum.edu.eg)于2017年制作。","# CS231n-2017-Summary 快速上手指南\n\n本项目是斯坦福大学 2017 年计算机视觉课程（CS231n）的笔记总结，涵盖了从图像分类、卷积神经网络（CNN）到生成模型等核心内容。它主要作为学习该课程的视频配套文字资料，而非可直接运行的软件库。\n\n## 环境准备\n\n由于本项目本质为 Markdown 格式的课程笔记与知识库，**无需安装特定的 Python 环境或依赖包**即可阅读。\n\n*   **系统要求**：任意支持现代浏览器的操作系统（Windows, macOS, Linux）。\n*   **前置依赖**：\n    *   **阅读模式**：仅需 Web 浏览器（推荐 Chrome, Firefox, Edge）。\n    *   **本地预览模式（可选）**：若希望在本地渲染 Markdown 文件，可安装 VS Code 及 `Markdown Preview Enhanced` 插件，或使用 Python 的 `mkdocs` \u002F `jupyter notebook` 环境。\n    *   **复现课程作业（可选）**：若需运行笔记中提到的算法代码或复现课程实验，建议安装 Python 3.6+ 及以下深度学习库：\n        ```bash\n        pip install numpy matplotlib jupyter notebook tensorflow pytorch\n        ```\n\n## 安装步骤\n\n本项目无需编译或复杂安装，直接克隆仓库即可。考虑到网络环境，推荐使用国内镜像加速或手动下载。\n\n### 方式一：使用 Git 克隆（推荐）\n\n如果网络条件允许，直接使用 Git 克隆：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fksopyla\u002FCS231n-2017-Summary.git\ncd CS231n-2017-Summary\n```\n\n*注：若访问 GitHub 缓慢，可使用国内代码托管平台（如 Gitee）搜索是否有同步镜像，或通过代理工具加速。*\n\n### 方式二：手动下载\n\n1.  访问项目 GitHub 页面：[https:\u002F\u002Fgithub.com\u002Fksopyla\u002FCS231n-2017-Summary](https:\u002F\u002Fgithub.com\u002Fksopyla\u002FCS231n-2017-Summary)\n2.  点击 **\"Code\"** 按钮，选择 **\"Download ZIP\"**。\n3.  解压下载的压缩包到本地目录。\n\n## 基本使用\n\n本项目主要用于辅助学习斯坦福 CS231n 课程。以下是两种主要的使用方式：\n\n### 1. 在线阅读（最直接方式）\n\n直接在 GitHub 上浏览整理好的目录结构。点击对应的章节链接（如 `05. Convolutional neural networks (CNNs)`）即可查看详细笔记。\n\n*   **课程视频配合**：建议同时打开 [YouTube 课程播放列表](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLC1qU-LWwrF64f4QKQT-Vg5Wr4qEE1Zxk)（国内用户可在 Bilibili 搜索\"CS231n 2017\"观看搬运字幕版），对照笔记章节进行学习。\n*   **核心内容速查**：\n    *   **基础理论**：查看 `02. Image classification` 和 `03. Loss function and optimization` 了解 KNN、SVM 及反向传播原理。\n    *   **架构详解**：查看 `05` 至 `09` 章节深入学习 CNN 结构与训练技巧。\n    *   **进阶主题**：参考 `10` 至 `16` 章节了解 RNN、目标检测、生成模型及对抗样本等前沿内容。\n\n### 2. 本地离线阅读\n\n如果你已克隆或下载了项目，可以使用支持 Markdown 的编辑器进行离线阅读，体验更流畅且无广告干扰。\n\n**使用 VS Code 阅读示例：**\n\n1.  用 VS Code 打开项目文件夹：\n    ```bash\n    code CS231n-2017-Summary\n    ```\n2.  在左侧文件树中点击任意 `.md` 文件（例如 `README.md` 或具体章节文件）。\n3.  点击右上角的 **预览图标**（或按下快捷键 `Ctrl+Shift+V` \u002F `Cmd+Shift+V`）即可渲染查看包含公式和图片的完整笔记。\n\n**关联课程作业代码：**\n笔记中提到的编程作业解决方案可参考原作者关联的仓库：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002FBurton2000\u002FCS231n-2017.git\n```\n在该仓库中可找到具体的 Python 实现代码，配合本笔记的理论部分进行实践。","一名刚入门计算机视觉的算法工程师，正试图复现斯坦福 CS231n 课程中的经典模型以解决工作中的图像分类难题。\n\n### 没有 CS231n-2017-Summary 时\n- **时间成本高昂**：面对总计 16 节、每节长达 1 小时的全英文视频讲座，需要花费数周时间逐帧观看才能梳理出知识脉络。\n- **重点难以捕捉**：课程涵盖从基础损失函数到对抗样本训练等广泛内容，新手极易在海量细节中迷失，无法快速定位如\"CNN 架构演进”或“调试技巧”等核心工程知识。\n- **知识碎片化**：缺乏系统性的笔记整理，看完视频后容易遗忘关键公式推导和参数调整策略，导致在动手编写代码时频繁卡壳，不得不反复回看视频确认细节。\n\n### 使用 CS231n-2017-Summary 后\n- **极速构建框架**：直接利用其按章节整理的目录（如从图像分类到生成模型），在几小时内即可建立起完整的深度学习知识体系，跳过非必要的背景介绍。\n- **精准获取干货**：作者已预先筛选并略去了非核心内容，工程师可直奔\"Training neural networks\"或\"Deep learning software\"等实战章节，快速获取反向传播推导及调参秘籍。\n- **理论与实践闭环**：结合摘要中提供的作业解决方案链接，能迅速将理论概念映射到代码实现，大幅缩短从“看懂视频”到“跑通模型”的周期，提升开发效率。\n\nCS231n-2017-Summary 通过将百小时的视频精华浓缩为结构化笔记，让开发者能以最低的时间成本掌握计算机视觉的核心精髓。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmbadry1_CS231n-2017-Summary_1eb15957.png","mbadry1","Mahmoud Badry","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmbadry1_d5dcada2.png","A deep learning researcher, a .NET developer, and a good learner.",null,"Cairo, Egypt","mahmoud.badry100@yahoo.com","https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fmbadry1\u002F","https:\u002F\u002Fgithub.com\u002Fmbadry1",[85],{"name":86,"color":87,"percentage":88},"Python","#3572A5",100,1577,456,"2026-03-24T00:41:09","MIT",1,"","未说明",{"notes":97,"python":95,"dependencies":98},"该项目是斯坦福 CS231n 2017 课程的视频学习笔记和总结文档，并非可执行的软件工具或代码库，因此没有特定的操作系统、GPU、内存或 Python 版本等运行环境需求。文中提到的 'Assignments solutions' 指向另一个独立的 GitHub 仓库，若需运行作业代码请参考该仓库的说明。",[],[13],[101,102,103,104],"neural-network","deep-learning","cs231n","notes","2026-03-27T02:49:30.150509","2026-04-06T07:14:08.538864",[108,113],{"id":109,"question_zh":110,"answer_zh":111,"source_url":112},13566,"我可以将此文档翻译成韩文吗？","可以，当然可以。所有内容都归你所有，如果你批准，可以通过 fork 仓库的方式进行翻译。","https:\u002F\u002Fgithub.com\u002Fmbadry1\u002FCS231n-2017-Summary\u002Fissues\u002F1",{"id":114,"question_zh":115,"answer_zh":116,"source_url":117},13565,"我可以将这些笔记翻译成中文吗？","可以，当然可以。所有内容都归你所有，你可以自由使用和翻译。","https:\u002F\u002Fgithub.com\u002Fmbadry1\u002FCS231n-2017-Summary\u002Fissues\u002F3",[]]