[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-mitmath--matrixcalc":3,"similar-mitmath--matrixcalc":57},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":9,"readme_en":10,"readme_zh":11,"quickstart_zh":12,"use_case_zh":13,"hero_image_url":14,"owner_login":15,"owner_name":16,"owner_avatar_url":17,"owner_bio":18,"owner_company":19,"owner_location":19,"owner_email":19,"owner_twitter":19,"owner_website":19,"owner_url":20,"languages":21,"stars":38,"forks":39,"last_commit_at":40,"license":19,"difficulty_score":41,"env_os":42,"env_gpu":43,"env_ram":43,"env_deps":44,"category_tags":50,"github_topics":19,"view_count":41,"oss_zip_url":19,"oss_zip_packed_at":19,"status":52,"created_at":53,"updated_at":54,"faqs":55,"releases":56},3189,"mitmath\u002Fmatrixcalc","matrixcalc","MIT IAP short course: Matrix Calculus for Machine Learning and Beyond","matrixcalc 是麻省理工学院（MIT）开设的一门关于“矩阵微积分”的开源课程资源，旨在帮助学习者掌握机器学习与大规模优化背后的核心数学工具。传统微积分课程多聚焦于标量或向量运算，而现代人工智能应用亟需处理更复杂的矩阵函数求导、自动微分算法及高维空间运算。matrixcalc 从线性代数视角重新构建微积分体系，将矩阵视为整体对象而非单纯数值数组，深入讲解矩阵逆、行列式等操作的导数计算，并揭示反向传播、伴随微分等技术的数学本质。\n\n该资源解决了进阶算法开发者在理解深度学习框架底层机制时面临的数学断层问题，尤其适合需要推导复杂模型梯度、优化自定义损失函数或研发新型自动微分系统的研究人员与工程师。课程由 Alan Edelman 和 Steven G. Johnson 教授主讲，结合 Julia 语言进行数值实践，强调“微分即线性算子”的统一观点，打破符号推导与有限差分的传统局限。其独特亮点在于将抽象数学理论与计算机科学的效率考量紧密结合，为理解现代 AI 框架（如 PyTorch、TensorFlow）的自动微分引擎提供坚实理论基础。无论是希望深化理论功底的研究生，还是从事算法落地的","matrixcalc 是麻省理工学院（MIT）开设的一门关于“矩阵微积分”的开源课程资源，旨在帮助学习者掌握机器学习与大规模优化背后的核心数学工具。传统微积分课程多聚焦于标量或向量运算，而现代人工智能应用亟需处理更复杂的矩阵函数求导、自动微分算法及高维空间运算。matrixcalc 从线性代数视角重新构建微积分体系，将矩阵视为整体对象而非单纯数值数组，深入讲解矩阵逆、行列式等操作的导数计算，并揭示反向传播、伴随微分等技术的数学本质。\n\n该资源解决了进阶算法开发者在理解深度学习框架底层机制时面临的数学断层问题，尤其适合需要推导复杂模型梯度、优化自定义损失函数或研发新型自动微分系统的研究人员与工程师。课程由 Alan Edelman 和 Steven G. Johnson 教授主讲，结合 Julia 语言进行数值实践，强调“微分即线性算子”的统一观点，打破符号推导与有限差分的传统局限。其独特亮点在于将抽象数学理论与计算机科学的效率考量紧密结合，为理解现代 AI 框架（如 PyTorch、TensorFlow）的自动微分引擎提供坚实理论基础。无论是希望深化理论功底的研究生，还是从事算法落地的技术专家，都能从中获得超越公式记忆的深刻洞察。","# Matrix Calculus for Machine Learning and Beyond\n\nThis is the course page for an **18.063 Matrix Calculus** at MIT taught in **January 2026** ([IAP](https:\u002F\u002Felo.mit.edu\u002Fiap\u002F)) by\nProfessors [Alan Edelman](https:\u002F\u002Fmath.mit.edu\u002F~edelman\u002F) and [Steven G. Johnson](https:\u002F\u002Fmath.mit.edu\u002F~stevenj\u002F).\n\n* For past versions of this course, see [Matrix Calculus in IAP 2023 (OCW)](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002F18-s096-matrix-calculus-for-machine-learning-and-beyond-january-iap-2023\u002F) on OpenCourseWare (also [on github](https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fmatrixcalc\u002Ftree\u002Fiap2023), with videos [on YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLUl4u3cNGP62EaLLH92E_VCN4izBKK6OE)).  See also [Matrix Calculus in IAP 2022 (OCW)](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002F18-s096-matrix-calculus-for-machine-learning-and-beyond-january-iap-2022\u002Fpages\u002Flecture-notes-and-readings\u002F) (also [on github](https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fmatrixcalc\u002Ftree\u002Fiap2022)), and [Matrix Calculus 2024 (github)](https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fmatrixcalc\u002Ftree\u002Fiap2024) and [Matrix Calculus 2025 (github)](https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fmatrixcalc\u002Ftree\u002Fiap2025); some previous years used the temporary 18.S096 \"special subject\" course number.\n\n**Lectures:** MWF time 11am–1pm, Jan 12–Jan 30 (except Jan 19) in room 35-310.  3 units, *2 problem sets* due Jan 23 and Jan 30 — submitted electronically [via Canvas](https:\u002F\u002Fcanvas.mit.edu\u002Fcourses\u002F35760), no exams.\n\n**Course Notes**: [18.063 COURSE NOTES](https:\u002F\u002Fwww.dropbox.com\u002Fscl\u002Ffi\u002Fiq4plt8oqja845cuuosa4\u002FMatrix-Calculus-latest.pdf?rlkey=nsnytdu28jje41nhh1bl2dbba&st=i6lfha0r&dl=0).  Other materials to be posted below.\n\n**Piazza forum:** Online discussions at [Piazza](https:\u002F\u002Fpiazza.com\u002Fclass\u002Fmkab8649oo96qm\u002F).\n\n**Description:**\n\n> We all know that calculus courses such as 18.01 and 18.02 are univariate and vector calculus, respectively. Modern applications such as machine learning and large-scale optimization require the next big step, \"matrix calculus\" and calculus on arbitrary vector spaces.\n>\n> This class **revisits and generalizes calculus from the perspective of linear algebra**, extending it to much more general things (e.g. the derivative of matrix functions, like a matrix inverse or determinant with respect to the *matrix*, or an integral with respect to a *function*, an ODE solution with respect to ODE parameters) and connecting it to the computer science of efficient algorithms for differentiation and automatic differentiation (AD).\n>\n> We present a coherent approach to matrix calculus emphasizing matrices as holistic objects (not just as an array of scalars), we generalize and compute derivatives of important matrix factorizations and many other complicated-looking operations, and understand how differentiation formulas must be re-imagined in large-scale computing. We will discuss reverse\u002Fadjoint\u002Fbackpropagation differentiation, custom vector-Jacobian products, and how modern AD is more computer science than calculus (it is neither symbolic formulas nor finite differences).\n\n**Prerequisites:** Linear Algebra such as [18.06](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Fmathematics\u002F18-06-linear-algebra-spring-2010\u002F) and multivariate calculus such as [18.02](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Fmathematics\u002F18-02-multivariable-calculus-fall-2007\u002F).\n\nCourse will involve simple numerical computations using the [Julia language](https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fjulia-mit).   Ideally install it on your own computer following [these instructions](https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fjulia-mit#installing-julia-and-ijulia-on-your-own-computer), but as a fallback you can run it in the cloud here:\n[![Binder](https:\u002F\u002Fmybinder.org\u002Fbadge_logo.svg)](https:\u002F\u002Fmybinder.org\u002Fv2\u002Fgh\u002Fmitmath\u002Fbinder-env\u002Fmain)\n\n**Topics:**\n\nHere are some of the planned topics:\n\n* Derivatives as linear operators and linear approximation on arbitrary vector spaces: beyond gradients and Jacobians.\n* Derivatives of functions with matrix inputs and\u002For outputs (e.g. matrix inverses and determinants).  Kronecker products and matrix \"vectorization\".\n* Derivatives of matrix factorizations (e.g. eigenvalues\u002FSVD) and derivatives with constraints (e.g. orthogonal matrices).\n* Multidimensional chain rules, and the significance of right-to-left (\"forward\") vs. left-to-right (\"reverse\") composition.  Chain rules on computational graphs (e.g. neural networks).\n* Forward- and reverse-mode manual and automatic multivariate differentiation.\n* Adjoint methods (vJp\u002Fpullback rules) for derivatives of solutions of linear, nonlinear, and differential equations.\n* Application to nonlinear root-finding and optimization.  Multidimensional Newton and steepest–descent methods.\n* Applications in engineering\u002Fscientific optimization and machine learning.\n* Second derivatives, Hessian matrices, quadratic approximations, and quasi-Newton methods.\n\n## Lecture 1 (Jan 12)\n\n* part 1: overview ([slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F16uwYARbg4unaGU4Enp6uQvlBb6N21j1UINQW99om6R4\u002Fedit?usp=sharing))\n* part 2: derivatives as linear operators: matrix functions, gradients, product and chain rule\n\n Re-thinking derivatives as linear operators: f(x+dx)-f(x)=df=f′(x)[dx]. That is, f′ is the [linear operator](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLinear_map) that gives the change df in the *output* from a \"tiny\" change dx in the *inputs*, to *first order* in dx (i.e. dropping higher-order terms).   When we have a vector function f(x)∈ℝᵐ of vector inputs x∈ℝⁿ, then f'(x) is a linear operator that takes n inputs to m outputs, which we can think of as an m×n matrix called the [Jacobian matrix](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FJacobian_matrix_and_determinant) (typically covered only superficially in 18.02).\n\n In the same way, we can define derivatives of matrix-valued operators as linear operators on matrices.  For example, f(X)=X² gives f'(X)[dX] = X dX + dX X.  Or f(X) = X⁻¹ gives f'(X)[dX] = –X⁻¹ dX X⁻¹.   These are perfectly good linear operators acting on matrices dX, even though they are not written in the form (Jacobian matrix)×(column vector)!   (We *could* rewrite them in the latter form by reshaping the inputs dX and the outputs df into column vectors, more formally by choosing a basis, and we will later cover how this process can be made more elegant using [Kronecker products](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FKronecker_product).  But for the most part it is neither necessary nor desirable to express all linear operators as Jacobian matrices in this way.)\n\n**Further reading**: *Course Notes* (link above), chapters 1 and 2.\n [matrixcalculus.org](http:\u002F\u002Fwww.matrixcalculus.org\u002F) (linked in the slides) is a fun site to play with derivatives of matrix and vector functions.  The [Matrix Cookbook](https:\u002F\u002Fwww.math.uwaterloo.ca\u002F~hwolkowi\u002Fmatrixcookbook.pdf) has a lot of formulas for these derivatives, but no derivations.  Some [notes on vector and matrix differentiation](https:\u002F\u002Fcdn-uploads.piazza.com\u002Fpaste\u002Fj779e63owl53k6\u002F04b2cb8c2f300212d723bea822a6b856085b28e28ca9debc75a05761a436499c\u002F6.S087_Lecture_2.pdf) were posted for 6.S087 from IAP 2021.\n\n**Further reading (fancier math)**: the perspective of derivatives as linear operators is sometimes called a [Fréchet derivative](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FFr%C3%A9chet_derivative) and you can find lots of very abstract (what I'm calling \"fancy\") presentations of this online, chock full of weird terminology whose purpose is basically to generalize the concept to weird types of vector spaces.  The \"little-o notation\" o(δx) we're using here for \"infinitesimal asymptotics\" is closely related to the [asymptotic notation](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBig_O_notation) used in computer science, but in computer science people are typically taking the limit as the argument (often called \"n\") becomes very *large* instead of very small.  We will formalize this later, corresponding to **section 5.2** of the course notes.\n\n## Lecture 2 (Jan 14)\n\n* part 1: generalized sum and product rule, derivatives of X⁻¹ and ‖x‖² and xᵀAx; gradients ∇f of scalar-valued functions.  Blackboard + some [slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F16uwYARbg4unaGU4Enp6uQvlBb6N21j1UINQW99om6R4\u002Fedit?usp=sharing) from lecture 1.  Course notes: **chapter 2**.\n* part 1: matrix-function Jacobians via [vectorization](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FVectorization_(mathematics)) and [Kronecker products](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FKronecker_product); notes: [2×2 Matrix Jacobians (html)](https:\u002F\u002Frawcdn.githack.com\u002Fmitmath\u002Fmatrixcalc\u002F3f6758996e40c5c1070279f89f7f65e76e08003d\u002Fnotes\u002F2x2Jacobians.jl.html) [(pluto notebook source code)](https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fmatrixcalc\u002Fblob\u002Fmain\u002Fnotes\u002F2x2Jacobians.jl) [(jupyter notebook)](notes\u002F2x2Jacobians.ipynb).  Course notes: **chapter 3**.\n\n **Further reading (gradients)**: We will cover more generalizations later, corresponding to **chapter 5** of the course notes. A fancy name for a row vector is a \"covector\" or [linear form](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLinear_form), and the fancy version of the relationship between row and column vectors is the [Riesz representation theorem](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FRiesz_representation_theorem), but until you get to non-Euclidean geometry you may be happier thinking of a row vector as the transpose of a column vector.\n\n## Lecture 3 (Jan 16)\n\n* part 1: the chain rule and forward vs. reverse \"mode\" differentiation: course notes **section 2.4**.  Example applications, **chapter 6**: slides on nonlinear root-finding, optimization, and adjoint-method differentiation [slides](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1U1lB5bhscjbxEuH5FcFwMl5xbHl0qIEkMf5rm0MO8uE\u002Fedit?usp=sharing)\n\n* matrix gradients via the matrix inner product (the [\"Frobenius\" inner product](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FFrobenius_inner_product))\" course notes **chapter 5**\n\n* [pset 1](psets\u002Fpset1.pdf) posted, due Friday Jan 23 at midnight.\n\n## Lecture 4 (Jan 21)\n\n* part 1: generalized gradients and inner products — [handwritten notes](https:\u002F\u002Fwww.dropbox.com\u002Fscl\u002Ffi\u002Fbyg5mpcnnk4xh9tqjbjmk\u002FInner-Products-and-Norms.pdf?rlkey=egsdhyee9go9v17iuxxqx1edj&dl=0) and course notes **chapter 5**\n    - also norms and derivatives: why a norm of the input and output are needed to *define* a derivative, and in particular to define what \"higher-order terms\" and o(δx) mean\n    - more on handling units: when the components of the vector are quantities different units, defining the inner product (and hence the norm) requires dimensional weight factors to scale the quantities.  (Using standard gradient \u002F inner product implicitly uses weights given by whatever units you are using.) A change of variables (to nondimensionalize the problem) is equivalent (for steepest descent) to a nondimensionalization of the inner-product\u002Fnorm, but the former is typically easier for use with off-the-shelf optimization software.   Usually, you want to use units\u002Fscaling so that all your quantities have similar scales, otherwise steepest descent may converge very slowly!\n\n\n* The [gradient of the determinant](https:\u002F\u002Frawcdn.githack.com\u002Fmitmath\u002Fmatrixcalc\u002Fb08435612045b17745707f03900e4e4187a6f489\u002Fnotes\u002Fdeterminant_and_inverse.html) is ∇(det A) = det(A)A⁻ᵀ (course notes **chapter 7**)\n\nGeneralizing **gradients** to *scalar* functions f(x) for x in arbitrary *vector spaces* x ∈ V.   The key thing is that we need not just a vector space, but an **inner product** x⋅y (a \"dot product\", also denoted ⟨x,y⟩ or ⟨x|y⟩); V is then formally called a [Hilbert space](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHilbert_space).   Then, for *any* scalar function, since df=f'(x)[dx] is a linear operator mapping dx∈V to scalars df∈ℝ (a \"[linear form](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLinear_form)\"), it turns out that it [*must* be a dot product](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FRiesz_representation_theorem) of dx with \"something\", and we call that \"something\" the gradient!  That is, once we define a dot product, then for any scalar function f(x) we can define ∇f by f'(x)[dx]=∇f⋅dx.  So ∇f is always something with the same \"shape\" as x (the [steepest-ascent](https:\u002F\u002Fmath.stackexchange.com\u002Fquestions\u002F223252\u002Fwhy-is-gradient-the-direction-of-steepest-ascent) direction).\n\nTalked about the general [requirements for an inner product](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FInner_product_space): linearity, positivity, and (conjugate) symmetry (and also mentioned the [Cauchy–Schwarz inequality](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCauchy%E2%80%93Schwarz_inequality), which follows from these properties).  Gave some examples of inner products, such as the familiar Euclidean inner product xᵀy or a weighted inner product.  Defined the most obvious inner product of m×n matrices: the [Frobenius inner product](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FFrobenius_inner_product) A⋅B=`sum(A .* B)`=trace(AᵀB)=vec(A)ᵀvec(B), the sum of the products of the matrix entries.  This also gives us the \"Frobenius norm\" ‖A‖²=A⋅A=trace(AᵀA)=‖vec(A)‖², the square root of the sum of the squares of the entries.   Using this, we can now take the derivatives of various scalar functions of matrices, e.g. we considered\n\n* f(A)=tr(A) ⥰ ∇f = I\n* f(A)=‖A‖ ⥰ ∇f = A\u002F‖A‖\n* f(A)=xᵀAy ⥰ ∇f = xyᵀ (for constant x, y)\n* f(A)=det(A) ⥰ ∇f = det(A)(A⁻¹)ᵀ = transpose of the [adjugate](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FAdjugate_matrix) of A\n\nAlso talked about the definition of a [norm](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNorm_(mathematics)) (which can be obtained from an inner product if you have one, but can also be defined by itself), and why a norm is necessary to define a derivative: it is embedded in the definition of what a higher-order term o(δx) means.   (Although there are many possible norms, [in finite dimensions all norms are equivalent up to constant factors](https:\u002F\u002Fmath.mit.edu\u002F~stevenj\u002F18.335\u002Fnorm-equivalence.pdf), so the definition of a derivative does not depend on the choice of norm.)\n\nMade precise and derived (with the help of Cauchy–Schwarz) the well known fact that ∇f is the **steepest-ascent** direction, for *any* scalar-valued function on a vector space with an inner product (any Hilbert space), in the norm corresponding to that inner product.  That is, if you take a step δx with a fixed length ‖δx‖=s, the greatest increase in f(x) to first order is found in a direction parallel to ∇f.\n\n**Further reading (∇det)**: Course notes, chapter 7.  There are lots of discussions of the\n[derivative of a determinant](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FJacobi%27s_formula) online, involving the [\"adjugate\" matrix](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FAdjugate_matrix) det(A)A⁻¹.\nNot as well documented is that the gradient of the determinant is the cofactor matrix widely used for the [Laplace expansion](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLaplace_expansion) of a determinant.\nThe formula for the [derivative of log(det A)](https:\u002F\u002Fstatisticaloddsandends.wordpress.com\u002F2018\u002F05\u002F24\u002Fderivative-of-log-det-x\u002F) is also nice, and logs of determinants appear in surprisingly many applications (from statistics to quantum field theory).  The [Matrix Cookbook](https:\u002F\u002Fwww.math.uwaterloo.ca\u002F~hwolkowi\u002Fmatrixcookbook.pdf) contains many of these formulas, but no derivations.   A nice application of d(det(A)) is solving for eigenvalues λ by applying Newton's method to det(A-λI)=0, and more generally one can solve det(M(λ))=0 for any function Μ(λ) — the resulting roots λ are called [nonlinear eigenvalues](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNonlinear_eigenproblem) (if M is nonlinear in λ), and one can [apply Newton's method](https:\u002F\u002Fwww.maths.manchester.ac.uk\u002F~ftisseur\u002Ftalks\u002FFT_talk2.pdf) using the determinant-derivative formula here.\n\n## Lecture 5 (Jan 23)\n* Directional derivatives: $f'(x)[v] = \\frac{d}{d\\alpha} f(x + \\alpha v) \\left. \\right|_{\\alpha=0}$.  Connection to \"components\" of gradient or derivative = directional derivative when $v$ is a Cartesian basis vector.  course notes **section 2.2.1**\n* Reverse-mode gradients for neural networks (NNs): [handwritten backpropagation notes](https:\u002F\u002Fwww.dropbox.com\u002Fscl\u002Ffi\u002Fbke4pbr342e1jhv9qytg1\u002FNN-Backpropagation.pdf?rlkey=b7krtzdt4hgsj63zyq9ok2gqv&dl=0), course notes **chapter 9**.\n* forward-mode automatic differentiation (AD) via [dual numbers](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDual_number) ([Julia notebook](notes\u002FAutoDiff.ipynb)) - course notes, **chapter 8**\n* [pset 1 solutions](psets\u002Fpset1sol.pdf)\n* [pset 2](psets\u002Fpset2.pdf): due midnight Jan 30\n\n**Further reading on backpropagation for NNs**:  [Strang (2019)](https:\u002F\u002Fmath.mit.edu\u002F~gs\u002Flearningfromdata\u002F) section VII.3 and [18.065 OCW lecture 27](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002F18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018\u002Fresources\u002Flecture-27-backpropagation-find-partial-derivatives\u002F). You can find many, many articles online about [backpropagation](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBackpropagation) in neural networks. Backpropagation for neural networks is closely related to [backpropagation\u002Fadjoint methods for recurrence relations (course notes)](https:\u002F\u002Fmath.mit.edu\u002F~stevenj\u002F18.336\u002Frecurrence2.pdf), and on [computational graphs (blog post)](https:\u002F\u002Fcolah.github.io\u002Fposts\u002F2015-08-Backprop\u002F).  We will return to computational graphs in a future lecture.\n\n**Further reading on forward AD**: Course notes, chapter 8.  Googling \"automatic differentiation\" will turn up many, many resources — this is a huge practical field these days.   [ForwardDiff.jl](https:\u002F\u002Fgithub.com\u002FJuliaDiff\u002FForwardDiff.jl) (described detail by [this paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F1607.07892)) in Julia uses [dual number](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDual_number) arithmetic similar to lecture to compute derivatives; see also this [AMS review article](http:\u002F\u002Fwww.ams.org\u002Fpublicoutreach\u002Ffeature-column\u002Ffc-2017-12), or google \"dual number automatic differentiation\" for many other reviews.    Adrian Hill posted some nice [lecture notes on automatic differentiation (Julia-based)](https:\u002F\u002Fadrhill.github.io\u002Fjulia-ml-course\u002FL6_Automatic_Differentiation\u002F) for an ML course at TU Berlin (Summer 2023).  [TaylorDiff.jl](https:\u002F\u002Fgithub.com\u002FJuliaDiff\u002FTaylorDiff.jl) extends this to higher-order derivatives.\n\n## Lecture 6 (Jan 26): via [Zoom (MIT only)](https:\u002F\u002Fmit.zoom.us\u002Fj\u002F98915152715?pwd=4GftZplphHYx7QIlDUL4vgiwzD7Rxc.1)\n\nDue to the snow emergency, Monday's lecture will be held via Zoom at the link above.\n\n* part 1: forward and reverse-mode automatic differentiation on computational graphs: course notes **section 8.3** and [slides](notes\u002Fgilbert_autodiff_2023.pdf) based on [\"Backpropagation through back substitution with a backslash (2003)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.15449).  [Julia notebook](https:\u002F\u002Fsimeonschaub.github.io\u002FReverseModePluto\u002Fnotebook.html).\n* part 2: calculus of variations: course notes **chapter 11**\n\n**Further reading (AD on graphs):** Course notes, section 8.3.  See [Prof. Edelman's poster](notes\u002Fbackprop_poster.pdf) about backpropagation on graphs, this blog post on [calculus on computational graphs](https:\u002F\u002Fcolah.github.io\u002Fposts\u002F2015-08-Backprop\u002F) for a gentle review, and these Columbia [course notes](http:\u002F\u002Fwww.cs.columbia.edu\u002F~mcollins\u002Fff2.pdf) for a more formal approach.  Implementing automatic reverse-mode AD is much more complicated than defining a new number type, unfortunately, and involves a lot more intricacies of compiler technology.  See also Chris Rackauckas's blog post on [tradeoffs in AD](https:\u002F\u002Fwww.stochasticlifestyle.com\u002Fengineering-trade-offs-in-automatic-differentiation-from-tensorflow-and-pytorch-to-jax-and-julia\u002F), and Chris's discussion post on [AD limitations](https:\u002F\u002Fdiscourse.julialang.org\u002Ft\u002Fopen-discussion-on-the-state-of-differentiable-physics-in-julia\u002F72900\u002F2).\n\n**Further reading (Calculus of Variations)**: There are many resources on the [\"calculus of variations\"](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCalculus_of_variations), which refers to derivatives of f(u)=∫F(u,u′,…)dx for functions u(x), but we saw that it is essentially just a special case of our general rule df=f(u+du)-f(u)=f′(u)[du]=∇f⋅du when du lies in a vector space of functions.  Setting ∇f to find an extremum of f(u) yields an [Euler–Lagrange equation](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FEuler%E2%80%93Lagrange_equation), the most famous examples of which are probably [Lagrangian mechanics](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLagrangian_mechanics) and also the [Brachistochrone problem](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBrachistochrone_curve), but it also shows up in many other contexts such as [optimal control](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FOptimal_control).  A very readable textbook on the subject is [*Calculus of Variations* by Gelfand and Fomin](https:\u002F\u002Fstore.doverpublications.com\u002F0486414485.html).\n\n## Lecture 7 (Jan 28)\n\n* part 1: complex numbers and CR calculus - [handwritten notes](https:\u002F\u002Fwww.dropbox.com\u002Fscl\u002Ffi\u002Fcn2cmzf1q2anpeeg9s5a4\u002FCR-Calculus.pdf?rlkey=b0qd9b3r8ldc9tu0s80nxpvir&dl=0) — course notes **chapter 15**\n* part 2: derivatives with constraints, derivatives of eigenproblems [(html)](https:\u002F\u002Frawcdn.githack.com\u002Fmitmath\u002Fmatrixcalc\u002Fd11b747d70a5d9e1a3da8cdb68a7f8a220d3afae\u002Fnotes\u002Fsymeig.jl.html) [(julia source)](notes\u002Fsymeig.jl) — course notes **chapter 14**\n\n**Further reading (CR calculus)**: A well-known reference on CR calculus can be found in the UCSD notes [The Complex Gradient Operator and the CR-Calculus](https:\u002F\u002Farxiv.org\u002Fabs\u002F0906.4835) by Ken Kreutz-Delgado (2009).  These are also called [Wirtinger derivatives](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FWirtinger_derivatives).  There is some support for this in automatic differentiations packages, e.g. see the documentation on [complex functions in ChainRules.jl](https:\u002F\u002Fjuliadiff.org\u002FChainRulesCore.jl\u002Fdev\u002Fmaths\u002Fcomplex.html) or [complex functions in JAX](https:\u002F\u002Fjax.readthedocs.io\u002Fen\u002Flatest\u002Fnotebooks\u002Fautodiff_cookbook.html#complex-numbers-and-differentiation).   The special case of \"[holomorphic](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHolomorphic_function)\" \u002F \"[analytic](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FAnalytic_function)\" functions where we have an \"ordinary\" derivative (= linear operator on dz) is the main topic of [complex analysis](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FComplex_analysis), for which there are many resources (textbooks, online tutorials, and classes like [18.04](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002F18-04-complex-variables-with-applications-spring-2018\u002F)).\n\n**Further reading (derivatives with constraints and eigenproblems)**: Computing derivatives on curved surfaces (\"manifolds\") is closely related to [tangent spaces](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTangent_space) in differential geometry.   The effect of constraints can also be expressed in terms of [Lagrange multipliers](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLagrange_multiplier), which are useful in expressing optimization problems with constraints (see also chapter 5 of [Convex Optimization](https:\u002F\u002Fweb.stanford.edu\u002F~boyd\u002Fcvxbook\u002F) by Boyd and Vandenberghe).\nIn physics, first and second derivatives of eigenvalues and first derivatives of eigenvectors are often presented as part of [\"time-independent\" perturbation theory](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FPerturbation_theory_(quantum_mechanics)#Time-independent_perturbation_theory) in quantum mechanics, or as the [Hellmann–Feynmann theorem](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHellmann%E2%80%93Feynman_theorem) for the case of dλ.    The derivative of an eigenvector involves *all* of the other eigenvectors, but a much simpler \"vector–Jacobian product\" (involving only a single eigenvector and eigenvalue) can be obtained from left-to-right differentiation of a *scalar function* of an eigenvector, as reviewed in the [18.335 notes on adjoint methods](https:\u002F\u002Fgithub.com\u002Fmitmath\u002F18335\u002Fblob\u002Fspring21\u002Fnotes\u002Fadjoint\u002Fadjoint.pdf).\n\n* When differentiating eigenvalues λ of matrices A(x), a complication arises at eigenvalue crossings (multiplicity k > 1), where in general the eigenvalues (and eigenvectors) cease to be differentiable.  (More generally, this problem arises for any [implicit function](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FImplicit_function) with a repeated root.)  In this case, one option is to use an expanded definition of sensitivity analysis called a **generalized gradient** (a k×k *matrix-valued* linear operator G(x)\\[dx\\] whose *eigenvalues* are the perturbations dλ); see e.g. [Cox (1995)](https:\u002F\u002Fdoi.org\u002F10.1006\u002Fjfan.1995.1117), [Seyranian *et al.* (1994)](https:\u002F\u002Fdoi.org\u002F10.1007\u002FBF01742705), and [Stechlinski (2022)](https:\u002F\u002Fdoi.org\u002F10.1016\u002Fj.laa.2022.04.019). (Physicists call this idea [degenerate perturbation theory](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002F8-06-quantum-physics-iii-spring-2018\u002Fa0889c5ca8a479c3e56c544d646fb770_MIT8_06S18ch1.pdf).) A recent formulation of similar ideas is called a **lexicographic directional derivative**; see [Nesterov (2005)](https:\u002F\u002Fdoi.org\u002F10.1007\u002Fs10107-005-0633-0) and [Barton *et al* (2017)](https:\u002F\u002Fdoi.org\u002F10.1080\u002F10556788.2017.1374385). Sometimes, optimization problems involving eigenvalues can be reformulated to avoid this difficulty by using [SDP](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSemidefinite_programming) constraints [(Men *et al.*, 2014)](http:\u002F\u002Fdoi.org\u002F10.1364\u002FOE.22.022632).  For a [defective matrix](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDefective_matrix) the situation is worse: even the generalized derivatives blow up, because dλ is proportional to the *square root* of the perturbation ‖dA‖.\n\n## Lecture 9 (Jan 30)\n\n* Eigenvalue and eigenvector derivatives, continued from previous lecture.\n* CR calculus in higher dimensions and the CR gradient, continued from previous lecture.\n* Second derivatives, Hessian matrices, quadratic approximations, and applications — course **chapter 13** — and combinations of reverse and forward mode to compute Hessians or Hessian–vector products (notes **section 8.4.1**).\n* Some topics we *didn't* cover: differentiating ODE solutions (notes **chapter 10**), differentiating random functions (notes **chapter 12**), and other topics such as [delta functions and distributional derivatives](https:\u002F\u002Fmath.mit.edu\u002F~stevenj\u002F18.303\u002Fdelta-notes.pdf) and other generalizations of derivatives (notes **chapter 16**).\n* [pset 2 solutions](psets\u002Fpset2sol.pdf)\n\n**Further reading (second derivatives)**:\n* [Bilinear forms](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBilinear_form) are an important generalization of quadratic operations to arbitrary vector spaces, and we saw that the second derivative can be viewed as a [symmetric bilinear forms](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSymmetric_bilinear_form).   This is closely related to a [quadratic form](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FQuadratic_form), which is just what we get by plugging in the same vector twice, e.g. the f''(x)[δx,δx]\u002F2 that appears in quadratic approximations for f(x+δx) is a quadratic form.  The most familar multivariate version of f''(x) is the [Hessian matrix](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHessian_matrix); Khan academy has an elementary [introduction to quadratic approximation](https:\u002F\u002Fwww.khanacademy.org\u002Fmath\u002Fmultivariable-calculus\u002Fapplications-of-multivariable-derivatives\u002Fquadratic-approximations\u002Fa\u002Fquadratic-approximation)\n* [Positive-definite](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDefinite_matrix) Hessian matrices, or more generally [definite quadratic forms](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDefinite_quadratic_form) f″, appear at extrema (f′=0) of scalar-valued functions f(x) that are local minima; there a lot [more formal treatments](http:\u002F\u002Fwww.columbia.edu\u002F~md3405\u002FUnconstrained_Optimization.pdf) of the same idea, and conversely Khan academy has the [simple 2-variable version](https:\u002F\u002Fwww.khanacademy.org\u002Fmath\u002Fmultivariable-calculus\u002Fapplications-of-multivariable-derivatives\u002Foptimizing-multivariable-functions\u002Fa\u002Fsecond-partial-derivative-test) where you can check the sign of the 2×2 eigenvalues just by looking at the determinant and a single entry (or the trace).  There's a nice [stackexchange discussion](https:\u002F\u002Fmath.stackexchange.com\u002Fquestions\u002F2285282\u002Frelating-condition-number-of-hessian-to-the-rate-of-convergence) on why an [ill-conditioned](https:\u002F\u002Fnhigham.com\u002F2020\u002F03\u002F19\u002Fwhat-is-a-condition-number\u002F) Hessian tends to make steepest descent converge slowly; some Toronto [course notes on the topic](https:\u002F\u002Fwww.cs.toronto.edu\u002F~rgrosse\u002Fcourses\u002Fcsc421_2019\u002Fslides\u002Flec07.pdf) may also be helpful.\n* See e.g. these Stanford notes on [sequential quadratic optimization](https:\u002F\u002Fweb.stanford.edu\u002Fclass\u002Fee364b\u002Flectures\u002Fseq_notes.pdf) using trust regions (sec. 2.2).  See 18.335 [notes on BFGS quasi-Newton methods](https:\u002F\u002Fgithub.com\u002Fmitmath\u002F18335\u002Fblob\u002Fspring21\u002Fnotes\u002FBFGS.pdf) (also [video](https:\u002F\u002Fmit.zoom.us\u002Frec\u002Fshare\u002FnaqcRgSkZ0VNeDp0ht8QmB566mPowuHJ8k0LcaAmZ7XxaCT1ch4j_O4Khzi-taXm.CXI8xFthag4RvvoC?startTime=1620241284000)).   The fact that a quadratic optimization problem in a sphere has [strong duality](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FStrong_duality) and hence is efficiently solvable is discussed in section 5.2.4 of the [*Convex Optimization* book](https:\u002F\u002Fweb.stanford.edu\u002F~boyd\u002Fcvxbook\u002F).  There has been a lot of work on [automatic Hessian computation](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHessian_automatic_differentiation), but for large-scale problems you can ultimately only compute Hessian–vector products efficiently in general, which are equivalent to a directional derivative of the gradient, and can be used e.g. for [Newton–Krylov methods](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNewton%E2%80%93Krylov_method).\n","# 机器学习及其他领域的矩阵微积分\n\n这是麻省理工学院于**2026年1月**（IAP学期）开设的**18.063 矩阵微积分**课程页面，由艾伦·埃德尔曼教授（[Alan Edelman](https:\u002F\u002Fmath.mit.edu\u002F~edelman\u002F)）和史蒂文·G·约翰逊教授（[Steven G. Johnson](https:\u002F\u002Fmath.mit.edu\u002F~stevenj\u002F)）主讲。\n\n* 如需查看本课程的往期版本，请参阅 OpenCourseWare 上的 **2023年IAP学期矩阵微积分（OCW）**（[GitHub仓库](https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fmatrixcalc\u002Ftree\u002Fiap2023)，视频可在 [YouTube](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLUl4u3cNGP62EaLLH92E_VCN4izBKK6OE) 上观看）。此外，还有 **2022年IAP学期矩阵微积分（OCW）**（[GitHub仓库](https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fmatrixcalc\u002Ftree\u002Fiap2022)），以及 **2024年矩阵微积分（GitHub）** 和 **2025年矩阵微积分（GitHub）**；前几年曾使用临时的18.S096“特殊课题”课程编号。\n\n**授课时间：** 周一至周五上午11点至下午1点，1月12日至1月30日（1月19日除外），地点为35-310教室。本课程共3个学分，设有*2次作业*，分别于1月23日和1月30日提交——通过Canvas在线提交（[Canvas链接](https:\u002F\u002Fcanvas.mit.edu\u002Fcourses\u002F35760)），无考试。\n\n**课程笔记：** [18.063 课程笔记](https:\u002F\u002Fwww.dropbox.com\u002Fscl\u002Ffi\u002Fiq4plt8oqja845cuuosa4\u002FMatrix-Calculus-latest.pdf?rlkey=nsnytdu28jje41nhh1bl2dbba&st=i6lfha0r&dl=0)。其他相关材料将陆续在此发布。\n\n**Piazza论坛：** 在线讨论请访问[Piazza](https:\u002F\u002Fpiazza.com\u002Fclass\u002Fmkab8649oo96qm\u002F)。\n\n**课程简介：**\n\n> 我们都知道，诸如18.01和18.02之类的微积分课程分别讲授一元微积分和多元微积分。然而，现代应用领域，如机器学习和大规模优化问题，需要更进一步的发展，即“矩阵微积分”以及在任意向量空间上的微积分。\n>\n> 本课程**从线性代数的角度重新审视并推广微积分理论**，将其扩展到更为广泛的范畴（例如，矩阵函数关于矩阵本身的导数，或关于函数的积分、关于常微分方程参数的解的导数等），并将其与高效求导算法及自动微分（AD）的计算机科学原理相结合。\n>\n> 我们将采用一种连贯的方法来讲解矩阵微积分，强调矩阵应被视为整体对象而非单纯标量数组；我们将推广并计算重要矩阵分解及其他复杂运算的导数，并探讨在大规模计算中如何重新设计求导公式。此外，我们还将讨论反向传播、自定义向量-雅可比乘积，以及现代自动微分更多地属于计算机科学而非传统微积分的本质——它既不是符号推导，也不是有限差分方法。\n\n**先修课程：** 线性代数（如[18.06](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Fmathematics\u002F18-06-linear-algebra-spring-2010\u002F)）和多元微积分（如[18.02](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002Fmathematics\u002F18-02-multivariable-calculus-fall-2007\u002F)）。\n\n本课程将涉及使用Julia语言进行简单的数值计算。建议您按照[这些说明](https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fjulia-mit#installing-julia-and-ijulia-on-your-own-computer)在自己的电脑上安装Julia；作为备选方案，您也可以在此处通过云端运行：\n[![Binder](https:\u002F\u002Fmybinder.org\u002Fbadge_logo.svg)](https:\u002F\u002Fmybinder.org\u002Fv2\u002Fgh\u002Fmitmath\u002Fbinder-env\u002Fmain)\n\n**课程主题：**\n\n以下是部分计划讲授的主题：\n\n* 导数作为线性算子及在任意向量空间上的线性近似：超越梯度与雅可比矩阵。\n* 具有矩阵输入或输出的函数的导数（如矩阵逆和行列式）。克罗内克积与矩阵“向量化”。\n* 矩阵分解的导数（如特征值\u002FSVD）以及带约束条件的导数（如正交矩阵）。\n* 多维链式法则，以及右向左（“前向”）与左向右（“反向”）复合的重要性。计算图上的链式法则（如神经网络）。\n* 手动与自动多变量微分的前向模式与反向模式。\n* 针对线性、非线性及微分方程解的导数的伴随方法（vJp\u002F回传规则）。\n* 应用于非线性方程求根与优化问题。多维牛顿法与最速下降法。\n* 工程\u002F科学优化及机器学习中的应用。\n* 二阶导数、海森矩阵、二次近似以及拟牛顿法。\n\n## 第1讲（1月12日）\n\n* 第一部分：概述（[幻灯片](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F16uwYARbg4unaGU4Enp6uQvlBb6N21j1UINQW99om6R4\u002Fedit?usp=sharing)）\n* 第二部分：导数作为线性算子：矩阵函数、梯度、乘积法则与链式法则\n\n将导数重新理解为线性算子：f(x+dx)-f(x)=df=f′(x)[dx]。也就是说，f′是那个在输入发生“微小”变化dx时，给出输出变化df的[线性算子](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLinear_map)，且这种关系仅在dx的一阶近似下成立（即忽略高阶项）。当我们的函数f(x)∈ℝᵐ的输入x∈ℝⁿ为向量时，f′(x)就是一个将n维输入映射到m维输出的线性算子，我们可以将其视为一个m×n的矩阵，称为[雅可比矩阵](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FJacobian_matrix_and_determinant)（通常在18.02课程中只是浅显地介绍）。\n\n同样地，我们也可以将矩阵值函数的导数定义为作用于矩阵上的线性算子。例如，f(X)=X²时，有f′(X)[dX] = X dX + dX X；而f(X) = X⁻¹时，则有f′(X)[dX] = –X⁻¹ dX X⁻¹。这些都是对矩阵dX进行操作的良好线性算子，尽管它们并未以“雅可比矩阵×列向量”的形式写出！（我们确实可以通过将输入dX和输出df重塑为列向量，并更正式地选择基底来实现这一形式；稍后我们会讨论如何利用[克罗内克积](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FKronecker_product)使这一过程更加优雅。但大多数情况下，没有必要也不希望将所有线性算子都以雅可比矩阵的形式表达。）\n\n**拓展阅读**：*课程笔记*（见上方链接），第1章和第2章。\n[matrixcalculus.org](http:\u002F\u002Fwww.matrixcalculus.org\u002F)（幻灯片中已链接）是一个可以用来玩转矩阵和向量函数导数的有趣网站。[矩阵手册](https:\u002F\u002Fwww.math.uwaterloo.ca\u002F~hwolkowi\u002Fmatrixcookbook.pdf)中包含大量此类导数的公式，但没有推导过程。此外，针对2021年IAP学期的6.S087课程，还发布了一些关于[向量与矩阵微分的笔记](https:\u002F\u002Fcdn-uploads.piazza.com\u002Fpaste\u002Fj779e63owl53k6\u002F04b2cb8c2f300212d723bea822a6b856085b28e28ca9debc75a05761a436499c\u002F6.S087_Lecture_2.pdf)。\n\n**进阶阅读（更高级的数学）**：将导数视为线性算子的观点有时被称为[弗雷歇导数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FFr%C3%A9chet_derivative)，网上有许多非常抽象（我称之为“高级”）的讲解，充斥着各种奇特术语，其目的基本上是将这一概念推广到各种奇异的向量空间中。“小o记号”o(δx)是我们在这里用于描述“无穷小渐近行为”的一种方式，它与计算机科学中使用的[大O记号](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBig_O_notation)密切相关，不过在计算机科学中，人们通常取的是当参数（常称为“n”）变得非常*大*时的极限，而不是非常*小*时的极限。我们将在后续内容中对此进行形式化，对应于课程笔记的**第5.2节**。\n\n## 第2讲（1月14日）\n\n* 第一部分：广义求和与乘积法则，X⁻¹、‖x‖²以及xᵀAx的导数；标量值函数的梯度∇f。黑板讲解加上一些来自第1讲的[幻灯片](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F16uwYARbg4unaGU4Enp6uQvlBb6N21j1UINQW99om6R4\u002Fedit?usp=sharing)。课程笔记：**第2章**。\n* 第二部分：通过[向量化](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FVectorization_(mathematics))和[克罗内克积](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FKronecker_product)计算矩阵函数的雅可比矩阵；相关资料：[2×2矩阵雅可比矩阵（HTML）](https:\u002F\u002Frawcdn.githack.com\u002Fmitmath\u002Fmatrixcalc\u002F3f6758996e40c5c1070279f89f7f65e76e08003d\u002Fnotes\u002F2x2Jacobians.jl.html)、[(Pluto笔记本源代码)](https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fmatrixcalc\u002Fblob\u002Fmain\u002Fnotes\u002F2x2Jacobians.jl)、[(Jupyter笔记本)](notes\u002F2x2Jacobians.ipynb)。课程笔记：**第3章**。\n\n**拓展阅读（关于梯度）**：我们将在后续内容中探讨更一般的推广，对应于课程笔记的**第5章**。行向量还有一个高大上的名字，叫做“余向量”或[线性泛函](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLinear_form)，而行向量与列向量之间关系的高级表述则是[里斯表示定理](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FRiesz_representation_theorem)，不过在接触到非欧几何之前，你可能还是把行向量看作列向量的转置会更自在。\n\n## 第3讲（1月16日）\n\n* 第一部分：链式法则以及前向与反向“模式”微分：课程笔记**第2.4节**。示例应用，**第6章**：关于非线性方程求根、优化以及伴随方法微分的幻灯片[幻灯片](https:\u002F\u002Fdocs.google.com\u002Fpresentation\u002Fd\u002F1U1lB5bhscjbxEuH5FcFwMl5xbHl0qIEkMf5rm0MO8uE\u002Fedit?usp=sharing)\n\n* 通过矩阵内积（即[弗罗贝尼乌斯内积](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FFrobenius_inner_product)）计算矩阵梯度：课程笔记**第5章**\n\n* [作业1](psets\u002Fpset1.pdf)已发布，截止日期为1月23日周五午夜。\n\n## 第4讲（1月21日）\n\n* 第一部分：广义梯度与内积 — 手写笔记[链接](https:\u002F\u002Fwww.dropbox.com\u002Fscl\u002Ffi\u002Fbyg5mpcnnk4xh9tqjbjmk\u002FInner-Products-and-Norms.pdf?rlkey=egsdhyee9go9v17iuxxqx1edj&dl=0)及课程讲义**第5章**\n    - 同时涉及范数与导数：为何需要输入和输出的范数来定义导数，尤其是用来界定“高阶项”以及o(δx)的具体含义。\n    - 更多关于单位处理的内容：当向量各分量具有不同单位时，定义内积（进而定义范数）就需要引入维度权重因子对各量进行归一化。若采用标准梯度\u002F内积，则隐含地使用了当前所用单位对应的权重。变量替换（以实现无量纲化）在最速下降法中等价于对内积\u002F范数的无量纲化处理，但前者通常更便于直接使用现成的优化软件。一般而言，应选择合适的单位或缩放方式，使所有变量的数量级相近，否则最速下降法可能会收敛得非常缓慢！\n\n\n* 行列式的梯度公式为∇(det A) = det(A)A⁻ᵀ（课程讲义**第7章**）\n  \n将**梯度**概念推广至定义在任意*向量空间*V中的标量函数f(x)。关键在于，我们不仅需要一个向量空间，还需要一个**内积**x⋅y（即“点积”，也可记作⟨x,y⟩或⟨x|y⟩）；此时V便正式被称为[希尔伯特空间](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHilbert_space)。对于任意标量函数，由于df=f'(x)[dx]是一个将dx∈V映射到实数df∈ℝ的线性算子（即“线性泛函”），因此它必然可以表示为dx与某个“东西”的点积，而这个“东西”就是梯度！也就是说，一旦定义了内积，对于任何标量函数f(x)，我们都可以通过f'(x)[dx]=∇f⋅dx来定义∇f。因此，∇f总是与x具有相同“形状”的量（即[最速上升方向](https:\u002F\u002Fmath.stackexchange.com\u002Fquestions\u002F223252\u002Fwhy-is-gradient-the-direction-of-steepest-ascent)）。\n\n随后讨论了内积的一般**要求**：线性、正定性以及（共轭）对称性，并简要提及由此推导出的[柯西-施瓦茨不等式](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCauchy%E2%80%93Schwarz_inequality)。接着给出了一些内积的例子，如熟悉的欧几里得内积xᵀy，以及加权内积等。此外还定义了m×n矩阵最直观的内积——[弗罗贝尼乌斯内积](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FFrobenius_inner_product)A⋅B=`sum(A .* B)`=trace(AᵀB)=vec(A)ᵀvec(B)，即矩阵对应元素乘积之和。由此还可得到“弗罗贝尼乌斯范数”‖A‖²=A⋅A=trace(AᵀA)=‖vec(A)‖²，即各元素平方和的平方根。借助这一工具，我们现在可以计算各类矩阵标量函数的导数，例如：\n\n* f(A)=tr(A) ⥰ ∇f = I\n* f(A)=‖A‖ ⥰ ∇f = A\u002F‖A‖\n* f(A)=xᵀAy ⥰ ∇f = xyᵀ（其中x、y为常数）\n* f(A)=det(A) ⥰ ∇f = det(A)(A⁻¹)ᵀ = A的[伴随矩阵](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FAdjugate_matrix)的转置\n\n此外还讨论了**范数**的定义（范数可由内积导出，但也可独立定义），并说明了为何需要范数来定义导数：范数嵌入在高阶项o(δx)的定义之中。尽管存在多种可能的范数，但在有限维空间中，[所有范数都等价于相差一个常数因子](https:\u002F\u002Fmath.mit.edu\u002F~stevenj\u002F18.335\u002Fnorm-equivalence.pdf)，因此导数的定义并不依赖于具体选用的范数。\n\n最后，在柯西-施瓦茨不等式的辅助下，精确推导并证明了一个众所周知的事实：对于任意向量空间（即希尔伯特空间）上的标量函数，其梯度∇f必然是该内积所对应范数下的**最速上升方向**。也就是说，若沿某一方向迈出固定长度的步长‖δx‖=s，则f(x)在一阶近似下的最大增量将出现在与∇f同向的方向上。\n\n**拓展阅读（∇det）**：课程讲义第7章。网络上关于[行列式导数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FJacobi%27s_formula)的讨论很多，其中常涉及“伴随矩阵”det(A)A⁻¹。然而较少被提及的是，行列式的梯度正是广泛用于行列式[拉普拉斯展开](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLaplace_expansion)的余子式矩阵。此外，[log(det A)的导数公式](https:\u002F\u002Fstatisticaloddsandends.wordpress.com\u002F2018\u002F05\u002F24\u002Fderivative-of-log-det-x\u002F)也颇为有趣，而行列式的对数在许多领域都有应用，从统计学到量子场论。[矩阵手册](https:\u002F\u002Fwww.math.uwaterloo.ca\u002F~hwolkowi\u002Fmatrixcookbook.pdf)中收录了许多此类公式，但并未给出推导过程。行列式导数的一个重要应用是利用牛顿法求解特征值λ，即令det(A-λI)=0。更一般地，对于任意函数Μ(λ)，均可通过求解det(M(λ))=0来获得其根λ，这些根被称为[非线性特征值](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNonlinear_eigenproblem)（若M关于λ是非线性的）。此时可借助此处的行列式导数公式，采用牛顿法进行迭代求解。\n\n## 第5讲（1月23日）\n* 方向导数：$f'(x)[v] = \\frac{d}{d\\alpha} f(x + \\alpha v) \\left. \\right|_{\\alpha=0}$。当 $v$ 是笛卡尔基向量时，方向导数与梯度或导数的“分量”之间存在联系。课程笔记 **第2.2.1节**\n* 神经网络（NN）的反向模式梯度：[手写反向传播笔记](https:\u002F\u002Fwww.dropbox.com\u002Fscl\u002Ffi\u002Fbke4pbr342e1jhv9qytg1\u002FNN-Backpropagation.pdf?rlkey=b7krtzdt4hgsj63zyq9ok2gqv&dl=0)，课程笔记 **第9章**。\n* 通过[对偶数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDual_number)实现的前向模式自动微分（AD）（[Julia 笔记本](notes\u002FAutoDiff.ipynb)) —— 课程笔记，**第8章**\n* [习题集1答案](psets\u002Fpset1sol.pdf)\n* [习题集2](psets\u002Fpset2.pdf)：截止日期为1月30日午夜\n\n**关于神经网络反向传播的进一步阅读**：[Strang (2019)](https:\u002F\u002Fmath.mit.edu\u002F~gs\u002Flearningfromdata\u002F) 第VII.3节以及[18.065 OCW 第27讲](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002F18-065-matrix-methods-in-data-analysis-signal-processing-and-machine-learning-spring-2018\u002Fresources\u002Flecture-27-backpropagation-find-partial-derivatives\u002F)。网上可以找到大量关于神经网络中[反向传播](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBackpropagation)的文章。神经网络中的反向传播与[递推关系的反向传播\u002F伴随方法（课程笔记）](https:\u002F\u002Fmath.mit.edu\u002F~stevenj\u002F18.336\u002Frecurrence2.pdf)以及[计算图（博客文章）](https:\u002F\u002Fcolah.github.io\u002Fposts\u002F2015-08-Backprop\u002F)密切相关。我们将在未来的课程中再次讨论计算图。\n\n**关于前向AD的进一步阅读**：课程笔记第8章。在谷歌上搜索“automatic differentiation”会找到许多资源——这如今是一个非常重要的应用领域。Julia 中的 [ForwardDiff.jl](https:\u002F\u002Fgithub.com\u002FJuliaDiff\u002FForwardDiff.jl)（由[这篇论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F1607.07892)详细描述）使用与课堂类似的[对偶数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDual_number)算术来计算导数；此外，还可以参考这篇[AMS评论文章](http:\u002F\u002Fwww.ams.org\u002Fpublicoutreach\u002Ffeature-column\u002Ffc-2017-12)，或者搜索“dual number automatic differentiation”以获取更多相关评论。Adrian Hill 曾为柏林工业大学的一门机器学习课程（2023年夏季）发布了一些不错的[关于自动微分的讲义](https:\u002F\u002Fadrhill.github.io\u002Fjulia-ml-course\u002FL6_Automatic_Differentiation\u002F)。[TaylorDiff.jl](https:\u002F\u002Fgithub.com\u002FJuliaDiff\u002FTaylorDiff.jl)则将这一方法扩展到了更高阶导数。\n\n## 第6讲（1月26日）：通过[Zoom（仅限MIT用户）](https:\u002F\u002Fmit.zoom.us\u002Fj\u002F98915152715?pwd=4GftZplphHYx7QIlDUL4vgiwzD7Rxc.1)\n\n由于暴雪紧急情况，周一的课程将通过上述 Zoom 链接进行。\n\n* 第一部分：计算图上的前向和反向模式自动微分：课程笔记 **第8.3节** 和基于[“用反斜杠进行回代的反向传播（2003）”](https:\u002F\u002Farxiv.org\u002Fabs\u002F2303.15449)的[幻灯片](notes\u002Fgilbert_autodiff_2023.pdf)。[Julia 笔记本](https:\u002F\u002Fsimeonschaub.github.io\u002FReverseModePluto\u002Fnotebook.html)。\n* 第二部分：变分法：课程笔记 **第11章**\n\n**进一步阅读（图上的AD）**：课程笔记第8.3节。请参阅[Edelman 教授的海报](notes\u002Fbackprop_poster.pdf)关于图上反向传播的内容，这篇关于[计算图上的微积分](https:\u002F\u002Fcolah.github.io\u002Fposts\u002F2015-08-Backprop\u002F)的博客文章提供了较为通俗易懂的介绍，而哥伦比亚大学的[课程笔记](http:\u002F\u002Fwww.cs.columbia.edu\u002F~mcollins\u002Fff2.pdf)则从更正式的角度进行了阐述。不幸的是，实现自动化的反向模式 AD 比定义一种新的数值类型要复杂得多，它涉及更多的编译器技术细节。此外，还可参考 Chris Rackauckas 关于[AD中的权衡](https:\u002F\u002Fwww.stochasticlifestyle.com\u002Fengineering-trade-offs-in-automatic-differentiation-from-tensorflow-and-pytorch-to-jax-and-julia\u002F)的博文，以及 Chris 在[关于Julia中可微分物理现状的讨论帖](https:\u002F\u002Fdiscourse.julialang.org\u002Ft\u002Fopen-discussion-on-the-state-of-differentiable-physics-in-julia\u002F72900\u002F2)中对AD局限性的探讨。\n\n**进一步阅读（变分法）**：关于[变分法](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCalculus_of_variations)有许多资料，它指的是对函数 $u(x)$ 的泛函 $f(u)=\\int F(u,u′,…)dx$ 求导。但我们已经看到，这本质上只是我们一般规则 $df=f(u+du)-f(u)=f′(u)[du]=\\nabla f\\cdot du$ 的一个特例，其中 $du$ 属于函数空间。将 $\\nabla f$ 设为零以寻找 $f(u)$ 的极值，便得到[欧拉-拉格朗日方程](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FEuler%E2%80%93Lagrange_equation)，其最著名的例子可能是[拉格朗日力学](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLagrangian_mechanics)，还有[最速降线问题](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBrachistochrone_curve)，但它也出现在许多其他领域，如[最优控制](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FOptimal_control)。一本非常易读的教材是[*变分法*，作者为盖尔凡德和福明](https:\u002F\u002Fstore.doverpublications.com\u002F0486414485.html)。\n\n## 第7讲（1月28日）\n\n* 第一部分：复数与CR微积分 — [手写笔记](https:\u002F\u002Fwww.dropbox.com\u002Fscl\u002Ffi\u002Fcn2cmzf1q2anpeeg9s5a4\u002FCR-Calculus.pdf?rlkey=b0qd9b3r8ldc9tu0s80nxpvir&dl=0) — 课程笔记 **第15章**\n* 第二部分：带约束的导数、特征值问题的导数 [(html)](https:\u002F\u002Frawcdn.githack.com\u002Fmitmath\u002Fmatrixcalc\u002Fd11b747d70a5d9e1a3da8cdb68a7f8a220d3afae\u002Fnotes\u002Fsymeig.jl.html) [(julia源码)](notes\u002Fsymeig.jl) — 课程笔记 **第14章**\n\n**扩展阅读（CR微积分）**：关于CR微积分的一份知名参考资料是UCSD的笔记《复梯度算子与CR微积分》（作者：Ken Kreutz-Delgado，2009年），可参见[arXiv链接](https:\u002F\u002Farxiv.org\u002Fabs\u002F0906.4835)。这类导数也被称为[Wirtinger导数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FWirtinger_derivatives)。一些自动微分库对此提供了支持，例如ChainRules.jl中关于[复函数的文档](https:\u002F\u002Fjuliadiff.org\u002FChainRulesCore.jl\u002Fdev\u002Fmaths\u002Fcomplex.html)以及JAX中关于[复函数的说明](https:\u002F\u002Fjax.readthedocs.io\u002Fen\u002Flatest\u002Fnotebooks\u002Fautodiff_cookbook.html#complex-numbers-and-differentiation)。对于“全纯函数”或“解析函数”这一特殊情况——即存在“普通”导数（对dz的线性算子）——其主要研究内容属于[复分析](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FComplex_analysis)，相关资源非常丰富，包括教科书、在线教程以及类似[18.04](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002F18-04-complex-variables-with-applications-spring-2018\u002F)的课程。\n\n**扩展阅读（带约束的导数及特征值问题）**：在曲面（流形）上计算导数与微分几何中的[切空间](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTangent_space)密切相关。约束的影响也可以用[Lagrange乘子](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLagrange_multiplier)来表示，后者常用于处理带约束的优化问题（参见Boyd和Vandenberghe所著的[凸优化](https:\u002F\u002Fweb.stanford.edu\u002F~boyd\u002Fcvxbook\u002F)第5章）。  \n在物理学中，本征值的一阶和二阶导数以及本征向量的一阶导数通常作为量子力学中“不含时”微扰理论的一部分出现，或者以[dλ情况下的Hellmann–Feyn曼定理](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHellmann%E2%80%93Feynman_theorem)的形式呈现。本征向量的导数涉及所有其他本征向量，但通过从左至右对本征向量的标量函数进行求导，可以得到一个更为简单的“向量–雅可比乘积”（仅涉及单个本征向量和本征值），这一点已在[18.335课程中关于伴随方法的笔记](https:\u002F\u002Fgithub.com\u002Fmitmath\u002F18335\u002Fblob\u002Fspring21\u002Fnotes\u002Fadjoint\u002Fadjoint.pdf)中有所回顾。\n\n* 在对矩阵A(x)的本征值λ求导时，当本征值发生交叉（重数k > 1）时会出现复杂情况：此时本征值（以及本征向量）通常不再可导。更一般地，任何具有重复根的[隐函数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FImplicit_function)都会面临类似问题。在这种情况下，一种可行方案是采用一种称为**广义梯度**的扩展敏感性分析定义：它是一个k×k的*矩阵值*线性算子G(x)\\[dx\\]，其*本征值*即为perturbations dλ。相关文献包括[Cox (1995)](https:\u002F\u002Fdoi.org\u002F10.1006\u002Fjfan.1995.1117)、[Seyranian等 (1994)](https:\u002F\u002Fdoi.org\u002F10.1007\u002FBF01742705)以及[Stechlinski (2022)](https:\u002F\u002Fdoi.org\u002F10.1016\u002Fj.laa.2022.04.019)。物理学家将这一思想称为[简并微扰理论](https:\u002F\u002Focw.mit.edu\u002Fcourses\u002F8-06-quantum-physics-iii-spring-2018\u002Fa0889c5ca8a479c3e56c544d646fb770_MIT8_06S18ch1.pdf)。近年来，另一种类似的表述方式被称为**字典序方向导数**；相关文献有[Nesterov (2005)](https:\u002F\u002Fdoi.org\u002F10.1007\u002Fs10107-005-0633-0)以及[Barton等 (2017)](https:\u002F\u002Fdoi.org\u002F10.1080\u002F10556788.2017.1374385)。有时，涉及本征值的优化问题可以通过使用[半正定规划(SDP)](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSemidefinite_programming)约束来进行重新建模，从而避免此类困难[(Men等，2014)](http:\u002F\u002Fdoi.org\u002F10.1364\u002FOE.22.022632)。而对于[亏本征矩阵](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDefective_matrix)，情况则更加糟糕：即使广义导数也会发散，因为dλ与扰动‖dA‖的*平方根*成正比。\n\n## 第9讲（1月30日）\n\n* 特征值与特征向量的导数，接续上节课内容。\n* 高维CR微积分及CR梯度，接续上节课内容。\n* 二阶导数、Hessian矩阵、二次近似及其应用——课程**第13章**——以及结合反向模式和正向模式来计算Hessian矩阵或Hessian-向量乘积（笔记**8.4.1节**）。\n* 我们未涉及的一些主题：常微分方程解的求导（笔记**第10章**）、随机函数的求导（笔记**第12章**），以及其他如[δ函数与分布导数](https:\u002F\u002Fmath.mit.edu\u002F~stevenj\u002F18.303\u002Fdelta-notes.pdf)等导数的推广形式（笔记**第16章**）。\n* [pset 2答案](psets\u002Fpset2sol.pdf)\n\n**延伸阅读（二阶导数）**：\n* [双线性形式](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBilinear_form)是将二次运算推广到任意向量空间的重要概念，我们已经看到，二阶导数可以被视为一种[对称双线性形式](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSymmetric_bilinear_form)。这与[二次型](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FQuadratic_form)密切相关，后者正是将同一个向量代入两次所得到的结果，例如在f(x+δx)的二次近似中出现的f''(x)[δx,δx]\u002F2就是一个二次型。f''(x)最熟悉的多元版本就是[Hessian矩阵](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHessian_matrix)；可汗学院提供了一个基础的[二次近似的介绍](https:\u002F\u002Fwww.khanacademy.org\u002Fmath\u002Fmultivariable-calculus\u002Fapplications-of-multivariable-derivatives\u002Fquadratic-approximations\u002Fa\u002Fquadratic-approximation)。\n* [正定](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDefinite_matrix)的Hessian矩阵，或者更一般地，[正定二次型](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDefinite_quadratic_form)f″，出现在标量函数f(x)的局部极小值点处（f′=0）；关于这一思想还有许多[更为正式的论述](http:\u002F\u002Fwww.columbia.edu\u002F~md3405\u002FUnconstrained_Optimization.pdf)。相反地，可汗学院也提供了[简单的二元版本](https:\u002F\u002Fwww.khanacademy.org\u002Fmath\u002Fmultivariable-calculus\u002Fapplications-of-multivariable-derivatives\u002Foptimizing-multivariable-functions\u002Fa\u002Fsecond-partial-derivative-test)，你只需查看行列式和其中一个元素（或迹）的符号，就能判断2×2矩阵的特征值正负。StackExchange上有一篇不错的讨论[为什么条件数不良的Hessian会使最速下降法收敛缓慢](https:\u002F\u002Fmath.stackexchange.com\u002Fquestions\u002F2285282\u002Frelating-condition-number-of-hessian-to-the-rate-of-convergence)；多伦多大学关于该主题的一些[课程笔记](https:\u002F\u002Fwww.cs.toronto.edu\u002F~rgrosse\u002Fcourses\u002Fcsc421_2019\u002Fslides\u002Flec07.pdf)也可能有所帮助。\n* 可参阅斯坦福大学关于使用信赖域的[序列二次优化](https:\u002F\u002Fweb.stanford.edu\u002Fclass\u002Fee364b\u002Flectures\u002Fseq_notes.pdf)的相关笔记（第2.2节）。此外，还有18.335课程关于BFGS拟牛顿法的[笔记](https:\u002F\u002Fgithub.com\u002Fmitmath\u002F18335\u002Fblob\u002Fspring21\u002Fnotes\u002FBFGS.pdf)（也可观看[视频](https:\u002F\u002Fmit.zoom.us\u002Frec\u002Fshare\u002FnaqcRgSkZ0VNeDp0ht8QmB566mPowuHJ8k0LcaAmZ7XxaCT1ch4j_O4Khzi-taXm.CXI8xFthag4RvvoC?startTime=1620241284000)）。书中还提到，在球体内进行的二次优化问题具有[强对偶性](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FStrong_duality)，因此可以高效求解（见《凸优化》一书第5.2.4节）。近年来，关于[自动计算Hessian矩阵](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FHessian_automatic_differentiation)的研究十分活跃，但对于大规模问题而言，通常只能高效计算Hessian-向量乘积，而这种乘积等价于梯度的方向导数，可用于例如[牛顿–克里洛夫方法](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNewton%E2%80%93Krylov_method)。","# MatrixCalc 快速上手指南\n\n`matrixcalc` 是麻省理工学院（MIT）18.063 课程的配套资源，旨在通过线性代数视角重新审视并推广微积分，特别适用于机器学习、大规模优化及自动微分（AD）领域。本指南将帮助你快速搭建环境并开始学习。\n\n## 环境准备\n\n本课程主要使用 **Julia** 语言进行数值计算演示。\n\n*   **操作系统**：Windows, macOS, 或 Linux\n*   **前置知识**：\n    *   线性代数基础（参考 MIT 18.06）\n    *   多变量微积分基础（参考 MIT 18.02）\n*   **核心依赖**：\n    *   [Julia 编程语言](https:\u002F\u002Fjulialang.org\u002F) (推荐最新稳定版)\n    *   网络访问能力（用于下载课程笔记和访问 Piazza 论坛）\n\n> **注意**：国内用户若下载 Julia 较慢，可尝试使用清华或中科大镜像源配置 Julia 包管理器，但语言本体建议从官网或 GitHub Release 页面下载。\n\n## 安装步骤\n\n你可以选择本地安装或云端运行两种方式。\n\n### 方案一：本地安装（推荐）\n\n1.  **下载并安装 Julia**\n    访问 [Julia 官网](https:\u002F\u002Fjulialang.org\u002Fdownloads\u002F) 下载对应系统的安装包并安装。\n    或者遵循 MIT 官方提供的详细安装指引：\n    ```bash\n    # 参考官方安装脚本说明\n    # https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fjulia-mit#installing-julia-and-ijulia-on-your-own-computer\n    ```\n\n2.  **克隆课程代码仓库**\n    打开终端（Terminal 或 PowerShell），执行以下命令获取课程材料：\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fmitmath\u002Fmatrixcalc.git\n    cd matrixcalc\n    ```\n\n3.  **安装必要的 Julia 包**\n    进入 Julia 交互界面，激活项目环境并实例化依赖：\n    ```julia\n    using Pkg\n    Pkg.activate(\".\")\n    Pkg.instantiate()\n    ```\n\n### 方案二：云端运行（免安装）\n\n如果不想在本地配置环境，可以直接使用 Binder 在浏览器中运行：\n\n[![Binder](https:\u002F\u002Fmybinder.org\u002Fbadge_logo.svg)](https:\u002F\u002Fmybinder.org\u002Fv2\u002Fgh\u002Fmitmath\u002Fbinder-env\u002Fmain)\n\n点击上述链接，等待环境加载完成后即可直接运行 `.jl` 和 `.ipynb` 文件。\n\n## 基本使用\n\n本课程的核心在于理解“导数作为线性算子”的概念。以下是基于课程笔记的最简单使用示例。\n\n### 1. 查阅核心课程笔记\n课程的核心内容整理在 PDF 笔记中，建议优先阅读：\n*   **下载地址**：[18.063 COURSE NOTES](https:\u002F\u002Fwww.dropbox.com\u002Fscl\u002Ffi\u002Fiq4plt8oqja845cuuosa4\u002FMatrix-Calculus-latest.pdf?rlkey=nsnytdu28jje41nhh1bl2dbba&st=i6lfha0r&dl=0)\n*   **重点章节**：\n    *   Chapter 1 & 2: 导数作为线性算子 (Derivatives as linear operators)\n    *   Chapter 3: 向量化与克罗内克积 (Vectorization and Kronecker products)\n\n### 2. Julia 代码示例：矩阵函数的导数\n\n在 Julia 环境中，我们可以验证课程中提到的矩阵求导公式。例如，对于函数 $f(X) = X^{-1}$，其导数作用于微小变化量 $dX$ 的结果应为 $-X^{-1} dX X^{-1}$。\n\n创建一个名为 `demo.jl` 的文件或直接 REPL 中输入：\n\n```julia\nusing LinearAlgebra\n\n# 定义一个随机可逆矩阵 X\nX = rand(3, 3) + 3I  # 加上单位阵确保可逆\ndX = rand(3, 3) * 1e-6  # 微小扰动\n\n# 方法 A: 数值差分近似 (Finite Difference)\nf(X) = inv(X)\nnumerical_derivative = (f(X + dX) - f(X)) \u002F 1.0 # 简化步长\n\n# 方法 B: 理论公式 (Linear Operator perspective)\n# f'(X)[dX] = -X⁻¹ * dX * X⁻¹\ntheoretical_derivative = -inv(X) * dX * inv(X)\n\n# 验证两者是否接近\nprintln(\"最大误差: \", maximum(abs.(numerical_derivative - theoretical_derivative)))\n\n# 输出标量函数的梯度示例 (f(x) = x'Ax)\nx = rand(3)\nA = rand(3, 3)\nA_sym = A + A' # 对称部分\n\n# 梯度 ∇f = (A + A')x\ngradient = A_sym * x\nprintln(\"梯度向量形状: \", size(gradient))\n```\n\n### 3. 交互式笔记本学习\n仓库中包含多个 Pluto 和 Jupyter 笔记本，用于可视化推导过程。启动方式：\n\n```julia\nusing Pluto\nPluto.run(notebook=\"notes\u002F2x2Jacobians.jl\")\n```\n或者在 Jupyter Lab 中打开 `notes\u002F2x2Jacobians.ipynb` 查看关于 $2 \\times 2$ 矩阵雅可比行列式的详细推导。\n\n### 4. 辅助工具推荐\n*   **Matrix Calculus Online**: 访问 [matrixcalculus.org](http:\u002F\u002Fwww.matrixcalculus.org\u002F) 在线验证矩阵和向量函数的导数。\n*   **The Matrix Cookbook**: 查阅常用矩阵导数公式速查表（仅含公式无推导）。","某算法工程师正在开发一个基于矩阵分解的高维推荐系统，需要手动推导损失函数对奇异值分解（SVD）各因子的梯度以进行自定义优化。\n\n### 没有 matrixcalc 时\n- 工程师被迫将矩阵视为标量数组，逐个元素手动推导偏导数，面对复杂的矩阵逆和行列式运算极易出错。\n- 缺乏统一的线性算子视角，难以处理高阶张量或非标准向量空间的微分，导致数学建模过程断裂。\n- 为了验证公式正确性，不得不编写低效的有限差分代码进行数值核对，严重拖慢原型迭代速度。\n- 在实现反向传播时，无法直接利用矩阵整体性质生成高效的向量 - 雅可比积，只能依赖框架的黑盒自动微分，难以定制优化。\n\n### 使用 matrixcalc 后\n- 借助 matrixcalc 提供的整体矩阵微分视角，工程师直接将矩阵作为单一对象推导，快速得出 SVD 因子的解析梯度公式。\n- 利用课程中关于任意向量空间线性算子的理论，轻松将微分规则推广到复杂的自定义矩阵流形上，逻辑清晰连贯。\n- 通过 Julia 语言结合 matrixcalc 的数值示例，瞬间完成解析解与数值解的自动化比对，确保数学推导零误差。\n- 基于对伴随算子和反向微分原理的深度理解，手写了定制化的向量 - 雅可比积算法，使训练速度比通用自动微分提升数倍。\n\nmatrixcalc 的核心价值在于它将微积分从繁琐的标量计算解放为优雅的线性代数操作，让开发者能高效构建大规模机器学习所需的自定义微分算法。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmitmath_matrixcalc_39a0962d.png","mitmath","MITMath","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmitmath_ae125e5e.png","educational materials for MIT math courses",null,"https:\u002F\u002Fgithub.com\u002Fmitmath",[22,26,30,34],{"name":23,"color":24,"percentage":25},"Jupyter Notebook","#DA5B0B",89,{"name":27,"color":28,"percentage":29},"HTML","#e34c26",8.2,{"name":31,"color":32,"percentage":33},"Julia","#a270ba",1.7,{"name":35,"color":36,"percentage":37},"TeX","#3D6117",1.1,579,85,"2026-03-23T03:55:39",2,"Linux, macOS, Windows","未说明",{"notes":45,"python":46,"dependencies":47},"该项目是 MIT 的矩阵微积分课程资料，并非传统的 AI 模型库。主要运行环境为 Julia 编程语言（而非 Python）。建议在本地安装 Julia 和 IJulia，或通过 Binder 在云端浏览器中直接运行示例代码。无特定的 GPU 或大内存需求，仅需满足基础数值计算要求。","不适用 (主要使用 Julia)",[48,49],"Julia language","IJulia (optional)",[51],"其他","ready","2026-03-27T02:49:30.150509","2026-04-06T06:44:11.996401",[],[],[58,74,83,91,100,108],{"id":59,"name":60,"github_repo":61,"description_zh":62,"stars":63,"difficulty_score":41,"last_commit_at":64,"category_tags":65,"status":52},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[66,67,68,69,70,51,71,72,73],"图像","数据工具","视频","插件","Agent","语言模型","开发框架","音频",{"id":75,"name":76,"github_repo":77,"description_zh":78,"stars":79,"difficulty_score":80,"last_commit_at":81,"category_tags":82,"status":52},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[70,66,72,71,51],{"id":84,"name":85,"github_repo":86,"description_zh":87,"stars":88,"difficulty_score":80,"last_commit_at":89,"category_tags":90,"status":52},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[71,66,72,51],{"id":92,"name":93,"github_repo":94,"description_zh":95,"stars":96,"difficulty_score":97,"last_commit_at":98,"category_tags":99,"status":52},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[72,51],{"id":101,"name":102,"github_repo":103,"description_zh":104,"stars":105,"difficulty_score":97,"last_commit_at":106,"category_tags":107,"status":52},2234,"scikit-learn","scikit-learn\u002Fscikit-learn","scikit-learn 是一个基于 Python 构建的开源机器学习库，依托于 SciPy、NumPy 等科学计算生态，旨在让机器学习变得简单高效。它提供了一套统一且简洁的接口，涵盖了从数据预处理、特征工程到模型训练、评估及选择的全流程工具，内置了包括线性回归、支持向量机、随机森林、聚类等在内的丰富经典算法。\n\n对于希望快速验证想法或构建原型的数据科学家、研究人员以及 Python 开发者而言，scikit-learn 是不可或缺的基础设施。它有效解决了机器学习入门门槛高、算法实现复杂以及不同模型间调用方式不统一的痛点，让用户无需重复造轮子，只需几行代码即可调用成熟的算法解决分类、回归、聚类等实际问题。\n\n其核心技术亮点在于高度一致的 API 设计风格，所有估算器（Estimator）均遵循相同的调用逻辑，极大地降低了学习成本并提升了代码的可读性与可维护性。此外，它还提供了强大的模型选择与评估工具，如交叉验证和网格搜索，帮助用户系统地优化模型性能。作为一个由全球志愿者共同维护的成熟项目，scikit-learn 以其稳定性、详尽的文档和活跃的社区支持，成为连接理论学习与工业级应用的最",65628,"2026-04-05T10:10:46",[72,51,67],{"id":109,"name":110,"github_repo":111,"description_zh":112,"stars":113,"difficulty_score":41,"last_commit_at":114,"category_tags":115,"status":52},3364,"keras","keras-team\u002Fkeras","Keras 是一个专为人类设计的深度学习框架，旨在让构建和训练神经网络变得简单直观。它解决了开发者在不同深度学习后端之间切换困难、模型开发效率低以及难以兼顾调试便捷性与运行性能的痛点。\n\n无论是刚入门的学生、专注算法的研究人员，还是需要快速落地产品的工程师，都能通过 Keras 轻松上手。它支持计算机视觉、自然语言处理、音频分析及时间序列预测等多种任务。\n\nKeras 3 的核心亮点在于其独特的“多后端”架构。用户只需编写一套代码，即可灵活选择 TensorFlow、JAX、PyTorch 或 OpenVINO 作为底层运行引擎。这一特性不仅保留了 Keras 一贯的高层易用性，还允许开发者根据需求自由选择：利用 JAX 或 PyTorch 的即时执行模式进行高效调试，或切换至速度最快的后端以获得最高 350% 的性能提升。此外，Keras 具备强大的扩展能力，能无缝从本地笔记本电脑扩展至大规模 GPU 或 TPU 集群，是连接原型开发与生产部署的理想桥梁。",63927,"2026-04-04T15:24:37",[72,67,51]]