[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-emoen--Machine-Learning-for-Asset-Managers":3,"tool-emoen--Machine-Learning-for-Asset-Managers":61},[4,18,26,36,44,52],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",141543,2,"2026-04-06T11:32:54",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":53,"name":54,"github_repo":55,"description_zh":56,"stars":57,"difficulty_score":10,"last_commit_at":58,"category_tags":59,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,60],"视频",{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":78,"owner_email":79,"owner_twitter":80,"owner_website":81,"owner_url":82,"languages":83,"stars":88,"forks":89,"last_commit_at":90,"license":91,"difficulty_score":92,"env_os":93,"env_gpu":94,"env_ram":94,"env_deps":95,"category_tags":99,"github_topics":100,"view_count":32,"oss_zip_url":79,"oss_zip_packed_at":79,"status":17,"created_at":106,"updated_at":107,"faqs":108,"releases":138},4446,"emoen\u002FMachine-Learning-for-Asset-Managers","Machine-Learning-for-Asset-Managers","Implementation of code snippets, exercises and application to live data from Machine Learning for Asset Managers (Elements in Quantitative Finance) written by Prof. Marcos López de Prado.","Machine-Learning-for-Asset-Managers 是马科斯·洛佩斯·德·普拉多（Marcos López de Prado）教授著作《机器学习在资产管理中的应用》的开源代码实现库。它专注于将书中复杂的量化金融理论转化为可执行的 Python 代码，涵盖去噪、去调协方差矩阵计算、最优带宽估计及聚类分析等核心算法。\n\n该工具主要解决了金融数据中常见的“噪声”干扰问题。通过应用 Marchenko-Pastur 分布理论和特征值修正方法，它能有效清洗随机矩阵，提取真实的市场信号，从而构建更稳定的最小方差投资组合。此外，项目还针对书中部分算法（如最优聚类数算法 ONC）存在的潜在缺陷进行了探讨与修正，为用户提供了更严谨的实践参考。\n\n这款软件非常适合量化研究员、金融数据科学家以及希望深入理解机器学习在资管领域应用的开发者使用。其独特的技术亮点在于不仅复现了经典理论，还通过交叉验证等方法优化了核密度估计（KDE）中的带宽选择，并直观展示了去噪前后特征值分布的对比。对于想要从理论走向实战，或利用 Hudson & Thames 的 mlfinlab 库进行更深层次开发的从业者","Machine-Learning-for-Asset-Managers 是马科斯·洛佩斯·德·普拉多（Marcos López de Prado）教授著作《机器学习在资产管理中的应用》的开源代码实现库。它专注于将书中复杂的量化金融理论转化为可执行的 Python 代码，涵盖去噪、去调协方差矩阵计算、最优带宽估计及聚类分析等核心算法。\n\n该工具主要解决了金融数据中常见的“噪声”干扰问题。通过应用 Marchenko-Pastur 分布理论和特征值修正方法，它能有效清洗随机矩阵，提取真实的市场信号，从而构建更稳定的最小方差投资组合。此外，项目还针对书中部分算法（如最优聚类数算法 ONC）存在的潜在缺陷进行了探讨与修正，为用户提供了更严谨的实践参考。\n\n这款软件非常适合量化研究员、金融数据科学家以及希望深入理解机器学习在资管领域应用的开发者使用。其独特的技术亮点在于不仅复现了经典理论，还通过交叉验证等方法优化了核密度估计（KDE）中的带宽选择，并直观展示了去噪前后特征值分布的对比。对于想要从理论走向实战，或利用 Hudson & Thames 的 mlfinlab 库进行更深层次开发的从业者来说，这是一个极佳的学习起点和实验沙箱。","## Install Library \n\n..with `pip install -U git+https:\u002F\u002Fgithub.com\u002Femoen\u002FMachine-Learning-for-Asset-Managers`\n\n\u003Cpre>\n>>> from Machine_Learning_for_Asset_Managers import ch2_fitKDE_find_best_bandwidth as c\n>>> import numpy as np\n>>> c.findOptimalBWidth(np.asarray([21,3]))\n{'bandwidth': 10.0}\n\u003C\u002Fpre>\n\n# Machine-Learning-for-Asset-Managers\n\nImplementation of code snippets and exercises from [Machine Learning for Asset Managers (Elements in Quantitative Finance)](https:\u002F\u002Fwww.amazon.com\u002FMachine-Learning-Managers-Elements-Quantitative\u002Fdp\u002F1108792898)\nwritten by Prof. Marcos López de Prado.\n\nThe project is for my own learning. If you want to use the consepts from the book - you should head over to Hudson & Thames. They have implemented these consepts and many more in [mlfinlab](https:\u002F\u002Fgithub.com\u002Fhudson-and-thames\u002Fmlfinlab). \n\nFor practical application see the repository: [Machine-Learning-for-Asset-Managers-Oslo-Bors](https:\u002F\u002Fgithub.com\u002Femoen\u002FMachine-Learning-for-Asset-Managers-Oslo-Bors).\n\nNote: In chapter 4 - there is a bug in the implementation of \"Optimal Number of Clusters\" algorithm (ONC) in the book \n(the code from the paper - DETECTION OF FALSE INVESTMENT STRATEGIES USING UNSUPERVISED LEARNING METHODS, de Prado and Lewis (2018) -  \nis different but is also incorrect )\nhttps:\u002F\u002Fquant.stackexchange.com\u002Fquestions\u002F60486\u002Fbug-found-in-optimal-number-of-clusters-algorithm-from-de-prado-and-lewis-201\n\nThe divide and conquer method of subspaces used by ONC can be problematic because if you embed a subspace into a space with a large eigen-value.\nThe larger space can distort the clusters found in the subspace. ONC does precisely that - it embeds subspaces into the space consisting of the largest \neigenvalues found in the correlation matrix. An outline describing the problem more rigorously can be found here: \nhttps:\u002F\u002Fmath.stackexchange.com\u002Fquestions\u002F4013808\u002Fmetric-on-clustering-of-correlation-matrix-using-silhouette-score\u002F4050616#4050616\n\nOther clustering algorithms should be investigated like hierarchical clustering.\n\n## Chapter 2 Denoising and Detoning\n\nMarcenko-Pasture theoretical probability density function, and empirical density function:\n| ![marcenko-pastur.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_fe79fe283334.png) | \n|:--:| \n| *Figure 2.1:Marcenko-Pasture theoretical probability density function, and empirical density function:* |\n\n\nDenoising a random matrix with signal using the constant residual eigenvalue method. This is done by fixing random eigenvalues. See code snippet 2.5\n| ![eigenvalue_method.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_b47ba397c347.png) | \n|:--:| \n| *Figure 2.2: A comparison of eigenvalues before and after applying the residual eigenvalue method:* |\n\nDetoned covariance matrix can be used to calculate minimum variance portfolio. The efficient frontier is the upper portion of the minimum variance frontier starting at the minimum variance portfolio. A denoised covariance matrix is less unstable to change.\n\nNote: Excersize 2.7: \"Extend function fitKDE in code snippet 2.2, so that it estimates through\ncross-validation the optimal value of bWidth (bandwidth)\".\n\nThe script ch2_fitKDE_find_bandwidth.py implements this procedure and produces the (green) KDE in figure 2.3:\n| ![gaussian_mp_excersize_2_7.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_3942e4b29a65.png) | \n|:--:| \n| *Figure 2.3:  Calculated bandwidth(green line) together with histogram, and pdf. The green line is smoother. Bandwidth found: 0.03511191734215131* |\n\nFrom code snippet 2.3 -  with random matrix with signal: the histogram is how the eigenvalues of a random matrix with signal is distributed. Then the variance of the theoretical probability density function is calculated using the $fitKDE$ as the empirical probability density function. So finding a good value for bandwidth in fitKDE is needed to find the likeliest variance of the theoretical mp-pdf.\n| ![fig_2_3_mp_with_signal.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_e8309300bdd5.png) | \n|:--:| \n| *Figure 2.4: histogram and pdf of eigenvalues with signal* |\n\n\n\n## Chapter 3 Distance Metrics\n\n* definition of a metric: \n   1. identity of indiscernibles d(x,y) = 0 => x=y \n   2. Symmetry d(x,y) = d(y,x) \n   3. triangle inequality. \n   - 1,2,3 => non-negativ, d(x,y) >= 0 \n* pearson correlation\n* distance correlation\n* angular distance\n* Information-theoretic codependence\u002Fentropy dependence\n    - cross-entropy:  H[X] = - &Sigma;\u003Csub>s &isin; S\u003Csub>X\u003C\u002Fsub>\u003C\u002Fsub> p[x] log (p[x])\n    - Kullback-Leilbler divergence:  D\u003Csub>KL\u003C\u002Fsub>[p||q] = - &Sigma;\u003Csub>s &isin; S\u003Csub>X\u003C\u002Fsub>\u003C\u002Fsub> p[x] log (q[x]\u002Fp[x]) = p[x] &Sigma;\u003Csub>s &isin; S\u003C\u002Fsub> log (p[x]\u002Fq[x])\n    - Cross-entropy: H\u003Csub>c\u003C\u002Fsub>[p||q] = H[x] = D\u003Csub>KL\u003C\u002Fsub>[p||q]\n    - Mutual information: Decrease in uncertainty in X from knowing Y: I[X,Y] = H[X] - H[X|Y] = H[X] + H[Y] - H[X,Y] = E\u003Csub>X\u003C\u002Fsub>[D\u003Csub>KL\u003C\u002Fsub>[p[y|x]||p[y]]]\n    - variation of information: VI[X,Y] = H[X|Y] + H[Y|X] = H[X,Y] - I[X,Y]. It is uncertainty we expect in one variable given another variable: VI[X,Y] = 0 \u003C=> X=Y\n    - Kullback-Leilbler divergence is not a metric while variation of information is.\n   \n \n ```\n >>> ss.entropy([1.\u002F2,1.\u002F2], base=2)\n1.0\n>>> ss.entropy([1,0], base=2)\n0.0\n>>> ss.entropy([1.\u002F3,2.\u002F3], base=2)\n0.9182958340544894\n```\n1. 1 bit of information in coin toss\n2. 0 bit of information in deterministic outcome\n3. less than 1 bit of information in unfair coin toss\n\n\n* Angular distance: p_d = sqrt(1\u002F2 - (1-rho(X, Y)))\n* Absolute angular distance: p_d = sqrt(1\u002F2 - (1-|rho|(X, Y)))\n* Squared angular distance: p_d = sqrt(1\u002F2 - (1-rho^2(X, Y)))\n\n![fig_3_1_angular_distance.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_6df476653b74.png)  ![fig_3_1_abs_squared_angular_distance.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_99dc7a859018.png) \nStandard angular distance is better used for long-only portfolio appliacations. Squared and Absolute Angular Distances for long-short portfolios. \n\n## Chapter 4 Optimal Clustering\n\nUse unsupervised learning to maximize intragroup similarities and minimize intergroup similarities. Consider matrix X of shape N x F. N objects and F features. Features are used to compute proximity(correlation, mutual information) to N objects in an NxN matrix.\n\nThere are 2 types of clustering algorithms. Partitional and hierarchical:\n1. Connectivity: hierarchical clustering\n2. Centroids: like k-means\n3. Distribution: gaussians\n4. Density: search for connected dense regions like DBSCAN, OPTICS\n5. Subspace: modeled on two dimension, feature and observation. [Example](https:\u002F\u002Fquantdare.com\u002Fbiclustering-time-series\u002F)\n\n\nGenerating of random block correlation matrices is used to simulate instruments with correlation. The utility for doing this is in code snippet 4.3, and it uses clustering algorithms \u003Ci>optimal number of cluster\u003C\u002Fi> (ONC) defined in snippet 4.1 and 4.2, which does not need a predefined number of clusters (unlike k-means), but uses an 'elbow method' to stop adding clusters. The optimal number of clusters are achieved when there is high intra-cluster correlation and low inter-cluster correlation. The [silhouette score](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSilhouette_(clustering)) is used to minimize within-group distance and maximize between-group distance. \n| ![random_block_corr_matrix.jpg](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_f702338caf82.png) | \n|:--:| \n| *Random block correlation matrix. Light colors indicate a high correlation, and dark colors indicate a low correlation. In this example, the number of blocks K=6, minBlockSize=2, and number of instruments N=30* |\n| ![fig_4_1_random_block_correlation_matrix_onc.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_f69b2a9f748e.png) | \n| *Applying the ONC algorithm to the random block correlation matrix. ONC finds all the clusters.* |\n\n## Chapter 5 Financial Labels\n\n* Fixed-Horizon method\n* Time-bar method\n* Volume-bar method\n\nTiple-Barrier Method involves holding a position until\n1. Unrealized profit target achieved\n2. unrealized loss limit reached\n3. Position is held beyond a maximum number of bars\n\nTrend-scanning method: the idea is to identify trends and let them run for as long and as far as they may persists, without setting any barriers. \n\n| ![fig_5_1_trend_scanning.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_415853398646.png) | \n|:--:| \n| *Example of trend-scanning labels on sine wave with gaussian noise:* |\n\n| ![fig_5_2_trend_scanning_t_values.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_ea0b29d58690.png) | \n|:--:| \n| *trend-scanning with t-values which shows confidence in trend. 1 is high confidence going up and -1 is high confidence going down.* |\n\nAn alternative to look-forward algorithm as presented in the book is to use look-backward from the latest data-point to the window-size. E.g. if the latest data-point is at index 20 - and the window size is between 3 and 10 days. The  look-backward algorithm will scan window at index 17 to 20 all the way back to index 11 to 20. Hence only considering the most recent information.\n\n\n| ![fig_5_2_trend_scanning_t_values2.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_6395e3bdf0c6.png) | \n|:--:| \n| *trend-scanning with t-values using look-backwards* |\n\n## Chapter 6 Feature Importance Analysis\n\n\u003Ci>\"p-value does not measure the probability that neither the null nor the alternative hypothesis is true, or the significance of a result.\"\u003C\u002Fi>\n| ![fig_6_1_p_values_explanatory_vars.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_53635ccfa1d3.png) | \n|:--:| \n| *p-Values computed on a set of informative, redundant, and noisy explanatory variables. The explanatory variables has not the hightest p-values.* |\n\n\"Backtesting is not a research tool. Feature importance is.\" (Lopez de Prado) The Mean Decrease Impurity (MDI) algorithm deals with 3 out of 4 problems with p-values:\n1. MDI is not imposing any tree structure, algebraic specification, or relying on any stochastic or distributional characteristics of the residuals (e.g. y=b\u003Csub>0\u003C\u002Fsub>+b\u003Csub>1\u003C\u002Fsub>*x\u003Csub>i\u003C\u002Fsub>+&epsilon;)\n2. betas are estimated from single sample, MDI relies on bootstrapping, so the variance can be reduced by the numbers of trees in the random forrest ensemble.\n3. In MDI the goal is not to estimate a coefficient of a given algebraic equation (b_hat_0, b_hat_1) describing the probability of a null-hypotheses.\n4. MDI does not correct of calculation in-sample, as there is no cross-validation.\n\n| ![fig_6_2_mdi_example.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_b7d1112d7181.png) | \n|:--:| \n| *MDI algorithm example* |\n\nFigure 6.4 shows that ONC correctly recognizes that there are six relevant clusters(one cluster for each informative feature, plus one cluster of noise features), and it assigns the redundant features to the cluster that contains the informative feature from which the redundant features where derived. Given the low correlation across clusters, there is no need to replace the features with their residuals.\n| ![fig_6_4_feature_clustering.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_5a2222121888.png) | \n|:--:| \n\nNext, apply the clustered MDI method to the clustered data:\n| ![fig_6_5_clustered_MDI.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_4e7b0b5c4fa8.png) | \n|:--:| \n| *Figure 6.5 Clustered MDI* |\n\nClustered MDI works better han non-clustered MDI. Finally, apply the clustered MDA method to this data:\n| ![fig_6_6_clustered_MDA.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_53db4865a64c.png) | \n|:--:| \n| *Figure 6.6 Clustered MDA* |\n\nConclusion: C_5 which is associated with noisy features is not important, and all other clusters has similar importance.\n\n## Chapter 7 Portfolio Construction\n\nConvex portfolio optimization can calculate minimum variance portfolio and max sharp-ratio.\n\nDefinition Condition number: absolute value of the ratio between the maximum and minimum eigenvalues: A_n_n \u002F A_m_m. The condition number says something about the instability of the instability caused by covariance structures.\nDefinition trace = sum(diag(A)) - its the sum of the diagonal elements\n\nHighly correlated time-series implies high condition number of the correlation matrix.\n\n### Markowitz's curse\nThe correlation matrix C is stable only when the correlation $\\ro = 0$ - when there is no correlation.\n\nHierarchical risk parity (HRP) outperforms Markowit in out-of-sample Monte-Carlo experiments, but is sub-optimal in-sample.\n\nCode-snippet 7.1 illustrates the signal-induced instability of the correlation matrix.\n```\n>>> corr0 = mc.formBlockMatrix(2, 2, .5)\n>>> corr0\narray([[1. , 0.5, 0. , 0. ],\n       [0.5, 1. , 0. , 0. ],\n       [0. , 0. , 1. , 0.5],\n       [0. , 0. , 0.5, 1. ]])\n>>> eVal, eVec = np.linalg.eigh(corr0)\n>>> print(max(eVal)\u002Fmin(eVal))\n3.0\n```\n\n| ![fig_7_1_block_diagonal.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_e713f2fd827b.png) | \n|:--:| \n| *Figure 7.1 Heatmap of a block-diagonal correlation matrix* |\n\nCode-snippet 7.2 creates same block diagonal matrix but with one dominant block. However the condition number is the same.\n```\n>>> corr0 = block_diag(mc.formBlockMatrix(1,2, .5))\n>>> corr1 = mc.formBlockMatrix(1,2, .0)\n>>> corr0 = block_diag(corr0, corr1)\n>>> corr0\narray([[1. , 0.5, 0. , 0. ],\n       [0.5, 1. , 0. , 0. ],\n       [0. , 0. , 1. , 0. ],\n       [0. , 0. , 0. , 1. ]])\n>>> eVal, eVec = np.linalg.eigh(corr0)\n>>> matrix_condition_number = max(eVal)\u002Fmin(eVal)\n>>> print(matrix_condition_number)\n3.0\n```\nThis demonstrates bringing down the intrablock correlation in only one of the two blocks doesnt reduce the condition number. This shows that the instablility in Markowitz's solution can be traced back to the dominant blocks.\n| ![fig_7_2_block_diagonal.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_d213a438e0e4.png) | \n|:--:| \n| *Figure 7.2 Heatmap of a dominant block-diagonal correlation matrix* |\n\n### The nested Clustered Optimization Algorithm (NCO)\n\nNCO provides a strategy for addressing the effect of Markowitz's curse on an existing mean-variance allocation method.\n1. step: cluster the correlation matrix\n2. step: compute optimal intracluster allocations, using the denoised covariance matrix\n3. step: compute optimal intercluster allocations, using the reduced covariance matrix which is close to a diagonal matrix, so optimization problem is close to ideal \nmarkowitz case when $\\ro$ = 0\n\n## Chapter 8 Testing set overfitting\n\nBacktesting is a historical simulation of how an investment strategy would have performed in the past. Backtesting suffers from selection bias under multiple testing, as researchers run millions of tests on historical data and presents the best ones (overfitted). This chapter studies how to measure the effect of selection bias. \n\n### Precision and recall\n### Precision and recall under multiple testing\n### The sharpe ratio\n\nSharpe Ratio = μ\u002Fσ\n\n### The 'False Strategy' theorem\n\nA researcher may run many historical simulations and report only the best one (max sharp ratio). \nThe distribution of max sharpe ratio is not the same as the expected sharpe ratio. Hence selection bias under multiple replications (SBuMT).\n\n### Experimental results\n\nA monte carlo experiment shows that the distribution of the max sharp ratio increases (E[max(sharp_ratio)] = 3.26) even \nwhen the expected sharp ratio is 0 (E[sharp_ratio]). So an investment strategy will seem promising even when there are no good strategy.\n\nWhen more than one trial takes place, the expected value of the maximum Sharpe Ratio is greater than the expected value \n of the Sharpe Ratio, from a random trial (when true Sharpe Ratio=0 and variance > 0).\n| ![maxSR_across_uniform_strategies_8_1.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_3ce8de6de92d.png) | \n|:--:| \n| *Figure 8.1 Comparison of experimental and theoretical results from False Strategy Theorem* |\n\n### The Deflated Sharpe Ratio\nThe main conclusion from the False Strategy Theorem is that, unless $max\u003Csub>k\u003C\u002Fsub>{SR^\u003Csub>k\u003C\u002Fsub>}>>E[max\u003Csub>k\u003C\u002Fsub>{SR^\u003Csub>k\u003C\u002Fsub>}], \nthe discovered strategy is likely to be false positive.\n\n### Type II errors under multiple testing\n### The interaction between type I and type II errors\n\n\n## Appendix A: Testing on Synthetic data\n\nEither from resampling or monte carlo\n","## 安装库\n\n使用 `pip install -U git+https:\u002F\u002Fgithub.com\u002Femoen\u002FMachine-Learning-for-Asset-Managers` 命令进行安装。\n\n```python\n>>> from Machine_Learning_for_Asset_Managers import ch2_fitKDE_find_best_bandwidth as c\n>>> import numpy as np\n>>> c.findOptimalBWidth(np.asarray([21,3]))\n{'bandwidth': 10.0}\n```\n\n# 资产管理者的机器学习\n\n本项目实现了由Marcos López de Prado教授所著的《资产管理者的机器学习（量化金融要素）》一书中代码片段及习题的实现。该书链接为：https:\u002F\u002Fwww.amazon.com\u002FMachine-Learning-Managers-Elements-Quantitative\u002Fdp\u002F1108792898。\n\n此项目主要用于个人学习。若希望直接应用书中的概念，建议前往Hudson & Thames公司，他们已在[mlfinlab](https:\u002F\u002Fgithub.com\u002Fhudson-and-thames\u002Fmlfinlab)中实现了这些概念及其他更多内容。\n\n如需实际应用，请参考以下仓库：[Machine-Learning-for-Asset-Managers-Oslo-Bors](https:\u002F\u002Fgithub.com\u002Femoen\u002FMachine-Learning-for-Asset-Managers-Oslo-Bors)。\n\n注意：在第4章中，书中“最优聚类数”算法（ONC）的实现存在一个错误。（论文《利用无监督学习方法检测虚假投资策略》，de Prado和Lewis，2018年中的代码与此不同，但同样不正确）。相关讨论请见：https:\u002F\u002Fquant.stackexchange.com\u002Fquestions\u002F60486\u002Fbug-found-in-optimal-number-of-clusters-algorithm-from-de-prado-and-lewis-201。\n\nONC采用的子空间分治法可能存在潜在问题：当将某个子空间嵌入到具有较大特征值的空间中时，较大的空间可能会扭曲子空间中发现的聚类结构。ONC正是将子空间嵌入到由相关性矩阵中最大特征值构成的空间中，从而导致这一问题。更严谨的问题描述可参见：https:\u002F\u002Fmath.stackexchange.com\u002Fquestions\u002F4013808\u002Fmetric-on-clustering-of-correlation-matrix-using-silhouette-score\u002F4050616#4050616。\n\n因此，应进一步研究其他聚类算法，例如层次聚类。\n\n## 第2章 去噪与去偏\n\n马尔琴科-帕斯图尔理论概率密度函数与经验密度函数：\n| ![marcenko-pastur.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_fe79fe283334.png) | \n|:--:| \n| *图2.1：马尔琴科-帕斯图尔理论概率密度函数与经验密度函数：* |\n\n使用恒定残差特征值法对含信号的随机矩阵进行去噪。该方法通过固定随机特征值来实现。具体参见代码片段2.5。\n| ![eigenvalue_method.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_b47ba397c347.png) | \n|:--:| \n| *图2.2：应用残差特征值法前后特征值的对比：* |\n\n经过去偏后的协方差矩阵可用于计算最小方差组合。有效前沿是自最小方差组合开始的最小方差前沿的上半部分。而经去噪处理后的协方差矩阵则更为稳定，不易受变化影响。\n\n注：练习2.7：“扩展代码片段2.2中的fitKDE函数，使其能够通过交叉验证估计最佳带宽（bWidth）”。\n\n脚本ch2_fitKDE_find_bandwidth.py实现了上述过程，并生成了图2.3中的绿色核密度估计曲线：\n| ![gaussian_mp_excersize_2_7.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_3942e4b29a65.png) | \n|:--:| \n| *图2.3：计算得到的带宽（绿线）与直方图、PDF曲线。绿线更加平滑。找到的带宽为：0.03511191734215131* |\n\n来自代码片段2.3——含信号的随机矩阵：直方图展示了含信号的随机矩阵特征值的分布情况。随后，以fitKDE作为经验密度函数，计算理论概率密度函数的方差。因此，找到fitKDE中合适的带宽值，对于确定理论MP-PDF最可能的方差至关重要。\n| ![fig_2_3_mp_with_signal.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_e8309300bdd5.png) | \n|:--:| \n| *图2.4：含信号特征值的直方图与PDF曲线* |\n\n\n\n## 第3章 距离度量\n\n* 距离度量的定义：\n   1. 不可区分元素同一性：d(x,y) = 0 => x=y\n   2. 对称性：d(x,y) = d(y,x)\n   3. 三角不等式。\n   - 1、2、3共同保证非负性：d(x,y) >= 0\n* 皮尔逊相关系数\n* 距离相关系数\n* 角距离\n* 信息论中的共依存\u002F熵依赖\n    - 交叉熵：H[X] = - Σ\u003Csub>s ∈ S\u003Csub>X\u003C\u002Fsub>\u003C\u002Fsub> p[x] log (p[x])\n    - 基拉巴-莱布勒散度：D\u003Csub>KL\u003C\u002Fsub>[p||q] = - Σ\u003Csub>s ∈ S\u003Csub>X\u003C\u002Fsub>\u003C\u002Fsub> p[x] log (q[x]\u002Fp[x]) = p[x] Σ\u003Csub>s ∈ S\u003C\u002Fsub> log (p[x]\u002Fq[x])\n    - 交叉熵：H\u003Csub>c\u003C\u002Fsub>[p||q] = H[x] = D\u003Csub>KL\u003C\u002Fsub>[p||q]\n    - 互信息：已知Y后X的不确定性降低：I[X,Y] = H[X] - H[X|Y] = H[X] + H[Y] - H[X,Y] = E\u003Csub>X\u003C\u002Fsub>[D\u003Csub>KL\u003C\u002Fsub>[p[y|x]||p[y]]]\n    - 信息差异：VI[X,Y] = H[X|Y] + H[Y|X] = H[X,Y] - I[X,Y]。它是给定一个变量时另一个变量的预期不确定性：VI[X,Y] = 0 \u003C=> X=Y\n    - 基拉巴-莱布勒散度并非距离度量，而信息差异则是。\n\n```python\n>>> ss.entropy([1.\u002F2,1.\u002F2], base=2)\n1.0\n>>> ss.entropy([1,0], base=2)\n0.0\n>>> ss.entropy([1.\u002F3,2.\u002F3], base=2)\n0.9182958340544894\n```\n1. 抛硬币时的信息量为1比特\n2. 确定性结果的信息量为0比特\n3. 不公平抛硬币时的信息量小于1比特\n\n\n* 角距离：p_d = sqrt(1\u002F2 - (1-rho(X, Y)))\n* 绝对角距离：p_d = sqrt(1\u002F2 - (1-|rho|(X, Y)))\n* 平方角距离：p_d = sqrt(1\u002F2 - (1-rho^2(X, Y)))\n\n![fig_3_1_angular_distance.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_6df476653b74.png)  ![fig_3_1_abs_squared_angular_distance.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_99dc7a859018.png) \n标准角距离更适合用于多头投资组合的应用场景。而平方角距离和绝对角距离则更适合多空投资组合。\n\n## 第4章 最优聚类\n\n使用无监督学习方法，最大化组内相似性并最小化组间相似性。考虑形状为N×F的矩阵X，其中N表示对象数量，F表示特征数量。利用这些特征计算N个对象之间的邻近度（相关性、互信息），形成一个N×N的邻近度矩阵。\n\n聚类算法主要分为两类：划分式和层次式：\n1. 连通性：层次聚类\n2. 质心：如K均值聚类\n3. 分布：高斯混合模型\n4. 密度：寻找连通的密集区域，如DBSCAN、OPTICS\n5. 子空间：基于特征和观测两个维度建模。[示例](https:\u002F\u002Fquantdare.com\u002Fbiclustering-time-series\u002F)\n\n\n生成随机块状相关性矩阵用于模拟具有相关性的金融工具。代码片段4.3展示了这一用途，并结合了片段4.1和4.2中定义的“最优聚类数”（ONC）算法。该算法无需预先设定聚类数量（与K均值不同），而是采用“肘部法”来确定停止增加聚类的时机。当组内相关性较高而组间相关性较低时，即可得到最优聚类数。[轮廓系数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSilhouette_(clustering))被用来最小化组内距离并最大化组间距离。\n| ![random_block_corr_matrix.jpg](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_f702338caf82.png) | \n|:--:| \n| *随机块状相关性矩阵。浅色表示高相关性，深色表示低相关性。在本例中，块数K=6，最小块大小minBlockSize=2，金融工具数量N=30* |\n| ![fig_4_1_random_block_correlation_matrix_onc.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_f69b2a9f748e.png) | \n| *将ONC算法应用于随机块状相关性矩阵。ONC能够找到所有聚类。* |\n\n## 第5章 金融标签\n\n* 固定时间窗口法\n* 时间柱法\n* 成交量柱法\n\n三重屏障法是指持仓直到：\n1. 未实现盈利目标达成\n2. 未实现亏损达到上限\n3. 持仓超过最大柱数限制\n\n趋势扫描法的核心思想是识别趋势，并让其持续运行，直至趋势不再有效，而不设置任何障碍。\n\n| ![fig_5_1_trend_scanning.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_415853398646.png) | \n|:--:| \n| *带有高斯噪声的正弦波上的趋势扫描标签示例：* |\n\n| ![fig_5_2_trend_scanning_t_values.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_ea0b29d58690.png) | \n|:--:| \n| *使用t值进行趋势扫描，t值反映了对趋势的信心。1表示高度看涨，-1表示高度看跌。* |\n\n书中介绍的向前看算法之外，另一种方法是从最新数据点开始，向后回溯至指定窗口大小的数据范围。例如，若最新数据点位于索引20处，窗口大小为3到10天，则向后算法会从索引17至20逐次回溯至索引11至20，从而仅考虑最近期的信息。\n\n\n| ![fig_5_2_trend_scanning_t_values2.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_6395e3bdf0c6.png) | \n|:--:| \n| *使用向后回溯方式的趋势扫描及t值* |\n\n## 第6章 特征重要性分析\n\n\u003Ci>“p值并不衡量原假设或备择假设不成立的概率，也不代表结果的重要性。”\u003C\u002Fi>\n| ![fig_6_1_p_values_explanatory_vars.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_53635ccfa1d3.png) | \n|:--:| \n| *在一组包含信息性、冗余性和噪声性解释变量上计算的p值。解释变量并未呈现最高的p值。* |\n\n“回测并非研究工具，特征重要性才是。”（洛佩兹·德·普拉多）平均不纯度减少（MDI）算法解决了p值存在的四个问题中的三个：\n1. MDI不依赖于树结构、代数表达式，也不依赖于残差的随机性或分布特性（如y=b\u003Csub>0\u003C\u002Fsub>+b\u003Csub>1\u003C\u002Fsub>*x\u003Csub>i\u003C\u002Fsub>+&epsilon;）。\n2. 通常单一样本估计出的贝塔系数方差较大，而MDI通过随机森林集成中的多棵树进行自助采样，从而降低方差。\n3. 在MDI中，目标不是估计描述原假设概率的代数方程系数（b_hat_0、b_hat_1）。\n4. MDI不进行交叉验证，因此无法纠正样本内计算偏差。\n\n| ![fig_6_2_mdi_example.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_b7d1112d7181.png) | \n|:--:| \n| *MDI算法示例* |\n\n图6.4显示，ONC正确识别出六个相关聚类（每个信息性特征对应一个聚类，再加上一个噪声特征聚类），并将冗余特征归入其源自的信息性特征所在的聚类。由于各聚类间的相关性较低，无需用残差替换原有特征。\n| ![fig_6_4_feature_clustering.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_5a2222121888.png) | \n|:--:| \n\n接下来，将聚类后的MDI方法应用于聚类后的数据：\n| ![fig_6_5_clustered_MDI.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_4e7b0b5c4fa8.png) | \n|:--:| \n| *图6.5 聚类后的MDI* |\n\n相比非聚类的MDI，聚类后的MDI效果更好。最后，将聚类后的MDA方法应用于同一数据：\n| ![fig_6_6_clustered_MDA.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_53db4865a64c.png) | \n|:--:| \n| *图6.6 聚类后的MDA* |\n\n结论：与噪声特征相关的C_5聚类并不重要，其余各聚类的重要性相近。\n\n## 第7章 投资组合构建\n\n凸优化方法可用于计算最小方差投资组合和最大夏普比率投资组合。\n\n条件数的定义：最大特征值与最小特征值之比的绝对值，即A_n_n \u002F A_m_m。条件数反映了由协方差结构引起的不稳定程度。\n\n迹的定义：矩阵对角线元素之和，即trace = sum(diag(A))。\n\n高度相关的时序数据会导致相关性矩阵的条件数较高。\n\n### 马科维茨的诅咒\n相关性矩阵C只有在相关系数$\\ro = 0$——即完全不相关时——才是稳定的。\n\n分层风险平价（HRP）在样本外蒙特卡洛实验中表现优于马科维茨模型，但在样本内却并非最优。\n\n代码片段7.1展示了由信号引起的相关性矩阵不稳定现象。\n```\n>>> corr0 = mc.formBlockMatrix(2, 2, .5)\n>>> corr0\narray([[1. , 0.5, 0. , 0. ],\n       [0.5, 1. , 0. , 0. ],\n       [0. , 0. , 1. , 0.5],\n       [0. , 0. , 0.5, 1. ]])\n>>> eVal, eVec = np.linalg.eigh(corr0)\n>>> print(max(eVal)\u002Fmin(eVal))\n3.0\n```\n\n| ![fig_7_1_block_diagonal.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_e713f2fd827b.png) | \n|:--:| \n| *图7.1 块对角相关性矩阵热图* |\n\n代码片段7.2创建了相同的块对角矩阵，但其中一个区块占主导地位。然而，条件数仍然相同。\n```\n>>> corr0 = block_diag(mc.formBlockMatrix(1,2, .5))\n>>> corr1 = mc.formBlockMatrix(1,2, .0)\n>>> corr0 = block_diag(corr0, corr1)\n>>> corr0\narray([[1. , 0.5, 0. , 0. ],\n       [0.5, 1. , 0. , 0. ],\n       [0. , 0. , 1. , 0. ],\n       [0. , 0. , 0. , 1. ]])\n>>> eVal, eVec = np.linalg.eigh(corr0)\n>>> matrix_condition_number = max(eVal)\u002Fmin(eVal)\n>>> print(matrix_condition_number)\n3.0\n```\n这表明，仅降低两个区块中的一个区块内部的相关性，并不能降低矩阵的条件数。这也说明，马科维茨解中的不稳定性可以追溯到那些占主导地位的区块。\n\n| ![fig_7_2_block_diagonal.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_d213a438e0e4.png) | \n|:--:| \n| *图7.2 占主导地位的块对角相关性矩阵热图* |\n\n### 嵌套聚类优化算法（NCO）\n\nNCO提供了一种策略，用于解决马科维茨诅咒对现有均值方差配置方法的影响。\n1. 步骤：对相关性矩阵进行聚类\n2. 步骤：使用去噪后的协方差矩阵计算簇内最优配置\n3. 步骤：使用接近对角矩阵的简化协方差矩阵计算簇间最优配置，此时优化问题接近于理想化的马科维茨情形，即当$\\ro$ = 0时\n\n## 第8章 测试集过拟合\n\n回测是对投资策略在过去表现的一种历史模拟。在多重测试的情况下，回测会受到选择偏差的影响：研究人员会在历史数据上运行数百万次测试，并报告其中表现最好的结果（即过拟合的结果）。本章研究如何衡量选择偏差的影响。\n\n### 精确率与召回率\n### 多重测试下的精确率与召回率\n### 夏普比率\n\n夏普比率 = μ\u002Fσ\n\n### “虚假策略”定理\n\n研究人员可能会运行大量历史模拟，并只报告表现最佳的一次（最大夏普比率）。然而，最大夏普比率的分布并不等于预期的夏普比率。因此，存在多重重复测试下的选择偏差（SBuMT）。\n\n### 实验结果\n\n一项蒙特卡洛实验表明，即使预期夏普比率是0（E[sharp_ratio]），最大夏普比率的期望值也会增加（E[max(sharp_ratio)] = 3.26）。这意味着，即便实际上并不存在有效的投资策略，某些策略仍可能显得颇具前景。\n\n当进行多次试验时，最大夏普比率的期望值将高于随机试验中夏普比率的期望值（当真实夏普比率=0且方差>0时）。\n| ![maxSR_across_uniform_strategies_8_1.png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_readme_3ce8de6de92d.png) | \n|:--:| \n| *图8.1 虚假策略定理的实验结果与理论结果对比* |\n\n### 通胀夏普比率\n“虚假策略”定理的主要结论是：除非$max\u003Csub>k\u003C\u002Fsub>{SR^\u003Csub>k\u003C\u002Fsub>}>>E[max\u003Csub>k\u003C\u002Fsub>{SR^\u003Csub>k\u003C\u002Fsub>}]$，否则所发现的策略很可能是假阳性。\n\n### 多重测试下的第二类错误\n### 第一类错误与第二类错误之间的相互作用\n\n\n## 附录A：基于合成数据的测试\n\n无论是通过重采样还是蒙特卡洛方法","# Machine-Learning-for-Asset-Managers 快速上手指南\n\n本指南旨在帮助中国开发者快速安装并使用 `Machine-Learning-for-Asset-Managers` 库。该库实现了 Marcos López de Prado 教授著作《Machine Learning for Asset Managers》中的代码片段和练习，专注于量化金融中的机器学习应用（如去噪、距离度量、聚类和特征重要性分析）。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**：Linux, macOS 或 Windows。\n*   **Python 版本**：推荐 Python 3.7 及以上版本。\n*   **前置依赖**：\n    *   `pip` (Python 包管理工具)\n    *   `numpy` (用于数值计算，库内部会自动处理部分依赖，但建议预先安装)\n    *   `git` (用于从 GitHub 直接安装)\n\n> **提示**：国内用户若遇到网络连接问题，可配置 pip 使用国内镜像源（如清华源、阿里源）加速下载。\n\n## 安装步骤\n\n该库未发布到 PyPI，需通过 GitHub 源码直接安装。\n\n### 1. 基础安装\n打开终端或命令行工具，执行以下命令：\n\n```bash\npip install -U git+https:\u002F\u002Fgithub.com\u002Femoen\u002FMachine-Learning-for-Asset-Managers\n```\n\n### 2. 国内加速安装（推荐）\n如果您在中国大陆地区，建议使用国内镜像源以提高下载速度和稳定性。以下命令使用清华大学开源软件镜像站：\n\n```bash\npip install -U git+https:\u002F\u002Fgithub.com\u002Femoen\u002FMachine-Learning-for-Asset-Managers -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 基本使用\n\n安装完成后，您可以直接在 Python 环境中导入模块并使用其功能。以下示例演示了如何调用第二章中的函数，通过交叉验证寻找核密度估计（KDE）的最佳带宽（bandwidth）。\n\n### 最简单的使用示例\n\n```python\n>>> from Machine_Learning_for_Asset_Managers import ch2_fitKDE_find_best_bandwidth as c\n>>> import numpy as np\n\n# 调用 findOptimalBWidth 函数计算最佳带宽\n# 输入为一个包含数据的 numpy 数组\n>>> result = c.findOptimalBWidth(np.asarray([21, 3]))\n\n# 输出结果是一个字典，包含计算出的 bandwidth 值\n>>> print(result)\n{'bandwidth': 10.0}\n```\n\n### 功能简述\n该库涵盖了书中多个核心章节的实现，主要包括：\n*   **第 2 章**：去噪与去调（Denoising and Detoning），包括 Marcenko-Pastur 分布拟合。\n*   **第 3 章**：距离度量，支持皮尔逊相关系数、角距离、互信息等。\n*   **第 4 章**：最优聚类（Optimal Clustering），实现了 ONC 算法以确定最佳聚类数量。\n*   **第 5 章**：金融标签生成，包括三重屏障法（Triple-Barrier Method）和趋势扫描法。\n*   **第 6 章**：特征重要性分析，实现了平均不纯度减少（MDI）算法。\n\n> **注意**：本项目主要用于学习和复现书中的概念。对于生产环境的量化交易应用，建议参考更成熟的 [mlfinlab](https:\u002F\u002Fgithub.com\u002Fhudson-and-thames\u002Fmlfinlab) 库。此外，书中第 4 章的 ONC 算法实现存在已知理论缺陷（在处理大特征值子空间嵌入时可能失真），使用时请结合层次聚类等替代方案进行评估。","某量化对冲基金的分析师正在构建多资产最小方差组合，试图从充满噪声的历史收益率数据中提取稳定的协方差矩阵以优化配置。\n\n### 没有 Machine-Learning-for-Asset-Managers 时\n- **噪声干扰严重**：直接使用原始样本协方差矩阵，导致随机噪声被误判为有效信号，使得投资组合对历史数据过度拟合。\n- **参数选择盲目**：在使用核密度估计（KDE）去噪时，缺乏科学的交叉验证机制来确定最佳带宽（bandwidth），只能凭经验猜测，导致概率密度函数拟合不准。\n- **组合表现不稳定**：由于输入矩阵包含大量虚假相关性，计算出的最优权重在实盘中频繁剧烈调仓，交易成本高昂且夏普比率低下。\n- **理论复现困难**：团队需手动复现 Marcos López de Prado 教授书中复杂的 Marchenko-Pastur 分布推导和特征值清洗算法，开发周期长且易出错。\n\n### 使用 Machine-Learning-for-Asset-Managers 后\n- **精准信号提取**：利用工具内置的固定残差特征值方法，自动识别并剔除随机噪声特征值，保留真实的市场信号结构。\n- **自动化参数寻优**：调用 `ch2_fitKDE_find_best_bandwidth` 模块，通过交叉验证自动计算出如 0.0351 这样的最优带宽，确保密度估计平滑且准确。\n- **稳健资产配置**：基于去噪后的协方差矩阵构建最小方差组合，有效前沿更加稳定，显著降低了实盘中的无效调仓频率。\n- **快速落地前沿理论**：直接复用书中第 2 章经过验证的代码片段，将原本数周的算法研发时间缩短至几小时，让团队能专注于策略逻辑而非底层数学实现。\n\nMachine-Learning-for-Asset-Managers 通过将顶尖量化理论转化为可执行的代码工具，帮助机构在嘈杂的市场数据中构建出真正稳健的投资组合。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Femoen_Machine-Learning-for-Asset-Managers_2b46c391.png","emoen","Endre Moen","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Femoen_dff19af5.jpg","Data scientist @ Bkk","Bkk","Bergen, Norway",null,"Endre_Moen","https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fendre-moen-3b74092\u002F","https:\u002F\u002Fgithub.com\u002Femoen",[84],{"name":85,"color":86,"percentage":87},"Python","#3572A5",100,621,189,"2026-04-05T15:43:08","Apache-2.0",1,"","未说明",{"notes":96,"python":94,"dependencies":97},"该工具是《Machine Learning for Asset Managers》一书的代码实现，主要通过 pip 从 GitHub 安装。README 中未明确指定操作系统、Python 版本、GPU 或内存的具体需求，仅展示了使用 numpy 进行计算的示例。作者指出书中第 4 章的 ONC 算法存在已知缺陷，并建议参考 Hudson & Thames 的 mlfinlab 库以获取更完善的工业级实现。",[98],"numpy",[14],[101,102,103,104,105],"correlation","machine-learning","eigenvalue","clusters","density","2026-03-27T02:49:30.150509","2026-04-06T22:51:56.686324",[109,114,119,124,129,133],{"id":110,"question_zh":111,"answer_zh":112,"source_url":113},20225,"运行 ch7_portfolio_construction.py 时出现 'AttributeError: NoneType object has no attribute labels_' 错误，如何解决？","这是因为在 ch4_optimal_clustering.py 的 clusterKMeansBase 函数中，最小聚类数量（min number of clusters）被错误地设置为 4，而实际应设置为 2。请修改该参数为 2 即可解决此问题。","https:\u002F\u002Fgithub.com\u002Femoen\u002FMachine-Learning-for-Asset-Managers\u002Fissues\u002F1",{"id":115,"question_zh":116,"answer_zh":117,"source_url":118},20226,"ch6_feature_importance_analysis.py 中引用模块时报错 'ModuleNotFoundError: No module named ch4_optimal_clustering'，该如何修复？","当作为包使用时，需要将导入语句从绝对引用改为显式的相对引用。请将代码 `from ch4_optimal_clustering import clusterKMeansBase` 修改为 `from .ch4_optimal_clustering import clusterKMeansBase`。","https:\u002F\u002Fgithub.com\u002Femoen\u002FMachine-Learning-for-Asset-Managers\u002Fissues\u002F4",{"id":120,"question_zh":121,"answer_zh":122,"source_url":123},20227,"ch6_feature_importance_analysis.py 中出现 'AttributeError: BaggingClassifier object has no attribute enumerators_' 错误，原因是什么？","这是由于代码中使用了错误的属性名称。BaggingClassifier 对象没有 `enumerators_` 属性，正确的属性名应为 `estimators_`。请将代码中的 `fit.enumerators_` 修改为 `fit.estimators_`。","https:\u002F\u002Fgithub.com\u002Femoen\u002FMachine-Learning-for-Asset-Managers\u002Fissues\u002F6",{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},20228,"ch3_metrics.py 中存在拼写错误导致代码无法运行，具体有哪些需要修正的地方？","需要修正两处错误：1. 函数名拼写错误，将 `mutualInfor` 改为 `mutualInfo`；2. 变量名错误，在计算边际熵时，`np.histogram` 的 bins 参数应使用 `bXY` 而不是未定义的 `bins`。即把 `np.histogram(x, bins)` 改为 `np.histogram(x, bXY)`。","https:\u002F\u002Fgithub.com\u002Femoen\u002FMachine-Learning-for-Asset-Managers\u002Fissues\u002F8",{"id":130,"question_zh":131,"answer_zh":132,"source_url":128},20229,"ch7_portfolio_construction.py 中计算簇内最优分配时使用了错误的矩阵，应该用协方差矩阵还是相关系数矩阵？","代码中存在逻辑错误，计算簇内分配时应使用去噪后的相关系数矩阵（corr1），而不是协方差矩阵（cov1）。请将代码 `minVarPort(cov1.loc[clstrs[i], clstrs[i]])` 修改为 `minVarPort(corr1.loc[clstrs[i], clstrs[i]])`。",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},20230,"ch6_feature_importance_analysis.py 中的 cMDA 函数里，计算平均分数的代码是否应该放在训练\u002F测试循环之外？","是的，该部分代码应当移出训练\u002F测试循环。因为平均分数是基于簇级别计算的交叉验证结果，需要在循环结束后统一计算均值和标准差。维护者已在相关提交中修复了此逻辑问题。","https:\u002F\u002Fgithub.com\u002Femoen\u002FMachine-Learning-for-Asset-Managers\u002Fissues\u002F3",[]]