[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-stefan-jansen--machine-learning-for-trading":3,"tool-stefan-jansen--machine-learning-for-trading":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",149489,2,"2026-04-10T11:32:46",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":77,"owner_email":75,"owner_twitter":78,"owner_website":79,"owner_url":80,"languages":81,"stars":96,"forks":97,"last_commit_at":98,"license":75,"difficulty_score":10,"env_os":99,"env_gpu":100,"env_ram":101,"env_deps":102,"category_tags":114,"github_topics":116,"view_count":32,"oss_zip_url":75,"oss_zip_packed_at":75,"status":17,"created_at":129,"updated_at":130,"faqs":131,"releases":166},6229,"stefan-jansen\u002Fmachine-learning-for-trading","machine-learning-for-trading","Code for Machine Learning for Algorithmic Trading, 2nd edition.","machine-learning-for-trading 是《机器学习算法交易（第 2 版）》一书的官方配套代码库，旨在帮助读者将机器学习技术实际应用于量化交易策略的开发。它解决了从理论到实践的跨越难题，提供了超过 150 个可执行的 Jupyter Notebook，涵盖数据获取、金融特征工程、模型训练、回测及策略评估的全流程。\n\n无论是希望深入量化领域的开发者、金融研究人员，还是对算法交易感兴趣的数据科学家，都能从中获益。该项目不仅演示了如何利用线性回归、深度学习（如 CNN、RNN）和深度强化学习等先进算法预测资产收益，还独特地展示了如何从 SEC 文件、财报会议记录等非结构化文本数据中提取交易信号，甚至利用生成对抗网络（GAN）合成数据。\n\n通过复现前沿研究成果并构建多空策略，machine-learning-for-trading 为用户提供了一套全面且实用的工具箱。建议配合原书阅读，以便更好地理解数学统计原理与金融背景知识，从而设计出属于自己的智能化交易系统。此外，项目还拥有活跃的在线社区，方便用户交流心得与解决实战问题。","# ML for Trading - 2\u003Csup>nd\u003C\u002Fsup> Edition\n\nThis [book](https:\u002F\u002Fwww.amazon.com\u002FMachine-Learning-Algorithmic-Trading-alternative\u002Fdp\u002F1839217715?pf_rd_r=GZH2XZ35GB3BET09PCCA&pf_rd_p=c5b6893a-24f2-4a59-9d4b-aff5065c90ec&pd_rd_r=91a679c7-f069-4a6e-bdbb-a2b3f548f0c8&pd_rd_w=2B0Q0&pd_rd_wg=GMY5S&ref_=pd_gw_ci_mcx_mr_hp_d) aims to show how ML can add value to algorithmic trading strategies in a practical yet comprehensive way. It covers a broad range of ML techniques from linear regression to deep reinforcement learning and demonstrates how to build, backtest, and evaluate a trading strategy driven by model predictions.  \n\nIn four parts with **23 chapters plus an appendix**, it covers on **over 800 pages**:\n- important aspects of data sourcing, **financial feature engineering**, and portfolio management, \n- the design and evaluation of long-short **strategies based on supervised and unsupervised ML algorithms**,\n- how to extract tradeable signals from **financial text data** like SEC filings, earnings call transcripts or financial news,\n- using **deep learning** models like CNN and RNN with market and alternative data, how to generate synthetic data with generative adversarial networks, and training a trading agent using deep reinforcement learning\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fwww.amazon.com\u002FMachine-Learning-Algorithmic-Trading-alternative\u002Fdp\u002F1839217715?pf_rd_r=GZH2XZ35GB3BET09PCCA&pf_rd_p=c5b6893a-24f2-4a59-9d4b-aff5065c90ec&pd_rd_r=91a679c7-f069-4a6e-bdbb-a2b3f548f0c8&pd_rd_w=2B0Q0&pd_rd_wg=GMY5S&ref_=pd_gw_ci_mcx_mr_hp_d\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fstefan-jansen_machine-learning-for-trading_readme_8698c6363773.png\" width=\"75%\">\n\u003C\u002Fa>\n\u003C\u002Fp>\n\nThis repo contains **over 150 notebooks** that put the concepts, algorithms, and use cases discussed in the book into action. They provide numerous examples that show:\n- how to work with and extract signals from market, fundamental and alternative text and image data, \n- how to train and tune models that predict returns for different asset classes and investment horizons, including how to replicate recently published research, and \n- how to design, backtest, and evaluate trading strategies.\n\n> We **highly recommend** reviewing the notebooks while reading the book; they are usually in an executed state and often contain additional information not included due to space constraints.  \n\nIn addition to the information in this repo, the book's [website](ml4trading.io) contains chapter summary and additional information.\n\n## Join the ML4T Community!\n\nTo make it easy for readers to ask questions about the book's content and code examples, as well as the development and implementation of their own strategies and industry developments, we are hosting an online [platform](https:\u002F\u002Fexchange.ml4trading.io\u002F).\n\nPlease [join](https:\u002F\u002Fexchange.ml4trading.io\u002F) our community and connect with fellow traders interested in leveraging ML for trading strategies, share your experience, and learn from each other! \n\n## What's new in the 2\u003Csup>nd\u003C\u002Fsup> Edition?\n\nFirst and foremost, this [book](https:\u002F\u002Fwww.amazon.com\u002FMachine-Learning-Algorithmic-Trading-alternative\u002Fdp\u002F1839217715?pf_rd_r=VMKJPZC4N36TTZZCWATP&pf_rd_p=c5b6893a-24f2-4a59-9d4b-aff5065c90ec&pd_rd_r=8f331266-0d21-4c76-a3eb-d2e61d23bb31&pd_rd_w=kVGNF&pd_rd_wg=LYLKH&ref_=pd_gw_ci_mcx_mr_hp_d) demonstrates how you can extract signals from a diverse set of data sources and design trading strategies for different asset classes using a broad range of supervised, unsupervised, and reinforcement learning algorithms. It also provides relevant mathematical and statistical knowledge to facilitate the tuning of an algorithm or the interpretation of the results. Furthermore, it covers the financial background that will help you work with market and fundamental data, extract informative features, and manage the performance of a trading strategy.\n\nFrom a practical standpoint, the 2nd edition aims to equip you with the conceptual understanding and tools to develop your own ML-based trading strategies. To this end, it frames ML as a critical element in a process rather than a standalone exercise, introducing the end-to-end ML for trading workflow from data sourcing, feature engineering, and model optimization to strategy design and backtesting.\n\nMore specifically, the ML4T workflow starts with generating ideas for a well-defined investment universe, collecting relevant data, and extracting informative features. It also involves designing, tuning, and evaluating ML models suited to the predictive task. Finally, it requires developing trading strategies to act on the models' predictive signals, as well as simulating and evaluating their performance on historical data using a backtesting engine. Once you decide to execute an algorithmic strategy in a real market, you will find yourself iterating over this workflow repeatedly to incorporate new information and a changing environment.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FkcgItgp.png\" width=\"75%\">\n\u003C\u002Fp>\n\nThe [second edition](https:\u002F\u002Fwww.amazon.com\u002FMachine-Learning-Algorithmic-Trading-alternative\u002Fdp\u002F1839217715?pf_rd_r=GZH2XZ35GB3BET09PCCA&pf_rd_p=c5b6893a-24f2-4a59-9d4b-aff5065c90ec&pd_rd_r=91a679c7-f069-4a6e-bdbb-a2b3f548f0c8&pd_rd_w=2B0Q0&pd_rd_wg=GMY5S&ref_=pd_gw_ci_mcx_mr_hp_d)'s emphasis on the ML4t workflow translates into a new chapter on [strategy backtesting](08_ml4t_workflow), a new [appendix](24_alpha_factor_library) describing over 100 different alpha factors, and many new practical applications. We have also rewritten most of the existing content for clarity and readability. \n\nThe trading applications now use a broader range of data sources beyond daily US equity prices, including international stocks and ETFs. It also demonstrates how to use ML for an intraday strategy with minute-frequency equity data. Furthermore, it extends the coverage of alternative data sources to include SEC filings for sentiment analysis and return forecasts, as well as satellite images to classify land use. \n\nAnother innovation of the second edition is to replicate several trading applications recently published in top journals: \n- [Chapter 18](18_convolutional_neural_nets) demonstrates how to apply convolutional neural networks to time series converted to image format for return predictions based on [Sezer and Ozbahoglu](https:\u002F\u002Fwww.researchgate.net\u002Fpublication\u002F324802031_Algorithmic_Financial_Trading_with_Deep_Convolutional_Neural_Networks_Time_Series_to_Image_Conversion_Approach) (2018). \n- [Chapter 20](20_autoencoders_for_conditional_risk_factors) shows how to extract risk factors conditioned on stock characteristics for asset pricing using autoencoders based on [Autoencoder Asset Pricing Models](https:\u002F\u002Fwww.aqr.com\u002FInsights\u002FResearch\u002FWorking-Paper\u002FAutoencoder-Asset-Pricing-Models) by Shihao Gu, Bryan T. Kelly, and Dacheng Xiu (2019), and \n- [Chapter 21](21_gans_for_synthetic_time_series) shows how to create synthetic training data using generative adversarial networks based on [Time-series Generative Adversarial Networks](https:\u002F\u002Fpapers.nips.cc\u002Fpaper\u002F8789-time-series-generative-adversarial-networks) by Jinsung Yoon, Daniel Jarrett, and Mihaela van der Schaar (2019).\n\nAll applications now use the latest available (at the time of writing) software versions such as pandas 1.0 and TensorFlow 2.2. There is also a customized version of Zipline that makes it easy to include machine learning model predictions when designing a trading strategy.\n\n## Installation, data sources and bug reports\n\nThe code examples rely on a wide range of Python libraries from the data science and finance domains.\n\nIt is not necessary to try and install all libraries at once because this increases the likeliihood of encountering version conflicts. Instead, we recommend that you install the libraries required for a specific chapter as you go along.\n\n> Update March 2022: `zipline-reloaded`, `pyfolio-reloaded`, `alphalens-reloaded`, and `empyrical-reloaded` are now available on the `conda-forge` channel. The channel `ml4t` only contains outdated versions and will soon be removed.\n\n> Update April 2021: with the update of [Zipline](https:\u002F\u002Fzipline.ml4trading.io), it is no longer necessary to use Docker. The installation instructions now refer to OS-specific environment files that should simplify your running of the notebooks.\n\n> Update Februar 2021: code sample release 2.0 updates the conda environments provided by the Docker image to Python 3.8, Pandas 1.2, and TensorFlow 1.2, among others; the Zipline backtesting environment with now uses Python 3.6.\n\n- The [installation](installation\u002FREADME.md) directory contains detailed instructions on setting up and using a Docker image to run the notebooks. It also contains configuration files for setting up various `conda` environments and install the packages used in the notebooks directly on your machine if you prefer (and, depending on your system, are prepared to go the extra mile).\n- To download and preprocess many of the data sources used in this book, see the instructions in the [README](data\u002FREADME.md) file alongside various notebooks in the [data](data) directory.\n\n> If you have any difficulties installing the environments, downloading the data or running the code, please raise a **GitHub issue** in the repo ([here](https:\u002F\u002Fgithub.com\u002Fstefan-jansen\u002Fmachine-learning-for-trading\u002Fissues)). Working with GitHub issues has been described [here](https:\u002F\u002Fguides.github.com\u002Ffeatures\u002Fissues\u002F).\n\n> **Update**: You can download the **[algoseek](https:\u002F\u002Fwww.algoseek.com)** data used in the book [here](https:\u002F\u002Fwww.algoseek.com\u002Fml4t-book-data.html). See instructions for preprocessing in [Chapter 2](02_market_and_fundamental_data\u002F02_algoseek_intraday\u002FREADME.md) and an intraday example with a gradient boosting model in [Chapter 12](12_gradient_boosting_machines\u002F10_intraday_features.ipynb).  \n\n> **Update**: The [figures](figures) directory contains color versions of the charts used in the book. \n\n# Outline & Chapter Summary\n\nThe [book](https:\u002F\u002Fwww.amazon.com\u002FMachine-Learning-Algorithmic-Trading-alternative\u002Fdp\u002F1839217715?pf_rd_r=GZH2XZ35GB3BET09PCCA&pf_rd_p=c5b6893a-24f2-4a59-9d4b-aff5065c90ec&pd_rd_r=91a679c7-f069-4a6e-bdbb-a2b3f548f0c8&pd_rd_w=2B0Q0&pd_rd_wg=GMY5S&ref_=pd_gw_ci_mcx_mr_hp_d) has four parts that address different challenges that arise when sourcing and working with market, fundamental and alternative data sourcing, developing ML solutions to various predictive tasks in the trading context, and designing and evaluating a trading strategy that relies on predictive signals generated by an ML model.\n\n> The directory for each chapter contains a README with additional information on content, code examples and additional resources.  \n\n[Part 1: From Data to Strategy Development](#part-1-from-data-to-strategy-development)\n* [01 Machine Learning for Trading: From Idea to Execution](#01-machine-learning-for-trading-from-idea-to-execution)\n* [02 Market & Fundamental Data: Sources and Techniques](#02-market--fundamental-data-sources-and-techniques)\n* [03 Alternative Data for Finance: Categories and Use Cases](#03-alternative-data-for-finance-categories-and-use-cases)\n* [04 Financial Feature Engineering: How to research Alpha Factors](#04-financial-feature-engineering-how-to-research-alpha-factors)\n* [05 Portfolio Optimization and Performance Evaluation](#05-portfolio-optimization-and-performance-evaluation)\n\n[Part 2: Machine Learning for Trading: Fundamentals](#part-2-machine-learning-for-trading-fundamentals)\n* [06 The Machine Learning Process](#06-the-machine-learning-process)\n* [07 Linear Models: From Risk Factors to Return Forecasts](#07-linear-models-from-risk-factors-to-return-forecasts)\n* [08 The ML4T Workflow: From Model to Strategy Backtesting](#08-the-ml4t-workflow-from-model-to-strategy-backtesting)\n* [09 Time Series Models for Volatility Forecasts and Statistical Arbitrage](#09-time-series-models-for-volatility-forecasts-and-statistical-arbitrage)\n* [10 Bayesian ML: Dynamic Sharpe Ratios and Pairs Trading](#10-bayesian-ml-dynamic-sharpe-ratios-and-pairs-trading)\n* [11 Random Forests: A Long-Short Strategy for Japanese Stocks](#11-random-forests-a-long-short-strategy-for-japanese-stocks)\n* [12 Boosting your Trading Strategy](#12-boosting-your-trading-strategy)\n* [13 Data-Driven Risk Factors and Asset Allocation with Unsupervised Learning](#13-data-driven-risk-factors-and-asset-allocation-with-unsupervised-learning)\n\n[Part 3: Natural Language Processing for Trading](#part-3-natural-language-processing-for-trading)\n* [14 Text Data for Trading: Sentiment Analysis](#14-text-data-for-trading-sentiment-analysis)\n* [15 Topic Modeling: Summarizing Financial News](#15-topic-modeling-summarizing-financial-news)\n* [16 Word embeddings for Earnings Calls and SEC Filings](#16-word-embeddings-for-earnings-calls-and-sec-filings)\n\n[Part 4: Deep & Reinforcement Learning](#part-4-deep--reinforcement-learning)\n* [17 Deep Learning for Trading](#17-deep-learning-for-trading)\n* [18 CNN for Financial Time Series and Satellite Images](#18-cnn-for-financial-time-series-and-satellite-images)\n* [19 RNN for Multivariate Time Series and Sentiment Analysis](#19-rnn-for-multivariate-time-series-and-sentiment-analysis)\n* [20 Autoencoders for Conditional Risk Factors and Asset Pricing](#20-autoencoders-for-conditional-risk-factors-and-asset-pricing)\n* [21 Generative Adversarial Nets for Synthetic Time Series Data](#21-generative-adversarial-nets-for-synthetic-time-series-data)\n* [22 Deep Reinforcement Learning: Building a Trading Agent](#22-deep-reinforcement-learning-building-a-trading-agent)\n* [23 Conclusions and Next Steps](#23-conclusions-and-next-steps)\n* [24 Appendix - Alpha Factor Library](#24-appendix---alpha-factor-library)\n\n## Part 1: From Data to Strategy Development\n\nThe first part provides a framework for developing trading strategies driven by machine learning (ML). It focuses on the data that power the ML algorithms and strategies discussed in this book, outlines how to engineer and evaluates features suitable for ML models, and how to manage and measure a portfolio's performance while executing a trading strategy.\n\n### 01 Machine Learning for Trading: From Idea to Execution\n\nThis [chapter](01_machine_learning_for_trading) explores industry trends that have led to the emergence of ML as a source of competitive advantage in the investment industry. We will also look at where ML fits into the investment process to enable algorithmic trading strategies. \n\nMore specifically, it covers the following topics:\n- Key trends behind the rise of ML in the investment industry\n- The design and execution of a trading strategy that leverages ML\n- Popular use cases for ML in trading\n\n### 02 Market & Fundamental Data: Sources and Techniques\n\nThis [chapter](02_market_and_fundamental_data) shows how to work with market and fundamental data and describes critical aspects of the environment that they reflect. For example, familiarity with various order types and the trading infrastructure matter not only for the interpretation of the data but also to correctly design backtest simulations. We also illustrate how to use Python to access and manipulate trading and financial statement data.  \n\nPractical examples demonstrate how to work with trading data from NASDAQ tick data and Algoseek minute bar data with a rich set of attributes capturing the demand-supply dynamic that we will later use for an ML-based intraday strategy. We also cover various data provider APIs and how to source financial statement information from the SEC.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FenaSo0C.png\" title=\"Order Book\" width=\"50%\"\u002F>\n\u003C\u002Fp>\nIn particular, this chapter covers:\n\n- How market data reflects the structure of the trading environment\n- Working with intraday trade and quotes data at minute frequency\n- Reconstructing the **limit order book** from tick data using NASDAQ ITCH \n- Summarizing tick data using various types of bars\n- Working with eXtensible Business Reporting Language (XBRL)-encoded **electronic filings**\n- Parsing and combining market and fundamental data to create a P\u002FE series\n- How to access various market and fundamental data sources using Python\n\n### 03 Alternative Data for Finance: Categories and Use Cases\n\nThis [chapter](03_alternative_data) outlines categories and use cases of alternative data, describes criteria to assess the exploding number of sources and providers, and summarizes the current market landscape. \n\nIt also demonstrates how to create alternative data sets by scraping websites, such as collecting earnings call transcripts for use with natural language processing (NLP) and sentiment analysis algorithms in the third part of the book.\n \nMore specifically, this chapter covers:\n\n- Which new sources of signals have emerged during the alternative data revolution\n- How individuals, business, and sensors generate a diverse set of alternative data\n- Important categories and providers of alternative data\n- Evaluating how the burgeoning supply of alternative data can be used for trading\n- Working with alternative data in Python, such as by scraping the internet\n\n### 04 Financial Feature Engineering: How to research Alpha Factors\n\nIf you are already familiar with ML, you know that feature engineering is a crucial ingredient for successful predictions. It matters at least as much in the trading domain, where academic and industry researchers have investigated for decades what drives asset markets and prices, and which features help to explain or predict price movements.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FUCu4Huo.png\" width=\"70%\">\n\u003C\u002Fp>\n\nThis [chapter](04_alpha_factor_research) outlines the key takeaways of this research as a starting point for your own quest for alpha factors. It also presents essential tools to compute and test alpha factors, highlighting how the NumPy, pandas, and TA-Lib libraries facilitate the manipulation of data and present popular smoothing techniques like the wavelets and the Kalman filter that help reduce noise in data. After reading it, you will know about:\n- Which categories of factors exist, why they work, and how to measure them,\n- Creating alpha factors using NumPy, pandas, and TA-Lib,\n- How to de-noise data using wavelets and the Kalman filter,\n- Using Zipline to test individual and multiple alpha factors,\n- How to use [Alphalens](https:\u002F\u002Fgithub.com\u002Fquantopian\u002Falphalens) to evaluate predictive performance.\n \n### 05 Portfolio Optimization and Performance Evaluation\n\nAlpha factors generate signals that an algorithmic strategy translates into trades, which, in turn, produce long and short positions. The returns and risk of the resulting portfolio determine whether the strategy meets the investment objectives.\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FE2h63ZB.png\" width=\"65%\">\n\u003C\u002Fp>\n\nThere are several approaches to optimize portfolios. These include the application of machine learning (ML) to learn hierarchical relationships among assets and treat them as complements or substitutes when designing the portfolio's risk profile. This [chapter](05_strategy_evaluation) covers:\n- How to measure portfolio risk and return\n- Managing portfolio weights using mean-variance optimization and alternatives\n- Using machine learning to optimize asset allocation in a portfolio context\n- Simulating trades and create a portfolio based on alpha factors using Zipline\n- How to evaluate portfolio performance using [pyfolio](https:\u002F\u002Fquantopian.github.io\u002Fpyfolio\u002F)\n\n## Part 2: Machine Learning for Trading: Fundamentals\n\nThe second part covers the fundamental supervised and unsupervised learning algorithms and illustrates their application to trading strategies. It also introduces the Quantopian platform that allows you to leverage and combine the data and ML techniques developed in this book to implement algorithmic strategies that execute trades in live markets.\n\n### 06 The Machine Learning Process\n\nThis [chapter](06_machine_learning_process) kicks off Part 2 that illustrates how you can use a range of supervised and unsupervised ML models for trading. We will explain each model's assumptions and use cases before we demonstrate relevant applications using various Python libraries. \n\nThere are several aspects that many of these models and their applications have in common. This chapter covers these common aspects so that we can focus on model-specific usage in the following chapters. It sets the stage by outlining how to formulate, train, tune, and evaluate the predictive performance of ML models as a systematic workflow. The content includes:\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002F5qisClE.png\" width=\"65%\">\n\u003C\u002Fp>\n\n- How supervised and unsupervised learning from data works\n- Training and evaluating supervised learning models for regression and classification tasks\n- How the bias-variance trade-off impacts predictive performance\n- How to diagnose and address prediction errors due to overfitting\n- Using cross-validation to optimize hyperparameters with a focus on time-series data\n- Why financial data requires additional attention when testing out-of-sample\n\n### 07 Linear Models: From Risk Factors to Return Forecasts\n\nLinear models are standard tools for inference and prediction in regression and classification contexts. Numerous widely used asset pricing models rely on linear regression. Regularized models like Ridge and Lasso regression often yield better predictions by limiting the risk of overfitting. Typical regression applications identify risk factors that drive asset returns to manage risks or predict returns. Classification problems, on the other hand, include directional price forecasts.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002F3Ph6jma.png\" width=\"65%\">\n\u003C\u002Fp>\n\n[Chapter 07](07_linear_models) covers the following topics:\n\n- How linear regression works and which assumptions it makes\n- Training and diagnosing linear regression models\n- Using linear regression to predict stock returns\n- Use regularization to improve the predictive performance\n- How logistic regression works\n- Converting a regression into a classification problem\n\n### 08 The ML4T Workflow: From Model to Strategy Backtesting\n\nThis [chapter](08_ml4t_workflow) presents an end-to-end perspective on designing, simulating, and evaluating a trading strategy driven by an ML algorithm. \nWe will demonstrate in detail how to backtest an ML-driven strategy in a historical market context using the Python libraries [backtrader](https:\u002F\u002Fwww.backtrader.com\u002F) and [Zipline](https:\u002F\u002Fzipline.ml4trading.io\u002Findex.html). \nThe ML4T workflow ultimately aims to gather evidence from historical data that helps decide whether to deploy a candidate strategy in a live market and put financial resources at risk. A realistic simulation of your strategy needs to faithfully represent how security markets operate and how trades execute. Also, several methodological aspects require attention to avoid biased results and false discoveries that will lead to poor investment decisions.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FR9O0fn3.png\" width=\"65%\">\n\u003C\u002Fp>\n\nMore specifically, after working through this chapter you will be able to:\n\n- Plan and implement end-to-end strategy backtesting\n- Understand and avoid critical pitfalls when implementing backtests\n- Discuss the advantages and disadvantages of vectorized vs event-driven backtesting engines\n- Identify and evaluate the key components of an event-driven backtester\n- Design and execute the ML4T workflow using data sources at minute and daily frequencies, with ML models trained separately or as part of the backtest\n- Use Zipline and backtrader to design and evaluate your own strategies \n\n### 09 Time Series Models for Volatility Forecasts and Statistical Arbitrage\n\nThis [chapter](09_time_series_models) focuses on models that extract signals from a time series' history to predict future values for the same time series. \nTime series models are in widespread use due to the time dimension inherent to trading. It presents tools to diagnose time series characteristics such as stationarity and extract features that capture potentially useful patterns. It also introduces univariate and multivariate time series models to forecast macro data and volatility patterns. \nFinally, it explains how cointegration identifies common trends across time series and shows how to develop a pairs trading strategy based on this crucial concept. \n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FcglLgJ0.png\" width=\"90%\">\n\u003C\u002Fp>\n\nIn particular, it covers:\n- How to use time-series analysis to prepare and inform the modeling process\n- Estimating and diagnosing univariate autoregressive and moving-average models\n- Building autoregressive conditional heteroskedasticity (ARCH) models to predict volatility\n- How to build multivariate vector autoregressive models\n- Using cointegration to develop a pairs trading strategy\n\n### 10 Bayesian ML: Dynamic Sharpe Ratios and Pairs Trading\n\nBayesian statistics allows us to quantify uncertainty about future events and refine estimates in a principled way as new information arrives. This dynamic approach adapts well to the evolving nature of financial markets. \nBayesian approaches to ML enable new insights into the uncertainty around statistical metrics, parameter estimates, and predictions. The applications range from more granular risk management to dynamic updates of predictive models that incorporate changes in the market environment. \n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FqOUPIDV.png\" width=\"80%\">\n\u003C\u002Fp>\n\nMore specifically, this [chapter](10_bayesian_machine_learning) covers: \n- How Bayesian statistics applies to machine learning\n- Probabilistic programming with PyMC3\n- Defining and training machine learning models using PyMC3\n- How to run state-of-the-art sampling methods to conduct approximate inference\n- Bayesian ML applications to compute dynamic Sharpe ratios, dynamic pairs trading hedge ratios, and estimate stochastic volatility\n\n\n### 11 Random Forests: A Long-Short Strategy for Japanese Stocks\n\nThis [chapter](11_decision_trees_random_forests) applies decision trees and random forests to trading. Decision trees learn rules from data that encode nonlinear input-output relationships. We show how to train a decision tree to make predictions for regression and classification problems, visualize and interpret the rules learned by the model, and tune the model's hyperparameters to optimize the bias-variance tradeoff and prevent overfitting.\n\nThe second part of the chapter introduces ensemble models that combine multiple decision trees in a randomized fashion to produce a single prediction with a lower error. It concludes with a long-short strategy for Japanese equities based on trading signals generated by a random forest model.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FS4s0rou.png\" width=\"80%\">\n\u003C\u002Fp>\n\nIn short, this chapter covers:\n- Use decision trees for regression and classification\n- Gain insights from decision trees and visualize the rules learned from the data\n- Understand why ensemble models tend to deliver superior results\n- Use bootstrap aggregation to address the overfitting challenges of decision trees\n- Train, tune, and interpret random forests\n- Employ a random forest to design and evaluate a profitable trading strategy\n\n\n### 12 Boosting your Trading Strategy\n\nGradient boosting is an alternative tree-based ensemble algorithm that often produces better results than random forests. The critical difference is that boosting modifies the data used to train each tree based on the cumulative errors made by the model. While random forests train many trees independently using random subsets of the data, boosting proceeds sequentially and reweights the data.\nThis [chapter](12_gradient_boosting_machines) shows how state-of-the-art libraries achieve impressive performance and apply boosting to both daily and high-frequency data to backtest an intraday trading strategy. \n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FRe0uI0H.png\" width=\"70%\">\n\u003C\u002Fp>\n\nMore specifically, we will cover the following topics:\n- How does boosting differ from bagging, and how did gradient boosting evolve from adaptive boosting,\n- Design and tune adaptive and gradient boosting models with scikit-learn,\n- Build, optimize, and evaluate gradient boosting models on large datasets with the state-of-the-art implementations XGBoost, LightGBM, and CatBoost,\n- Interpreting and gaining insights from gradient boosting models using [SHAP](https:\u002F\u002Fgithub.com\u002Fslundberg\u002Fshap) values, and\n- Using boosting with high-frequency data to design an intraday strategy.\n\n### 13 Data-Driven Risk Factors and Asset Allocation with Unsupervised Learning\n\nDimensionality reduction and clustering are the main tasks for unsupervised learning: \n- Dimensionality reduction transforms the existing features into a new, smaller set while minimizing the loss of information. A broad range of algorithms exists that differ by how they measure the loss of information, whether they apply linear or non-linear transformations or the constraints they impose on the new feature set. \n- Clustering algorithms identify and group similar observations or features instead of identifying new features. Algorithms differ in how they define the similarity of observations and their assumptions about the resulting groups.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FRfk7uCM.png\" width=\"70%\">\n\u003C\u002Fp>\n\nMore specifically, this [chapter](13_unsupervised_learning) covers:\n- How principal and independent component analysis (PCA and ICA) perform linear dimensionality reduction\n- Identifying data-driven risk factors and eigenportfolios from asset returns using PCA\n- Effectively visualizing nonlinear, high-dimensional data using manifold learning\n- Using T-SNE and UMAP to explore high-dimensional image data\n- How k-means, hierarchical, and density-based clustering algorithms work\n- Using agglomerative clustering to build robust portfolios with hierarchical risk parity\n\n\n## Part 3: Natural Language Processing for Trading\n\nText data are rich in content, yet unstructured in format and hence require more preprocessing so that a machine learning algorithm can extract the potential signal. The critical challenge consists of converting text into a numerical format for use by an algorithm, while simultaneously expressing the semantics or meaning of the content. \n\nThe next three chapters cover several techniques that capture language nuances readily understandable to humans so that machine learning algorithms can also interpret them.\n\n### 14 Text Data for Trading: Sentiment Analysis\n\nText data is very rich in content but highly unstructured so that it requires more preprocessing to enable an ML algorithm to extract relevant information. A key challenge consists of converting text into a numerical format without losing its meaning.\nThis [chapter](14_working_with_text_data) shows how to represent documents as vectors of token counts by creating a document-term matrix that, in turn, serves as input for text classification and sentiment analysis. It also introduces the Naive Bayes algorithm and compares its performance to linear and tree-based models.\n\nIn particular, in this chapter covers:\n- What the fundamental NLP workflow looks like\n- How to build a multilingual feature extraction pipeline using spaCy and TextBlob\n- Performing NLP tasks like part-of-speech tagging or named entity recognition\n- Converting tokens to numbers using the document-term matrix\n- Classifying news using the naive Bayes model\n- How to perform sentiment analysis using different ML algorithms\n\n### 15 Topic Modeling: Summarizing Financial News\n\nThis [chapter](15_topic_modeling) uses unsupervised learning to model latent topics and extract hidden themes from documents. These themes can generate detailed insights into a large corpus of financial reports.\nTopic models automate the creation of sophisticated, interpretable text features that, in turn, can help extract trading signals from extensive collections of texts. They speed up document review, enable the clustering of similar documents, and produce annotations useful for predictive modeling.\nApplications include identifying critical themes in company disclosures, earnings call transcripts or contracts, and annotation based on sentiment analysis or using returns of related assets. \n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FVVSnTCa.png\" width=\"60%\">\n\u003C\u002Fp>\n\nMore specifically, it covers:\n- How topic modeling has evolved, what it achieves, and why it matters\n- Reducing the dimensionality of the DTM using latent semantic indexing\n- Extracting topics with probabilistic latent semantic analysis (pLSA)\n- How latent Dirichlet allocation (LDA) improves pLSA to become the most popular topic model\n- Visualizing and evaluating topic modeling results -\n- Running LDA using scikit-learn and gensim\n- How to apply topic modeling to collections of earnings calls and financial news articles\n\n### 16 Word embeddings for Earnings Calls and SEC Filings\n\nThis [chapter](16_word_embeddings) uses neural networks to learn a vector representation of individual semantic units like a word or a paragraph. These vectors are dense with a few hundred real-valued entries, compared to the higher-dimensional sparse vectors of the bag-of-words model. As a result, these vectors embed or locate each semantic unit in a continuous vector space.\n\nEmbeddings result from training a model to relate tokens to their context with the benefit that similar usage implies a similar vector. As a result, they encode semantic aspects like relationships among words through their relative location. They are powerful features that we will use with deep learning models in the following chapters.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002Fv8w9XLL.png\" width=\"80%\">\n\u003C\u002Fp>\n\n More specifically, in this chapter, we will cover:\n- What word embeddings are and how they capture semantic information\n- How to obtain and use pre-trained word vectors\n- Which network architectures are most effective at training word2vec models\n- How to train a word2vec model using TensorFlow and gensim\n- Visualizing and evaluating the quality of word vectors\n- How to train a word2vec model on SEC filings to predict stock price moves\n- How doc2vec extends word2vec and helps with sentiment analysis\n- Why the transformer’s attention mechanism had such an impact on NLP\n- How to fine-tune pre-trained BERT models on financial data\n\n## Part 4: Deep & Reinforcement Learning\n\nPart four explains and demonstrates how to leverage deep learning for algorithmic trading. \nThe powerful capabilities of deep learning algorithms to identify patterns in unstructured data make it particularly suitable for alternative data like images and text. \n\nThe sample applications show, for exapmle, how to combine text and price data to predict earnings surprises from SEC filings, generate synthetic time series to expand the amount of training data, and train a trading agent using deep reinforcement learning.\nSeveral of these applications replicate research recently published in top journals.\n\n### 17 Deep Learning for Trading\n\nThis [chapter](17_deep_learning) presents feedforward neural networks (NN) and demonstrates how to efficiently train large models using backpropagation while managing the risks of overfitting. It also shows how to use TensorFlow 2.0 and PyTorch and how to optimize a NN architecture to generate trading signals.\nIn the following chapters, we will build on this foundation to apply various architectures to different investment applications with a focus on alternative data. These include recurrent NN tailored to sequential data like time series or natural language and convolutional NN, particularly well suited to image data. We will also cover deep unsupervised learning, such as how to create synthetic data using Generative Adversarial Networks (GAN). Moreover, we will discuss reinforcement learning to train agents that interactively learn from their environment.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002F5cet0Fi.png\" width=\"70%\">\n\u003C\u002Fp>\n\nIn particular, this chapter will cover\n- How DL solves AI challenges in complex domains\n- Key innovations that have propelled DL to its current popularity\n- How feedforward networks learn representations from data\n- Designing and training deep neural networks (NNs) in Python\n- Implementing deep NNs using Keras, TensorFlow, and PyTorch\n- Building and tuning a deep NN to predict asset returns\n- Designing and backtesting a trading strategy based on deep NN signals\n\n### 18 CNN for Financial Time Series and Satellite Images\n\nCNN architectures continue to evolve. This chapter describes building blocks common to successful applications, demonstrates how transfer learning can speed up learning, and how to use CNNs for object detection.\nCNNs can generate trading signals from images or time-series data. Satellite data can anticipate commodity trends via aerial images of agricultural areas, mines, or transport networks. Camera footage can help predict consumer activity; we show how to build a CNN that classifies economic activity in satellite images.\nCNNs can also deliver high-quality time-series classification results by exploiting their structural similarity with images, and we design a strategy based on time-series data formatted like images. \n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FPlLQV0M.png\" width=\"60%\">\n\u003C\u002Fp>\n\nMore specifically, this [chapter](18_convolutional_neural_nets) covers:\n\n- How CNNs employ several building blocks to efficiently model grid-like data\n- Training, tuning and regularizing CNNs for images and time series data using TensorFlow\n- Using transfer learning to streamline CNNs, even with fewer data\n- Designing a trading strategy using return predictions by a CNN trained on time-series data formatted like images\n- How to classify economic activity based on satellite images\n\n### 19 RNN for Multivariate Time Series and Sentiment Analysis\n\nRecurrent neural networks (RNNs) compute each output as a function of the previous output and new data, effectively creating a model with memory that shares parameters across a deeper computational graph. Prominent architectures include Long Short-Term Memory (LSTM) and Gated Recurrent Units (GRU) that address the challenges of learning long-range dependencies.\nRNNs are designed to map one or more input sequences to one or more output sequences and are particularly well suited to natural language. They can also be applied to univariate and multivariate time series to predict market or fundamental data. This chapter covers how RNN can model alternative text data using the word embeddings that we covered in Chapter 16 to classify the sentiment expressed in documents.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FE9fOApg.png\" width=\"60%\">\n\u003C\u002Fp>\n\nMore specifically, this chapter addresses:\n- How recurrent connections allow RNNs to memorize patterns and model a hidden state\n- Unrolling and analyzing the computational graph of RNNs\n- How gated units learn to regulate RNN memory from data to enable long-range dependencies\n- Designing and training RNNs for univariate and multivariate time series in Python\n- How to learn word embeddings or use pretrained word vectors for sentiment analysis with RNNs\n- Building a bidirectional RNN to predict stock returns using custom word embeddings\n\n### 20 Autoencoders for Conditional Risk Factors and Asset Pricing\n\nThis [chapter](20_autoencoders_for_conditional_risk_factors) shows how to leverage unsupervised deep learning for trading. We also discuss autoencoders, namely, a neural network trained to reproduce the input while learning a new representation encoded by the parameters of a hidden layer. Autoencoders have long been used for nonlinear dimensionality reduction, leveraging the NN architectures we covered in the last three chapters.\nWe replicate a recent AQR paper that shows how autoencoders can underpin a trading strategy. We will use a deep neural network that relies on an autoencoder to extract risk factors and predict equity returns, conditioned on a range of equity attributes.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FaCmE0UD.png\" width=\"60%\">\n\u003C\u002Fp>\n\nMore specifically, in this chapter you will learn about:\n- Which types of autoencoders are of practical use and how they work\n- Building and training autoencoders using Python\n- Using autoencoders to extract data-driven risk factors that take into account asset characteristics to predict returns\n\n### 21 Generative Adversarial Nets for Synthetic Time Series Data\n\nThis chapter introduces generative adversarial networks (GAN). GANs train a generator and a discriminator network in a competitive setting so that the generator learns to produce samples that the discriminator cannot distinguish from a given class of training data. The goal is to yield a generative model capable of producing synthetic samples representative of this class.\nWhile most popular with image data, GANs have also been used to generate synthetic time-series data in the medical domain. Subsequent experiments with financial data explored whether GANs can produce alternative price trajectories useful for ML training or strategy backtests. We replicate the 2019 NeurIPS Time-Series GAN paper to illustrate the approach and demonstrate the results.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FW1Rp89K.png\" width=\"60%\">\n\u003C\u002Fp>\n\nMore specifically, in this chapter you will learn about:\n- How GANs work, why they are useful, and how they could be applied to trading\n- Designing and training GANs using TensorFlow 2\n- Generating synthetic financial data to expand the inputs available for training ML models and backtesting\n\n### 22 Deep Reinforcement Learning: Building a Trading Agent\n\nReinforcement Learning (RL) models goal-directed learning by an agent that interacts with a stochastic environment. RL optimizes the agent's decisions concerning a long-term objective by learning the value of states and actions from a reward signal. The ultimate goal is to derive a policy that encodes behavioral rules and maps states to actions.\nThis [chapter](22_deep_reinforcement_learning) shows how to formulate and solve an RL problem. It covers model-based and model-free methods, introduces the OpenAI Gym environment, and combines deep learning with RL to train an agent that navigates a complex environment. Finally, we'll show you how to adapt RL to algorithmic trading by modeling an agent that interacts with the financial market while trying to optimize an objective function.\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002Flg0ofbZ.png\" width=\"60%\">\n\u003C\u002Fp>\n\nMore specifically,this chapter will cover:\n\n- Define a Markov decision problem (MDP)\n- Use value and policy iteration to solve an MDP\n- Apply Q-learning in an environment with discrete states and actions\n- Build and train a deep Q-learning agent in a continuous environment\n- Use the OpenAI Gym to design a custom market environment and train an RL agent to trade stocks\n\n### 23 Conclusions and Next Steps\n\nIn this concluding chapter, we will briefly summarize the essential tools, applications, and lessons learned throughout the book to avoid losing sight of the big picture after so much detail.\nWe will then identify areas that we did not cover but would be worth focusing on as you expand on the many machine learning techniques we introduced and become productive in their daily use.\n\nIn sum, in this chapter, we will\n- Review key takeaways and lessons learned\n- Point out the next steps to build on the techniques in this book\n- Suggest ways to incorporate ML into your investment process\n\n### 24 Appendix - Alpha Factor Library\n\nThroughout this book, we emphasized how the smart design of features, including appropriate preprocessing and denoising, typically leads to an effective strategy. This appendix synthesizes some of the lessons learned on feature engineering and provides additional information on this vital topic.\n\nTo this end, we focus on the broad range of indicators implemented by TA-Lib (see [Chapter 4](04_alpha_factor_research)) and WorldQuant's [101 Formulaic Alphas](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1601.00991.pdf) paper (Kakushadze 2016), which presents real-life quantitative trading factors used in production with an average holding period of 0.6-6.4 days.\n\nThis chapter covers: \n- How to compute several dozen technical indicators using TA-Lib and NumPy\u002Fpandas,\n- Creating the formulaic alphas describe in the above paper, and\n- Evaluating the predictive quality of the results using various metrics from rank correlation and mutual information to feature importance, SHAP values and Alphalens. \n","# 机器学习与交易——第2版\n\n本书旨在以实用且全面的方式展示机器学习如何为算法交易策略创造价值。书中涵盖了从线性回归到深度强化学习的广泛机器学习技术，并详细演示了如何基于模型预测构建、回测和评估交易策略。\n\n全书共分为四部分，包含**23章及附录**，超过**800页**，内容涵盖：\n- 数据获取、**金融特征工程**和投资组合管理的重要方面；\n- 基于监督与非监督机器学习算法的多空**策略**的设计与评估；\n- 如何从SEC文件、财报电话会议记录或财经新闻等**金融文本数据**中提取可交易信号；\n- 使用CNN、RNN等**深度学习**模型处理市场数据和另类数据，利用生成对抗网络生成合成数据，以及通过深度强化学习训练交易智能体。\n\n\u003Cp align=\"center\">\n\u003Ca href=\"https:\u002F\u002Fwww.amazon.com\u002FMachine-Learning-Algorithmic-Trading-alternative\u002Fdp\u002F1839217715?pf_rd_r=GZH2XZ35GB3BET09PCCA&pf_rd_p=c5b6893a-24f2-4a59-9d4b-aff5065c90ec&pd_rd_r=91a679c7-f069-4a6e-bdbb-a2b3f548f0c8&pd_rd_w=2B0Q0&pd_rd_wg=GMY5S&ref_=pd_gw_ci_mcx_mr_hp_d\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fstefan-jansen_machine-learning-for-trading_readme_8698c6363773.png\" width=\"75%\">\n\u003C\u002Fa>\n\u003C\u002Fp>\n\n本仓库包含**150多个Notebook**，将书中讨论的概念、算法和应用场景付诸实践。这些Notebook提供了大量示例，展示了：\n- 如何处理并从市场数据、基本面数据以及另类文本和图像数据中提取信号；\n- 如何训练和调优能够预测不同资产类别及不同投资期限收益的模型，包括复现近期发表的研究成果；\n- 如何设计、回测和评估交易策略。\n\n> 我们**强烈建议**在阅读本书的同时参考这些Notebook；它们通常已执行完毕，并且常常包含因篇幅限制而未收录的补充信息。\n\n除本仓库中的内容外，本书的[官网](ml4trading.io)还提供了各章节的概要及其他补充信息。\n\n## 加入ML4T社区！\n\n为了方便读者就本书内容、代码示例，以及自身策略的开发与实施和行业动态进行交流，我们搭建了一个在线[平台](https:\u002F\u002Fexchange.ml4trading.io\u002F)。\n\n请立即[加入](https:\u002F\u002Fexchange.ml4trading.io\u002F)我们的社区，与同样对利用机器学习构建交易策略感兴趣的同行互动，分享您的经验，互相学习！\n\n## 第2版有哪些新内容？\n\n首先，本书[链接](https:\u002F\u002Fwww.amazon.com\u002FMachine-Learning-Algorithmic-Trading-alternative\u002Fdp\u002F1839217715?pf_rd_r=VMKJPZC4N36TTZZCWATP&pf_rd_p=c5b6893a-24f2-4a59-9d4b-aff5065c90ec&pd_rd_r=8f331266-0d21-4c76-a3eb-d2e61d23bb31&pd_rd_w=kVGNF&pd_rd_wg=LYLKH&ref_=pd_gw_ci_mcx_mr_hp_d)展示了如何从多样化的数据源中提取信号，并利用广泛的监督学习、无监督学习和强化学习算法为不同资产类别设计交易策略。书中还提供了相关的数学和统计学知识，以帮助调优算法或解释结果。此外，本书涵盖了金融背景知识，有助于读者处理市场和基本面数据、提取有用特征，并管理交易策略的绩效。\n\n从实践角度来看，第2版旨在为你提供概念性理解和工具，以开发自己的基于机器学习的交易策略。为此，本书将机器学习视为一个流程中的关键环节，而非孤立的操作，介绍了端到端的机器学习交易工作流，涵盖数据获取、特征工程、模型优化、策略设计以及回测等环节。\n\n具体而言，ML4T工作流始于为明确的投资标的池构思策略思路、收集相关数据并提取有效特征；接着是针对预测任务设计、调优和评估机器学习模型；最后则是根据模型的预测信号制定交易策略，并使用回测引擎在历史数据上模拟和评估策略表现。一旦决定在真实市场中执行算法策略，你就会发现自己需要反复迭代这一工作流，以融入新信息和适应不断变化的市场环境。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FkcgItgp.png\" width=\"75%\">\n\u003C\u002Fp>\n\n[第二版](https:\u002F\u002Fwww.amazon.com\u002FMachine-Learning-Algorithmic-Trading-alternative\u002Fdp\u002F1839217715?pf_rd_r=GZH2XZ35GB3BET09PCCA&pf_rd_p=c5b6893a-24f2-4a59-9d4b-aff5065c90ec&pd_rd_r=91a679c7-f069-4a6e-bdbb-a2b3f548f0c8&pd_rd_w=2B0Q0&pd_rd_wg=GMY5S&ref_=pd_gw_ci_mcx_mr_hp_d)对ML4T工作流的强调，体现在新增了关于[策略回测](08_ml4t_workflow)的一章、描述了100多种阿尔法因子的新[附录](24_alpha_factor_library)，以及许多新的实际应用。我们还为了提高清晰度和可读性重写了大部分现有内容。\n\n当前的交易应用不仅使用美国股市的日频数据，还扩展到了国际股票和ETF等更广泛的数据来源。书中还演示了如何利用分钟级的股票数据构建日内交易策略。此外，对另类数据源的覆盖范围也进一步扩大，包括用于情绪分析和收益预测的美国证监会文件，以及用于土地利用分类的卫星图像。\n\n第二版的另一项创新是复现了几篇近期发表于顶级期刊的交易应用：\n- [第18章](18_convolutional_neural_nets)展示了如何将时间序列转换为图像格式后，运用卷积神经网络进行收益预测，其方法基于Sezer和Ozbahoglu（2018）的研究成果。\n- [第20章](20_autoencoders_for_conditional_risk_factors)介绍了如何利用自编码器提取与股票特征相关的风险因子，从而实现资产定价，该方法源自Shihao Gu、Bryan T. Kelly和Dacheng Xiu（2019）的研究。\n- [第21章](21_gans_for_synthetic_time_series)则展示了如何基于Jinsung Yoon、Daniel Jarrett和Mihaela van der Schaar（2019）的研究成果，使用生成对抗网络创建合成训练数据。\n\n所有示例目前均采用编写时可用的最新软件版本，如pandas 1.0和TensorFlow 2.2。此外，还提供了一个定制版的Zipline库，方便在设计交易策略时轻松集成机器学习模型的预测结果。\n\n## 安装、数据源与错误报告\n\n代码示例依赖于来自数据科学和金融领域的众多Python库。\n\n无需一次性尝试安装所有库，因为这样会增加版本冲突的可能性。相反，我们建议你在学习特定章节时，按需安装该章节所需的库。\n\n> 2022年3月更新：`zipline-reloaded`、`pyfolio-reloaded`、`alphalens-reloaded`和`empyrical-reloaded`现已在`conda-forge`频道上线。`ml4t`频道仅包含过时版本，即将被移除。\n\n> 2021年4月更新：随着[Zipline](https:\u002F\u002Fzipline.ml4trading.io)的更新，不再需要使用Docker。安装说明现在引用了操作系统特定的环境配置文件，这将简化笔记本的运行。\n\n> 2021年2月更新：代码示例发布2.0将Docker镜像提供的conda环境更新至Python 3.8、Pandas 1.2和TensorFlow 1.2等版本；而Zipline回测环境则使用Python 3.6。\n\n- [安装](installation\u002FREADME.md)目录包含详细的说明，介绍如何设置并使用Docker镜像来运行笔记本。此外，还提供了用于配置各种`conda`环境的文件，如果你愿意（并根据你的系统情况做好额外准备），可以直接在本地机器上安装笔记本中使用的软件包。\n- 要下载并预处理本书中使用的大量数据源，请参阅[data](data)目录下各笔记本旁的[README](data\u002FREADME.md)文件中的说明。\n\n> 如果你在安装环境、下载数据或运行代码时遇到任何困难，请在仓库中提交一个**GitHub问题**（[此处](https:\u002F\u002Fgithub.com\u002Fstefan-jansen\u002Fmachine-learning-for-trading\u002Fissues)）。GitHub问题的使用方法已在[这里](https:\u002F\u002Fguides.github.com\u002Ffeatures\u002Fissues\u002F)说明。\n\n> **更新**：书中使用的**[algoseek](https:\u002F\u002Fwww.algoseek.com)**数据可在此处下载（[链接](https:\u002F\u002Fwww.algoseek.com\u002Fml4t-book-data.html)）。请参阅[第2章](02_market_and_fundamental_data\u002F02_algoseek_intraday\u002FREADME.md)中的预处理说明，以及[第12章](12_gradient_boosting_machines\u002F10_intraday_features.ipynb)中使用梯度提升模型的日内示例。\n\n> **更新**：[figures](figures)目录包含了书中所用图表的彩色版本。\n\n# 大纲与章节概要\n\n本书分为四个部分，分别探讨在获取和处理市场、基本面及另类数据时遇到的不同挑战，在交易环境中为各类预测任务开发机器学习解决方案，以及设计并评估依赖于机器学习模型生成的预测信号的交易策略。\n\n> 每个章节的目录中都包含一个README文件，其中提供了关于内容、代码示例及其他资源的更多信息。\n\n[第一部分：从数据到策略开发](#part-1-from-data-to-strategy-development)\n* [01 机器学习在交易中的应用：从想法到执行](#01-machine-learning-for-trading-from-idea-to-execution)\n* [02 市场与基本面数据：来源与技术](#02-market--fundamental-data-sources-and-techniques)\n* [03 金融领域的另类数据：类别与用例](#03-alternative-data-for-finance-categories-and-use-cases)\n* [04 金融特征工程：如何研究阿尔法因子](#04-financial-feature-engineering-how-to-research-alpha-factors)\n* [05 投资组合优化与绩效评估](#05-portfolio-optimization-and-performance-evaluation)\n\n[第二部分：机器学习在交易中的基础](#part-2-machine-learning-for-trading-fundamentals)\n* [06 机器学习流程](#06-the-machine-learning-process)\n* [07 线性模型：从风险因子到收益预测](#07-linear-models-from-risk-factors-to-return-forecasts)\n* [08 ML4T工作流：从模型到策略回测](#08-the-ml4t-workflow-from-model-to-strategy-backtesting)\n* [09 时间序列模型：用于波动率预测与统计套利](#09-time-series-models-for-volatility-forecasts-and-statistical-arbitrage)\n* [10 贝叶斯机器学习：动态夏普比率与配对交易](#10-bayesian-ml-dynamic-sharpe-ratios-and-pairs-trading)\n* [11 随机森林：日本股票的多空策略](#11-random-forests-a-long-short-strategy-for-japanese-stocks)\n* [12 提升你的交易策略](#12-boosting-your-trading-strategy)\n* [13 无监督学习驱动的风险因子与资产配置](#13-data-driven-risk-factors-and-asset-allocation-with-unsupervised-learning)\n\n[第三部分：自然语言处理在交易中的应用](#part-3-natural-language-processing-for-trading)\n* [14 用于交易的文本数据：情感分析](#14-text-data-for-trading-sentiment-analysis)\n* [15 主题建模：总结财经新闻](#15-topic-modeling-summarizing-financial-news)\n* [16 用于财报电话会议与SEC备案文件的词嵌入](#16-word-embeddings-for-earnings-calls-and-sec-filings)\n\n[第四部分：深度学习与强化学习](#part-4-deep--reinforcement-learning)\n* [17 深度学习在交易中的应用](#17-deep-learning-for-trading)\n* [18 CNN应用于金融时间序列与卫星图像](#18-cnn-for-financial-time-series-and-satellite-images)\n* [19 RNN用于多元时间序列与情感分析](#19-rnn-for-multivariate-time-series-and-sentiment-analysis)\n* [20 自编码器用于条件风险因子与资产定价](#20-autoencoders-for-conditional-risk-factors-and-asset-pricing)\n* [21 生成对抗网络用于合成时间序列数据](#21-generative-adversarial-nets-for-synthetic-time-series-data)\n* [22 深度强化学习：构建交易智能体](#22-deep-reinforcement-learning-building-a-trading-agent)\n* [23 结论与下一步](#23-conclusions-and-next-steps)\n* [24 附录——阿尔法因子库](#24-appendix---alpha-factor-library)\n\n## 第一部分：从数据到策略开发\n\n第一部分提供了一个基于机器学习（ML）开发交易策略的框架。它重点关注驱动本书中讨论的机器学习算法和策略的数据，概述了如何为机器学习模型构建和评估特征，并在执行交易策略时如何管理和衡量投资组合的绩效。\n\n### 01 机器学习在交易中的应用：从想法到执行\n\n本章[01_machine_learning_for_trading]探讨了促使机器学习成为投资行业竞争优势来源的行业趋势。我们还将研究机器学习在投资流程中所处的位置，以支持算法交易策略的实施。\n\n具体而言，本章涵盖以下主题：\n- 投资行业中机器学习兴起的关键趋势\n- 利用机器学习设计和执行交易策略\n- 机器学习在交易中的常见应用场景\n\n### 02 市场与基本面数据：来源与技术\n\n本章[02_market_and_fundamental_data]展示了如何处理市场和基本面数据，并描述了这些数据所反映的环境中的关键方面。例如，熟悉各种订单类型和交易基础设施不仅有助于数据的解读，还能正确设计回测模拟。我们还演示了如何使用Python访问和操作交易数据及财务报表数据。\n\n通过实际案例，我们展示了如何使用来自NASDAQ的逐笔数据和Algoseek的分钟级分时数据，这些数据具有丰富的属性，能够捕捉供需动态，而这些动态将在后续用于基于机器学习的日内策略。此外，我们也介绍了多种数据提供商的API，以及如何从美国证券交易委员会（SEC）获取财务报表信息。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FenaSo0C.png\" title=\"订单簿\" width=\"50%\"\u002F>\n\u003C\u002Fp>\n特别是，本章涵盖了：\n\n- 市场数据如何反映交易环境的结构\n- 处理分钟级别的日内成交与报价数据\n- 使用NASDAQ ITCH协议从逐笔数据重建**限价订单簿**\n- 使用不同类型的K线图汇总逐笔数据\n- 处理以可扩展商业报告语言（XBRL）编码的**电子备案文件**\n- 解析并整合市场与基本面数据，以创建市盈率序列\n- 如何使用Python访问各种市场和基本面数据源\n\n### 03 金融领域的另类数据：类别与用例\n\n本[章节](03_alternative_data)概述了另类数据的类别和应用场景，介绍了评估不断涌现的数据源和提供商的标准，并总结了当前市场格局。\n\n此外，本章还演示了如何通过网页爬取等方式构建另类数据集，例如收集财报电话会议的文字记录，以便在本书第三部分中结合自然语言处理（NLP）和情感分析算法加以使用。\n\n具体而言，本章涵盖以下内容：\n\n- 在另类数据革命过程中涌现出哪些新型信号来源\n- 个人、企业和各类传感器如何生成多样化的另类数据\n- 另类数据的重要类别及其主要提供商\n- 如何评估日益丰富的另类数据资源在交易中的应用潜力\n- 使用Python进行另类数据处理，例如网络爬虫技术\n\n### 04 金融特征工程：如何研究阿尔法因子\n\n如果你已经熟悉机器学习，就会知道特征工程是实现成功预测的关键环节。在交易领域，这一点同样至关重要。学术界和业界的研究人员几十年来一直在探究驱动资产市场和价格变动的因素，以及哪些特征能够解释或预测价格走势。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FUCu4Huo.png\" width=\"70%\">\n\u003C\u002Fp>\n\n本[章节](04_alpha_factor_research)以这些研究成果为基础，为你开启寻找阿尔法因子的旅程。同时，它还提供了计算和测试阿尔法因子的必备工具，重点介绍了NumPy、pandas和TA-Lib等库如何帮助进行数据处理，并展示了小波变换和卡尔曼滤波等流行的平滑技术，以降低数据噪声。阅读本章后，你将了解：\n- 阿尔法因子的主要类别、作用原理及衡量方法\n- 如何利用NumPy、pandas和TA-Lib创建阿尔法因子\n- 如何使用小波变换和卡尔曼滤波降噪\n- 如何借助Zipline测试单个或多个阿尔法因子\n- 如何使用[Alphalens](https:\u002F\u002Fgithub.com\u002Fquantopian\u002Falphalens)评估预测性能\n\n### 05 投资组合优化与绩效评估\n\n阿尔法因子会生成信号，由算法策略转化为交易指令，进而形成多头和空头仓位。最终投资组合的收益与风险决定了该策略是否达到投资目标。\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FE2h63ZB.png\" width=\"65%\">\n\u003C\u002Fp>\n\n优化投资组合的方法多种多样，其中包括运用机器学习（ML）来挖掘资产间的层次关系，在设计投资组合的风险特征时将其视为互补或替代关系。本[章节](05_strategy_evaluation)涵盖：\n- 如何衡量投资组合的风险与收益\n- 利用均值方差优化及其他方法管理投资组合权重\n- 运用机器学习优化投资组合中的资产配置\n- 使用Zipline模拟交易并基于阿尔法因子构建投资组合\n- 如何借助[pyfolio](https:\u002F\u002Fquantopian.github.io\u002Fpyfolio\u002F)评估投资组合表现\n\n## 第二部分：机器学习在交易中的应用——基础篇\n\n第二部分将介绍监督学习和无监督学习的基本算法，并说明其在交易策略中的具体应用。此外，还将介绍Quantopian平台，该平台允许你整合并运用本书中所开发的数据与机器学习技术，从而在真实市场中执行算法交易策略。\n\n### 06 机器学习流程\n\n本[章节](06_machine_learning_process)开启了第二部分的内容，旨在展示如何将多种监督学习和无监督学习模型应用于交易。我们将首先解释每种模型的假设与适用场景，随后通过不同的Python库演示相关应用。\n\n许多模型及其应用都具有一些共性。本章将重点讨论这些共性，以便在后续章节中更专注于各模型的具体用法。本章以系统化的工作流程为框架，详细阐述了如何构建、训练、调优并评估机器学习模型的预测性能。具体内容包括：\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002F5qisClE.png\" width=\"65%\">\n\u003C\u002Fp>\n\n- 监督学习与无监督学习从数据中学习的基本原理\n- 监督学习模型在回归与分类任务中的训练与评估\n- 偏差-方差权衡对预测性能的影响\n- 如何诊断并解决因过拟合导致的预测误差\n- 利用交叉验证优化超参数，尤其针对时间序列数据\n- 为何金融数据在进行样本外测试时需要额外关注\n\n### 07 线性模型：从风险因子到收益预测\n\n线性模型是回归和分类问题中用于推断与预测的标准工具。许多广泛使用的资产定价模型都依赖于线性回归。而岭回归和Lasso回归等正则化模型则通过限制过拟合风险，往往能获得更好的预测效果。典型的回归应用会识别驱动资产收益的风险因子，以管理风险或预测收益；而分类问题则常用于方向性价格预测。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002F3Ph6jma.png\" width=\"65%\">\n\u003C\u002Fp>\n\n[第7章](07_linear_models)涵盖以下主题：\n\n- 线性回归的工作原理及其假设条件\n- 线性回归模型的训练与诊断\n- 利用线性回归预测股票收益\n- 通过正则化提升预测性能\n- 逻辑回归的工作原理\n- 如何将回归问题转化为分类问题\n\n### 08 ML4T 工作流：从模型到策略回测\n\n本章[08_ml4t_workflow]从端到端的角度介绍了如何设计、模拟并评估由机器学习算法驱动的交易策略。我们将详细演示如何使用 Python 库 [backtrader](https:\u002F\u002Fwww.backtrader.com\u002F) 和 [Zipline](https:\u002F\u002Fzipline.ml4trading.io\u002Findex.html) 在历史市场环境中对基于机器学习的策略进行回测。ML4T 工作流的最终目标是从历史数据中收集证据，以帮助决策是否将候选策略部署到实盘市场并投入资金。对策略的真实模拟需要忠实地反映证券市场的运作方式和交易执行过程。此外，还需关注若干方法论上的细节，以避免产生偏差的结果和错误发现，从而导致糟糕的投资决策。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FR9O0fn3.png\" width=\"65%\">\n\u003C\u002Fp>\n\n具体而言，学完本章后，你将能够：\n\n- 规划并实施端到端的策略回测\n- 理解并规避回测实施中的关键陷阱\n- 讨论向量化与事件驱动型回测引擎各自的优缺点\n- 识别并评估事件驱动型回测器的关键组件\n- 使用分钟级和日频数据源设计并执行 ML4T 工作流，其中机器学习模型可以单独训练，也可以在回测过程中同步训练\n- 利用 Zipline 和 backtrader 设计并评估自己的策略\n\n### 09 用于波动率预测与统计套利的时间序列模型\n\n本章[09_time_series_models]专注于从时间序列的历史数据中提取信号，以预测该序列未来的值。由于交易天然具有时间维度，时间序列模型被广泛应用于这一领域。本章提供了诊断时间序列特征（如平稳性）以及提取潜在有用模式特征的工具。同时，还介绍了单变量和多变量时间序列模型，用于预测宏观数据和波动率模式。最后，本章解释了协整如何识别不同时间序列之间的共同趋势，并展示了如何基于这一关键概念开发配对交易策略。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FcglLgJ0.png\" width=\"90%\">\n\u003C\u002Fp>\n\n具体来说，本章涵盖：\n- 如何利用时间序列分析为建模过程做准备并提供信息\n- 单变量自回归模型和移动平均模型的估计与诊断\n- 构建自回归条件异方差（ARCH）模型以预测波动率\n- 如何构建多变量向量自回归模型\n- 利用协整开发配对交易策略\n\n### 10 贝叶斯机器学习：动态夏普比率与配对交易\n\n贝叶斯统计允许我们量化对未来事件的不确定性，并在新信息不断涌入时以合理的方式修正估计。这种动态方法非常适合金融市场的不断变化特性。贝叶斯方法在机器学习中的应用，能够让我们更深入地理解统计指标、参数估计和预测结果中的不确定性。其应用场景涵盖了更为精细的风险管理，以及根据市场环境变化动态更新预测模型等。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FqOUPIDV.png\" width=\"80%\">\n\u003C\u002Fp>\n\n具体而言，本章[10_bayesian_machine_learning]涵盖：\n- 贝叶斯统计在机器学习中的应用\n- 使用 PyMC3 进行概率编程\n- 利用 PyMC3 定义并训练机器学习模型\n- 如何运行最先进的采样方法进行近似推理\n- 贝叶斯机器学习在计算动态夏普比率、动态配对交易对冲比例以及估计随机波动率方面的应用\n\n### 11 随机森林：针对日本股票的多空策略\n\n本章[11_decision_trees_random_forests]将决策树和随机森林应用于交易。决策树能够从数据中学习规则，从而编码非线性的输入输出关系。我们将展示如何训练决策树来解决回归和分类问题，可视化并解释模型所学到的规则，以及调整模型的超参数以优化偏差-方差权衡并防止过拟合。\n\n本章的第二部分介绍了集成模型，通过随机组合多棵决策树生成单一预测，从而降低误差。最后，我们提出了一种基于随机森林模型生成的交易信号的日本股市多空策略。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FS4s0rou.png\" width=\"80%\">\n\u003C\u002Fp>\n\n简而言之，本章涵盖：\n- 将决策树用于回归和分类任务\n- 从决策树中获取洞察，并可视化从数据中学到的规则\n- 理解为何集成模型往往能带来更好的效果\n- 使用自助聚合方法应对决策树过拟合的问题\n- 训练、调优并解释随机森林模型\n- 利用随机森林设计并评估一项盈利的交易策略\n\n### 12 提升你的交易策略\n\n梯度提升是一种基于树的集成算法，通常比随机森林的效果更好。其关键区别在于，梯度提升会根据模型累计的误差调整每棵树的训练数据。而随机森林是独立地使用数据的随机子集训练多棵树，梯度提升则是按顺序进行，并对数据重新加权。本章[12_gradient_boosting_machines]展示了当前最先进的库如何实现卓越性能，并将梯度提升应用于日频和高频数据，以回测一项日内交易策略。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FRe0uI0H.png\" width=\"70%\">\n\u003C\u002Fp>\n\n具体来说，我们将探讨以下主题：\n- 梯度提升与自助聚合有何不同，以及梯度提升是如何从自适应提升发展而来的\n- 使用 scikit-learn 设计并调优自适应提升和梯度提升模型\n- 借助 XGBoost、LightGBM 和 CatBoost 等最先进的实现，在大型数据集上构建、优化并评估梯度提升模型\n- 使用 [SHAP](https:\u002F\u002Fgithub.com\u002Fslundberg\u002Fshap) 值解释并深入理解梯度提升模型\n- 将梯度提升与高频数据结合，设计一项日内策略。\n\n### 13 基于无监督学习的数据驱动风险因子与资产配置\n\n无监督学习的主要任务是降维和聚类：\n- 降维将现有特征转换为一组新的、更小的特征，同时尽量减少信息损失。现有的算法种类繁多，它们在衡量信息损失的方式、采用线性或非线性变换以及对新特征集施加的约束等方面各有不同。\n- 聚类算法不是识别新特征，而是识别并分组相似的观测值或特征。不同算法在定义观测值相似性的方法以及对结果分组的假设上存在差异。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FRfk7uCM.png\" width=\"70%\">\n\u003C\u002Fp>\n\n具体而言，本[章节](13_unsupervised_learning)涵盖：\n- 主成分分析（PCA）和独立成分分析（ICA）如何进行线性降维\n- 如何利用PCA从资产收益率中识别数据驱动的风险因子和特征组合\n- 使用流形学习有效可视化非线性高维数据\n- 利用T-SNE和UMAP探索高维图像数据\n- k-means、层次聚类和基于密度的聚类算法的工作原理\n- 使用凝聚聚类构建具有层次化风险平价的稳健投资组合\n\n\n## 第三部分：用于交易的自然语言处理\n\n文本数据内容丰富，但格式非结构化，因此需要更多的预处理，以便机器学习算法能够提取潜在信号。关键挑战在于将文本转换为算法可使用的数值形式，同时保留其语义或含义。\n\n接下来的三章将介绍几种技术，这些技术能够捕捉人类易于理解的语言细微差别，从而使机器学习算法也能加以解释。\n\n### 14 用于交易的文本数据：情感分析\n\n文本数据内容非常丰富，但高度非结构化，因此需要更多的预处理才能使机器学习算法提取相关信息。一个关键挑战是在不丢失文本意义的情况下将其转换为数值形式。\n本[章节](14_working_with_text_data)展示了如何通过创建文档-词项矩阵，将文档表示为词项计数向量，进而作为文本分类和情感分析的输入。此外，还介绍了朴素贝叶斯算法，并将其性能与线性模型和树模型进行了比较。\n\n具体来说，本章涵盖了：\n- 核心自然语言处理工作流程是什么\n- 如何使用spaCy和TextBlob构建多语言特征提取管道\n- 执行词性标注或命名实体识别等NLP任务\n- 使用文档-词项矩阵将词项转换为数字\n- 利用朴素贝叶斯模型对新闻进行分类\n- 如何使用不同的机器学习算法进行情感分析\n\n### 15 主题建模：总结金融新闻\n\n本[章节](15_topic_modeling)利用无监督学习来建模文档中的潜在主题并提取隐藏的主题。这些主题可以为大量的财务报告语料库提供深入洞察。\n主题模型能够自动创建复杂且可解释的文本特征，从而帮助从大量文本中提取交易信号。它们加快了文档审查速度，实现了相似文档的聚类，并生成可用于预测建模的注释。应用包括识别公司披露文件、财报电话会议记录或合同中的关键主题，以及基于情感分析或相关资产收益的注释。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FVVSnTCa.png\" width=\"60%\">\n\u003C\u002Fp>\n\n更具体而言，本章涵盖了：\n- 主题建模的发展历程、实现的功能及其重要性\n- 使用潜在语义索引降低文档-词项矩阵的维度\n- 利用概率潜在语义分析（pLSA）提取主题\n- 潜在狄利克雷分配（LDA）如何改进pLSA，使其成为最流行的主题模型\n- 可视化和评估主题建模结果\n- 使用scikit-learn和gensim运行LDA\n- 如何将主题建模应用于财报电话会议和金融新闻文章的集合\n\n### 16 用于财报电话会议和SEC备案文件的词嵌入\n\n本[章节](16_word_embeddings)利用神经网络学习单个语义单元（如单词或段落）的向量表示。这些向量是稠密的，通常包含几百个实数值条目，而词袋模型的向量则是高维稀疏的。因此，词嵌入将每个语义单元嵌入到一个连续的向量空间中。\n\n词嵌入是通过训练模型来关联词项与其上下文获得的，其优势在于相似的用法会对应相似的向量。因此，它们通过相对位置编码了词语之间的关系等语义信息。这些强大的特征将在后续章节中与深度学习模型结合使用。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002Fv8w9XLL.png\" width=\"80%\">\n\u003C\u002Fp>\n\n具体来说，本章将涵盖：\n- 词嵌入是什么以及如何捕捉语义信息\n- 如何获取和使用预训练的词向量\n- 哪些网络架构最适合训练word2vec模型\n- 如何使用TensorFlow和gensim训练word2vec模型\n- 可视化和评估词向量的质量\n- 如何在SEC备案文件上训练word2vec模型以预测股价变动\n- doc2vec如何扩展word2vec并有助于情感分析\n- 为什么Transformer的注意力机制对NLP产生了如此深远的影响\n- 如何在金融数据上微调预训练的BERT模型\n\n## 第四部分：深度学习与强化学习\n\n第四部分解释并演示如何利用深度学习进行算法交易。\n深度学习算法在识别非结构化数据中的模式方面具有强大能力，因此特别适用于图像和文本等另类数据。\n\n示例应用展示了如何结合文本和价格数据，从SEC备案文件中预测盈利惊喜；如何生成合成时间序列以扩充训练数据量；以及如何使用深度强化学习训练交易智能体。其中一些应用复制了近期发表在顶级期刊上的研究成果。\n\n### 17 深度学习在交易中的应用\n\n本[章节](17_deep_learning)介绍了前馈神经网络（NN），并展示了如何利用反向传播高效训练大型模型，同时管理过拟合风险。此外，还演示了如何使用 TensorFlow 2.0 和 PyTorch，并优化神经网络架构以生成交易信号。\n\n在接下来的章节中，我们将在此基础上，针对不同投资应用场景，特别是另类数据，应用多种神经网络架构。这些架构包括专为时间序列或自然语言等序列数据设计的循环神经网络，以及特别适用于图像数据的卷积神经网络。我们还将探讨深度无监督学习，例如如何使用生成对抗网络（GAN）创建合成数据。此外，我们还会讨论强化学习，以训练能够与环境交互式学习的智能体。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002F5cet0Fi.png\" width=\"70%\">\n\u003C\u002Fp>\n\n具体而言，本章将涵盖：\n- 深度学习如何解决复杂领域中的人工智能挑战\n- 推动深度学习如今广受欢迎的关键创新\n- 前馈网络如何从数据中学习表示\n- 使用 Python 设计和训练深度神经网络（NN）\n- 利用 Keras、TensorFlow 和 PyTorch 实现深度 NN\n- 构建并调优深度 NN 以预测资产收益\n- 基于深度 NN 信号设计并回测交易策略\n\n### 18 用于金融时间序列和卫星图像的卷积神经网络\n\n卷积神经网络架构仍在不断发展。本章描述了成功应用中常见的构建模块，展示了迁移学习如何加速学习过程，以及如何使用 CNN 进行目标检测。\n\nCNN 可以从图像或时间序列数据中生成交易信号。通过获取农业区、矿山或交通网络的航拍图像，卫星数据可以提前预判大宗商品走势。摄像头拍摄的视频则有助于预测消费者活动；我们将展示如何构建一个 CNN 来分类卫星图像中的经济活动。\n\n此外，CNN 还能利用其与图像的结构相似性，在时间序列分类任务中取得高质量的结果，并且我们会设计一种基于类似图像格式的时间序列数据的策略。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FPlLQV0M.png\" width=\"60%\">\n\u003C\u002Fp>\n\n更具体地说，本[章节](18_convolutional_neural_nets)涵盖了：\n\n- CNN 如何运用多种构建模块高效地建模网格状数据\n- 使用 TensorFlow 训练、调优和正则化用于图像和时间序列数据的 CNN\n- 利用迁移学习简化 CNN 的开发流程，即使在数据较少的情况下也能实现\n- 设计一种基于 CNN 对类似图像格式的时间序列数据进行收益预测的交易策略\n- 如何根据卫星图像分类经济活动\n\n### 19 用于多变量时间序列和情感分析的循环神经网络\n\n循环神经网络（RNN）将每个输出计算为前一输出与新数据的函数，从而有效地创建了一个具有记忆功能、并在更深的计算图中共享参数的模型。其中，长短期记忆网络（LSTM）和门控循环单元（GRU）等著名架构解决了学习长距离依赖关系的难题。\n\nRNN 的设计目的是将一个或多个输入序列映射到一个或多个输出序列，尤其适合处理自然语言。它们也可应用于单变量和多变量时间序列，以预测市场或基本面数据。本章将介绍 RNN 如何利用我们在第 16 章中讨论的词嵌入技术来建模替代文本数据，从而对文档中表达的情感进行分类。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FE9fOApg.png\" width=\"60%\">\n\u003C\u002Fp>\n\n更具体而言，本章将探讨：\n- 循环连接如何使 RNN 能够记忆模式并建模隐藏状态\n- 展开并分析 RNN 的计算图\n- 门控单元如何从数据中学习调节 RNN 内存，以实现长距离依赖\n- 使用 Python 设计和训练用于单变量和多变量时间序列的 RNN\n- 如何学习词嵌入或使用预训练的词向量进行 RNN 情感分析\n- 构建双向 RNN，利用自定义词嵌入预测股票收益\n\n### 20 用于条件风险因子和资产定价的自编码器\n\n本[章节](20_autoencoders_for_conditional_risk_factors)展示了如何将无监督深度学习应用于交易。我们还讨论了自编码器，即一种经过训练能够在学习隐藏层参数所编码的新表示的同时，重现输入的神经网络。自编码器长期以来一直被用于非线性降维，其基础正是我们在过去三章中讨论过的神经网络架构。\n\n我们将复现一篇近期的 AQR 论文，该论文表明自编码器可以作为交易策略的基础。我们将使用一个基于自编码器的深度神经网络，提取风险因子并根据一系列股权属性预测股票收益。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FaCmE0UD.png\" width=\"60%\">\n\u003C\u002Fp>\n\n更具体地说，本章您将了解到：\n- 哪些类型的自编码器具有实际用途及其工作原理\n- 使用 Python 构建和训练自编码器\n- 利用自编码器提取考虑资产特征的数据驱动型风险因子，以预测收益\n\n### 21 用于生成合成时间序列数据的生成对抗网络\n\n本章介绍了生成对抗网络（GAN）。GAN 在竞争环境中同时训练生成器和判别器网络，使生成器学会生成无法被判别器区分于给定训练数据类别的样本。其目标是生成能够代表该类别特征的合成样本。\n\n尽管 GAN 最常用于图像数据，但它们也被用于在医疗领域生成合成时间序列数据。随后的金融数据实验则探讨了 GAN 是否能生成对机器学习训练或策略回测有用的替代价格轨迹。我们将复现 2019 年 NeurIPS 时间序列 GAN 论文，以说明这一方法并展示相关结果。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002FW1Rp89K.png\" width=\"60%\">\n\u003C\u002Fp>\n\n更具体地说，本章您将了解到：\n- GAN 的工作原理、其优势以及如何将其应用于交易\n- 使用 TensorFlow 2 设计和训练 GAN\n- 生成合成金融数据，以扩展可用于训练机器学习模型和回测策略的数据来源\n\n### 22 深度强化学习：构建交易智能体\n\n强化学习（RL）通过智能体与随机环境的交互来实现目标导向的学习。它根据奖励信号学习状态和动作的价值，从而优化智能体针对长期目标的决策。最终目标是推导出一种策略，该策略编码行为规则，并将状态映射到动作。\n\n本[章](22_deep_reinforcement_learning)展示了如何构建并求解一个强化学习问题。内容涵盖基于模型的方法和无模型方法，介绍了OpenAI Gym环境，并将深度学习与强化学习相结合，训练一个能够在复杂环境中导航的智能体。最后，我们将演示如何将强化学习应用于算法交易，通过构建一个与金融市场交互、同时试图优化目标函数的智能体来实现这一目标。\n\n\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Fi.imgur.com\u002Flg0ofbZ.png\" width=\"60%\">\n\u003C\u002Fp>\n\n具体而言，本章将涵盖以下内容：\n\n- 定义马尔可夫决策过程（MDP）\n- 使用值迭代和策略迭代求解MDP\n- 在具有离散状态和动作的环境中应用Q-learning\n- 在连续状态下构建并训练深度Q-learning智能体\n- 利用OpenAI Gym设计自定义市场环境，并训练一个RL智能体进行股票交易\n\n### 23 结论与下一步\n\n在本总结性章节中，我们将简要回顾全书中的关键工具、应用场景及所学经验，以帮助读者在深入探讨众多细节之后，仍能把握整体脉络。随后，我们将指出一些未涉及但值得进一步关注的方向，以便读者在掌握我们介绍的多种机器学习技术后，能够将其更高效地应用于日常工作中。\n\n总之，在本章中，我们将：\n- 回顾主要收获与经验教训\n- 指明基于本书所介绍技术的后续发展方向\n- 提供建议，指导如何将机器学习融入投资流程\n\n### 24 附录——阿尔法因子库\n\n在整本书中，我们一直强调，合理的特征设计，包括恰当的预处理和去噪步骤，往往是构建有效策略的关键。本附录综合了我们在特征工程方面的一些经验，并就这一重要主题提供了更多详细信息。\n\n为此，我们重点参考了TA-Lib库中实现的各类指标（详见[第4章](04_alpha_factor_research)），以及WorldQuant于2016年发表的论文《101种公式化阿尔法因子》（Kakushadze, 2016），该论文介绍了实际生产环境中使用的量化交易因子，其平均持有期为0.6至6.4天。\n\n本章内容包括：\n- 如何使用TA-Lib以及NumPy和pandas计算数十种技术指标\n- 构建上述论文中描述的公式化阿尔法因子\n- 采用秩相关、互信息、特征重要性、SHAP值和Alphalens等多种指标评估结果的预测能力","# machine-learning-for-trading 快速上手指南\n\n本指南基于《Machine Learning for Trading》第二版配套代码库，帮助开发者快速搭建环境并运行量化交易机器学习示例。\n\n## 环境准备\n\n*   **操作系统**: 推荐 Linux 或 macOS（Windows 用户建议使用 WSL2 或 Docker）。\n*   **核心依赖**:\n    *   Python 3.8+ (部分回测环境需 Python 3.6)\n    *   Conda (推荐使用 Miniconda 或 Anaconda 管理环境)\n    *   Docker (可选，用于隔离复杂的回测环境)\n*   **前置知识**: 熟悉 Python 数据分析栈 (pandas, numpy) 及基础机器学习概念。\n\n> **注意**: 本项目包含大量金融专用库（如 `zipline-reloaded`），直接在全局环境安装极易产生版本冲突。**强烈建议**按章节需求单独创建 Conda 环境，或使用提供的 Docker 镜像。\n\n## 安装步骤\n\n### 方案一：使用 Conda 环境（推荐）\n\n项目为不同章节提供了独立的环境配置文件，避免一次性安装所有依赖导致的冲突。\n\n1.  **克隆仓库**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002Fstefan-jansen\u002Fmachine-learning-for-trading.git\n    cd machine-learning-for-trading\n    ```\n\n2.  **安装特定章节环境**\n    进入 `installation` 目录，根据你想运行的章节选择对应的 `.yml` 文件。例如，若要运行第 12 章（梯度提升机）的示例：\n    ```bash\n    # 创建名为 ml4t_ch12 的环境\n    conda env create -f installation\u002Fenvironment_ch12.yml\n    \n    # 激活环境\n    conda activate ml4t_ch12\n    ```\n    *注：具体文件名请查阅 `installation\u002F` 目录下的实际列表。*\n\n3.  **更新核心回测库 (2022 年 3 月后重要更新)**\n    确保从 `conda-forge` 渠道安装最新的重构版金融库，旧版 `ml4t` 渠道已废弃：\n    ```bash\n    conda install -c conda-forge zipline-reloaded pyfolio-reloaded alphalens-reloaded empyrical-reloaded\n    ```\n\n### 方案二：使用 Docker（最省心）\n\n如果你希望避免配置本地环境的麻烦，可以使用官方提供的 Docker 镜像。\n\n1.  **构建并运行容器**\n    ```bash\n    # 进入 installation 目录查看 Docker 相关说明\n    cd installation\n    \n    # 构建镜像 (具体命令参考 installation\u002FREADME.md)\n    docker-compose build\n    \n    # 启动 Jupyter Lab\n    docker-compose up\n    ```\n    启动后，浏览器访问 `http:\u002F\u002Flocalhost:8888` 即可直接使用预装好所有依赖的 Notebook 环境。\n\n### 数据下载与预处理\n\n代码运行需要特定的市场数据和另类数据。\n\n1.  **查看数据说明**\n    阅读 `data\u002FREADME.md` 了解各章节所需数据源。\n    ```bash\n    cat data\u002FREADME.md\n    ```\n\n2.  **下载 Algoseek 数据 (书中主要数据源)**\n    访问 [Algoseek ML4T 数据页面](https:\u002F\u002Fwww.algoseek.com\u002Fml4t-book-data.html) 下载数据包。\n    \n3.  **预处理数据**\n    将下载的数据放入 `data` 目录对应文件夹，并运行预处理脚本（以日内数据为例）：\n    ```bash\n    # 参考第 2 章的预处理说明\n    cd 02_market_and_fundamental_data\u002F02_algoseek_intraday\n    # 运行相应的 notebook 或脚本进行清洗\n    ```\n\n## 基本使用\n\n本项目主要通过 Jupyter Notebooks 演示完整的机器学习交易工作流（从数据获取、特征工程到策略回测）。\n\n1.  **启动 Jupyter Lab**\n    在激活对应的 Conda 环境后，运行：\n    ```bash\n    jupyter lab\n    ```\n\n2.  **运行示例 Notebook**\n    在浏览器中导航至对应章节目录。例如，运行一个基于梯度提升机的日内策略示例（第 12 章）：\n    *   路径：`12_gradient_boosting_machines\u002F10_intraday_features.ipynb`\n    *   操作：依次执行单元格，观察数据加载、特征提取、模型训练及信号生成的全过程。\n\n3.  **复现学术研究模型**\n    尝试运行复现顶刊论文的章节，例如使用 CNN 处理时间序列图像（第 18 章）：\n    *   路径：`18_convolutional_neural_nets\u002F`\n    *   内容：演示如何将时间序列转换为图像格式，并利用卷积神经网络预测收益率。\n\n4.  **策略回测**\n    利用自定义版的 Zipline 进行策略回测。在 Notebook 中定义策略逻辑后，调用回测引擎评估历史表现：\n    ```python\n    # 伪代码示例，具体请参考 08_ml4t_workflow 章节\n    from zipline.reloader import run_algorithm\n    \n    results = run_algorithm(\n        start=pd.Timestamp('2020-01-01'),\n        end=pd.Timestamp('2020-12-31'),\n        capital_base=100000,\n        algorithm=my_trading_strategy,\n        data=my_data_bundle\n    )\n    ```\n\n> **提示**: 建议阅读书籍对应章节的同时运行 Notebook，书中未包含的细节信息（如参数调优细节、额外图表）通常补充在 Notebook 中。","某量化团队试图利用财报电话会议录音文本和另类数据，构建一个针对科技股的多因子对冲策略。\n\n### 没有 machine-learning-for-trading 时\n- **特征工程盲目**：缺乏成熟的金融特征提取框架，团队需从零编写代码处理 SEC 文件和转录稿，难以从非结构化文本中有效提取情绪信号。\n- **模型验证困难**：缺少标准化的回测流程，无法科学评估深度学习模型（如 CNN、RNN）在不同资产类别上的预测能力，容易陷入过拟合陷阱。\n- **策略落地脱节**：算法研究与实际交易执行割裂，难以将复杂的强化学习代理（Trading Agent）转化为可实盘运行的长短仓策略。\n- **复现成本高昂**：面对前沿学术论文中的交易思路，缺乏参考代码进行快速复现和调优，研发周期被大幅拉长。\n\n### 使用 machine-learning-for-trading 后\n- **高效信号提取**：直接复用书中提供的 150+ 个笔记示例，快速掌握从财务文本和图像数据中提取高价值因子的标准方法。\n- **严谨策略评估**：利用内置的完整回测框架，系统地设计并验证基于监督学习和无监督学习的策略，确保模型在不同投资周期下的稳健性。\n- **端到端实战闭环**：参考深度强化学习章节的代码，成功训练出能自动执行交易的智能体，实现了从数据清洗到策略生成的全流程打通。\n- **加速创新迭代**：基于书中复现的最新研究成果进行二次开发，迅速将理论转化为具备超额收益潜力的实盘策略。\n\nmachine-learning-for-trading 通过提供系统化的代码库和实战案例，将机器学习从理论概念转化为可落地的量化交易生产力，显著降低了策略研发的门槛与风险。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fstefan-jansen_machine-learning-for-trading_b4e1da4e.png","stefan-jansen","Stefan Jansen","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fstefan-jansen_4aea4bb0.png",null,"https:\u002F\u002Fwww.applied-ai.com","Brooklyn","ml4trading","https:\u002F\u002Fwww.ml4trading.io\u002F","https:\u002F\u002Fgithub.com\u002Fstefan-jansen",[82,86,90,93],{"name":83,"color":84,"percentage":85},"Jupyter Notebook","#DA5B0B",100,{"name":87,"color":88,"percentage":89},"Python","#3572A5",0,{"name":91,"color":92,"percentage":89},"JavaScript","#f1e05a",{"name":94,"color":95,"percentage":89},"Shell","#89e051",17018,5065,"2026-04-10T01:24:33","Linux, macOS, Windows","未说明 (项目涵盖深度学习如 CNN, RNN, GAN，通常建议配备 NVIDIA GPU，但 README 未明确具体型号或显存要求)","未说明",{"notes":103,"python":104,"dependencies":105},"1. 强烈建议使用 Docker 镜像运行，或通过 conda 按章节单独安装依赖以避免版本冲突。2. 旧版 'ml4t' conda 频道已废弃，请使用 'conda-forge' 安装 zipline-reloaded 等包。3. 不再强制要求使用 Docker（自 2021 年 4 月更新），提供了针对各操作系统的特定环境配置文件。4. 书中使用的 Algoseek 数据需单独下载并按指南预处理。","3.6 - 3.8 (Zipline 回测环境使用 Python 3.6，Conda 环境已更新至 Python 3.8)",[106,107,108,109,110,111,112,113],"pandas>=1.0","tensorflow>=2.2","zipline-reloaded","pyfolio-reloaded","alphalens-reloaded","empyrical-reloaded","scikit-learn","numpy",[16,115,14,13],"其他",[117,118,119,120,121,122,123,124,125,126,127,128],"machine-learning","trading","investment","finance","data-science","investment-strategies","artificial-intelligence","trading-strategies","deep-learning","synthetic-data","ml4t-workflow","trading-agent","2026-03-27T02:49:30.150509","2026-04-10T20:45:46.006334",[132,137,142,147,152,157,162],{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},28179,"运行 Zipline 笔记本时遇到 'unsupported pickle protocol: 5' 错误怎么办？","该错误通常是因为使用了旧版本的 Docker 镜像。维护者已修复了通过 Zipline Jupyter 扩展调用 `run_algorithm()` 时的问题。请确保下载并运行最新的 Docker 镜像，该镜像包含了修复此错误的最新提交。如果问题依旧，请检查是否仍在使用旧版镜像。","https:\u002F\u002Fgithub.com\u002Fstefan-jansen\u002Fmachine-learning-for-trading\u002Fissues\u002F113",{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},28180,"如何在 GCP 或 AWS 的 Docker 环境中解决 'zipline ingest' 的权限拒绝（Permission denied）错误？","错误通常由挂载卷的路径配置不当引起。运行容器时，应使用 `-v $(pwd):\u002Fhome\u002Fpackt\u002Fml4t` 将当前工作目录挂载到容器内。`$(pwd)` 代表当前路径，也可以替换为绝对路径。不要随意更改容器内的目标路径，确保有权限写入挂载的目录。完整命令示例：`docker run -it -v $(pwd):\u002Fhome\u002Fpackt\u002Fml4t -p 8888:8888 -e QUANDL_API_KEY=\u003Cyour API key> --name ml4t appliedai\u002Fpackt:latest bash`。","https:\u002F\u002Fgithub.com\u002Fstefan-jansen\u002Fmachine-learning-for-trading\u002Fissues\u002F55",{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},28181,"Pandas 无法识别自定义日期时间索引进行数据切片（如 y[:'2020-01-16']）怎么办？","在尝试按日期字符串切片之前，必须先将日期列转换为 `DateTimeIndex`。请确保使用 `pd.to_datetime()` 处理日期列，并将其设置为 DataFrame 的索引（`set_index`）。一旦索引是正确的 `DateTimeIndex` 格式，即可直接使用日期字符串进行切片操作。","https:\u002F\u002Fgithub.com\u002Fstefan-jansen\u002Fmachine-learning-for-trading\u002Fissues\u002F103",{"id":148,"question_zh":149,"answer_zh":150,"source_url":151},28182,"运行 Zipline 回测时出现 'Failed to find any assets with country_code US' 错误如何解决？","这通常意味着资产数据库过时或元数据不正确。如果是使用非 US 市场的数据包（如 stooq），可能需要手动更新 SQLite 数据库中的国家代码。可以使用 sqlite3 连接资产数据库文件，执行 SQL 更新语句，例如：`UPDATE exchanges SET country_code = 'JP';`（针对日本市场），然后提交更改。此外，首次运行数据摄入（ingest）可能需要较长时间（约 30 分钟），请耐心等待。","https:\u002F\u002Fgithub.com\u002Fstefan-jansen\u002Fmachine-learning-for-trading\u002Fissues\u002F21",{"id":153,"question_zh":154,"answer_zh":155,"source_url":156},28183,"storage_benchmark 笔记本中读取 Parquet 文件导致内核崩溃或缺少 Mixed.csv 文件怎么办？","`Mixed.csv` 文件并非预先提供，而是需要在笔记本中进行第二次运行时生成，该次运行需配置为包含文本数据（随机 10 字符字符串）和数值数据。根据书中描述（第 2 版第 57\u002F58 页），基准测试需要分别测试纯数值数据和混合数据场景。请按照笔记本说明重新配置数据生成步骤以创建包含文本列的 DataFrame，从而生成所需的文件。","https:\u002F\u002Fgithub.com\u002Fstefan-jansen\u002Fmachine-learning-for-trading\u002Fissues\u002F64",{"id":158,"question_zh":159,"answer_zh":160,"source_url":161},28184,"从 HDFStore 读取数据时遇到 'cannot set WRITEABLE flag to True of this array' 错误如何处理？","该错误通常与 Pandas 版本或底层 NumPy 数组的内存标志有关。虽然具体修复取决于环境，但常见的解决方法是确保使用的 Pandas 和 PyTables 库版本与项目依赖一致。如果在特定代码块（如 `.unstack()` 或列选择）后出现此错误，尝试在读取数据后显式复制 DataFrame（使用 `.copy()`），或者检查是否在不支持的操作中试图修改只读数组。建议更新 `ml4t-dl` 环境中的相关库至最新兼容版本。","https:\u002F\u002Fgithub.com\u002Fstefan-jansen\u002Fmachine-learning-for-trading\u002Fissues\u002F100",{"id":163,"question_zh":164,"answer_zh":165,"source_url":156},28185,"Docker 镜像仓库名称拼写错误导致无法拉取镜像怎么办？","请注意 Docker Hub 上的仓库名称是 `appliedai` 而不是 `alliedai`。正确的拉取命令应基于 `appliedai\u002Fpackt:latest`。如果是私有仓库，还需要先登录 Docker。请检查您的 docker pull 或 run 命令中的仓库名称拼写是否正确。",[167],{"id":168,"version":169,"summary_zh":170,"released_at":171},189092,"2.0","本次发布附带一个新的 Docker 镜像，该镜像仅使用两个包含更新库的环境。除了一二十个使用 `backtest` 环境来运行基于 Python 3.6 的 Zipline 1.4.1 进行回测的笔记本外，其余笔记本均使用 `ml4t` 环境（Python 3.8）。\n\n我已尽力修复自发布以来的过去几个月里报告的各种 bug，并将继续这样做。","2021-02-27T02:09:47"]