[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-borisbanushev--stockpredictionai":3,"tool-borisbanushev--stockpredictionai":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85052,2,"2026-04-08T11:03:08",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[19,14,18],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},5773,"cs-video-courses","Developer-Y\u002Fcs-video-courses","cs-video-courses 是一个精心整理的计算机科学视频课程清单，旨在为自学者提供系统化的学习路径。它汇集了全球知名高校（如加州大学伯克利分校、新南威尔士大学等）的完整课程录像，涵盖从编程基础、数据结构与算法，到操作系统、分布式系统、数据库等核心领域，并深入延伸至人工智能、机器学习、量子计算及区块链等前沿方向。\n\n面对网络上零散且质量参差不齐的教学资源，cs-video-courses 解决了学习者难以找到成体系、高难度大学级别课程的痛点。该项目严格筛选内容，仅收录真正的大学层级课程，排除了碎片化的简短教程或商业广告，确保用户能接触到严谨的学术内容。\n\n这份清单特别适合希望夯实计算机基础的开发者、需要补充特定领域知识的研究人员，以及渴望像在校生一样系统学习计算机科学的自学者。其独特的技术亮点在于分类极其详尽，不仅包含传统的软件工程与网络安全，还细分了生成式 AI、大语言模型、计算生物学等新兴学科，并直接链接至官方视频播放列表，让用户能一站式获取高质量的教育资源，免费享受世界顶尖大学的课堂体验。",79792,"2026-04-08T22:03:59",[18,13,14,20],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[17,13,20,19,18],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":46,"last_commit_at":55,"category_tags":56,"status":22},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",75179,"2026-04-08T22:01:10",[19,13,20,18],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":29,"last_commit_at":63,"category_tags":64,"status":22},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,"2026-04-03T21:50:24",[20,18],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":80,"owner_twitter":80,"owner_website":80,"owner_url":82,"languages":83,"stars":96,"forks":97,"last_commit_at":98,"license":80,"difficulty_score":99,"env_os":100,"env_gpu":101,"env_ram":100,"env_deps":102,"category_tags":111,"github_topics":80,"view_count":10,"oss_zip_url":80,"oss_zip_packed_at":80,"status":22,"created_at":112,"updated_at":113,"faqs":114,"releases":115},5768,"borisbanushev\u002Fstockpredictionai","stockpredictionai","       In this noteboook I will create a complete process for predicting stock price movements. Follow along and we will achieve some pretty good results. For that purpose we will use a Generative Adversarial Network (GAN) with LSTM, a type of Recurrent Neural Network, as generator, and a Convolutional Neural Network, CNN, as a discriminator. We use LSTM for the obvious reason that we are trying to predict time series data. Why we use GAN and specifically CNN as a discriminator? That is a good question: there are special sections on that later.","stockpredictionai 是一个基于先进人工智能技术的开源项目，旨在构建一套完整的股票价格波动预测流程。它致力于解决金融时间序列预测中数据复杂、噪声大以及传统模型难以捕捉深层规律的问题，通过融合多维度信息来提升预测准确性。\n\n该项目非常适合具备一定机器学习基础的开发者、量化研究人员以及对深度学习在金融领域应用感兴趣的数据科学家使用。其核心技术亮点在于创新性地结合了生成对抗网络（GAN）与长短期记忆网络（LSTM），利用 LSTM 作为生成器处理时间序列，卷积神经网络（CNN）作为判别器进行校验。为了克服 GAN 训练难的痛点，项目引入了贝叶斯优化和强化学习（如 Rainbow、PPO 算法）来动态调整超参数。此外，它还广泛整合了自然语言处理（BERT 情感分析）、傅里叶变换趋势提取、堆叠自编码器特征识别以及 ARIMA 等多种前沿技术，全方位挖掘市场数据中的潜在模式。基于 MXNet 框架开发的 stockpredictionai，为探索复杂的股市预测提供了极具参考价值的技术实践方案。","\n# Using the latest advancements in AI to predict stock market movements\n\n \n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;In this notebook I will create a complete process for predicting stock price movements. Follow along and we will achieve some pretty good results. For that purpose we will use a **Generative Adversarial Network** (GAN) with **LSTM**, a type of Recurrent Neural Network, as generator, and a Convolutional Neural Network, **CNN**, as a discriminator. We use LSTM for the obvious reason that we are trying to predict time series data. Why we use GAN and specifically CNN as a discriminator? That is a good question: there are special sections on that later.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We will go into greater details for each step, of course, but the most difficult part is the GAN: very tricky part of successfully training a GAN is getting the right set of hyperparameters. For that reason we will use **Bayesian optimisation** (along with Gaussian processes) and **Reinforcement learning** (RL) for deciding when and how to change the GAN's hyperparameters (the exploration vs. exploitation dilemma). In creating the reinforcement learning we will use the most recent advancements in the field, such as **Rainbow** and **PPO**.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We will use a lot of different types of input data. Along with the stock's historical trading data and technical indicators, we will use the newest advancements in **NLP** (using 'Bidirectional Embedding Representations from Transformers', **BERT**, sort of a transfer learning for NLP) to create sentiment analysis (as a source for fundamental analysis), **Fourier transforms** for extracting overall trend directions, **Stacked autoencoders** for identifying other high-level features, **Eigen portfolios** for finding correlated assets, autoregressive integrated moving average (**ARIMA**) for the stock function approximation, and many more, in order to capture as much information, patterns, dependencies, etc, as possible about the stock. As we all know, the more (data) the merrier. Predicting stock price movements is an extremely complex task, so the more we know about the stock (from different perspectives) the higher our changes are.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the purpose of creating all neural nets we will use MXNet and its high-level API - Gluon, and train them on multiple GPUs.\n\n**Note:** _Although I try to get into details of the math and the mechanisms behind almost all algorithms and techniques, this notebook is not explicitly intended to explain how machine\u002Fdeep learning, or the stock markets, work. The purpose is rather to show how we can use different techniques and algorithms for the purpose of accurately predicting stock price movements, and to also give rationale behind the reason and usefulness of using each technique at each step._\n\n_Notebook created: January 9, 2019_.\n\n\n**Figure 1 - The overall architecture of our work**\n\n\u003Ccenter>\u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_87a782e18466.jpg' width=1060>\u003C\u002Fimg>\u003C\u002Fcenter>\n\n## Table of content\n* [Introduction](#overview)\n* [Acknowledgement](#acknowledgement)\n* [The data](#thedata)\n    * [Correlated assets](#corrassets)\n    * [Technical indicators](#technicalind)\n    * [Fundamental analysis](#fundamental)\n        - [Bidirectional Embedding Representations from Transformers - BERT](#bidirnlp)\n    * [Fourier transforms for trend analysis](#fouriertransform)\n    * [ARIMA as a feature](#arimafeature)\n    * [Statistical checks](#statchecks)\n        - [Heteroskedasticity, multicollinearity, serial correlation](#hetemultiser)\n    * [Feature Engineering](#featureeng)\n        * [Feature importance with XGBoost](#xgboost)\n    * [Extracting high-level features with Stacked Autoencoders](#stacked_ae)\n        * [Activation function - GELU (Gaussian Error)](#gelu)\n        * [Eigen portfolio with PCA](#pca)\n    * [Deep Unsupervised Learning for anomaly detection in derivatives pricing](#dulfaddp)\n* [Generative Adversarial Network - GAN](#qgan)\n    * [Why GAN for stock market prediction?](#whygan)\n    * [Metropolis-Hastings GAN and Wasserstein GAN](#mhganwgan)\n    * [The Generator - One layer RNN](#thegenerator)\n        - [LSTM or GRU](#lstmorgru)\n        - [The LSTM architecture](#lstmarchitecture)\n        - [Learning rate scheduler](#lrscheduler)\n        - [How to prevent overfitting and the bias-variance trade-off](#preventoverfitting)\n        - [Custom weights initializers and custom loss metric](#customfns)\n    * [The Discriminator - 1D CNN](#thediscriminator)\n        - [Why CNN as a discriminator?](#why_cnn_architecture)\n        - [The CNN architecture](#the_cnn_architecture)\n    * [Hyperparameters](#hyperparams)\n* [Hyperparameters optimization](#hyperparams_optim)\n    * [Reinforcement learning for hyperparameters optimization](#reinforcementlearning)\n        - [Theory](#reinforcementlearning_theory)\n            - [Rainbow](#rl_rainbow)\n            - [PPO](#rl_ppo)\n        - [Further work on Reinforcement learning](#reinforcementlearning_further)\n    * [Bayesian optimization](#bayesian_opt)\n        - [Gaussian process](#gaussprocess)\n* [The result](#theresult)\n* [What is next?](#whatisnext)\n* [Disclaimer](#disclaimer)\n\n# 1. Introduction \u003Ca class=\"anchor\" id=\"overview\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Accurately predicting the stock markets is a complex task as there are millions of events and pre-conditions for a particilar stock to move in a particular direction. So we need to be able to capture as many of these pre-conditions as possible. We also need make several important assumptions: 1) markets are not 100% random, 2) history repeats, 3) markets follow people's rational behavior, and 4) the markets are '_perfect_'. And, please, do read the **Disclaimer** at the \u003Ca href=\"#disclaimer\">bottom\u003C\u002Fa>.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We will try to predict the price movements of **Goldman Sachs** (NYSE: GS). For the purpose, we will use daily closing price from January 1st, 2010 to December 31st, 2018 (seven years for training purposes and two years for validation purposes). _We will use the terms 'Goldman Sachs' and 'GS' interchangeably_.\n\n# 2. Acknowledgement \u003Ca class=\"anchor\" id=\"acknowledgement\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Before we continue, I'd like to thank my friends \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmanganganath\">Nuwan\u003C\u002Fa> and \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FQ4living\">Thomas\u003C\u002Fa> without whose ideas and support I wouldn't have been able to create this work.\n\n# 3. The Data \u003Ca class=\"anchor\" id=\"thedata\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We need to understand what affects whether GS's stock price will move up or down. It is what people as a whole think. Hence, we need to incorporate as much information (depicting the stock from different aspects and angles) as possible. (We will use daily data - 1,585 days to train the various algorithms (70% of the data we have) and predict the next 680 days (test data). Then we will compare the predicted results with a test (hold-out) data. Each type of data (we will refer to it as _feature_) is explained in greater detail in later sections, but, as a high level overview, the features we will use are:\n\n1. **Correlated assets** - these are other assets (any type, not necessarily stocks, such as commodities, FX, indices, or even fixed income securities). A big company, such as Goldman Sachs, obviously doesn't 'live' in an isolated world - it depends on, and interacts with, many external factors, including its competitors, clients, the global economy, the geo-political situation, fiscal and monetary policies, access to capital, etc. The details are listed later.\n2. **Technical indicators** - a lot of investors follow technical indicators. We will include the most popular indicators as independent features. Among them - 7 and 21 days moving average, exponential moving average, momentum, Bollinger bands, MACD.\n3. **Fundamental analysis** - A very important feature indicating whether a stock might move up or down. There are two features that can be used in fundamental analysis: 1) Analysing the company performance using 10-K and 10-Q reports, analysing ROE and P\u002FE, etc (we will not use this), and 2) **News** - potentially news can indicate upcoming events that can potentially move the stock in certain direction. We will read all daily news for Goldman Sachs and extract whether the total sentiment about Goldman Sachs on that day is positive, neutral, or negative (as a score from 0 to 1). As many investors closely read the news and make investment decisions based (partially of course) on news, there is a somewhat high chance that if, say, the news for Goldman Sachs today are extremely positive the stock will surge tomorrow. _One crucial point, we will perform feature importance (meaning how indicative it is for the movement of GS) on absolutely every feature (including this one) later on and decide whether we will use it. More on that later_.\u003Cbr\u002F>\n    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the purpose of creating accurate sentiment prediction we will use Neural Language Processing (**NLP**). We will use **BERT** - Google's recently announced NLP approach for transfer learning for sentiment classification stock news sentiment extraction.\n4. **Fourier transforms** - Along with the daily closing price, we will create Fourier transforms in order to generalize several long- and short-term trends. Using these transforms we will eliminate a lot of noise (random walks) and create approximations of the real stock movement. Having trend approximations can help the LSTM network pick its prediction trends more accurately.\n5. **Autoregressive Integrated Moving Average** (ARIMA) - This was one of the most popular techniques for predicting future values of time series data (in the pre-neural networks ages). Let's add it and see if it comes off as an important predictive feature.\n6. **Stacked autoencoders** - most of the aforementioned features (fundamental analysis, technical analysis, etc) were found by people after decades of research. But maybe we have missed something. Maybe there are hidden correlations that people cannot comprehend due to the enormous amount of data points, events, assets, charts, etc. With stacked autoencoders (type of neural networks) we can use the power of computers and probably find new types of features that affect stock movements. Even though we will not be able to understand these features in human language, we will use them in the GAN.\n7. **Deep Unsupervised learning for anomaly detection in options pricing**. We will use one more feature - for every day we will add the price for 90-days call option on Goldman Sachs stock. Options pricing itself combines a lot of data. The price for options contract depends on the future value of the stock (analysts try to also predict the price in order to come up with the most accurate price for the call option). Using deep unsupervised learning (**Self-organized Maps**) we will try to spot anomalies in every day's pricing. Anomaly (such as a drastic change in pricing) might indicate an event that might be useful for the LSTM to learn the overall stock pattern.\n\nNext, having so many features, we need to perform a couple of important steps:\n1. Perform statistical checks for the 'quality' of the data. If the data we create is flawed, then no matter how sophisticated our algorithms are, the results will not be positive. The checks include making sure the data does not suffer from heteroskedasticity, multicollinearity, or serial correlation.\n2. Create feature importance. If a feature (e.g. another stock or a technical indicator) has no explanatory power to the stock we want to predict, then there is no need for us to use it in the training of the neural nets. We will using **XGBoost** (eXtreme Gradient Boosting), a type of boosted tree regression algorithms.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As a final step of our data preparation, we will also create **Eigen portfolios** using Principal Component Analysis (**PCA**) in order to reduce the dimensionality of the features created from the autoencoders.\n\n\n```python\nfrom utils import *\n\nimport time\nimport numpy as np\n\nfrom mxnet import nd, autograd, gluon\nfrom mxnet.gluon import nn, rnn\nimport mxnet as mx\nimport datetime\nimport seaborn as sns\n\nimport matplotlib.pyplot as plt\n%matplotlib inline\nfrom sklearn.decomposition import PCA\n\nimport math\n\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.metrics import mean_squared_error\nfrom sklearn.preprocessing import StandardScaler\n\nimport xgboost as xgb\nfrom sklearn.metrics import accuracy_score\n```\n\n\n```python\nimport warnings\nwarnings.filterwarnings(\"ignore\")\n```\n\n\n```python\ncontext = mx.cpu(); model_ctx=mx.cpu()\nmx.random.seed(1719)\n```\n\n**Note**: The purpose of this section (3. The Data) is to show the data preprocessing and to give rationale for using different sources of data, hence I will only use a subset of the full data (that is used for training).\n\n\n```python\ndef parser(x):\n    return datetime.datetime.strptime(x,'%Y-%m-%d')\n```\n\n\n```python\ndataset_ex_df = pd.read_csv('data\u002Fpanel_data_close.csv', header=0, parse_dates=[0], date_parser=parser)\n```\n\n\n```python\ndataset_ex_df[['Date', 'GS']].head(3)\n```\n\n\n\n\n\u003Cdiv>\n\u003Cstyle scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003C\u002Fstyle>\n\u003Ctable border=\"1\" class=\"dataframe\">\n  \u003Cthead>\n    \u003Ctr style=\"text-align: right;\">\n      \u003Cth>\u003C\u002Fth>\n      \u003Cth>Date\u003C\u002Fth>\n      \u003Cth>GS\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Cth>0\u003C\u002Fth>\n      \u003Ctd>2009-12-31\u003C\u002Ftd>\n      \u003Ctd>168.839996\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>1\u003C\u002Fth>\n      \u003Ctd>2010-01-04\u003C\u002Ftd>\n      \u003Ctd>173.080002\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>2\u003C\u002Fth>\n      \u003Ctd>2010-01-05\u003C\u002Ftd>\n      \u003Ctd>176.139999\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\u003C\u002Fdiv>\n\n\n\n\n```python\nprint('There are {} number of days in the dataset.'.format(dataset_ex_df.shape[0]))\n```\n\n    There are 2265 number of days in the dataset.\n\n\nLet's visualize the stock for the last nine years. The dashed vertical line represents the separation between training and test data.\n\n\n```python\nplt.figure(figsize=(14, 5), dpi=100)\nplt.plot(dataset_ex_df['Date'], dataset_ex_df['GS'], label='Goldman Sachs stock')\nplt.vlines(datetime.date(2016,4, 20), 0, 270, linestyles='--', colors='gray', label='Train\u002FTest data cut-off')\nplt.xlabel('Date')\nplt.ylabel('USD')\nplt.title('Figure 2: Goldman Sachs stock price')\nplt.legend()\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_7877aa06c14b.png)\n\n\n\n```python\nnum_training_days = int(dataset_ex_df.shape[0]*.7)\nprint('Number of training days: {}. Number of test days: {}.'.format(num_training_days, \\\n                                                                    dataset_ex_df.shape[0]-num_training_days))\n```\n\n    Number of training days: 1585. Number of test days: 680.\n\n\n## 3.1. Correlated assets \u003Ca class=\"anchor\" id=\"corrassets\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As explained earlier we will use other assets as features, not only GS.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;So what other assets would affect GS's stock movements? Good understanding of the company, its lines of businesses, competitive landscape, dependencies, suppliers and client type, etc is very important for picking the right set of correlated assets:\n- First are the **companies** similar to GS. We will add JPMorgan Chase and Morgan Stanley, among others, to the dataset.\n- As an investment bank, Goldman Sachs depends on the **global economy**. Bad or volatile economy means no M&As or IPOs, and possibly limited proprietary trading earnings. That is why we will include global economy indices. Also, we will include LIBOR (USD and GBP denominated) rate, as possibly shocks in the economy might be accounted for by analysts to set these rates, and other **FI** securities.\n- Daily volatility index (**VIX**) - for the reason described in the previous point.\n- **Composite indices** - such as NASDAQ and NYSE (from USA), FTSE100 (UK), Nikkei225 (Japan), Hang Seng and BSE Sensex (APAC) indices.\n- **Currencies** - global trade is many times reflected into how currencies move, ergo we'll use a basket of currencies (such as USDJPY, GBPUSD, etc) as features.\n\n#### Overall, we have 72 other assets in the dataset - daily price for every asset.\n\n## 3.2. Technical indicators \u003Ca class=\"anchor\" id=\"technicalind\">\u003C\u002Fa>\n\nWe already covered what are technical indicators and why we use them so let's jump straight to the code. We will create technical indicators only for GS.\n\n\n```python\ndef get_technical_indicators(dataset):\n    # Create 7 and 21 days Moving Average\n    dataset['ma7'] = dataset['price'].rolling(window=7).mean()\n    dataset['ma21'] = dataset['price'].rolling(window=21).mean()\n    \n    # Create MACD\n    dataset['26ema'] = pd.ewma(dataset['price'], span=26)\n    dataset['12ema'] = pd.ewma(dataset['price'], span=12)\n    dataset['MACD'] = (dataset['12ema']-dataset['26ema'])\n\n    # Create Bollinger Bands\n    dataset['20sd'] = pd.stats.moments.rolling_std(dataset['price'],20)\n    dataset['upper_band'] = dataset['ma21'] + (dataset['20sd']*2)\n    dataset['lower_band'] = dataset['ma21'] - (dataset['20sd']*2)\n    \n    # Create Exponential moving average\n    dataset['ema'] = dataset['price'].ewm(com=0.5).mean()\n    \n    # Create Momentum\n    dataset['momentum'] = dataset['price']-1\n    \n    return dataset\n```\n\n\n```python\ndataset_TI_df = get_technical_indicators(dataset_ex_df[['GS']])\n```\n\n\n```python\ndataset_TI_df.head()\n```\n\n\n\n\n\u003Cdiv>\n\u003Cstyle scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003C\u002Fstyle>\n\u003Ctable border=\"1\" class=\"dataframe\">\n  \u003Cthead>\n    \u003Ctr style=\"text-align: right;\">\n      \u003Cth>\u003C\u002Fth>\n      \u003Cth>Date\u003C\u002Fth>\n      \u003Cth>price\u003C\u002Fth>\n      \u003Cth>ma7\u003C\u002Fth>\n      \u003Cth>ma21\u003C\u002Fth>\n      \u003Cth>26ema\u003C\u002Fth>\n      \u003Cth>12ema\u003C\u002Fth>\n      \u003Cth>MACD\u003C\u002Fth>\n      \u003Cth>20sd\u003C\u002Fth>\n      \u003Cth>upper_band\u003C\u002Fth>\n      \u003Cth>lower_band\u003C\u002Fth>\n      \u003Cth>ema\u003C\u002Fth>\n      \u003Cth>momentum\u003C\u002Fth>\n      \u003Cth>log_momentum\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Cth>0\u003C\u002Fth>\n      \u003Ctd>2010-02-01\u003C\u002Ftd>\n      \u003Ctd>153.130005\u003C\u002Ftd>\n      \u003Ctd>152.374285\u003C\u002Ftd>\n      \u003Ctd>164.220476\u003C\u002Ftd>\n      \u003Ctd>160.321839\u003C\u002Ftd>\n      \u003Ctd>156.655072\u003C\u002Ftd>\n      \u003Ctd>-3.666767\u003C\u002Ftd>\n      \u003Ctd>9.607375\u003C\u002Ftd>\n      \u003Ctd>183.435226\u003C\u002Ftd>\n      \u003Ctd>145.005726\u003C\u002Ftd>\n      \u003Ctd>152.113609\u003C\u002Ftd>\n      \u003Ctd>152.130005\u003C\u002Ftd>\n      \u003Ctd>5.024735\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>1\u003C\u002Fth>\n      \u003Ctd>2010-02-02\u003C\u002Ftd>\n      \u003Ctd>156.940002\u003C\u002Ftd>\n      \u003Ctd>152.777143\u003C\u002Ftd>\n      \u003Ctd>163.653809\u003C\u002Ftd>\n      \u003Ctd>160.014868\u003C\u002Ftd>\n      \u003Ctd>156.700048\u003C\u002Ftd>\n      \u003Ctd>-3.314821\u003C\u002Ftd>\n      \u003Ctd>9.480630\u003C\u002Ftd>\n      \u003Ctd>182.615070\u003C\u002Ftd>\n      \u003Ctd>144.692549\u003C\u002Ftd>\n      \u003Ctd>155.331205\u003C\u002Ftd>\n      \u003Ctd>155.940002\u003C\u002Ftd>\n      \u003Ctd>5.049471\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>2\u003C\u002Fth>\n      \u003Ctd>2010-02-03\u003C\u002Ftd>\n      \u003Ctd>157.229996\u003C\u002Ftd>\n      \u003Ctd>153.098572\u003C\u002Ftd>\n      \u003Ctd>162.899047\u003C\u002Ftd>\n      \u003Ctd>159.766235\u003C\u002Ftd>\n      \u003Ctd>156.783365\u003C\u002Ftd>\n      \u003Ctd>-2.982871\u003C\u002Ftd>\n      \u003Ctd>9.053702\u003C\u002Ftd>\n      \u003Ctd>181.006450\u003C\u002Ftd>\n      \u003Ctd>144.791644\u003C\u002Ftd>\n      \u003Ctd>156.597065\u003C\u002Ftd>\n      \u003Ctd>156.229996\u003C\u002Ftd>\n      \u003Ctd>5.051329\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>3\u003C\u002Fth>\n      \u003Ctd>2010-02-04\u003C\u002Ftd>\n      \u003Ctd>150.679993\u003C\u002Ftd>\n      \u003Ctd>153.069999\u003C\u002Ftd>\n      \u003Ctd>161.686666\u003C\u002Ftd>\n      \u003Ctd>158.967168\u003C\u002Ftd>\n      \u003Ctd>155.827031\u003C\u002Ftd>\n      \u003Ctd>-3.140137\u003C\u002Ftd>\n      \u003Ctd>8.940246\u003C\u002Ftd>\n      \u003Ctd>179.567157\u003C\u002Ftd>\n      \u003Ctd>143.806174\u003C\u002Ftd>\n      \u003Ctd>152.652350\u003C\u002Ftd>\n      \u003Ctd>149.679993\u003C\u002Ftd>\n      \u003Ctd>5.008500\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>4\u003C\u002Fth>\n      \u003Ctd>2010-02-05\u003C\u002Ftd>\n      \u003Ctd>154.160004\u003C\u002Ftd>\n      \u003Ctd>153.449999\u003C\u002Ftd>\n      \u003Ctd>160.729523\u003C\u002Ftd>\n      \u003Ctd>158.550196\u003C\u002Ftd>\n      \u003Ctd>155.566566\u003C\u002Ftd>\n      \u003Ctd>-2.983631\u003C\u002Ftd>\n      \u003Ctd>8.151912\u003C\u002Ftd>\n      \u003Ctd>177.033348\u003C\u002Ftd>\n      \u003Ctd>144.425699\u003C\u002Ftd>\n      \u003Ctd>153.657453\u003C\u002Ftd>\n      \u003Ctd>153.160004\u003C\u002Ftd>\n      \u003Ctd>5.031483\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\u003C\u002Fdiv>\n\n\n\nSo we have the technical indicators (including MACD, Bollinger bands, etc) for every trading day. We have in total 12 technical indicators.\n\nLet's visualize the last 400 days of these indicators.\n\n\n```python\ndef plot_technical_indicators(dataset, last_days):\n    plt.figure(figsize=(16, 10), dpi=100)\n    shape_0 = dataset.shape[0]\n    xmacd_ = shape_0-last_days\n    \n    dataset = dataset.iloc[-last_days:, :]\n    x_ = range(3, dataset.shape[0])\n    x_ =list(dataset.index)\n    \n    # Plot first subplot\n    plt.subplot(2, 1, 1)\n    plt.plot(dataset['ma7'],label='MA 7', color='g',linestyle='--')\n    plt.plot(dataset['price'],label='Closing Price', color='b')\n    plt.plot(dataset['ma21'],label='MA 21', color='r',linestyle='--')\n    plt.plot(dataset['upper_band'],label='Upper Band', color='c')\n    plt.plot(dataset['lower_band'],label='Lower Band', color='c')\n    plt.fill_between(x_, dataset['lower_band'], dataset['upper_band'], alpha=0.35)\n    plt.title('Technical indicators for Goldman Sachs - last {} days.'.format(last_days))\n    plt.ylabel('USD')\n    plt.legend()\n\n    # Plot second subplot\n    plt.subplot(2, 1, 2)\n    plt.title('MACD')\n    plt.plot(dataset['MACD'],label='MACD', linestyle='-.')\n    plt.hlines(15, xmacd_, shape_0, colors='g', linestyles='--')\n    plt.hlines(-15, xmacd_, shape_0, colors='g', linestyles='--')\n    plt.plot(dataset['log_momentum'],label='Momentum', color='b',linestyle='-')\n\n    plt.legend()\n    plt.show()\n```\n\n\n```python\nplot_technical_indicators(dataset_TI_df, 400)\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_4d011c141eca.png)\n\n\n## 3.3. Fundamental analysis \u003Ca class=\"anchor\" id=\"fundamental\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For fundamental analysis we will perform sentiment analysis on all daily news about GS. Using sigmoid at the end, result will be between 0 and 1. The closer the score is to 0 - the more negative the news is (closer to 1 indicates positive sentiment). For each day, we will create the average daily score (as a number between 0 and 1) and add it as a feature.\n\n### 3.3.1. Bidirectional Embedding Representations from Transformers - BERT \u003Ca class=\"anchor\" id=\"bidirnlp\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;For the purpose of classifying news as positive or negative (or neutral) we will use \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04805\">BERT\u003C\u002Fa>, which is a pre-trained language representation.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Pretrained BERT models are already available in MXNet\u002FGluon. We just need to instantiated them and add two (arbitrary number) ```Dense``` layers, going to softmax - the score is from 0 to 1.\n\n\n```python\n# just import bert\nimport bert\n```\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Going into the details of BERT and the NLP part is not in the scope of this notebook, but you have interest, do let me know - I will create a new repo only for BERT as it definitely is quite promissing when it comes to language processing tasks.\n\n## 3.4. Fourier transforms for trend analysis \u003Ca class=\"anchor\" id=\"fouriertransform\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Fourier transforms** take a function and create a series of sine waves (with different amplitudes and frames). When combined, these sine waves approximate the original function. Mathematically speaking, the transforms look like this:\n\n$$G(f) = \\int_{-\\infty}^\\infty g(t) e^{-i 2 \\pi f t} dt$$\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We will use Fourier transforms to extract global and local trends in the GS stock, and to also denoise it a little. So let's see how it works.\n\n\n```python\ndata_FT = dataset_ex_df[['Date', 'GS']]\n```\n\n\n```python\nclose_fft = np.fft.fft(np.asarray(data_FT['GS'].tolist()))\nfft_df = pd.DataFrame({'fft':close_fft})\nfft_df['absolute'] = fft_df['fft'].apply(lambda x: np.abs(x))\nfft_df['angle'] = fft_df['fft'].apply(lambda x: np.angle(x))\n```\n\n\n```python\nplt.figure(figsize=(14, 7), dpi=100)\nfft_list = np.asarray(fft_df['fft'].tolist())\nfor num_ in [3, 6, 9, 100]:\n    fft_list_m10= np.copy(fft_list); fft_list_m10[num_:-num_]=0\n    plt.plot(np.fft.ifft(fft_list_m10), label='Fourier transform with {} components'.format(num_))\nplt.plot(data_FT['GS'],  label='Real')\nplt.xlabel('Days')\nplt.ylabel('USD')\nplt.title('Figure 3: Goldman Sachs (close) stock prices & Fourier transforms')\nplt.legend()\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_5135aafa6ea7.png)\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As you see in Figure 3 the more components from the Fourier transform we use the closer the approximation function is to the real stock price (the 100 components transform is almost identical to the original function - the red and the purple lines almost overlap). We use Fourier transforms for the purpose of extracting long- and short-term trends so we will use the transforms with 3, 6, and 9 components. You can infer that the transform with 3 components serves as the long term trend.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Another technique used to denoise data is call **wavelets**. Wavelets and Fourier transform gave similar results so we will only use Fourier transforms.\n\n\n\n```python\nfrom collections import deque\nitems = deque(np.asarray(fft_df['absolute'].tolist()))\nitems.rotate(int(np.floor(len(fft_df)\u002F2)))\nplt.figure(figsize=(10, 7), dpi=80)\nplt.stem(items)\nplt.title('Figure 4: Components of Fourier transforms')\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_589833284e28.png)\n\n\n## 3.5. ARIMA as a feature \u003Ca class=\"anchor\" id=\"arimafeature\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**ARIMA** is a technique for predicting time series data. We will show how to use it, and althouth ARIMA will not serve as our final prediction, we will use it as a technique to denoise the stock a little and to (possibly) extract some new patters or features.\n\n\n```python\nfrom statsmodels.tsa.arima_model import ARIMA\nfrom pandas import DataFrame\nfrom pandas import datetime\n\nseries = data_FT['GS']\nmodel = ARIMA(series, order=(5, 1, 0))\nmodel_fit = model.fit(disp=0)\nprint(model_fit.summary())\n```\n\n                                 ARIMA Model Results                              \n    ==============================================================================\n    Dep. Variable:                   D.GS   No. Observations:                 2264\n    Model:                 ARIMA(5, 1, 0)   Log Likelihood               -5465.888\n    Method:                       css-mle   S.D. of innovations              2.706\n    Date:                Wed, 09 Jan 2019   AIC                          10945.777\n    Time:                        10:28:07   BIC                          10985.851\n    Sample:                             1   HQIC                         10960.399\n                                                                                  \n    ==============================================================================\n                     coef    std err          z      P>|z|      [0.025      0.975]\n    ------------------------------------------------------------------------------\n    const         -0.0011      0.054     -0.020      0.984      -0.106       0.104\n    ar.L1.D.GS    -0.0205      0.021     -0.974      0.330      -0.062       0.021\n    ar.L2.D.GS     0.0140      0.021      0.665      0.506      -0.027       0.055\n    ar.L3.D.GS    -0.0030      0.021     -0.141      0.888      -0.044       0.038\n    ar.L4.D.GS     0.0026      0.021      0.122      0.903      -0.039       0.044\n    ar.L5.D.GS    -0.0522      0.021     -2.479      0.013      -0.093      -0.011\n                                        Roots                                    \n    =============================================================================\n                      Real          Imaginary           Modulus         Frequency\n    -----------------------------------------------------------------------------\n    AR.1           -1.7595           -0.0000j            1.7595           -0.5000\n    AR.2           -0.5700           -1.7248j            1.8165           -0.3008\n    AR.3           -0.5700           +1.7248j            1.8165            0.3008\n    AR.4            1.4743           -1.0616j            1.8168           -0.0993\n    AR.5            1.4743           +1.0616j            1.8168            0.0993\n    -----------------------------------------------------------------------------\n\n\n\n```python\nfrom pandas.tools.plotting import autocorrelation_plot\nautocorrelation_plot(series)\nplt.figure(figsize=(10, 7), dpi=80)\nplt.show() \n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_db914aa51a20.png)\n\n\n\n    \u003CFigure size 800x560 with 0 Axes>\n\n\n\n```python\nfrom pandas import read_csv\nfrom pandas import datetime\nfrom statsmodels.tsa.arima_model import ARIMA\nfrom sklearn.metrics import mean_squared_error\n\nX = series.values\nsize = int(len(X) * 0.66)\ntrain, test = X[0:size], X[size:len(X)]\nhistory = [x for x in train]\npredictions = list()\nfor t in range(len(test)):\n    model = ARIMA(history, order=(5,1,0))\n    model_fit = model.fit(disp=0)\n    output = model_fit.forecast()\n    yhat = output[0]\n    predictions.append(yhat)\n    obs = test[t]\n    history.append(obs)\n```\n\n\n```python\nerror = mean_squared_error(test, predictions)\nprint('Test MSE: %.3f' % error)\n```\n\n    Test MSE: 10.151\n\n\n\n```python\n# Plot the predicted (from ARIMA) and real prices\n\nplt.figure(figsize=(12, 6), dpi=100)\nplt.plot(test, label='Real')\nplt.plot(predictions, color='red', label='Predicted')\nplt.xlabel('Days')\nplt.ylabel('USD')\nplt.title('Figure 5: ARIMA model on GS stock')\nplt.legend()\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_b29e30acdf6e.png)\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As we can see from Figure 5 ARIMA gives a very good approximation of the real stock price. We will use the predicted price through ARIMA as an input feature into the LSTM because, as we mentioned before, we want to capture as many features and patterns about Goldman Sachs as possible. We go test MSE (mean squared error) of 10.151, which by itself is not a bad result (considering we do have a lot of test data), but still we will only use it as a feature in the LSTM.\n\n## 3.6. Statistical checks \u003Ca class=\"anchor\" id=\"statchecks\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Ensuring that the data has good quality is very important for out models. In order to make sure our data is suitable we will perform a couple of simple checks in order to ensure that the results we achieve and observe are indeed real, rather than compromised due to the fact that the underlying data distribution suffers from fundamental errors.\n\n### 3.6.1. Heteroskedasticity, multicollinearity, serial correlation \u003Ca class=\"anchor\" id=\"hetemultiser\">\u003C\u002Fa>\n\n- **Conditional Heteroskedasticity** occurs when the error terms (the difference between a predicted value by a regression and the real value) are dependent on the data - for example, the error terms grow when the data point (along the x-axis) grow.\n- **Multicollinearity** is when error terms (also called residuals) depend on each other.\n- **Serial correlation** is when one data (feature) is a formula (or completely depemnds) of another feature.\n\nWe will not go into the code here as it is straightforward and our focus is more on the deep learning parts, **but the data is qualitative**.\n\n## 3.7. Feature Engineering \u003Ca class=\"anchor\" id=\"featureeng\">\u003C\u002Fa>\n\n\n```python\nprint('Total dataset has {} samples, and {} features.'.format(dataset_total_df.shape[0], \\\n                                                              dataset_total_df.shape[1]))\n```\n\n    Total dataset has 2265 samples, and 112 features.\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;So, after adding all types of data (the correlated assets, technical indicators, fundamental analysis, Fourier, and Arima) we have a total of 112 features for the 2,265 days (as mentioned before, however, only 1,585 days are for training data).\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We will also have some more features generated from the autoencoders.\n\n### 3.7.1. Feature importance with XGBoost \u003Ca class=\"anchor\" id=\"xgboost\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Having so many features we have to consider whether all of them are really indicative of the direction GS stock will take. For example, we included USD denominated LIBOR rates in the dataset because we think that changes in LIBOR might indicate changes in the economy, that, in turn, might indicate changes in the GS's stock behavior. But we need to test. There are many ways to test feature importance, but the one we will apply uses XGBoost, because it gives one of the best results in both classification and regression problems.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Since the features dataset is quite large, for the purpose of presentation here we'll use only the technical indicators. During the real features importance testing all selected features proved somewhat important so we won't exclude anything when training the GAN.\n\n\n```python\ndef get_feature_importance_data(data_income):\n    data = data_income.copy()\n    y = data['price']\n    X = data.iloc[:, 1:]\n    \n    train_samples = int(X.shape[0] * 0.65)\n \n    X_train = X.iloc[:train_samples]\n    X_test = X.iloc[train_samples:]\n\n    y_train = y.iloc[:train_samples]\n    y_test = y.iloc[train_samples:]\n    \n    return (X_train, y_train), (X_test, y_test)\n```\n\n\n```python\n# Get training and test data\n(X_train_FI, y_train_FI), (X_test_FI, y_test_FI) = get_feature_importance_data(dataset_TI_df)\n```\n\n\n```python\nregressor = xgb.XGBRegressor(gamma=0.0,n_estimators=150,base_score=0.7,colsample_bytree=1,learning_rate=0.05)\n```\n\n\n```python\nxgbModel = regressor.fit(X_train_FI,y_train_FI, \\\n                         eval_set = [(X_train_FI, y_train_FI), (X_test_FI, y_test_FI)], \\\n                         verbose=False)\n```\n\n\n```python\neval_result = regressor.evals_result()\n```\n\n\n```python\ntraining_rounds = range(len(eval_result['validation_0']['rmse']))\n```\n\nLet's plot the training and validation errors in order to observe the training and check for overfitting (there isn't overfitting).\n\n\n```python\nplt.scatter(x=training_rounds,y=eval_result['validation_0']['rmse'],label='Training Error')\nplt.scatter(x=training_rounds,y=eval_result['validation_1']['rmse'],label='Validation Error')\nplt.xlabel('Iterations')\nplt.ylabel('RMSE')\nplt.title('Training Vs Validation Error')\nplt.legend()\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_713075159d3c.png)\n\n\n\n```python\nfig = plt.figure(figsize=(8,8))\nplt.xticks(rotation='vertical')\nplt.bar([i for i in range(len(xgbModel.feature_importances_))], xgbModel.feature_importances_.tolist(), tick_label=X_test_FI.columns)\nplt.title('Figure 6: Feature importance of the technical indicators.')\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_ad6b3bc67324.png)\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Not surprisingly (for those with experience in stock trading) that MA7, MACD, and BB are among the important features. \n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;I followed the same logic for performing feature importance over the whole dataset - just the training took longer and results were a little more difficult to read, as compared with just a handful of features.\n\n## 3.8. Extracting high-level features with Stacked Autoencoders \u003Ca class=\"anchor\" id=\"stacked_ae\">\u003C\u002Fa>\n\nBefore we proceed to the autoencoders, we'll explore an alternative activation function.\n\n### 3.8.1. Activation function - GELU (Gaussian Error) \u003Ca class=\"anchor\" id=\"gelu\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**GELU** - Gaussian Error Linear Unites was recently proposed - \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1606.08415.pdf\">link\u003C\u002Fa>. In the paper the authors show several instances in which neural networks using GELU outperform networks using ReLU as an activation. ```gelu``` is also used in **BERT**, the NLP approach we used for news sentiment analysis.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We will use GELU for the autoencoders.\n\n**Note**: The cell below shows the logic behind the math of GELU. It is not the actual implementation as an activation function. I had to implement GELU inside MXNet. If you follow the code and change ```act_type='relu'``` to ```act_type='gelu'``` it will not work, unless you change the implementation of MXNet. Make a pull request on the whole project to access the MXNet implementation of GELU.\n\n\n```python\ndef gelu(x):\n    return 0.5 * x * (1 + math.tanh(math.sqrt(2 \u002F math.pi) * (x + 0.044715 * math.pow(x, 3))))\ndef relu(x):\n    return max(x, 0)\ndef lrelu(x):\n    return max(0.01*x, x)\n```\n\nLet's visualize ```GELU```, ```ReLU```, and ```LeakyReLU``` (the last one is mainly used in GANs - we also use it).\n\n\n```python\nplt.figure(figsize=(15, 5))\nplt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=.5, hspace=None)\n\nranges_ = (-10, 3, .25)\n\nplt.subplot(1, 2, 1)\nplt.plot([i for i in np.arange(*ranges_)], [relu(i) for i in np.arange(*ranges_)], label='ReLU', marker='.')\nplt.plot([i for i in np.arange(*ranges_)], [gelu(i) for i in np.arange(*ranges_)], label='GELU')\nplt.hlines(0, -10, 3, colors='gray', linestyles='--', label='0')\nplt.title('Figure 7: GELU as an activation function for autoencoders')\nplt.ylabel('f(x) for GELU and ReLU')\nplt.xlabel('x')\nplt.legend()\n\nplt.subplot(1, 2, 2)\nplt.plot([i for i in np.arange(*ranges_)], [lrelu(i) for i in np.arange(*ranges_)], label='Leaky ReLU')\nplt.hlines(0, -10, 3, colors='gray', linestyles='--', label='0')\nplt.ylabel('f(x) for Leaky ReLU')\nplt.xlabel('x')\nplt.title('Figure 8: LeakyReLU')\nplt.legend()\n\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_8882e864a706.png)\n\n\n**Note**: In future versions of this notebook I will experiment using **U-Net** (\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1505.04597\">link\u003C\u002Fa>), and try to utilize the convolutional layer and extract (and create) even more features about the stock's underlying movement patterns. For now, we will just use a simple autoencoder made only from ```Dense``` layers.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Ok, back to the autoencoders, depicted below (the image is only schematic, it doesn't represent the real number of layers, units, etc.)\n\n**Note**: One thing that I will explore in a later version is removing the last layer in the decoder. Normally, in autoencoders the number of encoders == number of decoders. We want, however, to extract higher level features (rather than creating the same input), so we can skip the last layer in the decoder. We achieve this creating the encoder and decoder with same number of layers during the training, but when we create the output we use the layer next to the only one as it would contain the higher level features.\n\n\u003Ccenter>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_fbd88935fb4c.jpg\" width=428>\u003C\u002Fimg>\u003C\u002Fcenter>\n\n\n```python\nbatch_size = 64\nn_batches = VAE_data.shape[0]\u002Fbatch_size\nVAE_data = VAE_data.values\n\ntrain_iter = mx.io.NDArrayIter(data={'data': VAE_data[:num_training_days,:-1]}, \\\n                               label={'label': VAE_data[:num_training_days, -1]}, batch_size = batch_size)\ntest_iter = mx.io.NDArrayIter(data={'data': VAE_data[num_training_days:,:-1]}, \\\n                              label={'label': VAE_data[num_training_days:,-1]}, batch_size = batch_size)\n```\n\n\n```python\nmodel_ctx =  mx.cpu()\nclass VAE(gluon.HybridBlock):\n    def __init__(self, n_hidden=400, n_latent=2, n_layers=1, n_output=784, \\\n                 batch_size=100, act_type='relu', **kwargs):\n        self.soft_zero = 1e-10\n        self.n_latent = n_latent\n        self.batch_size = batch_size\n        self.output = None\n        self.mu = None\n        super(VAE, self).__init__(**kwargs)\n        \n        with self.name_scope():\n            self.encoder = nn.HybridSequential(prefix='encoder')\n            \n            for i in range(n_layers):\n                self.encoder.add(nn.Dense(n_hidden, activation=act_type))\n            self.encoder.add(nn.Dense(n_latent*2, activation=None))\n\n            self.decoder = nn.HybridSequential(prefix='decoder')\n            for i in range(n_layers):\n                self.decoder.add(nn.Dense(n_hidden, activation=act_type))\n            self.decoder.add(nn.Dense(n_output, activation='sigmoid'))\n\n    def hybrid_forward(self, F, x):\n        h = self.encoder(x)\n        #print(h)\n        mu_lv = F.split(h, axis=1, num_outputs=2)\n        mu = mu_lv[0]\n        lv = mu_lv[1]\n        self.mu = mu\n\n        eps = F.random_normal(loc=0, scale=1, shape=(self.batch_size, self.n_latent), ctx=model_ctx)\n        z = mu + F.exp(0.5*lv)*eps\n        y = self.decoder(z)\n        self.output = y\n\n        KL = 0.5*F.sum(1+lv-mu*mu-F.exp(lv),axis=1)\n        logloss = F.sum(x*F.log(y+self.soft_zero)+ (1-x)*F.log(1-y+self.soft_zero), axis=1)\n        loss = -logloss-KL\n\n        return loss\n```\n\n\n```python\nn_hidden=400 # neurons in each layer\nn_latent=2 \nn_layers=3 # num of dense layers in encoder and decoder respectively\nn_output=VAE_data.shape[1]-1 \n\nnet = VAE(n_hidden=n_hidden, n_latent=n_latent, n_layers=n_layers, n_output=n_output, batch_size=batch_size, act_type='gelu')\n```\n\n\n```python\nnet.collect_params().initialize(mx.init.Xavier(), ctx=mx.cpu())\nnet.hybridize()\ntrainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': .01})\n```\n\n\n```python\nprint(net)\n```\n\n    VAE(\n      (encoder): HybridSequential(\n        (0): Dense(None -> 400, Activation(relu))\n        (1): Dense(None -> 400, Activation(relu))\n        (2): Dense(None -> 400, Activation(relu))\n        (3): Dense(None -> 4, linear)\n      )\n      (decoder): HybridSequential(\n        (0): Dense(None -> 400, Activation(relu))\n        (1): Dense(None -> 400, Activation(relu))\n        (2): Dense(None -> 400, Activation(relu))\n        (3): Dense(None -> 11, Activation(sigmoid))\n      )\n    )\n\n\nSo we have 3 layers (with 400 neurons in each) in both the encoder and the decoder.\n\n\n```python\nn_epoch = 150\nprint_period = n_epoch \u002F\u002F 10\nstart = time.time()\n\ntraining_loss = []\nvalidation_loss = []\nfor epoch in range(n_epoch):\n    epoch_loss = 0\n    epoch_val_loss = 0\n\n    train_iter.reset()\n    test_iter.reset()\n\n    n_batch_train = 0\n    for batch in train_iter:\n        n_batch_train +=1\n        data = batch.data[0].as_in_context(mx.cpu())\n\n        with autograd.record():\n            loss = net(data)\n        loss.backward()\n        trainer.step(data.shape[0])\n        epoch_loss += nd.mean(loss).asscalar()\n\n    n_batch_val = 0\n    for batch in test_iter:\n        n_batch_val +=1\n        data = batch.data[0].as_in_context(mx.cpu())\n        loss = net(data)\n        epoch_val_loss += nd.mean(loss).asscalar()\n\n    epoch_loss \u002F= n_batch_train\n    epoch_val_loss \u002F= n_batch_val\n\n    training_loss.append(epoch_loss)\n    validation_loss.append(epoch_val_loss)\n\n    \"\"\"if epoch % max(print_period, 1) == 0:\n        print('Epoch {}, Training loss {:.2f}, Validation loss {:.2f}'.\\\n              format(epoch, epoch_loss, epoch_val_loss))\"\"\"\n\nend = time.time()\nprint('Training completed in {} seconds.'.format(int(end-start)))\n```\n\n    Training completed in 62 seconds.\n\n\n\n```python\ndataset_total_df['Date'] = dataset_ex_df['Date']\n```\n\n\n```python\nvae_added_df = mx.nd.array(dataset_total_df.iloc[:, :-1].values)\n```\n\n\n```python\nprint('The shape of the newly created (from the autoencoder) features is {}.'.format(vae_added_df.shape))\n```\n\n    The shape of the newly created (from the autoencoder) features is (2265, 112).\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We created 112 more features from the autoencoder. As we want to only have high level features (overall patterns) we will create an Eigen portfolio on the newly created 112 features using Principal Component Analysis (PCA). This will reduce the dimension (number of columns) of the data. The descriptive capability of the Eigen portfolio will be the same as the original 112 features.\n\n**Note** Once again, this is purely experimental. I am not 100% sure the described logic will hold. As everything else in AI and deep learning, this is art and needs experiments.\n\n### 3.8.2. Eigen portfolio with PCA \u003Ca class=\"anchor\" id=\"pca\">\u003C\u002Fa>\n\n\n```python\n# We want the PCA to create the new components to explain 80% of the variance\npca = PCA(n_components=.8)\n```\n\n\n```python\nx_pca = StandardScaler().fit_transform(vae_added_df)\n```\n\n\n```python\nprincipalComponents = pca.fit_transform(x_pca)\n```\n\n\n```python\nprincipalComponents.n_components_\n```\n\n\n\n\n    84\n\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;So, in order to explain 80% of the variance we need 84 (out of the 112) features. This is still a lot. So, for now we will not include the autoencoder created features. I will work on creating the autoencoder architecture in which we get the output from an intermediate layer (not the last one) and connect it to another ```Dense``` layer with, say, 30 neurons. Thus, we will 1) only extract higher level features, and 2) come up with significantly fewer number of columns.\n\n## 3.9. Deep Unsupervised Learning for anomaly detection in derivatives pricing \u003Ca class=\"anchor\" id=\"dulfaddp\">\u003C\u002Fa>\n\n-- To be added soon.\n\n# 4. Generative Adversarial Network (GAN) \u003Ca class=\"anchor\" id=\"qgan\">\u003C\u002Fa>\n\nFigure 9: Simple GAN architecture\n\n\u003Ccenter>\u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_84dd102afb98.jpg' width=960>\u003C\u002Fimg>\u003C\u002Fcenter>\n\n#### How GANs work?\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As mentioned before, the purpose of this notebook is not to explain in detail the math behind deep learning but to show its applications. Of course, thorough and very solid understanding from the fundamentals down to the smallest details, in my opinion, is extremely imperative. Hence, we will try to balance and give a high-level overview of how GANs work in order for the reader to fully understand the rationale behind using GANs in predicting stock price movements. Feel free to skip this and the next section if you are experienced with GANs (and do check  section \u003Ca href=\"#wgan\">4.2.\u003C\u002Fa>).\n\nA GAN network consists of two models - a **Generator** ($G$) and **Discriminator** ($D$). The steps in training a GAN are:\n1. The Generator is, using random data (noise denoted $z$), trying to 'generate' data indistinguishable of, or extremely close to, the real data. It's purpose is to learn the distribution of the real data.\n2. Randomly, real or generated data is fitted into the Discriminator, which acts as a classifier and tries to understand whether the data is coming from the Generator or is the real data. $D$ estimates the (distributions) probabilities of the incoming sample to the real dataset. (_more info on comparing two distributions in \u003Ca href=\"#mhganwgan\">section 4.2.\u003C\u002Fa> below_).\n3. Then, the losses from $G$ and $D$ are combined and propagated back through the generator. Ergo, the generator's loss depends on both the generator and the discriminator. This is the step that helps the Generator learn about the real data distribution. If the generator doesn't do a good job at generating a realistic data (having the same distribution), the Discriminator's work will be very easy to distinguish generated from real data sets. Hence, the Discriminator's loss will be very small. Small discriminator loss will result in bigger generator loss (_see the equation below for $L(D, G)$_). This makes creating the discriminator a bit tricky, because too good of a discriminator will always result in a huge generator loss, making the generator unable to learn.\n4. The process goes on until the Discriminator can no longer distinguish generated from real data.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;When combined together, $D$ and $G$ as sort of playing a _minmax_ game (the Generator is trying to _fool_ the Discriminator making it increase the probability for on fake examples, i.e. minimize $\\mathbb{E}_{z \\sim p_{z}(z)} [\\log (1 - D(G(z)))]$. The Discriminator wants to separate the data coming from the Generator, $D(G(z))$, by maximizing $\\mathbb{E}_{x \\sim p_{r}(x)} [\\log D(x)]$). Having separated loss functions, however, it is not clear how both can converge together (that is why we use some advancements over the plain GANs, such as Wasserstein GAN). Overall, the combined loss function looks like:\n\n$$L(D, G) = \\mathbb{E}_{x \\sim p_{r}(x)} [\\log D(x)] + \\mathbb{E}_{z \\sim p_z(z)} [\\log(1 - D(G(z)))]$$\n\n**Note**: Really useful tips for training GANs can be found \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsoumith\u002Fganhacks\">here\u003C\u002Fa>.\n\n**Note**: I will not include the complete code behind the **GAN** and the **Reinforcement learning** parts in this notebook - only the results from the execution (the cell outputs) will be shown. Make a pull request or contact me for the code.\n\n## 4.1. Why GAN for stock market prediction? \u003Ca class=\"anchor\" id=\"whygan\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Generative Adversarial Networks** (GAN) have been recently used mainly in creating realistic images, paintings, and video clips. There aren't many applications of GANs being used for predicting time-series data as in our case. The main idea, however, should be same - we want to predict future stock movements. In the future, the pattern and behavior of GS's stock should be more or less the same (unless it starts operating in a totally different way, or the economy drastically changes). Hence, we want to 'generate' data for the future that will have similar (not absolutely the same, of course) distribution as the one we already have - the historical trading data. So, in theory, it should work.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;In our case, we will use **LSTM** as a time-series generator, and **CNN** as a discriminator.\n\n## 4.2. Metropolis-Hastings GAN and Wasserstein GAN \u003Ca class=\"anchor\" id=\"mhganwgan\">\u003C\u002Fa>\n\n**Note:** _The next couple of sections assume some experience with GANs._\n\n#### **I. Metropolis-Hastings GAN**\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;A recent improvement over the traditional GANs came out from Uber's engineering team and is called **Metropolis-Hastings GAN** (MHGAN). The idea behind Uber's approach is (as they state it) somewhat similar to another approach created by Google and University of California, Berkeley called **Discriminator Rejection Sampling** (\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1810.06758.pdf\">DRS\u003C\u002Fa>). Basically, when we train GAN we use the Discriminator ($D$) for the sole purpose of better training the Generator ($G$). Often, after training the GAN we do not use the $D$ any more. MHGAN and DRS, however, try to use $D$ in order to choose samples generated by $G$ that are close to the real data distribution (slight difference between is that MHGAN uses Markov Chain Monte Carlo (MCMC) for sampling).\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;MHGAN takes **_K_** samples generated from the $G$ (created from independent noise inputs to the $G$ - $z_0$ to $z_K$ in the figure below). Then it sequentially runs through the **_K_** outputs ($x'_0$ to $x'_K$) and following an acceptance rule (created from the Discriminator) decides whether to accept the current sample or keep the last accepted one. The last kept output is the one considered the real output of $G$.\n\n**Note**: MHGAN is originally implemented by Uber in pytorch. I only transferred it into MXNet\u002FGluon. \n\n\n#### **Note**: I will also upload it into Github sometime soon.\n\u003Cbr>\u003C\u002Fbr>\nFigure 10: Visual representation of MHGAN (from the original \u003Ca href=\"https:\u002F\u002Feng.uber.com\u002Fmh-gan\u002F?amp\">Uber post\u003C\u002Fa>).\n\n\u003Ccenter>\u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_d2289ff8fdc1.png' width=500>\u003C\u002Fimg>\u003C\u002Fcenter>\n\n#### **II. Wasserstein GAN** \u003Ca class=\"anchor\" id=\"wgan\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Training GANs is quite difficult. Models may never converge and mode collapse can easily happen. We will use a modification of GAN called **Wasserstein** GAN - \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1701.07875.pdf\">WGAN\u003C\u002Fa>.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Again, we will not go into details, but the most notable points to make are:\n- As we know the main goal behind GANs is for the Generator to start transforming random noise into some given data that we want to mimic. Ergo, the idea of comparing the similarity between two distributions is very imperative in GANs. The two most widely used such metrics are:\n  * **KL divergence** (Kullback–Leibler) - $D_{KL}(p \\| q) = \\int_x p(x) \\log \\frac{p(x)}{q(x)} dx$. $D_{KL}$ is zero when $p(x)$ is equal to $q(x)$,\n  * **JS Divergence** (Jensen–Shannon) - $D_{JS}(p \\| q) = \\frac{1}{2} D_{KL}(p \\| \\frac{p + q}{2}) + \\frac{1}{2} D_{KL}(q \\| \\frac{p + q}{2})$. JS divergence is bounded by 0 and 1, and, unlike KL divergence, is symmetric and smoother. Significant success in GAN training was achieved when the loss was switched from KL to JS divergence.\n- WGAN uses Wasserstein distance, $W(p_r, p_g) = \\frac{1}{K} \\sup_{\\| f \\|_L \\leq K} \\mathbb{E}_{x \\sim p_r}[f(x)] - \\mathbb{E}_{x \\sim p_g}[f(x)]$ (where $sup$ stands for _supremum_), as a loss function (also called Earth Mover's distance, because it normally is interpreted as moving one pile of, say, sand to another one, both piles having different probability distributions, using minimum energy during the transformation). Compared to KL and JS divergences, Wasserstein metric gives a smooth measure (without sudden jumps in divergence). This makes it much more suitable for creating a stable learning process during the gradient descent.\n- Also, compared to KL and JS, Wasserstein distance is differentiable nearly everywhere. As we know, during backpropagation, we differentiate the loss function in order to create the gradients, which in turn update the weights. Therefore, having a differentiable loss function is quite important.\n\n#### _Hands down, this was the toughest part of this notebook. Mixing WGAN and MHGAN took me three days._\n\n## 4.4. The Generator - One layer RNN \u003Ca class=\"anchor\" id=\"thegenerator\">\u003C\u002Fa>\n\n### 4.4.1. LSTM or GRU \u003Ca class=\"anchor\" id=\"lstmorgru\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As mentioned before, the generator is a LSTM network a type of Recurrent Neural Network (RNN). RNNs are used for time-series data because because they keep track of all previous data points and can capture patterns developing through time. Due to their nature, RNNs many time suffer from _vanishing gradient_ - that is, the changes the weights receive during training become so small, that they don't change, making the network unable to converge to a minimal loss (The opposite problem can also be observed at times - when gradients become too big. This is called _gradient exploding_, but the solution to this is quite simple - clip gradients if they start exceeding some constant number, i.e. gradient clipping). Two modifications tackle this problem - Gated Recurrent Unit (**GRU**) and Long-Short Term Memory (**LSTM**). The biggest differences between the two are: 1) GRU has 2 gates (update and reset) and LSTM has 4 (update, input, forget, and output), 2) LSTM maintains an internal memory state, while GRU doesn’t, and 3) LSTM applies a nonlinearity (sigmoid) before the output gate, GRU doesn’t.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;In most cases LSTM and GRU give similar results in terms of accuracy but GRU is much less computational intensive, as GRU has much fewer trainable params. LSTMs, however, and much more used. \n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Strictly speaking, the math behind the LSTM cell (the gates) is:\n\n\n$$g_t = \\text{tanh}(X_t W_{xg} + h_{t-1} W_{hg} + b_g),$$\n\n\n$$i_t = \\sigma(X_t W_{xi} + h_{t-1} W_{hi} + b_i),$$\n\n$$f_t = \\sigma(X_t W_{xf} + h_{t-1} W_{hf} + b_f),$$\n\n$$o_t = \\sigma(X_t W_{xo} + h_{t-1} W_{ho} + b_o),$$\n\n$$c_t = f_t \\odot c_{t-1} + i_t \\odot g_t,$$\n\n$$h_t = o_t \\odot \\text{tanh}(c_t),$$\n\nwhere $\\odot$ is an element-wise multiplication operator, and,\nfor all $x = [x_1, x_2, \\ldots, x_k]^\\top \\in R^k$ the two activation functions:,\n\n$$\\sigma(x) = \\left[\\frac{1}{1+\\exp(-x_1)}, \\ldots, \\frac{1}{1+\\exp(-x_k)}]\\right]^\\top,$$\n\n$$\\text{tanh}(x) = \\left[\\frac{1-\\exp(-2x_1)}{1+\\exp(-2x_1)},  \\ldots, \\frac{1-\\exp(-2x_k)}{1+\\exp(-2x_k)}\\right]^\\top$$\n\n### 4.4.2. The LSTM architecture \u003Ca class=\"anchor\" id=\"lstmarchitecture\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The LSTM architecture is very simple - one ```LSTM``` layer with 112 input units (as we have 112 features in the dataset) and 500 hidden units, and one ```Dense``` layer with 1 output - the price for every day. The initializer is Xavier and we will use L1 loss (which is mean absolute error loss with L1 regularization - see section 4.4.5. for more info on regularization).\n\n**Note** - In the code you can see we use ```Adam``` (with ```learning rate``` of .01) as an optimizer. Don't pay too much attention on that now - there is a section specially dedicated to explain what hyperparameters we use (learning rate is excluded as we have learning rate scheduler - \u003Ca href=\"#lrscheduler\">section 4.4.3.\u003C\u002Fa>) and how we optimize these hyperparameters - \u003Ca href=\"#hyperparams\">section 4.6.\u003C\u002Fa>\n\n\n```python\ngan_num_features = dataset_total_df.shape[1]\nsequence_length = 17\n\nclass RNNModel(gluon.Block):\n    def __init__(self, num_embed, num_hidden, num_layers, bidirectional=False, \\\n                 sequence_length=sequence_length, **kwargs):\n        super(RNNModel, self).__init__(**kwargs)\n        self.num_hidden = num_hidden\n        with self.name_scope():\n            self.rnn = rnn.LSTM(num_hidden, num_layers, input_size=num_embed, \\\n                                bidirectional=bidirectional, layout='TNC')\n            \n            self.decoder = nn.Dense(1, in_units=num_hidden)\n    \n    def forward(self, inputs, hidden):\n        output, hidden = self.rnn(inputs, hidden)\n        decoded = self.decoder(output.reshape((-1, self.num_hidden)))\n        return decoded, hidden\n    \n    def begin_state(self, *args, **kwargs):\n        return self.rnn.begin_state(*args, **kwargs)\n    \nlstm_model = RNNModel(num_embed=gan_num_features, num_hidden=500, num_layers=1)\nlstm_model.collect_params().initialize(mx.init.Xavier(), ctx=mx.cpu())\ntrainer = gluon.Trainer(lstm_model.collect_params(), 'adam', {'learning_rate': .01})\nloss = gluon.loss.L1Loss()\n```\n\nWe will use 500 neurons in the LSTM layer and use Xavier initialization. For regularization we'll use L1. Let's see what's inside the ```LSTM``` as printed by MXNet.\n\n\n```python\nprint(lstm_model)\n```\n\n    RNNModel(\n      (rnn): LSTM(112 -> 500, TNC)\n      (decoder): Dense(500 -> 1, linear)\n    )\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;As we can see, the input of the LSTM are the 112 features (```dataset_total_df.shape[1]```) which then go into 500 neurons in the LSTM layer, and then transformed to a single output - the stock price value.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The logic behind the LSTM is: we take 17 (```sequence_length```) days of data (again, the data being the stock price for GS stock every day + all the other feature for that day - correlated assets, sentiment, etc.) and try to predict the 18th day.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;In another post I will explore whether modification over the vanilla LSTM would be more beneficial, such as: \n- using **bidirectional** LSTM layer - in theory, going backwards (from end of the data set towards the beginning) might somehow help the LSTM figure out the pattern of the stock movement.\n- using **stacked** RNN architecture - having not only one LSTM layer but 2 or more. This, however, might be dangerous, as we might end up overfitting the model, as we don't have a lot of data (we have just 1,585 day worth of data).\n- Exploring **GRU** - as already explained, GRUs' cells are much more simpler.\n- Adding **attention** vectors to the RNN.\n\n### 4.4.3. Learning rate scheduler \u003Ca class=\"anchor\" id=\"lrscheduler\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;One of the most important hyperparameters is the learning rate. Setting the learning rate for almost every optimizer (such as SGD, Adam, or RMSProp) is crucially important when training neural networks because it controls both the speed of convergence and the ultimate performance of the network. One of the simplest learning rate strategies is to have a fixed learning rate throughout the training process. Choosing a small learning rate allows the optimizer find good solutions, but this comes at the expense of limiting the initial speed of convergence. Changing the learning rate over time can overcome this tradeoff.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Recent papers, such as \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1806.01593.pdf\">this\u003C\u002Fa>, show the benefits of changing the global learning rate during training, in terms of both convergence and time.\n\n\n```python\nclass TriangularSchedule():\n    def __init__(self, min_lr, max_lr, cycle_length, inc_fraction=0.5):     \n        self.min_lr = min_lr\n        self.max_lr = max_lr\n        self.cycle_length = cycle_length\n        self.inc_fraction = inc_fraction\n        \n    def __call__(self, iteration):\n        if iteration \u003C= self.cycle_length*self.inc_fraction:\n            unit_cycle = iteration * 1 \u002F (self.cycle_length * self.inc_fraction)\n        elif iteration \u003C= self.cycle_length:\n            unit_cycle = (self.cycle_length - iteration) * 1 \u002F (self.cycle_length * (1 - self.inc_fraction))\n        else:\n            unit_cycle = 0\n        adjusted_cycle = (unit_cycle * (self.max_lr - self.min_lr)) + self.min_lr\n        return adjusted_cycle\n\nclass CyclicalSchedule():\n    def __init__(self, schedule_class, cycle_length, cycle_length_decay=1, cycle_magnitude_decay=1, **kwargs):\n        self.schedule_class = schedule_class\n        self.length = cycle_length\n        self.length_decay = cycle_length_decay\n        self.magnitude_decay = cycle_magnitude_decay\n        self.kwargs = kwargs\n    \n    def __call__(self, iteration):\n        cycle_idx = 0\n        cycle_length = self.length\n        idx = self.length\n        while idx \u003C= iteration:\n            cycle_length = math.ceil(cycle_length * self.length_decay)\n            cycle_idx += 1\n            idx += cycle_length\n        cycle_offset = iteration - idx + cycle_length\n        \n        schedule = self.schedule_class(cycle_length=cycle_length, **self.kwargs)\n        return schedule(cycle_offset) * self.magnitude_decay**cycle_idx\n```\n\n\n```python\nschedule = CyclicalSchedule(TriangularSchedule, min_lr=0.5, max_lr=2, cycle_length=500)\niterations=1500\n\nplt.plot([i+1 for i in range(iterations)],[schedule(i) for i in range(iterations)])\nplt.title('Learning rate for each epoch')\nplt.xlabel(\"Epoch\")\nplt.ylabel(\"Learning Rate\")\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_918e9b50eb45.png)\n\n\n### 4.4.4. How to prevent overfitting and the bias-variance trade-off \u003Ca class=\"anchor\" id=\"preventoverfitting\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Having a lot of features and neural networks we need to make sure we prevent overfitting and be mindful of the total loss.\n\nWe use several techniques for preventing overfitting (not only in the LSTM, but also in the CNN and the auto-encoders):\n- **Ensuring data quality**. We already performed statistical checks and made sure the data doesn't suffer from multicollinearity or serial autocorrelation. Further we performed feature importance check on each feature. Finally, the initial feature selection (e.g. selecting correlated assets, technical indicators, etc.) was done with some domain knowledge about the mechanics behind the way stock markets work.\n- **Regularization** (or weights penalty). The two most widely used regularization techniques are LASSO (**L1**) and Ridge (**L2**). L1 adds the mean absolute error and L2 adds mean squared error to the loss. Without going into too many mathematical details, the basic differences are: lasso regression (L1) does both variable selection and parameter shrinkage, whereas Ridge regression only does parameter shrinkage and end up including all the coefficients in the model. In presence of correlated variables, ridge regression might be the preferred choice. Also, ridge regression works best in situations where the least square estimates have higher variance. Therefore, it depends on our model objective. The impact of the two types of regularizations is quite different. While they both penalize large weights, L1 regularization leads to a non-differentiable function at zero. L2 regularization favors smaller weights, but L1 regularization favors weights that go to zero. So, with L1 regularization you can end up with a sparse model - one with fewer parameters. In both cases the parameters of the L1 and L2 regularized models \"shrink\", but in the case of L1 regularization the shrinkage directly impacts the complexity (the number of parameters) of the model. Precisely, ridge regression works best in situations where the least square estimates have higher variance. L1 is more robust to outliers, is used when data is sparse, and creates feature importance. We will use L1.\n- **Dropout**. Dropout layers randomly remove nodes in the hidden layers.\n- **Dense-sparse-dense training**. - \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1607.04381v1.pdf\">link\u003C\u002Fa>\n- **Early stopping**.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Another important consideration when building complex neural networks is the bias-variance trade-off. Basically, the error we get when training nets is a function of the bias, the variance, and irreducible error - σ (error due to noise and randomness). The simplest formula of  the trade-off is:\n\n$$Error = bias^{2} + variance + \\sigma$$\n\n- **Bias**. Bias measures how well a trained (on training dataset) algorithm can generalize on unseen data. High bias (underfitting) meaning the model cannot work well on unseen data.\n- **Variance**. Variance measures the sensitivity of the model to changes in the dataset. High variance is the overfitting.\n\n### 4.4.5. Custom weights initializers and custom loss metric \u003Ca class=\"anchor\" id=\"customfns\">\u003C\u002Fa>\n\n#### Coming soon\n\n## 4.5. The Discriminator - One Dimentional CNN \u003Ca class=\"anchor\" id=\"thediscriminator\">\u003C\u002Fa>\n\n### 4.5.1. Why CNN as a discriminator? \u003Ca class=\"anchor\" id=\"why_cnn_architecture\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;We usually use CNNs for work related to images (classification, context extraction, etc). They are very powerful at extracting features from features from features, etc. For example, in an image of a dog, the first convolutional layer will detect edges, the second will start detecting circles, and the third will detect a nose. In our case, data points form small trends, small trends form bigger, trends in turn form patterns. CNNs' ability to detect features can be used for extracting information about patterns in GS's stock price movements.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Another reason for using CNN is that CNNs work well on spatial data - meaning data points that are closer to each other are more related to each other, than data points spread across. This should hold true for time series data. In our case each data point (for each feature) is for each consecutive day. It is natural to assume that the closer two days are to each other, the more related they are to each other. One thing to consider (although not covered in this work) is seasonality and how it might change (if at all) the work of the CNN.\n\n**Note**: As many other parts in this notebook, using CNN for time series data is experimental. We will inspect the results, without providing mathematical or other proofs. And results might vary using different data, activation functions, etc.\n\n### 4.5.1. The CNN Architecture \u003Ca class=\"anchor\" id=\"the_cnn_architecture\">\u003C\u002Fa>\n\nFigure 11: High level overview of the CNN architecture.\n\n\u003Ccenter>\u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_332d61898309.jpg' width=960>\u003C\u002Fimg>\u003C\u002Fcenter>\n\nThe code for the CNN inside the GAN looks like this:\n\n\n```python\nnum_fc = 512\n\n# ... other parts of the GAN\n\ncnn_net = gluon.nn.Sequential()\nwith net.name_scope():\n    \n    # Add the 1D Convolutional layers\n    cnn_net.add(gluon.nn.Conv1D(32, kernel_size=5, strides=2))\n    cnn_net.add(nn.LeakyReLU(0.01))\n    cnn_net.add(gluon.nn.Conv1D(64, kernel_size=5, strides=2))\n    cnn_net.add(nn.LeakyReLU(0.01))\n    cnn_net.add(nn.BatchNorm())\n    cnn_net.add(gluon.nn.Conv1D(128, kernel_size=5, strides=2))\n    cnn_net.add(nn.LeakyReLU(0.01))\n    cnn_net.add(nn.BatchNorm())\n    \n    # Add the two Fully Connected layers\n    cnn_net.add(nn.Dense(220, use_bias=False), nn.BatchNorm(), nn.LeakyReLU(0.01))\n    cnn_net.add(nn.Dense(220, use_bias=False), nn.Activation(activation='relu'))\n    cnn_net.add(nn.Dense(1))\n    \n# ... other parts of the GAN\n```\n\nLet's print the CNN.\n\n\n```python\nprint(cnn_net)\n```\n\n    Sequential(\n      (0): Conv1D(None -> 32, kernel_size=(5,), stride=(2,))\n      (1): LeakyReLU(0.01)\n      (2): Conv1D(None -> 64, kernel_size=(5,), stride=(2,))\n      (3): LeakyReLU(0.01)\n      (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=None)\n      (5): Conv1D(None -> 128, kernel_size=(5,), stride=(2,))\n      (6): LeakyReLU(0.01)\n      (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=None)\n      (8): Dense(None -> 220, linear)\n      (9): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=None)\n      (10): LeakyReLU(0.01)\n      (11): Dense(None -> 220, linear)\n      (12): Activation(relu)\n      (13): Dense(None -> 1, linear)\n    )\n\n\n## 4.6. Hyperparameters \u003Ca class=\"anchor\" id=\"hyperparams\">\u003C\u002Fa>\n\nThe hyperparameters that we will track and optimize are:\n- ```batch_size``` : batch size of the LSTM and CNN\n- ```cnn_lr```: the learningrate of the CNN\n- ```strides```: the number of strides in the CNN\n- ```lrelu_alpha```: the alpha for the LeakyReLU in the GAN\n- ```batchnorm_momentum```: momentum for the batch normalisation in the CNN\n- ```padding```: the padding in the CNN\n- ```kernel_size':1```: kernel size in the CNN\n- ```dropout```: dropout in the LSTM\n- ```filters```: the initial number of filters\n\nWe will train over 200 ```epochs```.\n\n# 5. Hyperparameters optimization \u003Ca class=\"anchor\" id=\"hyperparams_optim\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;After the GAN trains on the 200 epochs it will record the MAE (which is the error function in the LSTM, the $G$) and pass it as a reward value to the Reinforcement learning that will decide whether to change the hyperparameters of keep training with the same set of hyperparameters. As described later, this approach is strictly for experimenting with RL.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;If the RL decides it will update the hyperparameters it will call Bayesian optimisation (discussed below) library that will give the next best expected set of the hyperparams.\n\n## 5.1. Reinforcement learning for hyperparameters optimization \u003Ca class=\"anchor\" id=\"reinforcementlearning\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Why do we use reinforcement learning in the hyperparameters optimization? Stock markets change all the time. Even if we manage to train our GAN and LSTM to create extremely accurate results, the results might only be valid for a certain period. Meaning, we need to constantly optimise the whole process. To optimize the process we can:\n- Add or remove features (e.g. add new stocks or currencies that might be correlated) \n- Improve the our deep learning models. One of the most important ways to improve the models is through the hyper parameters (listed in Section 5). Once having found a certain set of hyperparameters we need to decide when to change them and when to use the already known set (exploration vs. exploitation). Also, stocks market represents a continuous space that depends on millions parameters.\n\n**Note**: The purpose of the whole reinforcement learning part of this notebook is more research oriented. We will explore different RL approaches using the GAN as an environment. There are many ways in which we can successfully perform hyperparameter optimization on our deep learning models without using RL. But... why not.\n\n**Note**: The next several sections assume you have some knowledge about RL - especially policy methods and Q-learning.\n\n### 5.1.1. Reinforcement Learning Theory \u003Ca class=\"anchor\" id=\"reinforcementlearning_theory\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Without explaining the basics of RL we will jump into the details of the specific approaches we implement here. We will use model-free RL algorithms for the obvious reason that we do not know the whole environment, hence there is no defined model for how the environment works - if there was we wouldn't need to predict stock prices movements - they will just follow the model. We will use the two subdivisions of model-free RL - Policy optimization and Q-learning.\n\n- **Q-learning** - in Q-learning we learn the **value** of taking an action from a given state. **Q-value** is the expected return after taking the action. We will use **Rainbow** which is a combination of seven Q learning algorithms.\n- **Policy Optimization** - in policy optimization we learn the action to take from a given state. (if we use methods like Actor\u002FCritic) we also learn the value of being in a given state. We will use **Proximal Policy Optimization**.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;One crucial aspect of building a RL algorithm is accurately setting the reward. It has to capture all aspects of the environment and the agent's interaction with the environment. We define the reward, **_R_**, as:\n\n$$Reward = 2*loss_G + loss_D + accuracy_G,$$\n\nwhere $loss_G$, $accuracy_G$, and $loss_D$ are the Generator's loss and accuracy, and Discriminator's loss, respectively. The environment is the GAN and the results of the LSTM training. The action the different agents can take is how to change the hyperparameters of the GAN's $D$ and $G$ nets.\n\n#### 5.1.1.1. Rainbow \u003Ca class=\"anchor\" id=\"rl_rainbow\">\u003C\u002Fa>\n\n**What is Rainbow?** \n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Rainbow (\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.02298.pdf\">link\u003C\u002Fa>) is a Q learning based off-policy deep reinforcement learning algorithm combining seven algorithm together:\n* **DQN**. DQN is an extension of Q learning algorithm that uses a neural network to represent the Q value. Similar to supervised (deep) learning, in DQN we train a neural network and try to minimize a loss function. We train the network by randomly sampling transitions (state, action, reward). The layers can be not only fully connected ones, but also convolutional, for example.\n* **Double Q Learning**. Double QL handles a big problem in Q learning, namely the overestimation bias.\n* **Prioritized replay**. In the vanilla DQN, all transitions are stored in a replay buffer and it uniformly samples this buffer. However, not all transitions are equally beneficial during the learning phase (which also makes learning inefficient as more episodes are required). Prioritized experience replay doesn't sample uniformly, rather it uses a distribution that gives higher probability to samples that have had higher Q loss in previous iterations.\n* **Dueling networks.** Dueling networks change the Q learning architecture a little by using two separate streams (i.e. having two different mini-neural networks). One stream is for the value and one for the _advantage_. Both of them share a convolutional encoder. The tricky part is the merging of the streams - it uses a special aggregator (_Wang et al. 2016_).\n  - _Advantage_, formula is $A(s, a) = Q(s, a) - V(s)$, generally speaking is a comparison of how good an action is compared to the average action for a specific state. Advantages are sometimes used when a 'wrong' action cannot be penalized with negative reward. So _advantage_ will try to further reward good actions from the average actions.\n* **Multi-step learning.** The big difference behind Multi-step learning is that it calculates the Q-values using N-step returns (not only the return from the next step), which naturally should be more accurate.\n* **Distributional RL**. Q learning uses average estimated Q-value as target value. However, in many cases the Q-values might not be the same in different situations. Distributional RL can directly learn (or approximate) the distribution of Q-values rather than averaging them. Again, the math is much more complicated than that, but for us the benefit is more accurate sampling of the Q-values.\n* **Noisy Nets**. Basic DQN implements a simple 𝜀-greedy mechanism to do exploration. This approach to exploration inefficient at times. The way Noisy Nets approach this issue is by adding a noisy linear layer. Over time, the network will learn how to ignore the noise (added as a noisy stream). But this learning comes at different rates in different parts of the space, allowing for state exploration.\n\u003Cbr>\u003C\u002Fbr>\n#### **Note**: Stay tuned - I will upload a MXNet\u002FGluon implementation on Rainbow to Github in early February 2019.\n\u003Cbr>\u003C\u002Fbr>\n\n\n\n#### 5.1.1.2. PPO \u003Ca class=\"anchor\" id=\"rl_ppo\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**Proximal Policy Optimization** (\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06347.pdf\">PPO\u003C\u002Fa>) is a policy optimization model-free type of reinforcement learning. It is much simpler to implement that other algorithms and gives very good results.\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Why do we use PPO? One of the advantages of PPO is that it directly learns the policy, rather than indirectly via the values (the way Q Learning uses Q-values to learn the policy). It can work well in continuous action spaces, which is suitable in our use case and can learn (through mean and standard deviation) the distribution probabilities (if softmax is added as an output).\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The problem of policy gradient methods is that they are extremely sensitive to the step size choice - if it is small the progress takes too long (most probably mainly due to the need of a second-order derivatives matrix); if it is large, there is a lot noise which significantly reduces the performance. Input data is nonstationary due to the changes in the policy (also the distributions of the reward and observations change). As compared to supervised learning, poorly chosen step can be much more devastating as it affects the whole distribution of next visits. PPO can solve these issues. What is more, compared to some other approaches, PPO: \n* is much less complicated, for example compared to **ACER**, which requires additional code for keeping the off-policy correlations and also a replay buffer, or **TRPO** which has a constraint imposed on the surrogate objective function (the KL divergence between the old and the new policy). This constraint is used to control the policy of changing too much - which might create instability. PPO reduces the computation (created by the constraint) by utilizing a _clipped  (between [1- 𝜖, 1+𝜖]) surrogate objective function_ and modifying the objective function with a penalty for having too big of an update.\n* gives compatibility with algos that share parameters between value and policy function or auxiliary losses, as compared to TRPO (although PPO also have the gain of trust region PO).\n\n**Note**: For the purpose of our exercise we won't go too much into the research and optimization of RL approaches, PPO and the others included. Rather, we will take what is available and try to fit into our process for hyperparameter optimization for our GAN, LSTM, and CNN models. The code we will reuse and customize is created by OpenAI and is available \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fopenai\u002Fbaselines\">here\u003C\u002Fa>.\n\n### 5.1.2. Further work on Reinforcement learning \u003Ca class=\"anchor\" id=\"reinforcementlearning_further\">\u003C\u002Fa>\n\nSome ideas for further exploring reinforcement learning:\n- One of the first things I will introduce next is using **Augmented Random Search** (\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.07055.pdf\">link\u003C\u002Fa>) as an alternative algorithm. The authors of the algorithm (out of UC, Berkeley) have managed to achieve similar rewards results as other state of the art approaches, such as PPO, but on average 15 times faster.\n- Choosing a reward function is very important. I stated the currently used reward function above, but I will try to play with different functions as an alternative.\n- Using **Curiosity** as an exploration policy.\n- Create **multi-agent** architecture as proposed by Berkeley's AI Research team (BAIR) - \u003Ca href=\"https:\u002F\u002Fbair.berkeley.edu\u002Fblog\u002F2018\u002F12\u002F12\u002Frllib\u002F\">link\u003C\u002Fa>.\n\n## 5.2. Bayesian optimization \u003Ca class=\"anchor\" id=\"bayesian_opt\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Instead of the grid search, that can take a lot of time to find the best combination of hyperparameters, we will use **Bayesian optimization**. The library that we'll use is already implemented - \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ffmfn\u002FBayesianOptimization\">link\u003C\u002Fa>.\n\nThe next part of the code only shows the initialization.\n\n\n```python\n# Initialize the optimizer\nfrom bayes_opt import BayesianOptimization\nfrom bayes_opt import UtilityFunction\n\nutility = UtilityFunction(kind=\"ucb\", kappa=2.5, xi=0.0)\n```\n\n### 5.2.1. Gaussian process \u003Ca class=\"anchor\" id=\"gaussprocess\">\u003C\u002Fa>\n\n\u003Ccenter>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_2423b9fa7a08.png\" width=600>\u003C\u002Fimg>\u003C\u002Fcenter>\n\n# 6. The result \u003Ca class=\"anchor\" id=\"theresult\">\u003C\u002Fa>\n\n\n```python\nfrom utils import plot_prediction\n```\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Finally we will compare the output of the LSTM when the unseen (test) data is used as an input after different phases of the process.\n\n1. Plot after the first epoch.\n\n\n```python\nplot_prediction('Predicted and Real price - after first epoch.')\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_b8a904c684b2.png)\n\n\n2. Plot after 50 epochs.\n\n\n```python\nplot_prediction('Predicted and Real price - after first 50 epochs.')\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_72a7c09ed02d.png)\n\n\n\n```python\nplot_prediction('Predicted and Real price - after first 200 epochs.')\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_f95b879b7b75.png)\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;The RL run for ten episodes (we define an eposide to be one full GAN training on the 200 epochs.)\n\n\n```python\nplot_prediction('Final result.')\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_15e585ff6e27.png)\n\n\n#### As a next step, I will try to take everything separately and provide some analysis on what worked and why. Why did we receive these results and is it just by coinscidence? So stay tuned.\n\n# What is next? \u003Ca class=\"anchor\" id=\"whatisnext\">\u003C\u002Fa>\n\n- Next, I will try to create a RL environment for testing trading algorithms that decide when and how to trade. The output from the GAN will be one of the parameters in the environment.\n\n# About me \u003Ca class=\"anchor\" id=\"me\">\u003C\u002Fa>\n\nwww.linkedin.com\u002Fin\u002Fborisbanushev\n\n# Disclaimer \u003Ca class=\"anchor\" id=\"disclaimer\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**This notebook is entirely informative. None of the content presented in this notebook constitutes a recommendation that any particular security, portfolio of securities, transaction or investment strategy is suitable for any specific person. Futures, stocks and options trading involves substantial risk of loss and is not suitable for every investor. The valuation of futures, stocks and options may fluctuate, and, as a result, clients may lose more than their original investment.**\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**All trading strategies are used at your own risk.**\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;There are many many more details to explore - in choosing data features, in choosing algorithms, in tuning the algos, etc. This version of the notebook itself took me 2 weeks to finish. I am sure there are many unaswered parts of the process. So, any comments and suggestion - please do share. I'd be happy to add and test any ideas in the current process.\n\nThanks for reading.\n\nBest,\nBoris\n","# 利用人工智能的最新进展预测股市走势\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;在本笔记本中，我将构建一个完整的股票价格走势预测流程。跟随我的步骤，我们将取得相当不错的效果。为此，我们将使用一种**生成对抗网络**（GAN），其中以**LSTM**——一种循环神经网络——作为生成器，并以**卷积神经网络**（CNN）作为判别器。之所以选择LSTM，是因为我们要预测的是时间序列数据。那么，为什么我们要使用GAN，尤其是CNN作为判别器呢？这是一个好问题，我们将在后面专门讨论。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;当然，我们会对每一步进行更详细的说明，但最困难的部分在于GAN：成功训练GAN的关键之一是找到合适的超参数组合。因此，我们将采用**贝叶斯优化**（结合高斯过程）和**强化学习**（RL），来决定何时以及如何调整GAN的超参数（即探索与利用之间的权衡）。在构建强化学习模型时，我们将使用该领域的最新进展，例如**Rainbow**和**PPO**。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;我们将使用多种不同类型的数据作为输入。除了股票的历史交易数据和技术指标外，我们还将利用自然语言处理（NLP）领域的最新成果（如“来自Transformer的双向嵌入表示”——BERT，一种NLP中的迁移学习方法）来进行情感分析（作为基本面分析的一部分）；使用**傅里叶变换**提取整体趋势方向；利用**堆叠自编码器**识别其他高层次特征；通过**主成分分析**构建**特征投资组合**以发现相关资产；并使用自回归积分滑动平均模型（ARIMA）对股票价格函数进行近似。此外，我们还会采用更多方法，以便尽可能全面地捕捉关于股票的各种信息、模式、依赖关系等。众所周知，数据越多越好。预测股票价格走势是一项极其复杂的任务，因此我们从不同角度掌握的信息越多，预测成功的可能性就越大。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;为了构建所有神经网络，我们将使用MXNet及其高级API——Gluon，并在多块GPU上进行训练。\n\n**注：** _尽管我会尽量详细解释几乎所有算法和技术背后的数学原理和机制，但本笔记本并非旨在明确阐述机器学习\u002F深度学习或股票市场的运作方式。其主要目的是展示如何利用不同的技术和算法来准确预测股票价格走势，并为每一步所采用的技术及其合理性提供依据。_\n\n_笔记本创建日期：2019年1月9日_。\n\n\n**图1 - 我们工作的整体架构**\n\n\u003Ccenter>\u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_87a782e18466.jpg' width=1060>\u003C\u002Fimg>\u003C\u002Fcenter>\n\n## 目录\n* [引言](#overview)\n* [致谢](#acknowledgement)\n* [数据](#thedata)\n    * [相关资产](#corrassets)\n    * [技术指标](#technicalind)\n    * [基本面分析](#fundamental)\n        - [来自Transformer的双向嵌入表示 - BERT](#bidirnlp)\n    * [用于趋势分析的傅里叶变换](#fouriertransform)\n    * [ARIMA作为特征](#arimafeature)\n    * [统计检验](#statchecks)\n        - [异方差性、多重共线性、序列相关性](#hetemultiser)\n    * [特征工程](#featureeng)\n        * [利用XGBoost进行特征重要性分析](#xgboost)\n    * [利用堆叠自编码器提取高层次特征](#stacked_ae)\n        * [激活函数 - GELU（高斯误差线性单元）](#gelu)\n        * [基于PCA的特征投资组合](#pca)\n    * [用于衍生品定价异常检测的深度无监督学习](#dulfaddp)\n* [生成对抗网络 - GAN](#qgan)\n    * [为什么选择GAN进行股市预测？](#whygan)\n    * [Metropolis-Hastings GAN和Wasserstein GAN](#mhganwgan)\n    * [生成器 - 单层RNN](#thegenerator)\n        - [LSTM还是GRU](#lstmorgru)\n        - [LSTM的架构](#lstmarchitecture)\n        - [学习率调度器](#lrscheduler)\n        - [如何防止过拟合及偏差-方差权衡](#preventoverfitting)\n        - [自定义权重初始化方法和自定义损失函数](#customfns)\n    * [判别器 - 一维CNN](#thediscriminator)\n        - [为何选择CNN作为判别器？](#why_cnn_architecture)\n        - [CNN的架构](#the_cnn_architecture)\n    * [超参数](#hyperparams)\n* [超参数优化](#hyperparams_optim)\n    * [用于超参数优化的强化学习](#reinforcementlearning)\n        - [理论](#reinforcementlearning_theory)\n            - [Rainbow](#rl_rainbow)\n            - [PPO](#rl_ppo)\n        - [强化学习的进一步工作](#reinforcementlearning_further)\n    * [贝叶斯优化](#bayesian_opt)\n        - [高斯过程](#gaussprocess)\n* [结果](#theresult)\n* [下一步计划](#whatisnext)\n* [免责声明](#disclaimer)\n\n# 1. 引言 \u003Ca class=\"anchor\" id=\"overview\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;准确预测股市走势是一项复杂的任务，因为影响某只股票朝特定方向变动的因素多达数百万种。因此，我们需要尽可能多地捕捉这些先决条件。同时，我们也需要做出几个重要的假设：1) 市场并非完全随机；2) 历史会重演；3) 市场遵循人们的理性行为；4) 市场是“完美的”。请务必阅读位于页面底部的**免责声明**。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;我们将尝试预测**高盛集团**（NYSE: GS）的价格走势。为此，我们将使用2010年1月1日至2018年12月31日的每日收盘价（其中七年用于训练，两年用于验证）。_在本文中，“高盛集团”和“GS”这两个术语可以互换使用_。\n\n# 2. 致谢 \u003Ca class=\"anchor\" id=\"acknowledgement\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;在继续之前，我要感谢我的朋友\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fmanganganath\">Nuwan\u003C\u002Fa>和\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FQ4living\">Thomas\u003C\u002Fa>，没有他们的想法和支持，我不可能完成这项工作。\n\n# 3. 数据 \u003Ca class=\"anchor\" id=\"thedata\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;我们需要弄清楚哪些因素会影响高盛股价的涨跌。归根结底，这取决于市场整体的看法。因此，我们需要尽可能地纳入更多信息（从不同方面和角度刻画股票）。我们将使用每日数据——1,585天用于训练各种算法（占我们现有数据的70%），并预测接下来的680天（测试数据）。随后，我们会将预测结果与独立的测试集进行对比。每种类型的数据（我们称之为“特征”）将在后续章节中更详细地说明，但作为高层次的概述，我们将使用的特征包括：\n\n1. **相关资产**——这些是其他资产（任何类型，不一定是股票，例如大宗商品、外汇、指数，甚至固定收益证券）。像高盛这样的大型公司显然并非孤立存在，它依赖并受到许多外部因素的影响，包括竞争对手、客户、全球经济、地缘政治局势、财政和货币政策、资本获取等。具体细节将在后面列出。\n2. **技术指标**——许多投资者会关注技术指标。我们将把最流行的技术指标作为独立特征纳入模型，其中包括7日和21日移动平均线、指数移动平均线、动量指标、布林带、MACD等。\n3. **基本面分析**——这是判断股票可能上涨或下跌的一个非常重要的特征。在基本面分析中，有两种方法：1）通过10-K和10-Q报告分析公司业绩，如ROE和P\u002FE等指标（我们暂不采用）；2）**新闻**——新闻可能会预示未来可能推动股价朝某一方向变动的事件。我们将读取所有关于高盛的每日新闻，并提取当天关于高盛的整体情绪是正面、中性还是负面（以0到1之间的分数表示）。由于许多投资者密切关注新闻，并据此做出投资决策，因此如果今天关于高盛的消息极其积极，那么明天其股价很可能大幅上涨。_需要注意的是，我们稍后会对每一个特征（包括这一项）进行特征重要性分析，以确定是否将其纳入模型，具体内容将在后面详述_。\n    &nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;为了准确预测情绪，我们将使用神经语言处理（NLP）。具体来说，我们将采用谷歌最近推出的用于情感分类和股票新闻情感提取的迁移学习NLP方法——BERT。\n4. **傅里叶变换**——除了每日收盘价外，我们还将计算傅里叶变换，以便概括长短期趋势。通过这些变换，我们可以消除大量噪声（随机游走），并构建对真实股价走势的近似。有了趋势近似，LSTM网络就能更准确地捕捉预测趋势。\n5. **自回归积分滑动平均模型（ARIMA）**——在神经网络出现之前，ARIMA曾是预测时间序列数据未来值的最受欢迎的方法之一。我们将尝试加入这一方法，看看它是否能成为重要的预测特征。\n6. **堆叠自编码器**——上述大多数特征（基本面分析、技术分析等）都是经过数十年研究才被人们发现的。但也许我们遗漏了某些内容。或许存在一些隐藏的相关性，由于数据点、事件、资产、图表等数量庞大，人类难以理解。借助堆叠自编码器（一种神经网络），我们可以利用计算机的强大能力，或许能找到影响股价的新特征。即使我们无法用人类语言理解这些特征，也可以在GAN中加以应用。\n7. **深度无监督学习用于期权定价中的异常检测**——我们还将引入一项新特征：为每一天添加高盛股票90天看涨期权的价格。期权定价本身涉及大量数据。期权合约价格取决于股票未来的价值（分析师也会尝试预测该价值，以得出最准确的看涨期权价格）。通过深度无监督学习（自组织映射），我们将尝试识别每日定价中的异常情况。这种异常（如价格的剧烈变化）可能预示着某种事件，有助于LSTM网络学习整体的股价模式。\n\n接下来，在拥有如此多特征的情况下，我们需要执行几个关键步骤：\n1. 对数据的“质量”进行统计检验。如果数据本身存在问题，那么无论我们的算法多么复杂，最终的结果都不会理想。这些检验包括确保数据不存在异方差性、多重共线性或序列相关性。\n2. 计算特征重要性。如果某个特征（例如另一只股票或某个技术指标）对我们想要预测的股票没有解释力，则无需将其用于神经网络的训练。我们将使用XGBoost（极端梯度提升）——一种基于树的增强回归算法。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;作为数据准备的最后一步，我们还将利用主成分分析（PCA）创建**特征组合**，以降低自编码器生成的特征维度。\n\n\n```python\nfrom utils import *\n\nimport time\nimport numpy as np\n\nfrom mxnet import nd, autograd, gluon\nfrom mxnet.gluon import nn, rnn\nimport mxnet as mx\nimport datetime\nimport seaborn as sns\n\nimport matplotlib.pyplot as plt\n%matplotlib inline\nfrom sklearn.decomposition import PCA\n\nimport math\n\nfrom sklearn.preprocessing import MinMaxScaler\nfrom sklearn.metrics import mean_squared_error\nfrom sklearn.preprocessing import StandardScaler\n\nimport xgboost as xgb\nfrom sklearn.metrics import accuracy_score\n```\n\n\n```python\nimport warnings\nwarnings.filterwarnings(\"ignore\")\n```\n\n\n```python\ncontext = mx.cpu(); model_ctx=mx.cpu()\nmx.random.seed(1719)\n```\n\n**注**：本节（第3节：数据）旨在展示数据预处理过程，并说明使用不同数据来源的理由，因此我仅会使用完整数据集的一部分（即用于训练的数据）来演示。\n\n\n```python\ndef parser(x):\n    return datetime.datetime.strptime(x,'%Y-%m-%d')\n```\n\n\n```python\ndataset_ex_df = pd.read_csv('data\u002Fpanel_data_close.csv', header=0, parse_dates=[0], date_parser=parser)\n```\n\n\n```python\ndataset_ex_df[['Date', 'GS']].head(3)\n```\n\n\n\n\n\u003Cdiv>\n\u003Cstyle scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n   }\n\n.dataframe thead th {\n        text-align: right;\n    }\n\u003C\u002Fstyle>\n\u003Ctable border=\"1\" class=\"dataframe\">\n  \u003Cthead>\n    \u003Ctr style=\"text-align: right;\">\n      \u003Cth>\u003C\u002Fth>\n      \u003Cth>日期\u003C\u002Fth>\n      \u003Cth>GS\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Cth>0\u003C\u002Fth>\n      \u003Ctd>2009-12-31\u003C\u002Ftd>\n      \u003Ctd>168.839996\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>1\u003C\u002Fth>\n      \u003Ctd>2010-01-04\u003C\u002Ftd>\n      \u003Ctd>173.080002\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>2\u003C\u002Fth>\n      \u003Ctd>2010-01-05\u003C\u002Ftd>\n      \u003Ctd>176.139999\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\u003C\u002Fdiv>\n\n\n\n\n```python\nprint('数据集中共有{}天的数据。'.format(dataset_ex_df.shape[0]))\n```\n\n    数据集中共有2265天的数据。\n\n\n让我们可视化过去九年的股票走势。虚线表示训练集和测试集的分界线。\n\n\n```python\nplt.figure(figsize=(14, 5), dpi=100)\nplt.plot(dataset_ex_df['Date'], dataset_ex_df['GS'], label='高盛股票')\nplt.vlines(datetime.date(2016,4, 20), 0, 270, linestyles='--', colors='gray', label='训练\u002F测试数据分割线')\nplt.xlabel('日期')\nplt.ylabel('美元')\nplt.title('图2：高盛股价')\nplt.legend()\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_7877aa06c14b.png)\n\n\n\n```python\nnum_training_days = int(dataset_ex_df.shape[0]*.7)\nprint('训练天数：{}。测试天数：{}。'.format(num_training_days, \\\n                                                                    dataset_ex_df.shape[0]-num_training_days))\n```\n\n    训练天数：1585。测试天数：680。\n\n\n\n\n## 3.1. 相关资产 \u003Ca class=\"anchor\" id=\"corrassets\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;如前所述，我们不仅会使用GS的股票作为目标变量，还会引入其他资产作为特征。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;那么，还有哪些资产会影响GS的股价波动呢？要选择合适的相关资产，必须深入了解公司的业务领域、竞争格局、依赖关系、供应商及客户类型等：\n- 首先是与GS**业务相似的公司**，例如摩根大通和摩根士丹利等。\n- 作为一家投资银行，高盛的业绩高度依赖于**全球经济状况**。经济低迷或波动剧烈时，企业并购、首次公开募股等活动减少，自营交易收益也可能受限。因此，我们将纳入全球经济指数，并加入以美元和英镑计价的LIBOR利率，因为这些利率往往反映了经济环境的变化；此外，我们还将引入其他**金融行业**的相关证券。\n- **VIX指数**——原因同上。\n- **综合股指**——例如美国的纳斯达克指数和纽约证券交易所指数、英国的富时100指数、日本的日经225指数以及亚太地区的恒生指数和孟买敏感30指数。\n- **货币汇率**——全球贸易活动常常反映在汇率波动中，因此我们将选取一篮子货币对（如USDJPY、GBPUSD等）作为特征。\n\n#### 总之，我们的数据集中包含了72种其他资产的每日价格数据。\n\n## 3.2. 技术指标 \u003Ca class=\"anchor\" id=\"technicalind\">\u003C\u002Fa>\n\n我们已经介绍了什么是技术指标以及为什么使用它们，因此现在直接进入代码部分。我们将只为GS创建技术指标。\n\n\n```python\ndef get_technical_indicators(dataset):\n    # 计算7日和21日移动平均线\n    dataset['ma7'] = dataset['price'].rolling(window=7).mean()\n    dataset['ma21'] = dataset['price'].rolling(window=21).mean()\n    \n    # 计算MACD\n    dataset['26ema'] = pd.ewma(dataset['price'], span=26)\n    dataset['12ema'] = pd.ewma(dataset['price'], span=12)\n    dataset['MACD'] = (dataset['12ema']-dataset['26ema'])\n\n    # 计算布林带\n    dataset['20sd'] = pd.stats.moments.rolling_std(dataset['price'],20)\n    dataset['upper_band'] = dataset['ma21'] + (dataset['20sd']*2)\n    dataset['lower_band'] = dataset['ma21'] - (dataset['20sd']*2)\n    \n    # 计算指数移动平均线\n    dataset['ema'] = dataset['price'].ewm(com=0.5).mean()\n    \n    # 计算动量\n    dataset['momentum'] = dataset['price']-1\n    \n    return dataset\n```\n\n\n```python\ndataset_TI_df = get_technical_indicators(dataset_ex_df[['GS']])\n```\n\n\n```python\ndataset_TI_df.head()\n```\n\n\n\n\n\u003Cdiv>\n\u003Cstyle scoped>\n    .dataframe tbody tr th:only-of-type {\n        vertical-align: middle;\n    }\n\n    .dataframe tbody tr th {\n        vertical-align: top;\n    }\n\n    .dataframe thead th {\n        text-align: right;\n    }\n\u003C\u002Fstyle>\n\u003Ctable border=\"1\" class=\"dataframe\">\n  \u003Cthead>\n    \u003Ctr style=\"text-align: right;\">\n      \u003Cth>\u003C\u002Fth>\n      \u003Cth>Date\u003C\u002Fth>\n      \u003Cth>price\u003C\u002Fth>\n      \u003Cth>ma7\u003C\u002Fth>\n      \u003Cth>ma21\u003C\u002Fth>\n      \u003Cth>26ema\u003C\u002Fth>\n      \u003Cth>12ema\u003C\u002Fth>\n      \u003Cth>MACD\u003C\u002Fth>\n      \u003Cth>20sd\u003C\u002Fth>\n      \u003Cth>upper_band\u003C\u002Fth>\n      \u003Cth>lower_band\u003C\u002Fth>\n      \u003Cth>ema\u003C\u002Fth>\n      \u003Cth>momentum\u003C\u002Fth>\n      \u003Cth>log_momentum\u003C\u002Fth>\n    \u003C\u002Ftr>\n  \u003C\u002Fthead>\n  \u003Ctbody>\n    \u003Ctr>\n      \u003Cth>0\u003C\u002Fth>\n      \u003Ctd>2010-02-01\u003C\u002Ftd>\n      \u003Ctd>153.130005\u003C\u002Ftd>\n      \u003Ctd>152.374285\u003C\u002Ftd>\n      \u003Ctd>164.220476\u003C\u002Ftd>\n      \u003Ctd>160.321839\u003C\u002Ftd>\n      \u003Ctd>156.655072\u003C\u002Ftd>\n      \u003Ctd>-3.666767\u003C\u002Ftd>\n      \u003Ctd>9.607375\u003C\u002Ftd>\n      \u003Ctd>183.435226\u003C\u002Ftd>\n      \u003Ctd>145.005726\u003C\u002Ftd>\n      \u003Ctd>152.113609\u003C\u002Ftd>\n      \u003Ctd>152.130005\u003C\u002Ftd>\n      \u003Ctd>5.024735\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>1\u003C\u002Fth>\n      \u003Ctd>2010-02-02\u003C\u002Ftd>\n      \u003Ctd>156.940002\u003C\u002Ftd>\n      \u003Ctd>152.777143\u003C\u002Ftd>\n      \u003Ctd>163.653809\u003C\u002Ftd>\n      \u003Ctd>160.014868\u003C\u002Ftd>\n      \u003Ctd>156.700048\u003C\u002Ftd>\n      \u003Ctd>-3.314821\u003C\u002Ftd>\n      \u003Ctd>9.480630\u003C\u002Ftd>\n      \u003Ctd>182.615070\u003C\u002Ftd>\n      \u003Ctd>144.692549\u003C\u002Ftd>\n      \u003Ctd>155.331205\u003C\u002Ftd>\n      \u003Ctd>155.940002\u003C\u002Ftd>\n      \u003Ctd>5.049471\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>2\u003C\u002Fth>\n      \u003Ctd>2010-02-03\u003C\u002Ftd>\n      \u003Ctd>157.229996\u003C\u002Ftd>\n      \u003Ctd>153.098572\u003C\u002Ftd>\n      \u003Ctd>162.899047\u003C\u002Ftd>\n      \u003Ctd>160.729523\u003C\u002Ftd>\n      \u003Ctd>158.967168\u003C\u002Ftd>\n      \u003Ctd>-2.982871\u003C\u002Ftd>\n      \u003Ctd>9.053702\u003C\u002Ftd>\n      \u003Ctd>181.006450\u003C\u002Ftd>\n      \u003Ctd>144.791644\u003C\u002Ftd>\n      \u003Ctd>156.597065\u003C\u002Ftd>\n      \u003Ctd>156.229996\u003C\u002Ftd>\n      \u003Ctd>5.051329\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>3\u003C\u002Fth>\n      \u003Ctd>2010-02-04\u003C\u002Ftd>\n      \u003Ctd>150.679993\u003C\u002Ftd>\n      \u003Ctd>153.069999\u003C\u002Ftd>\n      \u003Ctd>161.686666\u003C\u002Ftd>\n      \u003Ctd>158.967168\u003C\u002Ftd>\n      \u003Ctd>155.827031\u003C\u002Ftd>\n      \u003Ctd>-3.140137\u003C\u002Ftd>\n      \u003Ctd>8.940246\u003C\u002Ftd>\n      \u003Ctd>179.567157\u003C\u002Ftd>\n      \u003Ctd>143.806174\u003C\u002Ftd>\n      \u003Ctd>152.652350\u003C\u002Ftd>\n      \u003Ctd>149.679993\u003C\u002Ftd>\n      \u003Ctd>5.008500\u003C\u002Ftd>\n    \u003C\u002Ftr>\n    \u003Ctr>\n      \u003Cth>4\u003C\u002Fth>\n      \u003Ctd>2010-02-05\u003C\u002Ftd>\n      \u003Ctd>154.160004\u003C\u002Ftd>\n      \u003Ctd>153.449999\u003C\u002Ftd>\n      \u003Ctd>160.729523\u003C\u002Ftd>\n      \u003Ctd>158.550196\u003C\u002Ftd>\n      \u003Ctd>155.566566\u003C\u002Ftd>\n      \u003Ctd>-2.983631\u003C\u002Ftd>\n      \u003Ctd>8.151912\u003C\u002Ftd>\n      \u003Ctd>177.033348\u003C\u002Ftd>\n      \u003Ctd>144.425699\u003C\u002Ftd>\n      \u003Ctd>153.657453\u003C\u002Ftd>\n      \u003Ctd>153.160004\u003C\u002Ftd>\n      \u003Ctd>5.031483\u003C\u002Ftd>\n    \u003C\u002Ftr>\n  \u003C\u002Ftbody>\n\u003C\u002Ftable>\n\u003C\u002Fdiv>\n\n\n\n这样，我们就得到了每个交易日的技术指标（包括MACD、布林带等），总共有12个技术指标。\n\n让我们可视化这些指标的最后400天数据。\n\n\n```python\ndef plot_technical_indicators(dataset, last_days):\n    plt.figure(figsize=(16, 10), dpi=100)\n    shape_0 = dataset.shape[0]\n    xmacd_ = shape_0-last_days\n    \n    dataset = dataset.iloc[-last_days:, :]\n    x_ = range(3, dataset.shape[0])\n    x_ =list(dataset.index)\n    \n    # 绘制第一个子图\n    plt.subplot(2, 1, 1)\n    plt.plot(dataset['ma7'],label='MA 7', color='g',linestyle='--')\n    plt.plot(dataset['price'],label='收盘价', color='b')\n    plt.plot(dataset['ma21'],label='MA 21', color='r',linestyle='--')\n    plt.plot(dataset['upper_band'],label='上轨', color='c')\n    plt.plot(dataset['lower_band'],label='下轨', color='c')\n    plt.fill_between(x_, dataset['lower_band'], dataset['upper_band'], alpha=0.35)\n    plt.title('高盛公司技术指标——最近{}天'.format(last_days))\n    plt.ylabel('美元')\n    plt.legend()\n\n    # 绘制第二个子图\n    plt.subplot(2, 1, 2)\n    plt.title('MACD')\n    plt.plot(dataset['MACD'],label='MACD', linestyle='-.')\n    plt.hlines(15, xmacd_, shape_0, colors='g', linestyles='--')\n    plt.hlines(-15, xmacd_, shape_0, colors='g', linestyles='--')\n    plt.plot(dataset['log_momentum'],label='动量', color='b',linestyle='-')\n\n    plt.legend()\n    plt.show()\n```\n\n\n```python\nplot_technical_indicators(dataset_TI_df, 400)\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_4d011c141eca.png)\n\n\n## 3.3. 基本面分析 \u003Ca class=\"anchor\" id=\"fundamental\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;对于基本面分析，我们将对所有关于GS的每日新闻进行情感分析。在最后使用Sigmoid函数，结果将介于0到1之间。分数越接近0，表示新闻越负面；越接近1，则表示正面情感。对于每一天，我们将计算其平均每日得分（一个介于0到1之间的数字），并将其作为特征添加进去。\n\n### 3.3.1. 双向Transformer嵌入表示——BERT \u003Ca class=\"anchor\" id=\"bidirnlp\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;为了将新闻分类为正面、负面或中性，我们将使用\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1810.04805\">BERT\u003C\u002Fa>,这是一种预训练的语言表示模型。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;MXNet\u002FGluon中已经提供了预训练的BERT模型。我们只需实例化它们，并添加两个任意数量的```Dense```层，然后通过softmax输出，最终得到一个介于0到1之间的得分。\n\n\n```python\n# 直接导入BERT\nimport bert\n```\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;关于BERT和自然语言处理的详细内容不在本笔记本的讨论范围内，但如果你感兴趣，请告诉我——我会专门创建一个新的仓库来介绍BERT，因为它在语言处理任务中确实非常有前景。\n\n## 3.4. 用于趋势分析的傅里叶变换 \u003Ca class=\"anchor\" id=\"fouriertransform\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**傅里叶变换**将一个函数分解为一系列正弦波（具有不同的振幅和频率）。这些正弦波组合起来可以近似原函数。从数学上看，傅里叶变换的形式如下：\n\n$$G(f) = \\int_{-\\infty}^\\infty g(t) e^{-i 2 \\pi f t} dt$$\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;我们将使用傅里叶变换来提取高盛股票的全局和局部趋势，并对其进行一定程度的去噪处理。下面我们来看看它是如何工作的。\n\n\n```python\ndata_FT = dataset_ex_df[['Date', 'GS']]\n```\n\n\n```python\nclose_fft = np.fft.fft(np.asarray(data_FT['GS'].tolist()))\nfft_df = pd.DataFrame({'fft':close_fft})\nfft_df['absolute'] = fft_df['fft'].apply(lambda x: np.abs(x))\nfft_df['angle'] = fft_df['fft'].apply(lambda x: np.angle(x))\n```\n\n\n```python\nplt.figure(figsize=(14, 7), dpi=100)\nfft_list = np.asarray(fft_df['fft'].tolist())\nfor num_ in [3, 6, 9, 100]:\n    fft_list_m10= np.copy(fft_list); fft_list_m10[num_:-num_]=0\n    plt.plot(np.fft.ifft(fft_list_m10), label='包含 {} 个分量的傅里叶变换'.format(num_))\nplt.plot(data_FT['GS'],  label='真实值')\nplt.xlabel('天数')\nplt.ylabel('美元')\nplt.title('图3：高盛（收盘价）股价与傅里叶变换')\nplt.legend()\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_5135aafa6ea7.png)\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;如图3所示，我们使用的傅里叶变换分量越多，近似函数就越接近真实的股票价格（使用100个分量的变换几乎与原始函数完全重合——红色和紫色线条几乎重叠在一起）。由于我们的目的是提取长期和短期趋势，因此我们将使用包含3、6和9个分量的傅里叶变换。可以推断出，包含3个分量的变换代表了长期趋势。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;另一种用于数据去噪的技术是**小波变换**。小波变换和傅里叶变换的结果相似，因此我们只使用傅里叶变换。\n\n\n\n```python\nfrom collections import deque\nitems = deque(np.asarray(fft_df['absolute'].tolist()))\nitems.rotate(int(np.floor(len(fft_df)\u002F2)))\nplt.figure(figsize=(10, 7), dpi=80)\nplt.stem(items)\nplt.title('图4：傅里叶变换的各分量')\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_589833284e28.png)\n\n\n## 3.5. ARIMA作为特征 \u003Ca class=\"anchor\" id=\"arimafeature\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**ARIMA**是一种用于预测时间序列数据的技术。我们将展示如何使用它，尽管ARIMA不会作为最终的预测模型，但我们会将其用作一种对股票数据进行去噪并可能提取一些新模式或特征的方法。\n\n\n```python\nfrom statsmodels.tsa.arima_model import ARIMA\nfrom pandas import DataFrame\nfrom pandas import datetime\n\nseries = data_FT['GS']\nmodel = ARIMA(series, order=(5, 1, 0))\nmodel_fit = model.fit(disp=0)\nprint(model_fit.summary())\n```\n\n                                 ARIMA Model Results                              \n    ==============================================================================\n    Dep. Variable:                   D.GS   No. Observations:                 2264\n    Model:                 ARIMA(5, 1, 0)   Log Likelihood               -5465.888\n    Method:                       css-mle   S.D. of innovations              2.706\n    Date:                Wed, 09 Jan 2019   AIC                          10945.777\n    Time:                        10:28:07   BIC                          10985.851\n    Sample:                             1   HQIC                         10960.399\n                                                                                  \n    ==============================================================================\n                     coef    std err          z      P>|z|      [0.025      0.975]\n    ------------------------------------------------------------------------------\n    const         -0.0011      0.054     -0.020      0.984      -0.106       0.104\n    ar.L1.D.GS    -0.0205      0.021     -0.974      0.330      -0.062       0.021\n    ar.L2.D.GS     0.0140      0.021      0.665      0.506      -0.027       0.055\n    ar.L3.D.GS    -0.0030      0.021     -0.141      0.888      -0.044       0.038\n    ar.L4.D.GS     0.0026      0.021      0.122      0.903      -0.039       0.044\n    ar.L5.D.GS    -0.0522      0.021     -2.479      0.013      -0.093      -0.011\n                                        Roots                                    \n    =============================================================================\n                      Real          Imaginary           Modulus         Frequency\n    -----------------------------------------------------------------------------\n    AR.1           -1.7595           -0.0000j            1.7595           -0.5000\n    AR.2           -0.5700           -1.7248j            1.8165           -0.3008\n    AR.3           -0.5700           +1.7248j            1.8165            0.3008\n    AR.4            1.4743           -1.0616j            1.8168           -0.0993\n    AR.5            1.4743           +1.0616j            1.8168            0.0993\n    -----------------------------------------------------------------------------\n\n\n\n```python\nfrom pandas.tools.plotting import autocorrelation_plot\nautocorrelation_plot(series)\nplt.figure(figsize=(10, 7), dpi=80)\nplt.show() \n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_db914aa51a20.png)\n\n\n\n    \u003CFigure size 800x560 with 0 Axes>\n\n\n\n```python\nfrom pandas import read_csv\nfrom pandas import datetime\nfrom statsmodels.tsa.arima_model import ARIMA\nfrom sklearn.metrics import mean_squared_error\n\nX = series.values\nsize = int(len(X) * 0.66)\ntrain, test = X[0:size], X[size:len(X)]\nhistory = [x for x in train]\npredictions = list()\nfor t in range(len(test)):\n    model = ARIMA(history, order=(5,1,0))\n    model_fit = model.fit(disp=0)\n    output = model_fit.forecast()\n    yhat = output[0]\n    predictions.append(yhat)\n    obs = test[t]\n    history.append(obs)\n```\n\n\n```python\nerror = mean_squared_error(test, predictions)\nprint('Test MSE: %.3f' % error)\n```\n\n    Test MSE: 10.151\n\n\n\n```python\n\n# 绘制ARIMA模型预测的价格与实际价格\n\nplt.figure(figsize=(12, 6), dpi=100)\nplt.plot(test, label='实际')\nplt.plot(predictions, color='red', label='预测')\nplt.xlabel('天数')\nplt.ylabel('美元')\nplt.title('图5：GS股票的ARIMA模型')\nplt.legend()\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_b29e30acdf6e.png)\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;从图5可以看出，ARIMA模型对实际股价的拟合效果非常好。我们将使用ARIMA预测的价格作为LSTM模型的一个输入特征，因为如前所述，我们希望尽可能地捕捉关于高盛公司的各种特征和模式。我们计算得到的均方误差（MSE）为10.151，这个结果本身并不算差（考虑到我们的测试数据量较大），但我们仍然只将其作为LSTM模型的一个特征来使用。\n\n## 3.6. 统计检验 \u003Ca class=\"anchor\" id=\"statchecks\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;确保数据质量对于我们的模型至关重要。为了确认我们的数据是可靠的，我们将进行一些简单的检查，以保证我们所获得和观察到的结果是真实的，而不是由于底层数据分布存在根本性错误而导致的偏差。\n\n### 3.6.1. 异方差性、多重共线性、序列相关性 \u003Ca class=\"anchor\" id=\"hetemultiser\">\u003C\u002Fa>\n\n- **条件异方差性**是指误差项（回归预测值与真实值之间的差异）依赖于数据本身——例如，随着数据点（沿x轴方向）增大，误差项也会增大。\n- **多重共线性**是指误差项（也称为残差）彼此相互依赖。\n- **序列相关性**是指一个数据（特征）完全由另一个特征决定或通过某种公式计算得出。\n\n这里我们不再详细展开代码，因为实现起来较为简单，而我们的重点在于深度学习部分，**但可以确定的是，数据质量是合格的**。\n\n## 3.7. 特征工程 \u003Ca class=\"anchor\" id=\"featureeng\">\u003C\u002Fa>\n\n\n```python\nprint('总数据集共有{}个样本，{}个特征。'.format(dataset_total_df.shape[0], \\\n                                                              dataset_total_df.shape[1]))\n```\n\n    总数据集共有2265个样本，112个特征。\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;因此，在整合了各类数据（相关资产、技术指标、基本面分析、傅里叶变换以及ARIMA预测）之后，我们得到了总共112个特征，覆盖2,265天的数据（不过正如之前提到的，只有1,585天用于训练）。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;此外，我们还将利用自编码器生成更多的特征。\n\n### 3.7.1. 使用XGBoost进行特征重要性分析 \u003Ca class=\"anchor\" id=\"xgboost\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;面对如此多的特征，我们需要判断它们是否都能有效反映高盛股票的走势。例如，我们在数据集中加入了以美元计价的LIBOR利率，因为我们认为LIBOR的变化可能预示着经济形势的变动，进而影响高盛股票的表现。但这一点仍需验证。评估特征重要性的方法有很多，我们选择使用XGBoost，因为它在分类和回归问题中都能取得较好的效果。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;由于特征数据集规模较大，此处为了演示目的，我们仅以技术指标为例。在实际的特征重要性测试中，所有选定的特征都被证明具有一定的参考价值，因此在训练GAN时我们不会排除任何特征。\n\n\n```python\ndef get_feature_importance_data(data_income):\n    data = data_income.copy()\n    y = data['price']\n    X = data.iloc[:, 1:]\n    \n    train_samples = int(X.shape[0] * 0.65)\n \n    X_train = X.iloc[:train_samples]\n    X_test = X.iloc[train_samples:]\n\n    y_train = y.iloc[:train_samples]\n    y_test = y.iloc[train_samples:]\n    \n    return (X_train, y_train), (X_test, y_test)\n```\n\n\n```python\n# 获取训练和测试数据\n(X_train_FI, y_train_FI), (X_test_FI, y_test_FI) = get_feature_importance_data(dataset_TI_df)\n```\n\n\n```python\nregressor = xgb.XGBRegressor(gamma=0.0,n_estimators=150,base_score=0.7,colsample_bytree=1,learning_rate=0.05)\n```\n\n\n```python\nxgbModel = regressor.fit(X_train_FI,y_train_FI, \\\n                         eval_set = [(X_train_FI, y_train_FI), (X_test_FI, y_test_FI)], \\\n                         verbose=False)\n```\n\n\n```python\neval_result = regressor.evals_result()\n```\n\n\n```python\ntraining_rounds = range(len(eval_result['validation_0']['rmse']))\n```\n\n让我们绘制训练和验证误差曲线，以便观察训练过程并检查是否存在过拟合现象（从图中可以看出，并没有过拟合）。\n\n\n```python\nplt.scatter(x=training_rounds,y=eval_result['validation_0']['rmse'],label='训练误差')\nplt.scatter(x=training_rounds,y=eval_result['validation_1']['rmse'],label='验证误差')\nplt.xlabel('迭代次数')\nplt.ylabel('RMSE')\nplt.title('训练误差与验证误差对比')\nplt.legend()\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_713075159d3c.png)\n\n\n\n```python\nfig = plt.figure(figsize=(8,8))\nplt.xticks(rotation='vertical')\nplt.bar([i for i in range(len(xgbModel.feature_importances_))], xgbModel.feature_importances_.tolist(), tick_label=X_test_FI.columns)\nplt.title('图6：技术指标的特征重要性')\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_ad6b3bc67324.png)\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;毫不意外的是（对于有股票交易经验的人来说），MA7、MACD和BB等指标被证明是重要的特征之一。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;我采用同样的逻辑对整个数据集进行了特征重要性分析——只是训练时间更长，且结果相对复杂，不如仅针对少数几个特征时那么直观易读。\n\n## 3.8. 使用堆叠自编码器提取高层次特征 \u003Ca class=\"anchor\" id=\"stacked_ae\">\u003C\u002Fa>\n\n在进入自编码器部分之前，我们先探讨一种替代的激活函数。\n\n### 3.8.1. 激活函数——GELU（高斯误差线性单元） \u003Ca class=\"anchor\" id=\"gelu\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**GELU**——高斯误差线性单元最近被提出——\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1606.08415.pdf\">链接\u003C\u002Fa>。论文作者指出，在多个实验中，使用GELU作为激活函数的神经网络表现优于使用ReLU的网络。`gelu`也被用于**BERT**，即我们用来进行新闻情感分析的自然语言处理方法。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;我们将在自编码器中使用GELU。\n\n**注意**：下面的单元格展示了 GELU 数学公式的逻辑推导过程，但它并不是作为激活函数的实际实现。我必须在 MXNet 中重新实现 GELU。如果你直接修改代码，将 ```act_type='relu'``` 改为 ```act_type='gelu'```，是不会生效的，除非你更改 MXNet 的实现。你可以向整个项目提交一个 Pull Request 来访问 MXNet 中 GELU 的实现。\n\n\n```python\ndef gelu(x):\n    return 0.5 * x * (1 + math.tanh(math.sqrt(2 \u002F math.pi) * (x + 0.044715 * math.pow(x, 3))))\ndef relu(x):\n    return max(x, 0)\ndef lrelu(x):\n    return max(0.01*x, x)\n```\n\n让我们可视化一下 ```GELU```, ```ReLU``` 和 ```LeakyReLU```（后者主要用于生成对抗网络——我们也会用到它）。\n\n\n```python\nplt.figure(figsize=(15, 5))\nplt.subplots_adjust(left=None, bottom=None, right=None, top=None, wspace=.5, hspace=None)\n\nranges_ = (-10, 3, .25)\n\nplt.subplot(1, 2, 1)\nplt.plot([i for i in np.arange(*ranges_)], [relu(i) for i in np.arange(*ranges_)], label='ReLU', marker='.')\nplt.plot([i for i in np.arange(*ranges_)], [gelu(i) for i in np.arange(*ranges_)], label='GELU')\nplt.hlines(0, -10, 3, colors='gray', linestyles='--', label='0')\nplt.title('图7：自编码器中的 GELU 激活函数')\nplt.ylabel('GELU 和 ReLU 的 f(x)')\nplt.xlabel('x')\nplt.legend()\n\nplt.subplot(1, 2, 2)\nplt.plot([i for i in np.arange(*ranges_)], [lrelu(i) for i in np.arange(*ranges_)], label='Leaky ReLU')\nplt.hlines(0, -10, 3, colors='gray', linestyles='--', label='0')\nplt.ylabel('Leaky ReLU 的 f(x)')\nplt.xlabel('x')\nplt.title('图8：LeakyReLU')\nplt.legend()\n\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_8882e864a706.png)\n\n\n**注意**：在本笔记本的未来版本中，我将尝试使用 **U-Net**（\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F1505.04597\">链接\u003C\u002Fa>），并利用卷积层来提取（甚至创建）更多关于股票底层运动模式的特征。目前，我们仍将只使用由 ```Dense``` 层构成的简单自编码器。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;好了，回到自编码器，如下所示（图片仅为示意图，并不反映实际的层数、单元数等）。\n\n**注意**：我在后续版本中会探索的一个做法是移除解码器的最后一层。通常情况下，自编码器的编码器和解码器层数相同。然而，我们希望提取更高层次的特征（而不是重建与输入完全相同的输出），因此可以跳过解码器的最后一层。我们在训练时仍然保持编码器和解码器具有相同的层数，但在生成最终输出时，则使用倒数第二层的输出，因为它包含了更高层次的特征。\n\n\u003Ccenter>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_fbd88935fb4c.jpg\" width=428>\u003C\u002Fimg>\u003C\u002Fcenter>\n\n\n```python\nbatch_size = 64\nn_batches = VAE_data.shape[0]\u002Fbatch_size\nVAE_data = VAE_data.values\n\ntrain_iter = mx.io.NDArrayIter(data={'data': VAE_data[:num_training_days,:-1]}, \\\n                               label={'label': VAE_data[:num_training_days, -1]}, batch_size = batch_size)\ntest_iter = mx.io.NDArrayIter(data={'data': VAE_data[num_training_days:,:-1]}, \\\n                              label={'label': VAE_data[num_training_days:,-1]}, batch_size = batch_size)\n```\n\n\n```python\nmodel_ctx =  mx.cpu()\nclass VAE(gluon.HybridBlock):\n    def __init__(self, n_hidden=400, n_latent=2, n_layers=1, n_output=784, \\\n                 batch_size=100, act_type='relu', **kwargs):\n        self.soft_zero = 1e-10\n        self.n_latent = n_latent\n        self.batch_size = batch_size\n        self.output = None\n        self.mu = None\n        super(VAE, self).__init__(**kwargs)\n        \n        with self.name_scope():\n            self.encoder = nn.HybridSequential(prefix='encoder')\n            \n            for i in range(n_layers):\n                self.encoder.add(nn.Dense(n_hidden, activation=act_type))\n            self.encoder.add(nn.Dense(n_latent*2, activation=None))\n\n            self.decoder = nn.HybridSequential(prefix='decoder')\n            for i in range(n_layers):\n                self.decoder.add(nn.Dense(n_hidden, activation=act_type))\n            self.decoder.add(nn.Dense(n_output, activation='sigmoid'))\n\n    def hybrid_forward(self, F, x):\n        h = self.encoder(x)\n        #print(h)\n        mu_lv = F.split(h, axis=1, num_outputs=2)\n        mu = mu_lv[0]\n        lv = mu_lv[1]\n        self.mu = mu\n\n        eps = F.random_normal(loc=0, scale=1, shape=(self.batch_size, self.n_latent), ctx=model_ctx)\n        z = mu + F.exp(0.5*lv)*eps\n        y = self.decoder(z)\n        self.output = y\n\n        KL = 0.5*F.sum(1+lv-mu*mu-F.exp(lv),axis=1)\n        logloss = F.sum(x*F.log(y+self.soft_zero)+ (1-x)*F.log(1-y+self.soft_zero), axis=1)\n        loss = -logloss-KL\n\n        return loss\n```\n\n\n```python\nn_hidden=400 # 每层的神经元数量\nn_latent=2 \nn_layers=3 # 编码器和解码器中各有多少个全连接层\nn_output=VAE_data.shape[1]-1 \n\nnet = VAE(n_hidden=n_hidden, n_latent=n_latent, n_layers=n_layers, n_output=n_output, batch_size=batch_size, act_type='gelu')\n```\n\n\n```python\nnet.collect_params().initialize(mx.init.Xavier(), ctx=mx.cpu())\nnet.hybridize()\ntrainer = gluon.Trainer(net.collect_params(), 'adam', {'learning_rate': .01})\n```\n\n\n```python\nprint(net)\n```\n\n    VAE(\n      (encoder): HybridSequential(\n        (0): Dense(None -> 400, Activation(relu))\n        (1): Dense(None -> 400, Activation(relu))\n        (2): Dense(None -> 400, Activation(relu))\n        (3): Dense(None -> 4, linear)\n      )\n      (decoder): HybridSequential(\n        (0): Dense(None -> 400, Activation(relu))\n        (1): Dense(None -> 400, Activation(relu))\n        (2): Dense(None -> 400, Activation(relu))\n        (3): Dense(None -> 11, Activation(sigmoid))\n      )\n    )\n\n\n因此，编码器和解码器各有3层，每层包含400个神经元。\n\n\n```python\nn_epoch = 150\nprint_period = n_epoch \u002F\u002F 10\nstart = time.time()\n\ntraining_loss = []\nvalidation_loss = []\nfor epoch in range(n_epoch):\n    epoch_loss = 0\n    epoch_val_loss = 0\n\n    train_iter.reset()\n    test_iter.reset()\n\n    n_batch_train = 0\n    for batch in train_iter:\n        n_batch_train +=1\n        data = batch.data[0].as_in_context(mx.cpu())\n\n        with autograd.record():\n            loss = net(data)\n        loss.backward()\n        trainer.step(data.shape[0])\n        epoch_loss += nd.mean(loss).asscalar()\n\n    n_batch_val = 0\n    for batch in test_iter:\n        n_batch_val +=1\n        data = batch.data[0].as_in_context(mx.cpu())\n        loss = net(data)\n        epoch_val_loss += nd.mean(loss).asscalar()\n\n    epoch_loss \u002F= n_batch_train\n    epoch_val_loss \u002F= n_batch_val\n\n    training_loss.append(epoch_loss)\n    validation_loss.append(epoch_val_loss)\n\n    \"\"\"if epoch % max(print_period, 1) == 0:\n        print('Epoch {}, Training loss {:.2f}, Validation loss {:.2f}'.\\\n              format(epoch, epoch_loss, epoch_val_loss))\"\"\"\n\nend = time.time()\nprint('训练完成，耗时 {} 秒。'.format(int(end-start)))\n```\n\n训练在62秒内完成。\n\n\n\n```python\ndataset_total_df['Date'] = dataset_ex_df['Date']\n```\n\n\n```python\nvae_added_df = mx.nd.array(dataset_total_df.iloc[:, :-1].values)\n```\n\n\n```python\nprint('新创建的（来自自编码器）特征的形状是 {}。'.format(vae_added_df.shape))\n```\n\n    新创建的（来自自编码器）特征的形状是 (2265, 112)。\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;我们从自编码器中创建了112个额外的特征。由于我们只希望保留高层次特征（整体模式），我们将使用主成分分析（PCA）对这112个新特征构建一个特征组合。这将降低数据的维度（列数）。该特征组合的描述能力将与原始的112个特征相同。\n\n**注意** 这再次纯属实验性质。我并不完全确定所描述的逻辑是否成立。如同人工智能和深度学习中的其他一切一样，这也是一门艺术，需要通过实验来验证。\n\n\n\n### 3.8.2. 使用PCA的特征组合 \u003Ca class=\"anchor\" id=\"pca\">\u003C\u002Fa>\n\n\n```python\n# 我们希望PCA生成的新组件能够解释80%的方差\npca = PCA(n_components=.8)\n```\n\n\n```python\nx_pca = StandardScaler().fit_transform(vae_added_df)\n```\n\n\n```python\nprincipalComponents = pca.fit_transform(x_pca)\n```\n\n\n```python\nprincipalComponents.n_components_\n```\n\n\n\n\n    84\n\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;因此，为了解释80%的方差，我们需要84个（总共112个）特征。这仍然很多。所以，目前我们暂不使用自编码器生成的特征。我计划构建一种自编码器架构，在其中从中间层（而非最后一层）获取输出，并将其连接到另一个具有30个神经元的全连接层。这样，我们就可以1) 只提取更高层次的特征，2) 显著减少列数。\n\n## 3.9. 用于衍生品定价异常检测的深度无监督学习 \u003Ca class=\"anchor\" id=\"dulfaddp\">\u003C\u002Fa>\n\n-- 将于近期添加。\n\n# 4. 生成对抗网络（GAN） \u003Ca class=\"anchor\" id=\"qgan\">\u003C\u002Fa>\n\n图9：简单的GAN架构\n\n\u003Ccenter>\u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_84dd102afb98.jpg' width=960>\u003C\u002Fimg>\u003C\u002Fcenter>\n\n#### GAN是如何工作的？\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;如前所述，本笔记本的目的并非详细讲解深度学习背后的数学原理，而是展示其应用。当然，我认为从基础到细节的全面而扎实的理解至关重要。因此，我们将尝试在高屋建瓴的层面概述GAN的工作原理，以便读者充分理解在预测股票价格变动时使用GAN的合理性。如果您对GAN已经很熟悉，请随意跳过本节及下一节（但请务必查看第4.2节\u003Ca href=\"#wgan\">“WGAN”\u003C\u002Fa>）。\n\nGAN网络由两个模型组成——**生成器**（$G$）和**判别器**（$D$）。训练GAN的步骤如下：\n1. 生成器利用随机数据（噪声$z$）尝试“生成”与真实数据难以区分或极其接近的数据。其目的是学习真实数据的分布。\n2. 判别器会随机接收真实数据或生成的数据作为输入，它充当分类器，试图判断数据是来自生成器还是真实数据。$D$会估计输入样本属于真实数据集的概率。（关于比较两个分布的更多信息，请参阅下文第4.2节\u003Ca href=\"#mhganwgan\">“GAN与WGAN”\u003C\u002Fa>）。\n3. 随后，$G$和$D$的损失会被合并，并反向传播回生成器。因此，生成器的损失同时取决于生成器和判别器。这一步有助于生成器学习真实数据的分布。如果生成器生成的数据不够逼真（分布不一致），判别器就能很容易地区分生成数据和真实数据，从而导致判别器的损失非常小。而判别器损失越小，生成器的损失就会越大（见下方$L(D, G)$的公式）。这就使得判别器的设计变得有些棘手，因为过于优秀的判别器总是会导致巨大的生成器损失，从而使生成器无法继续学习。\n4. 这一过程将持续进行，直到判别器再也无法区分生成数据和真实数据为止。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;当$D$和$G$协同工作时，它们实际上是在进行一场“极小化极大”博弈（生成器试图“欺骗”判别器，提高其对虚假样本的判断概率，即最小化$\\mathbb{E}_{z \\sim p_{z}(z)} [\\log (1 - D(G(z)))]$。而判别器则希望通过最大化$\\mathbb{E}_{x \\sim p_{r}(x)} [\\log D(x)]$来区分来自生成器的数据$D(G(z))$）。然而，由于损失函数是分开的，尚不清楚两者如何能够共同收敛（这就是为什么我们会使用一些改进版的GAN，例如Wasserstein GAN）。总体而言，联合损失函数如下：\n\n$$L(D, G) = \\mathbb{E}_{x \\sim p_{r}(x)} [\\log D(x)] + \\mathbb{E}_{z \\sim p_z(z)} [\\log(1 - D(G(z)))]$$\n\n**注**：有关训练GAN的实用技巧，请参阅\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fsoumith\u002Fganhacks\">此处\u003C\u002Fa>。\n\n**注**：在本笔记本中，我不会包含**GAN**和**强化学习**部分的完整代码——仅会展示执行结果（单元格输出）。如需代码，请提交拉取请求或与我联系。\n\n## 4.1. 为什么使用GAN进行股市预测？ \u003Ca class=\"anchor\" id=\"whygan\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**生成对抗网络**（GAN）最近主要被用于生成逼真的图像、绘画和视频片段。很少有GAN被应用于像我们这样的时间序列数据预测。不过，其核心思想应该是一致的——我们希望预测未来的股票走势。在未来，GS公司的股票走势模式和行为大致应保持不变（除非其运营方式发生根本性变化，或者经济环境发生剧烈改变）。因此，我们希望“生成”未来数据，使其分布与我们已有的历史交易数据相似（当然，不必完全相同）。理论上来说，这应该是可行的。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;在我们的案例中，我们将使用**LSTM**作为时间序列生成器，用**CNN**作为判别器。\n\n## 4.2. Metropolis-Hastings GAN与Wasserstein GAN \u003Ca class=\"anchor\" id=\"mhganwgan\">\u003C\u002Fa>\n\n**注：** _接下来的几节内容假定读者已具备一定的GAN相关经验。_\n\n#### **一、Metropolis-Hastings GAN**\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;近期，Uber工程团队提出了一种对传统GAN的改进方法，称为**Metropolis-Hastings GAN**（MHGAN）。Uber团队提出的这一思路与其类似，即由Google和加州大学伯克利分校共同提出的**判别器拒绝采样**（\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1810.06758.pdf\">DRS\u003C\u002Fa>）。其核心思想在于：在传统的GAN训练中，判别器$D$仅用于指导生成器$G$更好地学习；而在训练完成后，$D$往往不再被使用。然而，MHGAN和DRS则尝试利用$D$来筛选出更接近真实数据分布的生成样本——两者的区别在于，MHGAN采用了马尔可夫链蒙特卡洛（MCMC）进行采样。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;MHGAN首先从生成器$G$中生成**_K_**个样本（这些样本由独立的噪声输入$z_0$至$z_K$产生，如图所示）。随后，它依次遍历这**_K_**个输出$x'_0$至$x'_K$，并根据由判别器$D$定义的接受准则，决定是保留当前样本还是继续使用上一个已被接受的样本。最终被保留的输出即被视为$G$的真实生成结果。\n\n**注**：MHGAN最初由Uber使用PyTorch实现，我仅将其移植到了MXNet\u002FGluon框架中。\n\n\n#### **注**：我也会在不久的将来将其上传至GitHub。\n\u003Cbr>\u003C\u002Fbr>\n图10：MHGAN的可视化示意图（摘自Uber的原始文章\u003Ca href=\"https:\u002F\u002Feng.uber.com\u002Fmh-gan\u002F?amp\">Uber post\u003C\u002Fa>）。\n\n\u003Ccenter>\u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_d2289ff8fdc1.png' width=500>\u003C\u002Fimg>\u003C\u002Fcenter>\n\n#### **二、Wasserstein GAN** \u003Ca class=\"anchor\" id=\"wgan\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;GAN的训练过程相当复杂，模型常常难以收敛，并且容易出现模式坍塌现象。为此，我们将采用一种名为**Wasserstein** GAN的改进版本——\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1701.07875.pdf\">WGAN\u003C\u002Fa>。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;此处我们不深入细节，但需要强调以下几点：\n- 如前所述，GAN的核心目标是让生成器能够将随机噪声转化为我们希望模仿的真实数据。因此，在GAN中比较两个概率分布之间的相似性至关重要。目前最常用的两种度量方式是：\n  * **KL散度**（Kullback–Leibler）：$D_{KL}(p \\| q) = \\int_x p(x) \\log \\frac{p(x)}{q(x)} dx$。当$p(x)$等于$q(x)$时，$D_{KL}$为零；\n  * **JS散度**（Jensen–Shannon）：$D_{JS}(p \\| q) = \\frac{1}{2} D_{KL}(p \\| \\frac{p + q}{2}) + \\frac{1}{2} D_{KL}(q \\| \\frac{p + q}{2})$。JS散度的取值范围在0到1之间，与KL散度不同，它是对称且更为平滑的。在GAN训练中，将损失函数从KL散度切换为JS散度曾取得了显著的成功。\n- WGAN则采用Wasserstein距离作为损失函数，公式为：$W(p_r, p_g) = \\frac{1}{K} \\sup_{\\| f \\|_L \\leq K} \\mathbb{E}_{x \\sim p_r}[f(x)] - \\mathbb{E}_{x \\sim p_g}[f(x)]$（其中$\\sup$表示上确界）。该距离也被称为“地球移动距离”，因为它可以被直观地理解为将一堆沙子（分别服从不同的概率分布）移动到另一堆，同时使整个过程所需的能量最小化。相较于KL和JS散度，Wasserstein距离能够提供更加平滑的度量，避免了散度值的突然跳变，从而使得梯度下降过程更加稳定。\n- 此外，与KL和JS散度相比，Wasserstein距离几乎处处可导。在反向传播过程中，我们需要对损失函数求导以计算梯度，进而更新网络权重。因此，拥有一个可导的损失函数显得尤为重要。\n\n#### _毫不夸张地说，这部分内容是本笔记本中最难的部分。将WGAN和MHGAN结合起来，我花了整整三天时间。_\n\n## 4.4. 生成器——单层RNN \u003Ca class=\"anchor\" id=\"thegenerator\">\u003C\u002Fa>\n\n### 4.4.1. LSTM或GRU \u003Ca class=\"anchor\" id=\"lstmorgru\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;如前所述，生成器是一个LSTM网络，属于循环神经网络（RNN）的一种。RNN常用于处理时间序列数据，因为它们能够记住所有历史数据点，并捕捉随时间变化的模式。然而，由于其自身特性，RNN经常面临“梯度消失”问题——即在训练过程中，权重更新的幅度变得极小，以至于无法继续优化，导致网络无法达到最小化损失的目标。有时也会出现相反的情况，即梯度变得过大，这被称为“梯度爆炸”。解决这一问题的方法相对简单：当梯度超过某个阈值时，对其进行裁剪，即梯度裁剪。为了解决梯度消失问题，人们提出了两种改进方案：门控循环单元（**GRU**）和长短期记忆网络（**LSTM**）。两者的主要区别在于：1）GRU只有两个门（更新门和重置门），而LSTM则有四个门（更新门、输入门、遗忘门和输出门）；2）LSTM会维护内部记忆状态，而GRU则不会；3）LSTM会在输出门之前应用非线性激活函数（sigmoid），而GRU则没有。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;在大多数情况下，LSTM和GRU在准确率方面表现相近，但GRU的计算开销要小得多，因为其可训练参数较少。尽管如此，LSTM的应用更为广泛。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;严格来说，LSTM单元（即各个门）背后的数学公式如下：\n\n\n$$g_t = \\text{tanh}(X_t W_{xg} + h_{t-1} W_{hg} + b_g),$$\n\n\n$$i_t = \\sigma(X_t W_{xi} + h_{t-1} W_{hi} + b_i),$$\n\n$$f_t = \\sigma(X_t W_{xf} + h_{t-1} W_{hf} + b_f),$$\n\n$$o_t = \\sigma(X_t W_{xo} + h_{t-1} W_{ho} + b_o),$$\n\n$$c_t = f_t \\odot c_{t-1} + i_t \\odot g_t,$$\n\n$$h_t = o_t \\odot \\text{tanh}(c_t),$$\n\n其中$\\odot$表示逐元素相乘运算符，对于任意$x = [x_1, x_2, \\ldots, x_k]^\\top \\in R^k$，两种激活函数分别为：\n\n$$\\sigma(x) = \\left[\\frac{1}{1+\\exp(-x_1)}, \\ldots, \\frac{1}{1+\\exp(-x_k)}]\\right]^\\top,$$\n\n$$\\text{tanh}(x) = \\left[\\frac{1-\\exp(-2x_1)}{1+\\exp(-2x_1)},  \\ldots, \\frac{1-\\exp(-2x_k)}{1+\\exp(-2x_k)}\\right]^\\top$$\n\n### 4.4.2. LSTM 架构 \u003Ca class=\"anchor\" id=\"lstmarchitecture\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;LSTM 的架构非常简单——一层 ```LSTM```，包含 112 个输入单元（因为数据集中有 112 个特征）和 500 个隐藏单元，以及一层 ```Dense``` 层，输出为单个值——即每天的价格。初始化方法采用 Xavier 初始化，并使用 L1 损失函数（即带有 L1 正则化的平均绝对误差损失——有关正则化的更多信息，请参阅第 4.4.5 节）。\n\n**注意**——在代码中可以看到，我们使用了 ```Adam``` 优化器（学习率设为 0.01）。目前不必过于关注这一点——后面有一个专门的章节会解释我们使用的超参数（学习率除外，因为我们使用了学习率调度器——\u003Ca href=\"#lrscheduler\">第 4.4.3 节\u003C\u002Fa>），以及如何对这些超参数进行优化——\u003Ca href=\"#hyperparams\">第 4.6 节\u003C\u002Fa>）。\n\n```python\ngan_num_features = dataset_total_df.shape[1]\nsequence_length = 17\n\nclass RNNModel(gluon.Block):\n    def __init__(self, num_embed, num_hidden, num_layers, bidirectional=False, \\\n                 sequence_length=sequence_length, **kwargs):\n        super(RNNModel, self).__init__(**kwargs)\n        self.num_hidden = num_hidden\n        with self.name_scope():\n            self.rnn = rnn.LSTM(num_hidden, num_layers, input_size=num_embed, \\\n                                bidirectional=bidirectional, layout='TNC')\n            \n            self.decoder = nn.Dense(1, in_units=num_hidden)\n    \n    def forward(self, inputs, hidden):\n        output, hidden = self.rnn(inputs, hidden)\n        decoded = self.decoder(output.reshape((-1, self.num_hidden)))\n        return decoded, hidden\n    \n    def begin_state(self, *args, **kwargs):\n        return self.rnn.begin_state(*args, **kwargs)\n    \nlstm_model = RNNModel(num_embed=gan_num_features, num_hidden=500, num_layers=1)\nlstm_model.collect_params().initialize(mx.init.Xavier(), ctx=mx.cpu())\ntrainer = gluon.Trainer(lstm_model.collect_params(), 'adam', {'learning_rate': .01})\nloss = gluon.loss.L1Loss()\n```\n\n我们将 LSTM 层设置为 500 个神经元，并使用 Xavier 初始化。对于正则化，我们采用 L1 正则化。让我们看看 MXNet 打印出的 ```LSTM``` 内部结构。\n\n\n```python\nprint(lstm_model)\n```\n\n    RNNModel(\n      (rnn): LSTM(112 -> 500, TNC)\n      (decoder): Dense(500 -> 1, linear)\n    )\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;正如我们所见，LSTM 的输入是 112 个特征（```dataset_total_df.shape[1]```），这些特征随后进入 LSTM 层的 500 个神经元，再被转换为一个单一的输出——即股票价格值。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;LSTM 的逻辑是：我们取 17 天（```sequence_length```）的数据（再次强调，这些数据包括每日 GS 股票的价格以及其他相关特征——如相关资产、情绪等——来预测第 18 天的价格。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;在另一篇文章中，我将探讨对标准 LSTM 进行改进是否更有益，例如：\n- 使用 **双向** LSTM 层——理论上，从数据集末尾向开头反向传递信息，可能有助于 LSTM 更好地捕捉股票走势的模式。\n- 使用 **堆叠式** RNN 架构——不只是一层 LSTM，而是两层或更多层。然而，这样做可能存在过拟合的风险，因为我们的数据量并不大（只有 1,585 天的数据）。\n- 探索 **GRU**——正如之前所解释的，GRU 的单元结构更为简单。\n- 在 RNN 中加入 **注意力** 向量。\n\n### 4.4.3. 学习率调度器 \u003Ca class=\"anchor\" id=\"lrscheduler\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;学习率是最重要的超参数之一。对于几乎所有的优化器（如 SGD、Adam 或 RMSProp）而言，在训练神经网络时设定合适的学习率至关重要，因为它既决定了收敛的速度，也影响着模型的最终性能。最简单的学习率策略之一就是在整个训练过程中保持固定的学习率。选择较小的学习率有助于优化器找到较好的解，但这也意味着初始收敛速度较慢。通过随时间调整学习率，可以克服这一权衡。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;最近的一些论文，比如 \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1806.01593.pdf\">这篇\u003C\u002Fa>, 表明在训练过程中动态调整全局学习率，无论是在收敛速度还是训练时间方面，都有显著的优势。\n\n\n```python\nclass TriangularSchedule():\n    def __init__(self, min_lr, max_lr, cycle_length, inc_fraction=0.5):     \n        self.min_lr = min_lr\n        self.max_lr = max_lr\n        self.cycle_length = cycle_length\n        self.inc_fraction = inc_fraction\n        \n    def __call__(self, iteration):\n        if iteration \u003C= self.cycle_length*self.inc_fraction:\n            unit_cycle = iteration * 1 \u002F (self.cycle_length * self.inc_fraction)\n        elif iteration \u003C= self.cycle_length:\n            unit_cycle = (self.cycle_length - iteration) * 1 \u002F (self.cycle_length * (1 - self.inc_fraction))\n        else:\n            unit_cycle = 0\n        adjusted_cycle = (unit_cycle * (self.max_lr - self.min_lr)) + self.min_lr\n        return adjusted_cycle\n\nclass CyclicalSchedule():\n    def __init__(self, schedule_class, cycle_length, cycle_length_decay=1, cycle_magnitude_decay=1, **kwargs):\n        self.schedule_class = schedule_class\n        self.length = cycle_length\n        self.length_decay = cycle_length_decay\n        self.magnitude_decay = cycle_magnitude_decay\n        self.kwargs = kwargs\n    \n    def __call__(self, iteration):\n        cycle_idx = 0\n        cycle_length = self.length\n        idx = self.length\n        while idx \u003C= iteration:\n            cycle_length = math.ceil(cycle_length * self.length_decay)\n            cycle_idx += 1\n            idx += cycle_length\n        cycle_offset = iteration - idx + cycle_length\n        \n        schedule = self.schedule_class(cycle_length=cycle_length, **self.kwargs)\n        return schedule(cycle_offset) * self.magnitude_decay**cycle_idx\n```\n\n\n```python\nschedule = CyclicalSchedule(TriangularSchedule, min_lr=0.5, max_lr=2, cycle_length=500)\niterations=1500\n\nplt.plot([i+1 for i in range(iterations)],[schedule(i) for i in range(iterations)])\nplt.title('每轮迭代的学习率')\nplt.xlabel(\"轮次\")\nplt.ylabel(\"学习率\")\nplt.show()\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_918e9b50eb45.png)\n\n### 4.4.4. 如何防止过拟合及偏差-方差权衡 \u003Ca class=\"anchor\" id=\"preventoverfitting\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;在拥有大量特征和神经网络的情况下，我们需要确保防止过拟合，并关注总损失。\n\n我们使用多种技术来防止过拟合（不仅在LSTM中，也在CNN和自编码器中）：\n- **确保数据质量**。我们已经进行了统计检验，确保数据不存在多重共线性或序列自相关问题。此外，我们还对每个特征进行了重要性检查。最后，初始的特征选择（例如选择相关资产、技术指标等）是基于对股票市场运作机制的领域知识进行的。\n- **正则化**（或权重惩罚）。最常用的两种正则化技术是LASSO（**L1**）和Ridge（**L2**）。L1会在损失函数中加入平均绝对误差，而L2则加入均方误差。不深入数学细节的话，基本区别在于：Lasso回归（L1）同时进行变量选择和参数收缩，而Ridge回归只进行参数收缩，最终会将所有系数纳入模型。当存在相关变量时，Ridge回归可能是更优的选择。此外，Ridge回归在最小二乘估计具有较高方差的情况下表现最佳。因此，具体使用哪种正则化方法取决于我们的模型目标。这两种正则化的效果截然不同。虽然它们都会惩罚较大的权重，但L1正则化在零点处会产生不可导的函数。L2正则化倾向于较小的权重，而L1正则化则倾向于使权重趋近于零。因此，使用L1正则化可能会得到一个稀疏模型——即参数较少的模型。在两种情况下，L1和L2正则化的模型参数都会“收缩”，但在L1正则化的情况下，这种收缩会直接影响模型的复杂度（即参数数量）。确切地说，Ridge回归在最小二乘估计具有较高方差时表现最佳。L1对异常值更具鲁棒性，在数据稀疏时使用，并能生成特征重要性。我们将使用L1。\n- **Dropout**。Dropout层会随机移除隐藏层中的节点。\n- **密集-稀疏-密集训练**。 - \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1607.04381v1.pdf\">链接\u003C\u002Fa>\n- **早停法**。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;在构建复杂的神经网络时，另一个重要的考虑因素是偏差-方差权衡。基本上，我们在训练网络时得到的误差是由偏差、方差以及不可约误差——σ（由噪声和随机性引起的误差）——共同决定的。该权衡关系的最简单公式为：\n\n$$Error = bias^{2} + variance + \\sigma$$\n\n- **偏差**。偏差衡量的是在训练数据集上训练好的算法在未见数据上的泛化能力。高偏差（欠拟合）意味着模型无法很好地处理未见数据。\n- **方差**。方差衡量的是模型对数据集变化的敏感程度。高方差即为过拟合。\n\n### 4.4.5. 自定义权重初始化器和自定义损失函数 \u003Ca class=\"anchor\" id=\"customfns\">\u003C\u002Fa>\n\n#### 即将推出\n\n## 4.5. 判别器——一维卷积神经网络 \u003Ca class=\"anchor\" id=\"thediscriminator\">\u003C\u002Fa>\n\n### 4.5.1. 为什么使用CNN作为判别器？ \u003Ca class=\"anchor\" id=\"why_cnn_architecture\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;我们通常将CNN用于与图像相关的任务（分类、上下文提取等）。它们在逐层提取特征方面非常强大。例如，在一张狗的图片中，第一层卷积会检测边缘，第二层开始检测圆形，第三层则会检测鼻子。而在我们的案例中，数据点形成小趋势，小趋势又形成更大的趋势，这些趋势进一步组合成模式。CNN的特征提取能力可以用来捕捉GS公司股价变动中的模式信息。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;使用CNN的另一个原因是，CNN擅长处理空间数据——也就是说，彼此靠近的数据点比分散的数据点之间关联性更强。这一点对于时间序列数据同样适用。在我们的案例中，每个数据点对应连续的一天。很自然地，我们可以假设相邻的两天之间关联性更强。需要注意的一点是（尽管本研究并未涉及），季节性因素可能会如何影响CNN的工作（如果有的话）。\n\n**注**：如同本笔记本中的许多部分一样，将CNN应用于时间序列数据仍处于实验阶段。我们将观察结果，但不会提供数学或其他形式的证明。而且，使用不同的数据、激活函数等，结果可能会有所不同。\n\n### 4.5.1. CNN架构 \u003Ca class=\"anchor\" id=\"the_cnn_architecture\">\u003C\u002Fa>\n\n图11：CNN架构的高层次概览。\n\n\u003Ccenter>\u003Cimg src='https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_332d61898309.jpg' width=960>\u003C\u002Fimg>\u003C\u002Fcenter>\n\nGAN内部的CNN代码如下所示：\n\n\n```python\nnum_fc = 512\n\n# ... GAN的其他部分\n\ncnn_net = gluon.nn.Sequential()\nwith net.name_scope():\n    \n    # 添加一维卷积层\n    cnn_net.add(gluon.nn.Conv1D(32, kernel_size=5, strides=2))\n    cnn_net.add(nn.LeakyReLU(0.01))\n    cnn_net.add(gluon.nn.Conv1D(64, kernel_size=5, strides=2))\n    cnn_net.add(nn.LeakyReLU(0.01))\n    cnn_net.add(nn.BatchNorm())\n    cnn_net.add(gluon.nn.Conv1D(128, kernel_size=5, strides=2))\n    cnn_net.add(nn.LeakyReLU(0.01))\n    cnn_net.add(nn.BatchNorm())\n    \n    # 添加两个全连接层\n    cnn_net.add(nn.Dense(220, use_bias=False), nn.BatchNorm(), nn.LeakyReLU(0.01))\n    cnn_net.add(nn.Dense(220, use_bias=False), nn.Activation(activation='relu'))\n    cnn_net.add(nn.Dense(1))\n    \n# ... GAN的其他部分\n```\n\n让我们打印一下CNN。\n\n\n```python\nprint(cnn_net)\n```\n\n    Sequential(\n      (0): Conv1D(None -> 32, kernel_size=(5,), stride=(2,))\n      (1): LeakyReLU(0.01)\n      (2): Conv1D(None -> 64, kernel_size=(5,), stride=(2,))\n      (3): LeakyReLU(0.01)\n      (4): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=None)\n      (5): Conv1D(None -> 128, kernel_size=(5,), stride=(2,))\n      (6): LeakyReLU(0.01)\n      (7): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=None)\n      (8): Dense(None -> 220, linear)\n      (9): BatchNorm(axis=1, eps=1e-05, momentum=0.9, fix_gamma=False, use_global_stats=False, in_channels=None)\n      (10): LeakyReLU(0.01)\n      (11): Dense(None -> 220, linear)\n      (12): Activation(relu)\n      (13): Dense(None -> 1, linear)\n    )\n\n## 4.6. 超参数 \u003Ca class=\"anchor\" id=\"hyperparams\">\u003C\u002Fa>\n\n我们将跟踪和优化的超参数包括：\n- ```batch_size```：LSTM 和 CNN 的批量大小\n- ```cnn_lr```：CNN 的学习率\n- ```strides```：CNN 中的步幅数量\n- ```lrelu_alpha```：GAN 中 LeakyReLU 的 alpha 参数\n- ```batchnorm_momentum```：CNN 中批归一化层的动量\n- ```padding```：CNN 中的填充方式\n- ```kernel_size':1```：CNN 中的卷积核大小\n- ```dropout```：LSTM 中的丢弃率\n- ```filters```：初始滤波器的数量\n\n我们将训练 200 个 ```epochs```。\n\n# 5. 超参数优化 \u003Ca class=\"anchor\" id=\"hyperparams_optim\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;在 GAN 训练完 200 个 epoch 后，它会记录 MAE（即 LSTM 中的误差函数 $G$），并将其作为奖励值传递给强化学习模块。强化学习模块将决定是更改超参数，还是继续使用当前的超参数集进行训练。正如后文所述，这种方法主要用于实验性地探索强化学习的应用。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;如果强化学习决定更新超参数，它将调用下面讨论的贝叶斯优化库，以获得下一组预期最优的超参数组合。\n\n## 5.1. 基于强化学习的超参数优化 \u003Ca class=\"anchor\" id=\"reinforcementlearning\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;为什么要在超参数优化中使用强化学习呢？股票市场瞬息万变。即使我们成功训练出能够生成极其准确结果的 GAN 和 LSTM 模型，这些结果也可能只在特定时期内有效。这意味着我们需要不断地对整个流程进行优化。为了优化这一过程，我们可以：\n- 添加或移除特征（例如加入可能具有相关性的新股票或货币）\n- 改进我们的深度学习模型。而改进模型最重要的方法之一就是调整超参数（详见第 5 节）。一旦找到一组合适的超参数，我们就需要决定何时更换它们，以及何时继续使用现有的超参数集（即探索与利用之间的权衡）。此外，股票市场是一个连续的状态空间，受到数百万种因素的影响。\n\n**注**：本笔记本中关于强化学习的部分更多地是为了研究目的。我们将以 GAN 作为环境，尝试不同的强化学习方法。实际上，在不使用强化学习的情况下，也有许多成功的方法可以对深度学习模型的超参数进行优化。不过……为什么不试试呢？\n\n**注**：接下来的几节内容假设您已经具备一定的强化学习知识——尤其是策略类方法和 Q 学习。\n\n### 5.1.1. 强化学习理论 \u003Ca class=\"anchor\" id=\"reinforcementlearning_theory\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;在不解释强化学习基本原理的情况下，我们将直接进入此处实现的具体方法细节。由于我们并不了解完整的环境状态，因此无法建立明确的环境模型——如果存在这样的模型，我们就无需预测股票价格走势，因为价格会直接遵循该模型运行。基于这一原因，我们将采用无模型强化学习算法，具体分为策略优化和 Q 学习两类。\n- **Q 学习**：在 Q 学习中，我们学习的是从某一状态下采取行动所获得的**价值**。“Q 值”是指采取该行动后预期得到的回报。我们将使用 Rainbow 算法，它是七种 Q 学习算法的组合。\n- **策略优化**：在策略优化中，我们学习的是从某一状态下应采取的动作。（如果我们使用类似 Actor-Critic 的方法）我们还会学习处于某一状态下的价值。我们将使用近端策略优化算法。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;构建强化学习算法的一个关键环节是准确设定奖励机制。奖励必须全面反映环境及其与智能体之间的交互关系。我们定义奖励 **_R_** 如下：\n\n$$Reward = 2*loss_G + loss_D + accuracy_G,$$\n\n其中 $loss_G$、$accuracy_G$ 和 $loss_D$ 分别表示生成器的损失、生成器的准确率以及判别器的损失。这里的环境是 GAN 及其驱动的 LSTM 训练结果。智能体可采取的动作则是如何调整 GAN 的 $D$ 和 $G$ 网络的超参数。\n\n#### 5.1.1.1. Rainbow \u003Ca class=\"anchor\" id=\"rl_rainbow\">\u003C\u002Fa>\n\n**什么是 Rainbow？**\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;Rainbow（\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1710.02298.pdf\">链接\u003C\u002Fa>) 是一种基于 Q 学习的离策略深度强化学习算法，它结合了七种不同的技术：\n* **DQN**。DQN 是 Q 学习算法的扩展，使用神经网络来表示 Q 值。与监督学习（深度学习）类似，在 DQN 中我们训练一个神经网络，并试图最小化损失函数。我们通过随机采样转换（状态、动作、奖励）来训练网络。网络的层不仅可以是全连接层，也可以是卷积层等。\n* **双 Q 学习**。双 QL 解决了 Q 学习中的一个重要问题，即估值偏高问题。\n* **优先回放**。在普通的 DQN 中，所有的转换都会被存储在一个回放缓冲区中，并且从该缓冲区中均匀采样。然而，在学习过程中，并非所有转换都同样有益（这也使得学习效率低下，需要更多的回合）。而优先经验回放并不进行均匀采样，而是采用一种分布，对之前迭代中具有较高 Q 损失的样本给予更高的采样概率。\n* **对决网络**。对决网络通过使用两条独立的流（即两个不同的小型神经网络），稍微改变了 Q 学习的架构。一条流用于计算价值，另一条流用于计算 _优势_。这两条流共享一个卷积编码器。难点在于如何将两条流合并——这里使用了一种特殊的聚合器（_Wang 等，2016_）。\n  - _优势_ 的公式为 $A(s, a) = Q(s, a) - V(s)$，通常来说，它是比较某个特定状态下某一动作相对于平均动作的好坏程度。当“错误”动作无法通过负奖励惩罚时，有时会用到优势值。因此，优势值会进一步奖励那些优于平均动作的良好行为。\n* **多步学习**。多步学习的最大区别在于，它使用 N 步回报来计算 Q 值（而不仅仅是下一步的回报），这自然应该更加准确。\n* **分布式强化学习**。Q 学习以平均估计的 Q 值作为目标值。然而，在许多情况下，不同情境下的 Q 值可能并不相同。分布式强化学习可以直接学习（或近似）Q 值的分布，而不是对其进行平均。当然，其中的数学原理要复杂得多，但对我们而言，其好处在于可以更准确地采样 Q 值。\n* **噪声网络**。基础的 DQN 使用简单的 ε-贪心机制来进行探索。但这种探索方式有时效率较低。噪声网络则通过添加一个带有噪声的线性层来解决这一问题。随着时间的推移，网络会学会忽略这些噪声（作为一条额外的噪声流）。不过，空间中不同区域的学习速度各不相同，从而实现对状态的探索。\n\u003Cbr>\u003C\u002Fbr>\n#### **注**：敬请关注——我将于 2019 年 2 月初在 GitHub 上上传 Rainbow 的 MXNet\u002FGluon 实现。\n\u003Cbr>\u003C\u002Fbr>\n\n\n\n#### 5.1.1.2. PPO \u003Ca class=\"anchor\" id=\"rl_ppo\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**近端策略优化**（\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1707.06347.pdf\">PPO\u003C\u002Fa>) 是一种无模型的策略优化型强化学习方法。它比其他算法更容易实现，同时能取得非常好的效果。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;我们为什么要使用 PPO？PPO 的一大优势在于，它直接学习策略，而不是像 Q 学习那样通过价值间接学习策略。它可以在连续动作空间中很好地工作，这非常适合我们的应用场景，并且能够通过均值和标准差学习分布概率（如果在输出端加入 softmax 函数）。\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;策略梯度方法的一个问题是，它们对步长的选择极其敏感：步长太小会导致进展过于缓慢（很可能是因为需要计算二阶导数矩阵）；而步长太大则会产生大量噪声，显著降低性能。由于策略的变化，输入数据是非平稳的（奖励和观测的分布也会随之改变）。与监督学习相比，步长选择不当可能会带来更为严重的后果，因为它会影响后续访问的整体分布。PPO 可以解决这些问题。此外，与其他一些方法相比，PPO：\n* 更加简单，例如与 **ACER** 相比，ACER 需要额外的代码来维护离策略相关性以及回放缓冲区；或者与 **TRPO** 相比，TRPO 对代理目标函数施加了一个约束（旧策略与新策略之间的 KL 散度）。这个约束用于控制策略变化幅度，避免因变化过大而导致不稳定。而 PPO 则通过使用 _夹紧在 [1-𝜖, 1+𝜖] 之间的代理目标函数_，并引入对更新过大的惩罚项，来减少由该约束带来的计算开销。\n* 具有更强的兼容性，能够与那些在价值函数和策略函数之间共享参数，或包含辅助损失的算法配合使用，这一点优于 TRPO（尽管 PPO 同样具备信任域策略优化的优势）。\n\n**注**：为了本次实验的目的，我们不会深入研究和优化 RL 方法，包括 PPO 等算法。相反，我们将直接采用现有的工具，并尝试将其融入到我们针对 GAN、LSTM 和 CNN 模型的超参数优化流程中。我们将复用并定制的代码由 OpenAI 开发，可在 \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fopenai\u002Fbaselines\">此处\u003C\u002Fa>获取。\n\n\n\n### 5.1.2. 强化学习的进一步工作 \u003Ca class=\"anchor\" id=\"reinforcementlearning_further\">\u003C\u002Fa>\n\n关于进一步探索强化学习的一些想法：\n- 我接下来计划介绍的内容之一是使用 **增强随机搜索**（\u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.07055.pdf\">链接\u003C\u002Fa>) 作为替代算法。该算法的作者来自加州大学伯克利分校，他们成功实现了与 PPO 等当前最先进方法相似的奖励效果，但平均速度快了约 15 倍。\n- 奖励函数的选择非常重要。我在上面已经说明了目前使用的奖励函数，但我还会尝试使用不同的奖励函数作为替代方案。\n- 使用 **好奇心驱动的探索策略**。\n- 构建伯克利人工智能研究团队（BAIR）提出的 **多智能体** 架构——\u003Ca href=\"https:\u002F\u002Fbair.berkeley.edu\u002Fblog\u002F2018\u002F12\u002F12\u002Frllib\u002F\">链接\u003C\u002Fa>。\n\n## 5.2. 贝叶斯优化 \u003Ca class=\"anchor\" id=\"bayesian_opt\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;与耗时较长的网格搜索相比，我们将使用**贝叶斯优化**来寻找最佳超参数组合。我们使用的库已经实现好——\u003Ca href=\"https:\u002F\u002Fgithub.com\u002Ffmfn\u002FBayesianOptimization\">链接\u003C\u002Fa>。\n\n接下来的代码部分仅展示初始化过程。\n\n\n```python\n# 初始化优化器\nfrom bayes_opt import BayesianOptimization\nfrom bayes_opt import UtilityFunction\n\nutility = UtilityFunction(kind=\"ucb\", kappa=2.5, xi=0.0)\n```\n\n### 5.2.1. 高斯过程 \u003Ca class=\"anchor\" id=\"gaussprocess\">\u003C\u002Fa>\n\n\u003Ccenter>\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_2423b9fa7a08.png\" width=600>\u003C\u002Fimg>\u003C\u002Fcenter>\n\n# 6. 结果 \u003Ca class=\"anchor\" id=\"theresult\">\u003C\u002Fa>\n\n\n```python\nfrom utils import plot_prediction\n```\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;最后，我们将比较在流程的不同阶段之后，使用未见过的（测试）数据作为输入时，LSTM 的输出结果。\n\n1. 第一个 epoch 后的绘图。\n\n\n```python\nplot_prediction('预测价格与真实价格 - 第一个 epoch 后。')\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_b8a904c684b2.png)\n\n\n2. 50 个 epoch 后的绘图。\n\n\n```python\nplot_prediction('预测价格与真实价格 - 前 50 个 epoch 后。')\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_72a7c09ed02d.png)\n\n\n\n```python\nplot_prediction('预测价格与真实价格 - 前 200 个 epoch 后。')\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_f95b879b7b75.png)\n\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;强化学习运行了十个回合（我们定义一个回合为在 200 个 epoch 上进行一次完整的 GAN 训练）。\n\n\n```python\nplot_prediction('最终结果。')\n```\n\n\n![png](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_readme_15e585ff6e27.png)\n\n\n#### 下一步，我将尝试把各个部分单独拆解出来，分析哪些方法有效以及原因。我们为何会得到这些结果？这是否只是巧合？敬请期待。\n\n# 下一步是什么？ \u003Ca class=\"anchor\" id=\"whatisnext\">\u003C\u002Fa>\n\n- 接下来，我将尝试构建一个用于测试交易算法的强化学习环境，该算法可以决定何时以及如何进行交易。GAN 的输出将成为该环境中的一项参数。\n\n# 关于我 \u003Ca class=\"anchor\" id=\"me\">\u003C\u002Fa>\n\nwww.linkedin.com\u002Fin\u002Fborisbanushev\n\n# 免责声明 \u003Ca class=\"anchor\" id=\"disclaimer\">\u003C\u002Fa>\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**本笔记本内容仅供参考。其中所包含的任何信息均不构成对任何特定证券、证券组合、交易或投资策略适合特定人士的推荐。期货、股票和期权交易涉及重大亏损风险，并不适合所有投资者。期货、股票和期权的价格可能会波动，因此客户有可能损失超过其初始投资金额。**\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;**所有交易策略均由您自行承担风险。**\n\n&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;在选择数据特征、算法以及调优算法等方面，还有许多细节值得深入探讨。这个版本的笔记本我自己就花了两周时间才完成。我相信整个过程中仍有许多未解之处。因此，如果您有任何意见或建议，请随时分享。我很乐意在现有流程中加入并测试您的想法。\n\n感谢您的阅读。\n\n此致，\n鲍里斯","# stockpredictionai 快速上手指南\n\n## 环境准备\n\n本项目基于深度学习框架 **MXNet (Gluon)** 构建，旨在利用 GAN、LSTM、CNN 及多种金融特征工程预测股价走势。\n\n*   **操作系统**: Linux (推荐 Ubuntu 18.04+), macOS, 或 Windows (需配置 CUDA 环境)\n*   **硬件要求**: \n    *   强烈推荐使用 **NVIDIA GPU** (支持多卡训练)\n    *   显存建议 8GB 以上（因模型包含堆叠自编码器、BERT 及复杂的 GAN 结构）\n*   **软件依赖**:\n    *   Python 3.6+\n    *   MXNet (带 GPU 支持版本)\n    *   GluonTS \u002F GluonCV (相关扩展库)\n    *   Jupyter Notebook \u002F JupyterLab (项目以 Notebook 形式交付)\n    *   其他科学计算库：`pandas`, `numpy`, `scikit-learn`, `xgboost`, `statsmodels`, `transformers` (用于 BERT)\n\n> **国内加速建议**：\n> 建议使用清华源或阿里源安装 Python 依赖，以提升下载速度。\n> ```bash\n> pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple \u003Cpackage_name>\n> ```\n\n## 安装步骤\n\n### 1. 克隆项目\n首先从仓库获取源代码：\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fyour-repo\u002Fstockpredictionai.git\ncd stockpredictionai\n```\n\n### 2. 创建虚拟环境 (推荐)\n```bash\npython3 -m venv venv\nsource venv\u002Fbin\u002Factivate  # Windows 用户请使用: venv\\Scripts\\activate\n```\n\n### 3. 安装核心依赖\n根据项目需求安装 MXNet 及其他库。若使用 NVIDIA GPU，请安装 `mxnet-cu\u003Cversion>` (例如 `mxnet-cu112`)。\n\n**使用国内镜像源安装：**\n```bash\n# 安装 MXNet (GPU 版本示例，请根据实际 CUDA 版本调整)\npip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple mxnet-cu112\n\n# 安装其他必要依赖\npip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple pandas numpy scikit-learn xgboost statsmodels torch transformers jupyterlab matplotlib seaborn\n```\n\n> **注意**：原文提到使用了 `BERT` 进行情感分析，这通常依赖 Hugging Face `transformers` 库和 `torch`。虽然主框架是 MXNet，但 NLP 部分可能需要 PyTorch 后端支持，请确保同时安装。\n\n### 4. 验证安装\n启动 Jupyter Lab 检查环境：\n```bash\njupyter lab\n```\n在 Notebook 中运行以下代码确认 MXNet 是否识别到 GPU：\n```python\nimport mxnet as mx\nprint(mx.context.num_gpus())\n```\n\n## 基本使用\n\n本项目核心逻辑封装在 Jupyter Notebook 中。请按照以下步骤运行预测流程：\n\n### 1. 启动 Notebook\n在项目根目录下打开主文件（通常为 `stock_prediction.ipynb` 或类似名称）：\n```bash\njupyter notebook stock_prediction.ipynb\n```\n\n### 2. 数据准备\nNotebook 会自动下载高盛 (Goldman Sachs, NYSE: GS) 的历史数据（2010-2018）。\n*   **自定义股票**：若需预测其他股票，请在 Notebook 的 **\"The Data\"** 章节修改股票代码和数据读取路径。\n*   **新闻数据**：情感分析部分需要每日新闻文本。如果是演示运行，代码可能包含模拟数据或需要手动放入新闻数据集到指定文件夹（详见 Notebook 内 `Fundamental analysis` 部分说明）。\n\n### 3. 执行特征工程\n按顺序运行 Notebook 单元格，系统将自动执行以下操作：\n*   计算技术指标 (MA, MACD, Bollinger Bands 等)\n*   利用 **BERT** 提取新闻情感得分\n*   执行傅里叶变换提取趋势\n*   运行 **XGBoost** 进行特征重要性筛选\n*   通过堆叠自编码器 (Stacked Autoencoders) 提取高阶特征\n\n### 4. 训练模型 (GAN + RL)\n这是最耗时的步骤。继续运行至 **\"Generative Adversarial Network - GAN\"** 章节：\n*   模型将使用 **LSTM** 作为生成器，**1D-CNN** 作为判别器。\n*   系统会调用 **贝叶斯优化** 和 **强化学习 (Rainbow\u002FPPO)** 自动调整超参数。\n*   *提示：此过程在多 GPU 环境下显著加速。*\n\n### 5. 查看结果\n运行至 **\"The result\"** 章节，将输出：\n*   预测股价走势与实际走势的对比图\n*   模型评估指标\n*   特征重要性分析图表\n\n### 最小化运行示例 (代码片段)\n若仅需测试数据加载和基础特征生成，可在新的 Python 脚本中运行：\n\n```python\nimport pandas as pd\nimport mxnet as mx\nfrom sklearn.preprocessing import StandardScaler\n\n# 1. 加载示例数据 (假设已下载 CSV)\ndf = pd.read_csv('data\u002FGS_historical.csv')\nprices = df['Close'].values.reshape(-1, 1)\n\n# 2. 数据标准化\nscaler = StandardScaler()\nscaled_prices = scaler.fit_transform(prices)\n\n# 3. 转换为 MXNet NDArray (GPU 模式)\nctx = mx.gpu(0) if mx.context.num_gpus() > 0 else mx.cpu()\ndata_nd = mx.nd.array(scaled_prices, ctx=ctx)\n\nprint(f\"Data loaded on {ctx}. Shape: {data_nd.shape}\")\n# 后续步骤请在 Jupyter Notebook 中完整运行以体验 GAN 训练流程\n```\n\n> **免责声明**：本工具仅供研究和教育目的使用。股市预测具有极高风险，模型结果不构成任何投资建议。请务必阅读原项目中的 Disclaimer 部分。","某量化对冲基金的分析团队正试图构建一个能融合历史行情、新闻舆情及宏观趋势的高精度股价预测模型，以辅助短线交易决策。\n\n### 没有 stockpredictionai 时\n- 模型架构单一，仅依赖 LSTM 处理时间序列，难以捕捉数据中复杂的非线性特征和潜在分布模式。\n- 超参数调整依靠人工经验试错，耗时数周仍难以找到最优解，导致模型收敛慢且容易陷入局部最优。\n- 数据来源局限在传统技术指标，无法有效整合 BERT 情感分析、傅里叶变换趋势提取等多模态信息，预测视角片面。\n- 缺乏对抗训练机制，生成的预测路径过于平滑，无法模拟真实市场中剧烈的波动和异常情况。\n\n### 使用 stockpredictionai 后\n- 构建了基于 LSTM 生成器与 CNN 判别器的 GAN 架构，不仅能预测点位，还能生成逼真的股价波动分布，显著提升泛化能力。\n- 引入贝叶斯优化结合强化学习（Rainbow\u002FPPO 算法）自动动态调整超参数，将调优周期从数周缩短至数天，并确保持续探索最优策略。\n- 一站式集成多源数据管道，自动融合 BERT 舆情情绪、堆叠自编码器高阶特征及 ARIMA 近似函数，全方位捕捉市场信号。\n- 利用特征重要性分析和统计检验自动清洗数据，有效识别相关性资产与异常值，大幅降低了过拟合风险。\n\nstockpredictionai 通过融合前沿的生成式对抗网络与自动化超参数调优技术，将多维异构数据转化为高鲁棒性的交易信号，彻底改变了传统量化建模的效率与精度瓶颈。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fborisbanushev_stockpredictionai_7877aa06.png","borisbanushev","Boris Banushev","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fborisbanushev_53405506.jpg","AI Quant and Data Scientist in Finance. Data Science MSc @ UC Berkeley. Economics PhD candidate @ Oxford. MSc Finance @ MBS. Yes, that's me playing ice hockey",null,"Singapore","https:\u002F\u002Fgithub.com\u002Fborisbanushev",[84,88,92],{"name":85,"color":86,"percentage":87},"JavaScript","#f1e05a",58,{"name":89,"color":90,"percentage":91},"CSS","#663399",39.2,{"name":93,"color":94,"percentage":95},"HTML","#e34c26",2.8,5517,1879,"2026-04-08T18:15:26",4,"未说明","需要多块 GPU 进行训练 (MXNet\u002FGluon)，具体型号、显存大小及 CUDA 版本未说明",{"notes":103,"python":100,"dependencies":104},"该项目基于 2019 年的 Notebook，核心框架为 MXNet 及其高层 API Gluon（非目前主流的 PyTorch 或 TensorFlow）。项目涉及复杂的混合架构，包括生成对抗网络 (GAN)、LSTM、CNN、强化学习 (Rainbow, PPO)、贝叶斯优化以及 BERT 情感分析等。由于原文未提供具体的安装脚本或版本锁定文件，且依赖的某些强化学习算法实现可能较为陈旧，实际部署可能需要根据当时的代码库自行配置环境。",[105,106,107,108,109,110],"mxnet","gluon","BERT (用于 NLP)","XGBoost","Rainbow (RL 算法)","PPO (RL 算法)",[18],"2026-03-27T02:49:30.150509","2026-04-09T10:07:26.679885",[],[]]