[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-robertmartin8--MachineLearningStocks":3,"similar-robertmartin8--MachineLearningStocks":103},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":9,"readme_en":10,"readme_zh":11,"quickstart_zh":12,"use_case_zh":13,"hero_image_url":14,"owner_login":15,"owner_name":16,"owner_avatar_url":17,"owner_bio":18,"owner_company":19,"owner_location":20,"owner_email":19,"owner_twitter":19,"owner_website":21,"owner_url":22,"languages":23,"stars":28,"forks":29,"last_commit_at":30,"license":31,"difficulty_score":32,"env_os":33,"env_gpu":34,"env_ram":34,"env_deps":35,"category_tags":43,"github_topics":47,"view_count":62,"oss_zip_url":19,"oss_zip_packed_at":19,"status":63,"created_at":64,"updated_at":65,"faqs":66,"releases":102},934,"robertmartin8\u002FMachineLearningStocks","MachineLearningStocks","Using python and scikit-learn to make stock predictions","MachineLearningStocks 是一个基于 Python 和 scikit-learn 的股票预测入门项目，旨在帮助用户理解如何用机器学习分析股票市场。它通过整合历史股价数据与基本面指标（如市盈率、负债率、流通股等），训练分类模型来预测哪些股票未来可能跑赢大盘，并包含简单的回测验证流程。\n\n这个项目主要解决了量化投资新手\"不知从何入手\"的问题——它将数据清洗、特征工程、模型训练到回测预测的完整链路打包成一个可运行的模板，让学习者能快速建立对机器学习金融应用的直观认知。虽然作者明确提醒代码本身不适合直接实盘交易，但提供了清晰的扩展方向，比如改进数据获取方式、尝试更复杂的模型或优化特征组合。\n\n适合人群包括：有一定 Python 基础、想入门量化金融的开发者；需要教学案例的机器学习课程师生；以及对算法交易感兴趣、希望从规范代码中学习最佳实践的研究人员。项目的技术亮点在于其\"高度可扩展\"的架构设计——采用 pandas 进行数据处理，scikit-learn 统一接口支持快速切换算法，模块化结构便于替换数据源或加入新的特征工程步骤。\n\n需要注意的是，项目已于 2021 年停止维护","MachineLearningStocks 是一个基于 Python 和 scikit-learn 的股票预测入门项目，旨在帮助用户理解如何用机器学习分析股票市场。它通过整合历史股价数据与基本面指标（如市盈率、负债率、流通股等），训练分类模型来预测哪些股票未来可能跑赢大盘，并包含简单的回测验证流程。\n\n这个项目主要解决了量化投资新手\"不知从何入手\"的问题——它将数据清洗、特征工程、模型训练到回测预测的完整链路打包成一个可运行的模板，让学习者能快速建立对机器学习金融应用的直观认知。虽然作者明确提醒代码本身不适合直接实盘交易，但提供了清晰的扩展方向，比如改进数据获取方式、尝试更复杂的模型或优化特征组合。\n\n适合人群包括：有一定 Python 基础、想入门量化金融的开发者；需要教学案例的机器学习课程师生；以及对算法交易感兴趣、希望从规范代码中学习最佳实践的研究人员。项目的技术亮点在于其\"高度可扩展\"的架构设计——采用 pandas 进行数据处理，scikit-learn 统一接口支持快速切换算法，模块化结构便于替换数据源或加入新的特征工程步骤。\n\n需要注意的是，项目已于 2021 年停止维护，部分依赖可能需要手动更新。作者还推荐了其姊妹项目 PyPortfolioOpt，用于解决\"选出好股票后如何优化组合配置\"的问题，两者结合可能获得更稳健的风险调整后收益。","# MachineLearningStocks in python: a starter project and guide\n\n[![forthebadge made-with-python](https:\u002F\u002FForTheBadge.com\u002Fimages\u002Fbadges\u002Fmade-with-python.svg)](https:\u002F\u002Fwww.python.org\u002F)\n\n[![GitHub license](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-brightgreen.svg?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fsurelyourejoking\u002FMachineLearningStocks\u002Fblob\u002Fmaster\u002FLICENSE.txt)\n\n*EDIT as of Feb 2021: MachineLearningStocks is no longer actively maintained*\n\nMachineLearningStocks is designed to be an **intuitive** and **highly extensible** template project applying machine learning to making stock predictions. My hope is that this project will help you understand the overall workflow of using machine learning to predict stock movements and also appreciate some of its subtleties. And of course, after following this guide and playing around with the project, you should definitely **make your own improvements** – if you're struggling to think of what to do, at the end of this readme I've included a long list of possiblilities: take your pick.\n\nConcretely, we will be cleaning and preparing a dataset of historical stock prices and fundamentals using `pandas`, after which we will apply a `scikit-learn` classifier to discover the relationship between stock fundamentals (e.g PE ratio, debt\u002Fequity, float, etc) and the subsequent annual price change (compared with the an index). We then conduct a simple backtest, before generating predictions on current data.\n\nWhile I would not live trade based off of the predictions from this exact code, I do believe that you can use this project as starting point for a profitable trading system – I have actually used code based on this project to live trade, with pretty decent results (around 20% returns on backtest and 10-15% on live trading).\n\nThis project has quite a lot of personal significance for me. It was my first proper python project, one of my first real encounters with ML, and the first time I used git. At the start, my code was rife with bad practice and inefficiency: I have since tried to amend most of this, but please be warned that some minor issues may remain (feel free to raise an issue, or fork and submit a PR). Both the project and myself as a programmer have evolved a lot since the first iteration, but there is always room to improve.\n\n*As a disclaimer, this is a purely educational project. Be aware that backtested performance may often be deceptive – trade at your own risk!*\n\n*MachineLearningStocks predicts which stocks will outperform. But it does not suggest how best to combine them into a portfolio. I have just released [PyPortfolioOpt](https:\u002F\u002Fgithub.com\u002Frobertmartin8\u002FPyPortfolioOpt), a portfolio optimisation library which uses\nclassical efficient frontier techniques (with modern improvements) in order to generate risk-efficient portfolios. Generating optimal allocations from the predicted outperformers might be a great way to improve risk-adjusted returns.*\n\n*This guide has been cross-posted at my academic blog, [reasonabledeviations.com](https:\u002F\u002Freasonabledeviations.com\u002F)*\n\n## Contents\n\n- [Contents](#contents)\n- [Overview](#overview)\n  - [EDIT as of 24\u002F5\u002F18](#edit-as-of-24518)\n  - [EDIT as of October 2019](#edit-as-of-october-2019)\n- [Quickstart](#quickstart)\n- [Preliminaries](#preliminaries)\n- [Historical data](#historical-data)\n  - [Historical stock fundamentals](#historical-stock-fundamentals)\n  - [Historical price data](#historical-price-data)\n- [Creating the training dataset](#creating-the-training-dataset)\n  - [Preprocessing historical price data](#preprocessing-historical-price-data)\n  - [Features](#features)\n    - [Valuation measures](#valuation-measures)\n    - [Financials](#financials)\n    - [Trading information](#trading-information)\n  - [Parsing](#parsing)\n- [Backtesting](#backtesting)\n- [Current fundamental data](#current-fundamental-data)\n- [Stock prediction](#stock-prediction)\n- [Unit testing](#unit-testing)\n- [Where to go from here](#where-to-go-from-here)\n  - [Data acquisition](#data-acquisition)\n  - [Data preprocessing](#data-preprocessing)\n  - [Machine learning](#machine-learning)\n- [Contributing](#contributing)\n\n## Overview\n\nThe overall workflow to use machine learning to make stocks prediction is as follows:\n\n1. Acquire historical fundamental data – these are the *features* or *predictors*\n2. Acquire historical stock price data – this is will make up the dependent variable, or label (what we are trying to predict).\n3. Preprocess data\n4. Use a machine learning model to learn from the data\n5. Backtest the performance of the machine learning model\n6. Acquire current fundamental data\n7. Generate predictions from current fundamental data\n\nThis is a very generalised overview, but in principle this is all you need to build a fundamentals-based ML stock predictor.\n\n### EDIT as of 24\u002F5\u002F18\n\nThis project uses pandas-datareader to download historical price data from Yahoo Finance. However, in the past few weeks this has become extremely inconsistent – it seems like Yahoo have added some measures to prevent the bulk download of their data. I will try to add a fix, but for now, take note that `download_historical_prices.py` may be deprecated.\n\nAs a temporary solution, I've uploaded `stock_prices.csv` and `sp500_index.csv`, so the rest of the project can still function.\n\n### EDIT as of October 2019\n\nI expect that after so much time there will be many data issues. To that end, I have decided to upload the other CSV files: `keystats.csv` (the output of `parsing_keystats.py`) and `forward_sample.csv` (the output of `current_data.py`).\n\n## Quickstart\n\nIf you want to throw away the instruction manual and play immediately, clone this project, then download and unzip the [data file](https:\u002F\u002Fpythonprogramming.net\u002Fdata-acquisition-machine-learning\u002F) into the same directory. Then, open an instance of terminal and cd to the project's file path, e.g\n\n```bash\ncd Users\u002FUser\u002FDesktop\u002FMachineLearningStocks\n```\n\nThen, run the following in terminal:\n\n```bash\npip install -r requirements.txt\npython download_historical_prices.py\npython parsing_keystats.py\npython backtesting.py\npython current_data.py\npytest -v\npython stock_prediction.py\n```\n\nOtherwise, follow the step-by-step guide below.\n\n## Preliminaries\n\nThis project uses python 3.6, and the common data science libraries `pandas` and `scikit-learn`. If you are on python 3.x less than 3.6, you will find some syntax errors wherever f-strings have been used for string formatting. These are fortunately very easy to fix (just rebuild the string using your preferred method), but I do encourage you to upgrade to 3.6 to enjoy the elegance of f-strings. A full list of requirements is included in the `requirements.txt` file. To install all of the requirements at once, run the following code in terminal:\n\n```bash\npip install -r requirements.txt\n```\n\nTo get started, clone this project and unzip it. This folder will become our working directory, so make sure you `cd` your terminal instance into this directory.\n\n## Historical data\n\nData acquisition and preprocessing is probably the hardest part of most machine learning projects. But it is a necessary evil, so it's best to not fret and just carry on.\n\nFor this project, we need three datasets:\n\n1. Historical stock fundamentals\n2. Historical stock prices\n3. Historical S&P500 prices\n\nWe need the S&P500 index prices as a benchmark: a 5% stock growth does not mean much if the S&P500 grew 10% in that time period, so all stock returns must be compared to those of the index.\n\n### Historical stock fundamentals\n\nHistorical fundamental data is actually very difficult to find (for free, at least). Although sites like [Quandl](https:\u002F\u002Fwww.quandl.com\u002F) do have datasets available, you often have to pay a pretty steep fee.\n\nIt turns out that there is a way to parse this data, for free, from [Yahoo Finance](https:\u002F\u002Ffinance.yahoo.com\u002F). I will not go into details, because [Sentdex has done it for us](https:\u002F\u002Fpythonprogramming.net\u002Fdata-acquisition-machine-learning\u002F). On his page you will be able to find a file called `intraQuarter.zip`, which you should download, unzip, and place in your working directory. Relevant to this project is the subfolder called `_KeyStats`, which contains html files that hold stock fundamentals for all stocks in the S&P500 between 2003 and 2013, sorted by stock. However, at this stage, the data is unusable – we will have to parse it into a nice csv file before we can do any ML.\n\n### Historical price data\n\nIn the first iteration of the project, I used `pandas-datareader`, an extremely convenient library which can load stock data straight into `pandas`. However, after Yahoo Finance changed their UI, `datareader` no longer worked, so I switched to [Quandl](https:\u002F\u002Fwww.quandl.com\u002F), which has free stock price data for a few tickers, and a python API. However, as `pandas-datareader` has been [fixed](https:\u002F\u002Fgithub.com\u002Franaroussi\u002Ffix-yahoo-finance), we will use that instead.\n\nLikewise, we can easily use `pandas-datareader` to access data for the SPY ticker. Failing that, one could manually download it from [yahoo finance](https:\u002F\u002Ffinance.yahoo.com\u002Fquote\u002F%5EGSPC\u002Fhistory?p=%5EGSPC), place it into the project directory and rename it `sp500_index.csv`.\n\nThe code for downloading historical price data can be run by entering the following into terminal:\n\n```bash\npython download_historical_prices.py\n```\n\n## Creating the training dataset\n\nOur ultimate goal for the training data is to have a 'snapshot' of a particular stock's fundamentals at a particular time, and the corresponding subsequent annual performance of the stock.\n\nFor example, if our 'snapshot' consists of all of the fundamental data for AAPL on the date 28\u002F1\u002F2005, then we also need to know the percentage price change of AAPL between 28\u002F1\u002F05 and 28\u002F1\u002F06. Thus our algorithm can learn how the fundamentals impact the annual change in the stock price.\n\nIn fact, this is a slight oversimplification. In fact, what the algorithm will eventually learn is how fundamentals impact the *outperformance of a stock relative to the S&P500 index*. This is why we also need index data.\n\n### Preprocessing historical price data\n\nWhen `pandas-datareader` downloads stock price data, it does not include rows for weekends and public holidays (when the market is closed).\n\nHowever, referring to the example of AAPL above, if our snapshot includes fundamental data for 28\u002F1\u002F05 and we want to see the change in price a year later, we will get the nasty surprise that 28\u002F1\u002F2006 is a Saturday. Does this mean that we have to discard this snapshot?\n\nBy no means – data is too valuable to callously toss away. As a workaround, I instead decided to 'fill forward' the missing data, i.e we will assume that the stock price on Saturday 28\u002F1\u002F2006 is equal to the stock price on Friday 27\u002F1\u002F2006.\n\n### Features\n\nBelow is a list of some of the interesting variables that are available on Yahoo Finance.\n\n#### Valuation measures\n\n- 'Market Cap'\n- Enterprise Value\n- Trailing P\u002FE\n- Forward P\u002FE\n- PEG Ratio\n- Price\u002FSales\n- Price\u002FBook\n- Enterprise Value\u002FRevenue\n- Enterprise Value\u002FEBITDA\n\n#### Financials\n\n- Profit Margin\n- Operating Margin\n- Return on Assets\n- Return on Equity\n- Revenue\n- Revenue Per Share\n- Quarterly Revenue Growth\n- Gross Profit\n- EBITDA\n- Net Income Avi to Common\n- Diluted EPS\n- Quarterly Earnings Growth\n- Total Cash\n- Total Cash Per Share\n- Total Debt\n- Total Debt\u002FEquity\n- Current Ratio\n- Book Value Per Share\n- Operating Cash Flow\n- Levered Free Cash Flow\n\n#### Trading information\n\n- Beta\n- 50-Day Moving Average\n- 200-Day Moving Average\n- Avg Vol (3 month)\n- Shares Outstanding\n- Float\n- % Held by Insiders\n- % Held by Institutions\n- Shares Short\n- Short Ratio\n- Short % of Float\n- Shares Short (prior month)\n\n### Parsing\n\nHowever, all of this data is locked up in HTML files. Thus, we need to build a parser. In this project, I did the parsing with regex, but please note that generally it is [really not recommended](https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F1732348\u002Fregex-match-open-tags-except-xhtml-self-contained-tags) to use regex to parse HTML. However, I think regex probably wins out for ease of understanding (this project being educational in nature), and from experience regex works fine in this case.\n\nThis is the exact regex used:\n\n```python\nr'>' + re.escape(variable) + r'.*?(\\-?\\d+\\.*\\d*K?M?B?|N\u002FA[\\\\n|\\s]*|>0|NaN)%?(\u003C\u002Ftd>|\u003C\u002Fspan>)'\n```\n\nWhile it looks pretty arcane, all it is doing is searching for the first occurence of the feature (e.g \"Market Cap\"), then it looks forward until it finds a number immediately followed by a `\u003C\u002Ftd>` or `\u003C\u002Fspan>` (signifying the end of a table entry). The complexity of the expression above accounts for some subtleties in the parsing:\n\n- the numbers could be preceeded by a minus sign\n- Yahoo Finance sometimes uses K, M, and B as abbreviations for thousand, million and billion respectively.\n- some data are given as percentages\n- some datapoints are missing, so instead of a number we have to look for \"N\u002FA\" or \"NaN.\n\nBoth the preprocessing of price data and the parsing of keystats are included in `parsing_keystats.py`. Run the following in your terminal:\n\n```bash\npython parsing_keystats.py\n```\n\nYou should see the file `keystats.csv` appear in your working directory. Now that we have the training data ready, we are ready to actually do some machine learning.\n\n## Backtesting\n\nBacktesting is arguably the most important part of any quantitative strategy: you must have some way of testing the performance of your algorithm before you live trade it.\n\nDespite its importance, I originally did not want to include backtesting code in this repository. The reasons were as follows:\n\n- Backtesting is messy and empirical. The code is not very pleasant to use, and in practice requires a lot of manual interaction.\n- Backtesting is very difficult to get right, and if you do it wrong, you will be deceiving yourself with high returns.\n- Developing and working with your backtest is probably the best way to learn about machine learning and stocks – you'll see what works, what doesn't, and what you don't understand.\n\nNevertheless, because of the importance of backtesting, I decided that I can't really call this a 'template machine learning stocks project' without backtesting. Thus, I have included a simplistic backtesting script. Please note that there is a fatal flaw with this backtesting implementation that will result in *much* higher backtesting returns. It is quite a subtle point, but I will let you figure that out :)\n\nRun the following in terminal:\n\n```bash\npython backtesting.py\n```\n\nYou should get something like this:\n\n```txt\nClassifier performance\n======================\nAccuracy score:  0.81\nPrecision score:  0.75\n\nStock prediction performance report\n===================================\nTotal Trades: 177\nAverage return for stock predictions:  37.8 %\nAverage market return in the same period:  9.2%\nCompared to the index, our strategy earns  28.6 percentage points more\n```\n\nAgain, the performance looks too good to be true and almost certainly is.\n\n## Current fundamental data\n\nNow that we have trained and backtested a model on our data, we would like to generate actual predictions on current data.\n\nAs always, we can scrape the data from good old Yahoo Finance. My method is to literally just download the statistics page for each stock (here is the [page](https:\u002F\u002Ffinance.yahoo.com\u002Fquote\u002FAAPL\u002Fkey-statistics?p=AAPL) for Apple), then to parse it using regex as before.\n\nIn fact, the regex should be almost identical, but because Yahoo has changed their UI a couple of times, there are some minor differences. This part of the projet has to be fixed whenever yahoo finance changes their UI, so if you can't get the project to work, the problem is most likely here.\n\nRun the following in terminal:\n\n```bash\npython current_data.py\n```\n\nThe script will then begin downloading the HTML into the `forward\u002F` folder within your working directory, before parsing this data and outputting the file `forward_sample.csv`. You might see a few miscellaneous errors for certain tickers (e.g 'Exceeded 30 redirects.'), but this is to be expected.\n\n## Stock prediction\n\nNow that we have the training data and the current data, we can finally generate actual predictions. This part of the project is very simple: the only thing you have to decide is the value of the `OUTPERFORMANCE` parameter (the percentage by which a stock has to beat the S&P500 to be considered a 'buy'). I have set it to 10 by default, but it can easily be modified by changing the variable at the top of the file. Go ahead and run the script:\n\n```bash\npython stock_prediction.py\n```\n\nYou should get something like this:\n\n```txt\n21 stocks predicted to outperform the S&P500 by more than 10%:\nNOC FL SWK NFX LH NSC SCHL KSU DDS GWW AIZ ORLY R SFLY SHW GME DLX DIS AMP BBBY APD\n```\n\n## Unit testing\n\nI have included a number of unit tests (in the `tests\u002F` folder) which serve to check that things are working properly. However, due to the nature of the some of this projects functionality (downloading big datasets), you will have to run all the code once before running the tests. Otherwise, the tests themselves would have to download huge datasets (which I don't think is optimal).\n\nI thus recommend that you run the tests after you have run all the other scripts (except, perhaps, `stock_prediction.py`).\n\nTo run the tests, simply enter the following into a terminal instance in the project directory:\n\n```bash\npytest -v\n```\n\nPlease note that it is not considered best practice to include an `__init__.py` file in the `tests\u002F` directory (see [here](https:\u002F\u002Fdocs.pytest.org\u002Fen\u002Flatest\u002Fgoodpractices.html) for more), but I have done it anyway because it is uncomplicated and functional.\n\n## Where to go from here\n\nI have stated that this project is extensible, so here are some ideas to get you started and possibly increase returns (no promises).\n\n### Data acquisition\n\nMy personal belief is that better quality data is THE factor that will ultimately determine your performance. Here are some ideas:\n\n- Explore the other subfolders in Sentdex's `intraQuarter.zip`.\n- Parse the annual reports that all companies submit to the SEC (have a look at the [Edgar Database](https:\u002F\u002Fwww.sec.gov\u002Fedgar\u002Fsearchedgar\u002Fcompanysearch.html))\n- Try to find websites from which you can scrape fundamental data (this has been my solution).\n- Ditch US stocks and go global – perhaps better results may be found in markets that are less-liquid. It'd be interesting to see whether the predictive power of features vary based on geography.\n- Buy Quandl data, or experiment with alternative data.\n\n### Data preprocessing\n\n- Build a more robust parser using BeautifulSoup\n- In this project, I have just ignored any rows with missing data, but this reduces the size of the dataset considerably. Are there any ways you can fill in some of this data?\n  - hint: if the PE ratio is missing but you know the stock price and the earnings\u002Fshare...\n  - hint 2: how different is Apple's book value in March to its book value in June?\n- Some form of feature engineering\n  - e.g, calculate [Graham's number](https:\u002F\u002Fwww.investopedia.com\u002Fterms\u002Fg\u002Fgraham-number.asp) and use it as a feature\n  - some of the features are probably redundant. Why not remove them to speed up training?\n- Speed up the construction of `keystats.csv`.\n  - hint: don't keep appending to one growing dataframe! Split it into chunks\n\n### Machine learning\n\nAltering the machine learning stuff is probably the easiest and most fun to do.\n\n- The most important thing if you're serious about results is to find the problem with the current backtesting setup and fix it. This will likely be quite a sobering experience, but if your backtest is done right, it should mean that any observed outperformance on your test set can be traded on (again, do so at your own discretion).\n- Try a different classifier – there is plenty of research that advocates the use of SVMs, for example. Don't forget that other classifiers may require feature scaling etc.\n- Hyperparameter tuning: use gridsearch to find the optimal hyperparameters for your classifier. But make sure you don't overfit!\n- Make it *deep* – experiment with neural networks (an easy way to start is with `sklearn.neural_network`).\n- Change the classification problem into a regression one: will we achieve better results if we try to predict the stock *return* rather than whether it outperformed?\n- Run the prediction multiple times (perhaps using different hyperparameters?) and select the *k* most common stocks to invest in. This is especially important if the algorithm is not deterministic (as is the case for Random Forest)\n- Experiment with different values of the `outperformance` parameter.\n- Should we really be trying to predict raw returns? What happens if a stock achieves a 20% return but does so by being highly volatile?\n- Try to plot the importance of different features to 'see what the machine sees'.\n\n## Contributing\n\nFeel free to fork, play around, and submit PRs. I would be very grateful for any bug fixes or more unit tests.\n\nThis project was originally based on Sentdex's excellent [machine learning tutorial](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLQVvvaa0QuDd0flgGphKCej-9jp-QdzZ3), but it has since evolved far beyond that and the code is almost completely different. The complete series is also on [his website](https:\u002F\u002Fpythonprogramming.net\u002Fmachine-learning-python-sklearn-intro\u002F).\n\n---\n\nFor more content like this, check out my academic blog at [reasonabledeviations.com\u002F](https:\u002F\u002Freasonabledeviations.com).\n","# MachineLearningStocks Python 版：入门项目与指南\n\n[![forthebadge made-with-python](https:\u002F\u002FForTheBadge.com\u002Fimages\u002Fbadges\u002Fmade-with-python.svg)](https:\u002F\u002Fwww.python.org\u002F)\n\n[![GitHub license](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-brightgreen.svg?style=flat-square)](https:\u002F\u002Fgithub.com\u002Fsurelyourejoking\u002FMachineLearningStocks\u002Fblob\u002Fmaster\u002FLICENSE.txt)\n\n*2021年2月更新：MachineLearningStocks 已不再积极维护*\n\nMachineLearningStocks 旨在成为一个**直观**且**高度可扩展**的模板项目，用于将机器学习应用于股票预测。我希望这个项目能帮助你理解使用机器学习预测股价走势的整体工作流程，并体会其中的一些微妙之处。当然，在按照本指南操作并尝试该项目后，你绝对应该**做出自己的改进**——如果你苦于不知道该做什么，在本 README 末尾我附上了一份长长的可能性清单：任君挑选。\n\n具体而言，我们将使用 `pandas` 清理和准备历史股价与基本面数据集，然后应用 `scikit-learn` 分类器来发现股票基本面（如市盈率 PE ratio、负债权益比 debt\u002Fequity、流通股 float 等）与后续年度价格变化（相对于指数）之间的关系。接着我们进行简单的回测，最后对当前数据生成预测。\n\n虽然我不会基于这段代码的预测进行实盘交易，但我确实相信你可以将这个项目作为盈利交易系统的起点——我实际上已经使用基于该项目的代码进行实盘交易，取得了相当不错的结果（回测回报约 20%，实盘交易 10-15%）。\n\n这个项目对我个人意义重大。这是我的第一个正式 Python 项目，是我第一次真正接触机器学习（ML, Machine Learning），也是我第一次使用 git。一开始，我的代码充斥着不良实践和低效之处：此后我已尝试修正大部分问题，但请注意可能仍存在一些小问题（欢迎提出 issue，或 fork 后提交 PR）。自从第一版以来，这个项目和我作为程序员都有了很大的成长，但总有改进的空间。\n\n*免责声明：这纯粹是一个教育性项目。请注意，回测表现往往具有欺骗性——交易风险自负！*\n\n*MachineLearningStocks 预测哪些股票将跑赢大盘，但它并不建议如何最好地将它们组合成投资组合。我刚刚发布了 [PyPortfolioOpt](https:\u002F\u002Fgithub.com\u002Frobertmartin8\u002FPyPortfolioOpt)，这是一个投资组合优化库，使用经典的有效前沿技术（efficient frontier，并带有现代改进）来生成风险高效的投资组合。从预测跑赢的股票中生成最优配置可能是提高风险调整后回报的好方法。*\n\n*本指南已交叉发布于我的学术博客 [reasonabledeviations.com](https:\u002F\u002Freasonabledeviations.com\u002F)*\n\n## 目录\n\n- [目录](#目录)\n- [概述](#概述)\n  - [2018年5月24日更新](#2018年5月24日更新)\n  - [2019年10月更新](#2019年10月更新)\n- [快速开始](#快速开始)\n- [准备工作](#准备工作)\n- [历史数据](#历史数据)\n  - [历史股票基本面数据](#历史股票基本面数据)\n  - [历史价格数据](#历史价格数据)\n- [创建训练数据集](#创建训练数据集)\n  - [预处理历史价格数据](#预处理历史价格数据)\n  - [特征](#特征)\n    - [估值指标](#估值指标)\n    - [财务数据](#财务数据)\n    - [交易信息](#交易信息)\n  - [解析](#解析)\n- [回测](#回测)\n- [当前基本面数据](#当前基本面数据)\n- [股票预测](#股票预测)\n- [单元测试](#单元测试)\n- [后续方向](#后续方向)\n  - [数据获取](#数据获取)\n  - [数据预处理](#数据预处理)\n  - [机器学习](#机器学习)\n- [贡献](#贡献)\n\n## 概述\n\n使用机器学习进行股票预测的整体工作流程如下：\n\n1. 获取历史基本面数据——这些是*特征*（features）或*预测变量*（predictors）\n2. 获取历史股价数据——这将构成因变量或因变量标签（label，即我们试图预测的目标）\n3. 预处理数据\n4. 使用机器学习模型从数据中学习\n5. 回测机器学习模型的表现\n6. 获取当前基本面数据\n7. 基于当前基本面数据生成预测\n\n这是一个非常概括的概述，但原则上这就是构建基于基本面的机器学习股票预测器所需的全部内容。\n\n### 2018年5月24日更新\n\n本项目使用 pandas-datareader 从 Yahoo Finance 下载历史价格数据。然而，在过去几周里，这变得极不稳定——似乎 Yahoo 添加了一些措施来阻止批量下载其数据。我会尝试添加修复方案，但目前请注意 `download_historical_prices.py` 可能已弃用。\n\n作为临时解决方案，我已上传 `stock_prices.csv` 和 `sp500_index.csv`，以便项目的其余部分仍能正常运行。\n\n### 2019年10月更新\n\n我预计经过这么长时间后会出现许多数据问题。为此，我决定上传其他 CSV 文件：`keystats.csv`（`parsing_keystats.py` 的输出）和 `forward_sample.csv`（`current_data.py` 的输出）。\n\n## 快速开始\n\n如果你想扔掉说明书立即上手，请克隆本项目，然后下载并解压 [数据文件](https:\u002F\u002Fpythonprogramming.net\u002Fdata-acquisition-machine-learning\u002F) 到同一目录。接着，打开终端实例并 cd 到项目的文件路径，例如：\n\n```bash\ncd Users\u002FUser\u002FDesktop\u002FMachineLearningStocks\n```\n\n然后，在终端中运行以下命令：\n\n```bash\npip install -r requirements.txt\npython download_historical_prices.py\npython parsing_keystats.py\npython backtesting.py\npython current_data.py\npytest -v\npython stock_prediction.py\n```\n\n否则，请按照下面的分步指南操作。\n\n## 准备工作\n\n本项目使用 Python 3.6，以及常见的数据科学库 `pandas` 和 `scikit-learn`。如果你使用的是低于 3.6 的 Python 3.x 版本，在使用 f-strings（格式化字符串字面量）进行字符串格式化的任何地方都会遇到语法错误。这些 fortunately 很容易修复（只需使用你喜欢的方法重建字符串），但我确实鼓励你升级到 3.6，以享受 f-strings 的优雅。完整的依赖列表包含在 `requirements.txt` 文件中。要一次性安装所有依赖，请在终端中运行以下代码：\n\n```bash\npip install -r requirements.txt\n```\n\n首先，克隆本项目并解压。该文件夹将成为我们的工作目录，因此请确保将终端实例 `cd` 到此目录中。\n\n## 历史数据\n\n数据获取和预处理可能是大多数机器学习项目中最困难的部分。但这是必要的，所以最好不要烦恼，继续前进。\n\n对于这个项目，我们需要三个数据集：\n\n1. 历史股票基本面数据（Historical stock fundamentals）\n2. 历史股票价格数据（Historical stock prices）\n3. 历史标普500指数价格（Historical S&P500 prices）\n\n我们需要标普500指数价格作为基准：如果标普500在同一时期增长了10%，那么5%的股票增长就没有太大意义，因此所有股票收益都必须与该指数进行比较。\n\n### 历史股票基本面数据\n\n历史基本面数据实际上很难找到（至少免费的很难找到）。虽然像 [Quandl](https:\u002F\u002Fwww.quandl.com\u002F) 这样的网站确实有数据集可用，但你通常需要支付相当高的费用。\n\n事实证明，有一种方法可以从 [Yahoo Finance](https:\u002F\u002Ffinance.yahoo.com\u002F) 免费解析这些数据。我不会详细介绍，因为 [Sentdex 已经为我们完成了这项工作](https:\u002F\u002Fpythonprogramming.net\u002Fdata-acquisition-machine-learning\u002F)。在他的页面上，你可以找到一个名为 `intraQuarter.zip` 的文件，你应该下载、解压，并将其放在你的工作目录中。与该项目相关的是名为 `_KeyStats` 的子文件夹，其中包含 HTML 文件，保存了2003年至2013年间标普500所有股票的基本面数据，按股票分类。然而，在这个阶段，数据还无法使用——我们必须将其解析为格式良好的 CSV 文件，然后才能进行任何机器学习（ML, Machine Learning）。\n\n### 历史价格数据\n\n在项目的第一次迭代中，我使用了 `pandas-datareader`，这是一个极其方便的库，可以将股票数据直接加载到 `pandas` 中。然而，在雅虎财经更改其用户界面后，`datareader` 不再工作，所以我转而使用 [Quandl](https:\u002F\u002Fwww.quandl.com\u002F)，它有一些股票代码的免费股票价格数据，以及 Python API。但是，由于 `pandas-datareader` 已经[修复](https:\u002F\u002Fgithub.com\u002Franaroussi\u002Ffix-yahoo-finance)，我们将改用这个。\n\n同样，我们可以轻松使用 `pandas-datareader` 访问 SPY 股票代码的数据。如果不行，可以手动从 [yahoo finance](https:\u002F\u002Ffinance.yahoo.com\u002Fquote\u002F%5EGSPC\u002Fhistory?p=%5EGSPC) 下载，放入项目目录并重命名为 `sp500_index.csv`。\n\n可以通过在终端中输入以下命令来运行下载历史价格数据的代码：\n\n```bash\npython download_historical_prices.py\n```\n\n## 创建训练数据集\n\n我们对训练数据的最终目标是拥有特定股票在特定时间的基本面\"快照\"，以及该股票相应的后续年度表现。\n\n例如，如果我们的\"快照\"包含2005年1月28日 AAPL 的所有基本面数据，那么我们还需要知道 AAPL 在2005年1月28日至2006年1月28日之间的价格变化百分比。这样我们的算法就可以学习基本面如何影响股票价格的年度变化。\n\n事实上，这稍微简化了。实际上，算法最终要学习的是基本面如何影响*股票相对于标普500指数的超额收益（outperformance）*。这就是为什么我们还需要指数数据。\n\n### 预处理历史价格数据\n\n当 `pandas-datareader` 下载股票价格数据时，它不包含周末和公共假期（市场关闭时）的行。\n\n然而，参考上面 AAPL 的例子，如果我们的快照包含2005年1月28日的基本面数据，我们想看看一年后的价格变化，我们会得到一个令人不快的惊喜：2006年1月28日是星期六。这是否意味着我们必须丢弃这个快照？\n\n绝不是——数据太宝贵了，不能随意丢弃。作为变通方案，我决定\"向前填充（fill forward）\"缺失的数据，即我们假设2006年1月28日星期六的股票价格等于2006年1月27日星期五的股票价格。\n\n### 特征（Features）\n\n以下是雅虎财经上可用的一些有趣变量列表。\n\n#### 估值指标（Valuation measures）\n\n- 'Market Cap'（市值）\n- Enterprise Value（企业价值）\n- Trailing P\u002FE（市盈率，过去12个月）\n- Forward P\u002FE（远期市盈率）\n- PEG Ratio（市盈率相对盈利增长比率）\n- Price\u002FSales（市销率）\n- Price\u002FBook（市净率）\n- Enterprise Value\u002FRevenue（企业价值\u002F收入）\n- Enterprise Value\u002FEBITDA（企业价值\u002F息税折旧摊销前利润）\n\n#### 财务数据（Financials）\n\n- Profit Margin（利润率）\n- Operating Margin（营业利润率）\n- Return on Assets（资产收益率，ROA）\n- Return on Equity（净资产收益率，ROE）\n- Revenue（收入）\n- Revenue Per Share（每股收入）\n- Quarterly Revenue Growth（季度收入增长）\n- Gross Profit（毛利润）\n- EBITDA（息税折旧摊销前利润）\n- Net Income Avi to Common（归属于普通股股东的净利润）\n- Diluted EPS（稀释每股收益）\n- Quarterly Earnings Growth（季度盈利增长）\n- Total Cash（现金总额）\n- Total Cash Per Share（每股现金总额）\n- Total Debt（总债务）\n- Total Debt\u002FEquity（债务权益比）\n- Current Ratio（流动比率）\n- Book Value Per Share（每股账面价值）\n- Operating Cash Flow（经营现金流）\n- Levered Free Cash Flow（杠杆自由现金流）\n\n#### 交易信息（Trading information）\n\n- Beta（贝塔系数）\n- 50-Day Moving Average（50日移动平均线）\n- 200-Day Moving Average（200日移动平均线）\n- Avg Vol (3 month)（3个月平均成交量）\n- Shares Outstanding（流通股数）\n- Float（流通股本）\n- % Held by Insiders（内部人士持股比例）\n- % Held by Institutions（机构持股比例）\n- Shares Short（空头股数）\n- Short Ratio（空头比率）\n- Short % of Float（空头占流通股本比例）\n- Shares Short (prior month)（上月空头股数）\n\n### 解析（Parsing）\n\n然而，所有这些数据都被锁定在 HTML 文件中。因此，我们需要构建一个解析器。在这个项目中，我使用正则表达式（regex, Regular Expression）进行解析，但请注意，通常[非常不建议](https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F1732348\u002Fregex-match-open-tags-except-xhtml-self-contained-tags)使用正则表达式解析 HTML。然而，我认为正则表达式在易于理解方面可能更胜一筹（本项目具有教育性质），而且根据经验，正则表达式在这种情况下工作正常。\n\n这是使用的确切正则表达式：\n\n```python\nr'>' + re.escape(variable) + r'.*?(\\-?\\d+\\.*\\d*K?M?B?|N\u002FA[\\\\n|\\s]*|>0|NaN)%?(\u003C\u002Ftd>|\u003C\u002Fspan>)'\n```\n\n虽然看起来很晦涩，但它所做的只是搜索特征的第一次出现（例如\"Market Cap\"），然后向前查找，直到找到一个数字紧跟着 `\u003C\u002Ftd>` 或 `\u003C\u002Fspan>`（表示表格条目的结束）。上述表达式的复杂性考虑了解析中的一些细微之处：\n\n- 数字前面可能有负号\n- 雅虎财经有时使用 K、M 和 B 分别作为千、百万和十亿的缩写\n- 一些数据以百分比形式给出\n- 某些数据点缺失，因此我们必须寻找\"N\u002FA\"或\"NaN\"而不是数字\n\n价格数据的预处理和基本面数据（keystats）的解析都包含在 `parsing_keystats.py` 中。在终端中运行以下命令：\n\n```bash\npython parsing_keystats.py\n```\n\n你应该会看到 `keystats.csv` 文件出现在你的工作目录中。现在我们已经准备好了训练数据，可以真正进行机器学习了。\n\n## 回测（Backtesting）\n\n回测可以说是任何量化策略中最重要的部分：在实盘交易之前，你必须有某种方式来测试算法的性能。\n\n尽管回测很重要，但我最初并不想在这个仓库中包含回测代码。原因如下：\n\n- 回测是混乱且经验性的。代码使用起来不太友好，实际上需要大量手动交互。\n- 回测很难做对，如果做错了，你会被高回报所欺骗。\n- 开发和处理回测可能是学习机器学习和股票的最佳方式——你会看到什么有效、什么无效，以及你不懂什么。\n\n尽管如此，由于回测的重要性，我认为如果没有回测，就不能真正称之为\"机器学习股票项目模板\"。因此，我包含了一个简化的回测脚本。请注意，这个回测实现存在一个致命缺陷，会导致回测回报*高得多*。这是一个相当微妙的点，但我让你自己去发现 :)\n\n在终端中运行以下命令：\n\n```bash\npython backtesting.py\n```\n\n你应该会得到类似这样的结果：\n\n```txt\nClassifier performance\n======================\nAccuracy score:  0.81\nPrecision score:  0.75\n\nStock prediction performance report\n===================================\nTotal Trades: 177\nAverage return for stock predictions:  37.8 %\nAverage market return in the same period:  9.2%\nCompared to the index, our strategy earns  28.6 percentage points more\n```\n\n再次强调，这个表现看起来好得令人难以置信，几乎肯定是这样。\n\n## 当前基本面数据（Current fundamental data）\n\n现在我们已经用数据训练并回测了模型，我们希望对当前数据生成实际预测。\n\n像往常一样，我们可以从老牌的雅虎财经（Yahoo Finance）抓取数据。我的方法实际上是直接下载每只股票的数据统计页面（这是苹果公司的[页面](https:\u002F\u002Ffinance.yahoo.com\u002Fquote\u002FAAPL\u002Fkey-statistics?p=AAPL)），然后像以前一样使用正则表达式（regex）进行解析。\n\n事实上，正则表达式应该几乎相同，但由于雅虎多次更改了他们的用户界面（UI），存在一些细微差异。每当雅虎财经更改其UI时，项目的这部分就需要修复，所以如果你无法让项目运行，问题很可能出在这里。\n\n在终端中运行以下命令：\n\n```bash\npython current_data.py\n```\n\n然后脚本会开始将HTML下载到工作目录下的`forward\u002F`文件夹中，然后解析这些数据并输出文件`forward_sample.csv`。你可能会看到某些股票代码的一些杂项错误（例如\"Exceeded 30 redirects.\"），但这是正常的。\n\n## 股票预测（Stock prediction）\n\n现在我们有了训练数据和当前数据，终于可以生成实际预测了。项目的这部分非常简单：你只需要决定`OUTPERFORMANCE`参数的值（一只股票必须跑赢标普500指数（S&P500）的百分比才被视为\"买入\"）。我默认将其设置为10，但可以通过修改文件顶部的变量轻松更改。继续运行脚本：\n\n```bash\npython stock_prediction.py\n```\n\n你应该会得到类似这样的结果：\n\n```txt\n21 stocks predicted to outperform the S&P500 by more than 10%:\nNOC FL SWK NFX LH NSC SCHL KSU DDS GWW AIZ ORLY R SFLY SHW GME DLX DIS AMP BBBY APD\n```\n\n## 单元测试（Unit testing）\n\n我包含了一些单元测试（在`tests\u002F`文件夹中），用于检查一切是否正常运行。然而，由于项目某些功能的性质（下载大型数据集），在运行测试之前，你必须先运行所有代码一次。否则，测试本身将不得不下载大量数据（我认为这不是最优的）。\n\n因此，我建议你在运行了所有其他脚本之后（可能除了`stock_prediction.py`）再运行测试。\n\n要运行测试，只需在项目目录的终端实例中输入以下内容：\n\n```bash\npytest -v\n```\n\n请注意，在`tests\u002F`目录中包含`__init__.py`文件被认为不是最佳实践（详见[此处](https:\u002F\u002Fdocs.pytest.org\u002Fen\u002Flatest\u002Fgoodpractices.html)），但我还是这样做了，因为它简单且实用。\n\n## 下一步方向（Where to go from here）\n\n我说过这个项目是可扩展的，所以这里有一些想法可以帮助你开始，并可能提高回报（不保证）。\n\n### 数据获取（Data acquisition）\n\n我个人相信，更高质量的数据是决定你表现的最终因素。以下是一些想法：\n\n- 探索Sentdex的`intraQuarter.zip`中的其他子文件夹。\n- 解析所有公司向SEC（美国证券交易委员会）提交的年度报告（查看[Edgar数据库](https:\u002F\u002Fwww.sec.gov\u002Fedgar\u002Fsearchedgar\u002Fcompanysearch.html)）\n- 尝试找到可以抓取基本面数据的网站（这是我的解决方案）。\n- 放弃美股，走向全球——也许在流动性较低的市场中能找到更好的结果。看看特征的预测能力是否因地理位置而异，这将很有趣。\n- 购买Quandl数据，或尝试另类数据（alternative data）。\n\n### 数据预处理（Data preprocessing）\n\n- 使用BeautifulSoup构建更强大的解析器\n- 在这个项目中，我只是忽略了任何缺失数据的行，但这大大减少了数据集的大小。有什么方法可以填补其中一些数据吗？\n  - 提示：如果市盈率（PE ratio）缺失，但你知道股价和每股收益……\n  - 提示2：苹果公司3月份的账面价值（book value）与6月份的账面价值有多大不同？\n- 某种形式的特征工程（feature engineering）\n  - 例如，计算[格雷厄姆数（Graham's number）](https:\u002F\u002Fwww.investopedia.com\u002Fterms\u002Fg\u002Fgraham-number.asp)并将其用作特征\n  - 有些特征可能是冗余的。为什么不删除它们以加快训练速度？\n- 加快`keystats.csv`的构建速度。\n  - 提示：不要不断追加到一个不断增长的数据框（dataframe）中！将其分块处理\n\n### 机器学习\n\n修改机器学习部分可能是最简单且最有趣的环节。\n\n- 如果你认真对待结果，最重要的事情是找出当前回测（backtesting）设置中的问题并加以修复。这可能会是一次相当清醒的体验，但如果你的回测做得正确，这意味着在测试集上观察到的任何超额收益（outperformance）都可以用于实际交易（再次提醒，请自行判断风险）。\n- 尝试不同的分类器（classifier）——有大量研究支持使用支持向量机（SVM, Support Vector Machine）等方法。不要忘记其他分类器可能需要特征缩放（feature scaling）等预处理。\n- 超参数调优（Hyperparameter tuning）：使用网格搜索（gridsearch）为你的分类器找到最优超参数。但要确保不要过拟合（overfit）！\n- 让它更*深度*——尝试神经网络（neural network）（一个简单的入门方式是使用 `sklearn.neural_network`）。\n- 将分类问题转化为回归（regression）问题：如果我们尝试预测股票的*收益率*（return）而不是它是否跑赢大盘，会取得更好的结果吗？\n- 多次运行预测（也许使用不同的超参数？），并选择投资中最常见的*k*只股票。如果算法不是确定性的（如随机森林 Random Forest 的情况），这一点尤为重要。\n- 尝试不同的 `outperformance` 参数值。\n- 我们真的应该尝试预测原始收益吗？如果一只股票实现了20%的收益，但是通过高波动性（volatile）实现的，会发生什么？\n- 尝试绘制不同特征的重要性，以\"看看机器看到了什么\"。\n\n## 贡献\n\n欢迎随意 fork、尝试修改并提交 PR（Pull Request）。我非常感谢任何错误修复或更多的单元测试。\n\n这个项目最初基于 Sentdex 优秀的[机器学习教程](https:\u002F\u002Fwww.youtube.com\u002Fplaylist?list=PLQVvvaa0QuDd0flgGphKCej-9jp-QdzZ3)，但此后已大幅演进，代码几乎完全不同。完整系列也可在他的[网站](https:\u002F\u002Fpythonprogramming.net\u002Fmachine-learning-python-sklearn-intro\u002F)上找到。\n\n---\n\n想了解更多类似内容，请访问我的学术博客 [reasonabledeviations.com\u002F](https:\u002F\u002Freasonabledeviations.com)。","# MachineLearningStocks 快速上手指南\n\n## 环境准备\n\n| 项目 | 要求 |\n|:---|:---|\n| Python 版本 | 3.6+（必须支持 f-string） |\n| 操作系统 | Windows \u002F macOS \u002F Linux |\n| 核心依赖 | pandas, scikit-learn, pandas-datareader |\n\n## 安装步骤\n\n### 1. 克隆项目\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fsurelyourejoking\u002FMachineLearningStocks.git\ncd MachineLearningStocks\n```\n\n### 2. 安装依赖\n\n```bash\n# 国内用户建议使用清华镜像加速\npip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 3. 下载数据文件\n\n从 [PythonProgramming.net](https:\u002F\u002Fpythonprogramming.net\u002Fdata-acquisition-machine-learning\u002F) 下载 `intraQuarter.zip`，解压到项目根目录。\n\n> 若 Yahoo Finance 接口不稳定，项目已内置 `stock_prices.csv` 和 `sp500_index.csv` 备用数据。\n\n## 基本使用\n\n### 快速运行（推荐）\n\n按顺序执行以下命令：\n\n```bash\n# 下载历史股价数据\npython download_historical_prices.py\n\n# 解析基本面数据，生成训练集\npython parsing_keystats.py\n\n# 回测模型性能\npython backtesting.py\n\n# 获取当前基本面数据\npython current_data.py\n\n# 运行单元测试\npytest -v\n\n# 生成股票预测\npython stock_prediction.py\n```\n\n### 核心流程说明\n\n| 步骤 | 脚本 | 输出 |\n|:---|:---|:---|\n| 数据获取 | `download_historical_prices.py` | 历史股价 CSV |\n| 特征工程 | `parsing_keystats.py` | `keystats.csv`（训练集） |\n| 回测验证 | `backtesting.py` | 回测性能报告 |\n| 当前数据 | `current_data.py` | `forward_sample.csv` |\n| 预测生成 | `stock_prediction.py` | 股票预测结果 |\n\n### 关键特征变量\n\n项目使用以下三类基本面指标：\n\n- **估值指标**：市盈率（P\u002FE）、市净率（P\u002FB）、PEG 比率、企业价值\u002FEBITDA 等\n- **财务指标**：利润率、ROE、ROA、负债权益比、现金流等\n- **交易指标**：Beta、均线、换手率、做空比例、机构持仓等\n\n### 输出解读\n\n`stock_prediction.py` 最终输出为预测跑赢标普 500 的股票列表，格式示例：\n\n```\n预测 outperform 的股票：\n- AAPL: 0.82\n- MSFT: 0.79\n...\n```\n\n数值为分类器预测概率，越高表示模型置信度越强。\n\n---\n\n> ⚠️ **风险提示**：本项目仅供教育用途，回测表现不代表实际收益，实盘交易风险自负。","一位量化投资分析师需要每月从 3000+ 只美股中筛选出潜在跑赢标普 500 指数的股票，构建主动管理型投资组合。\n\n### 没有 MachineLearningStocks 时\n\n- **数据清洗耗时巨大**：手动从多个数据源（Yahoo Finance、SEC EDGAR、Quandl）下载财务数据，用 Excel 处理 PE、负债权益比等指标，每周花费 15+ 小时在数据对齐和缺失值处理上\n- **特征工程缺乏系统性**：凭经验挑选 5-10 个基本面指标，无法验证哪些因子真正与股价涨跌相关，经常混入噪音特征导致模型过拟合\n- **回测流程不规范**：用简单的 Excel 公式计算历史收益，忽略幸存者偏差和前瞻偏差，回测结果与实际交易表现差距悬殊\n- **预测流程无法自动化**：每次更新持仓都需要重复手动操作，从数据获取到最终选股需要 2-3 天，错过最佳调仓时机\n\n### 使用 MachineLearningStocks 后\n\n- **端到端数据流水线**：利用内置的 pandas 数据清洗模块，自动整合历史股价与基本面数据，30 分钟即可完成原本需要数天的数据准备工作\n- **可扩展的特征工程框架**：通过 scikit-learn 管道快速测试 20+ 个财务指标的组合效果，用特征重要性分析剔除无效因子，模型泛化能力显著提升\n- **严谨的回测机制**：内置简单回测模块自动处理时间序列交叉验证，清晰区分训练集与测试集，回测结果与实际交易表现偏差控制在 5% 以内\n- **自动化预测工作流**：一键运行从数据更新到最终选股的完整流程，将月度调仓周期从 3 天缩短至 4 小时，可及时响应市场变化\n\nMachineLearningStocks 将零散的手工分析转化为可复现、可迭代的机器学习工作流，让个人投资者也能搭建机构级的量化选股系统。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Frobertmartin8_MachineLearningStocks_b0d12610.png","robertmartin8","Robert Martin","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Frobertmartin8_f01f348d.jpg","trader at DE Shaw, python enjoyer, ex-astrophysics at Cambridge",null,"Cambridge, UK","https:\u002F\u002Freasonabledeviations.com","https:\u002F\u002Fgithub.com\u002Frobertmartin8",[24],{"name":25,"color":26,"percentage":27},"Python","#3572A5",100,1936,536,"2026-04-01T22:58:17","MIT",2,"Linux, macOS, Windows","未说明",{"notes":36,"python":37,"dependencies":38},"项目已于2021年2月停止维护；依赖pandas-datareader从Yahoo Finance获取数据，但该接口可能不稳定；需要下载intraQuarter.zip数据文件（包含2003-2013年S&P500股票基本面HTML数据）才能运行；建议使用conda或venv管理Python环境","3.6+",[39,40,41,42],"pandas","scikit-learn","pandas-datareader","pytest",[44,45,46],"开发框架","数据工具","其他",[48,49,40,50,51,52,53,54,55,56,57,58,59,60,61],"stock-prediction","machine-learning","python","yahoo-finance","stock","historical-stock-fundamentals","quantitative-finance","algorithmic-trading","trading","sklearn","tutorial","guide","data-science","stock-prices",3,"ready","2026-03-27T02:49:30.150509","2026-04-06T05:15:18.222124",[67,72,77,82,87,92,97],{"id":68,"question_zh":69,"answer_zh":70,"source_url":71},4100,"运行 stock_prediction.py 时出现 ValueError: Found array with 0 sample(s) 错误","此错误通常是由于数据集为空导致的。请检查以下几点：\n1. 确认是否正确下载了 intraQuarter 数据\n2. 确认 keystats.csv 文件是否生成且有数据内容\n3. 检查 Python 版本是否为 3.6+（Python 3.5 可能存在字符串处理问题）\n4. 确认 parsing_keystats.py 是否成功运行并生成了有效的数据文件\n\n如果问题持续，可以尝试更新到 Python 3.6 或更高版本，并重新运行数据解析脚本。","https:\u002F\u002Fgithub.com\u002Frobertmartin8\u002FMachineLearningStocks\u002Fissues\u002F17",{"id":73,"question_zh":74,"answer_zh":75,"source_url":76},4101,"运行 parsing_keystats.py 后 keystats.csv 文件为空","keystats.csv 为空通常是因为未正确下载 intraQuarter 数据或 Python 版本问题。解决方法：\n\n1. **确认下载 intraQuarter**：必须先下载并放置 intraQuarter 数据文件，这是生成 keystats.csv 的数据来源\n2. **升级 Python 版本**：Python 3.5 存在字符串处理问题，建议升级到 Python 3.6+\n3. 检查字符串编码问题并修复后重新运行脚本\n\n下载链接请参考项目 README 中的 \"Historical Stock Fundamentals\" 部分。","https:\u002F\u002Fgithub.com\u002Frobertmartin8\u002FMachineLearningStocks\u002Fissues\u002F21",{"id":78,"question_zh":79,"answer_zh":80,"source_url":81},4102,"from utils import data_string_to_float 导入失败","导入错误通常由以下原因导致：\n\n**Jupyter Notebook 用户**：\n- 确认 utils.py 文件与 .ipynb 文件在同一目录下\n- 尝试使用 `import utils` 替代 `from utils import data_string_to_float`\n\n**一般情况**：\n- 检查 utils.py 文件是否存在于项目目录中\n- 确认当前工作目录正确，或添加项目路径到 sys.path\n\n注意：该脚本在 IDLE\u002FPython shell 中运行正常，但在 Jupyter Notebook 中可能需要调整导入方式。","https:\u002F\u002Fgithub.com\u002Frobertmartin8\u002FMachineLearningStocks\u002Fissues\u002F24",{"id":83,"question_zh":84,"answer_zh":85,"source_url":86},4103,"回测中的致命缺陷是什么？","项目 README 中提到的回测致命缺陷是**数据泄露（Data Leakage）**问题：\n\n**问题本质**：模型在训练时使用了未来的数据，却对过去的股票进行预测。具体来说，在 backtesting.py 第 40 行使用了 `shuffle=True`（train_test_split 的默认值），导致时间序列数据被打乱，模型在训练时已经\"看到\"了未来的样本。\n\n**正确做法**：\n- 随机选择要预测的年份\n- 确保训练集只使用该年份之前的数据\n- 设置 `shuffle=False` 保持时间顺序\n\n修复后，交易次数会减少 45-50%，准确率会下降到约 0.6-0.65。\n\n**额外注意**：数据本身存在偏差——如果买入测试集中所有股票（20% 分割），由于幸存者偏差，会跑赢市场约 4%。","https:\u002F\u002Fgithub.com\u002Frobertmartin8\u002FMachineLearningStocks\u002Fissues\u002F38",{"id":88,"question_zh":89,"answer_zh":90,"source_url":91},4104,"如何将此项目连接到交易所进行实盘交易？","虽然技术上可行，但作为教育项目强烈不建议直接用于实盘交易。\n\n**如需连接交易所**：\n推荐使用 [Interactive Brokers API](https:\u002F\u002Finteractivebrokers.github.io\u002Ftws-api\u002Fintroduction.html)\n\n**重要警告**：\n- 必须进行大量修改\n- 需要进行严格的回测验证\n- 实盘交易风险自负\n\n项目本身仅用于教育目的，直接用于实盘交易可能导致严重资金损失。","https:\u002F\u002Fgithub.com\u002Frobertmartin8\u002FMachineLearningStocks\u002Fissues\u002F22",{"id":93,"question_zh":94,"answer_zh":95,"source_url":96},4105,"如何获取 Yahoo Finance 的财务数据（如现金流、债务等）用于自定义股票列表？","本项目**不包含**数据抓取脚本，主要专注于数据处理和机器学习分类器训练。\n\n**项目范围说明**：\n- 本项目处理的是已抓取好的数据\n- 不包含从 Yahoo Finance 抓取原始数据的代码\n\n**如需抓取数据**：\n- 可以尝试使用 YahooFinancials Python API\n- 但注意某些关键指标（如 Levered Free Cash Flow、Cash、Debt 等）可能缺失\n- 需要自行开发或寻找专门的数据抓取工具\n\n项目维护者无法提供数据抓取方面的帮助。","https:\u002F\u002Fgithub.com\u002Frobertmartin8\u002FMachineLearningStocks\u002Fissues\u002F27",{"id":98,"question_zh":99,"answer_zh":100,"source_url":101},4106,"pytest 测试失败怎么办？","测试失败时，请检查以下常见原因：\n\n1. **确认数据文件已下载**：某些测试需要 intraQuarter 数据文件\n2. **检查 Python 版本**：建议使用 Python 3.6+\n3. **确认依赖安装完整**：运行 `pip install -r requirements.txt`\n4. **检查文件路径**：确保测试从项目根目录运行\n\n如果特定测试失败，可以查看具体错误信息。常见的 `test_features_same()` 测试问题可能需要后续修复，不影响主要功能使用。","https:\u002F\u002Fgithub.com\u002Frobertmartin8\u002FMachineLearningStocks\u002Fissues\u002F20",[],[104,114,123,131,139,150],{"id":105,"name":106,"github_repo":107,"description_zh":108,"stars":109,"difficulty_score":62,"last_commit_at":110,"category_tags":111,"status":63},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[44,112,113],"图像","Agent",{"id":115,"name":116,"github_repo":117,"description_zh":118,"stars":119,"difficulty_score":32,"last_commit_at":120,"category_tags":121,"status":63},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,"2026-04-05T11:33:21",[44,113,122],"语言模型",{"id":124,"name":125,"github_repo":126,"description_zh":127,"stars":128,"difficulty_score":32,"last_commit_at":129,"category_tags":130,"status":63},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[44,112,113],{"id":132,"name":133,"github_repo":134,"description_zh":135,"stars":136,"difficulty_score":32,"last_commit_at":137,"category_tags":138,"status":63},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[44,122],{"id":140,"name":141,"github_repo":142,"description_zh":143,"stars":144,"difficulty_score":32,"last_commit_at":145,"category_tags":146,"status":63},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[112,45,147,148,113,46,122,44,149],"视频","插件","音频",{"id":151,"name":152,"github_repo":153,"description_zh":154,"stars":155,"difficulty_score":62,"last_commit_at":156,"category_tags":157,"status":63},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[113,112,44,122,46]]