[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-szilard--benchm-ml":3,"tool-szilard--benchm-ml":62},[4,18,28,36,45,54],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":24,"last_commit_at":25,"category_tags":26,"status":17},9989,"n8n","n8n-io\u002Fn8n","n8n 是一款面向技术团队的公平代码（fair-code）工作流自动化平台，旨在让用户在享受低代码快速构建便利的同时，保留编写自定义代码的灵活性。它主要解决了传统自动化工具要么过于封闭难以扩展、要么完全依赖手写代码效率低下的痛点，帮助用户轻松连接 400 多种应用与服务，实现复杂业务流程的自动化。\n\nn8n 特别适合开发者、工程师以及具备一定技术背景的业务人员使用。其核心亮点在于“按需编码”：既可以通过直观的可视化界面拖拽节点搭建流程，也能随时插入 JavaScript 或 Python 代码、调用 npm 包来处理复杂逻辑。此外，n8n 原生集成了基于 LangChain 的 AI 能力，支持用户利用自有数据和模型构建智能体工作流。在部署方面，n8n 提供极高的自由度，支持完全自托管以保障数据隐私和控制权，也提供云端服务选项。凭借活跃的社区生态和数百个现成模板，n8n 让构建强大且可控的自动化系统变得简单高效。",184740,2,"2026-04-19T23:22:26",[16,14,13,15,27],"插件",{"id":29,"name":30,"github_repo":31,"description_zh":32,"stars":33,"difficulty_score":10,"last_commit_at":34,"category_tags":35,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":24,"last_commit_at":42,"category_tags":43,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",161147,"2026-04-19T23:31:47",[14,13,44],"语言模型",{"id":46,"name":47,"github_repo":48,"description_zh":49,"stars":50,"difficulty_score":51,"last_commit_at":52,"category_tags":53,"status":17},8272,"opencode","anomalyco\u002Fopencode","OpenCode 是一款开源的 AI 编程助手（Coding Agent），旨在像一位智能搭档一样融入您的开发流程。它不仅仅是一个代码补全插件，而是一个能够理解项目上下文、自主规划任务并执行复杂编码操作的智能体。无论是生成全新功能、重构现有代码，还是排查难以定位的 Bug，OpenCode 都能通过自然语言交互高效完成，显著减少开发者在重复性劳动和上下文切换上的时间消耗。\n\n这款工具专为软件开发者、工程师及技术研究人员设计，特别适合希望利用大模型能力来提升编码效率、加速原型开发或处理遗留代码维护的专业人群。其核心亮点在于完全开源的架构，这意味着用户可以审查代码逻辑、自定义行为策略，甚至私有化部署以保障数据安全，彻底打破了传统闭源 AI 助手的“黑盒”限制。\n\n在技术体验上，OpenCode 提供了灵活的终端界面（Terminal UI）和正在测试中的桌面应用程序，支持 macOS、Windows 及 Linux 全平台。它兼容多种包管理工具，安装便捷，并能无缝集成到现有的开发环境中。无论您是追求极致控制权的资深极客，还是渴望提升产出的独立开发者，OpenCode 都提供了一个透明、可信",144296,1,"2026-04-16T14:50:03",[13,27],{"id":55,"name":56,"github_repo":57,"description_zh":58,"stars":59,"difficulty_score":24,"last_commit_at":60,"category_tags":61,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",109154,"2026-04-18T11:18:24",[14,15,13],{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":68,"readme_en":69,"readme_zh":70,"quickstart_zh":71,"use_case_zh":72,"hero_image_url":73,"owner_login":74,"owner_name":75,"owner_avatar_url":76,"owner_bio":77,"owner_company":78,"owner_location":79,"owner_email":78,"owner_twitter":78,"owner_website":80,"owner_url":81,"languages":82,"stars":91,"forks":92,"last_commit_at":93,"license":94,"difficulty_score":24,"env_os":95,"env_gpu":96,"env_ram":97,"env_deps":98,"category_tags":108,"github_topics":112,"view_count":24,"oss_zip_url":78,"oss_zip_packed_at":78,"status":17,"created_at":122,"updated_at":123,"faqs":124,"releases":125},9933,"szilard\u002Fbenchm-ml","benchm-ml","A minimal benchmark for scalability, speed and accuracy of commonly used open source implementations (R packages, Python scikit-learn, H2O, xgboost, Spark MLlib etc.) of the top machine learning algorithms for binary classification (random forests, gradient boosted trees, deep neural networks etc.).","benchm-ml 是一个专注于评估主流开源机器学习库在二分类任务中可扩展性、运行速度与预测准确度的最小化基准测试项目。它旨在解决开发者在面对海量数据时，难以直观判断哪种算法实现（如随机森林、梯度提升树或深度神经网络）能在有限硬件资源下高效运行的痛点。\n\n该项目特别适合数据科学家、算法工程师及研究人员使用，帮助他们在信用评分、欺诈检测等典型商业场景中，为百万级样本和千维特征的数据集选择最合适的工具链。benchm-ml 涵盖了 R、Python scikit-learn、XGBoost、LightGBM、H2O 及 Spark MLlib 等多种流行实现，通过控制变量法，在单台具备多核处理器和充足内存的普通服务器上，系统性地对比了不同方案的表现。\n\n其独特之处在于不追求复杂的分布式架构评测，而是聚焦于“单机极限”，真实反映算法在内存约束下的崩溃临界点与计算效率。测试以 AUC 作为准确度指标，生成从 1 万到 1000 万不等的训练数据集，让用户能清晰看到随着数据规模扩大，各工具的性能变化趋势。如果你正在为大规模数据建模选型，benchm-ml 提供的实测数据将是一份极具参考价值的避坑","benchm-ml 是一个专注于评估主流开源机器学习库在二分类任务中可扩展性、运行速度与预测准确度的最小化基准测试项目。它旨在解决开发者在面对海量数据时，难以直观判断哪种算法实现（如随机森林、梯度提升树或深度神经网络）能在有限硬件资源下高效运行的痛点。\n\n该项目特别适合数据科学家、算法工程师及研究人员使用，帮助他们在信用评分、欺诈检测等典型商业场景中，为百万级样本和千维特征的数据集选择最合适的工具链。benchm-ml 涵盖了 R、Python scikit-learn、XGBoost、LightGBM、H2O 及 Spark MLlib 等多种流行实现，通过控制变量法，在单台具备多核处理器和充足内存的普通服务器上，系统性地对比了不同方案的表现。\n\n其独特之处在于不追求复杂的分布式架构评测，而是聚焦于“单机极限”，真实反映算法在内存约束下的崩溃临界点与计算效率。测试以 AUC 作为准确度指标，生成从 1 万到 1000 万不等的训练数据集，让用户能清晰看到随着数据规模扩大，各工具的性能变化趋势。如果你正在为大规模数据建模选型，benchm-ml 提供的实测数据将是一份极具参考价值的避坑指南。","\n## Simple\u002Flimited\u002Fincomplete benchmark for scalability, speed and accuracy of machine learning libraries for classification\n\n_**All benchmarks are wrong, but some are useful**_\n\nThis project aims at a *minimal* benchmark for scalability, speed and accuracy of commonly used implementations\nof a few machine learning algorithms. The target of this study is binary classification with numeric and categorical inputs (of \nlimited cardinality i.e. not very sparse) and no missing data, perhaps the most common problem in business\napplications (e.g. credit scoring, fraud detection or churn prediction). If the input matrix is of *n* x *p*, *n* is \nvaried as 10K, 100K, 1M, 10M, while *p* is ~1K (after expanding the categoricals into dummy \nvariables\u002Fone-hot encoding). This particular type of data structure\u002Fsize (the largest) stems from this author's interest in \nsome particular business applications.\n\n**A large part of this benchmark was done in 2015, with a number of updates later on as things have changed. Make sure you read \nat the [end](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml#summary) of this repo a summary of how the focus has changed over time,\nand why instead of updating this benchmark I started a new one (and where to find it).**\n\nThe algorithms studied are \n- linear (logistic regression, linear SVM)\n- random forest\n- boosting \n- deep neural network\n\nin various commonly used open source implementations like \n- R packages\n- Python scikit-learn\n- Vowpal Wabbit\n- H2O \n- xgboost\n- lightgbm (added in 2017)\n- Spark MLlib.\n\n**Update (June 2015):** It turns out these are the [most popular tools](https:\u002F\u002Fgithub.com\u002Fszilard\u002Flist-ml-tools)\nused for machine learning indeed. If your software tool of choice is not here, you can do a minimal benchmark\nwith little work with the [following instructions](z-other-tools).\n\nRandom forest, boosting and more recently deep neural networks are the algos expected to perform the best on the structure\u002Fsizes\ndescribed above (e.g. vs alternatives such as *k*-nearest neighbors, naive-Bayes, decision trees, linear models etc). \nNon-linear SVMs are also among the best in accuracy in general, but become slow\u002Fcannot scale for the larger *n*\nsizes we want to deal with. The linear models are less accurate in general and are used here only \nas a baseline (but they can scale better and some of them can deal with very sparse features, so they are great in other use cases). \n\nBy scalability we mean here that the algos are able to complete (in decent time) for the given data sizes with \nthe main constraint being RAM (a given algo\u002Fimplementation will crash if running out of memory). Some \nof the algos\u002Fimplementations can work in a distributed setting, although the largest dataset in this\nstudy *n* = 10M is less than 1GB, so scaling out to multiple machines should not be necessary and\nis not the focus of this current study. (Also, some of the algos perform relatively poorly speedwise in the multi-node setting, where \ncommunication is over the network rather than via updating shared memory.)\nSpeed (in the single node setting) is determined by computational\ncomplexity but also if the algo\u002Fimplementation can use multiple processor cores.\nAccuracy is measured by AUC. The interpretability of models is not of concern in this project.\n\nIn summary, we are focusing on which algos\u002Fimplementations can be used to train relatively accurate binary classifiers for data\nwith millions of observations and thousands of features processed on commodity hardware (mainly one machine with decent RAM and several cores).\n\n## Data\n\nTraining datasets of sizes 10K, 100K, 1M, 10M are [generated](0-init\u002F2-gendata.txt) from the well-known airline dataset (using years 2005 and 2006). \nA test set of size 100K is generated from the same (using year 2007). The task is to predict whether a flight will\nbe delayed by more than 15 minutes. While we study primarily the scalability of algos\u002Fimplementations, it is also interesting\nto see how much more information and consequently accuracy the same model can obtain with more data (more observations).\n\n## Setup \n\nThe tests have been carried out on a Amazon EC2 c3.8xlarge instance (32 cores, 60GB RAM). The tools are freely available and \ntheir [installation](0-init\u002F1-install.md) is trivial ([version information here](0-init\u002F1a-versions.txt)). For some\nof the models that ran out of memory for the larger data sizes a r3.8xlarge instance (32 cores, 250GB RAM) has been used\noccasionally. For deep learning on GPUs, p2.xlarge (1 GPU with 12GB video memory, 4 CPU cores, 60GB RAM) instance has been used.\n\n**Update (January 2018):** A more modern approach would use docker for fully automated installing of all ML software and automated\ntiming\u002Frunning of tests (which would make it more easy to rerun the tests on new versions of the tools, would make them more reproducible etc).\nThis approach has been actually used in a successor of this benchmark focusing on the top performing GBM implementations only, see \n[here](https:\u002F\u002Fgithub.com\u002Fszilard\u002FGBM-perf).\n\n## Results\n\nFor each algo\u002Ftool and each size *n* we observe the following: training time, maximum memory usage during training, CPU usage on the cores, \nand AUC as a measure for predictive accuracy. \nTimes to read the data, pre-process the data, score the test data are also observed but not\nreported (not the bottleneck).\n\n### Linear Models\n\nThe linear models are not the primary focus of this study because of their not so great accuracy vs\nthe more complex models (on this type of data). \nThey are analyzed here only to get some sort of baseline.\n\nThe R glm function (the basic R tool for logistic regression) is very slow, 500 seconds on *n* = 0.1M (AUC 70.6).\nTherefore, for R the glmnet package is used. For Python\u002Fscikit-learn LogisticRegression\n(based on the LIBLINEAR C++ library) has been used.\n\nTool    | *n*  |   Time (sec)  | RAM (GB) | AUC\n--------|------|---------------|----------|--------\nR       | 10K  |      0.1      |   1      | 66.7\n.       | 100K |      0.5      |   1      | 70.3\n.       | 1M   |      5        |   1      | 71.1\n.       | 10M  |      90       |   5      | 71.1\nPython  | 10K  |      0.2      |   2      | 67.6\n.       | 100K |       2       |   3      | 70.6\n.       | 1M   |       25      |   12     | 71.1\n.       | 10M  |  crash\u002F360    |          | 71.1\nVW      | 10K  |     0.3 (\u002F10) |          | 66.6\n.       | 100K |      3 (\u002F10)  |          | 70.3\n.       | 1M   |      10 (\u002F10) |          | 71.0\n.       | 10M  |     15        |          | 71.0\nH2O     | 10K  |      1        |   1      | 69.6\n.       | 100K |      1        |   1      | 70.3\n.       | 1M   |      2        |   2      | 70.8\n.       | 10M  |      5        |   3      | 71.0\nSpark   | 10K  |      1        |   1      | 66.6\n.       | 100K |      2        |   1      | 70.2\n.       | 1M   |      5        |   2      | 70.9\n.       | 10M  |      35       |   10     | 70.9\n\nPython crashes on the 60GB machine, but completes\nwhen RAM is increased to 250GB (using a [sparse format](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fissues\u002F27) \nwould help with memory footprint\nand likely runtime as well).\nThe Vowpal Wabbit (VW) running times are reported in the table for 10 passes (online learning) \nover the data for \nthe smaller sizes. While VW can be run on multiple cores (as multiple processes communicating with each\nother), it has been run here in \nthe simplest possible way (1 core). Also keep in mind that VW reads the data on the fly while for the other tools\nthe times reported exclude reading the data into memory.\n\nOne can play with various parameters (such as regularization) and even do some search in the parameter space with\ncross-validation to get better accuracy. However, very quick experimentation shows that at least for the larger\nsizes regularization does not increase accuracy significantly (which is expected since *n* >> *p*).\n\n![plot-time](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_73713b2c3b97.png)\n![plot-auc](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_48bcca78e020.png)\n\nThe main conclusion here is that **it is trivial to train linear models even for *n* = 10M rows virtually in\nany of these tools** on a single machine in a matter of seconds. \nH2O and VW are the most memory efficient (VW needs only 1 observation in memory\nat a time therefore is the ultimately scalable solution). H2O and VW are also the fastest (for VW the time reported\nincludes the time to read the data as it is read on the fly).\nAgain, the differences in memory efficiency and speed will start to really matter only for\nlarger sizes and beyond the scope of this study.\n\n\n#### Learning Curve of Linear vs Non-Linear Models\n\n\u003Ca name=\"rf-vs-linear\">\u003C\u002Fa>\nFor *this dataset* the accuracy of the linear\nmodel tops-off at moderate sizes while the accuracy of non-linear models (e.g. random forest) \ncontinues to increase with increasing data size.\nThis is because a simple linear structure can be extracted already from \na smaller dataset and having more data points will not change the classification boundary significantly.\nOn the other hand, more complex models such as random forests can improve further with increasing \ndata size by adjusting further the classification boundary.\n\nThis means that having more data (\"big data\") does not improve further the accuracy of the *linear* model\n(at least for this dataset).\n\nNote also that the random forest model is more accurate than the linear one for any size, and \ncontrary to the conventional wisdom of \"more data beats better algorithms\", \nthe random forest model \non 1% of the data (100K records) beats the linear model on all the data (10M records). \n\n![plot-auc](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_6cc0dd2b2c90.png)\n\nSimilar behavior can be observed in other *non-sparse* datasets, e.g. the \n[Higgs dataset](x1-data-higgs). Contact me (e.g. submit a [github issue](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fissues)) \nif you have learning curves for linear vs non-linear models on other datasets (dense or sparse).\n\nOn the other hand, there is certainly a price for higher accuracy in terms of larger required training (CPU) time.\n\nUltimately, there is a data size - algo (complexity) - cost (CPU time) - accuracy tradeoff \n(to be studied in more details later). Some quick results for H2O:\n\nn     |  Model  |  Time (sec) |   AUC \n------|---------|-------------|--------\n10M   |  Linear |    5        |   71.0  \n0.1M  |  RF     |    150      |   72.5\n10M   |  RF     |    4000     |   77.8\n\n\n### Random Forest\n\n**Note:** The random forests results have been published in a more organized and self-contained form\nin [this blog post](http:\u002F\u002Fdatascience.la\u002Fbenchmarking-random-forest-implementations\u002F).\n\nRandom forests with 500 trees have been trained in each tool choosing the default of square root of *p* as the number of\nvariables to split on.\n\nTool    | *n*  |   Time (sec)  | RAM (GB) | AUC\n-------------------------|------|---------------|----------|--------\nR       | 10K  |      50       |   10     | 68.2\n.       | 100K |     1200      |   35     | 71.2\n.       | 1M   |     crash     |          |\nPython  | 10K  |      2        |   2      | 68.4\n.       | 100K |     50        |   5      | 71.4\n.       | 1M   |     900       |   20     | 73.2\n.       | 10M  |     crash     |          |\nH2O     | 10K  |      15       |   2      | 69.8\n.       | 100K |      150      |   4      | 72.5\n.       | 1M   |      600      |    5     | 75.5\n.       | 10M  |     4000      |   25     | 77.8\nSpark   | 10K  |      50       |   10     | 69.1\n.       | 100K |      270      |   30     | 71.3\n.       | 1M   |  crash\u002F2000   |          | 71.4\nxgboost | 10K  |     4         |    1     | 69.9\n.       | 100K |    20         |    1     | 73.2\n.       | 1M   |    170        |    2     | 75.3\n.       | 10M  |    3000       |    9     | 76.3\n\n![plot-time](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_c0d809fd0db4.png)\n![plot-auc](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_93846bfa3598.png)\n\nThe [R](2-rf\u002F1.R) implementation (randomForest package) is slow and inefficient in memory use. \nIt cannot cope by default with a large number of categories, therefore the data had\nto be one-hot encoded. The implementation uses 1 processor core, but with 2 lines of extra code\nit is easy to build\nthe trees in parallel using all the cores and combine them at the end. However, it runs out\nof memory already for *n* = 1M. I have to emphasize this has nothing to do with R per se (and I still stand by\narguing R is the best data science platform esp. when it comes to data munging of structured data or\nvisualization), it is just this\nparticular (C and Fortran) RF implementation used by the randomForest package that is inefficient.\n\nThe [Python](2-rf\u002F2.py) (scikit-learn) implementation is faster, more memory efficient and uses all the cores.\nVariables needed to be one-hot encoded (which is more involved than for R) \nand for *n* = 10M doing this exhausted all the memory. Even if using a larger machine\nwith 250GB of memory (and 140GB free for RF after transforming all the data) the Python implementation\nruns out of memory and crashes for this larger size. The algo \n[finished successfully](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fissues\u002F1) \nthough when run on the larger box with simple integer encoding (which\nfor some datasets\u002Fcases might be actually a good approximation\u002Fchoice).\n\nThe [H2O](2-rf\u002F4-h2o.R) implementation is fast, memory efficient and uses all cores. It deals\nwith categorical variables automatically. It is also more accurate than the studied R\u002FPython packages, \nwhich may be because\nof dealing properly with the categorical variables, i.e. internally in the algo\nrather than working from a previously 1-hot encoded dataset (where the link between the dummies \nbelonging to the same original variable is lost).\n\n[Spark](2-rf\u002F5b-spark.txt) (MLlib) implementation is slower and has a larger memory footprint.\nIt runs out of memory already at *n* = 1M (with 250G of RAM it finishes for *n* = 1M, \nbut it crashes for *n* = 10M). However, as Spark\ncan run on a cluster one can throw in even more RAM by using more nodes.\nI also tried to provide the categorical\nvariables encoded simply as integers and passing the `categoricalFeaturesInfo` parameter, but that made\ntraining much slower.\nA convenience issue, reading the data is more than one line of code and at the start of this benchmark project\nSpark did not provide a one-hot encoder\nfor the categorical data (therefore I used R for that). This has been ammnded since, thanks @jkbradley\nfor native 1-hot encoding [code](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fblob\u002Fa04f7136438598ce700c3adbb0fee2efa29488f3\u002Fz-other-tools\u002F5xa-spark-1hot.txt).\nIn earlier versions of this benchmark there was an issue of Spark random forests having\nlow prediction accuracy vs the other methods. This was due to aggregating votes rather than probabilities\nand it has been addressed by @jkbradley in this \n[code](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fblob\u002Fmaster\u002F2-rf\u002F5b-spark.txt#L64) (will be included in next Spark release).\nThere is still an open issue on the accuracy for *n* = 1M (see the breaking trend in the AUC graph).\nTo get more insights on the issues above see\n[more comments](http:\u002F\u002Fdatascience.la\u002Fbenchmarking-random-forest-implementations\u002F#comment-53599) \nby Joseph Bradley @jkbradley of Databricks\u002FSpark project (thanks, Joseph).\n\n**Update (September 2016):** Spark 2.0 introduces a new API (Pipelines\u002F\"Spark ML\" vs \"Spark MLlib\") and the \n[code](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fblob\u002F406a00e9e501405589d234607e56f64a35ab1ddf\u002Fz-other-tools\u002F5xb-spark-trainpred--sp20.txt) becomes significantly simpler.\nFurthermore, Spark 1.5, 1.6 and 2.0 introduced several optimizations (\"Tungsten\") that have improved significantly for example the speed on queries (SparkSQL).\nHowever, there is no speed improvement for random forests, they actually got a bit \n[slower](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Ftree\u002Fmaster\u002Fz-other-tools#how-to-benchmark-your-tool-of-choice-with-minimal-work).\n\nI also tried [xgboost](2-rf\u002F6-xgboost.R), a popular library for boosting which is capable to build \nrandom forests as well. It is fast, memory efficient and of high accuracy. Note the different shapes of the\nAUC and runtime vs dataset size curves for H2O and xgboost, some discussions \n[here](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fissues\u002F14).\n\nBoth H2O and xgboost have interfaces from R and Python.\n\nA few other RF implementations (open source and commercial as well) \nhave been benchmarked quickly on 1M records and runtime and AUC are \n[reported here](z-other-tools).\n\nIt would be nice to study the dependence of running time and accuracy as a function of\nthe (hyper)parameter values of the algorithm, but a quick idea can be obtained easily for the\nH2O implementation from this table (*n* = 10M on 250GB RAM):\n\nntree    | depth  |   nbins  | mtries  | Time (hrs)   |  AUC\n---------|--------|----------|---------|--------------|--------\n500      |  20    |    20    | -1 (2)  |      1.2     |  77.8 \n500      |  50    |    200   | -1 (2)  |      4.5     |  78.9\n500      |  50    |    200   |   3     |      5.5     |  78.9\n5000     |  50    |    200   | -1 (2)  |      45      |  79.0\n500      |  100   |   1000   | -1 (2)  |      8.3     |  80.1\n\nother hyperparameters being sample rate (at each tree), min number of observations in nodes, impurity\nfunction.\n\nOne can see that the AUC could be improved further and the best AUC from this dataset with random forests\nseems to be around 80 (the best AUC from linear models seems to be around 71, and we will compare\nwith boosting and deep learning later).\n\n\n\n### Boosting (Gradient Boosted Trees\u002FGradient Boosting Machines)\n\nCompared to random forests, GBMs have a more complex relationship between hyperparameters\nand accuracy (and also runtime). The main hyperparameters are learning (shrinkage) rate, number of trees, \nmax depth of trees, while some others are number of bins, sample rate (at each tree), min number of \nobservations in nodes. To add to complexity, GBMs can overfit in the sense that adding more trees at some point will\nresult in decreasing accuracy on a test set (while on the training set \"accuracy\" keeps increasing).\n\nFor example using xgboost for `n = 100K` `learn_rate = 0.01` `max_depth = 16` (and the\n`printEveryN = 100` and `eval_metric = \"auc\"` options) the AUC on the train and test sets,\nrespectively after `n_trees` number of iterations are:\n\n![plot-overfit](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_8c71f43a7ea0.png)\n\nOne can see the AUC on the test set decreases after 1000 iterations (overfitting). \nxgboost has a handy early stopping option (`early_stop_round = k` - training\nwill stop if performance e.g. on a holdout set keeps getting worse consecutively \nfor `k` rounds). If one does not know where to stop, one might underfit (too few iterations)\nor overfit (too many iterations) and the resulting model will be suboptimal in accuracy\n(see Fig. above).\n\nDoing an extensive search for the best model is not the main goal of this project.\nNevertheless, a quick \n[exploratory search](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fblob\u002Fmaster\u002F3-boosting\u002F0-xgboost-init-grid.R) \nin the hyperparameter space has been\nconducted using xgboost (with the early stopping option). For this a separate validation\nset of size 100K from 2007 data not used in the test set has been generated. The goal is\nto find parameter values that provide decent accuracy and then run all GBM implementations\n(R, Python scikit-learn, etc) with those parameter values to compare speed\u002Fscalability (and \naccuracy).\n\nThe smaller the `learn_rate` the better the AUC, but for very small values training time increases dramatically, \ntherefore we use `learn_rate = 0.01` as a compromise. \nUnlike recommended in much of the literature, shallow trees don't produce best (or close to best) results, \nthe grid search showed better accuracy e.g. with `max_depth = 16`.\nThe number of trees to produce optimal results for the above hyperparameter values depend though on the training set size. \nFor `n_trees = 1000` we don't reach the overfitting regime\nfor either size and we use this value for studying the speed\u002Fscalability of the different implementations. \n(Values for the other hyper-parameters that seem to work well are: \n`sample_rate = 0.5` `min_obs_node = 1`.) We call this experiment A (in the table below).\n\nUnfortunately some implementations take too much time to run for the above parameter values\n(and Spark runs out of memory). Therefore, another set of parameter values (that provide lower accuracy but faster training times)\nhas been also used to study speed\u002Fscalability: `learn_rate = 0.1` `max_depth = 6` `n_trees = 300`. \nWe call this experiment B.\n\nI have to emphasize that while I make the effort to match parameter values for all algos\u002Fimplementations,\nevery implementation is different, some don't have all the above parameters, while some might\nuse the existing ones in a slightly different way (you can also see the resulting model\u002FAUC is somewhat different).\nNevertheless, the results below give us a pretty good idea of how the implementations compare to each other.\n\n\nTool    | *n*  | Time (s) A    | Time (s) B | AUC A  | AUC B  | RAM(GB) A | RAM(GB) B\n--------|------|---------------|------------|--------|--------|-----------|-----------\nR       | 10K  |   20          |   3        |   64.9 |  63.1  |    1      |     1\n.       | 100K |   200         |   30       |   72.3 |  71.6  |    1      |     1\n.       | 1M   |   3000        |   400      |   74.1 |  73.9  |    1      |     1\n.       | 10M  |               |   5000     |        |  74.3  |           |     4\nPython  | 10K  |    1100       |    120     |   69.9 |  69.1  |    2      |     2\n.       | 100K |               |   1500     |        |  72.9  |           |     3\n.       | 1M   |               |            |        |        |           |\n.       | 10M  |               |            |        |        |           |\nH2O     | 10K  |    90         |    7       |  68.2  |  67.7  |    3      |   2\n.       | 100K |   500         |    40      |  71.8  |  72.3  |    3      |   2\n.       | 1M   |   900         |    60      |  75.9  |  74.3  |    9      |   2\n.       | 10M  |   3500        |    300     |  78.3  |  74.6  |    11     |   20\nSpark   | 10K  |  180000       |   700      |  66.4  |  67.8  |    30     |   10\n.       | 100K |               |   1200     |        |  72.3  |           |   30\n.       | 1M   |               |   6000     |        |  73.8  |           |   30 \n.       | 10M  |               |   (60000)  |        | (74.1) |           | crash (110) \nxgboost | 10K  |   6           |     1      |  70.3  |  69.8  |   1       |  1\n.       | 100K |   40          |     4      |  74.1  |  73.5  |   1       |  1\n.       | 1M   |   400         |     45     |  76.9  |  74.5  |   1       |  1\n.       | 10M  |   9000        |    1000    |  78.7  |  74.7  |   6       |  5\n\n![plot-time](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_e3f2d6a51268.png)\n![plot-auc](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_434cd40f69b2.png)\n\nThe memory footprint of GBMs is in general smaller than for random forests, therefore the\nbottleneck is mainly training time (although besides being slow Spark is inefficient in memory use as well\nespecially for deeper trees, therefore it crashes).\n\nSimilar to random forests, H2O and xgboost are the fastest (both use\nmultithreading). R does relatively well considering that it's a single-threaded implementation.\nPython is very slow with one-hot encoding of categoricals, but almost as fast as R (just 1.5x slower) with\nsimple\u002Finteger encoding. Spark is slow and memory inefficient,\nbut at least for shallow trees it achieves similar accuracy to the other methods (unlike in\nthe case of random forests, where Spark provides lower accuracy than\nits peers).\n\nCompared to random forests, boosting requires more tuning to get a good choice of hyperparameters.\nQuick results for H2O and xgboost with `n = 10M` (largest data)\n`learn_rate = 0.01` (the smaller the better\nAUC, but also longer and longer training times) `max_depth = 20` (after rough search with \n`max_depth = 2,5,10,20,50`) `n_trees = 5000` (close to xgboost early stop)\n`min_obs_node = 1` (and `sample_rate = 0.5` for xgboost, `n_bins = 1000` for H2O):\n\nTool    |  Time (hr) |   AUC\n--------|------------|---------\nH2O     |   7.5      |   79.8\nH2O-3   |   9.5      |   81.2\nxgboost |   14       |   81.1\n\nCompare with H2O random forest from previous section (Time 8.3\thr, AUC 80.1).\nH2O-3 is the new generation\u002Fversion of H2O. \n\n**Update (May 2017):** A new tool for GBMs, LightGBM came out recently. While it's not (yet) as widely used as the tools above,\nit is now the fastest one. There is also recent work in running xgboost and LightGBM on GPUs. Therefore I started a new \n(leaner) github repo to keep track of the best GBM tools \n[here](https:\u002F\u002Fgithub.com\u002Fszilard\u002FGBM-perf) (and ignore mediocre tools such as Spark).\n\n**Update (January 2018)**: I dockerized the GBM measurements for h2o, xgboost and lightgbm (both CPU and GPU versions). The repo linked in \nthe paragraph above will contain all further development w.r.t. GBM implementations. GBMs are typically the most accurate algos\nfor supervised learning on structured\u002Ftabular data and therefore of my main interest \n(e.g. compared with the other 3 algos discussed in this current benchmark - linear models, random forests and neural networks), \nand the dockerization makes it easier to keep that other repo up to date with tests on the newest versions of the tools and\npotentially adding new ML tools. **Therefore this new [GBM-perf](https:\u002F\u002Fgithub.com\u002Fszilard\u002FGBM-perf) repo can be considered as\na \"successor\" of the current one.**\n\n### Deep neural networks\n\nDeep learning has been extremely successful on a few classes of data\u002Fmachine learning problems such as involving images, \nspeech and text (supervised learning) and games (reinforcement learning).\nHowever, it seems that in \"traditional\" machine learning problems such as fraud detection, credit scoring or churn,\ndeep learning is not as successful and it provides lower accuracy than random forests or gradient boosting machines. \nMy experiments (November 2015) on the airline dataset used in this repo and also on another \ncommercial dataset have [conjectured](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fissues\u002F28) this, \nbut unfortunately most of the hype surrounding deep learning and \"artificial intelligence\" overwhelms this reality,\nand there are only a few references in this direction e.g. \n[here](https:\u002F\u002Fwww.quora.com\u002FWhy-is-xgboost-given-so-much-less-attention-than-deep-learning-despite-its-ubiquity-in-winning-Kaggle-solutions\u002Fanswer\u002FTianqi-Chen-1),\n[here](https:\u002F\u002Fspeakerdeck.com\u002Fdatasciencela\u002Ftianqi-chen-xgboost-implementation-details-la-workshop-talk?slide=28)\nor [here](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=8KzjARKIgTo#t=28m15s).\n\nHere are the results of a few fully connected network architectures \n[trained](4-DL\u002F1-h2o.R)\nwith various optimization schemes (adaptive, rate annealing, momentum etc.) \nand various regularizers (dropout, L1, L2) \nusing H2O with early stopping on the 10M dataset:\n\nParams                                                               |  AUC  |  Time (s) | Epochs \n---------------------------------------------------------------------|-------|-----------|----------\ndefault: `activation = \"Rectifier\", hidden = c(200,200)`             | 73.1  |    270    |  1.8\n`hidden = c(50,50,50,50), input_dropout_ratio = 0.2`                 | 73.2  |    140    |  2.7\n`hidden = c(50,50,50,50)`                                            | 73.2  |    110    |  1.9\n`hidden = c(20,20)`                                                  | 73.1  |    100    |  4.6\n`hidden = c(20)`                                                     | 73.1  |    120    |  6.7\n`hidden = c(10)`                                                     | 73.2  |    150    |  12\n`hidden = c(5)`                                                      | 72.9  |    110    |  9.3\n`hidden = c(1)` (~logistic regression)                               | 71.2  |    120    |  13\n`hidden = c(200,200), l1 = 1e-5, l2 = 1e-5`                          | 73.1  |    260    |  1.8\n`RectifierWithDropout, c(200,200,200,200), dropout=c(0.2,0.1,0.1,0)` | 73.3  |    440    |  2.0\n`ADADELTA rho = 0.95, epsilon = 1e-06`                               | 71.1  |    240    |  1.7\n` rho = 0.999, epsilon = 1e-08`                                      | 73.3  |    270    |  1.9\n`adaptive = FALSE` default: `rate = 0.005, decay = 1, momentum = 0`  | 73.0  |    340    |  1.1\n`rate = 0.001, momentum = 0.5 \u002F 1e5 \u002F 0.99`                          | 73.2  |    410    |  0.7\n`rate = 0.01, momentum = 0.5 \u002F 1e5 \u002F 0.99`                           | 73.3  |    280    |  0.9\n`rate = 0.01, rate_annealing = 1e-05, momentum = 0.5 \u002F 1e5 \u002F 0.99`   | 73.5  |    360    |  1\n`rate = 0.01, rate_annealing = 1e-04, momentum = 0.5 \u002F 1e5 \u002F 0.99`   | 72.7  |    3700   |  8.7\n`rate = 0.01, rate_annealing = 1e-05, momentum = 0.5 \u002F 1e5 \u002F 0.9`    | 73.4  |    350    |  0.9\n\n\nIt looks like the neural nets are underfitting and are not able to capture the same structure in the\ndata as the random forests\u002FGBMs can (AUC 80-81). Therefore adding various forms of regularization\ndoes not improve accuracy (see above). Note also that by using early stopping (based on the decrease of\naccuracy on a validation dataset during training iterations) the training takes relatively short time\n(compared to RF\u002FGBM), also a sign of effectively low model complexity.\nRemarkably, the nets with more layers (deep) are not performing better than a simple net with\n1 hidden layer and a small number of neurons in that layer (10). \n\nTiming on the 1M dataset of various tools (fully connected networks, 2 hidden layers, 200 neurons each, ReLU,  \nSGD, learning rate 0.01, momentum 0.9, 1 epoch), code \n[here](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Ftree\u002Fmaster\u002F4-DL):\n\nTool         | Time GPU  | Time CPU\n-------------|-----------|-----------\nh2o          |    -      |   50\nmxnet        |    35     |   65\nkeras+TF     |    35     |   60\nkeras+theano |    25     |   70\n\n(GPU = p2.xlarge, CPU = r3.8xlarge 32c for h2o\u002Fmxnet, p2.xlarge 4c for TF\u002Ftheano, theano uses 1 core only)\n\nDespite not being great (in accuracy) on tabular data of the type above, \ndeep learning has been a blast in domains such as image, speech and somewhat text,\nand I'm planing to do a [benchmark of tools](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-dl) \nin that area as well (mostly conv-nets and RNNs\u002FLSTMs).\n\n\n\n### Big(ger) Data\n\nWhile my primary interest is in machine learning on datasets of 10M records, you might be interested in \nlarger datasets. Some problems might need a cluster, though there has been a tendency recently \nto solve every problem with distributed computing, needed or not. As a reminder, sending data\nover a network vs using shared memory is a big speed difference. Also several popular distributed systems\nhave significant computation and memory overhead, and more fundamentally, their communication patterns\n(e.g. map-reduce style) are not the best fit for many of the machine learning algos.\n\n#### Larger Data Sizes (on a Single Server)\n\nFor linear models, most tools, including single-core R work well on 100M records still\non a single server (r3.8xlarge instance with 32 cores, 250GB RAM used here).\n(A 10x copy of the 10M dataset has been used, therefore information on AUC vs size is invalid\nand is not considered here.)\n\nLinear models, 100M rows:\n\nTool    |   Time[s]   |   RAM[GB]\n--------|-------------|-----------\nR       |   1000      |    60\nSpark   |    160      |    120\nH2O     |    40       |    20\nVW      |    150      |\n\nSome tools can handle 1B records on a single machine\n(in fact VW never runs out of memory, so if larger runtimes are acceptable,\nyou can go further still on one machine).\n\nLinear models, 1B rows:\n\nTool    |   Time[s]   |   RAM[GB]\n--------|-------------|-----------\nH2O     |    500      |    100\nVW      |    1400     |\n\nFor tree-based ensembles (RF, GBM) H2O and xgboost can train on 100M records\non a single server, though the training times become several hours:\n\nRF\u002FGBM, 100M rows:\n\nAlgo    |Tool     |   Time[s]   |   Time[hr]  | RAM[GB]\n--------|---------|-------------|-------------|----------\nRF      | H2O     |   40000     |     11      | 80\n.       | xgboost |   36000     |     10      | 60\nGBM     | H2O     |   35000     |     10      | 100   \n.       | xgboost |   110000    |     30      | 50\n\nOne usually hopes here (and most often gets) much better accuracy for the 1000x in training time vs linear models.\n\n\n#### Distributed Systems\n\nSome quick results:\n\nH2O logistic runtime (sec):\n\nsize    |  1 node |  5 nodes\n--------|---------|----------\n100M    |   42    |   9.9 \n1B      |  480    |   101 \n\nH2O RF runtime (sec) (5 trees):\n\nsize    |  1 node |  5 nodes\n--------|---------|----------\n10M     |   42    |   41     \n100M    |  405    |   122\n\n\n\n## Summary\n\nAs of January 2018:\n\nWhen I started this benchmark in March 2015, the \"big data\" hype was all the rage, and the fanboys wanted to do\nmachine learning on \"big data\" with distributed computing (Hadoop, Spark etc.), while for the datasets most people had\nsingle-machine tools were not only good enough, but also faster, with more features and less bugs. I gave quite a few\ntalks at conferences and meetups about these benchmarks starting 2015 \nand while at the beginning I had several people asking angrily about my results on Spark, by 2017 most people realized single machine\ntools are much better for solving most of their ML problems. While Spark is a decent tool for ETL on raw data (which \noften is indeed \"big\"), its ML libraries are totally garbage and outperformed (in training time, memory footpring and\neven accuracy) by much better tools by orders of magnitude. \nFurthermore, the increase in available RAM over the last years in servers and also in the cloud,\nand the fact that for machine learning one typically refines the raw data \ninto a much smaller sized data matrix is making the mostly single-machine highly-performing tools \n(such as xgboost, lightgbm, VW but also h2o) the best choice for most \npractical applications now. The big data hype is finally over.\n\nWhat's happening now is a new wave of hype, namely deep learning. The fanboys now think deep learning (or as they miscall it:\nAI) is the best solution to all machine learning problems. While deep learning has been extremely\nsuccessful indeed on a few classes of data\u002Fmachine learning problems such as involving images, \nspeech and somewhat text (supervised learning) and games\u002Fvirtual environments (reinforcement learning),\nin more \"traditional\" machine learning problems encountered in business such as fraud detection, credit scoring or churn\n(with structured\u002Ftabular data) deep learning is not as successful and it provides lower accuracy \nthan random forests or gradient boosting machines (GBM). Therefore, lately I'm concentrating my benchmarking efforts \nmostly on GBM implementations and \nI have started a new github repo [GBM-perf](https:\u002F\u002Fgithub.com\u002Fszilard\u002FGBM-perf) that's more \"focused\" and lean \nand also uses more modern tools (such as docker) to make the benchmarks more maintainable and reproducible. Also, it has become\napparent recently that GPUs can be a powerful computing platform for GBMs too, and the new repo includes benchmarks \nof the available GPU implementations as well.\n\nI started these benchmarks mostly out of curiousity and the desire to learn (and also in order to be able to choose \ngood tools for my projects). It's been quite some experience and I'd like to thank all the folks (especially the developers of\nthe tools) for helping me in tuning and getting the most out of their ML tools. \nAs a side effect of this work I had the pleasure to be invited to talk at several conferences\n(KDD, R-finance, useR!, eRum, H2O World, Crunch, Predictive Analytics World, EARL, Domino Data Science Popup, Big Data Day LA,\nBudapest Data Forum) and to over 10 meetups, e.g.:\n\n- KDD **Invited Talk** - Machine Learning Software in Practice: Quo Vadis? - Halifax, Canada, August 2017\n- R in Finance **Keynote** - No-Bullshit Data Science - Chicago, May 2017\n- LA Data Science Meetup - Machine Learning in Production - Los Angeles, May 2017\n- useR! 2016 - Size of Datasets for Analytics and Implications for R - Stanford, June 2016\n- H2O World - Benchmarking open source ML platforms - Mountain View, November 2015\n- LA Machine Learning Meetup - Benchmarking ML Tools for Scalability, Speed and Accuracy - LA, June 2015\n\n(see code\u002Fslides and for some video recordings [here](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml-talks)). These talks\u002Fmaterials are also\nprobably the best place to get a grasp on the findings of this benchmark (and if you want to pick the one that is most\nup to date and summarizes the most watch the \n[video of my KDD talk](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=8wyOwUNw7D8&list=PLliTSxmRFGVO6Vag6FX5Jfq-RG-kUtKFZ&index=11)).\nThe work goes on, expect more results...\n\n## Citation\n\nIf `benchm-ml` was useful for your research, please consider citing it, for instance using the latest commit:\n\n```\n@misc{,\n\tauthor = {Pafka, Szilard},\n\ttitle = {benchm-ml},\n\tpublisher = {GitHub},\n\tyear = {2019},\n\tjournal = {GitHub repository},\n\turl = {https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml},\n\thowpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml}},\n\tcommit = {13325ce3edd7c902390197f43bcc7938c306bbe3}\n}\n```\n","## 用于分类的机器学习库在可扩展性、速度和准确度方面的简单\u002F有限\u002F不完整基准测试\n\n_**所有基准测试都有其局限性，但有些仍然有用**_\n\n本项目旨在为几种常用机器学习算法的实现提供一个*最小化*的基准测试，重点评估其在可扩展性、速度和准确度方面的表现。研究对象是二分类问题，输入数据包含数值型和类别型特征（类别数量有限，即非非常稀疏），且无缺失值——这可能是商业应用中最常见的场景，例如信用评分、欺诈检测或客户流失预测。假设输入矩阵为 *n* × *p*，其中 *n* 分别取 1万、10万、100万、1000万，而 *p* 约为 1000（通过将类别型变量转换为哑变量\u002F独热编码后得到）。这种特定的数据结构和规模（最大规模）源于作者对某些特定商业应用的兴趣。\n\n**该基准测试的大部分工作完成于2015年，此后随着技术发展进行了一些更新。请务必阅读本仓库的[结尾部分](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml#summary)，了解研究重点随时间的变化，以及为何我没有继续维护此基准测试，而是启动了一个新的项目（新项目地址见文末）。**\n\n所研究的算法包括：\n- 线性模型（逻辑回归、线性支持向量机）\n- 随机森林\n- 梯度提升树\n- 深度神经网络\n\n这些算法分别由多种常用的开源实现来执行，例如：\n- R 包\n- Python 的 scikit-learn\n- Vowpal Wabbit\n- H2O\n- XGBoost\n- LightGBM（2017 年加入）\n- Spark MLlib。\n\n**更新（2015年6月）：** 事实证明，这些正是目前机器学习领域中[最受欢迎的工具](https:\u002F\u002Fgithub.com\u002Fszilard\u002Flist-ml-tools)。如果你使用的工具未在此列出，只需按照[以下说明](z-other-tools)即可轻松完成一个最小化的基准测试。\n\n随机森林、梯度提升树以及近年来兴起的深度神经网络，预计将在上述数据结构和规模下表现出最佳性能（相较于 *k* 近邻、朴素贝叶斯、决策树、线性模型等其他方法而言）。非线性支持向量机通常也具有较高的准确度，但在处理较大规模的 *n* 时会变得缓慢或无法扩展。相比之下，线性模型的准确度较低，因此仅作为基准参考；不过，它们在可扩展性方面表现更好，且部分模型能够处理非常稀疏的特征，因此在其他应用场景中仍具优势。\n\n此处所说的“可扩展性”是指算法能够在给定数据规模下，在内存资源受限的情况下（即运行过程中不会因内存不足而崩溃）以合理的时间完成训练。部分算法或实现支持分布式计算，但本研究中最大的数据集规模仅为 1000 万条记录，数据量不足 1GB，因此无需扩展到多台机器上进行计算，这也并非本研究的重点。此外，某些算法在多节点环境下由于通信依赖网络而非共享内存，性能反而会有所下降。\n\n单节点环境下的速度主要取决于算法的计算复杂度，同时也受是否能利用多核处理器的影响。准确度则通过 AUC 指标进行衡量。本项目并不关注模型的可解释性。\n\n综上所述，本研究聚焦于哪些算法或实现能够在普通硬件设备上（主要是具备足够内存和多核处理器的一台机器）训练出相对准确的二分类模型，用于处理拥有数百万条样本、数千个特征的数据集。\n\n## 数据\n\n训练数据集的规模分别为 1万、10万、100万和 1000 万条记录，均基于著名的航空数据集生成（使用 2005 年和 2006 年的数据）。测试集规模为 10 万条记录，同样来自该数据集（使用 2007 年的数据）。任务是预测航班是否会晚点超过 15 分钟。尽管我们主要关注算法或实现的可扩展性，但同时也可以观察到，随着数据量的增加（即观测样本的增多），同一模型所能获取的信息量及准确度会有何变化，这也是一个有趣的问题。\n\n## 环境搭建\n\n测试在 Amazon EC2 c3.8xlarge 实例上进行（32 核，60GB 内存）。所用工具均为免费软件，安装过程十分简单（[安装说明](0-init\u002F1-install.md)见附录）。对于在较大数据规模下出现内存不足的模型，偶尔也会使用 r3.8xlarge 实例（32 核，250GB 内存）。而在 GPU 上进行深度学习时，则采用了 p2.xlarge 实例（1 块 GPU，显存 12GB，4 核 CPU，60GB 内存）。\n\n**更新（2018年1月）：** 更现代的做法是使用 Docker 完全自动地安装所有机器学习软件，并自动进行计时与测试运行，这样可以更方便地针对工具的新版本重新运行测试，提高结果的可重复性等。这种方法已被应用于本基准测试的一个后续项目，该项目专注于性能最佳的 GBM 实现，详情请参阅[此处](https:\u002F\u002Fgithub.com\u002Fszilard\u002FGBM-perf)。\n\n## 结果\n\n对于每个算法\u002F工具以及每种 *n* 规模，我们记录了以下指标：训练时间、训练过程中占用的最大内存、各核心的 CPU 使用率，以及作为预测准确度指标的 AUC 值。此外，还记录了读取数据、数据预处理和测试集评分所需的时间，但由于这些步骤并非瓶颈，因此未予报告。\n\n### 线性模型\n\n线性模型并非本研究的主要关注点，因为在该类型的数据上，它们的准确率不如更复杂的模型。此处分析线性模型仅是为了获得一个基准参考。\n\nR语言中的glm函数（用于逻辑回归的基本工具）速度非常慢，在样本量n=10万的情况下耗时500秒，AUC为70.6。因此，在R中使用了glmnet包。而在Python\u002Fscikit-learn中，则使用了基于LIBLINEAR C++库的LogisticRegression。\n\n工具    | *n*  |   时间（秒）  | 内存（GB） | AUC\n--------|------|---------------|----------|--------\nR       | 1万  |      0.1      |   1      | 66.7\n.       | 10万 |      0.5      |   1      | 70.3\n.       | 100万 |      5        |   1      | 71.1\n.       | 1000万 |      90       |   5      | 71.1\nPython  | 1万  |      0.2      |   2      | 67.6\n.       | 10万 |       2       |   3      | 70.6\n.       | 100万 |       25      |   12     | 71.1\n.       | 1000万 |  崩溃\u002F360    |          | 71.1\nVW      | 1万  |     0.3 (\u002F10) |          | 66.6\n.       | 10万 |      3 (\u002F10)  |          | 70.3\n.       | 100万 |      10 (\u002F10) |          | 71.0\n.       | 1000万 |     15        |          | 71.0\nH2O     | 1万  |      1        |   1      | 69.6\n.       | 10万 |      1        |   1      | 70.3\n.       | 100万 |      2        |   2      | 70.8\n.       | 1000万 |      5        |   3      | 71.0\nSpark   | 1万  |      1        |   1      | 66.6\n.       | 10万 |      2        |   1      | 70.2\n.       | 100万 |      5        |   2      | 70.9\n.       | 1000万 |      35       |   10     | 70.9\n\nPython在60GB内存的机器上会崩溃，但当内存增加到250GB时则可以完成训练（使用[稀疏格式](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fissues\u002F27)将有助于减少内存占用，并可能同时缩短运行时间）。表中列出的Vowpal Wabbit（VW）运行时间是针对较小数据规模进行10次遍历（在线学习）的结果。尽管VW可以在多个核心上运行（通过多个进程相互通信），但此处仍以最简单的方式（单核）执行。此外，还需注意的是，VW是在数据流式读取的过程中进行处理的，而其他工具的时间统计则不包括将数据加载到内存中的时间。\n\n用户可以调整各种参数（如正则化），甚至通过交叉验证在参数空间中进行搜索以提高准确率。然而，快速实验表明，至少对于较大的数据规模而言，正则化并不能显著提升准确率（这也是意料之中的，因为样本量n远大于特征数p）。\n\n![plot-time](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_73713b2c3b97.png)\n![plot-auc](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_48bcca78e020.png)\n\n此处的主要结论是：**即使在样本量达到1000万行的情况下，使用这些工具中的任意一种，也几乎可以在任何一台机器上在几秒钟内完成线性模型的训练。** H2O和VW在内存效率方面表现最佳（VW每次只需在内存中保存一条记录，因此是真正可扩展的解决方案）。同时，H2O和VW也是最快的（VW的运行时间包含了数据流式读取的时间）。再次强调，内存效率和速度上的差异只有在更大的数据规模下才会变得显著，而这已超出了本研究的范围。\n\n\n#### 线性模型与非线性模型的学习曲线\n\n\u003Ca name=\"rf-vs-linear\">\u003C\u002Fa>\n对于*该数据集*而言，线性模型的准确率在数据规模达到中等水平时便趋于稳定，而非线性模型（例如随机森林）的准确率则会随着数据量的增加而持续提升。这是因为简单的线性结构已经可以从较小的数据集中提取出来，再增加数据点并不会显著改变分类边界。另一方面，像随机森林这样的复杂模型则可以通过进一步调整分类边界来随着数据量的增加而不断提升性能。\n\n这意味着，拥有更多数据（“大数据”）并不会进一步提升*线性*模型的准确率（至少对于该数据集而言）。\n\n此外值得注意的是，无论数据规模如何，随机森林模型的准确率都高于线性模型；并且与“数据越多算法越优”的传统观点相反，随机森林模型在仅使用1%数据（10万条记录）的情况下，其表现就已经优于线性模型在整个数据集（1000万条记录）上的表现。\n\n![plot-auc](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_6cc0dd2b2c90.png)\n\n类似的现象也可以在其他*非稀疏*数据集中观察到，例如[Higgs数据集](x1-data-higgs)。如果您有其他数据集（密集或稀疏）上线性模型与非线性模型的学习曲线，请与我联系（例如提交一个[github问题](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fissues)）。\n\n另一方面，更高的准确率无疑需要付出更大的代价，即更长的训练（CPU）时间。\n\n最终，数据规模、算法复杂度、计算成本与准确率之间存在一个权衡关系（将在后续研究中进一步探讨）。以下是H2O的一些初步结果：\n\nn     |  模型  |  时间（秒） |   AUC \n------|---------|-------------|--------\n1000万 |  线性  |    5        |   71.0  \n10万  |  雷达森林 |    150      |   72.5\n1000万 |  雷达森林 |    4000     |   77.8\n\n\n### 随机森林\n\n**注：** 随机森林的相关结果已在一篇更加系统化和自包含的博客文章中发布，详见[这篇博文](http:\u002F\u002Fdatascience.la\u002Fbenchmarking-random-forest-implementations\u002F)。\n\n在每个工具中，均使用500棵树的随机森林模型进行训练，并将默认的分裂变量数量设置为特征数p的平方根。\n\n工具    | *n*  |   时间（秒）  | 内存（GB） | AUC\n-------------------------|------|---------------|----------|--------\nR       | 1万  |      50       |   10     | 68.2\n.       | 10万 |     1200      |   35     | 71.2\n.       | 100万 |     崩溃     |          |\nPython  | 1万  |      2        |   2      | 68.4\n.       | 10万 |     50        |   5      | 71.4\n.       | 100万 |     900       |   20     | 73.2\n.       | 1000万 |     崩溃     |          |\nH2O     | 1万  |      15       |   2      | 69.8\n.       | 10万 |      150      |   4      | 72.5\n.       | 100万 |      600      |    5     | 75.5\n.       | 1000万 |     4000      |   25     | 77.8\nSpark   | 1万  |      50       |   10     | 69.1\n.       | 10万 |      270      |   30     | 71.3\n.       | 100万 |  崩溃\u002F2000   |          | 71.4\nxgboost | 1万  |     4         |    1     | 69.9\n.       | 10万 |    20         |    1     | 73.2\n.       | 100万 |    170        |    2     | 75.3\n.       | 1000万 |    3000       |    9     | 76.3\n\n![plot-time](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_c0d809fd0db4.png)\n![plot-auc](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_93846bfa3598.png)\n\n[R](2-rf\u002F1.R) 实现（randomForest 包）速度较慢，内存使用效率低下。默认情况下，它无法处理大量类别，因此数据必须进行独热编码。该实现仅使用一个处理器核心，但只需添加两行代码，即可利用所有核心并行构建决策树，并在最后将它们合并。然而，当样本量 *n* = 100 万时，它就已经耗尽了内存。需要强调的是，这与 R 语言本身无关（我仍然认为 R 是最佳的数据科学平台，尤其是在结构化数据的清洗和可视化方面），问题在于 randomForest 包所使用的这一特定 C 和 Fortran 实现效率较低。\n\nPython (scikit-learn) 实现速度更快、内存利用率更高，并且可以充分利用所有核心。变量同样需要进行独热编码（相比 R 更为复杂），但在 *n* = 1000 万的情况下，这一过程会耗尽所有内存。即使使用配备 250GB 内存的机器（在完成所有数据转换后，RF 还有 140GB 可用），Python 实现仍会在更大规模的数据上因内存不足而崩溃。不过，如果采用简单的整数编码方式，在更大的机器上运行该算法则能成功完成（对于某些数据集或场景，这可能实际上是一个不错的近似或选择）。\n\nH2O 实现速度快、内存效率高，并且能够利用所有核心。它能够自动处理分类变量，其准确性也高于所研究的 R 和 Python 包，这可能是因为它在算法内部直接正确地处理了分类变量，而不是基于预先进行独热编码的数据集（后者会导致属于同一原始变量的虚拟变量之间的关联丢失）。\n\nSpark (MLlib) 实现速度较慢，内存占用较大。在 *n* = 100 万时就会出现内存不足的情况（虽然在 250GB RAM 的机器上可以完成 *n* = 100 万的任务，但在 *n* = 1000 万时会崩溃）。不过，由于 Spark 可以在集群上运行，可以通过增加节点来进一步扩展内存。我也尝试过将分类变量简单地编码为整数，并传递 `categoricalFeaturesInfo` 参数，但这使得训练速度大幅下降。此外，数据读取操作需要多行代码，而在本基准测试项目开始时，Spark 尚未提供用于分类数据的独热编码器（因此我使用 R 来完成这一步骤）。后来这个问题得到了解决，感谢 @jkbradley 提供的原生独热编码代码 [链接](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fblob\u002Fa04f7136438598ce700c3adbb0fee2efa29488f3\u002Fz-other-tools\u002F5xa-spark-1hot.txt)。在此之前的版本中，Spark 随机森林的预测准确率低于其他方法，这是由于其采用投票机制而非概率加权所致。@jkbradley 已经通过这段代码解决了该问题 [链接](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fblob\u002Fmaster\u002F2-rf\u002F5b-spark.txt#L64)，预计将在下一次 Spark 发布中正式纳入。尽管如此，对于 *n* = 100 万的数据，准确率仍然存在一些问题（见 AUC 曲线中的下降趋势）。有关上述问题的更多见解，请参阅 Databricks\u002FSpark 项目负责人 Joseph Bradley (@jkbradley) 的评论 [链接](http:\u002F\u002Fdatascience.la\u002Fbenchmarking-random-forest-implementations\u002F#comment-53599)（感谢 Joseph）。\n\n**更新（2016年9月）：** Spark 2.0 引入了新的 API（Pipeline\u002F\"Spark ML\" vs \"Spark MLlib\"），相关代码 [链接](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fblob\u002F406a00e9e501405589d234607e56f64a35ab1ddf\u002Fz-other-tools\u002F5xb-spark-trainpred--sp20.txt) 显得更加简洁。此外，Spark 1.5、1.6 和 2.0 版本引入了多项优化技术（如 Tungsten），显著提升了查询速度（例如 SparkSQL）。然而，随机森林的性能并未得到改善，反而略有下降 [链接](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Ftree\u002Fmaster\u002Fz-other-tools#how-to-benchmark-your-tool-of-choice-with-minimal-work)。\n\n我还尝试了 xgboost（2-rf\u002F6-xgboost.R），这是一个流行的梯度提升库，同样可以构建随机森林。它速度快、内存效率高，且具有较高的准确率。请注意，H2O 和 xgboost 在 AUC 和运行时间随数据集规模变化的曲线形状上存在差异，相关讨论请参见 [此处](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fissues\u002F14)。\n\nH2O 和 xgboost 均提供了 R 和 Python 接口。\n\n此外，还有一些其他开源和商业的随机森林实现，已在 100 万条记录上进行了快速基准测试，其运行时间和 AUC 结果已在此处公布 [链接](z-other-tools)。\n\n理想情况下，应深入研究算法运行时间和准确率与超参数值之间的关系，但以 H2O 实现为例，我们可以通过下表（*n* = 1000 万，250GB RAM）快速获得初步了解：\n\n| ntree | depth | nbins | mtries | 时间（小时） | AUC |\n|-------|-------|-------|--------|-------------|-----|\n| 500   | 20    | 20    | -1 (2) | 1.2         | 77.8|\n| 500   | 50    | 200   | -1 (2) | 4.5         | 78.9|\n| 500   | 50    | 200   | 3      | 5.5         | 78.9|\n| 5000  | 50    | 200   | -1 (2) | 45          | 79.0|\n| 500   | 100   | 1000  | -1 (2) | 8.3         | 80.1|\n\n其他超参数包括每棵树的采样比例、节点中最小观测数以及不纯度函数等。可以看出，AUC 还有进一步提升的空间，而该数据集上随机森林的最佳 AUC 约为 80（线性模型的最佳 AUC 约为 71，后续我们将比较梯度提升和深度学习的结果）。\n\n\n\n### 梯度提升（梯度提升树\u002F梯度提升机）\n\n与随机森林相比，GBM 的超参数与准确率（以及运行时间）之间的关系更为复杂。主要超参数包括学习率（收缩率）、树的数量、树的最大深度；其他参数还有分箱数、每棵树的采样比例以及节点中的最小观测数。更复杂的是，GBM 容易发生过拟合——在某个点之后，继续增加树的数量反而会导致测试集上的准确率下降（而训练集上的“准确率”仍在上升）。\n\n例如，使用 xgboost 对 *n* = 10 万的数据进行训练，设置 `learn_rate = 0.01`、`max_depth = 16`，并启用 `printEveryN = 100` 和 `eval_metric = \"auc\"` 选项后，训练集和测试集在不同迭代次数后的 AUC 如下图所示：\n\n![plot-overfit](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_8c71f43a7ea0.png)\n\n可以看到，测试集的 AUC 在经过 1000 次迭代后开始下降，即发生了过拟合。xgboost 提供了一个便捷的早停选项（`early_stop_round = k`——如果在验证集等指标上连续 `k` 轮表现持续恶化，则停止训练）。如果不掌握合适的停止时机，就可能造成欠拟合（迭代次数不足）或过拟合（迭代次数过多），从而导致最终模型的准确率不理想（见上图）。\n\n对最佳模型进行广泛搜索并非本项目的首要目标。\n尽管如此，我们还是利用XGBoost（开启早停机制）在超参数空间中进行了一次快速的\n[探索性搜索](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fblob\u002Fmaster\u002F3-boosting\u002F0-xgboost-init-grid.R)。\n为此，我们从2007年的数据中单独划分出一个10万条记录的验证集，该集合未被用于测试集。\n我们的目标是找到能够提供合理准确率的参数值，然后使用这些参数值运行所有GBM实现（R、Python scikit-learn等），以比较其速度和可扩展性（以及准确率）。\n\n`learn_rate`越小，AUC越高，但当取值非常小时，训练时间会急剧增加。因此，我们选择`learn_rate = 0.01`作为折衷方案。\n与许多文献中的建议不同，浅层树并未带来最佳（或接近最佳）的结果；网格搜索显示，例如当`max_depth = 16`时，准确率更高。\n对于上述超参数值而言，达到最优结果所需的树的数量则取决于训练集的大小。\n对于`n_trees = 1000`，无论数据规模如何，我们都未进入过拟合状态，因此采用此值来研究不同实现的速度与可扩展性。\n（其他表现较好的超参数取值为：`sample_rate = 0.5`、`min_obs_node = 1`。）我们将这一实验称为实验A（见下表）。\n\n遗憾的是，某些实现方式在上述参数设置下耗时过长（且Spark会因内存不足而崩溃）。因此，我们还采用了另一组参数值（虽然准确率较低，但训练速度更快）来研究速度和可扩展性：`learn_rate = 0.1`、`max_depth = 6`、`n_trees = 300`。我们将这一实验称为实验B。\n\n我必须强调，尽管我努力使所有算法和实现的参数保持一致，但每种实现都有所不同——有些并不具备全部上述参数，而有些可能以略有不同的方式使用现有参数（您也可以看到，最终生成的模型及其AUC确实存在差异）。尽管如此，以下结果仍能让我们大致了解各实现之间的相对表现。\n\n工具    | *n*  | 时间（秒）A    | 时间（秒）B | AUC A  | AUC B  | 内存（GB）A | 内存（GB）B\n--------|------|---------------|------------|--------|--------|-----------|-----------\nR       | 1万  |   20          |   3        |   64.9 |  63.1  |    1      |     1\n.       | 10万 |   200         |   30       |   72.3 |  71.6  |    1      |     1\n.       | 100万 |   3000        |   400      |   74.1 |  73.9  |    1      |     1\n.       | 1000万 |               |   5000     |        |  74.3  |           |     4\nPython  | 1万  |    1100       |    120     |   69.9 |  69.1  |    2      |     2\n.       | 10万 |               |   1500     |        |  72.9  |           |     3\n.       | 100万 |               |            |        |        |           |\n.       | 1000万 |               |            |        |        |           |\nH2O     | 1万  |    90         |    7       |  68.2  |  67.7  |    3      |   2\n.       | 10万 |   500         |    40      |  71.8  |  72.3  |    3      |   2\n.       | 100万 |   900         |    60      |  75.9  |  74.3  |    9      |   2\n.       | 1000万 |   3500        |    300     |  78.3  |  74.6  |    11     |   20\nSpark   | 1万  |  180000       |   700      |  66.4  |  67.8  |    30     |   10\n.       | 10万 |               |   1200     |        |  72.3  |           |   30\n.       | 100万 |               |   6000     |        |  73.8  |           |   30 \n.       | 1000万 |               |   （60000）  |        | （74.1） |           | 崩溃（110）\nxgboost | 1万  |   6           |     1      |  70.3  |  69.8  |   1       |  1\n.       | 10万 |   40          |     4      |  74.1  |  73.5  |   1       |  1\n.       | 100万 |   400         |     45     |  76.9  |  74.5  |   1       |  1\n.       | 1000万 |   9000        |    1000    |  78.7  |  74.7  |   6       |  5\n\n![plot-time](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_e3f2d6a51268.png)\n![plot-auc](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_readme_434cd40f69b2.png)\n\nGBM的内存占用通常比随机森林要小，因此瓶颈主要在于训练时间（尽管除了速度慢之外，Spark在内存使用方面也效率低下，尤其是在处理较深的树时，这导致它容易崩溃）。\n\n与随机森林类似，H2O和XGBoost的速度最快（两者都支持多线程）。R的表现相对较好，考虑到它是一个单线程实现。Python在对分类变量进行独热编码时速度很慢，但在采用简单整数编码的情况下，其速度几乎与R相当（仅慢1.5倍）。Spark既慢又浪费内存，不过至少对于浅层树而言，它的准确率与其他方法相近（不同于随机森林的情况，在随机森林中，Spark的准确率低于同类方法）。\n\n相较于随机森林，提升法需要更多的调优才能选出合适的超参数。以下是针对H2O和XGBoost在`n = 1000万`（最大数据量）、`learn_rate = 0.01`（越小AUC越高，但训练时间也会越来越长）、`max_depth = 20`（经过初步搜索，分别尝试了`max_depth = 2, 5, 10, 20, 50`）、`n_trees = 5000`（接近XGBoost的早停条件）、`min_obs_node = 1`（以及XGBoost的`sample_rate = 0.5`、H2O的`n_bins = 1000`）时的快速结果：\n\n工具    |  时间（小时） |   AUC\n--------|------------|---------\nH2O     |   7.5      |   79.8\nH2O-3   |   9.5      |   81.2\nxgboost |   14       |   81.1\n\n相比之下，前一节中H2O随机森林的结果为：时间8.3小时，AUC80.1。H2O-3是H2O的新一代版本。\n\n**更新（2017年5月）：** 最近推出了一款新的GBM工具LightGBM。虽然它目前尚未像上述工具那样广泛应用，但已经是目前最快的工具之一。此外，最近还有将XGBoost和LightGBM运行在GPU上的研究进展。因此，我创建了一个更精简的GitHub仓库来跟踪最佳的GBM工具：\n[这里](https:\u002F\u002Fgithub.com\u002Fszilard\u002FGBM-perf)（并忽略诸如Spark之类的平庸工具）。\n\n**更新（2018年1月）：** 我已将H2O、XGBoost和LightGBM（包括CPU和GPU版本）的GBM性能测试容器化。上一段中链接的仓库将包含所有关于GBM实现的后续开发工作。GBM通常是结构化\u002F表格型数据监督学习中最准确的算法，因此也是我的主要研究兴趣所在（例如，与本次基准测试中讨论的其他三种算法——线性模型、随机森林和神经网络相比）。通过容器化，可以更方便地保持该仓库的最新状态，定期测试最新版本的工具，并有可能添加新的机器学习工具。**因此，这个新的[GBM-perf](https:\u002F\u002Fgithub.com\u002Fszilard\u002FGBM-perf)仓库可以被视为当前仓库的“继任者”。**\n\n### 深度神经网络\n\n深度学习在几类数据和机器学习问题上取得了巨大成功，例如图像、语音和文本处理（监督学习）以及游戏（强化学习）。然而，在“传统”机器学习问题中，比如欺诈检测、信用评分或客户流失预测，深度学习的表现似乎并不理想，其准确率往往低于随机森林或梯度提升机。我在2015年11月对该仓库中使用的航空公司数据集以及其他商业数据集进行的实验也得出了类似的结论，并在[这里](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml\u002Fissues\u002F28)提出了这一猜想。但遗憾的是，围绕深度学习和“人工智能”的炒作掩盖了这一现实，关于这一话题的讨论和文献相对较少，例如[这里](https:\u002F\u002Fwww.quora.com\u002FWhy-is-xgboost-given-so-much-less-attention-than-deep-learning-despite-its-ubiquity-in-winning-Kaggle-solutions\u002Fanswer\u002FTianqi-Chen-1)、[这里](https:\u002F\u002Fspeakerdeck.com\u002Fdatasciencela\u002Ftianqi-chen-xgboost-implementation-details-la-workshop-talk?slide=28)以及[这里](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=8KzjARKIgTo#t=28m15s)。\n\n以下是使用H2O框架在1000万条样本的数据集上训练的几种全连接网络架构的结果，采用了不同的优化方法（自适应、学习率退火、动量等）以及正则化手段（Dropout、L1、L2），并结合了早停策略：\n\n参数                                                               |  AUC  |  时间 (秒) | 轮数 \n---------------------------------------------------------------------|-------|-----------|----------\n默认：`activation = \"Rectifier\", hidden = c(200,200)`             | 73.1  |    270    |  1.8\n`hidden = c(50,50,50,50), input_dropout_ratio = 0.2`                 | 73.2  |    140    |  2.7\n`hidden = c(50,50,50,50)`                                            | 73.2  |    110    |  1.9\n`hidden = c(20,20)`                                                  | 73.1  |    100    |  4.6\n`hidden = c(20)`                                                     | 73.1  |    120    |  6.7\n`hidden = c(10)`                                                     | 73.2  |    150    |  12\n`hidden = c(5)`                                                      | 72.9  |    110    |  9.3\n`hidden = c(1)` (~逻辑回归)                               | 71.2  |    120    |  13\n`hidden = c(200,200), l1 = 1e-5, l2 = 1e-5`                          | 73.1  |    260    |  1.8\n`RectifierWithDropout, c(200,200,200,200), dropout=c(0.2,0.1,0.1,0)` | 73.3  |    440    |  2.0\n`ADADELTA rho = 0.95, epsilon = 1e-06`                               | 71.1  |    240    |  1.7\n`rho = 0.999, epsilon = 1e-08`                                      | 73.3  |    270    |  1.9\n`adaptive = FALSE` 默认：`rate = 0.005, decay = 1, momentum = 0`  | 73.0  |    340    |  1.1\n`rate = 0.001, momentum = 0.5 \u002F 1e5 \u002F 0.99`                          | 73.2  |    410    |  0.7\n`rate = 0.01, momentum = 0.5 \u002F 1e5 \u002F 0.99`                           | 73.3  |    280    |  0.9\n`rate = 0.01, rate_annealing = 1e-05, momentum = 0.5 \u002F 1e5 \u002F 0.99`   | 73.5  |    360    |  1\n`rate = 0.01, rate_annealing = 1e-04, momentum = 0.5 \u002F 1e5 \u002F 0.99`   | 72.7  |    3700   |  8.7\n`rate = 0.01, rate_annealing = 1e-05, momentum = 0.5 \u002F 1e5 \u002F 0.9`    | 73.4  |    350    |  0.9\n\n\n从结果来看，这些神经网络似乎存在欠拟合现象，无法像随机森林或梯度提升机那样捕捉到数据中的复杂结构（AUC达到80–81）。因此，添加各种形式的正则化措施并未显著提升模型的准确率。此外，值得注意的是，通过基于验证集准确率下降的早停机制，训练时间相对较短（相比随机森林和梯度提升机而言），这也表明模型的复杂度确实较低。令人惊讶的是，更深的网络结构（多层）并没有比仅有一层隐藏层且该层神经元数量较少（10个）的简单网络表现更好。\n\n以下是在100万条样本的数据集上对不同工具进行的计时结果（全连接网络，两层隐藏层，每层200个神经元，ReLU激活函数，SGD优化器，学习率0.01，动量0.9，训练1轮）：\n\n工具         | GPU时间  | CPU时间\n-------------|-----------|-----------\nh2o          |    -      |   50\nmxnet        |    35     |   65\nkeras+TF     |    35     |   60\nkeras+theano |    25     |   70\n\n（GPU为p2.xlarge，CPU为r3.8xlarge 32核用于h2o\u002Fmxnet；p2.xlarge 4核用于TF\u002Ftheano；theano仅使用1个核心）\n\n尽管在上述类型的表格数据上，深度学习的性能并不突出，但它在图像、语音以及一定程度上在文本处理等领域却表现出色。因此，我计划在这些领域也开展一次[工具基准测试](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-dl)，主要聚焦于卷积神经网络和RNN\u002FLSTM等模型。\n\n### 大（更）数据\n\n虽然我的主要兴趣在于处理1000万条记录的数据集上的机器学习，但你可能对更大的数据集更感兴趣。有些问题确实需要使用集群，不过最近有一种倾向，即无论是否必要，都倾向于用分布式计算来解决所有问题。需要注意的是，通过网络传输数据与使用共享内存相比，速度差异非常大。此外，一些流行的分布式系统存在显著的计算和内存开销，更根本的是，它们的通信模式（例如Map-Reduce风格）并不适合许多机器学习算法。\n\n#### 更大数据规模（单机）\n\n对于线性模型，大多数工具，包括单核R，在单台服务器上仍然可以很好地处理1亿条记录（此处使用32核、250GB内存的r3.8xlarge实例）。\n\n（这里使用了1000万条记录数据集的10倍副本，因此关于AUC与数据规模关系的信息无效，未在此考虑。）\n\n线性模型，1亿行：\n\n工具    |   时间[s]   |   内存[GB]\n--------|-------------|-----------\nR       |   1000      |    60\nSpark   |    160      |    120\nH2O     |    40       |    20\nVW      |    150      |\n\n某些工具甚至可以在单台机器上处理10亿条记录。\n（实际上，VW永远不会耗尽内存，因此如果可以接受更长的运行时间，你还可以在单机上进一步扩展。）\n\n线性模型，10亿行：\n\n工具    |   时间[s]   |   内存[GB]\n--------|-------------|-----------\nH2O     |    500      |    100\nVW      |    1400     |\n\n对于基于树的集成模型（随机森林、梯度提升机），H2O和XGBoost可以在单台服务器上训练1亿条记录，不过训练时间会达到数小时：\n\n随机森林\u002F梯度提升机，1亿行：\n\n算法    |工具     |   时间[s]   |   时间[小时]  | 内存[GB]\n--------|---------|-------------|-------------|----------\n随机森林      | H2O     |   40000     |     11      | 80\n.       | XGBoost |   36000     |     10      | 60\n梯度提升机     | H2O     |   35000     |     10      | 100   \n.       | XGBoost |   110000    |     30      | 50\n\n通常情况下，人们期望在这种情况下获得比线性模型高得多的准确率，尽管训练时间增加了1000倍。\n\n#### 分布式系统\n\n一些快速结果如下：\n\nH2O逻辑回归运行时间（秒）：\n\n规模    |  1节点 |  5节点\n--------|---------|----------\n1亿    |   42    |   9.9 \n10亿      |  480    |   101 \n\nH2O随机森林运行时间（秒）（5棵树）：\n\n规模    |  1节点 |  5节点\n--------|---------|----------\n1000万    |   42    |   41     \n1亿    |  405    |   122\n\n\n\n## 总结\n\n截至2018年1月：\n\n当我于2015年3月开始这项基准测试时，“大数据”炒作正盛行，狂热者们热衷于使用分布式计算（Hadoop、Spark等）来进行“大数据”机器学习。然而，对于大多数人拥有的数据集而言，单机工具不仅足够好，而且速度更快、功能更丰富、Bug也更少。自2015年起，我在各种会议和聚会中多次分享这些基准测试的结果。起初，不少人对我关于Spark的结果感到愤怒，但到了2017年，大多数人已经意识到，对于大多数机器学习问题，单机工具要好得多。尽管Spark在原始数据的ETL任务中表现尚可（而这类数据往往确实是“大”的），但其机器学习库却非常糟糕，在训练时间、内存占用，甚至准确率方面，都被其他更好的工具远远超越。此外，近年来服务器及云环境中可用内存的不断增加，以及机器学习中通常会将原始数据提炼成规模小得多的数据矩阵这一事实，使得那些性能卓越的单机工具（如XGBoost、LightGBM、Vowpal Wabbit，以及H2O）成为当前大多数实际应用中的最佳选择。大数据的炒作终于结束了。\n\n如今，新的炒作浪潮正在兴起，那就是深度学习。狂热者们认为深度学习（或者他们错误地称之为“AI”）是解决所有机器学习问题的最佳方案。诚然，深度学习在图像、语音以及部分文本（监督学习）和游戏\u002F虚拟环境（强化学习）等领域取得了巨大成功，但在商业中常见的“传统”机器学习问题，比如欺诈检测、信用评分或客户流失预测（涉及结构化\u002F表格型数据），深度学习的表现并不理想，其准确率往往低于随机森林或梯度提升机（GBM）。因此，近来我将基准测试的重点放在GBM实现上，并创建了一个新的GitHub仓库[GBM-perf](https:\u002F\u002Fgithub.com\u002Fszilard\u002FGBM-perf)，它更加“聚焦”且精简，同时采用了更现代的工具（如Docker），以提高基准测试的可维护性和可重复性。此外，最近还发现GPU也可以成为GBM的强大计算平台，因此新仓库中也包含了现有GPU实现的基准测试结果。\n\n我最初进行这些基准测试主要是出于好奇心和学习的愿望（同时也是为了能够为自己的项目选择合适的工具）。这是一段非常宝贵的经历，我要感谢所有参与者（尤其是各工具的开发者），感谢他们帮助我优化并充分发挥各自机器学习工具的潜力。作为这项工作的副产品，我有幸受邀在多个会议上发表演讲（KDD、R-Finance、useR!、eRum、H2O World、Crunch、Predictive Analytics World、EARL、Domino Data Science Popup、Big Data Day LA、Budapest Data Forum），并在十余次聚会上发言，例如：\n\n- KDD **特邀演讲** - 实践中的机器学习软件：何去何从？ - 加拿大哈利法克斯，2017年8月\n- R in Finance **主题演讲** - 真实无虚的数据科学 - 美国芝加哥，2017年5月\n- LA数据科学聚会 - 生产环境中的机器学习 - 美国洛杉矶，2017年5月\n- useR! 2016 - 分析用数据集规模及其对R的影响 - 美国斯坦福，2016年6月\n- H2O World - 开源机器学习平台基准测试 - 美国山景城，2015年11月\n- LA机器学习聚会 - 机器学习工具的可扩展性、速度和准确率基准测试 - 美国洛杉矶，2015年6月\n\n（相关代码、幻灯片以及部分视频录像请参见[这里](https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml-talks)。）这些演讲和材料也是了解本次基准测试成果的最佳途径（如果你想选择最新且最全面的总结，可以观看我的KDD演讲视频[链接](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=8wyOwUNw7D8&list=PLliTSxmRFGVO6Vag6FX5Jfq-RG-kUtKFZ&index=11)）。这项工作仍在继续，敬请期待更多成果……\n\n## 引用\n\n如果`benchm-ml`对你的研究有所帮助，请考虑引用它，例如使用最新的提交版本：\n\n```\n@misc{,\n\tauthor = {Pafka, Szilard},\n\ttitle = {benchm-ml},\n\tpublisher = {GitHub},\n\tyear = {2019},\n\tjournal = {GitHub repository},\n\turl = {https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml},\n\thowpublished = {\\url{https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml}},\n\tcommit = {13325ce3edd7c902390197f43bcc7938c306bbe3}\n}\n```","# benchm-ml 快速上手指南\n\n`benchm-ml` 是一个用于评估主流机器学习库在**二分类任务**中可扩展性、速度和准确性的基准测试项目。它主要针对数值型和低基数类别型输入数据（无缺失值），模拟信用评分、欺诈检测等常见商业场景。\n\n> **注意**：本项目主要作为基准测试参考，部分核心测试完成于 2015 年并后续更新。作者已启动新的基准项目 [GBM-perf](https:\u002F\u002Fgithub.com\u002Fszilard\u002FGBM-perf) 专注于高性能 GBM 实现，建议结合阅读。\n\n## 环境准备\n\n### 系统要求\n基准测试原始环境基于 Amazon EC2 实例，建议在具备多核 CPU 和大内存的单机环境中运行：\n- **推荐配置**：32 核 CPU，60GB+ RAM（处理 10M 行数据时可能需要 250GB RAM 或启用稀疏格式）。\n- **操作系统**：Linux (Ubuntu\u002FCentOS) 或 macOS。\n- **GPU 支持**（可选）：若测试深度学习模型，需配备 NVIDIA GPU (12GB+ 显存)。\n\n### 前置依赖\n本项目对比了多种工具，需根据要测试的算法安装相应环境：\n- **R 环境**：R >= 3.0，需安装 `glmnet`, `randomForest`, `h2o` 等包。\n- **Python 环境**：Python >= 2.7\u002F3.x，需安装 `scikit-learn`, `xgboost`, `lightgbm`, `pandas`, `numpy`。\n- **其他工具**：Vowpal Wabbit, H2O Java 运行时，Spark (可选)。\n\n## 安装步骤\n\n### 1. 克隆项目\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fszilard\u002Fbenchm-ml.git\ncd benchm-ml\n```\n\n### 2. 安装 Python 依赖\n建议使用虚拟环境，并通过国内镜像源加速安装：\n```bash\npython -m venv venv\nsource venv\u002Fbin\u002Factivate  # Windows: venv\\Scripts\\activate\n\n# 使用清华源安装核心依赖\npip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple scikit-learn xgboost lightgbm pandas numpy\n```\n\n### 3. 安装 R 依赖\n在 R 交互环境中执行：\n```R\n# 设置 CRAN 镜像 (可选，如中科大源)\noptions(repos = c(CRAN = \"https:\u002F\u002Fmirrors.ustc.edu.cn\u002FCRAN\"))\n\ninstall.packages(c(\"glmnet\", \"randomForest\", \"data.table\"))\n# 安装 h2o (需先安装 dependencies)\ninstall.packages(\"h2o\", type=\"source\")\n```\n\n### 4. 安装其他工具\n- **Vowpal Wabbit**: \n  ```bash\n  # Ubuntu\u002FDebian\n  sudo apt-get install vowpal-wabbit\n  # 或从源码编译\n  git clone https:\u002F\u002Fgithub.com\u002FVowpalWabbit\u002Fvowpal_wabbit.git\n  cd vowpal_wabbit && make && sudo make install\n  ```\n- **H2O**: 下载最新 jar 包或通过 R\u002FPython API 自动下载。\n\n## 基本使用\n\n本项目主要通过脚本生成数据并运行不同模型的训练脚本。以下是核心流程示例：\n\n### 1. 生成数据集\n项目使用航空延误数据（预测航班是否延误超过 15 分钟）。运行初始化脚本生成不同规模（10K, 100K, 1M, 10M）的训练集和测试集：\n```bash\n# 进入初始化目录\ncd 0-init\n\n# 执行数据生成脚本 (需确保已安装 R 或对应转换工具)\n# 具体命令参考项目内的 2-gendata.txt 说明\nRscript 2-gendata.R \n```\n*注：生成的数据将包含数值特征及经过 One-Hot 编码的类别特征。*\n\n### 2. 运行线性模型基准 (以 Python sklearn 为例)\n```python\nfrom sklearn.linear_model import LogisticRegression\nfrom sklearn.metrics import roc_auc_score\nimport pandas as pd\n\n# 加载生成的数据 (示例路径)\ntrain = pd.read_csv('data\u002Ftrain_100k.csv')\ntest = pd.read_csv('data\u002Ftest_100k.csv')\n\nX_train, y_train = train.drop('target', axis=1), train['target']\nX_test, y_test = test.drop('target', axis=1), test['target']\n\n# 训练模型\nmodel = LogisticRegression()\nmodel.fit(X_train, y_train)\n\n# 评估 AUC\npreds = model.predict_proba(X_test)[:, 1]\nauc = roc_auc_score(y_test, preds)\nprint(f\"AUC: {auc}\")\n```\n\n### 3. 运行随机森林基准 (以 XGBoost 为例)\n```python\nimport xgboost as xgb\nfrom sklearn.metrics import roc_auc_score\n\n# 准备 DMatrix\ndtrain = xgb.DMatrix(X_train, label=y_train)\ndtest = xgb.DMatrix(X_test, label=y_test)\n\n# 设置参数 (500 棵树，默认 sqrt(p) 分裂特征)\nparams = {\n    'objective': 'binary:logistic',\n    'eval_metric': 'auc',\n    'max_depth': 6,\n    'eta': 0.3\n}\n\n# 训练\nbst = xgb.train(params, dtrain, num_boost_round=500)\n\n# 预测与评估\npreds = bst.predict(dtest)\nauc = roc_auc_score(y_test, preds)\nprint(f\"XGBoost AUC: {auc}\")\n```\n\n### 4. 查看结果\n各算法的详细运行时间、内存占用和 AUC 结果通常记录在对应子目录（如 `1-linear`, `2-rf`）的脚本输出或生成的图表中。您可以修改脚本中的数据规模参数（`n`）来测试不同数据量下的性能表现。","某金融科技公司数据团队正面临千万级用户交易数据的二分类建模任务（如欺诈检测），需在有限内存的单台服务器上快速筛选出兼顾速度与精度的最佳算法组合。\n\n### 没有 benchm-ml 时\n- **选型盲目**：团队需在 R、Python scikit-learn、XGBoost、Spark MLlib 等众多库中手动逐一测试，缺乏统一标准，耗时数周仍难定夺。\n- **资源浪费**：直接在全量数据上试错，常因算法内存优化不足导致程序崩溃（OOM），反复调试严重拖慢项目进度。\n- **性能误判**：仅凭小样本数据评估模型，上线后发现随着数据量从 10 万增至 1000 万，部分算法训练时间呈指数级增长，无法满足业务时效。\n- **基准缺失**：缺乏针对“百万行数据、千维特征”这一典型业务场景的权威速度与精度（AUC）对照，难以向管理层证明技术选型的合理性。\n\n### 使用 benchm-ml 后\n- **精准导航**：直接参考 benchm-ml 对随机森林、GBDT 及深度学习在不同规模数据下的评测报告，迅速锁定 LightGBM 和 H2O 为候选方案，将选型周期缩短至 2 天。\n- **规避风险**：依据工具提供的内存消耗数据，提前排除无法在单机处理 1000 万条记录的实现方式，确保生产环境运行稳定。\n- **预期可控**：通过查看不同数据量级（10K 到 10M）下的速度曲线，准确预估全量训练耗时，合理制定计算资源预算与交付时间表。\n- **决策有据**：利用工具生成的 AUC 与耗时对比图表，清晰展示为何放弃线性模型而选择梯度提升树，以数据实证赢得业务方信任。\n\nbenchm-ml 通过提供标准化的最小基准测试，帮助团队在海量开源实现中快速定位最适合大规模二分类任务的算法，显著降低了试错成本并提升了决策效率。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fszilard_benchm-ml_73713b2c.png","szilard","Szilard Pafka","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fszilard_3f28ec16.jpg","physics PhD, chief (data) scientist, meetup organizer, (visiting) professor, machine learning benchmarks",null,"The Woodlands, Texas","https:\u002F\u002Fszilard.github.io\u002Faboutme\u002F","https:\u002F\u002Fgithub.com\u002Fszilard",[83,87],{"name":84,"color":85,"percentage":86},"R","#198CE7",84,{"name":88,"color":89,"percentage":90},"Python","#3572A5",16,1895,330,"2026-04-15T09:00:23","MIT","未说明","非必需。若进行深度学习测试，需使用配备 12GB 显存的 GPU（测试环境为 AWS p2.xlarge 实例，含 1 个 GPU）。","最低 60GB（针对大规模数据集测试）；部分内存密集型模型在大数据量下可能需要 250GB。",{"notes":99,"python":95,"dependencies":100},"该项目是一个基准测试工具集而非单一软件包，需在宿主机上分别安装上述多种独立的机器学习库。测试主要基于单节点高性能服务器（32 核），部分分布式算法支持多节点但非本次测试重点。2018 年后的更新建议使用 Docker 进行自动化部署和复现。数据预处理涉及将分类变量转换为哑变量（One-hot encoding），会导致特征维度增加至约 1000 维。",[101,102,103,104,105,106,107],"R (glmnet, randomForest)","Python scikit-learn","Vowpal Wabbit","H2O","xgboost","lightgbm","Spark MLlib",[16,27,109,15,110,44,14,111,13],"视频","其他","音频",[113,114,115,116,117,118,119,105,120,121],"machine-learning","data-science","r","python","gradient-boosting-machine","random-forest","deep-learning","h2o","spark","2026-03-27T02:49:30.150509","2026-04-20T10:24:15.553975",[],[]]