[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-iamtodor--data-science-interview-questions-and-answers":3,"tool-iamtodor--data-science-interview-questions-and-answers":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",143909,2,"2026-04-07T11:33:18",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":10,"last_commit_at":59,"category_tags":60,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[35,15,13,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":73,"owner_location":73,"owner_email":76,"owner_twitter":73,"owner_website":73,"owner_url":77,"languages":73,"stars":78,"forks":79,"last_commit_at":80,"license":81,"difficulty_score":82,"env_os":83,"env_gpu":84,"env_ram":84,"env_deps":85,"category_tags":88,"github_topics":90,"view_count":32,"oss_zip_url":73,"oss_zip_packed_at":73,"status":17,"created_at":95,"updated_at":96,"faqs":97,"releases":98},5155,"iamtodor\u002Fdata-science-interview-questions-and-answers","data-science-interview-questions-and-answers","Data science interview questions with answers. Not ideally (yet)","data-science-interview-questions-and-answers 是一个专注于数据科学领域的开源面试题库，汇集了从基础统计概念到复杂机器学习算法的常见问题与参考答案。它旨在解决求职者在准备数据科学家、分析师或相关技术岗位面试时，难以系统梳理核心知识点和应对高频技术提问的痛点。\n\n这份资源特别适合正在寻求职业突破的数据科学从业者、计算机相关专业学生，以及希望评估候选人技术实力的招聘面试官。内容覆盖广泛，不仅包含特征选择、正则化（L1\u002FL2）、偏差与方差等理论基础，还深入探讨了逻辑回归、梯度提升树（Gradient Boosting）、高斯混合模型等进阶算法原理，甚至涉及如何处理非平衡分类、异常值检测及稀疏数据等实际工程挑战。\n\n其独特亮点在于将抽象的数学概念与具体的面试场景紧密结合，通过结构化的问答形式，帮助使用者快速构建完整的知识体系。无论是需要温故知新的资深工程师，还是初次踏入该领域的新人，都能从中获得清晰的解题思路和技术洞察，是备战技术面试不可或缺的实用指南。","[Source](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1ajyJhXyt4q9ZsufXV1kZxDH_3Isg3MYAKsFqNytXrCw\u002F)\n\n- [1. Why do you use feature selection?](#1-why-do-you-use-feature-selection)\n    - [Filter Methods](#filter-methods)\n    - [Embedded Methods](#embedded-methods)\n    - [Misleading](#misleading)\n    - [Overfitting](#overfitting)\n- [2. Explain what regularization is and why it is useful.](#2-explain-what-regularization-is-and-why-it-is-useful)\n- [3. What’s the difference between L1 and L2 regularization?](#3-whats-the-difference-between-l1-and-l2-regularization)\n- [4. How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?](#4-how-would-you-validate-a-model-you-created-to-generate-a-predictive-model-of-a-quantitative-outcome-variable-using-multiple-regression)\n- [5. Explain what precision and recall are. How do they relate to the ROC curve?](#5-explain-what-precision-and-recall-are-how-do-they-relate-to-the-roc-curve)\n- [6. Is it better to have too many false positives, or too many false negatives?](#6-is-it-better-to-have-too-many-false-positives--or-too-many-false-negatives)\n- [7. How do you deal with unbalanced binary classification?](#7-how-do-you-deal-with-unbalanced-binary-classification)\n- [8. What is statistical power?](#8-what-is-statistical-power)\n- [9. What are bias and variance, and what are their relation to modeling data?](#9-what-are-bias-and-variance--and-what-are-their-relation-to-modeling-data)\n    - [Approaches](#approaches)\n- [10. What if the classes are imbalanced? What if there are more than 2 groups?](#10-what-if-the-classes-are-imbalanced-what-if-there-are-more-than-2-groups)\n- [11. What are some ways I can make my model more robust to outliers?](#11-what-are-some-ways-i-can-make-my-model-more-robust-to-outliers)\n- [12. In unsupervised learning, if a ground truth about a dataset is unknown, how can we determine the most useful number of clusters to be?](#12-in-unsupervised-learning--if-a-ground-truth-about-a-dataset-is-unknown--how-can-we-determine-the-most-useful-number-of-clusters-to-be)\n- [13. Define variance](#13-define-variance)\n- [14. Expected value](#14-expected-value)\n- [15. Describe the differences between and use cases for box plots and histograms](#15-describe-the-differences-between-and-use-cases-for-box-plots-and-histograms)\n- [16. How would you find an anomaly in a distribution?](#16-how-would-you-find-an-anomaly-in-a-distribution)\n    - [Statistical methods](#statistical-methods)\n    - [Metric methods](#metric-methods)\n- [17. How do you deal with outliers in your data?](#17-how-do-you-deal-with-outliers-in-your-data)\n- [18. How do you deal with sparse data?](#18-how-do-you-deal-with-sparse-data)\n- [19. Big Data Engineer Can you explain what REST is?](#19-big-data-engineer-can-you-explain-what-rest-is)\n- [20. Logistic regression](#20-logistic-regression)\n- [21. What is the effect on the coefficients of logistic regression if two predictors are highly correlated? What are the confidence intervals of the coefficients?](#21-what-is-the-effect-on-the-coefficients-of-logistic-regression-if-two-predictors-are-highly-correlated-what-are-the-confidence-intervals-of-the-coefficients)\n- [22. What’s the difference between Gaussian Mixture Model and K-Means?](#22-whats-the-difference-between-gaussian-mixture-model-and-k-means)\n- [23. Describe how Gradient Boosting works.](#23-describe-how-gradient-boosting-works)\n  - [AdaBoost the First Boosting Algorithm](#adaboost-the-first-boosting-algorithm)\n    - [Loss Function](#loss-function)\n    - [Weak Learner](#weak-learner)\n    - [Additive Model](#additive-model)\n  - [Improvements to Basic Gradient Boosting](#improvements-to-basic-gradient-boosting)\n    - [Tree Constraints](#tree-constraints)\n    - [Weighted Updates](#weighted-updates)\n    - [Stochastic Gradient Boosting](#stochastic-gradient-boosting)\n    - [Penalized Gradient Boosting](#penalized-gradient-boosting)\n- [24. Difference between AdaBoost and XGBoost](#24-difference-between-AdaBoost-and-XGBoost)\n- [25. Data Mining Describe the decision tree model.](#25-data-mining-describe-the-decision-tree-model)\n- [26. Notes from Coursera Deep Learning courses by Andrew Ng](#26-notes-from-coursera-deep-learning-courses-by-andrew-ng)\n- [27. What is a neural network?](#27-what-is-a-neural-network)\n- [28. How do you deal with sparse data?](#28-how-do-you-deal-with-sparse-data)\n- [29. RNN and LSTM](#29-rnn-and-lstm)\n- [30. Pseudo Labeling](#30-pseudo-labeling)\n- [31. Knowledge Distillation](#31-knowledge-distillation)\n- [32. What is an inductive bias?](#32-what-is-an-inductive-bias)\n- [33. What is a confidence interval in layman's terms?](#33-confidence-interval-in-layman's-terms)\n\n\n## 1. Why do you use feature selection?\nFeature selection is the process of selecting a subset of relevant features for use in model construction. Feature selection is itself useful, but it mostly acts as a filter, muting out features that aren’t useful in addition to your existing features.\nFeature selection methods aid you in your mission to create an accurate predictive model. They help you by choosing features that will give you as good or better accuracy whilst requiring less data.\nFeature selection methods can be used to identify and remove unneeded, irrelevant and redundant attributes from data that do not contribute to the accuracy of a predictive model or may in fact decrease the accuracy of the model.\nFewer attributes is desirable because it reduces the complexity of the model, and a simpler model is simpler to understand and explain.\n#### Filter Methods\nFilter feature selection methods apply a statistical measure to assign a scoring to each feature. The features are ranked by the score and either selected to be kept or removed from the dataset. The methods are often univariate and consider the feature independently, or with regard to the dependent variable.\nSome examples of some filter methods include the Chi squared test, information gain and correlation coefficient scores.\n#### Embedded Methods\nEmbedded methods learn which features best contribute to the accuracy of the model while the model is being created. The most common type of embedded feature selection methods are regularization methods.\nRegularization methods are also called penalization methods that introduce additional constraints into the optimization of a predictive algorithm (such as a regression algorithm) that bias the model toward lower complexity (fewer coefficients).\nExamples of regularization algorithms are the LASSO, Elastic Net and Ridge Regression.\n#### Misleading\nIncluding redundant attributes can be misleading to modeling algorithms. Instance-based methods such as k-nearest neighbor use small neighborhoods in the attribute space to determine classification and regression predictions. These predictions can be greatly skewed by redundant attributes.\n#### Overfitting\nKeeping irrelevant attributes in your dataset can result in overfitting. Decision tree algorithms like C4.5 seek to make optimal spits in attribute values. Those attributes that are more correlated with the prediction are split on first. Deeper in the tree less relevant and irrelevant attributes are used to make prediction decisions that may only be beneficial by chance in the training dataset. This overfitting of the training data can negatively affect the modeling power of the method and cripple the predictive accuracy.\n\n## 2. Explain what regularization is and why it is useful.\nRegularization is the process of adding a tuning parameter to a model to induce smoothness in order to prevent [overfitting](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FOverfitting).\n\nThis is most often done by adding a constant multiple to an existing weight vector. This constant is often either the [L1 (Lasso)](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLasso_(statistics)) or [L2 (ridge)](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTikhonov_regularization), but can in actuality can be any norm. The model predictions should then minimize the mean of the loss function calculated on the regularized training set.\n\nIt is well known, as explained by others, that L1 regularization helps perform feature selection in sparse feature spaces, and that is a good practical reason to use L1 in some situations. However, beyond that particular reason I have never seen L1 to perform better than L2 in practice. If you take a look at [LIBLINEAR FAQ](https:\u002F\u002Fwww.csie.ntu.edu.tw\u002F~cjlin\u002Fliblinear\u002FFAQ.html#l1_regularized_classification) on this issue you will see how they have not seen a practical example where L1 beats L2 and encourage users of the library to contact them if they find one. Even in a situation where you might benefit from L1's sparsity in order to do feature selection, using L2 on the remaining variables is likely to give better results than L1 by itself.\n\n## 3. What’s the difference between L1 and L2 regularization?\nRegularization is a very important technique in machine learning to prevent overfitting. Mathematically speaking, it adds a regularization term in order to prevent the coefficients to fit so perfectly to overfit. The difference between the L1(Lasso) and L2(Ridge) is just that L2(Ridge) is the sum of the square of the weights, while L1(Lasso) is just the sum of the absolute weights in MSE or another loss function. As follows:\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_ba61756c14f0.png)\nThe difference between their properties can be promptly summarized as follows:\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_7342efb82e8a.png)\n\n**Solution uniqueness** is a simpler case but requires a bit of imagination. First, this picture below:\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_c524c7afe38c.png)\n\n## 4. How would you validate a model you created to generate a predictive model of a quantitative outcome variable using multiple regression?\n[Proposed methods](http:\u002F\u002Fsupport.sas.com\u002Fresources\u002Fpapers\u002Fproceedings12\u002F333-2012.pdf) for model validation:\n* If the values predicted by the model are far outside of the response variable range, this would immediately indicate poor estimation or model inaccuracy.\n* If the values seem to be reasonable, examine the parameters; any of the following would indicate poor estimation or multi-collinearity: opposite signs of expectations, unusually large or small values, or observed inconsistency when the model is fed new data.\n* Use the model for prediction by feeding it new data, and use the [coefficient of determination](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCoefficient_of_determination) (R squared) as a model validity measure.\n* Use data splitting to form a separate dataset for estimating model parameters, and another for validating predictions.\n* Use [jackknife resampling](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FJackknife_resampling) if the dataset contains a small number of instances, and measure validity with R squared and [mean squared error](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMean_squared_error) (MSE).\n\n## 5. Explain what precision and recall are. How do they relate to the ROC curve?\nCalculating precision and recall is actually quite easy. Imagine there are 100 positive cases among 10,000 cases. You want to predict which ones are positive, and you pick 200 to have a better chance of catching many of the 100 positive cases. You record the IDs of your predictions, and when you get the actual results you sum up how many times you were right or wrong. There are four ways of being right or wrong:\n1. TN \u002F True Negative: case was negative and predicted negative\n2. TP \u002F True Positive: case was positive and predicted positive\n3. FN \u002F False Negative: case was positive but predicted negative\n4. FP \u002F False Positive: case was negative but predicted positive\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_42c8db9502db.png)\n\nNow, your boss asks you three questions:\n* What percent of your predictions were correct?\nYou answer: the \"accuracy\" was (9,760+60) out of 10,000 = 98.2%\n* What percent of the positive cases did you catch?\nYou answer: the \"recall\" was 60 out of 100 = 60%\n* What percent of positive predictions were correct?\nYou answer: the \"precision\" was 60 out of 200 = 30%\nSee also a very good explanation of [Precision and recall](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FPrecision_and_recall) in Wikipedia.\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_e8c5324e2e5f.jpg)\n\nROC curve represents a relation between sensitivity (RECALL) and specificity(NOT PRECISION) and is commonly used to measure the performance of binary classifiers. However, when dealing with highly skewed datasets, [Precision-Recall (PR)](http:\u002F\u002Fpages.cs.wisc.edu\u002F~jdavis\u002Fdavisgoadrichcamera2.pdf) curves give a more representative picture of performance. Remember, a ROC curve represents a relation between sensitivity (RECALL) and specificity(NOT PRECISION). Sensitivity is the other name for recall but specificity is not PRECISION.\n\nRecall\u002FSensitivity is the measure of the probability that your estimate is 1 given all the samples whose true class label is 1. It is a measure of how many of the positive samples have been identified as being positive. Specificity is the measure of the probability that your estimate is 0 given all the samples whose true class label is 0. It is a measure of how many of the negative samples have been identified as being negative.\n\nPRECISION on the other hand is different. It is a measure of the probability that a sample is a true positive class given that your classifier said it is positive. It is a measure of how many of the samples predicted by the classifier as positive is indeed positive. Note here that this changes when the base probability or prior probability of the positive class changes. Which means PRECISION depends on how rare is the positive class. In other words, it is used when positive class is more interesting than the negative class.\n\n* Sensitivity also known as the True Positive rate or Recall is calculated as,\n`Sensitivity = TP \u002F (TP + FN)`. Since the formula doesn’t contain FP and TN, Sensitivity may give you a biased result, especially for imbalanced classes.\nIn the example of Fraud detection, it gives you the percentage of Correctly Predicted Frauds from the pool of Actual Frauds pool of Actual Non-Frauds.\n* Specificity, also known as True Negative Rate is calculated as, `Specificity = TN \u002F (TN + FP)`. Since the formula does not contain FN and TP, Specificity may give you a biased result, especially for imbalanced classes.\nIn the example of Fraud detection, it gives you the percentage of Correctly Predicted Non-Frauds from the pool of Actual Frauds pool of Actual Non-Frauds\n\n[Assessing and Comparing Classifier Performance with ROC Curves](https:\u002F\u002Fmachinelearningmastery.com\u002Fassessing-comparing-classifier-performance-roc-curves-2\u002F)\n\n## 6. Is it better to have too many false positives, or too many false negatives?\nIt depends on the question as well as on the domain for which we are trying to solve the question.\n\nIn medical testing, false negatives may provide a falsely reassuring message to patients and physicians that disease is absent, when it is actually present. This sometimes leads to inappropriate or inadequate treatment of both the patient and their disease. So, it is desired to have too many false positive.\n\nFor spam filtering, a false positive occurs when spam filtering or spam blocking techniques wrongly classify a legitimate email message as spam and, as a result, interferes with its delivery. While most anti-spam tactics can block or filter a high percentage of unwanted emails, doing so without creating significant false-positive results is a much more demanding task. So, we prefer too many false negatives over many false positives.\n\n## 7. How do you deal with unbalanced binary classification?\nImbalanced data typically refers to a problem with classification problems where the classes are not represented equally.\nFor example, you may have a 2-class (binary) classification problem with 100 instances (rows). A total of 80 instances are labeled with Class-1 and the remaining 20 instances are labeled with Class-2.\n\nThis is an imbalanced dataset and the ratio of Class-1 to Class-2 instances is 80:20 or more concisely 4:1.\nYou can have a class imbalance problem on two-class classification problems as well as multi-class classification problems. Most techniques can be used on either.\nThe remaining discussions will assume a two-class classification problem because it is easier to think about and describe.\n1. Can You Collect More Data?\u003C\u002Fbr>\nA larger dataset might expose a different and perhaps more balanced perspective on the classes.\nMore examples of minor classes may be useful later when we look at resampling your dataset.\n2. Try Changing Your Performance Metric\u003C\u002Fbr>\nAccuracy is not the metric to use when working with an imbalanced dataset. We have seen that it is misleading.\nFrom that post, I recommend looking at the following performance measures that can give more insight into the accuracy of the model than traditional classification accuracy:\n  - [Confusion Matrix](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FConfusion_matrix): A breakdown of predictions into a table showing correct predictions (the diagonal) and the types of incorrect predictions made (what classes incorrect predictions were assigned).\n  - [Precision](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FInformation_retrieval#Precision): A measure of a classifiers exactness. Precision is the number of True Positives divided by the number of True Positives and False Positives. Put another way, it is the number of positive predictions divided by the total number of positive class values predicted. It is also called the [Positive Predictive Value (PPV)](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FPositive_and_negative_predictive_values). Precision can be thought of as a measure of a classifiers exactness. A low precision can also indicate a large number of False Positives.\n  - [Recall](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FInformation_retrieval#Recall): A measure of a classifiers completeness. Recall is the number of True Positives divided by the number of True Positives and the number of False Negatives. Put another way it is the number of positive predictions divided by the number of positive class values in the test data. It is also called Sensitivity or the True Positive Rate. Recall can be thought of as a measure of a classifiers completeness. A low recall indicates many False Negatives.\n  - [F1 Score (or F-score)](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FF1_score): A weighted average of precision and recall.\nI would also advise you to take a look at the following:\n  - Kappa (or [Cohen’s kappa](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCohen%27s_kappa)): Classification accuracy normalized by the imbalance of the classes in the data.\nROC Curves: Like precision and recall, accuracy is divided into sensitivity and specificity and models can be chosen based on the balance thresholds of these values.\n3. Try Resampling Your Dataset\n  * You can add copies of instances from the under-represented class called over-sampling (or more formally sampling with replacement)\n  * You can delete instances from the over-represented class, called under-sampling.\n5. Try Different Algorithms\n6. Try Penalized Models\u003C\u002Fbr>\nYou can use the same algorithms but give them a different perspective on the problem.\nPenalized classification imposes an additional cost on the model for making classification mistakes on the minority class during training. These penalties can bias the model to pay more attention to the minority class.\nOften the handling of class penalties or weights are specialized to the learning algorithm. There are penalized versions of algorithms such as penalized-SVM and penalized-LDA.\nUsing penalization is desirable if you are locked into a specific algorithm and are unable to resample or you’re getting poor results. It provides yet another way to “balance” the classes. Setting up the penalty matrix can be complex. You will very likely have to try a variety of penalty schemes and see what works best for your problem.\n7. Try a Different Perspective\u003C\u002Fbr>\nTaking a look and thinking about your problem from these perspectives can sometimes shame loose some ideas.\nTwo you might like to consider are anomaly detection and change detection.\n\n## 8. What is statistical power?\n[Statistical power or sensitivity](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FStatistical_power) of a binary hypothesis test is the probability that the test correctly rejects the null hypothesis (H0) when the alternative hypothesis (H1) is true.\n\nIt can be equivalently thought of as the probability of accepting the alternative hypothesis (H1) when it is true—that is, the ability of a test to detect an effect, if the effect actually exists.\n\nTo put in another way, [Statistical power](https:\u002F\u002Feffectsizefaq.com\u002F2010\u002F05\u002F31\u002Fwhat-is-statistical-power\u002F) is the likelihood that a study will detect an effect when the effect is present. The higher the statistical power, the less likely you are to make a Type II error (concluding there is no effect when, in fact, there is).\n\nA type I error (or error of the first kind) is the incorrect rejection of a true null hypothesis. Usually a type I error leads one to conclude that a supposed effect or relationship exists when in fact it doesn't. Examples of type I errors include a test that shows a patient to have a disease when in fact the patient does not have the disease, a fire alarm going on indicating a fire when in fact there is no fire, or an experiment indicating that a medical treatment should cure a disease when in fact it does not.\n\nA type II error (or error of the second kind) is the failure to reject a false null hypothesis. Examples of type II errors would be a blood test failing to detect the disease it was designed to detect, in a patient who really has the disease; a fire breaking out and the fire alarm does not ring; or a clinical trial of a medical treatment failing to show that the treatment works when really it does.\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_ed96aa6ebc65.png)\n\n## 9. What are bias and variance, and what are their relation to modeling data?\n**Bias** is how far removed a model's predictions are from correctness, while variance is the degree to which these predictions vary between model iterations.\n\nBias is generally the distance between the model that you build on the training data (the best model that your model space can provide) and the “real model” (which generates data).\n\n**Error due to Bias**: Due to randomness in the underlying data sets, the resulting models will have a range of predictions. [Bias](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBias_of_an_estimator) measures how far off in general these models' predictions are from the correct value. The bias is error from erroneous assumptions in the learning algorithm. High bias can cause an algorithm to miss the relevant relations between features and target outputs (underfitting).\n\n**Error due to Variance**: The error due to variance is taken as the variability of a model prediction for a given data point. Again, imagine you can repeat the entire model building process multiple times. The variance is how much the predictions for a given point vary between different realizations of the model. The variance is error from sensitivity to small fluctuations in the training set.\n\nHigh variance can cause an algorithm to model the random [noise](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNoise_(signal_processing)) in the training data, rather than the intended outputs (overfitting).\n\nBig dataset -> low variance \u003Cbr\u002F>\nLow dataset -> high variance \u003Cbr\u002F>\nFew features -> high bias, low variance \u003Cbr\u002F>\nMany features -> low bias, high variance \u003Cbr\u002F>\nComplicated model -> low bias \u003Cbr\u002F>\nSimplified model -> high bias \u003Cbr\u002F>\nDecreasing λ -> low bias \u003Cbr\u002F>\nIncreasing λ -> low variance \u003Cbr\u002F>\n\nWe can create a graphical visualization of bias and variance using a bulls-eye diagram. Imagine that the center of the target is a model that perfectly predicts the correct values. As we move away from the bulls-eye, our predictions get worse and worse. Imagine we can repeat our entire model building process to get a number of separate hits on the target. Each hit represents an individual realization of our model, given the chance variability in the training data we gather. Sometimes we will get a good distribution of training data so we predict very well and we are close to the bulls-eye, while sometimes our training data might be full of outliers or non-standard values resulting in poorer predictions. These different realizations result in a scatter of hits on the target.\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_fa9c304a970f.jpg)\n\n[As an example](https:\u002F\u002Fwww.kdnuggets.com\u002F2016\u002F08\u002Fbias-variance-tradeoff-overview.html), using a simple flawed Presidential election survey as an example, errors in the survey are then explained through the twin lenses of bias and variance: selecting survey participants from a phonebook is a source of bias; a small sample size is a source of variance.\n\nMinimizing total model error relies on the balancing of bias and variance errors. Ideally, models are the result of a collection of unbiased data of low variance. Unfortunately, however, the more complex a model becomes, its tendency is toward less bias but greater variance; therefore an optimal model would need to consider a balance between these 2 properties.\n\nThe statistical evaluation method of cross-validation is useful in both demonstrating the importance of this balance, as well as actually searching it out. The number of data folds to use -- the value of k in k-fold cross-validation -- is an important decision; the lower the value, the higher the bias in the error estimates and the less variance.\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_723b1e0f49c2.jpg)\n\nThe most important takeaways are that bias and variance are two sides of an important trade-off when building models, and that even the most routine of statistical evaluation methods are directly reliant upon such a trade-off.\n\nWe may estimate a model f̂ (X) of f(X) using linear regressions or another modeling technique. In this case, the expected squared prediction error at a point x is:\n`Err(x)=E[(Y−f̂ (x))^2]`\n\nThis error may then be decomposed into bias and variance components:\n`Err(x)=(E[f̂ (x)]−f(x))^2+E[(f̂ (x)−E[f̂ (x)])^2]+σ^2e`\n`Err(x)=Bias^2+Variance+Irreducible`\n\nThat third term, irreducible error, is the noise term in the true relationship that cannot fundamentally be reduced by any model. Given the true model and infinite data to calibrate it, we should be able to reduce both the bias and variance terms to 0. However, in a world with imperfect models and finite data, there is a tradeoff between minimizing the bias and minimizing the variance.\n\nThat third term, irreducible error, is the noise term in the true relationship that cannot fundamentally be reduced by any model. Given the true model and infinite data to calibrate it, we should be able to reduce both the bias and variance terms to 0. However, in a world with imperfect models and finite data, there is a tradeoff between minimizing the bias and minimizing the variance.\n\nIf a model is suffering from high bias, it means that model is less complex, to make the model more robust, we can add more features in feature space. Adding data points will reduce the variance.\n\nThe bias–variance tradeoff is a central problem in supervised learning. Ideally, one wants to [choose a model](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FModel_selection) that both accurately captures the regularities in its training data, but also generalizes well to unseen data. Unfortunately, it is typically impossible to do both simultaneously. High-variance learning methods may be able to represent their training set well, but are at risk of overfitting to noisy or unrepresentative training data. In contrast, algorithms with high bias typically produce simpler models that don't tend to overfit, but may underfit their training data, failing to capture important regularities.\n\nModels with low bias are usually more complex (e.g. higher-order regression polynomials), enabling them to represent the training set more accurately. In the process, however, they may also represent a large noise component in the training set, making their predictions less accurate - despite their added complexity. In contrast, models with higher bias tend to be relatively simple (low-order or even linear regression polynomials), but may produce lower variance predictions when applied beyond the training set.\n\n#### Approaches\n\n[Dimensionality reduction](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDimensionality_reduction) and [feature selection](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FFeature_selection) can decrease variance by simplifying models. Similarly, a larger training set tends to decrease variance. Adding features (predictors) tends to decrease bias, at the expense of introducing additional variance. Learning algorithms typically have some tunable parameters that control bias and variance, e.g.:\n* (Generalized) linear models can be [regularized](#2-explain-what-regularization-is-and-why-it-is-useful) to decrease their variance at the cost of increasing their bias.\n* In artificial neural networks, the variance increases and the bias decreases with the number of hidden units. Like in GLMs, regularization is typically applied.\n* In k-nearest neighbor models, a high value of k leads to high bias and low variance (see below).\n* In Instance-based learning, regularization can be achieved varying the mixture of prototypes and exemplars.[\n* In decision trees, the depth of the tree determines the variance. Decision trees are commonly pruned to control variance.\n\nOne way of resolving the trade-off is to use [mixture models](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMixture_model) and [ensemble learning](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FEnsemble_learning). For example, [boosting](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBoosting_(machine_learning)) combines many \"weak\" (high bias) models in an ensemble that has lower bias than the individual models, while [bagging](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBootstrap_aggregating) combines \"strong\" learners in a way that reduces their variance.\n\n[Understanding the Bias-Variance Tradeoff](http:\u002F\u002Fscott.fortmann-roe.com\u002Fdocs\u002FBiasVariance.html)\n\n## 10. What if the classes are imbalanced? What if there are more than 2 groups?\nBinary classification involves classifying the data into two groups, e.g. whether or not a customer buys a particular product or not (Yes\u002FNo), based on independent variables such as gender, age, location etc.\n\nAs the target variable is not continuous, binary classification model predicts the probability of a target variable to be Yes\u002FNo. To evaluate such a model, a metric called the confusion matrix is used, also called the classification or co-incidence matrix. With the help of a confusion matrix, we can calculate important performance measures:\n* True Positive Rate (TPR) or Recall or Sensitivity = TP \u002F (TP + FN)\n* [Precision](https:\u002F\u002Fgithub.com\u002Fiamtodor\u002Fdata-science-interview-questions-and-answers#5-explain-what-precision-and-recall-are-how-do-they-relate-to-the-roc-curve) = TP \u002F (TP + FP)\n* False Positive Rate(FPR) or False Alarm Rate = 1 - Specificity = 1 - (TN \u002F (TN + FP))\n* Accuracy = (TP + TN) \u002F (TP + TN + FP + FN)\n* Error Rate = 1 – Accuracy\n* F-measure = 2 \u002F ((1 \u002F Precision) + (1 \u002F Recall)) = 2 * (precision * recall) \u002F (precision + recall)\n* ROC (Receiver Operating Characteristics) = plot of FPR vs TPR\n* AUC (Area Under the [ROC] Curve)  \nPerformance measure across all classification thresholds. Treated as the probability that a model ranks a randomly chosen positive sample higher than negative\n\n\n\n## 11. What are some ways I can make my model more robust to outliers?\nThere are several ways to make a model more robust to outliers, from different points of view (data preparation or model building). An outlier in the question and answer is assumed being unwanted, unexpected, or a must-be-wrong value to the human’s knowledge so far (e.g. no one is 200 years old) rather than a rare event which is possible but rare.\n\nOutliers are usually defined in relation to the distribution. Thus outliers could be removed in the pre-processing step (before any learning step), by using standard deviations `(Mean +\u002F- 2*SD)`, it can be used for normality. Or interquartile ranges `Q1 - Q3`, `Q1` -  is the \"middle\" value in the first half of the rank-ordered data set, `Q3` - is the \"middle\" value in the second half of the rank-ordered data set. It can be used for not normal\u002Funknown as threshold levels.\n\nMoreover, data transformation (e.g. log transformation) may help if data have a noticeable tail. When outliers related to the sensitivity of the collecting instrument which may not precisely record small values, Winsorization may be useful. This type of transformation (named after Charles P. Winsor (1895–1951)) has the same effect as clipping signals (i.e. replaces extreme data values with less extreme values).  Another option to reduce the influence of outliers is using mean absolute difference rather mean squared error.\n\nFor model building, some models are resistant to outliers (e.g. tree-based approaches) or non-parametric tests. Similar to the median effect, tree models divide each node into two in each split. Thus, at each split, all data points in a bucket could be equally treated regardless of extreme values they may have.\n\n## 12. In unsupervised learning, if a ground truth about a dataset is unknown, how can we determine the most useful number of clusters to be?\nThe elbow method is often the best place to start, and is especially useful due to its ease of explanation and verification via visualization. The elbow method is interested in explaining variance as a function of cluster numbers (the k in k-means). By plotting the percentage of variance explained against k, the first N clusters should add significant information, explaining variance; yet, some eventual value of k will result in a much less significant gain in information, and it is at this point that the graph will provide a noticeable angle. This angle will be the optimal number of clusters, from the perspective of the elbow method,\nIt should be self-evident that, in order to plot this variance against varying numbers of clusters, varying numbers of clusters must be tested. Successive complete iterations of the clustering method must be undertaken, after which the results can be plotted and compared.\nDBSCAN - Density-Based Spatial Clustering of Applications with Noise. Finds core samples of high density and expands clusters from them. Good for data which contains clusters of similar density.\n\n## 13. Define variance\nVariance is the expectation of the squared deviation of a random variable from its mean. Informally, it measures how far a set of (random) numbers are spread out from their average value. The variance is the square of the standard deviation, the second central moment of a distribution, and the covariance of the random variable with itself.\n\nVar(X) = E[(X - m)^2], m=E[X]\n\nVariance is, thus, a measure of the scatter of the values of a random variable relative to its mathematical expectation.\n\n## 14. Expected value\nExpected value — [Expected Value](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FExpected_value) ([Probability Distribution](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FProbability_distribution) In a probability distribution, expected value is the value that a random variable takes with greatest likelihood. \n\nBased on the law of distribution of a random variable x, we know that a random variable x can take values x1, x2, ..., xk with probabilities p1, p2, ..., pk.\nThe mathematical expectation M(x) of a random variable x is equal.\nThe mathematical expectation of a random variable X (denoted by M (X) or less often E (X)) characterizes the average value of a random variable (discrete or continuous). Mathematical expectation is the first initial moment of a given CB.\n\nMathematical expectation is attributed to the so-called characteristics of the distribution position (to which the mode and median also belong). This characteristic describes a certain average position of a random variable on the numerical axis. Say, if the expectation of a random variable - the lamp life is 100 hours, then it is considered that the values of the service life are concentrated (on both sides) from this value (with dispersion on each side, indicated by the variance).\n\nThe mathematical expectation of a discrete random variable X is calculated as the sum of the products of the values xi that the CB takes X by the corresponding probabilities pi:\n```python\nimport numpy as np\nX = [3,4,5,6,7]\nP = [0.1,0.2,0.3,0.4,0.5]\nnp.sum(np.dot(X, P))\n```\n\n## 15. Describe the differences between and use cases for box plots and histograms\nA [histogram](http:\u002F\u002Fwww.brighthubpm.com\u002Fsix-sigma\u002F13307-what-is-a-histogram\u002F) is a type of bar chart that graphically displays the frequencies of a data set. Similar to a bar chart, a histogram plots the frequency, or raw count, on the Y-axis (vertical) and the variable being measured on the X-axis (horizontal).\n\nThe only difference between a histogram and a bar chart is that a histogram displays frequencies for a group of data, rather than an individual data point; therefore, no spaces are present between the bars. Typically, a histogram groups data into small chunks (four to eight values per bar on the horizontal axis), unless the range of data is so great that it easier to identify general distribution trends with larger groupings.\n\nA box plot, also called a [box-and-whisker](http:\u002F\u002Fwww.brighthubpm.com\u002Fsix-sigma\u002F43824-using-box-and-whiskers-plots\u002F) plot, is a chart that graphically represents the five most important descriptive values for a data set. These values include the minimum value, the first quartile, the median, the third quartile, and the maximum value. When graphing this five-number summary, only the horizontal axis displays values. Within the quadrant, a vertical line is placed above each of the summary numbers. A box is drawn around the middle three lines (first quartile, median, and third quartile) and two lines are drawn from the box’s edges to the two endpoints (minimum and maximum).\nBoxplots are better for comparing distributions than histograms!\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_0b67822dc37b.png)\n\n## 16. How would you find an anomaly in a distribution?\nBefore getting started, it is important to establish some boundaries on the definition of an anomaly. Anomalies can be broadly categorized as:\n1. Point anomalies: A single instance of data is anomalous if it's too far off from the rest. Business use case: Detecting credit card fraud based on \"amount spent.\"\n2. Contextual anomalies: The abnormality is context specific. This type of anomaly is common in time-series data. Business use case: Spending $100 on food every day during the holiday season is normal, but may be odd otherwise.\n3. Collective anomalies: A set of data instances collectively helps in detecting anomalies. Business use case: Someone is trying to copy data form a remote machine to a local host unexpectedly, an anomaly that would be flagged as a potential cyber attack.\n\nBest steps to prevent anomalies is to implement policies or checks that can catch them during the data collection stage. Unfortunately, you do not often get to collect your own data, and often the data you're mining was collected for another purpose. About 68% of all the data points are within one standard deviation from the mean. About 95% of the data points are within two standard deviations from the mean. Finally, over 99% of the data is within three standard deviations from the mean. When the value deviate too much from the mean, let’s say by ± 4σ, then we can considerate this almost impossible value as anomaly. (This limit can also be calculated using the percentile).\n\n#### Statistical methods\nStatistically based anomaly detection uses this knowledge to discover outliers. A dataset can be standardized by taking the z-score of each point. A z-score is a measure of how many standard deviations a data point is away from the mean of the data. Any data-point that has a z-score higher than 3 is an outlier, and likely to be an anomaly. As the z-score increases above 3, points become more obviously anomalous. A z-score is calculated using the following equation. A box-plot is perfect for this application.\n\n#### Metric method\nJudging by the number of publications, metric methods are the most popular methods among researchers. They postulate the existence of a certain metric in the space of objects, which helps to find anomalies. Intuitively, the anomaly has few neighbors in the instannce space, and a typical point has many. Therefore, a good measure of anomalies can be, for example, the «distance to the k-th neighbor». (See method: [Local Outlier Factor](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLocal_outlier_factor)). Specific metrics are used here, for example [Mahalonobis distance](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMahalanobis_distance). Mahalonobis distance is a measure of distance between vectors of random variables, generalizing the concept of Euclidean distance. Using Mahalonobis distance, it is possible to determine the similarity of unknown and known samples. It differs from Euclidean distance in that it takes into account correlations between variables and is scale invariant.\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_f840a7001180.png)\n\nThe most common form of clustering-based anomaly detection is done with prototype-based clustering.\n\nUsing this approach to anomaly detection, a point is classified as an anomaly if its omission from the group significantly improves the prototype, then the point is classified as an anomaly. This logically makes sense. K-means is a clustering algorithm that clusters similar points. The points in any cluster are similar to the centroid of that cluster, hence why they are members of that cluster. If one point in the cluster is so far from the centroid that it pulls the centroid away from it's natural center, than that point is literally an outlier, since it lies outside the natural bounds for the cluster. Hence, its omission is a logical step to improve the accuracy of the rest of the cluster. Using this approach, the outlier score is defined as the degree to which a point doesn't belong to any cluster, or the distance it is from the centroid of the cluster. In K-means, the degree to which the removal of a point would increase the accuracy of the centroid is the difference in the SSE, or standard squared error, or the cluster with and without the point. If there is a substantial improvement in SSE after the removal of the point, that correlates to a high outlier score for that point.\nMore specifically, when using a k-means clustering approach towards anomaly detection, the outlier score is calculated in one of two ways. The simplest is the point's distance from its closest centroid. However, this approach is not as useful when there are clusters of differing densities. To tackle that problem, the point's relative distance to it's closest centroid is used, where relative distance is defined as the ratio of the point's distance from the centroid to the median distance of all points in the cluster from the centroid. This approach to anomaly detection is sensitive to the value of k. Also, if the data is highly noisy, then that will throw off the accuracy of the initial clusters, which will decrease the accuracy of this type of anomaly detection. The time complexity of this approach is obviously dependent on the choice of clustering algorithm, but since most clustering algorithms have linear or close to linear time and space complexity, this type of anomaly detection can be highly efficient.\n\n## 17. How do you deal with outliers in your data?\n\nFor the most part, if your data is affected by these extreme cases, you can bound the input to a historical representative of your data that excludes outliers. So \nthat could be a number of items (>3) or a lower or upper bounds on your order value.\n\nIf the outliers are from a data set that is relatively unique then analyze them for your specific situation. Analyze both with and without them, and perhaps with a replacement alternative, if you have a reason for one, and report your results of this assessment. \nOne option is to try a transformation. Square root and log transformations both pull in high numbers.  This can make assumptions work better if the outlier is a dependent.\n\n## 18. How do you deal with sparse data?\n\nWe could take a look at L1 regularization since it best fits to the sparse data and do feature selection. If linear relationship - linear regression either - svm. \n\nAlso it would be nice to use one-hot-encoding or bag-of-words. A one hot encoding is a representation of categorical variables as binary vectors. This first requires that the categorical values be mapped to integer values. Then, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.\n\n## 19. Big Data Engineer Can you explain what REST is?\n\nREST stands for Representational State Transfer. (It is sometimes spelled \"ReST\".) It relies on a stateless, client-server, cacheable communications protocol -- and in virtually all cases, the HTTP protocol is used.\nREST is an architecture style for designing networked applications. The idea is simple HTTP is used to make calls between machines.\n* In many ways, the World Wide Web itself, based on HTTP, can be viewed as a REST-based architecture.\nRESTful applications use HTTP requests to post data (create and\u002For update), read data (e.g., make queries), and delete data. Thus, REST uses HTTP for all four CRUD (Create\u002FRead\u002FUpdate\u002FDelete) operations.\nREST is a lightweight alternative to mechanisms like RPC (Remote Procedure Calls) and Web Services (SOAP, WSDL, et al.). Later, we will see how much more simple REST is.\n* Despite being simple, REST is fully-featured; there's basically nothing you can do in Web Services that can't be done with a RESTful architecture.\nREST is not a \"standard\". There will never be a W3C recommendation for REST, for example. And while there are REST programming frameworks, working with REST is so simple that you can often \"roll your own\" with standard library features in languages like Perl, Java, or C#.\n\n## 20. Logistic regression\n\nLog odds - raw output from the model; odds - exponent from the output of the model. Probability of the output - odds \u002F (1+odds).\n\n## 21. What is the effect on the coefficients of logistic regression if two predictors are highly correlated? What are the confidence intervals of the coefficients?\nWhen predictor variables are correlated, the estimated regression coefficient of any one variable depends on which other predictor variables are included in the model. When predictor variables are correlated, the precision of the estimated regression coefficients decreases as more predictor variables are added to the model.\n\nIn statistics, multicollinearity (also collinearity) is a phenomenon in which two or more predictor variables in a multiple regression model are highly correlated, meaning that one can be linearly predicted from the others with a substantial degree of accuracy. In this situation the coefficient estimates of the multiple regression may change erratically in response to small changes in the model or the data. Multicollinearity does not reduce the predictive power or reliability of the model as a whole, at least within the sample data set; it only affects calculations regarding individual predictors. That is, a multiple regression model with correlated predictors can indicate how well the entire bundle of predictors predicts the outcome variable, but it may not give valid results about any individual predictor, or about which predictors are redundant with respect to others.\n\nThe consequences of multicollinearity:\n* Ratings estimates remain unbiased.\n* Standard coefficient errors increase.\n* The calculated t-statistics are underestimated.\n* Estimates become very sensitive to changes in specifications and changes in individual observations.\n* The overall quality of the equation, as well as estimates of variables not related to multicollinearity, remain unaffected.\n* The closer multicollinearity to perfect (strict), the more serious its consequences.\n\nIndicators of multicollinearity: \n1. High R2 and negligible odds.\n2. Strong pair correlation of predictors.\n3. Strong partial correlations of predictors.\n4. High VIF - variance inflation factor.\n\nConfidence interval (CI) is a type of interval estimate (of a population parameter) that is computed from the observed data. The confidence level is the frequency (i.e., the proportion) of possible confidence intervals that contain the true value of their corresponding parameter. In other words, if confidence intervals are constructed using a given confidence level in an infinite number of independent experiments, the proportion of those intervals that contain the true value of the parameter will match the confidence level.\n\nConfidence intervals consist of a range of values (interval) that act as good estimates of the unknown population parameter. However, the interval computed from a particular sample does not necessarily include the true value of the parameter. Since the observed data are random samples from the true population, the confidence interval obtained from the data is also random. If a corresponding hypothesis test is performed, the confidence level is the complement of the level of significance, i.e. a 95% confidence interval reflects a significance level of 0.05. If it is hypothesized that a true parameter value is 0 but the 95% confidence interval does not contain 0, then the estimate is significantly different from zero at the 5% significance level.\n\nThe desired level of confidence is set by the researcher (not determined by data). Most commonly, the 95% confidence level is used. However, other confidence levels can be used, for example, 90% and 99%.\n\nFactors affecting the width of the confidence interval include the size of the sample, the confidence level, and the variability in the sample. A larger sample size normally will lead to a better estimate of the population parameter.\nA Confidence Interval is a range of values we are fairly sure our true value lies in.\n\n`X  ±  Z*s\u002F√(n)`, X is the mean, Z is the chosen Z-value from the table, s is the standard deviation, n is the number of samples. The value after the ± is called the margin of error.\n\n## 22. What’s the difference between Gaussian Mixture Model and K-Means?\nLet's says we are aiming to break them into three clusters. K-means will start with the assumption that a given data point belongs to one cluster.\n\nChoose a data point. At a given point in the algorithm, we are certain that a point belongs to a red cluster. In the next iteration, we might revise that belief, and be certain that it belongs to the green cluster. However, remember, in each iteration, we are absolutely certain as to which cluster the point belongs to. This is the \"hard assignment\".\n\nWhat if we are uncertain? What if we think, well, I can't be sure, but there is 70% chance it belongs to the red cluster, but also 10% chance its in green, 20% chance it might be blue. That's a soft assignment. The Mixture of Gaussian model helps us to express this uncertainty. It starts with some prior belief about how certain we are about each point's cluster assignments. As it goes on, it revises those beliefs. But it incorporates the degree of uncertainty we have about our assignment.\n\nKmeans: find kk to minimize `(x−μk)^2`\n\nGaussian Mixture (EM clustering) : find kk to minimize `(x−μk)^2\u002Fσ^2`\n\nThe difference (mathematically) is the denominator “σ^2”, which means GM takes variance into consideration when it calculates the measurement.\nKmeans only calculates conventional Euclidean distance.\nIn other words, Kmeans calculate distance, while GM calculates “weighted” distance.\n\n**K means**:\n* Hard assign a data point to one particular cluster on convergence.\n* It makes use of the L2 norm when optimizing (Min {Theta} L2 norm point and its centroid coordinates).\n\n**EM**:\n* Soft assigns a point to clusters (so it give a probability of any point belonging to any centroid).\n* It doesn't depend on the L2 norm, but is based on the Expectation, i.e., the probability of the point belonging to a particular cluster. This makes K-means biased towards spherical clusters.\n\n## 23. Describe how Gradient Boosting works.\nThe idea of boosting came out of the idea of whether a weak learner can be modified to become better.\n\nGradient boosting relies on regression trees (even when solving a classification problem) which minimize **MSE**. Selecting a prediction for a leaf region is simple: to minimize MSE we should select an average target value over samples in the leaf. The tree is built greedily starting from the root: for each leaf a split is selected to minimize MSE for this step.\n\nTo begin with, gradient boosting is an ensembling technique, which means that prediction is done by an ensemble of simpler estimators. While this theoretical framework makes it possible to create an ensemble of various estimators, in practice we almost always use GBDT — gradient boosting over decision trees. \n\nThe aim of gradient boosting is to create (or \"train\") an ensemble of trees, given that we know how to train a single decision tree. This technique is called **boosting** because we expect an ensemble to work much better than a single estimator.\n\nHere comes the most interesting part. Gradient boosting builds an ensemble of trees **one-by-one**, then the predictions of the individual trees **are summed**: D(x)=d​tree 1​​(x)+d​tree 2​​(x)+...\n\nThe next decision tree tries to cover the discrepancy between the target function f(x) and the current ensemble prediction **by reconstructing the residual**.\n\nFor example, if an ensemble has 3 trees the prediction of that ensemble is:\nD(x)=d​tree 1​​(x)+d​tree 2​​(x)+d​tree 3​​(x). The next tree (tree 4) in the ensemble should complement well the existing trees and minimize the training error of the ensemble.\n\nIn the ideal case we'd be happy to have: D(x)+d​tree 4​​(x)=f(x).\n\nTo get a bit closer to the destination, we train a tree to reconstruct the difference between the target function and the current predictions of an ensemble, which is called the **residual**: R(x)=f(x)−D(x). Did you notice? If decision tree completely reconstructs R(x), the whole ensemble gives predictions without errors (after adding the newly-trained tree to the ensemble)! That said, in practice this never happens, so we instead continue the iterative process of ensemble building.\n\n### AdaBoost the First Boosting Algorithm\nThe weak learners in AdaBoost are decision trees with a single split, called decision stumps for their shortness.\n\nAdaBoost works by weighting the observations, putting more weight on difficult to classify instances and less on those already handled well. New weak learners are added sequentially that focus their training on the more difficult patterns.\n**Gradient boosting involves three elements:**\n1. A loss function to be optimized.\n2. A weak learner to make predictions.\n3. An additive model to add weak learners to minimize the loss function.\n\n#### Loss Function\nThe loss function used depends on the type of problem being solved.\nIt must be differentiable, but many standard loss functions are supported and you can define your own.\nFor example, regression may use a squared error and classification may use logarithmic loss.\nA benefit of the gradient boosting framework is that a new boosting algorithm does not have to be derived for each loss function that may want to be used, instead, it is a generic enough framework that any differentiable loss function can be used.\n\n#### Weak Learner\nDecision trees are used as the weak learner in gradient boosting.\n\nSpecifically regression trees are used that output real values for splits and whose output can be added together, allowing subsequent models outputs to be added and “correct” the residuals in the predictions.\n\nTrees are constructed in a greedy manner, choosing the best split points based on purity scores like Gini or to minimize the loss.\nInitially, such as in the case of AdaBoost, very short decision trees were used that only had a single split, called a decision stump. Larger trees can be used generally with 4-to-8 levels.\n\nIt is common to constrain the weak learners in specific ways, such as a maximum number of layers, nodes, splits or leaf nodes.\nThis is to ensure that the learners remain weak, but can still be constructed in a greedy manner.\n\n#### Additive Model\nTrees are added one at a time, and existing trees in the model are not changed.\n\nA gradient descent procedure is used to minimize the loss when adding trees.\nTraditionally, gradient descent is used to minimize a set of parameters, such as the coefficients in a regression equation or weights in a neural network. After calculating error or loss, the weights are updated to minimize that error.\n\nInstead of parameters, we have weak learner sub-models or more specifically decision trees. After calculating the loss, to perform the gradient descent procedure, we must add a tree to the model that reduces the loss (i.e. follow the gradient). We do this by parameterizing the tree, then modify the parameters of the tree and move in the right direction by reducing the residual loss.\n\nGenerally this approach is called functional gradient descent or gradient descent with functions.\nThe output for the new tree is then added to the output of the existing sequence of trees in an effort to correct or improve the final output of the model.\n\nA fixed number of trees are added or training stops once loss reaches an acceptable level or no longer improves on an external validation dataset.\n\n### Improvements to Basic Gradient Boosting\nGradient boosting is a greedy algorithm and can overfit a training dataset quickly.\nIt can benefit from regularization methods that penalize various parts of the algorithm and generally improve the performance of the algorithm by reducing overfitting.\nIn this section we will look at 4 enhancements to basic gradient boosting:\n* Tree Constraints\n* Shrinkage\n* Random sampling\n* Penalized Learning\n\n#### Tree Constraints\nIt is important that the weak learners have skill but remain weak.\nThere are a number of ways that the trees can be constrained.\n\nA good general heuristic is that the more constrained tree creation is, the more trees you will need in the model, and the reverse, where less constrained individual trees, the fewer trees that will be required.\n\nBelow are some constraints that can be imposed on the construction of decision trees:\n* Number of trees, generally adding more trees to the model can be very slow to overfit. The advice is to keep adding trees until no further improvement is observed.\n* Tree depth, deeper trees are more complex trees and shorter trees are preferred. Generally, better results are seen with 4-8 levels.\n* Number of nodes or number of leaves, like depth, this can constrain the size of the tree, but is not constrained to a symmetrical structure if other constraints are used.\n* Number of observations per split imposes a minimum constraint on the amount of training data at a training node before a split can be considered\n* Minimum improvement to loss is a constraint on the improvement of any split added to a tree.\n\n#### Weighted Updates\nThe predictions of each tree are added together sequentially.\nThe contribution of each tree to this sum can be weighted to slow down the learning by the algorithm. This weighting is called a shrinkage or a learning rate.\n\nEach update is simply scaled by the value of the “learning rate parameter” *v*\n\nThe effect is that learning is slowed down, in turn require more trees to be added to the model, in turn taking longer to train, providing a configuration trade-off between the number of trees and learning rate.\n\nDecreasing the value of v [the learning rate] increases the best value for M [the number of trees].\n\nIt is common to have small values in the range of 0.1 to 0.3, as well as values less than 0.1.\n\nSimilar to a learning rate in stochastic optimization, shrinkage reduces the influence of each individual tree and leaves space for future trees to improve the model.\n#### Stochastic Gradient Boosting\nA big insight into bagging ensembles and random forest was allowing trees to be greedily created from subsamples of the training dataset.\n\nThis same benefit can be used to reduce the correlation between the trees in the sequence in gradient boosting models.\n\nThis variation of boosting is called stochastic gradient boosting.\n\nAt each iteration a subsample of the training data is drawn at random (without replacement) from the full training dataset. The randomly selected subsample is then used, instead of the full sample, to fit the base learner.\n\nA few variants of stochastic boosting that can be used:\n* Subsample rows before creating each tree.\n* Subsample columns before creating each tree\n* Subsample columns before considering each split.\nGenerally, aggressive sub-sampling such as selecting only 50% of the data has shown to be beneficial. According to user feedback, using column sub-sampling prevents over-fitting even more so than the traditional row sub-sampling.\n#### Penalized Gradient Boosting\nAdditional constraints can be imposed on the parameterized trees in addition to their structure.\nClassical decision trees like CART are not used as weak learners, instead a modified form called a regression tree is used that has numeric values in the leaf nodes (also called terminal nodes). The values in the leaves of the trees can be called weights in some literature.\n\nAs such, the leaf weight values of the trees can be regularized using popular regularization functions, such as:\n* L1 regularization of weights.\n* L2 regularization of weights.\n\nThe additional regularization term helps to smooth the final learnt weights to avoid over-fitting. Intuitively, the regularized objective will tend to select a model employing simple and predictive functions.\n\nMore details in 2 posts (russian):\n* https:\u002F\u002Fhabr.com\u002Fcompany\u002Fods\u002Fblog\u002F327250\u002F\n* https:\u002F\u002Falexanderdyakonov.files.wordpress.com\u002F2017\u002F06\u002Fbook_boosting_pdf.pdf\n\n## 24. Difference between AdaBoost and XGBoost.\nBoth methods combine weak learners into one strong learner. For example, one decision tree is a weak learner, and an emsemble of them would be a random forest model, which is a strong learner. \n\nBoth methods in the learning process will increase the ensemble of weak-trainers, adding new weak learners to the ensemble at each training iteration, i.e. in the case of the forest, the forest will grow with new trees. The only difference between AdaBoost and XGBoost is how the ensemble is replenished.\n\nAdaBoost works by weighting the observations, putting more weight on difficult to classify instances and less on those already handled well. New weak learners are added sequentially that focus their training on the more difficult patterns. AdaBoost at each iteration changes the sample weights in the sample. It raises the weight of the samples in which more mistakes were made. The sample weights vary in proportion to the ensemble error. We thereby change the probabilistic distribution of samples - those that have more weight will be selected more often in the future. It is as if we had accumulated samples on which more mistakes were made and would use them instead of the original sample. In addition, in AdaBoost, each weak learner has its own weight in the ensemble (alpha weight) - this weight is higher, the “smarter” this weak learner is, i.e. than the learner least likely to make mistakes.\n\nXGBoost does not change the selection or the distribution of observations at all. XGBoost builds the first tree (weak learner), which will fit the observations with some prediction error. A second tree (weak learner) is then added to correct the errors made by the existing model. Errors are minimized using a gradient descent algorithm. Regularization can also be used to penalize more complex models through both Lasso and Ridge regularization. \n\nIn short, AdaBoost- reweighting examples. Gradient boosting - predicting the loss function of trees. Xgboost - the regularization term was added to the loss function (depth + values ​​in leaves).\n\n## 25. Data Mining Describe the decision tree model\nA decision tree is a structure that includes a root node, branches, and leaf nodes. Each internal node denotes a test on an attribute, each branch denotes the outcome of a test, and each leaf node holds a class label. The topmost node in the tree is the root node.\n\nEach internal node represents a test on an attribute. Each leaf node represents a class.\nThe benefits of having a decision tree are as follows:\n* It does not require any domain knowledge.\n* It is easy to comprehend.\n* The learning and classification steps of a decision tree are simple and fast.\n\n**Tree Pruning**\n\nTree pruning is performed in order to remove anomalies in the training data due to noise or outliers. The pruned trees are smaller and less complex.\n\n**Tree Pruning Approaches**\n\nHere is the Tree Pruning Approaches listed below:\n* Pre-pruning − The tree is pruned by halting its construction early.\n* Post-pruning - This approach removes a sub-tree from a fully grown tree.\n\n**Cost Complexity**\n\nThe cost complexity is measured by the following two parameters − Number of leaves in the tree, and Error rate of the tree.\n\n## 26. Notes from Coursera Deep Learning courses by Andrew Ng\n[Notes from Coursera Deep Learning courses by Andrew Ng](https:\u002F\u002Fpt.slideshare.net\u002FTessFerrandez\u002Fnotes-from-coursera-deep-learning-courses-by-andrew-ng\u002F)\n\n## 27. What is a neural network?\nNeural networks are typically organized in layers. Layers are made up of a number of interconnected 'nodes' which contain an 'activation function'. Patterns are presented to the network via the 'input layer', which communicates to one or more 'hidden layers' where the actual processing is done via a system of weighted 'connections'. The hidden layers then link to an 'output layer' where the answer is output as shown in the graphic below.\n\nAlthough there are many different kinds of learning rules used by neural networks, this demonstration is concerned only with one: the delta rule. The delta rule is often utilized by the most common class of ANNs called 'backpropagation neural networks' (BPNNs). Backpropagation is an abbreviation for the backwards propagation of error. With the delta rule, as with other types of back propagation, 'learning' is a supervised process that occurs with each cycle or 'epoch' (i.e. each time the network is presented with a new input pattern) through a forward activation flow of outputs, and the backwards error propagation of weight adjustments. More simply, when a neural network is initially presented with a pattern it makes a random 'guess' as to what it might be. It then sees how far its answer was from the actual one and makes an appropriate adjustment to its connection weights. More graphically, the process looks something like this: \n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_3f39a5cb4d68.png)\n\nBackpropagation performs a gradient descent within the solution's vector space towards a 'global minimum' along the steepest vector of the error surface. The global minimum is that theoretical solution with the lowest possible error. The error surface itself is a hyperparaboloid but is seldom 'smooth'. Indeed, in most problems, the solution space is quite irregular with numerous 'pits' and 'hills' which may cause the network to settle down in a 'local minimum' which is not the best overall solution.\n\nSince the nature of the error space can not be known a priori, neural network analysis often requires a large number of individual runs to determine the best solution. Most learning rules have built-in mathematical terms to assist in this process which control the 'speed' (Beta-coefficient) and the 'momentum' of the learning. The speed of learning is actually the rate of convergence between the current solution and the global minimum. Momentum helps the network to overcome obstacles (local minima) in the error surface and settle down at or near the global minimum.\n\nOnce a neural network is 'trained' to a satisfactory level it may be used as an analytical tool on other data. To do this, the user no longer specifies any training runs and instead allows the network to work in forward propagation mode only. New inputs are presented to the input pattern where they filter into and are processed by the middle layers as though training were taking place, however, at this point the output is retained and no backpropagation occurs. The output of a forward propagation run is the predicted model for the data which can then be used for further analysis and interpretation.\n\n## 28. How do you deal with sparse data?\nWe could take a look at L1 regularization since it best fits the sparse data and does feature selection. If linear relationship - linear regression either - svm. Also it would be nice to use one-hot-encoding or bag-of-words. \nA one hot encoding is a representation of categorical variables as binary vectors.\nThis first requires that the categorical values be mapped to integer values.\nThen, each integer value is represented as a binary vector that is all zero values except the index of the integer, which is marked with a 1.\n\n## 29. RNN and LSTM\nHere are a few of my favorites:\n* [Understanding LSTM Networks, Chris Olah's LSTM post](http:\u002F\u002Fcolah.github.io\u002Fposts\u002F2015-08-Understanding-LSTMs\u002F)\n* [Exploring LSTMs, Edwin Chen's LSTM post](http:\u002F\u002Fblog.echen.me\u002F2017\u002F05\u002F30\u002Fexploring-lstms\u002F)\n* [The Unreasonable Effectiveness of Recurrent Neural Networks, Andrej Karpathy's blog post](http:\u002F\u002Fkarpathy.github.io\u002F2015\u002F05\u002F21\u002Frnn-effectiveness\u002F)\n* [CS231n Lecture 10 - Recurrent Neural Networks, Image Captioning, LSTM, Andrej Karpathy's lecture](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=iX5V1WpxxkY)\n* [Jay Alammar's The Illustrated Transformer](http:\u002F\u002Fjalammar.github.io\u002Fillustrated-transformer\u002F) the guy generally focuses on visualizing different ML concepts\n\n## 30. Pseudo Labeling\nPseudo-labeling is a technique that allows you to use predicted with **confidence** test data in your training process. This effectivey works by allowing your model to look at more samples, possibly varying in distributions. I have found [this](https:\u002F\u002Fwww.kaggle.com\u002Fcdeotte\u002Fpseudo-labeling-qda-0-969) Kaggle kernel to be useful in understanding how one can use pseudo-labeling in light of having too few train data points.\n\n## 31. Knowledge Distillation\nIt is the process by which a considerably larger model is able to transfer its knowledge to a smaller one. Applications include NLP and object detection allowing for less powerful hardware to make good inferences without significant loss of accuracy.\n\nExample: model compression which is used to compress the knowledge of multiple models into a single neural network.\n\n[Explanation](https:\u002F\u002Fnervanasystems.github.io\u002Fdistiller\u002Fknowledge_distillation.html)\n\n## 32. What is an inductive bias?\nA model's inductive bias is referred to as assumptions made within that model to learn your target function from independent variables, your features. Without these assumptions, there is a whole space of solutions to our problem and finding the one that works best becomes a problem. Found [this](https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F35655267\u002Fwhat-is-inductive-bias-in-machine-learning) StackOverflow question useful to look at and explore.\n\nConsider an example of an inducion bias when choosing a learning algorithm with the minimum cross-validation (CV) error. Here, we **rely** on the hypothesis of the minimum CV error and **hope** it is able to generalize well on the data yet to be seen. Effectively, this choice is what helps us (in this case) make a choice in favor of the learning algorithm (or model) being tried.\n\n## 33. What is a confidence interval in layman's terms?\nConfidence interval as the name suggests is the amount of confidence associated with an interval of values to get the desired outcome. For example : if 100 - 200 range is a 95% confidence interval , it implies that someone can have 95% assurance that the data point or any desired value is present in that range.\n","[来源](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1ajyJhXyt4q9ZsufXV1kZxDH_3Isg3MYAKsFqNytXrCw\u002F)\n\n- [1. 为什么使用特征选择？](#1-why-do-you-use-feature-selection)\n    - [过滤法](#filter-methods)\n    - [嵌入法](#embedded-methods)\n    - [误导性](#misleading)\n    - [过拟合](#overfitting)\n- [2. 解释什么是正则化，以及它为什么有用。](#2-explain-what-regularization-is-and-why-it-is-useful)\n- [3. L1 正则化和 L2 正则化有什么区别？](#3-whats-the-difference-between-l1-and-l2-regularization)\n- [4. 如果你创建了一个使用多元回归来预测定量结果变量的模型，你会如何验证这个模型？](#4-how-would-you-validate-a-model-you-created-to-generate-a-predictive-model-of-a-quantitative-outcome-variable-using-multiple-regression)\n- [5. 解释精确率和召回率是什么。它们与 ROC 曲线有何关系？](#5-explain-what-precision-and-recall-are-how-do-they-relate-to-the-roc-curve)\n- [6. 是假阳性过多更好，还是假阴性过多更好？](#6-is-it-better-to-have-too-many-false-positives--or-too-many-false-negatives)\n- [7. 如何处理不平衡的二分类问题？](#7-how-do-you-deal-with-unbalanced-binary-classification)\n- [8. 什么是统计功效？](#8-what-is-statistical-power)\n- [9. 偏差和方差分别是什么？它们与数据建模有什么关系？](#9-what-are-bias-and-variance--and-what-are-their-relation-to-modeling-data)\n    - [方法](#approaches)\n- [10. 如果类别不平衡怎么办？如果有超过两个类别呢？](#10-what-if-the-classes-are-imbalanced-what-if-there-are-more-than-2-groups)\n- [11. 有哪些方法可以使我的模型对异常值更鲁棒？](#11-what-are-some-ways-i-can-make-my-model-more-robust-to-outliers)\n- [12. 在无监督学习中，如果数据集的真实标签未知，我们如何确定最合适的聚类数量？](#12-in-unsupervised-learning--if-a-ground-truth-about-a-dataset-is-unknown--how-can-we-determine-the-most-useful-number-of-clusters-to-be)\n- [13. 定义方差](#13-define-variance)\n- [14. 期望值](#14-expected-value)\n- [15. 描述箱线图和直方图的区别及其应用场景](#15-describe-the-differences-between-and-use-cases-for-box-plots-and-histograms)\n- [16. 你将如何在一个分布中发现异常值？](#16-how-would-you-find-an-anomaly-in-a-distribution)\n    - [统计方法](#statistical-methods)\n    - [度量方法](#metric-methods)\n- [17. 你如何处理数据中的异常值？](#17-how-do-you-deal-with-outliers-in-your-data)\n- [18. 你如何处理稀疏数据？](#18-how-do-you-deal-with-sparse-data)\n- [19. 大数据工程师：你能解释一下什么是 REST 吗？](#19-big-data-engineer-can-you-explain-what-rest-is)\n- [20. 逻辑回归](#20-logistic-regression)\n- [21. 如果两个预测变量高度相关，这对逻辑回归的系数会有什么影响？这些系数的置信区间是多少？](#21-what-is-the-effect-on-the-coefficients-of-logistic-regression-if-two-predictors-are-highly-correlated-what-are-the-confidence-intervals-of-the-coefficients)\n- [22. 高斯混合模型和 K-Means 有什么区别？](#22-whats-the-difference-between-gaussian-mixture-model-and-k-means)\n- [23. 描述梯度提升的工作原理。](#23-describe-how-gradient-boosting-works)\n  - [AdaBoost：第一个提升算法](#adaboost-the-first-boosting-algorithm)\n    - [损失函数](#loss-function)\n    - [弱学习器](#weak-learner)\n    - [加法模型](#additive-model)\n  - [基础梯度提升的改进](#improvements-to-basic-gradient-boosting)\n    - [树约束](#tree-constraints)\n    - [加权更新](#weighted-updates)\n    - [随机梯度提升](#stochastic-gradient-boosting)\n    - [惩罚式梯度提升](#penalized-gradient-boosting)\n- [24. AdaBoost 和 XGBoost 的区别](#24-difference-between-AdaBoost-and-XGBoost)\n- [25. 数据挖掘：描述决策树模型。](#25-data-mining-describe-the-decision-tree-model)\n- [26. 来自吴恩达 Coursera 深度学习课程的笔记](#26-notes-from-coursera-deep-learning-courses-by-andrew-ng)\n- [27. 什么是神经网络？](#27-what-is-a-neural-network)\n- [28. 你如何处理稀疏数据？](#28-how-do-you-deal-with-sparse-data)\n- [29. RNN 和 LSTM](#29-rnn-and-lstm)\n- [30. 伪标签](#30-pseudo-labeling)\n- [31. 知识蒸馏](#31-knowledge-distillation)\n- [32. 什么是归纳偏置？](#32-what-is-an-inductive-bias)\n- [33. 用通俗易懂的语言解释置信区间是什么？](#33-confidence-interval-in-layman's-terms)\n\n## 1. 为什么使用特征选择？\n特征选择是为模型构建选择相关特征子集的过程。特征选择本身很有用，但它主要起到过滤作用，除了你现有的特征外，还能屏蔽掉那些不相关的特征。\n特征选择方法有助于你创建一个准确的预测模型。它们通过挑选能够提供同样或更高准确度、同时所需数据更少的特征来帮助你实现这一目标。\n特征选择方法可用于识别并移除数据中那些对预测模型准确性没有贡献，甚至可能降低其准确性的多余、无关和冗余属性。\n减少属性数量是可取的，因为它可以降低模型的复杂性，而更简单的模型更容易理解和解释。\n#### 过滤法\n过滤法特征选择会应用统计指标为每个特征赋分。特征会根据得分进行排序，然后决定哪些保留、哪些从数据集中移除。这些方法通常是单变量的，只考虑单个特征，或者将特征与因变量联系起来进行评估。\n一些常见的过滤法包括卡方检验、信息增益和相关系数评分等。\n#### 嵌入法\n嵌入法会在模型构建过程中学习哪些特征最能提升模型的准确性。最常见的嵌入式特征选择方法是正则化方法。\n正则化方法也称为惩罚方法，它在预测算法（如回归算法）的优化过程中引入额外的约束条件，使模型倾向于更低的复杂度（即更少的系数）。\n正则化算法的例子有LASSO、弹性网络和岭回归等。\n#### 误导性\n包含冗余属性可能会误导建模算法。基于实例的方法，例如k近邻算法，会利用属性空间中的小邻域来确定分类和回归预测结果。然而，这些预测结果可能会被冗余属性严重扭曲。\n#### 过拟合\n在数据集中保留不相关的属性可能导致过拟合。决策树算法（如C4.5）会尝试在属性值上做出最优划分，优先选择与预测结果相关性较高的属性进行分割。而在树的更深层次，相关性较低或完全无关的属性则会被用来做出预测决策，这些决策往往只是在训练数据集中偶然有效。这种对训练数据的过度拟合会负面影响模型的泛化能力，并降低预测准确性。\n\n## 2. 解释什么是正则化以及它为何有用。\n正则化是在模型中添加一个调节参数以引入平滑性，从而防止[过拟合](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FOverfitting)的过程。\n这通常通过向现有权重向量添加一个常数倍数来实现。这个常数通常是[L1（Lasso）](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLasso_(statistics))或[L2（岭）](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FTikhonov_regularization)，但实际上也可以是任何范数。模型的预测结果应使正则化后的训练集上计算出的损失函数均值最小化。\n正如其他人所解释的那样，众所周知，L1正则化有助于在稀疏特征空间中进行特征选择，这也是在某些情况下使用L1的一个很好的实际理由。然而，除了这个特定原因之外，我从未见过L1在实践中比L2表现更好的情况。如果你查看[LIBLINEAR常见问题解答](https:\u002F\u002Fwww.csie.ntu.edu.tw\u002F~cjlin\u002Fliblinear\u002FFAQ.html#l1_regularized_classification)中关于此问题的内容，就会发现他们并没有遇到L1优于L2的实际案例，并鼓励用户如果发现这样的例子就与他们联系。即使在你可能需要利用L1的稀疏性来进行特征选择的情况下，对剩余变量使用L2通常也会比单独使用L1获得更好的效果。\n\n## 3. L1和L2正则化的区别是什么？\n正则化是机器学习中防止过拟合的一项非常重要的技术。从数学角度来看，它通过添加正则化项来限制系数过度拟合数据，从而避免过拟合现象的发生。L1（Lasso）和L2（Ridge）的区别在于：L2（Ridge）是权重平方的和，而L1（Lasso）则是均方误差或其他损失函数中权重绝对值的和。如下所示：\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_ba61756c14f0.png)\n两者的特性差异可以简要总结如下：\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_7342efb82e8a.png)\n\n**解的唯一性**是一个较为简单但需要一定想象力的问题。首先，请看下图：\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_c524c7afe38c.png)\n\n## 4. 如果你建立了一个使用多元回归来预测定量结果变量的模型，你会如何验证该模型？\n[建议的模型验证方法](http:\u002F\u002Fsupport.sas.com\u002Fresources\u002Fpapers\u002Fproceedings12\u002F333-2012.pdf)如下：\n* 如果模型预测的数值远远超出响应变量的范围，则立即表明估计不佳或模型不准确。\n* 如果预测值看起来合理，则应检查模型参数；以下任何一种情况都可能表明估计不佳或存在多重共线性：参数预期符号相反、数值异常大或异常小，或者在输入新数据时出现不一致现象。\n* 使用模型进行预测，输入新数据，并以[决定系数](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCoefficient_of_determination)（R²）作为模型有效性的衡量标准。\n* 将数据集拆分为两部分：一部分用于估计模型参数，另一部分用于验证预测结果。\n* 如果数据集中的样本数量较少，可以使用[刀切重采样法](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FJackknife_resampling)，并以R²和[均方误差](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMean_squared_error)（MSE）来衡量模型的有效性。\n\n## 5. 解释什么是精确率和召回率。它们与ROC曲线有何关系？\n计算精确率和召回率其实非常简单。假设在10,000个样本中，有100个正例。你想预测哪些是正例，并选择了200个样本，以便更有机会捕捉到这100个正例中的大部分。你记录下这些预测的样本编号，待实际结果出来后，统计自己预测对了多少、错了多少。总共有四种可能的结果：\n1. TN \u002F 真负：样本为负例且被正确预测为负例\n2. TP \u002F 真正例：样本为正例且被正确预测为正例\n3. FN \u002F 假负：样本为正例但被错误地预测为负例\n4. FP \u002F 假正：样本为负例但被错误地预测为正例\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_42c8db9502db.png)\n\n这时，你的老板问了你三个问题：\n* 你的预测中有百分之多少是正确的？\n你回答：准确率是(9,760+60)除以10,000，即98.2%。\n* 你捕获了多少百分比的正例？\n你回答：召回率是60除以100，即60%。\n* 在所有被预测为正例的样本中，有多少百分比是真正的正例？\n你回答：精确率是60除以200，即30%。\n你也可以参考维基百科上关于[精确率与召回率](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FPrecision_and_recall)的详细解释。\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_e8c5324e2e5f.jpg)\n\nROC曲线展示了敏感性（即召回率）与特异性（而非精确率）之间的关系，常用于评估二分类模型的性能。然而，在处理严重不平衡的数据集时，[精确率-召回率（PR）]曲线更能真实反映模型的表现。需要注意的是，ROC曲线展示的是敏感性（召回率）与特异性之间的关系；敏感性就是召回率的另一种说法，而特异性则不同于精确率。\n\n召回率\u002F敏感性衡量的是，在所有真实标签为正例的样本中，你的模型预测为正例的概率。它反映了正例样本中有多少被正确识别了出来。特异性则是指，在所有真实标签为负例的样本中，你的模型预测为负例的概率。它衡量的是负例样本中有多少被正确识别为负例。\n\n相比之下，精确率则有所不同。它表示在模型预测为正例的样本中，该样本确实是正例的概率。换句话说，它是衡量模型预测为正例的样本中有多少确实属于正例。需要注意的是，当正类的基础概率或先验概率发生变化时，精确率也会随之改变。也就是说，精确率取决于正类的稀有程度。换言之，当正类比负类更值得关注时，我们才会使用精确率作为评价指标。\n\n* 敏感性，也称为真正例率或召回率，其计算公式为：\n`敏感性 = TP \u002F (TP + FN)`。由于该公式中不包含FP和TN，因此对于类别极度不平衡的情况，敏感性可能会产生偏差。以欺诈检测为例，敏感性表示在所有实际欺诈案例中，被正确预测为欺诈的比例，而不考虑非欺诈案例。\n* 特异性，又称真负例率，其计算公式为：\n`特异性 = TN \u002F (TN + FP)`。同样地，由于公式中未涉及FN和TP，特异性在处理类别不平衡数据时也可能出现偏差。以欺诈检测为例，特异性表示在所有实际非欺诈案例中，被正确预测为非欺诈的比例。\n\n[使用ROC曲线评估和比较分类器性能](https:\u002F\u002Fmachinelearningmastery.com\u002Fassessing-comparing-classifier-performance-roc-curves-2\u002F)\n\n## 6. 是假阳性过多更好，还是假阴性过多更好？\n这取决于具体的问题以及我们所解决的应用领域。\n\n在医学检测中，假阴性可能会给患者和医生带来一种虚假的安心感，让他们误以为疾病不存在，而实际上疾病却已经存在。这种情况有时会导致对患者及其疾病的治疗不当或不足。因此，在这种情况下，我们更希望假阳性多一些。\n\n而在垃圾邮件过滤中，假阳性是指垃圾邮件过滤或拦截技术错误地将一封合法的电子邮件判定为垃圾邮件，从而影响其正常送达。尽管大多数反垃圾邮件策略能够拦截或过滤掉大量不需要的邮件，但在尽量减少假阳性的前提下做到这一点却更加困难。因此，在垃圾邮件过滤场景下，我们更倾向于接受较多的假阴性，而不是假阳性。\n\n## 7. 如何处理不平衡的二分类问题？\n数据不平衡通常指分类问题中各类别样本数量不均衡的情况。  \n例如，你可能有一个包含100个样本（行）的二分类问题。其中80个样本被标记为类别1，其余20个样本被标记为类别2。\n\n这是一个不平衡的数据集，类别1与类别2的样本比例为80:20，也可以简写为4:1。  \n类别不平衡问题既可能出现在二分类问题中，也可能出现在多分类问题中，大多数处理方法在这两类问题中都可以适用。  \n在接下来的讨论中，我们将以二分类问题为例，因为这样更容易理解和描述。\n\n1. 是否可以收集更多数据？\u003C\u002Fbr>  \n更大的数据集可能会揭示不同且更平衡的类别分布情况。  \n稍后在讨论数据重采样时，我们还会提到增加少数类样本的重要性。\n\n2. 尝试更换评估指标\u003C\u002Fbr>  \n在处理不平衡数据集时，准确率并不是合适的评估指标，因为它容易产生误导。  \n基于之前的讨论，我建议使用以下几种更能反映模型性能的指标，而不是传统的分类准确率：\n  - [混淆矩阵](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FConfusion_matrix)：将预测结果以表格形式展示，显示正确预测（对角线部分）以及错误预测的类型（即错误预测被归为哪些类别）。\n  - [精确率](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FInformation_retrieval#Precision)：衡量分类器的准确性。精确率等于真正例数除以真正例数加假正例数。换句话说，它是正类预测总数除以实际为正类的样本总数。精确率也被称为[阳性预测值（PPV）](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FPositive_and_negative_predictive_values)。精确率可以被视为分类器准确性的度量；较低的精确率可能意味着存在较多的假正例。\n  - [召回率](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FInformation_retrieval#Recall)：衡量分类器的完整程度。召回率等于真正例数除以真正例数加假负例数。换言之，它是正类预测总数除以测试集中实际为正类的样本总数。召回率也称为灵敏度或真正例率。召回率可以被视为分类器完整性的度量；较低的召回率表明存在较多的假负例。\n  - [F1分数（或F分数）](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FF1_score)：精确率和召回率的加权平均值。\n此外，我还建议关注以下内容：\n  - Kappa系数（或[Cohen’s kappa](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FCohen%27s_kappa)）：根据数据中各类别的不平衡程度对分类准确率进行标准化调整。\n  - ROC曲线：类似于精确率和召回率，准确率可以分解为敏感性和特异性，从而可以根据这些指标的平衡点来选择模型。\n\n3. 尝试对数据集进行重采样\n  * 可以通过过采样（即有放回地复制少数类样本）来增加少数类样本的数量。\n  * 也可以通过欠采样（即删除多数类样本）来减少多数类样本的数量。\n\n5. 尝试不同的算法\n6. 尝试使用惩罚机制的模型\u003C\u002Fbr>  \n你可以使用相同的算法，但为其提供一种不同的视角来解决不平衡问题。  \n惩罚性分类会在训练过程中为模型在少数类上的分类错误施加额外的成本，从而使模型更加关注少数类。  \n通常，惩罚权重或惩罚项的设置是针对特定学习算法定制的。例如，存在惩罚版的支持向量机（SVM）和惩罚版的线性判别分析（LDA）等算法。  \n如果你已经确定了某种算法，无法进行数据重采样，或者当前的结果不够理想，那么使用惩罚机制是一个值得尝试的方法。它提供了另一种“平衡”各类别的方式。不过，设置惩罚矩阵可能会比较复杂，你很可能需要尝试多种惩罚方案，才能找到最适合你问题的设置。\n\n7. 尝试从不同角度思考\u003C\u002Fbr>  \n从以下角度重新审视和思考你的问题，有时可能会激发新的思路。  \n两个值得考虑的方向是异常检测和变化检测。\n\n## 8. 什么是统计功效？\n[统计功效或检验效能](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FStatistical_power)是指在二元假设检验中，当备择假设（H1）为真时，检验正确拒绝原假设（H0）的概率。\n\n它也可以理解为：当备择假设为真时，检验接受备择假设的概率——即检验在效应确实存在时检测出该效应的能力。\n\n换句话说，[统计功效](https:\u002F\u002Feffectsizefaq.com\u002F2010\u002F05\u002F31\u002Fwhat-is-statistical-power\u002F)就是研究在效应存在时检测出该效应的可能性。统计功效越高，犯第二类错误（即实际上存在效应却得出不存在效应的结论）的概率就越低。\n\n第一类错误（或I型错误）是指错误地拒绝一个真实的原假设。通常，第一类错误会导致人们误以为某种效应或关系存在，而实际上并不存在。例如，一项检测结果显示患者患有某种疾病，但实际上患者并未患病；火灾报警器误报火情，而实际上并无火灾发生；或者一项实验声称某种治疗方法能够治愈疾病，但事实并非如此。\n\n第二类错误（或II型错误）则是未能拒绝一个错误的原假设。例如，血液检测本应发现某位确实患病的患者的疾病，却未能检出；火灾已经发生，但火灾报警器却没有响起；或者一项临床试验未能证明某种治疗方法有效，而实际上该方法确实有效。\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_ed96aa6ebc65.png)\n\n## 9. 什么是偏差和方差？它们与数据建模有何关系？\n**偏差**是指模型预测结果与真实值之间的偏离程度；而**方差**则表示模型在不同迭代过程中预测结果的变化幅度。\n\n偏差通常是指你在训练数据上构建的最佳模型与“真实模型”（即生成数据的真实过程）之间的差距。\n\n**由偏差引起的误差**：由于底层数据集的随机性，生成的模型会产生一系列预测值。[偏差](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBias_of_an_estimator)用于衡量这些模型的预测值与真实值之间的平均偏离程度。偏差是由学习算法中的错误假设导致的误差。高偏差可能导致算法无法捕捉特征与目标输出之间的相关关系（欠拟合）。\n\n**由方差引起的误差**：方差是指对于给定的数据点，模型预测结果的变异性。再次设想可以多次重复整个模型构建过程。方差就是针对某个特定数据点，不同模型实现之间的预测结果差异程度。方差是由模型对训练集中小幅波动过于敏感而产生的误差。\n\n高方差可能导致算法过度拟合训练数据中的随机[噪声](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FNoise_(signal_processing))，而不是真正想要学习的目标输出（过拟合）。\n\n大数据集 -> 低方差\u003Cbr\u002F>\n小数据集 -> 高方差\u003Cbr\u002F>\n特征较少 -> 高偏差、低方差\u003Cbr\u002F>\n特征较多 -> 低偏差、高方差\u003Cbr\u002F>\n复杂模型 -> 低偏差\u003Cbr\u002F>\n简单模型 -> 高偏差\u003Cbr\u002F>\n降低λ -> 低偏差\u003Cbr\u002F>\n增加λ -> 低方差\u003Cbr\u002F>\n\n我们可以通过靶心图来直观地展示偏差和方差的关系。假设靶心代表一个能够完美预测正确值的模型。越远离靶心，预测效果就越差。如果我们能重复进行多次模型构建，就会在靶上得到若干个落点。每个落点都代表一次基于随机变化的训练数据所得到的模型实现。有时训练数据分布良好，我们的预测会非常接近靶心；而有时训练数据中可能包含大量异常值或非标准值，导致预测效果较差。这些不同的实现会在靶上形成散点分布。\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_fa9c304a970f.jpg)\n\n[以一个简单的例子](https:\u002F\u002Fwww.kdnuggets.com\u002F2016\u002F08\u002Fbias-variance-tradeoff-overview.html)说明，比如一项存在缺陷的总统选举民调，其误差就可以通过偏差和方差这两个角度来解释：从电话簿中选取调查对象是偏差的来源；样本量过小则是方差的来源。\n\n要最小化模型的总误差，关键在于平衡偏差和方差带来的误差。理想情况下，模型应基于无偏且方差较低的数据集构建。然而，实际情况是，随着模型复杂度的增加，偏差往往会降低，但方差却会增大。因此，最优的模型需要在这两种特性之间找到平衡。\n\n交叉验证这一统计评估方法，不仅有助于展示这种平衡的重要性，还能实际帮助我们寻找这种平衡。在k折交叉验证中，选择多少折是一个重要的决策；折数越少，误差估计中的偏差就越大，而方差则越小。\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_723b1e0f49c2.jpg)\n\n最重要的启示是，在构建模型时，偏差和方差是一对相互制衡的重要因素，即便是最常规的统计评估方法，也直接依赖于这种权衡。\n\n我们可以使用线性回归或其他建模技术来估计模型f̂(X)对f(X)的近似值。在这种情况下，某一点x处的期望平方预测误差为：\n`Err(x)=E[(Y−f̂ (x))^2]`\n\n该误差可以分解为偏差和方差两部分：\n`Err(x)=(E[f̂ (x)]−f(x))^2+E[(f̂ (x)−E[f̂ (x)])^2]+σ^2e`\n`Err(x)=Bias^2+Variance+Irreducible`\n\n其中第三项“不可约误差”指的是真实关系中存在的噪声成分，无论使用何种模型都无法从根本上消除。如果拥有真实的模型以及无限量的校准数据，理论上我们可以将偏差和方差两项都降至零。然而，在现实世界中，由于模型并不完美且数据量有限，我们在最小化偏差和最小化方差之间必须做出权衡。\n\n第三项“不可约误差”是真实关系中无法被任何模型消除的噪声成分。尽管有了真实模型和无限的数据来进行校准，理论上我们仍可将偏差和方差降至零。但在现实中，由于模型的不完善和数据量的有限性，我们在减少偏差和减少方差之间始终存在着权衡。\n\n如果模型存在高偏差问题，说明该模型较为简单。为了提高模型的鲁棒性，我们可以在特征空间中加入更多特征。而增加数据点则可以降低方差。\n\n偏差-方差权衡问题是监督学习中的核心问题。理想情况下，我们希望选择一种既能准确捕捉训练数据中规律性，又能很好地泛化到未见数据上的模型。然而，通常很难同时做到这两点。高方差的学习方法虽然能够很好地拟合训练集，但却容易过拟合到含有噪声或不具代表性的训练数据上。相反，高偏差的算法往往产生较为简单的模型，这类模型不容易过拟合，但可能会出现欠拟合现象，从而无法捕捉到数据中的重要规律。\n\n低偏差的模型通常更为复杂（例如高阶回归多项式），这使得它们能够更准确地拟合训练集。然而，在此过程中，它们也可能将训练集中大量的噪声成分纳入考虑，从而导致预测精度下降——尽管模型本身更加复杂。相比之下，高偏差的模型往往较为简单（如低阶甚至线性回归多项式），但在应用于训练集之外的数据时，其预测结果的方差通常更低。\n\n#### 方法\n\n[降维](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDimensionality_reduction)和[特征选择](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FFeature_selection)可以通过简化模型来降低方差。同样，更大的训练集通常也会降低方差。增加特征（预测变量）往往会降低偏差，但会引入额外的方差。学习算法通常具有一些可调参数，用于控制偏差和方差，例如：\n* （广义）线性模型可以通过[正则化](#2-explain-what-regularization-is-and-why-it-is-useful)来降低其方差，但会增加偏差。\n* 在人工神经网络中，隐藏单元的数量越多，方差越大，偏差越小。与广义线性模型类似，通常也会应用正则化。\n* 在k近邻模型中，较大的k值会导致高偏差和低方差（见下文）。\n* 在基于实例的学习中，可以通过调整原型和示例的混合比例来实现正则化。\n* 在决策树中，树的深度决定了方差的大小。为了控制方差，决策树通常会进行剪枝。\n\n解决偏差-方差权衡的一种方法是使用[混合模型](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMixture_model)和[集成学习](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FEnsemble_learning)。例如，[提升法](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBoosting_(machine_learning))将许多“弱”（高偏差）模型组合成一个集成模型，其偏差低于单个模型；而[自助聚合法](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FBootstrap_aggregating)则以降低方差的方式组合“强”学习器。\n\n[理解偏差-方差权衡](http:\u002F\u002Fscott.fortmann-roe.com\u002Fdocs\u002FBiasVariance.html)\n\n\n\n## 10. 如果类别不平衡怎么办？如果超过两个类别呢？\n二分类问题是指根据诸如性别、年龄、地点等自变量，将数据分为两类，例如客户是否会购买某特定产品（是\u002F否）。\n\n由于目标变量不是连续的，二分类模型会预测目标变量为“是”或“否”的概率。为了评估这类模型，通常会使用混淆矩阵，也称为分类矩阵或一致性矩阵。借助混淆矩阵，我们可以计算一些重要的性能指标：\n* 真阳性率（TPR）或召回率或灵敏度 = TP \u002F (TP + FN)\n* [精确率](https:\u002F\u002Fgithub.com\u002Fiamtodor\u002Fdata-science-interview-questions-and-answers#5-explain-what-precision-and-recall-are-how-do-they-relate-to-the-roc-curve) = TP \u002F (TP + FP)\n* 假阳性率（FPR）或误报率 = 1 - 特异度 = 1 - (TN \u002F (TN + FP))\n* 准确率 = (TP + TN) \u002F (TP + TN + FP + FN)\n* 错误率 = 1 – 准确率\n* F-measure = 2 \u002F ((1 \u002F 精确率) + (1 \u002F 召回率)) = 2 * (精确率 * 召回率) \u002F (精确率 + 召回率)\n* ROC（接收者操作特性）= FPR与TPR的曲线图\n* AUC（ROC曲线下面积）\n  不同分类阈值下的综合性能指标。可视为模型将随机选择的正样本排在负样本之上的概率\n\n\n\n## 11. 我有哪些方法可以使我的模型对异常值更鲁棒？\n从不同的角度（数据预处理或模型构建）来看，有多种方法可以使模型对异常值更具鲁棒性。在本问题中，异常值被假定为人类目前已知情况下不希望出现、意料之外或明显错误的值（例如，没有人会200岁），而不是一种可能但罕见的事件。\n\n异常值通常是相对于数据分布来定义的。因此，可以在预处理阶段（在任何学习步骤之前）通过使用标准差 `(Mean +\u002F- 2*SD)` 来移除异常值，这适用于正态分布的数据。或者使用四分位距 `Q1 - Q3`，其中 `Q1` 是排序后数据集前半部分的中间值，`Q3` 是后半部分的中间值，这可以作为非正态或未知分布的阈值。\n\n此外，如果数据存在明显的长尾分布，数据变换（如对数变换）可能会有所帮助。当异常值与采集仪器的敏感性有关，导致无法准确记录较小的数值时，Winsorization变换可能会有用。这种变换（以查尔斯·P·温索尔（1895–1951）命名）的效果类似于信号截断，即用不太极端的值替换极端数据。另一种减少异常值影响的方法是使用平均绝对误差代替均方误差。\n\n在模型构建方面，某些模型本身对异常值具有较强的抵抗力（如基于树的模型）或属于非参数检验。类似于中位数的作用，树模型在每次分裂时都会将节点分成两部分。因此，在每次分裂中，桶内的所有数据点都会被同等对待，无论它们是否包含极端值。\n\n## 12. 在无监督学习中，如果数据集的真实标签未知，我们如何确定最合适的聚类数量？\n肘部法则通常是最佳起点，尤其因其易于解释和通过可视化验证而备受青睐。肘部法则关注的是方差随聚类数量（k-means中的k）的变化情况。通过绘制方差解释百分比与k的关系图，最初的N个聚类应该能显著增加信息量，解释方差；然而，当k达到某个值时，信息增益会变得非常微小，此时图表会出现一个明显的拐角。从肘部法则的角度来看，这个拐角所对应的k值就是最优的聚类数量。\n显然，要绘制方差随聚类数量变化的曲线，必须测试不同的聚类数量。需要依次完整地运行聚类算法，然后将结果绘制成图并进行比较。\nDBSCAN——基于密度的空间聚类算法，能够识别高密度的核心样本，并由此扩展聚类。适用于包含相似密度簇的数据集。\n\n## 13. 定义方差\n方差是随机变量与其均值之差的平方的期望值。通俗地说，它衡量一组（随机）数值与其平均值之间的离散程度。方差是标准差的平方，是分布的二阶中心矩，也是随机变量与其自身的协方差。\nVar(X) = E[(X - m)^2], m=E[X]\n\n因此，方差是衡量随机变量取值相对于其数学期望的离散程度的指标。\n\n## 14. 数学期望\n数学期望——[数学期望](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FExpected_value)（[概率分布](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FProbability_distribution)）在概率分布中，数学期望是指随机变量最有可能取到的值。\n\n根据随机变量 \\( x \\) 的分布律，我们知道随机变量 \\( x \\) 可以取值 \\( x_1, x_2, \\ldots, x_k \\)，对应的概率分别为 \\( p_1, p_2, \\ldots, p_k \\)。\n随机变量 \\( x \\) 的数学期望 \\( M(x) \\) 等于：\n随机变量 \\( X \\) 的数学期望（记作 \\( M(X) \\)，较少使用 \\( E(X) \\)）刻画了随机变量（离散或连续）的平均值。数学期望是该随机变量的首个原点矩。\n\n数学期望属于所谓的分布位置特征之一（其他还包括众数和中位数）。这一特征描述了随机变量在数值轴上的某种平均位置。例如，如果一个随机变量——灯泡寿命——的期望值为100小时，则认为其寿命值会围绕这个期望值集中分布（两侧存在离散程度，由方差表示）。\n\n对于离散型随机变量 \\( X \\)，其数学期望计算公式为：随机变量取值 \\( x_i \\) 与相应概率 \\( p_i \\) 的乘积之和：\n```python\nimport numpy as np\nX = [3,4,5,6,7]\nP = [0.1,0.2,0.3,0.4,0.5]\nnp.sum(np.dot(X, P))\n```\n\n## 15. 描述箱线图和直方图的区别及其应用场景\n[直方图](http:\u002F\u002Fwww.brighthubpm.com\u002Fsix-sigma\u002F13307-what-is-a-histogram\u002F)是一种条形图，用于以图形方式展示数据集的频数分布。与普通条形图类似，直方图将频数（即原始计数）绘制在 Y 轴（垂直方向），而所测量的变量则绘制在 X 轴（水平方向）。\n\n直方图与普通条形图的唯一区别在于：直方图显示的是数据组的频数，而不是单个数据点；因此，直方图中的柱子之间没有空隙。通常，直方图会将数据分成若干小区间（每根柱子代表 X 轴上的 4 到 8 个数值），除非数据范围非常大，此时采用较大的分组反而更容易看出整体分布趋势。\n\n箱线图，也称为[箱须图](http:\u002F\u002Fwww.brighthubpm.com\u002Fsix-sigma\u002F43824-using-box-and-whiskers-plots\u002F)，是一种用图形方式表示数据集五个重要描述性统计量的图表。这五个统计量包括最小值、第一四分位数、中位数、第三四分位数和最大值。在绘制这一五数概括时，只有 X 轴上显示数值。在坐标系内，每个统计量上方都会画一条竖线。中间三根线（第一四分位数、中位数和第三四分位数）被围成一个矩形框，再从矩形框的两端引出两条线，分别指向最小值和最大值。\n相比直方图，箱线图更适合用于比较不同数据的分布！\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_0b67822dc37b.png)\n\n## 16. 如何在数据分布中发现异常值？\n在开始之前，有必要先明确异常值的定义范围。异常值大致可以分为以下三类：\n1. 点异常：当某个数据点与其他数据点相比偏离过大时，即为点异常。业务场景：基于“消费金额”检测信用卡欺诈。\n2. 上下文异常：异常情况取决于具体上下文。这类异常在时间序列数据中较为常见。业务场景：在节假日期间每天花费100美元购买食品是正常现象，但在其他时候则可能显得异常。\n3. 集体异常：通过一组数据点的相互关系来识别异常。业务场景：有人正在从远程机器向本地主机意外复制数据，这种行为可能被标记为潜在的网络攻击。\n\n预防异常的最佳方法是在数据采集阶段就实施相应的策略或检查机制，以便及时发现并拦截异常。然而，在实际工作中，我们往往无法自行采集数据，而是需要处理他人为了其他目的收集的数据。通常情况下，约68%的数据点位于均值的一个标准差范围内，约95%的数据点位于两个标准差范围内，而超过99%的数据则落在三个标准差之内。当某个数值与均值的偏差过大，例如超出±4σ时，就可以将其视为异常值。（这一界限也可以通过百分位数来计算）。\n\n#### 统计方法\n基于统计的异常检测方法利用上述知识来识别离群点。可以通过计算每个数据点的z-score将其标准化。z-score表示该数据点距离数据均值的标准差数量。任何z-score大于3的数据点都被视为离群点，并很可能属于异常。随着z-score的增加，这些点的异常性会更加明显。z-score的计算公式如下。箱线图非常适合用于此类分析。\n\n#### 度量方法\n从相关文献的数量来看，度量方法是研究人员中最常用的方法。该方法假设在数据对象的空间中存在某种度量指标，能够帮助识别异常。直观上讲，异常点在其所在空间中的邻近点较少，而正常点则具有较多的邻近点。因此，一种常用的异常度量方式是“到第k个邻近点的距离”。（参见局部离群因子方法：[Local Outlier Factor](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FLocal_outlier_factor)）。此外，还可以使用特定的度量方法，如马氏距离。（参见马哈拉诺比斯距离：[Mahalanobis distance](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMahalanobis_distance)）。马氏距离是一种衡量随机变量向量之间距离的指标，它推广了欧几里得距离的概念。通过马氏距离，可以判断未知样本与已知样本之间的相似程度。与欧几里得距离不同的是，马氏距离考虑了变量之间的相关性，并且具有尺度不变性。\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_f840a7001180.png)\n\n基于聚类的异常检测最常见的方式是采用原型聚类法。\n\n在这种方法中，如果将某个数据点从聚类中移除后能显著改善聚类原型，则可判定该点为异常。这种逻辑是合理的。K-means是一种将相似数据点归为同一簇的聚类算法。每个簇内的数据点都与其质心相近，因此才会被分配到该簇中。如果某个数据点与质心的距离过远，以至于会将质心从其自然中心位置拉开，那么这个点实际上就是离群点，因为它超出了该簇的自然边界范围。因此，将其移除有助于提高其余数据点的聚类准确性。在这种方法中，离群点得分被定义为该点与任何簇的不匹配程度，或者其与所属簇质心的距离。在K-means中，移除某点后对质心准确性的提升程度，可以用有该点和无该点时的SSE（误差平方和）之差来衡量。如果移除该点后SSE显著降低，则表明该点的离群程度较高。\n\n更具体地说，在使用K-means聚类方法进行异常检测时，离群点得分通常有两种计算方式。最简单的方式是计算该点与其最近质心之间的距离。然而，当簇的密度差异较大时，这种方法的效果会大打折扣。为了解决这个问题，可以采用相对距离的概念，即该点到最近质心的距离与该簇所有点到质心距离中位数的比值。需要注意的是，这种异常检测方法对k值的选择较为敏感。此外，如果数据噪声较大，可能会导致初始聚类结果不准确，从而影响异常检测的精度。该方法的时间复杂度主要取决于所选用的聚类算法，但由于大多数聚类算法的时间和空间复杂度均为线性或接近线性，因此这种方法仍具有较高的效率。\n\n## 17. 如何处理数据中的离群点？\n\n通常情况下，如果您的数据受到极端值的影响，可以通过设定一个排除离群点的历史代表性区间来限制输入范围。例如，可以设置一个最小或最大数量的项目，或者对订单金额设定上下限。\n\n如果离群点来自相对独特的数据集，则应根据具体情况对其进行分析。分别在包含和排除这些离群点的情况下进行分析，并考虑是否有合适的替代方案；最后，报告您的评估结果。\n\n另一种选择是尝试数据变换。平方根变换和对数变换都可以压缩较大的数值，从而使模型在处理依赖于这些极端值的变量时表现更好。\n\n## 18. 如何处理稀疏数据？\n\n我们可以考虑使用L1正则化，因为它特别适合稀疏数据，并结合特征选择来处理问题。如果数据之间存在线性关系，则可以使用线性回归或支持向量机。\n\n此外，还可以采用独热编码或词袋模型。独热编码是一种将分类变量表示为二进制向量的方法。首先需要将分类值映射为整数值，然后将每个整数值表示为一个二进制向量，其中只有一个位置为1，其余均为0。\n\n## 19. 大数据工程师 你能解释一下什么是 REST 吗？\n\nREST 是“表述性状态转移”的缩写。（有时也拼作“ReST”。）它依赖于一种无状态、客户端-服务器架构且可缓存的通信协议——在绝大多数情况下，使用的是 HTTP 协议。\nREST 是一种用于设计网络应用的架构风格。其核心思想很简单：利用 HTTP 在不同机器之间进行调用。\n* 从很多方面来看，基于 HTTP 的万维网本身就可以被视为一种基于 REST 的架构。\nREST 风格的应用程序通过 HTTP 请求来发布数据（创建和\u002F或更新）、读取数据（例如执行查询）以及删除数据。因此，REST 使用 HTTP 来完成所有的 CRUD（创建\u002F读取\u002F更新\u002F删除）操作。\nREST 是 RPC（远程过程调用）和 Web 服务（SOAP、WSDL 等）等机制的一种轻量级替代方案。稍后我们会看到 REST 究竟有多简单。\n* 尽管 REST 很简单，但它功能齐全；基本上，Web 服务能实现的功能，REST 架构也能做到。\nREST 并不是一个“标准”。例如，不可能会有 W3C 关于 REST 的推荐标准。虽然有一些 REST 编程框架，但 REST 的使用非常简单，以至于你通常可以仅借助 Perl、Java 或 C# 等语言的标准库功能就“自己动手”实现。\n\n## 20. 逻辑回归\n\n对数几率——模型的原始输出；几率——模型输出的指数值。预测概率——几率 \u002F (1 + 几率)。\n\n## 21. 如果两个自变量高度相关，会对逻辑回归的系数产生什么影响？这些系数的置信区间又是怎样的？\n当自变量之间存在相关性时，任何一个自变量的估计回归系数都会受到模型中其他自变量的影响。随着更多自变量被加入模型，估计回归系数的精确度会降低。\n\n在统计学中，多重共线性（也称共线性）是指多元回归模型中两个或多个自变量高度相关的一种现象，即其中一个自变量可以被其他自变量以相当高的精度线性预测出来。在这种情况下，多元回归模型的系数估计可能会因模型或数据的微小变化而发生剧烈波动。不过，多重共线性并不会降低整个模型的预测能力或可靠性，至少在样本数据范围内是如此；它只会影响关于单个自变量的推断结果。也就是说，一个包含相关自变量的多元回归模型可以很好地说明所有自变量作为一个整体对因变量的预测效果，但却无法准确反映单个自变量的作用，也无法判断哪些自变量与其他自变量存在冗余关系。\n\n多重共线性的后果：\n* 系数估计仍然无偏。\n* 标准误差增大。\n* 计算出的 t 统计量被低估。\n* 估计结果对模型设定的变化以及个别观测值的变化极为敏感。\n* 方程的整体质量以及与多重共线性无关的变量的估计结果不受影响。\n* 多重共线性越接近完全共线性（严格共线性），其带来的后果就越严重。\n\n多重共线性的指标：\n1. R² 较高，而方差膨胀因子（VIF）却很低。\n2. 自变量之间的成对相关性很强。\n3. 自变量之间的偏相关性也很强。\n4. VIF 值较高——即方差膨胀因子较大。\n\n置信区间（CI）是一种基于观测数据计算得出的区间估计（用于估计总体参数）。置信水平是指在无数次独立实验中，采用相同置信水平构建的置信区间中包含对应参数真实值的比例。换句话说，如果在无限次独立实验中都使用相同的置信水平构建置信区间，那么其中包含参数真实值的比例将等于该置信水平。\n\n置信区间由一组数值范围（区间）组成，这些范围可以作为未知总体参数的良好估计。然而，从特定样本计算出的区间并不一定包含参数的真实值。由于观测数据只是来自总体的随机样本，因此由这些数据得到的置信区间也是随机的。如果同时进行相应的假设检验，置信水平就是显著性水平的补集，即 95% 的置信区间对应 0.05 的显著性水平。如果假设某个参数的真实值为 0，但 95% 的置信区间并未包含 0，那么在 5% 的显著性水平下，该估计与零有显著差异。\n\n期望的置信水平由研究者自行设定（而非由数据决定）。最常用的置信水平是 95%。当然，也可以使用其他置信水平，比如 90% 和 99%。\n\n影响置信区间宽度的因素包括样本量、置信水平以及样本中的变异程度。一般来说，样本量越大，对总体参数的估计就越准确。\n置信区间是我们比较确信真实值所在的一个数值范围。\n`X ± Z*s\u002F√(n)`，其中 X 是样本均值，Z 是查表得到的 Z 值，s 是标准差，n 是样本量。± 号后面的值称为边际误差。\n\n## 22. 高斯混合模型与K均值聚类有何区别？\n假设我们要将数据划分为三个簇。K均值聚类会从一个假定出发：每个数据点都属于某个特定的簇。\n\n例如，我们选择一个数据点，在算法的某一时刻，我们确信它属于红色簇；但在下一次迭代中，我们可能会改变这一判断，认为它更可能属于绿色簇。然而，请记住，在每一次迭代中，我们都对数据点所属的簇有着明确的归属——这就是“硬分配”。\n\n那么，如果我们无法确定呢？比如，我们觉得这个点有70%的可能性属于红色簇，10%的可能性属于绿色簇，还有20%的可能性属于蓝色簇。这就是“软分配”。高斯混合模型正是用来表达这种不确定性的一种方法。它首先基于某种先验信念，为每个数据点赋予其属于各个簇的概率；随着算法的推进，这些概率会被不断更新，从而反映出我们对数据点归属的不确定程度。\n\nK均值聚类的目标是找到k个聚类中心，以最小化 `(x−μk)^2`；\n而高斯混合模型（EM聚类）则是找到k个聚类中心，以最小化 `(x−μk)^2\u002Fσ^2`。\n\n从数学上看，两者的区别就在于分母中的 `σ^2`——这意味着高斯混合模型在计算距离时考虑了数据的方差，而K均值仅计算传统的欧几里得距离。换句话说，K均值计算的是普通距离，而高斯混合模型计算的是“加权”距离。\n\n**K均值**：\n* 在收敛时，将每个数据点硬性分配到某一个特定的簇。\n* 优化过程中使用L2范数，即最小化数据点与其所属簇中心之间的L2距离。\n\n**EM算法**：\n* 对数据点进行软分配，即为每个数据点分配其属于各个簇的概率。\n* 不依赖于L2范数，而是基于期望值，也就是数据点属于某个特定簇的概率。这也使得K均值倾向于生成球形的簇。\n\n## 23. 请描述梯度提升的工作原理。\n提升方法的核心思想在于：能否通过改进弱学习器使其变得更强大？\n\n梯度提升算法通常基于回归树（即使是在解决分类问题时也是如此），并以最小化均方误差（MSE）为目标。对于叶节点区域的预测值，选择非常简单：为了使MSE最小，只需取该叶节点内所有样本目标值的平均值即可。树的构建采用贪心策略，从根节点开始，每次为每个叶节点选择一个分割点，以使当前步骤的MSE达到最小。\n\n首先，梯度提升是一种集成方法，也就是说，最终的预测是由一组较简单的基学习器共同完成的。尽管从理论上讲，可以使用多种不同的基学习器来构建集成模型，但在实际应用中，我们几乎总是使用GBDT——基于决策树的梯度提升算法。\n\n梯度提升的目标是利用已知的单棵决策树训练方法，逐步构建一棵棵新的树，形成一个集成模型。之所以称为“提升”，是因为我们期望这个集成模型的表现能够远远超过单个基学习器。\n\n接下来是最有趣的部分：梯度提升会逐棵树地构建集成模型，并将每棵树的预测结果相加：D(x)=d​tree 1​​(x)+d​tree 2​​(x)+...。随后，新加入的下一棵树会尝试通过拟合残差来弥补目标函数f(x)与当前集成模型预测值之间的差距。\n\n举个例子，假设当前的集成模型由3棵树组成，其预测值为：D(x)=d​tree 1​​(x)+d​tree 2​​(x)+d​tree 3​​(x)。那么，集成中的第四棵树就需要很好地补充前几棵树，以进一步降低整个集成的训练误差。\n\n理想情况下，我们希望 D(x)+d​tree 4​​(x)=f(x)。为了更接近这一目标，我们会训练新的一棵树来拟合目标函数与当前集成预测值之间的残差：R(x)=f(x)−D(x)。请注意，如果决策树能够完全拟合残差R(x)，那么整个集成模型的预测就将不再存在误差（只要将这棵树加入集成即可）。然而，在实践中这种情况几乎不可能发生，因此我们只能继续重复上述的迭代过程，逐步完善集成模型。\n\n### AdaBoost：首个提升算法\nAdaBoost中的弱学习器是仅有一个分裂节点的决策树，因其结构短小而被称为“决策桩”。\n\nAdaBoost通过为每个样本赋予不同的权重来工作：难以分类的样本权重较高，而已经分类得较好的样本权重较低。随后依次添加新的弱学习器，这些学习器会重点训练那些较为困难的模式。\n\n**梯度提升方法包含三个核心要素：**\n1. 用于优化的损失函数。\n2. 用于进行预测的弱学习器。\n3. 用于逐步添加弱学习器以最小化损失函数的加法模型。\n\n#### 损失函数\n所使用的损失函数取决于具体问题的类型。\n该损失函数必须是可导的，不过许多标准的损失函数都已被支持，用户也可以自定义损失函数。例如，在回归任务中常用平方误差，在分类任务中则常用对数损失。梯度提升框架的一大优势在于，无需为每一种可能使用的损失函数单独设计新的提升算法；相反，它是一个足够通用的框架，可以适用于任何可导的损失函数。\n\n#### 额定学习器\n梯度提升中通常使用决策树作为弱学习器。\n\n具体来说，采用的是回归树，这类树在分裂时输出连续值，并且其输出可以累加，从而允许后续模型的输出叠加起来，以“修正”预测中的残差。\n\n树的构建采用贪心策略，根据基尼指数等纯度指标或损失最小化原则选择最优的分裂点。早期，如在AdaBoost中，使用的往往是仅有一个分裂节点的极短决策树，称为“决策桩”。而在实际应用中，通常会使用更深一些的树，层数一般在4到8层之间。\n\n为了确保弱学习器保持较弱的性能，同时仍能以贪心方式高效构建，常会对弱学习器施加一些约束，例如限制最大层数、节点数、分裂次数或叶节点数。\n\n#### 加法模型\n梯度提升中，树是逐棵加入的，且已有的树不会被修改。\n\n在添加新树时，会使用梯度下降法来最小化损失。传统上，梯度下降法用于优化一组参数，比如回归方程中的系数或神经网络中的权重。计算出误差或损失后，通过调整权重来最小化该误差。\n\n但在梯度提升中，我们优化的对象不是参数，而是弱学习器子模型——更具体地说，是决策树。计算出损失后，要执行梯度下降步骤，就必须向模型中添加一棵能够降低损失的树（即沿着梯度方向前进）。为此，我们需要对这棵树的参数进行调优，然后调整这些参数，使模型朝着减少残差损失的方向移动。\n\n这种方法通常被称为函数梯度下降，或基于函数的梯度下降。随后，新树的输出会被加到现有树序列的输出之上，以修正或改进模型的最终输出。\n\n通常会固定添加一定数量的树，或者当损失达到可接受水平，或在外部验证数据集上不再进一步改善时，训练便会停止。\n\n### 基本梯度提升的改进\n梯度提升是一种贪心算法，很容易快速过拟合训练数据集。\n通过引入正则化方法来惩罚算法的不同部分，可以有效减少过拟合，从而提升算法的整体性能。\n\n在这一节中，我们将探讨对基本梯度提升的4种增强方法：\n* 树的约束条件\n* 学习率缩减\n* 随机采样\n* 正则化学习\n\n#### 树的约束条件\n弱学习器需要具备一定的预测能力，但同时也要保持“弱”的特性。可以通过多种方式对树进行约束。\n\n一个通用的经验法则：树的生成约束越严格，模型中所需的树的数量就越多；反之，如果单棵树的约束较少，则整体所需的树数量也会相应减少。\n\n以下是一些可以应用于决策树构建的约束条件：\n* 树的数量：通常情况下，增加更多的树不会很快导致过拟合。建议持续添加树，直到不再观察到性能提升为止。\n* 树的深度：更深的树结构更复杂，因此倾向于选择较浅的树。一般而言，4至8层的树效果较好。\n* 节点数或叶节点数：与树的深度类似，这也可以限制树的大小，但如果同时使用其他约束条件，则不必强制要求对称的树结构。\n* 每个分裂节点所需的最小样本量：在考虑进行分裂之前，要求该节点必须包含足够数量的训练数据。\n* 损失函数的最小改善值：用于限制每次分裂对损失函数的改善程度。\n\n#### 加权更新\n每棵树的预测结果会依次累加。可以通过为每棵树分配不同的权重来减缓算法的学习速度，这种权重调整被称为学习率缩减或学习率。\n\n每次更新都会被“学习率参数”v乘以一个缩放因子，从而降低其贡献程度。\n\n这样做的效果是减缓了学习速度，进而需要加入更多的树来完成训练，最终导致训练时间延长。这实际上是在树的数量和学习率之间进行的一种权衡。\n\n降低v（即学习率）的值会提高M（树的数量）的最佳取值。\n\n常见的学习率范围是0.1到0.3，有时甚至低于0.1。\n\n类似于随机优化中的学习率，学习率缩减能够降低单棵树的影响，为后续树的加入留出空间，以便进一步优化模型。\n\n#### 随机梯度提升\n关于装袋集成和随机森林的一个重要发现是：允许树基于训练数据集的子样本贪婪地构建。\n\n同样的思路也可以用来降低梯度提升模型中各棵树之间的相关性。\n\n这种变体被称为随机梯度提升。\n\n在每一轮迭代中，都会从完整的训练数据集中随机抽取一个子样本（不放回）。然后使用这个随机子样本而不是整个数据集来拟合基学习器。\n\n随机提升的几种变体包括：\n* 在构建每棵树之前对行进行子采样。\n* 在构建每棵树之前对列进行子采样。\n* 在考虑每个分裂之前对列进行子采样。\n\n通常来说，激进的子采样策略，例如只选取50%的数据，已被证明是有益的。根据用户反馈，采用列子采样比传统的行子采样更能有效防止过拟合。\n\n#### 正则化梯度提升\n除了对树的结构施加约束外，还可以对树的参数化部分引入额外的约束。\n\n传统的CART决策树通常不作为弱学习器使用，取而代之的是回归树这一变体，其叶节点（也称为终端节点）包含数值。在某些文献中，这些叶节点的数值被称为权重。\n\n因此，可以利用常见的正则化函数对树的叶节点权重进行正则化，例如：\n* 权重的L1正则化\n* 权重的L2正则化\n\n额外的正则化项有助于平滑最终学到的权重，从而避免过拟合。直观地说，经过正则化的目标函数会倾向于选择简单且具有预测能力的模型。\n\n更多详细信息请参阅以下两篇俄文文章：\n* https:\u002F\u002Fhabr.com\u002Fcompany\u002Fods\u002Fblog\u002F327250\u002F\n* https:\u002F\u002Falexanderdyakonov.files.wordpress.com\u002F2017\u002F06\u002Fbook_boosting_pdf.pdf\n\n## 24. AdaBoost与XGBoost的区别\n这两种方法都是将多个弱学习器组合成一个强学习器。例如，一棵决策树就是一个弱学习器，而由多棵决策树组成的随机森林则是一个强学习器。\n\n在训练过程中，这两种方法都会逐步增加弱学习器的集合，在每一轮迭代中向集合中添加新的弱学习器——以随机森林为例，森林就会不断“长出”新树。AdaBoost和XGBoost之间的唯一区别在于它们补充集合的方式不同。\n\nAdaBoost通过调整样本权重来工作：它会给那些难以分类的样本赋予更高的权重，而对已经处理得较好的样本则赋予较低权重。随后，按顺序添加新的弱学习器，这些弱学习器会专注于学习更加困难的模式。AdaBoost会在每一轮迭代中改变样本的权重，提高错误较多的样本的权重，使样本权重与整体误差成正比。这样一来，我们就改变了样本的概率分布——权重较高的样本在未来会被更频繁地选中。这相当于我们把那些容易出错的样本单独积累起来，以后就用这些样本代替原始数据集。此外，在AdaBoost中，每个弱学习器在集成中都有自己的权重（alpha权重），这个权重越高，说明该弱学习器越“聪明”，即越不容易犯错。\n\n相比之下，XGBoost完全不会改变样本的选择或分布。XGBoost首先构建第一棵树（弱学习器），用它对样本进行初步预测，但可能会存在一定的误差。然后，再添加第二棵树（弱学习器），以修正现有模型的误差。误差的最小化是通过梯度下降算法实现的。此外，还可以使用正则化技术，通过Lasso和Ridge正则化来惩罚过于复杂的模型。\n\n简而言之：AdaBoost侧重于重新加权样本；梯度提升则是通过预测树的损失函数来进行优化；而XGBoost则在损失函数中加入了正则化项（包括树的深度和叶节点的数值）。\n\n## 25. 数据挖掘 描述决策树模型\n决策树是一种包含根节点、分支和叶节点的结构。每个内部节点表示对某个属性的测试，每条分支表示测试的结果，而每个叶节点则持有类别标签。树中最顶端的节点就是根节点。\n\n每个内部节点代表对一个属性的测试，而每个叶节点则代表一个类别。\n拥有决策树的好处如下：\n* 不需要任何领域知识。\n* 易于理解。\n* 决策树的学习和分类步骤简单且快速。\n\n**树剪枝**\n\n进行树剪枝是为了去除训练数据中由于噪声或异常值引起的异常情况。经过剪枝后的树更小、更简单。\n\n**树剪枝方法**\n\n以下是几种树剪枝方法：\n* 预剪枝——通过提前停止树的构建来实现剪枝。\n* 后剪枝——这种方法会从一棵完全长成的树上移除子树。\n\n**代价复杂度**\n\n代价复杂度由以下两个参数衡量：树中的叶节点数量以及树的错误率。\n\n## 26. 吴恩达Coursera深度学习课程笔记\n[吴恩达Coursera深度学习课程笔记](https:\u002F\u002Fpt.slideshare.net\u002FTessFerrandez\u002Fnotes-from-coursera-deep-learning-courses-by-andrew-ng\u002F)\n\n## 27. 什么是神经网络？\n神经网络通常被组织成若干层。每一层由许多相互连接的“节点”组成，这些节点包含一个“激活函数”。模式通过“输入层”被输入到网络中，随后传递给一个或多个“隐藏层”，在这些隐藏层中，实际的处理是通过一系列加权“连接”完成的。隐藏层再与“输出层”相连，最终在输出层得到答案，如下面的图示所示。\n\n尽管神经网络使用了许多不同的学习规则，但本演示仅关注其中一种：δ规则。δ规则常用于最常见的神经网络类型——反向传播神经网络（BPNN）。反向传播是指误差的反向传播。使用δ规则时，与其他类型的反向传播一样，“学习”是一个有监督的过程，它在每一个循环或“ epoch”（即每次网络接收到一个新的输入模式时）都会发生，包括输出的正向激活流以及权重调整的反向误差传播。更简单地说，当神经网络首次接收到一个模式时，它会随机猜测这个模式可能是什么。然后，它会检查自己的答案与实际答案之间的差距，并相应地调整其连接权重。更直观地看，这个过程大致如下：\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_readme_3f39a5cb4d68.png)\n\n反向传播会在解空间的向量范围内沿着误差曲面最陡峭的向量方向进行梯度下降，以达到“全局最小值”。全局最小值是指具有最低可能误差的理论解。误差曲面本身通常是一个双曲抛物面，但很少是“平滑”的。事实上，在大多数问题中，解空间非常不规则，有许多“坑洼”和“山丘”，这可能导致网络停留在“局部最小值”处，而这不是最佳的整体解决方案。\n\n由于误差空间的性质事先无法确定，因此神经网络分析通常需要进行大量的独立运行，才能找到最佳解决方案。大多数学习规则都内置了数学公式来辅助这一过程，这些公式控制着学习的“速度”（β系数）和“动量”。学习速度实际上是指当前解与全局最小值之间的收敛速率。动量则帮助网络克服误差曲面上的障碍（局部最小值），从而使其能够稳定在全局最小值附近。\n\n一旦神经网络被“训练”到令人满意的程度，就可以将其用作分析其他数据的工具。为此，用户不再指定任何训练运行，而是让网络仅以正向传播模式工作。新的输入会被送入输入层，然后像训练时一样经过中间层的过滤和处理，不过此时不会进行反向传播，而是保留输出结果。正向传播运行的输出就是该数据的预测模型，可用于进一步的分析和解释。\n\n## 28. 如何处理稀疏数据？\n我们可以考虑使用L1正则化，因为它最适合稀疏数据，并且还能进行特征选择。如果是线性关系，则可以使用线性回归或支持向量机。此外，使用独热编码或词袋模型也是不错的选择。\n独热编码是一种将分类变量表示为二进制向量的方法。\n首先需要将分类值映射为整数值。\n然后，每个整数值都被表示为一个二进制向量，该向量除了对应整数值的索引位置为1外，其余位置均为0。\n\n## 29. RNN和LSTM\n以下是我喜欢的一些资源：\n* [理解LSTM网络，Chris Olah的LSTM文章](http:\u002F\u002Fcolah.github.io\u002Fposts\u002F2015-08-Understanding-LSTMs\u002F)\n* [探索LSTM，Edwin Chen的LSTM文章](http:\u002F\u002Fblog.echen.me\u002F2017\u002F05\u002F30\u002Fexploring-lstms\u002F)\n* [循环神经网络的不可思议效果，Andrej Karpathy的博客文章](http:\u002F\u002Fkarpathy.github.io\u002F2015\u002F05\u002F21\u002Frnn-effectiveness\u002F)\n* [CS231n第10讲——循环神经网络、图像描述、LSTM，Andrej Karpathy的讲座](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=iX5V1WpxxkY)\n* [Jay Alammar的《图解Transformer》](http:\u002F\u002Fjalammar.github.io\u002Fillustrated-transformer\u002F) 这位作者通常专注于可视化各种机器学习概念。\n\n## 30. 伪标签\n伪标签是一种技术，它允许你在训练过程中使用带有**置信度**的预测测试数据。这种方法有效地增加了可用于训练的样本数量，而且这些样本的分布可能会有所不同。我发现Kaggle上的这个内核[这里](https:\u002F\u002Fwww.kaggle.com\u002Fcdeotte\u002Fpseudo-labeling-qda-0-969)很有帮助，它展示了在训练数据点过少的情况下如何使用伪标签。\n\n## 31. 知识蒸馏\n知识蒸馏是指一个规模较大的模型将其知识传递给一个较小的模型的过程。其应用包括自然语言处理和目标检测，使得性能较弱的硬件也能在几乎不损失准确性的前提下做出良好的推断。\n\n例如：模型压缩，即将多个模型的知识压缩到一个单一的神经网络中。\n[解释](https:\u002F\u002Fnervanasystems.github.io\u002Fdistiller\u002Fknowledge_distillation.html)\n\n## 32. 什么是归纳偏置？\n模型的归纳偏置是指在该模型中做出的一些假设，这些假设用于从自变量（即特征）中学习目标函数。如果没有这些假设，我们的问题将存在无数种可能的解，而找到最优解本身就会成为一个难题。我发现 [这个](https:\u002F\u002Fstackoverflow.com\u002Fquestions\u002F35655267\u002Fwhat-is-inductive-bias-in-machine-learning) StackOverflow 上的问题很有帮助，值得进一步阅读和探讨。\n\n举个例子，在选择交叉验证误差最小的学习算法时，就体现了一种归纳偏置。此时，我们**依赖**于“交叉验证误差最小”的假设，并**希望**它能够在尚未见过的数据上具有良好的泛化能力。实际上，正是这种选择帮助我们在尝试的不同学习算法（或模型）中做出决策。\n\n## 33. 用通俗的话来说，置信区间是什么？\n顾名思义，置信区间就是指对某个取值范围能够得到预期结果的信心程度。例如：如果100到200的范围是一个95%的置信区间，那就意味着人们可以有95%的把握断定某个数据点或任何期望的值都落在这个范围内。","# data-science-interview-questions-and-answers 快速上手指南\n\n`data-science-interview-questions-and-answers` 并非一个需要编译或安装依赖的软件库，而是一个汇总了数据科学面试核心知识点（如特征选择、正则化、模型评估、深度学习等）的开源文档集合。本指南将指导你如何获取并阅读这些内容。\n\n## 环境准备\n\n本项目无特殊的系统要求或前置依赖，只需具备以下基础环境即可：\n\n*   **操作系统**：Windows, macOS 或 Linux 均可。\n*   **必要工具**：\n    *   **Git**：用于克隆代码仓库（推荐）。\n    *   **浏览器**：用于直接查看在线文档或本地 Markdown 文件。\n    *   **Markdown 编辑器**（可选）：如 VS Code, Typora 等，用于获得更好的本地阅读体验。\n\n## 安装步骤\n\n你可以通过以下两种方式获取内容：\n\n### 方式一：使用 Git 克隆（推荐）\n\n打开终端（Terminal 或 CMD），执行以下命令将仓库克隆到本地：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fiamtodor\u002Fdata-science-interview-questions-and-answers.git\n```\n\n> **国内加速提示**：如果访问 GitHub 速度较慢，可以使用国内镜像源（如 Gitee 上的同步仓库，若有）或配置 Git 代理。若无可用的官方同步镜像，建议配置本地 hosts 或使用加速工具。\n\n### 方式二：直接下载 ZIP\n\n1.  访问项目 GitHub 主页。\n2.  点击绿色的 **Code** 按钮。\n3.  选择 **Download ZIP**。\n4.  解压下载的压缩包到任意目录。\n\n## 基本使用\n\n获取项目后，你可以通过以下方式开始学习：\n\n### 1. 在线直接阅读\n项目核心内容托管在 Google Docs 上，你可以直接访问原始文档链接进行浏览：\n[https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1ajyJhXyt4q9ZsufXV1kZxDH_3Isg3MYAKsFqNytXrCw\u002F](https:\u002F\u002Fdocs.google.com\u002Fdocument\u002Fd\u002F1ajyJhXyt4q9ZsufXV1kZxDH_3Isg3MYAKsFqNytXrCw\u002F)\n\n### 2. 本地阅读 Markdown 文件\n如果你已克隆或解压了项目，进入项目根目录，直接使用文本编辑器或 IDE 打开 `README.md` 文件。该文件包含了完整的目录索引和详细的问答内容。\n\n**示例：在终端中预览内容**\n\n```bash\ncd data-science-interview-questions-and-answers\ncat README.md\n```\n\n**示例：在 VS Code 中打开**\n\n```bash\ncode README.md\n```\n\n打开后，你可以利用编辑器的 Markdown 预览功能（通常在 VS Code 中按 `Ctrl+Shift+V` 或 `Cmd+Shift+V`）以获得结构化的阅读体验。内容涵盖从基础的统计学概念（如偏差与方差）到高级的深度学习主题（如 RNN, LSTM, 知识蒸馏），非常适合面试准备和技术复习。","某科技公司数据科学团队正紧急筹备季度招聘，面试官需要在短时间内评估多位候选人在特征选择、正则化及不平衡分类等核心领域的实战能力。\n\n### 没有 data-science-interview-questions-and-answers 时\n- 面试官需花费数小时在零散的技术博客和论坛中搜索高质量问题，难以覆盖从基础统计（如方差定义）到高级算法（如梯度提升细节）的完整知识体系。\n- 面对“如何处理稀疏数据”或\"L1 与 L2 正则化区别”等经典问题时，缺乏统一的标准参考答案，导致不同面试官的评分标准主观且不一致。\n- 在考察候选人对异常值处理或聚类数量确定等开放性问题时，因缺少系统的思路引导（如统计方法与度量方法的分类），难以深入追问以辨别候选人真实水平。\n- 准备过程中容易遗漏关键考点，例如混淆矩阵与 ROC 曲线的关系或多重共线性对逻辑回归系数的具体影响，造成面试评估存在盲区。\n\n### 使用 data-science-interview-questions-and-answers 后\n- 团队直接利用该仓库中结构化的 20+ 个核心议题，迅速构建出涵盖特征工程、模型验证及无监督学习的全面面试题库，大幅缩短备课时间。\n- 参考其中关于精确率、召回率及统计功效的标准解析，面试官统一了评价标尺，确保对所有候选人的技术深度判断客观公正。\n- 借助其提供的细分思路（如将异常值检测分为统计法和度量法），面试官能层层递进地追问，有效识别出仅背八股文与具备真正解决问题能力的候选人。\n- 依托其对梯度提升、高斯混合模型等复杂算法的原理拆解，面试环节能精准考察候选人对底层机制的理解，避免了评估流于表面。\n\ndata-science-interview-questions-and-answers 将分散的知识点的转化为标准化的评估武器，显著提升了数据科学人才选拔的效率与准确性。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fiamtodor_data-science-interview-questions-and-answers_e8c5324e.jpg","iamtodor",null,"https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fiamtodor_0f01faf4.jpg","Get Shit Done","todor.ilya@gmail.com","https:\u002F\u002Fgithub.com\u002Fiamtodor",1654,360,"2026-04-07T02:12:16","MIT",1,"","未说明",{"notes":86,"python":84,"dependencies":87},"该仓库并非可执行的 AI 工具或代码库，而是一份包含数据科学面试问题与答案的文档集合（内容源自 Google Docs）。因此，它不需要特定的操作系统、GPU、内存、Python 版本或任何依赖库即可运行。用户只需使用浏览器查看或使用文本编辑器阅读 Markdown 文件即可。",[],[16,89,14],"其他",[91,92,93,94],"data-science","machine-learning","interview-questions","interview-preparation","2026-03-27T02:49:30.150509","2026-04-08T02:00:01.486611",[],[]]