[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-facebookresearch--balance":3,"tool-facebookresearch--balance":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,2,"2026-04-05T10:45:23",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[17,13,20,19,18],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[19,13,20,18],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[20,18],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":46,"last_commit_at":55,"category_tags":56,"status":22},2234,"scikit-learn","scikit-learn\u002Fscikit-learn","scikit-learn 是一个基于 Python 构建的开源机器学习库，依托于 SciPy、NumPy 等科学计算生态，旨在让机器学习变得简单高效。它提供了一套统一且简洁的接口，涵盖了从数据预处理、特征工程到模型训练、评估及选择的全流程工具，内置了包括线性回归、支持向量机、随机森林、聚类等在内的丰富经典算法。\n\n对于希望快速验证想法或构建原型的数据科学家、研究人员以及 Python 开发者而言，scikit-learn 是不可或缺的基础设施。它有效解决了机器学习入门门槛高、算法实现复杂以及不同模型间调用方式不统一的痛点，让用户无需重复造轮子，只需几行代码即可调用成熟的算法解决分类、回归、聚类等实际问题。\n\n其核心技术亮点在于高度一致的 API 设计风格，所有估算器（Estimator）均遵循相同的调用逻辑，极大地降低了学习成本并提升了代码的可读性与可维护性。此外，它还提供了强大的模型选择与评估工具，如交叉验证和网格搜索，帮助用户系统地优化模型性能。作为一个由全球志愿者共同维护的成熟项目，scikit-learn 以其稳定性、详尽的文档和活跃的社区支持，成为连接理论学习与工业级应用的最",65628,"2026-04-05T10:10:46",[20,18,14],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":10,"last_commit_at":63,"category_tags":64,"status":22},3364,"keras","keras-team\u002Fkeras","Keras 是一个专为人类设计的深度学习框架，旨在让构建和训练神经网络变得简单直观。它解决了开发者在不同深度学习后端之间切换困难、模型开发效率低以及难以兼顾调试便捷性与运行性能的痛点。\n\n无论是刚入门的学生、专注算法的研究人员，还是需要快速落地产品的工程师，都能通过 Keras 轻松上手。它支持计算机视觉、自然语言处理、音频分析及时间序列预测等多种任务。\n\nKeras 3 的核心亮点在于其独特的“多后端”架构。用户只需编写一套代码，即可灵活选择 TensorFlow、JAX、PyTorch 或 OpenVINO 作为底层运行引擎。这一特性不仅保留了 Keras 一贯的高层易用性，还允许开发者根据需求自由选择：利用 JAX 或 PyTorch 的即时执行模式进行高效调试，或切换至速度最快的后端以获得最高 350% 的性能提升。此外，Keras 具备强大的扩展能力，能无缝从本地笔记本电脑扩展至大规模 GPU 或 TPU 集群，是连接原型开发与生产部署的理想桥梁。",63927,"2026-04-04T15:24:37",[20,14,18],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":71,"readme_en":72,"readme_zh":73,"quickstart_zh":74,"use_case_zh":75,"hero_image_url":76,"owner_login":77,"owner_name":78,"owner_avatar_url":79,"owner_bio":80,"owner_company":81,"owner_location":81,"owner_email":81,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":84,"stars":118,"forks":119,"last_commit_at":120,"license":121,"difficulty_score":46,"env_os":122,"env_gpu":123,"env_ram":123,"env_deps":124,"category_tags":138,"github_topics":81,"view_count":10,"oss_zip_url":81,"oss_zip_packed_at":81,"status":22,"created_at":139,"updated_at":140,"faqs":141,"releases":171},3352,"facebookresearch\u002Fbalance","balance","The balance python package offers a simple workflow and methods for dealing with biased data samples when looking to infer from them to some target population of interest.","balance 是一款专为处理数据偏差而设计的 Python 开源工具包。在调查研究、观察性实验或任何存在“选择偏差”的场景中，我们收集到的样本往往无法完美代表目标总体（例如问卷回收率低导致特定人群缺失）。balance 的核心价值在于提供了一套简洁的工作流，帮助研究者利用辅助信息（如年龄、性别、教育程度等已知特征）对偏差样本进行加权调整，从而更准确地推断总体情况。\n\n该工具特别适用于需要严谨数据分析的专业人士，包括调查方法学家、人口统计学家、用户体验研究员、市场分析师，以及广大数据科学家和机器学习工程师。无论是处理非响应偏差还是采样偏差，只要数据满足“随机缺失”假设，balance 都能通过科学的加权算法有效缓解偏差带来的影响。\n\n作为由 Facebook Research 推出的项目，balance 不仅拥有扎实的统计学理论支撑（相关论文已发表于 arXiv），还具备友好的工程化实现。它让复杂的样本平衡方法变得易于调用，让用户无需从头编写算法，即可在 Python 环境中快速完成从数据诊断到偏差校正的全过程，是提升推断结果可靠性的得力助手。目前该项目处于 Beta 阶段，正在积极维","balance 是一款专为处理数据偏差而设计的 Python 开源工具包。在调查研究、观察性实验或任何存在“选择偏差”的场景中，我们收集到的样本往往无法完美代表目标总体（例如问卷回收率低导致特定人群缺失）。balance 的核心价值在于提供了一套简洁的工作流，帮助研究者利用辅助信息（如年龄、性别、教育程度等已知特征）对偏差样本进行加权调整，从而更准确地推断总体情况。\n\n该工具特别适用于需要严谨数据分析的专业人士，包括调查方法学家、人口统计学家、用户体验研究员、市场分析师，以及广大数据科学家和机器学习工程师。无论是处理非响应偏差还是采样偏差，只要数据满足“随机缺失”假设，balance 都能通过科学的加权算法有效缓解偏差带来的影响。\n\n作为由 Facebook Research 推出的项目，balance 不仅拥有扎实的统计学理论支撑（相关论文已发表于 arXiv），还具备友好的工程化实现。它让复杂的样本平衡方法变得易于调用，让用户无需从头编写算法，即可在 Python 环境中快速完成从数据诊断到偏差校正的全过程，是提升推断结果可靠性的得力助手。目前该项目处于 Beta 阶段，正在积极维护中。","[![balance_logo_horizontal](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_balance_readme_4a15c6b9316a.png)](https:\u002F\u002Fimport-balance.org\u002F)\n\n# _balance_: a python package for balancing biased data samples\n\n\u003Cdiv align=\"center\">\n\n[![Current Release](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Ffacebookresearch\u002Fbalance.svg)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Freleases)\n[![Python 3.9+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.9+-fcbc2c.svg?logo=python&logoColor=white)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![Build & Test](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fbuild-and-test.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fbuild-and-test.yml?query=branch%3Amain)\n[![Coverage](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fgist.githubusercontent.com\u002Ftalgalili\u002F89d05034d314ebda47c1e16607e1ee22\u002Fraw\u002Fcoverage-balance.json)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fcoverage.yml?query=branch%3Amain)\n[![CodeQL](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fcodeql.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fcodeql.yml?query=branch%3Amain)\n[![Deploy Website](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fdeploy-website.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fdeploy-website.yml?query=branch%3Amain)\n[![Release](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Frelease.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Frelease.yml?query=branch%3Amain)\n[![DOI](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDOI-10.48550\u002FarXiv.2307.06024-blue.svg)](https:\u002F\u002Fdoi.org\u002F10.48550\u002FarXiv.2307.06024)\n[![Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_balance_readme_ce311fe03a80.png)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fbalance)\n\n\u003C\u002Fdiv>\n\n> [!NOTE]\n> _balance_ is currently **in beta** and is actively supported. Follow us [on github](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance).\n\n## What is _balance_?\n\n**[_balance_](https:\u002F\u002Fimport-balance.org\u002F) is a Python package** offering a\nsimple workflow and methods for **dealing with biased data samples** when\nlooking to infer from them to some population of interest.\n\nBiased samples often occur in\n[survey statistics](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSurvey_methodology) when\nrespondents present\n[non-response bias](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FParticipation_bias) or survey\nsuffers from [sampling bias](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSampling_bias) (that\nare not\n[missing completely at random](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMissing_data#Missing_completely_at_random)).\nA similar issue arises in\n[observational studies](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FObservational_study) when\ncomparing the treated vs untreated groups, and in any data that suffers from\nselection bias.\n\nUnder the missing at random assumption\n([MAR](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMissing_data#Missing_at_random)), bias in\nsamples could sometimes be (at least partially) mitigated by relying on\nauxiliary information (a.k.a.: \"covariates\" or \"features\") that is present for\nall items in the sample, as well as present in a sample of items from the\npopulation. For example, if we want to infer from a sample of respondents to\nsome survey, we may wish to adjust for non-response using demographic\ninformation such as age, gender, education, etc. This can be done by weighing\nthe sample to the population using auxiliary information.\n\nThe package is intended for researchers who are interested in balancing biased\nsamples, such as the ones coming from surveys, using a Python package. This need\nmay arise by survey methodologists, demographers, UX researchers, market\nresearchers, and generally data scientists, statisticians, and machine learners.\n\nMore about the methodological background can be found in\n[Sarig, T., Galili, T., & Eilat, R. (2023). balance – a Python package for balancing biased data samples](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06024).\n\n# Installation\n\n## Requirements\n\nYou need Python 3.9, 3.10, 3.11, 3.12, 3.13, or 3.14 to run _balance_. _balance_\ncan be built and run from Linux, OSX, and Windows.\n\nThe required Python dependencies are:\n\n```python\nREQUIRES = [\n    # Numpy and pandas: carefully versioned for binary compatibility\n    \"numpy>=1.21.0,\u003C2.0; python_version\u003C'3.12'\",\n    \"numpy>=1.24.0; python_version>='3.12'\",\n    \"pandas>=1.5.0,\u003C4.0.0; python_version\u003C'3.12'\",\n    \"pandas>=2.0.0,\u003C4.0.0; python_version>='3.12'\",\n    # Scientific stack\n    \"scipy>=1.7.0,\u003C1.14.0; python_version\u003C'3.12'\",\n    \"scipy>=1.11.0; python_version>='3.12'\",\n    \"scikit-learn>=1.0.0,\u003C1.4.0; python_version\u003C'3.12'\",\n    \"scikit-learn>=1.3.0; python_version>='3.12'\",\n    \"ipython\",\n    \"patsy\",\n    \"seaborn\",\n    \"plotly\",\n    \"matplotlib\",\n    \"statsmodels\",\n    \"session-info\",\n]\n```\n\nSee\n[pyproject.toml](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002Fpyproject.toml)\nfor more details.\n\n## Installing _balance_\n\n### Installing via PyPi\n\nWe recommend installing _balance_ from PyPi via pip for the latest stable\nversion:\n\n```bash\npython -m pip install balance\n```\n\nInstallation will use Python wheels from PyPI, available for\n[OSX, Linux, and Windows](https:\u002F\u002Fpypi.org\u002Fproject\u002Fbalance\u002F#files).\n\n### Installing from Source\u002FGit\n\nYou can install the latest (bleeding edge) version from Git:\n\n```bash\npython -m pip install git+https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance.git\n```\n\nAlternatively, if you have a local clone of the repo:\n\n```bash\ncd balance\npython -m pip install .\n```\n\nOr using dev-dependencies:\n\n```bash\ncd balance\npython -m pip install .[dev]\n```\n\n# Getting started\n\n## balance's workflow in high-level\n\nThe core workflow in [_balance_](https:\u002F\u002Fimport-balance.org\u002F) deals with fitting\nand evaluating weights to a sample. For each unit in the sample (such as a\nrespondent to a survey), balance fits a weight that can be (loosely) interpreted\nas the number of people from the target population that this respondent\nrepresents. This aims to help mitigate the coverage and non-response biases, as\nillustrated in the following figure.\n\n![total_survey_error_img](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_balance_readme_fb37679407f5.png)\n\nThe weighting of survey data through _balance_ is done in the following main\nsteps:\n\n1. Loading data of the respondents of the survey.\n2. Loading data about the target population we would like to correct for.\n3. Diagnostics of the sample covariates so to evaluate whether weighting is\n   needed.\n4. Adjusting the sample to the target.\n5. Evaluation of the results.\n6. Use the weights for producing population level estimations.\n7. Saving the output weights.\n\nYou can see a step-by-step description (with code) of the above steps in the\n[General Framework](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fgeneral_framework\u002F)\npage.\n\n## Code example of using _balance_\n\nYou may run the following code to play with _balance_'s basic workflow (these\nare snippets taken from the\n[quickstart tutorial](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002Fquickstart\u002F)):\n\nWe start by loading data, and adjusting it:\n\n```python\nfrom balance import load_data, Sample\n\n# load simulated example data\ntarget_df, sample_df = load_data()\n\n# Import sample and target data into a Sample object\nsample = Sample.from_frame(sample_df, outcome_columns=[\"happiness\"])\ntarget = Sample.from_frame(target_df)\n\n# Set the target to be the target of sample\nsample_with_target = sample.set_target(target)\n\n# Check basic diagnostics of sample vs target before adjusting:\n# sample_with_target.covars().plot()\n\n```\n\n_You can read more on evaluation of the pre-adjusted data in the\n[Pre-Adjustment Diagnostics](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fgeneral_framework\u002Fpre_adjustment_diagnostics\u002F)\npage._\n\nNext, we adjust the sample to the population by fitting balancing survey\nweights:\n\n```python\n# Using ipw to fit survey weights\nadjusted = sample_with_target.adjust()\n```\n\n_You can read more on adjustment process in the\n[Adjusting Sample to Population](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fgeneral_framework\u002Fadjusting_sample_to_population\u002F)\npage._\n\nThe above code gets us an `adjusted` object with weights. We can evaluate the\nbenefit of the weights to the covariate balance, for example by running:\n\n```python\nprint(adjusted.summary())\n    # Covar ASMD reduction: 62.3%, design effect: 2.249\n    # Covar ASMD (7 variables):0.335 -> 0.126\n    # Model performance: Model proportion deviance explained: 0.174\n\nadjusted.covars().plot(library = \"seaborn\", dist_type = \"kde\")\n```\n\nAnd get:\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_balance_readme_63fd8b6ff40e.png)\n\nWe can also check the impact of the weights on the outcome using:\n\n```python\n# For the outcome:\nprint(adjusted.outcomes().summary())\n    # 1 outcomes: ['happiness']\n    # Mean outcomes:\n    #             happiness\n    # source\n    # self        54.221388\n    # unadjusted  48.392784\n    #\n    # Response rates (relative to number of respondents in sample):\n    #    happiness\n    # n     1000.0\n    # %      100.0\nadjusted.outcomes().plot()\n```\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_balance_readme_ab8e0c70a26d.png)\n\n_You can read more on evaluation of the post-adjusted data in the\n[Evaluating and using the adjustment weights](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fgeneral_framework\u002Fevaluation_of_results\u002F)\npage._\n\nFinally, the adjusted data can be downloaded using:\n\n```python\nadjusted.to_download()  # Or:\n# adjusted.to_csv()\n```\n\nTo see a more detailed step-by-step code example with code output prints and\nplots (both static and interactive), please go over to the\n[tutorials section](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002F).\n\n## Implemented methods for adjustments\n\n_balance_ currently implements various adjustment methods. Click the links to\nlearn more about each:\n\n1. [Logistic regression using L1 (LASSO) penalization.](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fstatistical_methods\u002Fipw\u002F)\n2. [Covariate Balancing Propensity Score (CBPS).](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fstatistical_methods\u002Fcbps\u002F)\n3. [Post-stratification.](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fstatistical_methods\u002Fpoststratify\u002F)\n4. [Raking.](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fstatistical_methods\u002Frake\u002F)\n\n## Implemented methods for diagnostics\u002Fevaluation\n\nFor diagnostics the main tools (comparing before, after applying weights, and\nthe target population) are:\n\n1. Plots\n   1. barplots\n   2. density plots (for weights and covariances)\n   3. qq-plots\n2. Statistical summaries\n   1. Weights distributions\n      1. [Kish's design effect](\u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDesign_effect#Haphazard_weights_with_estimated_ratio-mean_(%7F'%22%60UNIQ--postMath-0000003A-QINU%60%22'%7F)_-_Kish's_design_effect>)\n      2. Main summaries (mean, median, variances, quantiles)\n   2. Covariate distributions\n      1. Absolute Standardized Mean Difference (ASMD). For continuous variables,\n         it is [Cohen's d](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FEffect_size#Cohen's_d).\n         Categorical variables are one-hot encoded, Cohen's d is calculated for\n         each category and ASMD for a categorical variable is defined as Cohen's\n         d, average across all categories.\n\n_You can read more on evaluation of the post-adjusted data in the\n[Evaluating and using the adjustment weights](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fgeneral_framework\u002Fevaluation_of_results\u002F)\npage._\n\n## Other resources\n\n- Presentation:\n  [\"Balancing biased data samples with the 'balance' Python package\"](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002Fwebsite\u002Fstatic\u002Fdocs\u002FBalancing_biased_data_samples_with_the_balance_Python_package_-_ISA_conference_2023-06-01.pdf) -\n  presented in the Israeli Statistical Association (ISA) conference on June\n  1st 2023.\n\n# More details\n\n## Getting help, submitting bug reports and contributing code\n\nYou are welcome to:\n\n- Learn more in the [_balance_](https:\u002F\u002Fimport-balance.org\u002F) website.\n- Ask for help on:\n  https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fissues\u002Fnew?template=support_question.md\n- Submit bug-reports and features' suggestions at:\n  https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fissues\n- Send a pull request on: https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance. See the\n  [CONTRIBUTING](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002FCONTRIBUTING.md)\n  file for how to help out. And our\n  [CODE OF CONDUCT](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002FLICENSE-DOCUMENTATION)\n  for our expectations from contributors.\n\n## Citing _balance_\n\nSarig, T., Galili, T., & Eilat, R. (2023). balance – a Python package for\nbalancing biased data samples.\n[https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06024](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06024)\n\n```bibtex\n@misc{sarig2023balance,\n      title={balance - a Python package for balancing biased data samples},\n      author={Tal Sarig and Tal Galili and Roee Eilat},\n      year={2023},\n      eprint={2307.06024},\n      archivePrefix={arXiv},\n      primaryClass={stat.CO}\n}\n```\n\n## License\n\nThe _balance_ package is licensed under the\n[MIT license](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002FLICENSE),\nand all the documentation on the site (including text and images) is under\n[CC-BY](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002FLICENSE-DOCUMENTATION).\n\n# News\n\nYou can follow updates on our:\n\n- [Blog](https:\u002F\u002Fimport-balance.org\u002Fblog\u002F)\n- [Changelog](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002FCHANGELOG.md)\n\n## Acknowledgements \u002F People\n\nThe _balance_ package is actively maintained by people from the\n[Central Applied Science](https:\u002F\u002Fresearch.facebook.com\u002Fteams\u002Fcentral-applied-science\u002F)\nteam (in Menlo Park and Tel Aviv), by\n[Wesley Lee](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fwesley-lee),\n[Tal Sarig](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fsarig-tal\u002F), and\n[Tal Galili](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fgalili-tal\u002F).\n\nThe _balance_ package was (and is) developed by many people, including:\n[Roee Eilat](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Feilat-roee\u002F),\n[Tal Galili](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fgalili-tal\u002F),\n[Daniel Haimovich](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fhaimovich-daniel\u002F),\n[Kevin Liou](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fkevinycliou),\n[Steve Mandala](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fmandala-steve\u002F),\n[Adam Obeng](https:\u002F\u002Fadamobeng.com\u002F) (author of the initial internal Meta\nversion), [Tal Sarig](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fsarig-tal\u002F),\n[Luke Sonnet](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fluke-sonnet),\n[Sean Taylor](https:\u002F\u002Fseanjtaylor.com),\n[Barak Yair Reif](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fbarak-yair-reif-2154365\u002F),\n[Soumyadip Sarkar](https:\u002F\u002Fgithub.com\u002Fneuralsorcerer), and others. If you worked\non balance in the past, please email us to be added to this list.\n\nThe _balance_ package was open-sourced by\n[Tal Sarig](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fsarig-tal\u002F),\n[Tal Galili](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fgalili-tal\u002F) and\n[Steve Mandala](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fmandala-steve\u002F) in\nlate 2022.\n\nBranding created by [Dana Beaty](https:\u002F\u002Fwww.danabeaty.com\u002F), from the Meta AI\nDesign and Marketing Team. For logo files, see\n[here](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Ftree\u002Fmain\u002Fwebsite\u002Fstatic\u002Fimg\u002F).\n","[![balance_logo_horizontal](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_balance_readme_4a15c6b9316a.png)](https:\u002F\u002Fimport-balance.org\u002F)\n\n# _balance_: 一个用于平衡有偏数据样本的 Python 包\n\n\u003Cdiv align=\"center\">\n\n[![当前版本](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Frelease\u002Ffacebookresearch\u002Fbalance.svg)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Freleases)\n[![Python 3.9+](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FPython-3.9+-fcbc2c.svg?logo=python&logoColor=white)](https:\u002F\u002Fwww.python.org\u002Fdownloads\u002F)\n[![构建与测试](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fbuild-and-test.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fbuild-and-test.yml?query=branch%3Amain)\n[![覆盖率](https:\u002F\u002Fimg.shields.io\u002Fendpoint?url=https:\u002F\u002Fgist.githubusercontent.com\u002Ftalgalili\u002F89d05034d314ebda47c1e16607e1ee22\u002Fraw\u002Fcoverage-balance.json)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fcoverage.yml?query=branch%3Amain)\n[![CodeQL](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fcodeql.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fcodeql.yml?query=branch%3Amain)\n[![部署网站](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fdeploy-website.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Fdeploy-website.yml?query=branch%3Amain)\n[![发布](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Frelease.yml\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fworkflows\u002Frelease.yml?query=branch%3Amain)\n[![DOI](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDOI-10.48550\u002FarXiv.2307.06024-blue.svg)](https:\u002F\u002Fdoi.org\u002F10.48550\u002FarXiv.2307.06024)\n[![下载量](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_balance_readme_ce311fe03a80.png)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fbalance)\n\n\u003C\u002Fdiv>\n\n> [!注意]\n> _balance_ 目前处于 **测试阶段**，并得到积极维护。请在 [GitHub](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance) 上关注我们。\n\n## 什么是 _balance_？\n\n**[_balance_](https:\u002F\u002Fimport-balance.org\u002F) 是一个 Python 包**，提供了一套\n简单的工作流程和方法，用于在从有偏样本推断目标总体时，\n**处理有偏数据样本**的问题。\n\n有偏样本通常出现在\n[调查统计学](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSurvey_methodology) 中，当受访者表现出\n[无响应偏差](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FParticipation_bias) 或调查本身存在\n[抽样偏差](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSampling_bias)（而非\n[完全随机缺失](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMissing_data#Missing_completely_at_random)）时。类似的问题也会出现在\n[观察性研究](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FObservational_study) 中，在比较接受治疗与未接受治疗的组别时，以及任何存在选择偏差的数据中。\n\n在随机缺失假设\n([MAR](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FMissing_data#Missing_at_random)) 下，样本中的偏差有时可以通过利用辅助信息（即“协变量”或“特征”）来部分缓解——这些信息既存在于样本中的所有个体中，也存在于来自总体的样本中。例如，如果我们希望从某次调查的样本中进行推断，可以使用年龄、性别、教育程度等人口统计信息来调整无响应偏差。这可以通过利用辅助信息对样本进行加权，使其更接近总体分布来实现。\n\n该包面向希望使用 Python 包来平衡有偏样本的研究人员，例如来自调查的数据。这种需求可能出现在调查方法学家、人口统计学家、用户体验研究人员、市场研究人员，以及广义上的数据科学家、统计学家和机器学习从业者中。\n\n更多方法论背景可参见\n[Sarig, T., Galili, T., & Eilat, R. (2023). balance – a Python package for balancing biased data samples](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06024)。\n\n# 安装\n\n## 要求\n\n运行 _balance_ 需要 Python 3.9、3.10、3.11、3.12、3.13 或 3.14。_balance_ 可以在 Linux、OSX 和 Windows 上构建和运行。\n\n所需的 Python 依赖项如下：\n\n```python\nREQUIRES = [\n    # Numpy 和 pandas：严格控制版本以确保二进制兼容性\n    \"numpy>=1.21.0,\u003C2.0; python_version\u003C'3.12'\",\n    \"numpy>=1.24.0; python_version>='3.12'\",\n    \"pandas>=1.5.0,\u003C4.0.0; python_version\u003C'3.12'\",\n    \"pandas>=2.0.0,\u003C4.0.0; python_version>='3.12'\",\n    # 科学计算栈\n    \"scipy>=1.7.0,\u003C1.14.0; python_version\u003C'3.12'\",\n    \"scipy>=1.11.0; python_version>='3.12'\",\n    \"scikit-learn>=1.0.0,\u003C1.4.0; python_version\u003C'3.12'\",\n    \"scikit-learn>=1.3.0; python_version>='3.12'\",\n    \"ipython\",\n    \"patsy\",\n    \"seaborn\",\n    \"plotly\",\n    \"matplotlib\",\n    \"statsmodels\",\n    \"session-info\",\n]\n```\n\n更多详细信息请参阅\n[pyproject.toml](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002Fpyproject.toml)。\n\n## 安装 _balance_\n\n### 通过 PyPI 安装\n\n我们建议通过 pip 从 PyPI 安装 _balance_ 的最新稳定版本：\n\n```bash\npython -m pip install balance\n```\n\n安装将使用 PyPI 提供的 Python 轮包，适用于\n[OSX、Linux 和 Windows](https:\u002F\u002Fpypi.org\u002Fproject\u002Fbalance\u002F#files)。\n\n### 从源代码\u002FGit 安装\n\n您也可以从 Git 安装最新的（前沿）版本：\n\n```bash\npython -m pip install git+https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance.git\n```\n\n或者，如果您已经克隆了仓库：\n\n```bash\ncd balance\npython -m pip install .\n```\n\n也可以使用开发依赖项：\n\n```bash\ncd balance\npython -m pip install .[dev]\n```\n\n# 入门\n\n## balance 的高级工作流程\n\n[_balance_](https:\u002F\u002Fimport-balance.org\u002F) 的核心工作流程涉及为样本拟合和评估权重。对于样本中的每个单元（例如调查的受访者），balance 会拟合一个权重，这个权重可以粗略地理解为该受访者所代表的目标总体中的人数。其目的是帮助缓解覆盖偏差和无响应偏差，如图所示。\n\n![total_survey_error_img](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_balance_readme_fb37679407f5.png)\n\n通过 _balance_ 对调查数据进行加权的主要步骤如下：\n\n1. 加载调查受访者的数据。\n2. 加载我们希望校正的目标总体的相关数据。\n3. 对样本协变量进行诊断，以评估是否需要进行加权。\n4. 将样本调整至目标总体。\n5. 评估结果。\n6. 使用权重生成总体层面的估计值。\n7. 保存输出的权重。\n\n您可以在 [通用框架](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fgeneral_framework\u002F) 页面中找到上述步骤的逐步描述（附带代码）。\n\n## 使用 _balance_ 的代码示例\n\n你可以运行以下代码来体验 _balance_ 的基本工作流程（这些片段摘自\n[快速入门教程](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002Fquickstart\u002F)）：\n\n首先，我们加载数据并进行调整：\n\n```python\nfrom balance import load_data, Sample\n\n# 加载模拟示例数据\ntarget_df, sample_df = load_data()\n\n# 将样本和目标数据导入 Sample 对象\nsample = Sample.from_frame(sample_df, outcome_columns=[\"happiness\"])\ntarget = Sample.from_frame(target_df)\n\n# 设置目标为样本的目标\nsample_with_target = sample.set_target(target)\n\n# 在调整之前检查样本与目标的基本诊断：\n# sample_with_target.covars().plot()\n```\n\n_你可以在 [预调整诊断](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fgeneral_framework\u002Fpre_adjustment_diagnostics\u002F) 页面中了解更多关于预调整数据评估的信息。_\n\n接下来，我们通过拟合平衡抽样权重将样本调整到总体：\n\n```python\n# 使用 ipw 拟合抽样权重\nadjusted = sample_with_target.adjust()\n```\n\n_你可以在 [将样本调整到总体](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fgeneral_framework\u002Fadjusting_sample_to_population\u002F) 页面中了解更多关于调整过程的信息。_\n\n上述代码会生成一个带有权重的 `adjusted` 对象。我们可以评估权重对协变量平衡的好处，例如通过运行：\n\n```python\nprint(adjusted.summary())\n    # 协变量 ASMD 降低：62.3%，设计效应：2.249\n    # 协变量 ASMD（7 个变量）：0.335 -> 0.126\n    # 模型性能：模型解释的偏差比例：0.174\n\nadjusted.covars().plot(library = \"seaborn\", dist_type = \"kde\")\n```\n\n得到如下结果：\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_balance_readme_63fd8b6ff40e.png)\n\n我们还可以通过以下方式检查权重对结果的影响：\n\n```python\n# 对于结果：\nprint(adjusted.outcomes().summary())\n    # 1 个结果：['happiness']\n    # 平均结果：\n    #             happiness\n    # source\n    # self        54.221388\n    # unadjusted  48.392784\n    #\n    # 回应率（相对于样本中的受访者数量）：\n    #    happiness\n    # n     1000.0\n    # %      100.0\nadjusted.outcomes().plot()\n```\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_balance_readme_ab8e0c70a26d.png)\n\n_你可以在 [评估和使用调整权重](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fgeneral_framework\u002Fevaluation_of_results\u002F) 页面中了解更多关于调整后数据评估的信息。_\n\n最后，可以使用以下代码下载调整后的数据：\n\n```python\nadjusted.to_download()  # 或者：\n# adjusted.to_csv()\n```\n\n若想查看更详细的分步代码示例，包括代码输出打印和图表（静态及交互式），请前往\n[教程部分](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002F)。\n\n## 已实现的调整方法\n\n_balance_ 目前实现了多种调整方法。点击链接以了解每种方法的详细信息：\n\n1. [使用 L1（LASSO）惩罚的逻辑回归。](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fstatistical_methods\u002Fipw\u002F)\n2. [协变量平衡倾向得分（CBPS）。](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fstatistical_methods\u002Fcbps\u002F)\n3. [事后分层。](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fstatistical_methods\u002Fpoststratify\u002F)\n4. [拉格法。](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fstatistical_methods\u002Frake\u002F)\n\n## 已实现的诊断\u002F评估方法\n\n在诊断方面，主要工具（比较应用权重前后以及与目标总体的情况）包括：\n\n1. 图表\n   1. 条形图\n   2. 密度图（用于权重和协变量）\n   3. QQ 图\n2. 统计摘要\n   1. 偏差分布\n      1. [基什设计效应](\u003Chttps:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FDesign_effect#Haphazard_weights_with_estimated_ratio-mean_(%7F'%22%60UNIQ--postMath-0000003A-QINU%60%22'%7F)_-_Kish's_design_effect>)\n      2. 主要汇总统计（均值、中位数、方差、分位数）\n   2. 协变量分布\n      1. 绝对标准化均值差异（ASMD）。对于连续变量，它等同于 [Cohen's d](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FEffect_size#Cohen's_d)。分类变量会进行独热编码，针对每个类别计算 Cohen's d，而分类变量的 ASMD 定义为所有类别 Cohen's d 的平均值。\n\n_你可以在 [评估和使用调整权重](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fgeneral_framework\u002Fevaluation_of_results\u002F) 页面中了解更多关于调整后数据评估的信息。_\n\n## 其他资源\n\n- 演示文稿：\n  [\"使用 'balance' Python 包平衡有偏数据样本\"](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002Fwebsite\u002Fstatic\u002Fdocs\u002FBalancing_biased_data_samples_with_the_balance_Python_package_-_ISA_conference_2023-06-01.pdf) -\n  于 2023 年 6 月 1 日在以色列统计协会 (ISA) 大会上发表。\n\n# 更多详情\n\n## 获取帮助、提交错误报告和贡献代码\n\n欢迎你：\n\n- 访问 [_balance_](https:\u002F\u002Fimport-balance.org\u002F) 官网了解更多。\n- 在以下链接寻求帮助：\n  https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fissues\u002Fnew?template=support_question.md\n- 在以下链接提交错误报告和功能建议：\n  https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fissues\n- 向 https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance 提交拉取请求。请参阅 [CONTRIBUTING](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002FCONTRIBUTING.md) 文件以了解如何参与贡献。同时，请参考我们的 [行为准则](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002FLICENSE-DOCUMENTATION)，了解我们对贡献者的期望。\n\n## 引用 _balance_\n\nSarig, T., Galili, T., & Eilat, R. (2023). balance – 用于平衡有偏数据样本的 Python 包。\n[https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06024](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06024)\n\n```bibtex\n@misc{sarig2023balance,\n      title={balance - 用于平衡有偏数据样本的 Python 包},\n      author={Tal Sarig 和 Tal Galili 以及 Roee Eilat},\n      year={2023},\n      eprint={2307.06024},\n      archivePrefix={arXiv},\n      primaryClass={stat.CO}\n}\n```\n\n## 许可证\n\n_balance_ 软件包采用 [MIT 许可证](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002FLICENSE) 许可，网站上的所有文档（包括文字和图片）则采用 [CC-BY](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002FLICENSE-DOCUMENTATION) 许可。\n\n# 新闻\n\n你可以关注我们的：\n\n- [博客](https:\u002F\u002Fimport-balance.org\u002Fblog\u002F)\n- [变更日志](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002FCHANGELOG.md)\n\n## 致谢 \u002F 参与人员\n\n_balance_ 包由来自 [中央应用科学](https:\u002F\u002Fresearch.facebook.com\u002Fteams\u002Fcentral-applied-science\u002F) 团队（位于门洛帕克和特拉维夫）的以下人员积极维护：[Wesley Lee](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fwesley-lee)、[Tal Sarig](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fsarig-tal\u002F) 和 [Tal Galili](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fgalili-tal\u002F)。\n\n_balance_ 包的开发曾由并仍在由多位同事共同完成，其中包括：[Roee Eilat](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Feilat-roee\u002F)、[Tal Galili](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fgalili-tal\u002F)、[Daniel Haimovich](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fhaimovich-daniel\u002F)、[Kevin Liou](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fkevinycliou)、[Steve Mandala](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fmandala-steve\u002F)、[Adam Obeng](https:\u002F\u002Fadamobeng.com\u002F)（Meta 内部初始版本的作者）、[Tal Sarig](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fsarig-tal\u002F)、[Luke Sonnet](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fluke-sonnet)、[Sean Taylor](https:\u002F\u002Fseanjtaylor.com)、[Barak Yair Reif](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fbarak-yair-reif-2154365\u002F)、[Soumyadip Sarkar](https:\u002F\u002Fgithub.com\u002Fneuralsorcerer) 等。如果您过去曾参与 _balance_ 的相关工作，请发送邮件至我们，以便将您的名字加入此列表。\n\n_balance_ 包于 2022 年底由 [Tal Sarig](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fsarig-tal\u002F)、[Tal Galili](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fgalili-tal\u002F) 和 [Steve Mandala](https:\u002F\u002Fresearch.facebook.com\u002Fpeople\u002Fmandala-steve\u002F) 共同开源。\n\n品牌设计由 Meta AI 设计与市场团队的 [Dana Beaty](https:\u002F\u002Fwww.danabeaty.com\u002F) 完成。有关徽标文件，请参阅[此处](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Ftree\u002Fmain\u002Fwebsite\u002Fstatic\u002Fimg\u002F)。","# balance 快速上手指南\n\n`balance` 是一个 Python 包，旨在帮助研究人员和数据科学家处理**有偏数据样本**（biased data samples）。它通过利用辅助信息（协变量）为样本拟合权重，从而将样本调整至目标总体分布，有效缓解调查统计、观察性研究中的无响应偏差和选择偏差。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**：Linux, macOS, 或 Windows。\n*   **Python 版本**：需要 Python 3.9 至 3.14 之间的版本。\n*   **核心依赖**：安装时会自动解决以下主要依赖库的版本兼容性问题：\n    *   `numpy`, `pandas`\n    *   `scipy`, `scikit-learn`\n    *   `statsmodels`, `seaborn`, `matplotlib`, `plotly`\n    *   `patsy`, `ipython`\n\n> **注意**：该工具目前处于 **Beta** 阶段，但正在积极维护中。\n\n## 安装步骤\n\n推荐使用 `pip` 从 PyPI 安装最新稳定版本。国内用户若遇到下载速度慢的问题，可指定清华或阿里镜像源。\n\n### 方式一：通过 PyPI 安装（推荐）\n\n```bash\n# 使用默认源\npython -m pip install balance\n\n# 或使用国内镜像源加速（推荐中国开发者使用）\npython -m pip install balance -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n### 方式二：从源码安装（获取最新开发版）\n\n如果您需要体验最新的功能（bleeding edge），可以直接从 GitHub 安装：\n\n```bash\npython -m pip install git+https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance.git\n```\n\n或者克隆仓库后本地安装：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance.git\ncd balance\npython -m pip install .\n```\n\n## 基本使用\n\n`balance` 的核心工作流包括：加载数据 -> 定义样本与目标总体 -> 诊断偏差 -> 拟合权重 -> 评估结果。\n\n以下是最简化的代码示例，演示如何加载模拟数据并执行加权调整：\n\n### 1. 加载数据并初始化\n\n首先导入必要的模块，加载示例数据，并将数据封装为 `Sample` 对象。\n\n```python\nfrom balance import load_data, Sample\n\n# 加载模拟示例数据 (target_df 为目标总体，sample_df 为有偏样本)\ntarget_df, sample_df = load_data()\n\n# 将 DataFrame 转换为 Sample 对象\n# outcome_columns 指定需要分析的结果变量（如幸福感）\nsample = Sample.from_frame(sample_df, outcome_columns=[\"happiness\"])\ntarget = Sample.from_frame(target_df)\n\n# 将目标总体设定为样本的调整目标\nsample_with_target = sample.set_target(target)\n\n# (可选) 查看调整前的协变量诊断图\n# sample_with_target.covars().plot()\n```\n\n### 2. 执行加权调整\n\n使用默认的逆概率加权（IPW）方法拟合权重，将样本调整至与目标总体一致。\n\n```python\n# 拟合平衡权重\nadjusted = sample_with_target.adjust()\n```\n\n### 3. 评估结果\n\n调整完成后，可以查看摘要统计信息以确认偏差是否降低（如 ASMD 减少比例），并可视化协变量分布。\n\n```python\n# 打印调整摘要\nprint(adjusted.summary())\n# 输出示例:\n# Covar ASMD reduction: 62.3%, design effect: 2.249\n# Covar ASMD (7 variables):0.335 -> 0.126\n\n# 可视化调整后的协变量分布 (使用 seaborn)\nadjusted.covars().plot(library=\"seaborn\", dist_type=\"kde\")\n```\n\n### 4. 分析结果变量与导出\n\n检查权重对结果变量（如 `happiness`）均值的影响，并导出带有权重的数据。\n\n```python\n# 查看结果变量的汇总统计\nprint(adjusted.outcomes().summary())\n\n# 可视化结果变量分布\nadjusted.outcomes().plot()\n\n# 导出数据（包含计算出的权重）\nadjusted.to_csv(\"balanced_data.csv\") \n# 或者获取下载链接\n# adjusted.to_download()\n```\n\n通过以上步骤，您即可完成从有偏样本到平衡样本的完整处理流程。更多高级用法（如使用 CBPS、事后分层等方法）请参考官方文档教程部分。","某市场研究团队正试图通过线上问卷数据，推断全国消费者对新款智能手表的购买意愿，但回收的样本中年轻男性比例严重过高。\n\n### 没有 balance 时\n- 直接基于原始数据建模会导致预测结果严重偏向年轻群体，完全无法代表全年龄段的真实市场需求。\n- 分析师需手动编写复杂的加权算法代码来匹配人口普查数据，不仅耗时且极易因公式错误导致统计偏差。\n- 缺乏标准化的评估指标，难以量化样本偏差的具体程度，无法向管理层证明最终结论的可信度。\n- 面对多维度的协变量（如年龄、性别、收入、地区），传统手动调整方法难以同时平衡所有特征分布。\n\n### 使用 balance 后\n- 利用 balance 内置的加权方法，自动将样本权重调整至与目标人口结构一致，显著消除了非响应偏差。\n- 仅需几行代码即可调用成熟的统计流程，无需重复造轮子，将原本数天的数据清洗工作缩短至小时级。\n- 工具提供可视化的平衡性诊断图表，直观展示调整前后协变量分布的差异，让分析结论有据可依。\n- 能够同时处理多个辅助变量，确保在年龄、性别和地域等多个维度上样本与总体高度对齐，提升推断精度。\n\nbalance 将复杂的统计加权理论转化为简单的 Python 工作流，让研究人员能从有偏样本中可靠地推断出总体真相。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ffacebookresearch_balance_4a15c6b9.png","facebookresearch","Meta Research","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Ffacebookresearch_449342bd.png","",null,"https:\u002F\u002Fopensource.fb.com","https:\u002F\u002Fgithub.com\u002Ffacebookresearch",[85,89,93,97,101,105,108,111,115],{"name":86,"color":87,"percentage":88},"Python","#3572A5",71.6,{"name":90,"color":91,"percentage":92},"Jupyter Notebook","#DA5B0B",27.3,{"name":94,"color":95,"percentage":96},"JavaScript","#f1e05a",0.4,{"name":98,"color":99,"percentage":100},"MDX","#fcb32c",0.2,{"name":102,"color":103,"percentage":104},"Shell","#89e051",0.1,{"name":106,"color":107,"percentage":104},"R","#198CE7",{"name":109,"color":110,"percentage":104},"CSS","#663399",{"name":112,"color":113,"percentage":114},"Batchfile","#C1F12E",0,{"name":116,"color":117,"percentage":114},"Makefile","#427819",741,51,"2026-03-31T08:07:01","MIT","Linux, macOS, Windows","未说明",{"notes":125,"python":126,"dependencies":127},"该工具主要用于处理调查统计中的偏差数据样本（如加权调整），不涉及深度学习模型，因此无 GPU 需求。依赖库版本会根据 Python 版本自动调整（例如 Python 3.12+ 需要更高版本的 numpy 和 pandas）。可通过 pip 直接安装或使用源码安装。","3.9, 3.10, 3.11, 3.12, 3.13, 3.14",[128,129,130,131,132,133,134,135,136,137],"numpy>=1.21.0","pandas>=1.5.0","scipy>=1.7.0","scikit-learn>=1.0.0","statsmodels","seaborn","plotly","matplotlib","patsy","ipython",[14,18],"2026-03-27T02:49:30.150509","2026-04-06T05:36:31.390490",[142,147,152,157,162,167],{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},15398,"使用 'rake' 方法时出现 'AttributeError: numpy.int64 object has no attribute loc' 错误怎么办？","该错误通常是因为样本（sample）中存在的某些分箱（bin）在目标数据（target）中从未出现过。解决方法是：将样本数据添加到目标数据中，确保所有在样本中出现的分箱也存在于目标数据中，这样错误就会消失。此外，维护者表示已对 raking 算法进行了改进以增强其鲁棒性，建议升级到最新版本的 balance 库。如果问题仍然存在，请提交新的 Issue。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fissues\u002F73",{"id":148,"question_zh":149,"answer_zh":150,"source_url":151},15399,"如何在 Python 3.11 环境中安装 balance 包？遇到 scipy 版本不兼容问题如何解决？","旧版本的 balance 依赖 `scipy\u003C=1.8.1`，这与 Python 3.11 不兼容（Scipy 1.9.0+ 才支持 Python 3.11）。如果遇到安装失败或元数据生成错误，请尝试以下方案：\n1. 升级 balance 包到最新版本，维护者已更新依赖以支持更新的 scipy 版本。\n2. 检查 pandas 版本，旧版可能限制为 `pandas\u003C=1.4.3`，如需在 Python 3.11 使用，可能需要等待或手动调整依赖以支持 pandas 1.5.0+。\n3. 如果必须使用旧环境，请暂时使用 Python 3.8 - 3.10。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fissues\u002F19",{"id":153,"question_zh":154,"answer_zh":155,"source_url":156},15400,"使用 rake 算法校准边际分布时遇到错误或无法复现的问题该如何处理？","如果在进行边际分布校准时遇到错误，首先请确保使用的是最新版本的 balance 包。近期版本已移除了外部依赖（如 ipfn），并修复了许多已知问题。如果问题依旧，由于该问题难以复现，建议：\n1. 清理环境并重新安装最新版。\n2. 准备详细的复现步骤、日志和环境信息（包括 pandas, numpy 版本等）。\n3. 提交一个新的 Issue 并附上上述信息，以便维护者排查。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fissues\u002F80",{"id":158,"question_zh":159,"answer_zh":160,"source_url":161},15401,"poststratify 函数的作用是什么？它如何处理单列和多列特征？文档在哪里？","`poststratify` 函数主要用于事后分层加权。关于其具体行为（特别是处理两个特征时是否回退到 raking 算法），官方文档和教程已进行了更新和澄清。\n1. 查看最新的 API 参考文档：https:\u002F\u002Fimport-balance.org\u002Fapi_reference\u002Fhtml\u002Fbalance.weighting_methods.poststratify.html\n2. 参考快速入门教程中的 raking 部分以对比理解：http:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002Fquickstart_rake\u002F\n维护者已补充了包含单特征和双特征的示例代码，建议直接查阅上述链接获取最新用法说明。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fissues\u002F111",{"id":163,"question_zh":164,"answer_zh":165,"source_url":166},15402,"使用 prepare_marginal_dist_for_raking 时内存爆炸（Memory Blowup）怎么办？","当协变量超过 2-3 个，或者传入的比例字典精度过高（超过 2 位小数）时，`prepare_marginal_dist_for_raking` 可能会因为最小公倍数（LCM）扩展导致目标 DataFrame 行数激增，从而引发内存溢出。\n解决方案：\n1. 降低传入比例数据的精度（例如保留 2 位小数）。\n2. 减少参与计算的协变量数量。\n3. 确保使用的是最新版本（0.17.0+），维护者正在关注此类性能问题。如果问题严重，建议检查输入数据的基数（cardinality）是否过大。","https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fissues\u002F369",{"id":168,"question_zh":169,"answer_zh":170,"source_url":146},15403,"balance 库中不同加权方法（ipw, cbps, rake）的主要区别及适用场景是什么？","根据用户反馈和 Issue 记录：\n1. `ipw` (逆概率加权) 和 `cbps` (协变量平衡倾向得分) 通常在常规情况下运行稳定，不易出现分箱缺失导致的错误。\n2. `rake` (raking\u002Frim weighting) 适用于多维边际分布校准，但对数据一致性要求较高。如果样本中的分箱在目标总体中不存在，会抛出 AttributeError。使用前需确保目标数据覆盖样本的所有分箱类别，或使用新版库的鲁棒性修复。\n3. 如果需要进行复杂的多维边际校准，首选 `rake`，但需注意数据预处理；若只需基于倾向得分平衡，`cbps` 或 `ipw` 可能更简单稳健。",[172,177,182,187,192,197,202,207,212,217,222,227,232,237,242,247,252,257,262,267],{"id":173,"version":174,"summary_zh":175,"released_at":176},90049,"0.18.0","## 新特性\n\n- **实现了使用经过验证的样本方差公式计算的 `r_indicator()`**\n  - 在 `weighted_comparisons_stats` 模块中添加了公开的 `r_indicator(sample_p, target_p)` 实现，该实现基于文档中的公式 2.2.2，对拼接后的倾向得分向量进行计算，并进行了显式的输入大小验证。\n  - 增加了对非有限值和超出范围的倾向得分值的验证，并扩展了单元测试覆盖率，以确保公式的正确性及边界情况的处理。\n  - 添加了 `BalanceDFWeights.r_indicator()` 作为便捷封装方法，使得可以直接通过 `sample.weights().r_indicator()` 计算 r 指标。\n\n## 已弃用功能\n\n- **`Sample.design_effect()` 已弃用** —— 请改用 `sample.weights().design_effect()`。该方法已存在于 `BalanceDFWeights` 中；`Sample` 类中的方法现在会发出 `DeprecationWarning` 并进行委托调用。计划在 balance 0.19.0 版本中移除。\n- **`Sample.design_effect_prop()` 已弃用** —— 请改用 `sample.weights().design_effect_prop()`。已在 `BalanceDFWeights` 中新增该方法。计划在 balance 0.19.0 版本中移除。\n- **`Sample.plot_weight_density()` 已弃用** —— 请改用 `sample.weights().plot()`。计划在 balance 0.19.0 版本中移除。\n- **`Sample.covar_means()` 已弃用** —— 请改用 `sample.covars().mean()`（配合 `.rename(index={'self': 'adjusted'}).reindex([...]).T` 可获得相同格式）。计划在 balance 0.19.0 版本中移除。\n- **`Sample.outcome_sd_prop()` 已弃用** —— 请改用 `sample.outcomes().outcome_sd_prop()`。已在 `BalanceDFOutcomes` 中新增该方法。计划在 balance 0.19.0 版本中移除。\n- **`Sample.outcome_variance_ratio()` 已弃用** —— 请改用 `sample.outcomes().outcome_variance_ratio()`。已在 `BalanceDFOutcomes` 中新增该方法。计划在 balance 0.19.0 版本中移除。\n\n## LLM\u002FGenAI\n\n- **为 Claude Code 用户添加了 `CLAUDE.md` 项目上下文文件**，内容涵盖架构、构建与测试说明（包括 Meta 和开源版本）、代码规范以及提交前检查清单。\n- **更新了 `.github\u002Fcopilot-instructions.md` 审查检查清单**，以减少与 `CLAUDE.md` 的重复内容，并补充了遗漏的规范（MIT 许可证头部注释、`from __future__ import annotations`、工厂模式、随机种子固定、弃用风格等）。\n\n## 错误修复\n\n- **`prepare_marginal_dist_for_raking` \u002F `_realize_dicts_of_proportions`：修复了因最小公倍数展开导致的内存爆炸问题**\n  - 当比例具有较高小数精度或传递的协变量数量较多时，各变量数组长度的最小公倍数可能达到数千万甚至更高，从而引发内存不足崩溃。\n  - 两个函数现均接受 `max_length` 参数（默认值为 10000）。当自然计算出的最小公倍数超过 `max_length` 时，输出将被限制为最多 `max_length` 行，并采用 **哈雷-尼迈耶（最大余数法）** 分配计数，该方法可确保总数严格等于 `max_length`，且每个类别的四舍五入误差最小。\n  - 每次应用上限限制时，都会记录一条警告信息。\n  - 一个新的整数","2026-03-24T19:58:19",{"id":178,"version":179,"summary_zh":180,"released_at":181},90050,"0.17.0","\r\n## 重大变更\r\n\r\n- **CLI：未提及的列现在会进入 `ignore_columns`，而不是 `outcome_columns`**\r\n  - 以前，当未显式设置 `--outcome_columns` 时，所有既不是 ID 列、权重列，也不是协变量的列都会自动被归类为结果列。现在，这些列会被放入 `ignore_columns` 中。\r\n  - 显式提及的列——ID 列、权重列、协变量列和结果列——**不会**被忽略。\r\n\r\n## 新特性\r\n\r\n- **ASCII 对比直方图及绘图改进**\r\n  - 新增了 `ascii_comparative_hist`，用于通过内联视觉符号（`█`、`▒`、`▐`、`░`）将多个分布与基准分布进行比较。\r\n  - 对比 ASCII 图现在会按总体 → 调整后 → 样本的顺序排列数据集。\r\n  - `ascii_plot_dist` 接受一个新的 `comparative` 关键字参数（默认为 `True`），用于在数值变量的对比直方图和分组条形图之间切换。\r\n\r\n## 代码质量与重构\r\n\r\n- **将数据集加载实现移出 `balance.datasets.__init__`**\r\n  - 将 `load_sim_data`、`load_cbps_data` 和 `load_data` 重构到 `balance.datasets.loading_data` 模块中，并从 `balance.datasets` 重新导出，以保持公共 API 的同时，使各模块职责更加清晰。\r\n\r\n## 文档更新\r\n\r\n- **ASCII 绘图文档及教程示例**\r\n  - 在 ASCII 绘图的文档字符串中添加了渲染后的文本图示例，并记录了对 `library=\"balance\"` 的支持。同时更新了 `balance_quickstart.ipynb`，加入了调整前后的 ASCII 绘图示例。\r\n  - **改进了 `keep_columns` 的文档**\r\n    - 更新了 `has_keep_columns()`、`keep_columns()` 以及 `--keep_columns` 参数的文档字符串，明确指出保留列控制最终输出 CSV 中出现哪些列。那些既不是 ID 列、权重列、协变量列，也不是结果列的保留列，会在处理过程中被放入 `ignore_columns`，但仍然会被保留并在输出中可用。\r\n  - **澄清了 `_prepare_input_model_matrix` 参数的文档**\r\n    - 更新了 `balance.utils.model_matrix` 模块中的文档字符串，针对准备模型矩阵输入时的 `sample`、`target`、`variables` 和 `add_na` 行为进行了明确说明。\r\n\r\n## 错误修复\r\n\r\n- **权重诊断现在一致地接受 DataFrame 输入**\r\n  - `design_effect`、`nonparametric_skew`、`prop_above_and_below` 和 `weighted_median_breakdown_point` 现在会在计算之前，将 DataFrame 输入显式标准化为其第一列，从而匹配验证行为，并始终返回标量或 Series 类型的输出。\r\n  - **模型矩阵的鲁棒性提升**\r\n    - `_make_df_column_names_unique()` 现在避免了诸如 `a`、`a_1` 以及重复的 `a` 名称同时出现时的后缀冲突，以确定性的方式重命名重复列，防止下游冲突。\r\n  - `_prepare_input_model_matrix()` 现在会在输入…","2026-03-17T14:01:59",{"id":183,"version":184,"summary_zh":185,"released_at":186},90051,"0.16.0","## 新特性\n\n- **结果权重影响诊断**\n  - 添加了成对的结果-权重影响检验（`y*w0` 对 `y*w1`），并附带置信区间。\n  - 在 `BalanceDFOutcomes`、`Sample.diagnostics()` 中以及 CLI 的 `--weights_impact_on_outcome_method` 参数中提供支持。\n- **Pandas 3 支持**\n  - 更新了与 Pandas 3.x 的兼容性及测试。\n- **无需独热编码的分类分布度量**\n  - `BalanceDF.covars()` 上的 KLD\u002FEMD\u002FCVMD\u002FKS 现在直接作用于原始分类变量（含 NA 标志），而非独热编码后的列。\n- **其他**\n  - **自定义模型的原始协变量调整**\n    - `Sample.adjust()` 现在支持在原始协变量上拟合模型（无需构建设计矩阵），通过设置 `use_model_matrix=False` 实现逆概率加权。字符串、对象和布尔类型的列会被转换为 Pandas 的 `Categorical` 类型，从而允许原生支持分类数据的 sklearn 估计器（例如，使用 `categorical_features=\"from_dtype\"` 的 `HistGradientBoostingClassifier`）正确处理这些数据。当存在分类列时，需要 scikit-learn ≥ 1.4。\n  - **验证权重包含正值**\n    - 在权重诊断中增加了保护机制，当所有权重均为零时会抛出错误。\n  - **支持可配置的 ID 列候选**\n    - `Sample.from_frame()` 和 `guess_id_column()` 现在可以在自动检测 ID 列时接受多个候选 ID 列名。\n  - **BalanceDF 设计矩阵的公式支持**\n    - `BalanceDF.model_matrix()` 现在接受 `formula` 参数，用于构建自定义设计矩阵，而无需手动预先计算。\n\n## Bug 修复\n\n- **移除已弃用的 setup 构建**\n  - 在 CI 中将已弃用的 `setup.py` 替换为 `pyproject.toml` 构建，以避免构建失败。\n- **强化 ID 列候选验证**\n  - `guess_id_column()` 现在会忽略重复的候选名称，并确保候选名称是非空字符串。\n- **强化 Pandas 3 兼容路径**\n  - 更新了 Pandas 3 数据类型下的字符串\u002FNA 处理及离散性检查，并刷新了测试以支持基于字符串的数据类型。\n\n## 打包与测试\n\n- **Pandas 3.x 兼容性**\n  - 扩展了 Pandas 的依赖范围，以支持 Pandas 3.x 版本。\n- **测试中直接导入工具函数**\n  - 重构了工具测试模块，使其直接从各自模块导入辅助函数，而非通过 `balance_util` 导入。\n\n## 破坏性变更\n\n- **要求权重诊断中归一化或聚合操作的权重为正**\n  - `design_effect`、`nonparametric_skew`、`prop_above_and_below` 和 `weighted_median_breakdown_point` 现在会在所有权重均为零时抛出 `ValueError`。\n  - **迁移建议：** 在调用这些诊断之前，请确保您的权重至少包含一个正值；或者，在工作流中可能出现全零权重的情况下，捕获并处理 `ValueError` 异常。\n\n## 贡献者\n\n\n@neuralsorcerer、@talgalili（并由 @talsarig 进行代码和方法学评审）\n\n完整变更日志：https:\u002F\u002Fgithub.com\u002Ffacebookrese","2026-02-09T15:23:42",{"id":188,"version":189,"summary_zh":190,"released_at":191},90052,"0.15.0","## 新特性\n\n- **新增 EMD\u002FCVMD\u002FKS 分布诊断**\n  - `BalanceDF` 现在提供了地球移动距离（EMD）、克拉默-冯·米塞斯距离（CVMD）以及柯尔莫哥洛夫-斯米尔诺夫统计量（KS），用于比较调整后的样本与目标分布。\n  - 这些诊断支持加权或非加权比较，可选择离散或连续形式，并且会根据 `aggregate_by_main_covar` 参数对独热编码的分类变量进行聚合处理。\n- **在 CLI 中公开结果列的选择功能**\n  - 新增了 `--outcome_columns` 参数，允许用户指定哪些列被视为结果列，而不是默认使用所有非 ID、权重和协变量的列。其余列将被移至 `ignored_columns`。\n- **改进了 `poststratify()` 中缺失值的处理**\n  - `poststratify()` 现在接受 `na_action` 参数，可以选择直接丢弃包含缺失值的行，或者在加权过程中将缺失值视为单独的类别。\n  - **破坏性变更：** 默认行为现在会用 `\"__NaN__\"` 填充后分层变量中的缺失值，并将其作为加权时的一个独立类别。此前，缺失值并未被显式处理，其处理方式依赖于 pandas 的 `groupby` 和 `merge` 的默认行为。若希望恢复旧版行为，即让缺失值不形成独立类别，需显式传递 `na_action=\"drop\"`。\n- **为 `descriptive_stats` 模型矩阵添加公式支持**\n  - `descriptive_stats()` 现在接受一个 `formula` 参数，该参数始终应用于数据（包括仅包含数值的框架），使调用者能够控制哪些项和虚拟变量会被纳入汇总统计中。\n\n## 文档更新\n\n- **记录了平衡 CLI**\n  - 为 `balance.cli` 添加了完整的 API 文档字符串，并新增了一个 CLI 教程笔记本。\n- **创建了平衡 CLI 教程**\n  - 增加了 CLI 命令回显功能、`load_data()` 示例，以及更丰富的诊断探索功能，包括指标\u002F变量列表和可浏览的诊断表格。https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002Fbalance_cli_tutorial\u002F\n- **同步文档示例与测试用例**\n  - 更新了面向用户的文档字符串，使文档中的示例与经过测试的输入和输出保持一致。\n\n## 代码质量与重构\n\n- **当“目标”样本量远大于“样本”样本量时发出警告**\n  - `Sample.adjust()` 现在会在目标样本超过 10 万行且至少是样本规模的 10 倍时发出警告，强调此时不确定性主要由样本本身决定（类似于单样本比较的情况）。\n- **将工具辅助函数拆分为专注的模块**\n  - 将 `balance.util` 拆分为 `balance.utils` 子模块，以方便导航。\n\n## 错误修复\n\n- **更新了 `Sample.__str__()`，使其按与 `Sample.summary()` 相同的方式格式化权重诊断信息**\n  - 权重诊断信息（设计效应、有效样本量比例、有效样本量）现在会分多行显示，而非全部挤在同一行并用逗号分隔。\n  - 将“eff.”缩写替换为完整的“effective”。","2026-01-20T10:51:11",{"id":193,"version":194,"summary_zh":195,"released_at":196},90053,"0.14.0","\r\n## 新特性\r\n\r\n- **增强的调整后样本摘要输出**\r\n  - `Sample.__str__()` 现在会在打印调整后的样本时显示调整详情（方法、截尾参数、设计效应、有效样本量）（[#194](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fpull\u002F194)、[#57](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fissues\u002F57)）。\r\n- **更丰富的 `Sample.summary()` 诊断信息**\r\n  - 调整后的样本摘要现在会将协变量诊断信息分组显示，同时报告设计效应与 ESSP\u002FESS，并在可用时展示加权结果均值。\r\n- **`.adjust()` 中高基数分类特征的警告**\r\n  - 如果某个分类特征中 ≥80% 的取值都是唯一的，则会在权重拟合之前进行标记，以帮助识别类似用户 ID 等问题列（[#195](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fpull\u002F195)、[#65](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fissues\u002F65)）。\r\n- **对 Sample 输入中忽略列的处理**\r\n  - `Sample.from_frame` 接受 `ignore_columns` 参数，用于指定那些应保留在数据框中但需从协变量和结果统计中排除的列。被忽略的列会出现在 `Sample.df` 中，并可通过 `Sample.ignored_columns()` 方法获取。\r\n\r\n## 代码质量与重构\r\n\r\n- **整合诊断辅助函数**\r\n  - 添加了 `_concat_metric_val_var()` 辅助函数和 `balance.util._coerce_scalar`，用于稳健地构建诊断行以及将标量转换为浮点数。\r\n  - **破坏性变更：** `Sample.diagnostics()` 对于 IPW 现在始终会输出迭代\u002F截距摘要以及超参数设置。\r\n\r\n## Bug 修复\r\n\r\n- **对空权重输入的提前验证**\r\n  - `Sample.from_frame` 现在会在权重包含 `None`、`NaN` 或 `pd.NA` 值时抛出 `ValueError`，并提供受影响行的数量及预览。\r\n- **跨平台的百分位权重截尾**\r\n  - `trim_weights()` 现在通过百分位分位数计算阈值，并设置了明确的截断边界，以确保在不同 Python\u002FNumPy 版本之间行为一致。\r\n  - **破坏性变更：** 基于百分位的截断在典型范围内可能会移动大约一条观测值。\r\n- **IPW 诊断改进**\r\n  - 修复了 `multi_class` 报告问题，将标量型超参数归一化为浮点数，移除了已弃用的 `penalty` 参数警告，并去除了重复的指标条目，以保证在不同版本的 sklearn 中计数稳定。\r\n\r\n## 测试\r\n\r\n- **新增 Windows 和 macOS CI 测试支持**\r\n  - 扩展了 GitHub Actions，使其能够在 `ubuntu-latest`、`macos-latest` 和 `windows-latest` 上运行，支持 Python 3.9 至 3.14。\r\n  - 添加了 `tempfile_path()` 上下文管理器，用于跨平台临时文件处理，并通过 `conftest.py` 配置了 matplotlib 的 Agg 后端。\r\n\r\n## 贡献者\r\n\r\n@neuralsorcerer、@talgalili、@wesleytlee\r\n\r\n## **完整变更日志**\r\n\r\nhttps:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fcompare\u002F0.13.0...0.14.0","2025-12-14T09:31:56",{"id":198,"version":199,"summary_zh":200,"released_at":201},90054,"0.13.0","\n## 新特性\n\n- **倾向得分建模超越静态逻辑回归**\n  - `ipw()` 现在可以通过 `model` 参数接受任何 scikit-learn 分类器，\n    从而能够在保留所有现有裁剪和诊断功能的同时使用随机森林、梯度提升等模型。仅支持密集矩阵的估计器以及没有线性系数的模型也得到了完全支持。倾向得分概率已被稳定化，以避免数值问题。\n  - 允许通过传递一个已配置的 :class:`~sklearn.linear_model.LogisticRegression` 实例到 `model` 参数来自定义逻辑回归。此外，命令行界面现在接受 `--ipw_logistic_regression_kwargs` JSON 参数，以便直接为命令行工作流构建该估计器。\n- **协变量诊断**\n  - 添加了用于协变量比较（数值型和独热编码的分类变量）的 KL 散度计算，并通过 `BalanceDF.kld()` 暴露出来，同时支持链接样本聚合。\n- **加权方法**\n  - `rake()` 和 `poststratify()` 现在会尊重 `weight_trimming_mean_ratio` 和 `weight_trimming_percentile` 参数，通过增强的 `trim_weights(..., target_sum_weights=...)` API 对权重进行裁剪和重新归一化，从而使文档中描述的参数按预期工作\n    ([#147](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fpull\u002F147))。\n\n## 文档\n\n- 添加了全面的后分层教程笔记本 (`balance_quickstart_poststratify.ipynb`)\n  ([#141](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fpull\u002F141),\n  [#142](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fpull\u002F142),\n  [#143](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fpull\u002F143))。\n- 扩展了 poststratify 的文档字符串，加入了清晰的示例和改进的统计方法说明\n  ([#141](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fpull\u002F141))。\n- 在 README 中添加了项目徽章，用于显示构建状态、Python 版本支持情况以及版本跟踪信息\n  ([#145](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fpull\u002F145))。\n- 添加了 IPW 快速入门教程，展示了默认逻辑回归以及自定义 scikit-learn 分类器的用法（`balance_quickstart.ipynb`）。\n- 缩短了导入包时的欢迎消息。\n\n## 代码质量与重构\n\n- **Raking 算法重构**\n  - 移除了对 `ipfn` 的依赖，替换为基于 NumPy 的向量化实现（`_run_ipf_numpy`）来进行迭代比例拟合，从而显著提升了性能并消除了外部依赖关系（[#135](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fpull\u002F135)）。\n  \n- **IPW 方法重构**\n  - 通过将重复的代码模式提取到可重用的帮助函数中，降低了循环复杂度（CCN）：`_compute_deviance()`、`_compute_proportion_deviance()`、`_convert_to_dense_array()`。\n  - 移除了手动的 ASMD 改善率计算，现在直接使用 `weighted_comparisons_stats.py` 中现有的 `compute_asmd_improvement()` 函数。\n\n- **类型安全性改进**","2025-12-02T16:49:39",{"id":203,"version":204,"summary_zh":205,"released_at":206},90055,"0.12.1","## 新特性\n- 导入包时添加了欢迎消息。\n\n> 欢迎来到 balance（版本 0.12.1）！\n> 一个用于平衡有偏数据样本的开源 Python 包。\n> \n> 📖 文档：https:\u002F\u002Fimport-balance.org\u002F\n> 🛠️ 获取帮助 \u002F 报告问题：https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fissues\u002F\n> 📄 引用：\n>     Sarig, T., Galili, T., & Eilat, R. (2023).\n>     balance — 一个用于平衡有偏数据样本的 Python 包。\n>     https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.06024\n> \n> 小贴士：您随时可以通过 balance.help() 访问这些信息。\n\n## 文档\n- 在文档网站上新增了“CHANGELOG”。https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002FCHANGELOG\u002F\n\n## 错误修复\n- 修复了所有教程中的 Plotly 图表。https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002F\n\n## 贡献者\n@talgalili，@wesleytlee\n\n**完整变更日志**：https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fcompare\u002F0.12.0...0.12.1","2025-11-03T09:30:07",{"id":208,"version":209,"summary_zh":210,"released_at":211},90056,"0.12.0","## 新特性\n- **支持 Python 3.13 和 3.14**\n    - 更新 `setup.py` 和 CI\u002FCD 集成，以包含 Python 3.13 和 3.14。\n    - 移除对 NumPy、Pandas、SciPy 和 scikit-learn 等依赖库在 Python 3.12 及以上版本中的最高版本限制。\n\n## 贡献者\n@talgalili, @wesleytlee\n\n**完整变更日志**: https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fcompare\u002F0.11.0...0.12.0","2025-10-15T08:02:57",{"id":213,"version":214,"summary_zh":215,"released_at":216},90057,"0.11.0","## 新特性\n- **Python 3.12 支持** - 在现有 Python 3.9、3.10 和 3.11 支持的基础上，全面支持 Python 3.12（并集成 CI\u002FCD）。\n    - **实现了 Python 版本特定的依赖约束** - 为 numpy、pandas、scipy 和 scikit-learn 添加了基于 Python 版本变化的条件性版本范围（例如：Python \u003C3.12 时 numpy>=1.21.0,\u003C2.0；Python >=3.12 时 numpy>=1.24.0,\u003C2.1）。\n    - **Pandas 兼容性改进** - 在频率表生成中，将 `value_counts(dropna=False)` 替换为 `groupby().size()`，以避免 FutureWarning。\n    - 修复了多种 pandas 废弃警告，并改进了 DataFrame 的处理逻辑。\n- **Raking 算法优化** - 将基于 DataFrame 的 Rake 权重计算完全重构为基于数组的 ipfn 算法，使用多维数组和 itertools 提升性能，并增强与最新 Python 版本的兼容性。**变量现在会自动按字母顺序排序**，以确保无论输入顺序如何，结果都保持一致。\n- **poststratify 方法增强** - 新增 `strict_matching` 参数（默认为 True），用于处理样本单元在目标数据中不存在的情况。当该参数设置为 False 时，会发出警告，并将未覆盖样本的权重设为 0。\n\n## Bug 修复\n- **类型注解** - 在整个代码库中增强了 Pyre 类型提示，特别是在工具函数中。\n- **Sample 类改进** - 修复了权重类型赋值问题（确保为 float64 类型），通过 `.infer_objects(copy=False)` 改进了 DataFrame 操作以提高与 pandas 的兼容性，并优化了权重设置逻辑。\n- **网站依赖项更新** - 更新了包括 Docusaurus 在内的多个网站相关依赖包。\n\n## 测试\n**全面测试重构**，包括：\n- **增强测试验证** - 在 docstring 中添加了对测试方法和预期行为的详细说明。\n- **提升测试覆盖率** - 测试现在涵盖了边缘情况，如 NaN 处理、不同数据类型以及错误条件。\n- **优化测试组织**（更加细粒度）——覆盖所有测试模块（test_stats_and_plots.py、test_balancedf.py、test_ipw.py、test_rake.py、test_cli.py、test_weighted_comparisons_plots.py、test_cbps.py、test_testutil.py、test_adjustment.py、test_util.py、test_sample.py）。\n- **更新 GitHub 工作流**，将 Python 3.12 纳入构建和测试矩阵。\n- **修复 261 个 “pandas 废弃” 警告！**\n- **添加类型注解** - 将 test_balancedf.py 转换为 pyre-strict 模式。\n\n## 文档\n- **GitHub 支持问题模板** - 添加了结构化模板，帮助用户更清晰地提出关于 balance 包使用的问题。\n\n## 贡献者\n@talgalili、@wesleytlee\n\n## **完整变更日志**\n\nhttps:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fcompare\u002F0.10.0...0.11.0","2025-09-24T08:18:15",{"id":218,"version":219,"summary_zh":220,"released_at":221},90058,"0.10.0","## 新闻\n- 在此版本中，我们将 `ipw` 迁移到使用 sklearn。这不仅支持更新的 Python 版本，还支持 Windows 操作系统！\n- 更新了 Python 和依赖包的兼容性。Balance 现在兼容 Python 3.11，但由于类型错误，不再兼容 Python 3.8。目前，由于 distutils 被移除，Balance 尚不兼容 Python 3.12。\n- 许可证从 GPL-v2 更改为 [MIT 许可证](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002FLICENSE)。\n\n## 新特性\n- 移除了对 glmnet 的依赖，`ipw` 方法现在使用 sklearn 实现。\n- `ipw` 方法出于计算效率考虑，改用带有 L2 正则化的逻辑回归，而非 L1 正则化。从 glmnet 切换到 sklearn，并采用 L2 正则化，会导致生成的权重与 Balance 的旧版本略有不同。\n- 不幸的是，基于 sklearn 的 `ipw` 方法通常比旧版本慢 2 到 5 倍。建议使用新的参数 `lambda_min`、`lambda_max` 和 `num_lambdas`，以更高效地搜索 `ipw` 正则化空间。\n\n## 错误修复\n- 修复了 flake8 的 E721 问题（详见：https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Factions\u002Fruns\u002F5704381365\u002Fjob\u002F15457952704）\n\n## 文档更新\n- 添加了在 ISA 2023 上所做的演示文稿链接。\n- 修正了一些错别字。\n\n## 完整变更日志\nhttps:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fcompare\u002F0.9.0...0.10.0\n\n## 贡献者\n@wesleytlee, @talgalili, @SarigT","2025-01-06T15:48:54",{"id":223,"version":224,"summary_zh":225,"released_at":226},90059,"0.9.0","## News\r\n- Remove support for python 3.11 due to new test failures. This will be the case until glmnet will be replaced by sklearn. hopefully before end of year.\r\n\r\n## New Features\r\n- All plotly functions: add kwargs to pass arguments to update_layout in all plotly figures. This is useful to control width and height of the plot. For example, when wanting to save a high resolution of the image.\r\n- Add a `summary` methods to `BalanceWeightsDF` (i.e.: `Sample.weights().summary()`) to easily get access to summary statistics of the survey weights. Also, it means that `Sample.diagnostics()` now uses this new summary method in its internal implementation.\r\n- `BalanceWeightsDF.plot` method now relies on the default `BalanceDF.plot` method. This means that instead of a static seaborn kde plot we'll get an interactive plotly version.\r\n\r\n## Bug Fixes\r\n- datasets\r\n    - Remove a no-op in `load_data` and accommodate deprecation of pandas syntax by using a list rather than a set when selecting df columns (thanks @ahakso for the PR).\r\n    - Make the outcome variable (`happiness`) be properly displayed in the tutorials (so we can see the benefit of the weighting process). This included fixing the simulation code in the target.\r\n- Fix `Sample.outcomes().summary()` so it will output the ci columns without truncating them.\r\n\r\n## Documentation\r\n- Fix text based on updated from version 0.7.0 and 0.8.0.\r\n    - https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fgeneral_framework\u002Fadjusting_sample_to_population\u002F\r\n- Fix tutorials to include the outcome in the target.\r\n\r\n## Contributors\r\n@talgalili, @SarigT, @ahakso","2023-05-22T08:26:24",{"id":228,"version":229,"summary_zh":230,"released_at":231},90060,"0.8.0","## New Features\r\n- Add `rake` method to .adjust (currently in beta, given that it doesn't handles marginal target as input).\r\n- Add a new function `prepare_marginal_dist_for_raking` - to take in a dict of marginal proportions and turn them into a pandas DataFrame. This can serve as an input target population for raking.\r\n\r\n## Misc\r\n- The `ipw` function now gets max_de=None as default (instead of 1.5). This version is faster, and the user can still choose a threshold as desired.\r\n- Adding hex stickers graphics files\r\n\r\n## Documentation\r\n- New section on [raking.](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fstatistical_methods\u002Frake\u002F)\r\n- New notebook (in the tutorial section):\r\n    - [**quickstart_rake**](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002Fquickstart_rake\u002F) - like the [**quickstart**](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002Fquickstart\u002F) tutorial, but shows how to use the rake (raking) algorithm and compares the results to IPW (logistic regression with LASSO).\r\n\r\n\r\n## Contributors\r\n@talgalili, @SarigT","2023-04-26T13:31:11",{"id":233,"version":234,"summary_zh":235,"released_at":236},90061,"0.7.0","## New Features\r\n- Add `plotly_plot_density` function: Plots interactive density plots of the given variables using kernel density estimation.\r\n- Modified `plotly_plot_dist` and `plot_dist` to also support 'kde' plots. Also, these are now the default options. This automatically percolates to `BalanceDF.plot()` methods.\r\n- `Sample.from_frame` can now guess that a column called \"weights\" is a weight column (instead of only guessing so if the column is called \"weight\").\r\n\r\n## Bug Fixes\r\n- Fix `rm_mutual_nas`: it now remembers the index of pandas.Series that were used as input. This fixed erroneous plots produced by seaborn functions which uses rm_mutual_nas.\r\n- Fix `plot_hist_kde` to work when dist_type = \"ecdf\"\r\n- Fix `plot_hist_kde` and `plot_bar` when having an input only with \"self\" and \"target\", by fixing `_return_sample_palette`.\r\n\r\n## Misc\r\n- All plotting functions moved internally to expect weight column to be called `weight`, instead of `weights`.\r\n- All adjust (ipw, cbps, poststratify, null) functions now export a dict with a key called `weight` instead of `weights`.\r\n\r\n\r\n## Contributors\r\n@talgalili, @SarigT","2023-04-10T07:45:00",{"id":238,"version":239,"summary_zh":240,"released_at":241},90062,"0.6.0","## New Features\r\n- Variance of the weighted mean\r\n    - Add the `var_of_weighted_mean` function (from balance.stats_and_plots.weighted_stats import var_of_weighted_mean):\r\n        Computes the variance of the weighted average (pi estimator for ratio-mean) of a list of values and their corresponding weights.\r\n        - Added the `var_of_mean` option to stat in the `descriptive_stats` function (based on `var_of_weighted_mean`)\r\n        - Added the `.var_of_mean()` method to BalanceDF.\r\n    - Add the `ci_of_weighted_mean` function (from balance.stats_and_plots.weighted_stats import ci_of_weighted_mean):\r\n        Computes the confidence intervals of the weighted mean using the (just added) variance of the weighted mean.\r\n        - Added the `ci_of_mean` option to stat in the `descriptive_stats` function (based on `ci_of_weighted_mean`). Also added kwargs support.\r\n        - Added the `.ci_of_mean()` method to BalanceDF.\r\n        - Added the `.mean_with_ci()` method to BalanceDF.\r\n        - Updated `.summary()` methods to include the output of `ci_of_mean`.\r\n- All bar plots now have an added ylim argument to control the limits of the y axis.\r\n    For example use: `plot_dist(dfs1, names=[\"self\", \"unadjusted\", \"target\"], ylim = (0,1))`\r\n    Or this: `s3_null.covars().plot(ylim = (0,1))`\r\n- Improve 'choose_variables' function to control the order of the returned variables\r\n    - The return type is now a list (and not a Tuple)\r\n    - The order of the returned list is based on the variables argument. If it is not supplied, it is based on the order of the column names in the DataFrames. The df_for_var_order arg controls which df to use.\r\n- Misc\r\n    - The `_prepare_input_model_matrix` and downstream functions (e.g.: `model_matrix`, `sample.outcomes().mean()`, etc) can now handle DataFrame with special characters in the column names, by replacing special characters with '_' (or '_i', if we end up with columns with duplicate names). It also handles cases in which the column names have duplicates (using the new `_make_df_column_names_unique` function).\r\n    - Improve choose_variables to control the order of the returned variables\r\n        - The return type is now a list (and not a Tuple)\r\n        - The order of the returned list is based on the variables argument. If it is not supplied, it is based on column names in the DataFrames. The df_for_var_order arg controls which df to use.\r\n\r\n## Contributors\r\n@talgalili, @SarigT","2023-04-05T12:28:29",{"id":243,"version":244,"summary_zh":245,"released_at":246},90063,"0.5.0","## New Features\r\n- The `datasets.load_data` function now also supports the input \"sim_data_cbps\", which loads the simulated data used in the CBPS R vs Python tutorial. It is also used in unit-testing to compare the CBPS weights produced from Python (i.e.: balance) with R (i.e.: the CBPS package). The testing shows how the correlation of the weights from the two implementations (both Pearson and Spearman) produce a correlation of >0.98.\r\n- cli improvements:\r\n    - Add an option to set formula (as string) in the cli.\r\n\r\n## Documentation\r\n- New notebook (in the tutorial section):\r\n    - Comparing results of fitting CBPS between R's `CBPS` package and Python's `balance` package (using simulated data). [link](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002Fcomparing_cbps_in_r_vs_python_using_sim_data\u002F)\r\n\r\n## Contributors\r\n@stevemandala , @SarigT , @talgalili","2023-03-08T18:20:14",{"id":248,"version":249,"summary_zh":250,"released_at":251},90064,"0.4.0","## New Features\r\n- Added two new flags to the cli:\r\n    - `--standardize_types`: This gives cli users the ability to set the `standardize_types` parameter in Sample.from_frame\r\n        to True or False. To learn more about this parameter, see:\r\n        https:\u002F\u002Fimport-balance.org\u002Fapi_reference\u002Fhtml\u002Fbalance.sample_class.html#balance.sample_class.Sample.from_frame\r\n    - `--return_df_with_original_dtypes`: the Sample object now stores the dtypes of the original df that was read using Sample.from_frame. This can be used to restore the original dtypes of the file output from the cli. This is relevant in cases in which we want to convert back the dtypes of columns from how they are stored in Sample, to their original types (e.g.: if something was Int32 it would be turned in float32 in balance.Sample, and using the new flag will return that column, when using the cli, to be back in the Int32 type). This feature may not be robust to various edge cases. So use with caution.\r\n- In the logging:\r\n    - Added warnings about dtypes changes. E.g.: if using Sample.from_frame with a column that has Int32, it will be turned into float32 in the internal storage of sample. Now there will be a warning message indicating of this change.\r\n    - Increase the default length of logger printing (from 500 to 2000)\r\n\r\n\r\n## Bug Fixes\r\n- Fix pandas warning: SettingWithCopyWarning in from_frame (and other places in sample_class.py)\r\n- sample.from_frame has a new argument `use_deepcopy` to decide if changes made to the df inside the sample object would also change the original df that was provided to the sample object. The default is now set to `True` since it's more likely that we'd like to keep the changes inside the sample object to the df contained in it, and not have them spill into the original df.\r\n\r\n\r\n## Contributors\r\n@SarigT , @talgalili","2023-02-08T19:09:27",{"id":253,"version":254,"summary_zh":255,"released_at":256},90065,"0.3.1","### Bug Fixes\r\n- Sample.from_frame now also converts int16 and in8 to float16 and float16. Thus helping to avoid `TypeError: Cannot interpret 'Int16Dtype()' as a data type` style errors.\r\n\r\n### Documentation\r\n- Added ISSUE_TEMPLATE\r\n\r\n### Contributors\r\n@stevemandala , @SarigT , @talgalili","2023-02-01T19:59:43",{"id":258,"version":259,"summary_zh":260,"released_at":261},90066,"0.3.0","### New Features\r\n- Added compatibility for Python 3.11 (by supporting SciPy 1.9.2) (props to @tomwagstaff-opml for flagging this issue).\r\n- Added the `session-info` package as a dependency.\r\n\r\n### Bug Fixes\r\n- Fixed pip install from source on Windows machines (props to @tomwagstaff-opml for the bug report).\r\n\r\n### Documentation\r\n- Added `session_info.show()` outputs to the end of the three tutorials (at: https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002F)\r\n- Misc updates to the README.\r\n\r\n### Contributors\r\n@stevemandala , @SarigT , @talgalili ","2023-01-31T07:58:38",{"id":263,"version":264,"summary_zh":265,"released_at":266},90067,"0.2.0","### New Features\r\n- cli improvements:\r\n    - Add an option to set weight_trimming_mean_ratio = None for no trimming.\r\n    - Add an option to set transformations to be None (i.e. no transformations).\r\n- Add an option to adapt the title in:\r\n    - stats_and_plots.weighted_comparison_plots.plot_bar\r\n    - stats_and_plots.weighted_comparison_plots.plot_hist_kde\r\n\r\n### Bug Fixes\r\n- Fix (and simplify) balanceDF.plot to organize the order of groups (now unadjusted\u002Fself is left, adjusted\u002Fself center, and target is on the right)\r\n- Fix plotly functions to use the red color for self when only compared to target (since in that case it is likely unadjusted): balance.stats_and_plots.weighted_comparisons_plots.plotly_plot_qq and balance.stats_and_plots.weighted_comparisons_plots.plotly_plot_bar\r\n- Fix seaborn_plot_dist: output None by default (instead of axis object). Added a return_Axes argument to control this behavior.\r\n- Fix some test_cbps tests that were failing due to non-exact matches (we made the test less sensitive)\r\n\r\n### Documentation\r\n- New blog section, with the post: [Bringing \"balance\" to your data\r\n](https:\u002F\u002Fimport-balance.org\u002Fblog\u002F2023\u002F01\u002F09\u002Fbringing-balance-to-your-data\u002F)\r\n- New tutorial:\r\n    - [**quickstart_cbps**](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002Fquickstart_cbps\u002F) - like the [**quickstart**](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002Fquickstart\u002F) tutorial, but shows how to use the CBPS algorithm and compares the results to IPW (logistic regression with LASSO).\r\n    - [**balance_transformations_and_formulas**](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Ftutorials\u002Fbalance_transformations_and_formulas\u002F) - This tutorial showcases ways in which transformations, formulas and penalty can be included in your pre-processing of the covariates\r\n    before adjusting for them.\r\n- API docs:\r\n    - New: highlighting on codeblocks\r\n    - a bunch of text fixes.\r\n- Update README.md\r\n    - logo\r\n    - with contributors\r\n    - typo fixes (props to @zbraiterman and @luca-martial).\r\n- Added section about \"Releasing a new version\" to CONTRIBUTING.md\r\n    - Available under [\"Docs\u002FContributing\"](https:\u002F\u002Fimport-balance.org\u002Fdocs\u002Fdocs\u002Fcontributing\u002F#releasing-a-new-version) section of website\r\n\r\n## Misc\r\n- Added automated Github Action package builds & deployment to PyPi on release.\r\n  - See [release.yml](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Fbalance\u002Fblob\u002Fmain\u002F.github\u002Fworkflows\u002Frelease.yml)\r\n\r\n### Contributors\r\n@stevemandala , @SarigT , @talgalili ","2023-01-19T20:22:33",{"id":268,"version":269,"summary_zh":270,"released_at":271},90068,"0.1.0","## News\r\n\r\nInitial beta release for the balance package. Visit [https:\u002F\u002Fimport-balance.org\u002F](https:\u002F\u002Fimport-balance.org\u002F) for more information","2022-11-21T13:35:01"]