[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-moj-analytical-services--splink":3,"tool-moj-analytical-services--splink":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,2,"2026-04-05T10:45:23",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[17,13,20,19,18],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74939,"2026-04-05T23:16:38",[19,13,20,18],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[20,18],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":46,"last_commit_at":55,"category_tags":56,"status":22},2234,"scikit-learn","scikit-learn\u002Fscikit-learn","scikit-learn 是一个基于 Python 构建的开源机器学习库，依托于 SciPy、NumPy 等科学计算生态，旨在让机器学习变得简单高效。它提供了一套统一且简洁的接口，涵盖了从数据预处理、特征工程到模型训练、评估及选择的全流程工具，内置了包括线性回归、支持向量机、随机森林、聚类等在内的丰富经典算法。\n\n对于希望快速验证想法或构建原型的数据科学家、研究人员以及 Python 开发者而言，scikit-learn 是不可或缺的基础设施。它有效解决了机器学习入门门槛高、算法实现复杂以及不同模型间调用方式不统一的痛点，让用户无需重复造轮子，只需几行代码即可调用成熟的算法解决分类、回归、聚类等实际问题。\n\n其核心技术亮点在于高度一致的 API 设计风格，所有估算器（Estimator）均遵循相同的调用逻辑，极大地降低了学习成本并提升了代码的可读性与可维护性。此外，它还提供了强大的模型选择与评估工具，如交叉验证和网格搜索，帮助用户系统地优化模型性能。作为一个由全球志愿者共同维护的成熟项目，scikit-learn 以其稳定性、详尽的文档和活跃的社区支持，成为连接理论学习与工业级应用的最",65628,"2026-04-05T10:10:46",[20,18,14],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":10,"last_commit_at":63,"category_tags":64,"status":22},3364,"keras","keras-team\u002Fkeras","Keras 是一个专为人类设计的深度学习框架，旨在让构建和训练神经网络变得简单直观。它解决了开发者在不同深度学习后端之间切换困难、模型开发效率低以及难以兼顾调试便捷性与运行性能的痛点。\n\n无论是刚入门的学生、专注算法的研究人员，还是需要快速落地产品的工程师，都能通过 Keras 轻松上手。它支持计算机视觉、自然语言处理、音频分析及时间序列预测等多种任务。\n\nKeras 3 的核心亮点在于其独特的“多后端”架构。用户只需编写一套代码，即可灵活选择 TensorFlow、JAX、PyTorch 或 OpenVINO 作为底层运行引擎。这一特性不仅保留了 Keras 一贯的高层易用性，还允许开发者根据需求自由选择：利用 JAX 或 PyTorch 的即时执行模式进行高效调试，或切换至速度最快的后端以获得最高 350% 的性能提升。此外，Keras 具备强大的扩展能力，能无缝从本地笔记本电脑扩展至大规模 GPU 或 TPU 集群，是连接原型开发与生产部署的理想桥梁。",63927,"2026-04-04T15:24:37",[20,14,18],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":71,"readme_en":72,"readme_zh":73,"quickstart_zh":74,"use_case_zh":75,"hero_image_url":76,"owner_login":77,"owner_name":78,"owner_avatar_url":79,"owner_bio":80,"owner_company":81,"owner_location":81,"owner_email":81,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":84,"stars":113,"forks":114,"last_commit_at":115,"license":116,"difficulty_score":10,"env_os":117,"env_gpu":117,"env_ram":117,"env_deps":118,"category_tags":128,"github_topics":129,"view_count":140,"oss_zip_url":81,"oss_zip_packed_at":81,"status":22,"created_at":141,"updated_at":142,"faqs":143,"releases":164},948,"moj-analytical-services\u002Fsplink","splink","Fast, accurate and scalable probabilistic data linkage with support for multiple SQL backends","Splink 是一个用于概率记录链接和实体解析的 Python 工具，能够快速、准确地对没有唯一标识符的数据集进行去重和记录匹配。它解决了在处理大规模数据时，如何高效识别和合并重复记录的难题，特别适合需要清理和整合多源数据的场景。\n\n这款工具非常适合数据科学家、研究人员以及数据工程师使用，尤其是那些处理人口普查、客户数据整合或学术研究中实体匹配问题的专业人士。Splink 的设计兼顾了易用性和性能，即使是没有深厚编程背景的用户也能通过其交互式可视化功能轻松理解和优化匹配结果。\n\nSplink 的技术亮点包括支持多种 SQL 后端（如 DuckDB、AWS Athena 和 Spark），能够在笔记本电脑上一分钟内完成百万级记录的匹配，同时也可扩展到处理亿级数据。它基于 Fellegi-Sunter 模型，并引入了词频调整和自定义模糊匹配逻辑，从而提升了匹配精度。此外，Splink 不需要训练数据即可完成无监督学习，降低了使用门槛。\n\n如果你的数据包含多个不高度相关的列（如姓名、出生日期、城市等），Splink 将表现尤为出色。不过，它并不适合仅基于单列“词袋”数据（如公司名称）进行匹配","Splink 是一个用于概率记录链接和实体解析的 Python 工具，能够快速、准确地对没有唯一标识符的数据集进行去重和记录匹配。它解决了在处理大规模数据时，如何高效识别和合并重复记录的难题，特别适合需要清理和整合多源数据的场景。\n\n这款工具非常适合数据科学家、研究人员以及数据工程师使用，尤其是那些处理人口普查、客户数据整合或学术研究中实体匹配问题的专业人士。Splink 的设计兼顾了易用性和性能，即使是没有深厚编程背景的用户也能通过其交互式可视化功能轻松理解和优化匹配结果。\n\nSplink 的技术亮点包括支持多种 SQL 后端（如 DuckDB、AWS Athena 和 Spark），能够在笔记本电脑上一分钟内完成百万级记录的匹配，同时也可扩展到处理亿级数据。它基于 Fellegi-Sunter 模型，并引入了词频调整和自定义模糊匹配逻辑，从而提升了匹配精度。此外，Splink 不需要训练数据即可完成无监督学习，降低了使用门槛。\n\n如果你的数据包含多个不高度相关的列（如姓名、出生日期、城市等），Splink 将表现尤为出色。不过，它并不适合仅基于单列“词袋”数据（如公司名称）进行匹配的场景。无论是政府机构、学术界还是私营企业，Splink 都能为复杂的数据链接任务提供强大支持。","\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmoj-analytical-services_splink_readme_5cfd7565b9e1.png\" alt=\"Splink Logo\" height=\"150px\">\n\u003C\u002Fp>\n\n[![pypi](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fmoj-analytical-services\u002Fsplink?include_prereleases)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fsplink\u002F#history)\n[![Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmoj-analytical-services_splink_readme_365ca464eebd.png)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fsplink)\n[![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAPI-documentation-blue)](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002F)\n\n\n> [!IMPORTANT]\n> 🎉 Splink 4 has been released! Examples of new syntax are [here](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdemos\u002Fexamples\u002Fexamples_index.html) and a release announcement is [here](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fblog\u002F2024\u002F07\u002F24\u002Fsplink-400-released.html).\n\n\n# Fast, accurate and scalable data linkage and deduplication\n\nSplink is a Python package for probabilistic record linkage (entity resolution) that allows you to deduplicate and link records from datasets that lack unique identifiers.\n\nIt is used widely by within government, academia and the private sector - see [use cases](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002F#use-cases).\n\n## Key Features\n\n⚡ **Speed:** Capable of linking a million records on a laptop in around a minute.\u003Cbr>\n🎯 **Accuracy:** Support for term frequency adjustments and user-defined fuzzy matching logic.\u003Cbr>\n🌐 **Scalability:** Execute linkage in Python (using DuckDB) or big-data backends like AWS Athena or Spark for 100+ million records.\u003Cbr>\n🎓 **Unsupervised Learning:** No training data is required for model training.\u003Cbr>\n📊 **Interactive Outputs:** A suite of interactive visualisations help users understand their model and diagnose problems.\u003Cbr>\n\nSplink's linkage algorithm is based on Fellegi-Sunter's model of record linkage, with various customisations to improve accuracy.\n\n## What does Splink do?\n\nConsider the following records that lack a unique person identifier:\n\n![tables showing what splink does](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmoj-analytical-services_splink_readme_c4b6614973d1.png)\n\nSplink predicts which rows link together:\n\n![tables showing what splink does](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmoj-analytical-services_splink_readme_b32d318c98ab.png)\n\nand clusters these links to produce an estimated person ID:\n\n![tables showing what splink does](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmoj-analytical-services_splink_readme_d582cb6055eb.png)\n\n## What data does Splink work best with?\n\nSplink performs best with input data containing **multiple** columns that are **not highly correlated**. For instance, if the entity type is persons, you may have columns for full name, date of birth, and city. If the entity type is companies, you could have columns for name, turnover, sector, and telephone number.\n\nHigh correlation occurs when one column is highly predictable from another - for instance, city can be predicted from postcode.  Correlation is particularly problematic if **all** of your input columns are highly correlated.\n\nSplink is not designed for linking a single column containing a 'bag of words'. For example, a table with a single 'company name' column, and no other details.\n\n## Documentation\n\nThe homepage for the Splink documentation can be found [here](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002F), including a [tutorial](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdemos\u002Ftutorials\u002F00_Tutorial_Introduction.html) and [examples](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdemos\u002Fexamples\u002Fexamples_index.html) that can be run in the browser.\n\n\nThe specification of the Fellegi Sunter statistical model behind `splink` is similar as that used in the R [fastLink package](https:\u002F\u002Fgithub.com\u002Fkosukeimai\u002FfastLink). Accompanying the fastLink package is an [academic paper](http:\u002F\u002Fimai.fas.harvard.edu\u002Fresearch\u002Ffiles\u002Flinkage.pdf) that describes this model. The [Splink documentation site](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Ftopic_guides\u002Ffellegi_sunter.html) and a [series of interactive articles](https:\u002F\u002Fwww.robinlinacre.com\u002Fprobabilistic_linkage\u002F) also explores the theory behind Splink.\n\nThe Office for National Statistics have written a [case study about using Splink](https:\u002F\u002Fgithub.com\u002FData-Linkage\u002FSplink-census-linkage\u002Fblob\u002Fmain\u002FSplinkCaseStudy.pdf) to link 2021 Census data to itself.\n\n## Installation\n\nSplink supports python 3.9+. To obtain the latest released version of splink you can install from PyPI using pip:\n\n```sh\npip install splink\n```\n\nor, if you prefer, you can instead install splink using conda:\n\n```sh\nconda install -c conda-forge splink\n```\n\n### Installing Splink for Specific Backends\n\n\nFor projects requiring specific backends, Splink offers optional installations for **Spark**, **Athena**, and **PostgreSQL**. These can be installed by appending the backend name in brackets to the pip install command:\n```sh\npip install 'splink[{backend}]'\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Ci>Click here for backend-specific installation commands\u003C\u002Fi>\u003C\u002Fsummary>\n\n#### Spark\n```sh\npip install 'splink[spark]'\n```\n\n#### Athena\n```sh\npip install 'splink[athena]'\n```\n\n#### PostgreSQL\n```sh\npip install 'splink[postgres]'\n```\n\u003C\u002Fdetails>\n\n## Quickstart\n\nThe following code demonstrates how to estimate the parameters of a deduplication model, use it to identify duplicate records, and then use clustering to generate an estimated unique person ID.\n\nFor more detailed tutorial, please see [here](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdemos\u002Ftutorials\u002F00_Tutorial_Introduction.html).\n\n```py\nimport splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndb_api = DuckDBAPI()\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]),\n        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n        cl.DateOfBirthComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ]\n)\n\nlinker = Linker(df, settings, db_api)\n\nlinker.training.estimate_probability_two_random_records_match(\n    [block_on(\"first_name\", \"surname\")],\n    recall=0.7,\n)\n\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"first_name\", \"surname\")\n)\n\nlinker.training.estimate_parameters_using_expectation_maximisation(block_on(\"dob\"))\n\npairwise_predictions = linker.inference.predict(threshold_match_weight=-10)\n\nclusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    pairwise_predictions, 0.95\n)\n\ndf_clusters = clusters.as_pandas_dataframe(limit=5)\n```\n\n## Videos\n\n- [Pydata Global 2024 talk](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=eQtFkI8f02U)\n- [A introductory presentation on Splink](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=msz3T741KQI)\n- [An introduction to the Splink Comparison Viewer dashboard](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=DNvCMqjipis)\n\n\n## Support\n\nTo find the best place to ask a question, report a bug or get general advice, please refer to our [Guide](.\u002FCONTRIBUTING.md).\n\n## Awards\n\n🥇 Civil Service Awards 2025: Innovation category - [Winner](https:\u002F\u002Fx.com\u002FCSWnews\u002Fstatus\u002F1998488787433979981)\n\n🥇 Civil Service Awards 2025: The Excellence In Delivery Award was [won](https:\u002F\u002Fwww.civilserviceawards.com\u002Fwinners-2025\u002F) by a dashboard powered by Splink.\n\n🥇 OpenUK Awards 2025: Open data category - [Winner](https:\u002F\u002Fopenuk.uk\u002Fawards\u002F)\n\n🥈 Civil Service Awards 2023: Best Use of Data, Science, and Technology - [Runner up](https:\u002F\u002Fwww.civilserviceawards.com\u002Fbest-use-of-data-science-and-technology-award-2\u002F)\n\n🥇 Analysis in Government Awards 2022: People's Choice Award - [Winner](https:\u002F\u002Fanalysisfunction.civilservice.gov.uk\u002Fnews\u002Fannouncing-the-winner-of-the-first-analysis-in-government-peoples-choice-award\u002F)\n\n🥈 Analysis in Government Awards 2022: Innovative Methods - [Runner up](https:\u002F\u002Ftwitter.com\u002Fgov_analysis\u002Fstatus\u002F1616073633692274689?s=20&t=6TQyNLJRjnhsfJy28Zd6UQ)\n\n🥇 Analysis in Government Awards 2020: Innovative Methods - [Winner](https:\u002F\u002Fwww.gov.uk\u002Fgovernment\u002Fnews\u002Flaunch-of-the-analysis-in-government-awards)\n\n🥇 MoJ Data and Analytical Services Directorate (DASD) Awards 2020: Innovation and Impact - Winner\n\n\n## Citation\n\nIf you use Splink in your research, please cite as follows:\n\n```BibTeX\n@article{Linacre_Lindsay_Manassis_Slade_Hepworth_2022,\n\ttitle        = {Splink: Free software for probabilistic record linkage at scale.},\n\tauthor       = {Linacre, Robin and Lindsay, Sam and Manassis, Theodore and Slade, Zoe and Hepworth, Tom and Kennedy, Ross and Bond, Andrew},\n\tyear         = 2022,\n\tmonth        = {Aug.},\n\tjournal      = {International Journal of Population Data Science},\n\tvolume       = 7,\n\tnumber       = 3,\n\tdoi          = {10.23889\u002Fijpds.v7i3.1794},\n\turl          = {https:\u002F\u002Fijpds.org\u002Farticle\u002Fview\u002F1794},\n}\n```\n\n## Acknowledgements\n\nWe are very grateful to [ADR UK](https:\u002F\u002Fwww.adruk.org\u002F) (Administrative Data Research UK) for providing the initial funding for this work as part of the [Data First](https:\u002F\u002Fwww.adruk.org\u002Four-work\u002Fbrowse-all-projects\u002Fdata-first-harnessing-the-potential-of-linked-administrative-data-for-the-justice-system-169\u002F) project.\n\nWe are extremely grateful to professors Katie Harron, James Doidge and Peter Christen for their expert advice and guidance in the development of Splink. We are also very grateful to colleagues at the UK's Office for National Statistics for their expert advice and peer review of this work. Any errors remain our own.\n\n## Related Repositories\n\nWhile Splink is a standalone package, there are a number of repositories in the Splink ecosystem:\n\n\n- [splink_scalaudfs](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink_scalaudfs) contains the code to generate [User Defined Functions](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdev_guides\u002Fudfs.html#spark) in scala which are then callable in Spark.\n- [splink_datasets](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink_datasets) contains datasets that can be installed automatically as a part of Splink through the [In-build datasets](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdatasets.html) functionality.\n- [splink_synthetic_data](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink_synthetic_data) contains code to generate synthetic data.\n","\u003Cp align=\"center\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmoj-analytical-services_splink_readme_5cfd7565b9e1.png\" alt=\"Splink Logo\" height=\"150px\">\n\u003C\u002Fp>\n\n[![pypi](https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002Fmoj-analytical-services\u002Fsplink?include_prereleases)](https:\u002F\u002Fpypi.org\u002Fproject\u002Fsplink\u002F#history)\n[![Downloads](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmoj-analytical-services_splink_readme_365ca464eebd.png)](https:\u002F\u002Fpepy.tech\u002Fproject\u002Fsplink)\n[![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FAPI-documentation-blue)](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002F)\n\n\n> [!IMPORTANT]\n> 🎉 Splink 4 已发布！新语法的示例可以在这里找到：[示例链接](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdemos\u002Fexamples\u002Fexamples_index.html)，发布公告请见：[公告链接](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fblog\u002F2024\u002F07\u002F24\u002Fsplink-400-released.html)。\n\n\n# 快速、准确且可扩展的数据链接与去重\n\nSplink 是一个用于概率记录链接（实体解析，Entity Resolution）的 Python 包，它允许您对缺乏唯一标识符的数据集进行去重和记录链接。\n\n它被广泛应用于政府、学术界和私营部门——更多用例请参阅[用例](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002F#use-cases)。\n\n## 主要特性\n\n⚡ **速度:** 能够在笔记本电脑上在一分钟左右完成百万条记录的链接。\u003Cbr>\n🎯 **准确性:** 支持词频调整和用户定义的模糊匹配逻辑。\u003Cbr>\n🌐 **可扩展性:** 可以在 Python 中执行链接（使用 DuckDB），也可以在 AWS Athena 或 Spark 等大数据后端处理超过一亿条记录。\u003Cbr>\n🎓 **无监督学习:** 模型训练不需要训练数据。\u003Cbr>\n📊 **交互式输出:** 提供一系列交互式可视化工具，帮助用户理解模型并诊断问题。\u003Cbr>\n\nSplink 的链接算法基于 Fellegi-Sunter 的记录链接模型，并进行了各种定制以提高准确性。\n\n## Splink 能做什么？\n\n考虑以下缺乏唯一人员标识符的记录：\n\n![展示 Splink 功能的表格](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmoj-analytical-services_splink_readme_c4b6614973d1.png)\n\nSplink 预测哪些行会链接在一起：\n\n![展示 Splink 功能的表格](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmoj-analytical-services_splink_readme_b32d318c98ab.png)\n\n并将这些链接聚类生成估计的人员 ID：\n\n![展示 Splink 功能的表格](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmoj-analytical-services_splink_readme_d582cb6055eb.png)\n\n## Splink 最适合处理什么样的数据？\n\nSplink 在输入数据包含**多个****非高度相关**的列时表现最佳。例如，如果实体类型是人，您可以有全名、出生日期和城市等列。如果实体类型是公司，则可以有名称、营业额、行业和电话号码等列。\n\n当某一列可以从另一列高度预测时，就会发生高相关性——例如，城市可以从邮政编码预测出来。如果**所有**输入列都高度相关，那么相关性将特别成问题。\n\nSplink 不适用于链接包含“词袋”的单列数据。例如，只有一个“公司名称”列而没有其他详细信息的表。\n\n## 文档\n\nSplink 文档的主页可以在这里找到：[文档链接](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002F)，其中包括可以在浏览器中运行的[教程](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdemos\u002Ftutorials\u002F00_Tutorial_Introduction.html)和[示例](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdemos\u002Fexamples\u002Fexamples_index.html)。\n\n\n`Splink` 背后的 Fellegi Sunter 统计模型规范类似于 R 语言中的 [fastLink 包](https:\u002F\u002Fgithub.com\u002Fkosukeimai\u002FfastLink)所使用的模型。伴随 fastLink 包的是一篇[学术论文](http:\u002F\u002Fimai.fas.harvard.edu\u002Fresearch\u002Ffiles\u002Flinkage.pdf)，描述了该模型。[Splink 文档站点](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Ftopic_guides\u002Ffellegi_sunter.html) 和一系列[交互式文章](https:\u002F\u002Fwww.robinlinacre.com\u002Fprobabilistic_linkage\u002F)也探讨了 Splink 背后的理论。\n\n英国国家统计局撰写了一篇[关于使用 Splink 的案例研究](https:\u002F\u002Fgithub.com\u002FData-Linkage\u002FSplink-census-linkage\u002Fblob\u002Fmain\u002FSplinkCaseStudy.pdf)，用于将 2021 年人口普查数据与其自身进行链接。\n\n## 安装\n\nSplink 支持 Python 3.9+。要获取最新发布的 Splink 版本，您可以使用 pip 从 PyPI 安装：\n\n```sh\npip install splink\n```\n\n或者，如果您愿意，也可以使用 conda 安装 Splink：\n\n```sh\nconda install -c conda-forge splink\n```\n\n### 为特定后端安装 Splink\n\n\n对于需要特定后端的项目，Splink 提供了针对 **Spark**、**Athena** 和 **PostgreSQL** 的可选安装。这些可以通过在 pip 安装命令中添加后端名称来安装：\n```sh\npip install 'splink[{backend}]'\n```\n\n\u003Cdetails>\n\u003Csummary>\u003Ci>点击此处查看后端特定的安装命令\u003C\u002Fi>\u003C\u002Fsummary>\n\n#### Spark\n```sh\npip install 'splink[spark]'\n```\n\n#### Athena\n```sh\npip install 'splink[athena]'\n```\n\n#### PostgreSQL\n```sh\npip install 'splink[postgres]'\n```\n\u003C\u002Fdetails>\n\n## 快速入门\n\n以下代码演示了如何估算去重模型的参数，使用该模型识别重复记录，并通过聚类生成估计的唯一人员 ID。\n\n如需更详细的教程，请参见[此处](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdemos\u002Ftutorials\u002F00_Tutorial_Introduction.html)。\n\n```py\nimport splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\ndb_api = DuckDBAPI()\n\ndf = splink_datasets.fake_1000\n\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]),\n        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n        cl.DateOfBirthComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ]\n)\n\nlinker = Linker(df, settings, db_api)\n\nlinker.training.estimate_probability_two_random_records_match(\n    [block_on(\"first_name\", \"surname\")],\n    recall=0.7,\n)\n\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"first_name\", \"surname\")\n)\n\nlinker.training.estimate_parameters_using_expectation_maximisation(block_on(\"dob\"))\n\npairwise_predictions = linker.inference.predict(threshold_match_weight=-10)\n\nclusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    pairwise_predictions, 0.95\n)\n\ndf_clusters = clusters.as_pandas_dataframe(limit=5)\n```\n\n## 视频\n\n- [Pydata Global 2024 演讲](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=eQtFkI8f02U)\n- [Splink 简介演讲](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=msz3T741KQI)\n- [Splink 对比查看器仪表板简介](https:\u002F\u002Fwww.youtube.com\u002Fwatch?v=DNvCMqjipis)\n\n\n## 支持\n\n要找到提问、报告错误或获取一般建议的最佳位置，请参阅我们的[指南](.\u002FCONTRIBUTING.md)。\n\n## 奖项\n\n🥇 2025 年公务员奖：创新类别 - [获奖者](https:\u002F\u002Fx.com\u002FCSWnews\u002Fstatus\u002F1998488787433979981)\n\n🥇 2025 年公务员奖：卓越交付奖 - [由 Splink 驱动的仪表板获奖](https:\u002F\u002Fwww.civilserviceawards.com\u002Fwinners-2025\u002F)\n\n🥇 2025 年 OpenUK 奖：开放数据类别 - [获奖者](https:\u002F\u002Fopenuk.uk\u002Fawards\u002F)\n\n🥈 2023 年公务员奖：最佳数据、科学和技术应用 - [亚军](https:\u002F\u002Fwww.civilserviceawards.com\u002Fbest-use-of-data-science-and-technology-award-2\u002F)\n\n🥇 2022 年政府分析奖：人民选择奖 - [获奖者](https:\u002F\u002Fanalysisfunction.civilservice.gov.uk\u002Fnews\u002Fannouncing-the-winner-of-the-first-analysis-in-government-peoples-choice-award\u002F)\n\n🥈 2022 年政府分析奖：创新方法 - [亚军](https:\u002F\u002Ftwitter.com\u002Fgov_analysis\u002Fstatus\u002F1616073633692274689?s=20&t=6TQyNLJRjnhsfJy28Zd6UQ)\n\n🥇 2020 年政府分析奖：创新方法 - [获奖者](https:\u002F\u002Fwww.gov.uk\u002Fgovernment\u002Fnews\u002Flaunch-of-the-analysis-in-government-awards)\n\n🥇 2020 年司法部数据与分析服务局 (DASD) 奖：创新与影响 - 获奖者\n\n\n## 引用\n\n如果您在研究中使用了 Splink，请按如下方式引用：\n\n```BibTeX\n@article{Linacre_Lindsay_Manassis_Slade_Hepworth_2022,\n\ttitle        = {Splink: Free software for probabilistic record linkage at scale.},\n\tauthor       = {Linacre, Robin and Lindsay, Sam and Manassis, Theodore and Slade, Zoe and Hepworth, Tom and Kennedy, Ross and Bond, Andrew},\n\tyear         = 2022,\n\tmonth        = {Aug.},\n\tjournal      = {International Journal of Population Data Science},\n\tvolume       = 7,\n\tnumber       = 3,\n\tdoi          = {10.23889\u002Fijpds.v7i3.1794},\n\turl          = {https:\u002F\u002Fijpds.org\u002Farticle\u002Fview\u002F1794},\n}\n```\n\n## 致谢\n\n我们非常感谢 [ADR UK](https:\u002F\u002Fwww.adruk.org\u002F)（英国行政数据分析研究中心）作为 [Data First](https:\u002F\u002Fwww.adruk.org\u002Four-work\u002Fbrowse-all-projects\u002Fdata-first-harnessing-the-potential-of-linked-administrative-data-for-the-justice-system-169\u002F) 项目的一部分为这项工作提供了初始资金。\n\n我们对 Katie Harron 教授、James Doidge 教授和 Peter Christen 教授在 Splink 开发过程中提供的专家建议和指导表示衷心感谢。我们也非常感谢英国国家统计局的同事们对这项工作的专家建议和同行评审。任何错误均由我们自行负责。\n\n## 相关仓库\n\n尽管 Splink 是一个独立的软件包，但在 Splink 生态系统中还有一些其他仓库：\n\n\n- [splink_scalaudfs](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink_scalaudfs) 包含用于生成 [用户定义函数](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdev_guides\u002Fudfs.html#spark) 的 Scala 代码，这些函数随后可以在 Spark 中调用。\n- [splink_datasets](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink_datasets) 包含可以通过 [内置数据集](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdatasets.html) 功能自动安装为 Splink 一部分的数据集。\n- [splink_synthetic_data](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink_synthetic_data) 包含生成合成数据的代码。","# Splink 快速上手指南\n\nSplink 是一个用于概率记录链接（实体解析）的 Python 包，能够帮助您对缺乏唯一标识符的数据集进行去重和记录链接。\n\n---\n\n## 环境准备\n\n### 系统要求\n- Python 3.9 或更高版本\n- 支持的操作系统：Windows、macOS 和 Linux\n\n### 前置依赖\n- **DuckDB**（默认后端）\n- 可选后端支持：Spark、AWS Athena、PostgreSQL\n\n如果您在中国大陆，建议使用国内镜像源加速安装：\n```sh\npip install splink -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n---\n\n## 安装步骤\n\n### 基础安装\n运行以下命令安装最新版本的 Splink：\n```sh\npip install splink\n```\n\n或者使用 `conda` 安装：\n```sh\nconda install -c conda-forge splink\n```\n\n### 安装特定后端支持\n如果需要支持特定后端（如 Spark、Athena 或 PostgreSQL），可以使用以下命令安装扩展功能：\n```sh\npip install 'splink[{backend}]'\n```\n\n例如：\n- 安装 Spark 支持：\n  ```sh\n  pip install 'splink[spark]'\n  ```\n- 安装 Athena 支持：\n  ```sh\n  pip install 'splink[athena]'\n  ```\n- 安装 PostgreSQL 支持：\n  ```sh\n  pip install 'splink[postgres]'\n  ```\n\n---\n\n## 基本使用\n\n以下代码展示了如何使用 Splink 进行数据去重，并生成估计的唯一 ID：\n\n```py\nimport splink.comparison_library as cl\nfrom splink import DuckDBAPI, Linker, SettingsCreator, block_on, splink_datasets\n\n# 初始化数据库 API\ndb_api = DuckDBAPI()\n\n# 加载示例数据集\ndf = splink_datasets.fake_1000\n\n# 配置链接规则\nsettings = SettingsCreator(\n    link_type=\"dedupe_only\",\n    comparisons=[\n        cl.JaroWinklerAtThresholds(\"first_name\", [0.9, 0.7]),\n        cl.JaroAtThresholds(\"surname\", [0.9, 0.7]),\n        cl.DateOfBirthComparison(\n            \"dob\",\n            input_is_string=True,\n            datetime_metrics=[\"year\", \"month\"],\n            datetime_thresholds=[1, 1],\n        ),\n        cl.ExactMatch(\"city\").configure(term_frequency_adjustments=True),\n        cl.EmailComparison(\"email\"),\n    ],\n    blocking_rules_to_generate_predictions=[\n        block_on(\"first_name\"),\n        block_on(\"surname\"),\n    ]\n)\n\n# 初始化 Linker\nlinker = Linker(df, settings, db_api)\n\n# 估计模型参数\nlinker.training.estimate_probability_two_random_records_match(\n    [block_on(\"first_name\", \"surname\")],\n    recall=0.7,\n)\n\nlinker.training.estimate_u_using_random_sampling(max_pairs=1e6)\n\nlinker.training.estimate_parameters_using_expectation_maximisation(\n    block_on(\"first_name\", \"surname\")\n)\n\nlinker.training.estimate_parameters_using_expectation_maximisation(block_on(\"dob\"))\n\n# 生成预测结果\npairwise_predictions = linker.inference.predict(threshold_match_weight=-10)\n\n# 聚类生成唯一 ID\nclusters = linker.clustering.cluster_pairwise_predictions_at_threshold(\n    pairwise_predictions, 0.95\n)\n\n# 输出聚类结果\ndf_clusters = clusters.as_pandas_dataframe(limit=5)\nprint(df_clusters)\n```\n\n---\n\n## 更多资源\n\n- **官方文档**: [Splink 文档](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002F)\n- **教程**: [Splink 教程](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdemos\u002Ftutorials\u002F00_Tutorial_Introduction.html)\n- **示例**: [Splink 示例](https:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fdemos\u002Fexamples\u002Fexamples_index.html)\n\n通过以上步骤，您可以快速上手 Splink 并开始处理数据链接任务！","一家电商公司正在整合来自不同渠道的客户数据，希望通过合并重复记录来构建统一的客户画像。\n\n### 没有 splink 时\n- 数据团队需要手动编写复杂的 SQL 查询来比对不同表中的客户信息，效率极低且容易出错  \n- 面对数百万条客户记录，传统方法在笔记本电脑上运行可能需要数小时甚至数天才能完成  \n- 无法灵活处理模糊匹配问题，例如姓名的拼写错误或地址的部分缺失，导致匹配准确率较低  \n- 缺乏直观的分析工具，难以评估匹配结果的质量和发现潜在问题  \n- 当数据规模扩大到千万级别时，现有方案完全无法应对，必须依赖昂贵的大数据平台  \n\n### 使用 splink 后\n- 数据团队只需几行 Python 代码即可快速配置和执行记录链接任务，大幅降低开发成本  \n- 借助 DuckDB 或 Spark 等后端支持，splink 能在一分钟内处理百万级数据，轻松应对大规模场景  \n- 内置的模糊匹配逻辑和词频调整功能显著提升了匹配准确性，即使是不完美的数据也能高效处理  \n- 提供交互式可视化界面，帮助团队直观理解模型表现并快速定位问题  \n- 支持多种 SQL 后端，既能在本地运行小规模任务，也能无缝扩展到云端处理超大规模数据  \n\n通过 splink，电商公司不仅节省了大量时间和资源，还显著提升了客户数据整合的效率和质量。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmoj-analytical-services_splink_c4b66149.png","moj-analytical-services","MoJ Analytical Services","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmoj-analytical-services_fe1dfa04.png","",null,"www.justice.gov.uk","https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services",[85,89,93,97,101,105,109],{"name":86,"color":87,"percentage":88},"Python","#3572A5",53.2,{"name":90,"color":91,"percentage":92},"JavaScript","#f1e05a",45.8,{"name":94,"color":95,"percentage":96},"Jinja","#a52a22",0.5,{"name":98,"color":99,"percentage":100},"Shell","#89e051",0.3,{"name":102,"color":103,"percentage":104},"CSS","#663399",0.2,{"name":106,"color":107,"percentage":108},"Dockerfile","#384d54",0.1,{"name":110,"color":111,"percentage":112},"HTML","#e34c26",0,2056,225,"2026-04-05T22:45:35","MIT","未说明",{"notes":119,"python":120,"dependencies":121},"支持多种后端（如 Spark、Athena、PostgreSQL），需根据具体后端安装额外依赖。建议使用 pip 或 conda 安装。","3.9+",[122,123,124,125,126,127],"duckdb","pandas","numpy","scipy","sqlalchemy","tqdm",[14,18],[130,131,132,133,134,135,136,137,138,122,139],"record-linkage","spark","em-algorithm","deduplication","deduplicate-data","entity-resolution","data-matching","fuzzy-matching","data-science","uk-gov-data-science",4,"2026-03-27T02:49:30.150509","2026-04-06T08:44:24.400957",[144,149,154,159],{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},4182,"为什么在 Databricks 环境中使用 Jaro-Winkler 相似度时返回的值与预期相反？","这是一个已修复的问题。请确保使用最新版本的 Splink，并测试是否问题仍然存在。根据维护者的回复，更新到 Spark 3.1.3 或更高版本可以解决此问题。此外，也可以尝试以下命令重新测试：`pip install splink==2.1.5`。","https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fissues\u002F951",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},4183,"在 Databricks 中运行 Splink 代码时，为什么找不到 GraphFrames 的 JAR 文件？","确保正确安装了 GraphFrames 的 JAR 包，并在代码中添加 `setCheckpointDir()` 方法。如果问题仍然存在，请升级 Splink 到 2.1.5 版本（通过 `pip install splink==2.1.5`），然后重新测试。","https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fissues\u002F266",{"id":155,"question_zh":156,"answer_zh":157,"source_url":158},4184,"为什么 Splink 模型会忽略阻塞规则？","可能是由于 `F.monotonically_increasing_id()` 函数的非确定性行为导致的。建议在生成唯一 ID 后立即将数据保存为 Parquet 文件，然后重新加载并验证 ID 是否唯一。此外，确保在适当的位置缓存或保存数据以避免 Spark 的长血统问题。","https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fissues\u002F227",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},4185,"如何在 Splink 中基于列表或数组交集进行高效的阻塞？","目前可以通过类似 `list_unique(l.list_column) + list_unique(r.list_column) > list_unique(list_concat(l.list_column, r.list_column))` 的方法实现，但性能较慢。开发团队正在考虑优化此功能。","https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fissues\u002F1448",[165,170,175,180,185,190,195,200,205,210,215,220,225,230,235,240,245,250,255,260],{"id":166,"version":167,"summary_zh":168,"released_at":169},113287,"v5.0.0.dev3","## What's Changed\r\n* :dependabot: github-actions(deps): Bump github\u002Fcodeql-action from 4.32.4 to 4.32.6 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2980\r\n* :dependabot: uv(deps-dev): Bump mkdocs-material from 9.7.4 to 9.7.5 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2984\r\n* Depandas - blocking analysis & tests batch 2 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2985\r\n* :dependabot: github-actions(deps): Bump astral-sh\u002Fsetup-uv from 7.3.1 to 7.6.0 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2986\r\n* Remove `TestHelper.convert frame()` by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2987\r\n* :dependabot: uv(deps-dev): Bump sqlglot from 29.0.1 to 30.0.2 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2989\r\n* Publish via Trusted Publishing by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2990\r\n* Merge 4 -> 5 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2996\r\n* Emit untrained warnings once only by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F3003\r\n* Join on term frequencies at predict() stage rather than precomputing by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F3010\r\n* json schema no longer exists by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F3012\r\n* Better formatting of emitted sql v2 by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F3014\r\n* Allow profiling of the SQL executed in Duckdb and Spark  pipelines by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F3021\r\n* Improve blocking counts sql by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F3016\r\n* Alternative version of improving u training. by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F3025\r\n* bump version for 5.0.0.dev3 release by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F3026\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv5.0.0.dev2...v5.0.0.dev3","2026-04-05T18:44:00",{"id":171,"version":172,"summary_zh":173,"released_at":174},113288,"v5.0.0.dev2","## What's Changed\r\n* Splink 5 - Remove implicit cache by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2847\r\n* Splink 5 - Explicit cache table mgt fns by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2848\r\n* Splink 5 - Remove salting by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2849\r\n* Splink 5 - Add chunking  by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2850\r\n* Splink5 - Bayes factors to match weights final by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2851\r\n* Drop athena by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2858\r\n* Minor fixes by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2859\r\n* Update changelog by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2860\r\n* remove code coverage by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2861\r\n* Merge - splink 4 to 5 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2862\r\n* Merge\u002Fmaster into splink5 2025 12 17 by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2865\r\n* Update tests for splinkdataframes_everywhere by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2866\r\n* Splink 5: Register SplinkDataFrames using db_api before passing to Splink functions  by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2863\r\n* Merge `master` to Splink 5 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2880\r\n* Drop pandas as a required dependancy - core working version by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2883\r\n* Optimise train u by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2870\r\n* Notebooks to jupytext by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2899\r\n* Makefile & simplifcation of dependency groups by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2908\r\n* Add register_blocked_pairs_for_predict function to table management by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2915\r\n* Merge splink 4 to 5 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2923\r\n* added filtered neighbours table to list of tables for Spark to persist by @aymonwuolanne in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2928\r\n* Merge\u002Fsplink 4 to 5 by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2932\r\n* To arrow and duckdb tables by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2916\r\n* Depandas estimate u by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2937\r\n* Merge\u002Fsplink 4 to 5 by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2939\r\n* SplinkChart by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2941\r\n* (Some) structured chart records by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2940\r\n* Fix dashboards to work with mw_ not bf by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2952\r\n* Simplify blocking by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2935\r\n* Match key and unique id quoting fixes by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2955\r\n* TF Adjustment chart without pandas + SplinkChart by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2956\r\n* Finalise API of computation of blocked pairs workflow by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2957\r\n* Revert simplify blocking and match key quoting by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2958\r\n* A bit less pandas in tests by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2969\r\n* Change `linker.misc.query_sql` to output `SplinkDataFrame` by default by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2970\r\n* :dependabot: uv(deps-dev): Bump sqlalchemy from 2.0.47 to 2.0.48 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2953\r\n* :dependabot: github-actions(deps): Bump astral-sh\u002Fsetup-uv from 7.3.0 to 7.3.1 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2954\r\n* :dependabot: github-actions(deps): Bump actions\u002Fdependency-review-action from 4.8.3 to 4.9.0 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2964\r\n* :dependabot: github-actions(deps): Bump actions\u002Fupload-artifact from 6.0.0 to 7.0.0 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2951\r\n* :dependabot: uv(deps): Bump tornado from 6.5.4 to 6.5.5 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2968\r\n* :dependabot: uv(deps-dev): Bump mkdocs-material from 9.7.3 to 9.7.4 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2965\r\n* `ty` visibility by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2971\r","2026-03-13T12:41:41",{"id":176,"version":177,"summary_zh":178,"released_at":179},113289,"v4.0.16","## What's Changed\r\n* :dependabot: uv(deps-dev): bump pymdown-extensions from 10.20.1 to 10.21 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2942\r\n* :dependabot: github-actions(deps): bump actions\u002Fdependency-review-action from 4.8.2 to 4.8.3 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2947\r\n* Bump locked dependencies by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2948\r\n* :dependabot: github-actions(deps): bump github\u002Fcodeql-action from 4.31.10 to 4.32.4 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2946\r\n* :dependabot: github-actions(deps): bump lycheeverse\u002Flychee-action from 2.7.0 to 2.8.0 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2943\r\n* Remove select star at start of pipeline for aliasing splink4 by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2962\r\n* Blocking select only blocking cols splink 4 by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2961\r\n* Optimise link only exploding blocking rules by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2963\r\n* ensure two dataset link only behaviour does not swap l and r by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2966\r\n* Splink 4 0 16 release by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2967\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.15...v4.0.16","2026-03-11T18:48:16",{"id":181,"version":182,"summary_zh":183,"released_at":184},113290,"v4.0.15","## What's Changed\r\n* Faster `two_dataset_link_only` joins when joining small table to large in duckdb by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2936\r\n* 4_0_15 release by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2938\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.14...v4.0.15","2026-02-17T21:59:00",{"id":186,"version":187,"summary_zh":188,"released_at":189},113291,"v4.0.13","## What's Changed\r\n* Bump lockfile versions by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2871\r\n* :octocat: GitHub Actions Updates by @jacobwoffenden in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2868\r\n* add data city use case by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2891\r\n* Make tests compatible with pandas 3 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2895\r\n* :dependabot: github-actions(deps): Bump github\u002Fcodeql-action from 4.31.9 to 4.31.10 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2889\r\n* :dependabot: uv(deps-dev): Bump sqlglot from 28.5.0 to 28.6.0 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2890\r\n* :dependabot: github-actions(deps): Bump astral-sh\u002Fsetup-uv from 7.1.6 to 7.2.0 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2884\r\n* :dependabot: github-actions(deps): Bump actions\u002Fcheckout from 6.0.1 to 6.0.2 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2896\r\n* :dependabot: uv(deps-dev): Bump sqlalchemy from 2.0.45 to 2.0.46 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2901\r\n* docs(blog): add post on productionising linkage pipelines by @ThomasHepworth in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2900\r\n* :dependabot: uv(deps): bump duckdb from 1.4.3 to 1.4.4 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2904\r\n* :dependabot: uv(deps-dev): bump pymdown-extensions from 10.20 to 10.20.1 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2905\r\n* Sqlglot bracket parsing by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2907\r\n* Bump lockfile versions by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2909\r\n* :dependabot: github-actions(deps): bump astral-sh\u002Fsetup-uv from 7.2.0 to 7.3.0 by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2914\r\n* Restict python version for docs group by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2919\r\n* Remove `awswrangler` from lockfile resolution by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2920\r\n* Fix Postgres clustering: add alias to UNION ALL subquery by @Mostafa-Armandi in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2894\r\n* Update changelog by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2921\r\n* Release - 4.0.13 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2922\r\n\r\n## New Contributors\r\n* @jacobwoffenden made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2868\r\n* @Mostafa-Armandi made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2894\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.12...v4.0.13","2026-02-12T12:48:42",{"id":191,"version":192,"summary_zh":193,"released_at":194},113292,"v4.0.12","## What's Changed\r\n* Drop python 3 8 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2819\r\n* Test python 3.14 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2820\r\n* Fix docs code blocks in admonitions by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2824\r\n* Fix 2821 by @aymonwuolanne in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2826\r\n* fix extract major version, make it robust to dev variants by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2829\r\n* Always include match key by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2831\r\n* Remove unused methods by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2832\r\n* [MAINT] Lockfile dependencies upgrade by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2838\r\n* Fix docs build by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2840\r\n* Docs CI - runs when lockfile \u002F dependencies change by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2841\r\n* [MAINT] Isolate tests by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2843\r\n* openuk by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2846\r\n* Implement equality operator on InputColumn by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2833\r\n* Correct files we ship in build by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2844\r\n* Remove executable permission from module file by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2852\r\n* [MAINT] Unpin python 3.14 version from CI by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2853\r\n* [MAINT] Remove obsolete Spark custom jars by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2854\r\n* cs award by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2855\r\n* add cam by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2856\r\n* add bold award by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2864\r\n* Release v4.0.12 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2867\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.11...v4.0.12","2025-12-22T17:08:48",{"id":196,"version":197,"summary_zh":198,"released_at":199},113293,"v5.0.0.dev1","### Added\r\n\r\n- Support for chunking to allow processing of very large datasets in blocking and prediction [#2850](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2850)\r\n- New `table_management` functions to explicitly manage table caching [#2848](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2848)\r\n\r\n### Changed\r\n\r\n- Internal probabilistic calculations now use Match Weights (log-odds) instead of Bayes Factors to improve numerical stability [#2851](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2851)\r\n\r\n### Deprecated\r\n\r\n- `bayes_factor_column_prefix` setting is deprecated in favour of `match_weight_column_prefix` [#2851](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2851)\r\n\r\n### Removed\r\n\r\n- Dropped support for Amazon Athena [#2858](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2858)\r\n- Removed implicit caching mechanism and the `use_cache` parameter from database execution methods [#2847](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2847)\r\n- Removed `materialise_blocked_pairs` argument from `predict` (blocked pairs are now always materialised) [#2848](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2848)\r\n- Removed salting mechanism as it is no longer required for parallelisation in DuckDB [#2849](https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2849)","2025-12-17T16:28:36",{"id":201,"version":202,"summary_zh":203,"released_at":204},113294,"v4.0.12.dev1","Pre-release to ensure #2844 works correctly","2025-12-17T09:34:57",{"id":206,"version":207,"summary_zh":208,"released_at":209},113295,"v4.0.11","## What's Changed\r\n* Upgrade lockfile by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2811\r\n* Improve clustering performance by @aymonwuolanne in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2800\r\n* Simplify docs build by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2813\r\n* Add py.typed marker by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2815\r\n* Waterfall cache by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2816\r\n* Remove obsolete Spark 4 warning by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2818\r\n* Release 4 0 11 by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2817\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.10...v4.0.11","2025-11-12T09:44:47",{"id":211,"version":212,"summary_zh":213,"released_at":214},113296,"v4.0.10","## What's Changed\r\n* change NOT IN statement to NOT EXISTS statement by @aymonwuolanne in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2766\r\n* Poetry -> uv by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2775\r\n* Uncap sqlglot by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2785\r\n* Bump minimum sqlglot version by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2786\r\n* add laos by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2787\r\n* Remove duplicated mypy check by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2788\r\n* Fix more docs links by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2765\r\n* Improve debug mode by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2789\r\n* Fix issue 1889 by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2758\r\n* Fix count comparisons for exploding blocking rules by @aymonwuolanne in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2778\r\n* Faster test suite by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2792\r\n* Deprecate python 3.9 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2797\r\n* Remove jsonschema dependency by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2798\r\n* Update changelog by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2799\r\n* Simplify exclude preceding for exploding blocking rules by @aymonwuolanne in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2760\r\n* Spark 4 - allow + test current functionality by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2801\r\n* Spark 4 compatible UDFs by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2802\r\n* Bare-bones Splink install by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2804\r\n* Spark 4 - try parse date fix by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2805\r\n* Demos requirements to uv by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2806\r\n* Early-warning breaking dependency CI cron job by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2807\r\n* Release - v4.0.10 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2810\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.9...v4.0.10","2025-11-03T09:21:13",{"id":216,"version":217,"summary_zh":218,"released_at":219},113297,"v4.0.9","## What's Changed\r\n* 2710 comparison viewer performance by @aymonwuolanne in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2716\r\n* Apply pseudopeople tutorial feedback by @tylerdy in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2688\r\n* Tie breaking in cluster_using_single_best_links by @aymonwuolanne in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2691\r\n* update cookbook by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2722\r\n* add use case by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2719\r\n* typo and grammar fix by @daidoji in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2725\r\n* Miscellaneous type-hinting by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2729\r\n* Fixed another typo by @daidoji in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2727\r\n* Small typo and grammar fix by @daidoji in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2728\r\n* Update README.md to use comparison_libary rather than comparison_template_library by @leppekja in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2731\r\n* richmond by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2734\r\n* Gambia by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2735\r\n* Updating instances of linker.predict to linker.inference.predict in the docs by @aymonwuolanne in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2736\r\n* changing match_weight field to sort_avg_match_weight for the distribu… by @aymonwuolanne in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2743\r\n* Fix spelling by @probjects in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2755\r\n* westmorland by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2759\r\n* Fix broken docs links by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2764\r\n* Bump lycheeverse\u002Flychee-action from 1.8.0 to 2.0.2 in \u002F.github\u002Fworkflows by @dependabot[bot] in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2763\r\n* docs(fix): Remove broken link from Tutorial by @calebhadley1 in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2767\r\n* Sqlglot compatibility fix by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2780\r\n* Release - v4.0.9 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2782\r\n\r\n## New Contributors\r\n* @daidoji made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2725\r\n* @leppekja made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2731\r\n* @calebhadley1 made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2767\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.8...v4.0.9","2025-09-24T19:05:49",{"id":221,"version":222,"summary_zh":223,"released_at":224},113298,"v4.0.8","## What's Changed\r\n* Lockfile update by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2649\r\n* add dbt by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2658\r\n* Upgrade Vega to 5.31.0 by @hedsnz in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2627\r\n* use unique_id from settings by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2659\r\n* Bump lockfile versions by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2664\r\n* add princeton paper by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2669\r\n* Add link to pydata by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2675\r\n* Add PyData Global talk to md by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2676\r\n* Pydata typo by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2677\r\n* Jar udf package - updated dependencies by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2679\r\n* Reducing warnings by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2680\r\n* Pseudopeople Splink example linking Census and ACS datasets by @tylerdy in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2665\r\n* Move pseudopeople to no test by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2681\r\n* Add dashboards from pseudopeople example by @tylerdy in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2682\r\n* Realtime custom join by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2683\r\n* Consistent mypy version by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2695\r\n* Update 04_Estimating_model_parameters.ipynb by @w2o-hbrashear in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2692\r\n* Specify SQL cache key for realtime linking by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2693\r\n* add Welsh Revenue Authority use case by @rhyswilliams2 in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2696\r\n* [DOCS] Missing page + navbar alignment by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2699\r\n* Blocking rules dialected by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2702\r\n* bump lockfile versions by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2703\r\n* Release - v4.0.8 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2709\r\n\r\n## New Contributors\r\n* @hedsnz made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2627\r\n* @tylerdy made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2665\r\n* @rhyswilliams2 made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2696\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.7...v4.0.8","2025-06-04T13:49:10",{"id":226,"version":227,"summary_zh":228,"released_at":229},113299,"v4.0.7","## What's Changed\r\n* Add speed tests to docs by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2538\r\n* Llm prompt to docs by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2541\r\n* Fix typos in docs by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2542\r\n* improve llm prompt by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2543\r\n* Link to custom GPT by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2544\r\n* Test python 3.13 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2521\r\n* Fix reference to similarity_jar_location by @julijonas in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2547\r\n* Deprecation warning for python 3.8 by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2520\r\n* Add Spark support for PairwiseStringDistanceFunction by @zmbc in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2546\r\n* add knowledgebase by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2549\r\n* Add gn group by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2552\r\n* Modelling guide by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2553\r\n* Fix formatting of docs by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2554\r\n* Make igraph explicitly non-optional by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2551\r\n* Add rationale for training by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2555\r\n* added dod by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2556\r\n* Improve llm prompts by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2557\r\n* add dfe by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2558\r\n* add SAIL SERP usage by @medwar99 in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2559\r\n* [DOCS] Use block on rather than sql strings in 50k example by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2561\r\n* add ukhsa by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2567\r\n* fix typo by @RossKen in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2565\r\n* Remove unused binder files by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2572\r\n* `ColumnExpression` first\u002Flast index by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2585\r\n* ColumnExpression - `NULLIF` by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2586\r\n* Update index.md by @gidelpanta in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2590\r\n* Bug - Realtime cache collision by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2589\r\n* add modify settings exampel to cookbook by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2591\r\n* Fix spark database double-quoting by @julijonas in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2577\r\n* Add ArrayIntersect default by @RossKen in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2587\r\n* Add poetry configuration to conda script, bump versions by @zmbc in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2516\r\n* ICS use case of splink by @BenNBEIS in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2593\r\n* update use cases by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2596\r\n* add ontario by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2597\r\n* country flags by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2598\r\n* Fix duckdb 1.2.0 issue on cumulative_comparisons chart  by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2609\r\n* update spark performance for splink 4 by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2608\r\n* Update lockfile by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2602\r\n* Add udf example to cookbook by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2612\r\n* Remove insecure polyfill thats no longer needed by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2613\r\n* Duckdb as record dict no longer uses pandas by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2610\r\n* Add ons link by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2614\r\n* One to one clustering by @aymonwuolanne in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2578\r\n* add node centrality to graph metrics by @RossKen in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2618\r\n* add NHSE to use cases by @amaiaita in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2620\r\n* Fix typo - 'compelex' by @b-d-e in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2623\r\n* Fix missing word in docs - \"driver OF ...\" by @b-d-e in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2622\r\n* Docs splink_fundamentals\u002Fsettings : Make sure simple model SettingsCreator i","2025-03-05T08:56:37",{"id":231,"version":232,"summary_zh":233,"released_at":234},113300,"v4.0.6","## What's Changed\r\n* Explicit selection by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2484\r\n* Fix clustering in debug mode by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2485\r\n* Less caching in debug mode by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2488\r\n* Update changelog by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2497\r\n* remove unnecessary import by @lubrst in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2500\r\n* Spark test session handling by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2504\r\n* Fix count_comparisons_from_blocking_rule by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2503\r\n* Streamline docs by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2505\r\n* Test and fix debug mode by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2481\r\n* Improve compare two records by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2498\r\n* Bug - get columns of DuckDB frame even when table is empty by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2510\r\n* Update CONTRIBUTING.md with correct link by @zmbc in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2513\r\n* Constrain dev pandas version by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2518\r\n* Update lockfile + fixes for latest package versions by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2514\r\n* Avoid bug with checkpointing by switching to parquet by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2525\r\n* Clustering allows match weight args not just match probability by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2454\r\n* Explicit tf columns select by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2527\r\n* Make `Settings._columns_used_by_comparisons` unquoted by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2532\r\n* Pairwise string distance comparison by @zmbc in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2517\r\n* Bias blog 2 by @RossKen in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2408\r\n* 4.0.6 release by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2537\r\n\r\n## New Contributors\r\n* @lubrst made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2500\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.5...v4.0.6","2024-12-05T09:45:47",{"id":236,"version":237,"summary_zh":238,"released_at":239},113301,"v4.0.5","## What's Changed\r\n* add EMA use case by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2468\r\n* Change name of second __splink__cluster_count_row_numbered query, prevent table name conflict by @browo097302 in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2447\r\n* Add iteration number to `neighbours_filtered` table by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2470\r\n* Fix docs examples by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2471\r\n* Docs - correct heading and link text by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2472\r\n* Simplify Altair import by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2479\r\n* Specify version range for `pytest-cov` in CI by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2489\r\n* Compare two records - allow dataframes to be registered  by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2493\r\n* 4.0.5 release by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2495\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.4...v4.0.5","2024-11-06T14:37:42",{"id":241,"version":242,"summary_zh":243,"released_at":244},113302,"v4.0.4","## What's Changed\r\n* Handle threshold_match_probablity 0 in predict() #2420 by @browo097302 in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2425\r\n* Take converged clusters out of play by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2436\r\n* Fix clustering in linky jobs with source dataset column on Postgres by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2444\r\n* Cluster multiple thresholds v2 by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2437\r\n* Used .blocking_rule_sql property match_weights_interactive_history_chart() by @browo097302 in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2446\r\n* restore pretty print of SplinkDataFrame by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2450\r\n* 2440 add docstring to customrule by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2452\r\n* Cluster multiple add stats by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2453\r\n* Score missing intra-cluster edges by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2442\r\n* Fix cluster studio docstring by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2455\r\n* Docs cleanup by @Thomas-Hirsch in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2460\r\n* Fix profile charts issue by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2466\r\n* 4.0.4 release by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2467\r\n\r\n## New Contributors\r\n* @browo097302 made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2425\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.3...v4.0.4","2024-10-13T19:32:13",{"id":246,"version":247,"summary_zh":248,"released_at":249},113303,"v4.0.3","## What's Changed\r\n* fix dead links by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2430\r\n* Cluster without linker by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2412\r\n* Better autocomplete for dataframes by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2434\r\n* v4.0.3 release by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2435\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.2...v4.0.3","2024-09-30T10:59:58",{"id":251,"version":252,"summary_zh":253,"released_at":254},113304,"v4.0.2","## What's Changed\r\n* Fix performance issue with exploding blocking rules by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2385\r\n* Add cookbook to examples by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2388\r\n* fix docs by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2389\r\n* Create llm prompt by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2366\r\n* 2351 fix spark sampling by @aymonwuolanne in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2390\r\n* Improve number formatting and descriptions on match weight charts by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2392\r\n* add labelling tool by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2393\r\n* Fix ColumnsReversedLevel by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2395\r\n* Add `is_in_level` and `compute_comparison_vector_value` testing functions to internals by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2396\r\n* Migrate tests of comparisons and comparison levels to new testing framework by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2397\r\n* Add AbsoluteDifferenceLevel by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2398\r\n* TimeDifference docstring by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2400\r\n* More levels docstrings by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2401\r\n* add dates docs by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2402\r\n* Better docstrings by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2404\r\n* Add cosine similiarity comparison level and comparison by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2405\r\n* add gov transformation mag link by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2406\r\n* Add cosine similarity tests and allow schemad data by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2407\r\n* Consistency in usage of sql_dialect, sql_dialect_str, sqlglot_dialect by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2391\r\n* ArraySubset comparison level by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2416\r\n* Interactive comparison notebook by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2417\r\n* 4.0.2 release by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2418\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.1...v4.0.2","2024-09-22T14:09:19",{"id":256,"version":257,"summary_zh":258,"released_at":259},113305,"v4.0.1","## What's Changed\r\n\r\n* Bias blog by @ericakane-moj in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2279\r\n* Fix bug in Postgres example by @fhightower in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2352\r\n* Added new use case to index.md by @AnthonyTacquet in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2363\r\n* Fixing issue with reaonly filesystems by @RossHammer in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2357\r\n* Update changelog by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2370\r\n* avoid attempting to cast `Infinity` to double for spark backend by @bkitej-rw in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2372\r\n* Fix Spark 'InfinityD' bug  by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2374\r\n* Support duckdbpyrelation as input type by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2375\r\n* Bump actions\u002Fdownload-artifact from 3 to 4.1.7 in \u002F.github\u002Fworkflows by @dependabot in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2377\r\n* Splink datasets - simplify + restructure by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2378\r\n* Fix docs reference for renamed class by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2380\r\n* Update upload-artifact version in docs CI by @ADBond in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2381\r\n* Allow a specific m and u probabilities to be fixed during training by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2379\r\n* Allow all charts to be generated as a dict by @RossHammer in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2361\r\n* Splink 401 release by @RobinL in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2386\r\n\r\n## New Contributors\r\n* @probjects made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2172\r\n* @DavidFrenchSG made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2204\r\n* @astimoore made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2229\r\n* @dkaufman-rc made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2240\r\n* @ericakane-moj made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2277\r\n* @bnm3k made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2342\r\n* @fhightower made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2352\r\n* @AnthonyTacquet made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2363\r\n* @RossHammer made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2357\r\n* @bkitej-rw made their first contribution in https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fpull\u002F2372\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002Fmoj-analytical-services\u002Fsplink\u002Fcompare\u002Fv4.0.0...v4.0.1","2024-09-05T17:24:48",{"id":261,"version":262,"summary_zh":263,"released_at":264},113306,"v4.0.0","See\r\nhttps:\u002F\u002Fmoj-analytical-services.github.io\u002Fsplink\u002Fblog\u002F2024\u002F07\u002F24\u002Fsplink-400-released.html\r\nfor release announcement","2024-07-24T09:20:18"]