[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-dedupeio--dedupe":3,"tool-dedupeio--dedupe":64},[4,23,32,40,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,2,"2026-04-05T10:45:23",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},2234,"scikit-learn","scikit-learn\u002Fscikit-learn","scikit-learn 是一个基于 Python 构建的开源机器学习库，依托于 SciPy、NumPy 等科学计算生态，旨在让机器学习变得简单高效。它提供了一套统一且简洁的接口，涵盖了从数据预处理、特征工程到模型训练、评估及选择的全流程工具，内置了包括线性回归、支持向量机、随机森林、聚类等在内的丰富经典算法。\n\n对于希望快速验证想法或构建原型的数据科学家、研究人员以及 Python 开发者而言，scikit-learn 是不可或缺的基础设施。它有效解决了机器学习入门门槛高、算法实现复杂以及不同模型间调用方式不统一的痛点，让用户无需重复造轮子，只需几行代码即可调用成熟的算法解决分类、回归、聚类等实际问题。\n\n其核心技术亮点在于高度一致的 API 设计风格，所有估算器（Estimator）均遵循相同的调用逻辑，极大地降低了学习成本并提升了代码的可读性与可维护性。此外，它还提供了强大的模型选择与评估工具，如交叉验证和网格搜索，帮助用户系统地优化模型性能。作为一个由全球志愿者共同维护的成熟项目，scikit-learn 以其稳定性、详尽的文档和活跃的社区支持，成为连接理论学习与工业级应用的最",65628,1,"2026-04-05T10:10:46",[20,18,14],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":10,"last_commit_at":38,"category_tags":39,"status":22},3364,"keras","keras-team\u002Fkeras","Keras 是一个专为人类设计的深度学习框架，旨在让构建和训练神经网络变得简单直观。它解决了开发者在不同深度学习后端之间切换困难、模型开发效率低以及难以兼顾调试便捷性与运行性能的痛点。\n\n无论是刚入门的学生、专注算法的研究人员，还是需要快速落地产品的工程师，都能通过 Keras 轻松上手。它支持计算机视觉、自然语言处理、音频分析及时间序列预测等多种任务。\n\nKeras 3 的核心亮点在于其独特的“多后端”架构。用户只需编写一套代码，即可灵活选择 TensorFlow、JAX、PyTorch 或 OpenVINO 作为底层运行引擎。这一特性不仅保留了 Keras 一贯的高层易用性，还允许开发者根据需求自由选择：利用 JAX 或 PyTorch 的即时执行模式进行高效调试，或切换至速度最快的后端以获得最高 350% 的性能提升。此外，Keras 具备强大的扩展能力，能无缝从本地笔记本电脑扩展至大规模 GPU 或 TPU 集群，是连接原型开发与生产部署的理想桥梁。",63927,"2026-04-04T15:24:37",[20,14,18],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":10,"last_commit_at":46,"category_tags":47,"status":22},2403,"crawl4ai","unclecode\u002Fcrawl4ai","Crawl4AI 是一款专为大语言模型（LLM）设计的开源网络爬虫与数据提取工具。它的核心使命是将纷繁复杂的网页内容转化为干净、结构化的 Markdown 格式，直接服务于检索增强生成（RAG）、智能体构建及各类数据管道，让 AI 能更轻松地“读懂”互联网。\n\n传统爬虫往往面临反爬机制拦截、动态内容加载困难以及输出格式杂乱等痛点，导致后续数据处理成本高昂。Crawl4AI 通过内置自动化的三级反机器人检测、代理升级策略以及对 Shadow DOM 的深度支持，有效突破了这些障碍。它能智能移除同意弹窗，处理深层链接，并具备长任务崩溃恢复能力，确保数据采集的稳定与高效。\n\n这款工具特别适合开发者、AI 研究人员及数据工程师使用。无论是需要为本地模型构建知识库，还是搭建大规模自动化信息采集流程，Crawl4AI 都提供了极高的可控性与灵活性。作为 GitHub 上备受瞩目的开源项目，它完全免费开放，无需繁琐的注册或昂贵的 API 费用，让用户能够专注于数据价值本身而非采集难题。",63242,"2026-04-02T22:29:19",[14,17],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":10,"last_commit_at":54,"category_tags":55,"status":22},193,"meilisearch","meilisearch\u002Fmeilisearch","Meilisearch 是一个开源的极速搜索服务，专为现代应用和网站打造，开箱即用。它能帮助开发者快速集成高质量的搜索功能，无需复杂的配置或额外的数据预处理。传统搜索方案往往需要大量调优才能实现准确结果，而 Meilisearch 内置了拼写容错、同义词识别、即时响应等实用特性，并支持 AI 驱动的混合搜索（结合关键词与语义理解），显著提升用户查找信息的体验。\n\nMeilisearch 特别适合 Web 开发者、产品团队或初创公司使用，尤其适用于需要快速上线搜索功能的场景，如电商网站、内容平台或 SaaS 应用。它提供简洁的 RESTful API 和多种语言 SDK，部署简单，资源占用低，本地开发或生产环境均可轻松运行。对于希望在不依赖大型云服务的前提下，为用户提供流畅、智能搜索体验的团队来说，Meilisearch 是一个高效且友好的选择。",56964,"2026-04-05T08:19:14",[13,17,14,20,16,18],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":22},223,"Made-With-ML","GokuMohandas\u002FMade-With-ML","Made-With-ML 是一个面向实战的开源项目，旨在帮助开发者系统掌握从设计、开发到部署和迭代生产级机器学习应用的完整流程。它解决了许多人在学习机器学习时“会训练模型但不会上线”的痛点，强调将软件工程最佳实践与 ML 技术结合，构建可靠、可维护的端到端系统。\n\n该项目特别适合三类人群：一是希望将模型真正落地的开发者（包括软件工程师、数据科学家）；二是刚毕业、想补齐工业界所需技能的学生；三是需要理解技术边界以更好推动产品的技术管理者或产品经理。\n\nMade-With-ML 的亮点在于注重第一性原理讲解，避免盲目调包；同时覆盖 MLOps 关键环节（如实验跟踪、模型测试、服务部署、CI\u002FCD 等），并支持在 Python 生态内平滑扩展训练与推理任务，无需切换语言或复杂基础设施。课程内容结构清晰，配有详细代码示例和视频导览，兼顾理论深度与工程实用性。",47108,"2026-04-05T10:42:55",[19,18,14,16,20],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":79,"owner_website":81,"owner_url":82,"languages":83,"stars":96,"forks":97,"last_commit_at":98,"license":99,"difficulty_score":29,"env_os":100,"env_gpu":100,"env_ram":100,"env_deps":101,"category_tags":104,"github_topics":105,"view_count":114,"oss_zip_url":79,"oss_zip_packed_at":79,"status":22,"created_at":115,"updated_at":116,"faqs":117,"releases":148},1041,"dedupeio\u002Fdedupe","dedupe",":id: A python library for accurate and scalable fuzzy matching, record deduplication and entity-resolution.","dedupe 是一款基于 Python 的开源库，专注于结构化数据的模糊匹配、记录去重与实体解析。面对数据中常见的拼写错误、格式不统一或缺乏唯一标识符等难题，dedupe 能够高效识别并合并重复记录。例如，它可以清理姓名地址表格中的冗余项，在没有客户 ID 的情况下关联订单历史，或是从捐款数据库中识别出同一人的不同记录。\n\n与其他简单匹配方法不同，dedupe 的核心亮点在于引入机器学习技术。用户只需提供少量人工训练数据，dedupe 就能自动学习并生成最适合当前数据集的规则，即使在大规模数据库中也能快速找到相似记录。这种智能化的方式大幅提升了数据清洗的准确率与扩展性。\n\ndedupe 非常适合开发者、数据分析师及研究人员使用，尤其是那些需要处理脏数据或进行跨源数据整合的 Python 用户。除了核心库，社区还提供了命令行工具 csvdedupe 及云端服务 Dedupe.io，方便不同需求的用户灵活选择。通过简单的 pip 安装即可上手，结合丰富的文档与示例，能帮助用户轻松实现高质量的数据治理。","# Dedupe Python Library\n\n[![Tests Passing](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fworkflows\u002Ftests\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Factions?query=workflow%3Atests)[![codecov](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fdedupeio\u002Fdedupe\u002Fbranch\u002Fmain\u002Fgraph\u002Fbadge.svg?token=aauKUrTEgh)](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fdedupeio\u002Fdedupe)\n\n_dedupe is a python library that uses machine learning to perform fuzzy matching, deduplication and entity resolution quickly on structured data._\n\n__dedupe__ will help you: \n\n* __remove duplicate entries__ from a spreadsheet of names and addresses\n* __link a list__ with customer information to another with order history, even without unique customer IDs\n* take a database of campaign contributions and __figure out which ones were made by the same person__, even if the names were entered slightly differently for each record\n\ndedupe takes in human training data and comes up with the best rules for your dataset to quickly and automatically find similar records, even with very large databases.\n\n## Important links\n* Documentation: https:\u002F\u002Fdocs.dedupe.io\u002F\n* Repository: https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\n* Issues: https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fissues\n* Mailing list: https:\u002F\u002Fgroups.google.com\u002Fforum\u002F#!forum\u002Fopen-source-deduplication\n* Examples: https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe-examples\n\n## dedupe library consulting\n\nIf you or your organization would like professional assistance in working with the dedupe library, Dedupe.io LLC offers consulting services. [Read more about pricing and available services here](https:\u002F\u002Fdedupe.io\u002Fpricing\u002F#consulting).\n\n## Tools built with dedupe\n\n### [Dedupe.io](https:\u002F\u002Fdedupe.io\u002F)\nA cloud service powered by the dedupe library for de-duplicating and finding matches in your data. It provides a step-by-step wizard for uploading your data, setting up a model, training, clustering and reviewing the results.\n\n[Dedupe.io](https:\u002F\u002Fdedupe.io\u002F) also supports record linkage across data sources and continuous matching and training through an [API](https:\u002F\u002Fapidocs.dedupe.io\u002Fen\u002Flatest\u002F).\n\nFor more, see the [Dedupe.io product site](https:\u002F\u002Fdedupe.io\u002F), [tutorials on how to use it](https:\u002F\u002Fdedupe.io\u002Ftutorial\u002Fintro-to-dedupe-io.html), and [differences between it and the dedupe library](https:\u002F\u002Fdedupe.io\u002Fdocumentation\u002Fshould-i-use-dedupeio-or-the-dedupe-python-library.html).\n\nDedupe is well adopted by the Python community. Check out this [blogpost](https:\u002F\u002Fmedium.com\u002Fdistrict-data-labs\u002Fbasics-of-entity-resolution-with-python-and-dedupe-bc87440b64d4),\na YouTube video on how to use [Dedupe with Python](https:\u002F\u002Fyoutu.be\u002FMcsTWXeURhA) and a Youtube video on how to apply [Dedupe at scale using Spark](https:\u002F\u002Fyoutu.be\u002Fq9HPUYmiwjE?t=2704).\n\n\n### [csvdedupe](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fcsvdedupe)\nCommand line tool for de-duplicating and [linking](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fcsvdedupe#csvlink-usage) CSV files. Read about it on [Source Knight-Mozilla OpenNews](https:\u002F\u002Fsource.opennews.org\u002Fen-US\u002Farticles\u002Fintroducing-cvsdedupe\u002F).\n\n## Installation\n\n### Using dedupe\n\nIf you only want to use dedupe, install it this way:\n\n```bash\npip install dedupe\n```\n\nFamiliarize yourself with [dedupe's API](https:\u002F\u002Fdocs.dedupe.io\u002Fen\u002Flatest\u002FAPI-documentation.html), and get started on your project. Need inspiration? Have a look at [some examples](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe-examples).\n\n### Developing dedupe\n\nWe recommend using [virtualenv](http:\u002F\u002Fvirtualenv.readthedocs.org\u002Fen\u002Flatest\u002Fvirtualenv.html) and [virtualenvwrapper](http:\u002F\u002Fvirtualenvwrapper.readthedocs.org\u002Fen\u002Flatest\u002Finstall.html) for working in a virtualized development environment. [Read how to set up virtualenv](http:\u002F\u002Fdocs.python-guide.org\u002Fen\u002Flatest\u002Fdev\u002Fvirtualenvs\u002F).\n\nOnce you have virtualenvwrapper set up,\n\n```bash\nmkvirtualenv dedupe\ngit clone https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe.git\ncd dedupe\npip install -e . --config-settings editable_mode=compat\npip install -r requirements.txt\n```\n\nIf these tests pass, then everything should have been installed correctly!\n\n```bash\npytest\n```\n\nAfterwards, whenever you want to work on dedupe,\n\n```bash\nworkon dedupe\n```\n\n## Testing\nUnit tests of core dedupe functions\n```bash\npytest\n```\n\n#### Test using canonical dataset from Bilenko's research\n  \nUsing Deduplication\n```bash\npython -m pip install -e .\u002Fbenchmarks\npython benchmarks\u002Fbenchmarks\u002Fcanonical.py\n```\n\nUsing Record Linkage\n```bash\npython -m pip install -e .\u002Fbenchmarks\npython benchmarks\u002Fbenchmarks\u002Fcanonical_matching.py\n```\n\n\n## Team\n\n* Forest Gregg, DataMade\n* Derek Eder, DataMade\n\n## Credits\n\nDedupe is based on Mikhail Yuryevich Bilenko's Ph.D. dissertation: [*Learnable Similarity Functions and their Application to Record Linkage and Clustering*](http:\u002F\u002Fwww.cs.utexas.edu\u002F~ml\u002Fpapers\u002Fmarlin-dissertation-06.pdf).\n\n## Errors \u002F Bugs\n\nIf something is not behaving intuitively, it is a bug, and should be reported.\n[Report it here](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fissues)\n\n\n## Note on Patches\u002FPull Requests\n \n* Fork the project.\n* Make your feature addition or bug fix.\n* Send us a pull request. Bonus points for topic branches.\n\n## Copyright\n\nCopyright (c) 2022 Forest Gregg and Derek Eder. Released under the [MIT License](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fblob\u002Fmain\u002FLICENSE).\n\nThird-party copyright in this distribution is noted where applicable.\n\n## Citing Dedupe\nIf you use Dedupe in an academic work, please give this citation:\n\nForest Gregg and Derek Eder. 2022. Dedupe. https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe.\n","# Dedupe Python 库\n\n[![Tests Passing](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fworkflows\u002Ftests\u002Fbadge.svg)](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Factions?query=workflow%3Atests)[![codecov](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fdedupeio\u002Fdedupe\u002Fbranch\u002Fmain\u002Fgraph\u002Fbadge.svg?token=aauKUrTEgh)](https:\u002F\u002Fcodecov.io\u002Fgh\u002Fdedupeio\u002Fdedupe)\n\n_dedupe 是一个 Python 库（library），利用机器学习（machine learning）在结构化数据（structured data）上快速执行模糊匹配（fuzzy matching）、去重（deduplication）和实体解析（entity resolution）。_\n\n__dedupe__ 将帮助您：\n\n* 从包含姓名和地址的电子表格中__移除重复条目__\n* 将包含客户信息的列表与包含订单历史的另一个列表__链接__起来，即使没有唯一的客户 ID\n* 获取竞选捐款数据库，并__找出哪些是由同一个人捐款的__，即使每条记录的姓名输入略有不同\n\ndedupe 接受人工训练数据，并为您的数据集制定最佳规则，以便快速自动地查找相似记录，即使数据库非常大。\n\n## 重要链接\n* 文档：https:\u002F\u002Fdocs.dedupe.io\u002F\n* 仓库：https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\n* 问题：https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fissues\n* 邮件列表：https:\u002F\u002Fgroups.google.com\u002Fforum\u002F#!forum\u002Fopen-source-deduplication\n* 示例：https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe-examples\n\n## dedupe 库咨询\n\n如果您或您的组织希望在使用 dedupe 库方面获得专业协助，Dedupe.io LLC 提供咨询服务。[在此处阅读更多关于定价和可用服务的信息](https:\u002F\u002Fdedupe.io\u002Fpricing\u002F#consulting)。\n\n## 基于 dedupe 构建的工具\n\n### [Dedupe.io](https:\u002F\u002Fdedupe.io\u002F)\n一项由 dedupe 库提供支持云服务，用于数据去重和查找匹配项。它提供了一步一步的向导，用于上传数据、设置模型、训练、聚类（clustering）和审查结果。\n\n[Dedupe.io](https:\u002F\u002Fdedupe.io\u002F) 还支持跨数据源的记录链接（record linkage），以及通过 [API](https:\u002F\u002Fapidocs.dedupe.io\u002Fen\u002Flatest\u002F) 进行持续匹配和训练。\n\n更多信息，请参阅 [Dedupe.io 产品网站](https:\u002F\u002Fdedupe.io\u002F)、[如何使用它的教程](https:\u002F\u002Fdedupe.io\u002Ftutorial\u002Fintro-to-dedupe-io.html)，以及 [它与 dedupe 库之间的区别](https:\u002F\u002Fdedupe.io\u002Fdocumentation\u002Fshould-i-use-dedupeio-or-the-dedupe-python-library.html)。\n\nDedupe 在 Python 社区中被广泛采用。查看这篇 [博客文章](https:\u002F\u002Fmedium.com\u002Fdistrict-data-labs\u002Fbasics-of-entity-resolution-with-python-and-dedupe-bc87440b64d4)，一个关于如何使用 [Python 版 Dedupe](https:\u002F\u002Fyoutu.be\u002FMcsTWXeURhA) 的 YouTube 视频，以及一个关于如何使用 [Spark 大规模应用 Dedupe](https:\u002F\u002Fyoutu.be\u002Fq9HPUYmiwjE?t=2704) 的 Youtube 视频。\n\n\n### [csvdedupe](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fcsvdedupe)\n用于去重和 [链接](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fcsvdedupe#csvlink-usage) CSV 文件的命令行工具（command line tool）。在 [Source Knight-Mozilla OpenNews](https:\u002F\u002Fsource.opennews.org\u002Fen-US\u002Farticles\u002Fintroducing-cvsdedupe\u002F) 上阅读相关内容。\n\n## 安装\n\n### 使用 dedupe\n\n如果您只想使用 dedupe，请通过以下方式安装：\n\n```bash\npip install dedupe\n```\n\n熟悉 [dedupe 的 API](https:\u002F\u002Fdocs.dedupe.io\u002Fen\u002Flatest\u002FAPI-documentation.html)，并开始您的项目。需要灵感？看看 [一些示例](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe-examples)。\n\n### 开发 dedupe\n\n我们建议使用 [virtualenv](http:\u002F\u002Fvirtualenv.readthedocs.org\u002Fen\u002Flatest\u002Fvirtualenv.html)（虚拟环境）和 [virtualenvwrapper](http:\u002F\u002Fvirtualenvwrapper.readthedocs.org\u002Fen\u002Flatest\u002Finstall.html) 在虚拟化开发环境中工作。[阅读如何设置 virtualenv](http:\u002F\u002Fdocs.python-guide.org\u002Fen\u002Flatest\u002Fdev\u002Fvirtualenvs\u002F)。\n\n设置好 virtualenvwrapper 后，\n\n```bash\nmkvirtualenv dedupe\ngit clone https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe.git\ncd dedupe\npip install -e . --config-settings editable_mode=compat\npip install -r requirements.txt\n```\n\n如果这些测试通过，那么所有内容都应该已正确安装！\n\n```bash\npytest\n```\n\n之后，每当您想开发 dedupe 时，\n\n```bash\nworkon dedupe\n```\n\n## 测试\n核心 dedupe 函数的单元测试（unit tests）\n```bash\npytest\n```\n\n#### 使用 Bilenko 研究中的规范数据集进行测试\n  \n使用去重（Deduplication）\n```bash\npython -m pip install -e .\u002Fbenchmarks\npython benchmarks\u002Fbenchmarks\u002Fcanonical.py\n```\n\n使用记录链接（Record Linkage）\n```bash\npython -m pip install -e .\u002Fbenchmarks\npython benchmarks\u002Fbenchmarks\u002Fcanonical_matching.py\n```\n\n\n## 团队\n\n* Forest Gregg, DataMade\n* Derek Eder, DataMade\n\n## 致谢\n\nDedupe 基于 Mikhail Yuryevich Bilenko 的博士论文：[*Learnable Similarity Functions and their Application to Record Linkage and Clustering*](http:\u002F\u002Fwww.cs.utexas.edu\u002F~ml\u002Fpapers\u002Fmarlin-dissertation-06.pdf)。\n\n## 错误 \u002F Bugs\n\n如果某些行为不符合直觉，那就是一个 bug（缺陷），应该被报告。\n[在此处报告](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fissues)\n\n\n## 关于补丁\u002F拉取请求的说明\n \n* Fork（项目分支）项目。\n* 进行功能添加或 bug 修复。\n* 向我们发送 pull request（拉取请求）。如果是主题分支（topic branches）会有加分。\n\n## 版权\n\nCopyright (c) 2022 Forest Gregg 和 Derek Eder。根据 [MIT License](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fblob\u002Fmain\u002FLICENSE) 发布。\n\n此分发中的第三方版权在适用处注明。\n\n## 引用 Dedupe\n如果您在学术作品中使用 Dedupe，请引用以下内容：\n\nForest Gregg and Derek Eder. 2022. Dedupe. https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe.","# Dedupe 快速上手指南\n\n## 简介\nDedupe 是一个 Python 库，利用机器学习在结构化数据上快速执行**模糊匹配**、**去重**和**实体解析**。它适用于处理姓名地址去重、跨表关联客户信息、识别相同贡献者等场景。\n\n## 环境准备\n- **系统要求**：支持 Python 的环境（Windows\u002FLinux\u002FmacOS）。\n- **前置依赖**：\n  - Python 3.x\n  - pip 包管理工具\n  - 推荐使用虚拟环境工具（如 `virtualenv` 或 `virtualenvwrapper`）以隔离依赖。\n\n## 安装步骤\n如果您仅需使用 Dedupe 库，请通过 pip 直接安装：\n\n```bash\npip install dedupe\n```\n\n> **提示**：国内开发者若遇到网络安装问题，可配置 pip 使用国内镜像源（如清华源、阿里源）进行加速。\n\n若需参与开发或修改源码，建议使用虚拟环境安装：\n\n```bash\nmkvirtualenv dedupe\ngit clone https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe.git\ncd dedupe\npip install -e . --config-settings editable_mode=compat\npip install -r requirements.txt\n```\n\n## 基本使用\nDedupe 的核心工作流程包括：定义字段、准备数据、人工训练模型、自动匹配。以下是一个最小化的初始化示例：\n\n1. **导入库并定义字段**\n   告诉 Dedupe 哪些数据字段需要参与匹配。\n\n```python\nimport dedupe\n\n# 定义需要匹配的字段及其类型\nfields = [\n    {'field': 'Site name', 'type': 'String'},\n    {'field': 'Address', 'type': 'String'},\n    {'field': 'City', 'type': 'String'},\n]\n\n# 初始化 Dedupe 对象\ndeduper = dedupe.Dedupe(fields)\n```\n\n2. **训练与匹配**\n   库需要少量人工标记的训练数据来学习规则，之后即可自动处理大规模数据。具体的训练流程和 API 用法较为丰富，建议参考官方示例项目。\n\n- **官方 API 文档**：https:\u002F\u002Fdocs.dedupe.io\u002Fen\u002Flatest\u002FAPI-documentation.html\n- **代码示例仓库**：https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe-examples\n- **CSV 命令行工具**：如需直接处理 CSV 文件，可参考 [csvdedupe](https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fcsvdedupe) 工具。","某零售企业的数据分析师急需合并线上商城与线下门店的会员数据，以构建统一用户视图，但双方系统缺乏统一 ID，且姓名和电话录入存在大量拼写错误与格式差异。\n\n### 没有 dedupe 时\n- 人工核对十万级 Excel 表格耗时数周，且疲劳导致漏判率极高。\n- 传统规则无法识别“张三”与“张 三”、\"13800138000\"与\"138-0013-8000\"为同一人。\n- 重复记录导致促销短信重复发送，不仅浪费预算还引发用户投诉。\n- 无法准确计算真实活跃用户数，导致管理层对市场转化率判断失误。\n\n### 使用 dedupe 后\n- dedupe 利用机器学习主动学习，仅需少量人工标记即可自动处理十万级数据。\n- 模糊匹配算法精准识别姓名空格、电话格式差异及常见拼写错误，召回率显著提升。\n- 自动去除重复项并生成唯一实体 ID，确保营销触达唯一性，大幅节省成本。\n- 成功打通多渠道数据孤岛，输出干净的标准数据集，支撑精准用户画像分析。\n\ndedupe 通过智能实体解析技术，将混乱的多源数据转化为干净统一的资产，极大降低了数据清洗门槛并提升了治理效率。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fdedupeio_dedupe_be666cc4.png","dedupeio","Dedupe.io","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fdedupeio_a768293d.png","De-duplicate and find matches in your Excel spreadsheet or database",null,"dedupe@datamade.us","https:\u002F\u002Fdedupe.io\u002F","https:\u002F\u002Fgithub.com\u002Fdedupeio",[84,88,92],{"name":85,"color":86,"percentage":87},"Python","#3572A5",99.1,{"name":89,"color":90,"percentage":91},"Cython","#fedf5b",0.8,{"name":93,"color":94,"percentage":95},"Shell","#89e051",0.1,4451,571,"2026-04-05T02:08:00","MIT","未说明",{"notes":102,"python":100,"dependencies":103},"开发环境建议使用 virtualenv 和 virtualenvwrapper 管理虚拟环境。测试使用 pytest 框架。另有配套命令行工具 csvdedupe 可用于 CSV 文件的去重和链接。核心算法基于 Mikhail Yuryevich Bilenko 的博士论文研究。安装开发版需使用 pip install -e . --config-settings editable_mode=compat。",[100],[14],[67,106,107,108,109,110,111,112,113],"record-linkage","python","python-library","entity-resolution","dedupe-library","de-duplicating","datamade","clustering",4,"2026-03-27T02:49:30.150509","2026-04-06T05:17:22.737752",[118,123,128,133,138,143],{"id":119,"question_zh":120,"answer_zh":121,"source_url":122},4648,"遇到 \"AttributeError: module 'dedupe' has no attribute 'Dedupe'\" 错误如何解决？","这是 Python 3 下的常见错误，主要原因及解决方法：1. 检查本地脚本文件名是否命名为 `dedupe.py`，这会引发模块冲突，请重命名文件；2. 可能存在包冲突，尝试卸载 `csvdedupe` 后重新安装 `dedupe`（命令：`pip install dedupe`）；3. Python 版本兼容性问题，建议在 Python 3.6 环境下运行，用户在 3.5 或 3.7 版本中曾遇到安装报错。","https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fissues\u002F623",{"id":124,"question_zh":125,"answer_zh":126,"source_url":127},4649,"处理大规模数据（如 2000 万条记录）时出现 MemoryError 怎么办？","错误通常发生在 `scoreDuplicates` 方法的映射和归约步骤。建议方案：1. 修改 `fillQueue` 中的 chunk size 增长逻辑，设置上限或保持常数，防止指数增长导致内存溢出；2. 避免将所有记录加载到 numpy 数组中，尝试分块写入 memmap；3. 考虑使用 Python 的 `shelve` 模块进行 blocking 以减少内存占用（需权衡速度 tradeoff）。","https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fissues\u002F529",{"id":129,"question_zh":130,"answer_zh":131,"source_url":132},4650,"Dedupe 能处理多大的数据表（例如 100GB 或数百万行）？","对于超大数据集（如 100GB），直接运行可能不可行。若脏数据本身约 300 万行且内存不足，建议通过 SQL 懒加载（lazy load）数据，而不是将所有数据加载到字典中。可以在 `_blockData` 代码中传递 messy_data 的懒加载接口，并通过 SQL 获取对应行。一般性扩展性问题建议参考 Stack Overflow 相关讨论。","https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fissues\u002F390",{"id":134,"question_zh":135,"answer_zh":136,"source_url":137},4651,"如何实现流式匹配（即新记录与现有数据库的 N+1 检测）？","维护者建议的实现逻辑：1. 对新记录进行 blocking；2. 基于 blocking 结果找到该记录可能加入的所有现有集群；3. 尝试将新记录与每个集群聚类，查看“集群分数”决定是否合并。如果分数未达阈值，可创建新的单例集群。此方法比重新聚类所有记录更高效。","https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fissues\u002F195",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},4652,"Dedupe 是否支持分布式环境或 Apache Spark？","目前官方未正式支持分布式或 Spark 集成。曾有用户提议利用 blocking 分配分区在 Spark 上运行，但该议题已关闭。如有需求，建议开启新议题讨论或自行探索基于 blocking 的分区方案。","https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fissues\u002F427",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},4653,"安装 Levenshtein_search 或在 OS X 上编译遇到问题怎么办？","1. pip 搜索 \"Levenshtein_search\" 可能无结果，但直接运行 `pip install Levenshtein_search` 通常可行；2. 在 OS X 上可能会遇到编译问题，但其他功能可能正常。如果 pip 搜索不到包，请尝试直接使用安装命令而非搜索。","https:\u002F\u002Fgithub.com\u002Fdedupeio\u002Fdedupe\u002Fissues\u002F436",[149,152,155,158,161,164,167,170,173,176,179,182,185,188,191,194,197,200,203,206],{"id":150,"version":151,"summary_zh":79,"released_at":79},104129,"wide_windows_build",{"id":153,"version":154,"summary_zh":79,"released_at":79},104130,"vpython34x64",{"id":156,"version":157,"summary_zh":79,"released_at":79},104131,"v3.0.3",{"id":159,"version":160,"summary_zh":79,"released_at":79},104132,"v3.0.2",{"id":162,"version":163,"summary_zh":79,"released_at":79},104133,"v3.0.1",{"id":165,"version":166,"summary_zh":79,"released_at":79},104134,"v3.0.0a1",{"id":168,"version":169,"summary_zh":79,"released_at":79},104135,"v2.0.24",{"id":171,"version":172,"summary_zh":79,"released_at":79},104136,"v2.0.24.redux",{"id":174,"version":175,"summary_zh":79,"released_at":79},104137,"v2.0.23",{"id":177,"version":178,"summary_zh":79,"released_at":79},104138,"v2.0.22",{"id":180,"version":181,"summary_zh":79,"released_at":79},104139,"v2.0.21",{"id":183,"version":184,"summary_zh":79,"released_at":79},104140,"v2.0.21.py.typed",{"id":186,"version":187,"summary_zh":79,"released_at":79},104141,"v2.0.20",{"id":189,"version":190,"summary_zh":79,"released_at":79},104142,"v2.0.19",{"id":192,"version":193,"summary_zh":79,"released_at":79},104143,"v2.0.18",{"id":195,"version":196,"summary_zh":79,"released_at":79},104144,"v2.0.17",{"id":198,"version":199,"summary_zh":79,"released_at":79},104145,"v2.0.16",{"id":201,"version":202,"summary_zh":79,"released_at":79},104146,"v2.0.15",{"id":204,"version":205,"summary_zh":79,"released_at":79},104147,"v2.0.14",{"id":207,"version":208,"summary_zh":79,"released_at":79},104148,"v2.0.13"]