[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-jamesturk--scrapeghost":3,"tool-jamesturk--scrapeghost":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":82,"owner_twitter":79,"owner_website":83,"owner_url":84,"languages":85,"stars":94,"forks":95,"last_commit_at":96,"license":97,"difficulty_score":23,"env_os":98,"env_gpu":99,"env_ram":99,"env_deps":100,"category_tags":107,"github_topics":108,"view_count":112,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":113,"updated_at":114,"faqs":115,"releases":116},434,"jamesturk\u002Fscrapeghost","scrapeghost","👻 Experimental library for scraping websites using OpenAI's GPT API.","scrapeghost 是一个曾用于实验性网页数据提取的 Python 库，它借助 OpenAI 的 GPT 模型（如 GPT-4）理解网页内容并按用户定义的结构返回所需信息。传统网页爬虫依赖固定的规则或解析逻辑，面对结构多变的网站往往难以维护；而 scrapeghost 利用大语言模型的理解能力，让用户只需描述想要的数据格式，即可自动从页面中“读懂”并抽取内容。\n\n它适合熟悉 Python 的开发者或研究人员探索基于大模型的智能爬取方案，尤其适用于目标网站结构复杂、频繁变动或难以用常规选择器精准定位的场景。scrapeghost 提供多项实用功能：支持用 Pydantic 定义数据结构、自动清理冗余 HTML、通过 CSS\u002FXPath 预筛选内容、对模型输出进行 JSON 与 schema 校验，并包含防幻觉检查以确保提取结果真实存在于页面中。此外，它还内置成本控制机制，可追踪 token 消耗、设置预算上限，并支持在 GPT-3.5 和 GPT-4 之间自动降级以节省费用。\n\n需要注意的是，该项目已于 2023 年停止维护，作者不再推荐使用，且调用 GPT-4 成本较高（单次中等页","scrapeghost 是一个曾用于实验性网页数据提取的 Python 库，它借助 OpenAI 的 GPT 模型（如 GPT-4）理解网页内容并按用户定义的结构返回所需信息。传统网页爬虫依赖固定的规则或解析逻辑，面对结构多变的网站往往难以维护；而 scrapeghost 利用大语言模型的理解能力，让用户只需描述想要的数据格式，即可自动从页面中“读懂”并抽取内容。\n\n它适合熟悉 Python 的开发者或研究人员探索基于大模型的智能爬取方案，尤其适用于目标网站结构复杂、频繁变动或难以用常规选择器精准定位的场景。scrapeghost 提供多项实用功能：支持用 Pydantic 定义数据结构、自动清理冗余 HTML、通过 CSS\u002FXPath 预筛选内容、对模型输出进行 JSON 与 schema 校验，并包含防幻觉检查以确保提取结果真实存在于页面中。此外，它还内置成本控制机制，可追踪 token 消耗、设置预算上限，并支持在 GPT-3.5 和 GPT-4 之间自动降级以节省费用。\n\n需要注意的是，该项目已于 2023 年停止维护，作者不再推荐使用，且调用 GPT-4 成本较高（单次中等页面约 0.36 美元），使用时需谨慎评估需求与开销。","# scrapeghost\n\n**This project from 2023 is no longer maintained or recommended, the author has no interest in working with commercial LLMs.**\n\n\n`scrapeghost` **was** an experimental library for scraping websites using OpenAI's GPT.\n\nSource: [https:\u002F\u002Fcodeberg.org\u002Fjpt\u002Fscrapeghost](https:\u002F\u002Fcodeberg.org\u002Fjpt\u002Fscrapeghost)\n\nDocumentation: [https:\u002F\u002Fjamesturk.github.io\u002Fscrapeghost\u002F](https:\u002F\u002Fjamesturk.github.io\u002Fscrapeghost\u002F)\n\nIssues: [https:\u002F\u002Fcodeberg.org\u002Fjpt\u002Fscrapeghost\u002Fissues](https:\u002F\u002Fcodeberg.org\u002Fjpt\u002Fscrapeghost\u002Fissues)\n\n[![PyPI badge](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fscrapeghost.svg)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fscrapeghost)\n\n**Use at your own risk. This library makes considerably expensive calls ($0.36 for a GPT-4 call on a moderately sized page.) Cost estimates are based on the [OpenAI pricing page](https:\u002F\u002Fbeta.openai.com\u002Fpricing) and not guaranteed to be accurate.**\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fjamesturk_scrapeghost_readme_03643a1a273a.png)\n\n## Features\n\nThe purpose of this library is to provide a convenient interface for exploring web scraping with GPT.\n\nWhile the bulk of the work is done by the GPT model, `scrapeghost` provides a number of features to make it easier to use.\n\n**Python-based schema definition** - Define the shape of the data you want to extract as any Python object, with as much or little detail as you want.\n\n**Preprocessing**\n\n* **HTML cleaning** - Remove unnecessary HTML to reduce the size and cost of API requests.\n* **CSS and XPath selectors** - Pre-filter HTML by writing a single CSS or XPath selector.\n* **Auto-splitting** - Optionally split the HTML into multiple calls to the model, allowing for larger pages to be scraped.\n\n**Postprocessing**\n\n* **JSON validation** - Ensure that the response is valid JSON.  (With the option to kick it back to GPT for fixes if it's not.)\n* **Schema validation** - Go a step further, use a [`pydantic`](https:\u002F\u002Fpydantic-docs.helpmanual.io\u002F) schema to validate the response.\n* **Hallucination check** - Does the data in the response truly exist on the page?\n\n**Cost Controls**\n\n* Scrapers keep running totals of how many tokens have been sent and received, so costs can be tracked.\n* Support for automatic fallbacks (e.g. use cost-saving GPT-3.5-Turbo by default, fall back to GPT-4 if needed.)\n* Allows setting a budget and stops the scraper if the budget is exceeded.\n","# scrapeghost\n\n**该项目始于 2023 年，目前已不再维护或推荐使用，作者无意继续与商业大语言模型（LLM）合作。**\n\n`scrapeghost` **曾是**一个实验性库，用于通过 OpenAI 的 GPT 模型抓取网站内容。\n\n源码地址：[https:\u002F\u002Fcodeberg.org\u002Fjpt\u002Fscrapeghost](https:\u002F\u002Fcodeberg.org\u002Fjpt\u002Fscrapeghost)\n\n文档地址：[https:\u002F\u002Fjamesturk.github.io\u002Fscrapeghost\u002F](https:\u002F\u002Fjamesturk.github.io\u002Fscrapeghost\u002F)\n\n问题反馈：[https:\u002F\u002Fcodeberg.org\u002Fjpt\u002Fscrapeghost\u002Fissues](https:\u002F\u002Fcodeberg.org\u002Fjpt\u002Fscrapeghost\u002Fissues)\n\n[![PyPI badge](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fscrapeghost.svg)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Fscrapeghost)\n\n**使用风险自负。该库会发起成本较高的 API 调用（例如，在中等大小的网页上调用一次 GPT-4 约需 $0.36）。费用估算基于 [OpenAI 定价页面](https:\u002F\u002Fbeta.openai.com\u002Fpricing)，但不保证完全准确。**\n\n![](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fjamesturk_scrapeghost_readme_03643a1a273a.png)\n\n## 功能特性\n\n本库旨在为探索使用 GPT 进行网页抓取（web scraping）提供便捷的接口。\n\n虽然主要工作由 GPT 模型完成，但 `scrapeghost` 提供了多项功能以简化使用流程。\n\n**基于 Python 的数据结构定义** - 使用任意 Python 对象定义你希望提取的数据结构，可详可略。\n\n**预处理（Preprocessing）**\n\n* **HTML 清洗** - 移除不必要的 HTML 内容，以减少 API 请求的大小和成本。\n* **CSS 和 XPath 选择器** - 通过编写单个 CSS 或 XPath 选择器对 HTML 进行预过滤。\n* **自动分块（Auto-splitting）** - 可选地将 HTML 分割成多个请求发送给模型，从而支持抓取更大的页面。\n\n**后处理（Postprocessing）**\n\n* **JSON 验证** - 确保返回结果为有效的 JSON。（若无效，可选择将其重新提交给 GPT 进行修正。）\n* **结构验证（Schema validation）** - 更进一步，使用 [`pydantic`](https:\u002F\u002Fpydantic-docs.helpmanual.io\u002F) 结构对响应进行验证。\n* **幻觉检查（Hallucination check）** - 检查响应中的数据是否确实存在于原始页面上。\n\n**成本控制（Cost Controls）**\n\n* 抓取器会持续记录已发送和接收的 token 数量，便于追踪成本。\n* 支持自动降级策略（例如，默认使用节省成本的 GPT-3.5-Turbo，必要时再回退到 GPT-4）。\n* 允许设置预算，一旦超出预算即停止抓取。","# scrapeghost 快速上手指南\n\n> ⚠️ 注意：该项目自 2023 年起已停止维护，作者不再推荐使用，且不适用于商业 LLM 场景。使用前请评估风险与成本（例如 GPT-4 单次调用可能花费约 $0.36）。\n\n## 环境准备\n\n- **操作系统**：支持 Python 的任意系统（Linux \u002F macOS \u002F Windows）\n- **Python 版本**：建议 Python 3.8 或更高版本\n- **依赖服务**：\n  - 有效的 [OpenAI API Key](https:\u002F\u002Fplatform.openai.com\u002Fapi-keys)\n  - 网络可访问 OpenAI API（国内用户需配置代理或使用合规渠道）\n\n## 安装步骤\n\n推荐使用 `pip` 安装（可搭配国内镜像源加速）：\n\n```bash\npip install scrapeghost -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n如需使用 Pydantic 进行结构校验，请确保已安装：\n\n```bash\npip install pydantic\n```\n\n## 基本使用\n\n以下是最简单的使用示例：从网页中提取标题和链接列表。\n\n```python\nfrom scrapeghost import scrape\n\n# 定义期望的数据结构（使用字典即可）\nschema = {\n    \"title\": \"页面标题\",\n    \"links\": [\"链接1\", \"链接2\"]\n}\n\n# 调用 scrape 函数（需设置 OPENAI_API_KEY 环境变量）\nresult = scrape(\n    url=\"https:\u002F\u002Fexample.com\",\n    schema=schema,\n    model=\"gpt-3.5-turbo\"  # 推荐优先使用低成本模型\n)\n\nprint(result.data)\n```\n\n> 💡 提示：实际使用时请通过环境变量 `OPENAI_API_KEY` 设置密钥，避免硬编码。  \n> 示例：`export OPENAI_API_KEY='your-api-key'`（Linux\u002FmacOS）或 `set OPENAI_API_KEY=your-api-key`（Windows）","一家本地旅游创业公司需要从多个景区官网抓取最新门票价格、开放时间和游客须知，用于构建实时比价平台。\n\n### 没有 scrapeghost 时\n- 每个景区页面结构不同，需为每个站点单独编写 XPath 或 CSS 选择器，开发耗时且维护成本高。\n- 页面包含大量广告和动态脚本，原始 HTML 体积大，若直接交给大模型处理，API 调用费用不可控。\n- 提取结果常因格式不一致（如日期写成“周一至周日”或“7:00–18:00”）导致后续解析失败。\n- 无法自动验证提取内容是否真实存在于页面中，容易引入模型幻觉数据。\n- 缺乏统一的成本监控机制，团队难以预估或限制单次爬取的开销。\n\n### 使用 scrapeghost 后\n- 通过定义 Pydantic 数据模型（如 TicketInfo 包含 price、hours、notes 字段），只需声明期望的数据结构，无需手动解析 HTML。\n- 内置 HTML 清洗和 CSS 预过滤功能，自动剔除无关内容，显著减少输入 token 数量，降低调用成本。\n- 自动进行 JSON 和 schema 校验，无效响应可触发重试或修正，确保输出格式统一可靠。\n- 启用“幻觉检查”功能，比对模型输出与原始文本，过滤虚构信息，提升数据可信度。\n- 实时追踪 token 消耗并设置预算上限（如每次爬取不超过 $1），避免意外超额支出。\n\nscrapeghost 将复杂的网页结构理解任务转化为声明式数据提取，大幅降低非结构化网页采集的开发门槛与试错成本。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fjamesturk_scrapeghost_f203891e.png","jamesturk","jpt","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fjamesturk_f268b6b8.png",null,"University of Chicago","Chicago, IL","dev@jpt.sh","https:\u002F\u002Fjpt.sh","https:\u002F\u002Fgithub.com\u002Fjamesturk",[86,90],{"name":87,"color":88,"percentage":89},"Python","#3572A5",97.8,{"name":91,"color":92,"percentage":93},"Just","#384d54",2.2,1442,88,"2026-03-21T21:18:39","NOASSERTION","Linux, macOS, Windows","未说明",{"notes":101,"python":99,"dependencies":102},"该项目已不再维护，使用需自行承担风险；依赖 OpenAI API（如 GPT-3.5-Turbo 或 GPT-4），会产生调用费用；支持通过 CSS\u002FXPath 预过滤 HTML、自动分块处理大页面、JSON 和 Pydantic 模式验证及幻觉检查。",[103,104,105,106],"openai","pydantic","beautifulsoup4","lxml",[26,51,53,15],[109,110,111],"gpt","webscraping","openai-api",4,"2026-03-27T02:49:30.150509","2026-04-06T06:51:53.687949",[],[117,122,127,132,137,142,147,152,156,160],{"id":118,"version":119,"summary_zh":120,"released_at":121},101161,"0.6.0","# Changelog\n\n## 0.6.0\n\n* move to supporting Python 3.11 and 3.12\n* move to `openai` 1.0\n* move to `pydantic` 2.0\n* add support for November 2023 model upgrades\n\n## 0.5.1 - 2023-06-13\n\n* Improve type annotations and remove some ignored errors.\n* Support for new OpenAI models announced June 13th 2023.\n* Improved support for model fallbacks.  Now if a request has 6k tokens and the model list looks like `['gpt-3.5-turbo', 'gpt-3.5-turbo-16k']`, the 16k model will be used automatically since the default 4k model will not be able to handle the request.\n\n## 0.5.0 - 2023-06-06\n\n* Restore `PaginatedSchemaScraper` and add documentation for pagination.\n* Documentation improvements.\n* Small quality-of-life improvements such as better `pydantic` schema support and\n  more useful error messages.\n\n## 0.4.4 - 2023-03-31\n\n* Deactivate `HallucinationCheck` by default, it is overly aggressive and needs more work to be useful without raising false positives.\n* Bugfix for postprocessors parameter behavior not overriding defaults.\n\n## 0.4.2 - 2023-03-26\n\n* Fix type bug with JSON nudging.\n* Improve `HallucinationCheck` to handle more cases.\n* More tests!\n\n## 0.4.1 - 2023-03-24\n\n* Fix bug with HallucinationCheck.\n\n## 0.4.0 - 2023-03-24\n\n* New configurable pre- and post-processing pipelines for customizing behavior.\n* Addition of `ScrapeResult` object to hold results of scraping along with metadata.\n* Support for `pydantic` models as schemas and for validation.\n* \"Hallucination\" check to ensure that the data in the response truly exists on the page.\n* Use post-processing pipeline to \"nudge\" JSON errors to a better result.\n* Now fully type-annotated.\n* Another big refactor, separation of API calls and scraping logic.\n* Finally, a ghost logo reminiscent of library's [namesake](https:\u002F\u002Fstatic.wikia.nocookie.net\u002Fsuperheroes\u002Fimages\u002F4\u002F49\u002FSpace_Ghost.jpg\u002Frevision\u002Flatest\u002Fscale-to-width-down\u002F1000?cb=20140111031255).\n\n## 0.3.0 - 2023-03-20\n\n* Add tests, docs, and complete examples!\n* Add preprocessors to `SchemaScraper` to allow for uniform interface for cleaning & selecting HTML.\n* Use `tiktoken` for accurate token counts.\n* New `cost_estimate` utility function.\n* Cost is now tracked on a per-scraper basis (see the `total_cost` attribute on `SchemaScraper` objects).\n* `SchemaScraper` now takes a `max_cost` parameter to limit the total cost of a scraper.\n* Prompt improvements, list mode simplification.\n\n## 0.2.0 - 2023-03-18\n\n* Add list mode, auto-splitting, and pagination support.\n* Improve `xpath` and `css` handling.\n* Improve prompt for GPT 3.5.\n* Make it possible to alter parameters when calling scrape.\n* Logging & error handling.\n* Command line interface.\n* See blog post for details: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-part-2\u002F>\n\n## 0.1.0 - 2023-03-17\n\n* Initial experiment, see blog post for more: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-4\u002F>\n","2023-11-25T01:10:57",{"id":123,"version":124,"summary_zh":125,"released_at":126},101162,"0.5.1","# Changelog\n\n## 0.5.1 - 2023-06-13\n\n* Improve type annotations and remove some ignored errors.\n* Support for new OpenAI models announced June 13th 2023.\n* Improved support for model fallbacks.  Now if a request has 6k tokens and the model list looks like `['gpt-3.5-turbo', 'gpt-3.5-turbo-16k']`, the 16k model will be used automatically since the default 4k model will not be able to handle the request.\n\n## 0.5.0 - 2023-06-06\n\n* Restore `PaginatedSchemaScraper` and add documentation for pagination.\n* Documentation improvements.\n* Small quality-of-life improvements such as better `pydantic` schema support and\n  more useful error messages.\n\n## 0.4.4 - 2023-03-31\n\n* Deactivate `HallucinationCheck` by default, it is overly aggressive and needs more work to be useful without raising false positives.\n* Bugfix for postprocessors parameter behavior not overriding defaults.\n\n## 0.4.2 - 2023-03-26\n\n* Fix type bug with JSON nudging.\n* Improve `HallucinationCheck` to handle more cases.\n* More tests!\n\n## 0.4.1 - 2023-03-24\n\n* Fix bug with HallucinationCheck.\n\n## 0.4.0 - 2023-03-24\n\n* New configurable pre- and post-processing pipelines for customizing behavior.\n* Addition of `ScrapeResult` object to hold results of scraping along with metadata.\n* Support for `pydantic` models as schemas and for validation.\n* \"Hallucination\" check to ensure that the data in the response truly exists on the page.\n* Use post-processing pipeline to \"nudge\" JSON errors to a better result.\n* Now fully type-annotated.\n* Another big refactor, separation of API calls and scraping logic.\n* Finally, a ghost logo reminiscent of library's [namesake](https:\u002F\u002Fstatic.wikia.nocookie.net\u002Fsuperheroes\u002Fimages\u002F4\u002F49\u002FSpace_Ghost.jpg\u002Frevision\u002Flatest\u002Fscale-to-width-down\u002F1000?cb=20140111031255).\n\n## 0.3.0 - 2023-03-20\n\n* Add tests, docs, and complete examples!\n* Add preprocessors to `SchemaScraper` to allow for uniform interface for cleaning & selecting HTML.\n* Use `tiktoken` for accurate token counts.\n* New `cost_estimate` utility function.\n* Cost is now tracked on a per-scraper basis (see the `total_cost` attribute on `SchemaScraper` objects).\n* `SchemaScraper` now takes a `max_cost` parameter to limit the total cost of a scraper.\n* Prompt improvements, list mode simplification.\n\n## 0.2.0 - 2023-03-18\n\n* Add list mode, auto-splitting, and pagination support.\n* Improve `xpath` and `css` handling.\n* Improve prompt for GPT 3.5.\n* Make it possible to alter parameters when calling scrape.\n* Logging & error handling.\n* Command line interface.\n* See blog post for details: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-part-2\u002F>\n\n## 0.1.0 - 2023-03-17\n\n* Initial experiment, see blog post for more: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-4\u002F>\n","2023-06-13T22:50:59",{"id":128,"version":129,"summary_zh":130,"released_at":131},101163,"0.5.0","# Changelog\n\n## 0.5.0 - WIP\n\n* Restore `PaginatedSchemaScraper` and add documentation for pagination.\n* Documentation improvements.\n\n## 0.4.4 - 2023-03-31\n\n* Deactivate `HallucinationCheck` by default, it is overly aggressive and needs more work to be useful without raising false positives.\n* Bugfix for postprocessors parameter behavior not overriding defaults.\n\n## 0.4.2 - 2023-03-26\n\n* Fix type bug with JSON nudging.\n* Improve `HallucinationCheck` to handle more cases.\n* More tests!\n\n## 0.4.1 - 2023-03-24\n\n* Fix bug with HallucinationCheck.\n\n## 0.4.0 - 2023-03-24\n\n* New configurable pre- and post-processing pipelines for customizing behavior.\n* Addition of `ScrapeResult` object to hold results of scraping along with metadata.\n* Support for `pydantic` models as schemas and for validation.\n* \"Hallucination\" check to ensure that the data in the response truly exists on the page.\n* Use post-processing pipeline to \"nudge\" JSON errors to a better result.\n* Now fully type-annotated.\n* Another big refactor, separation of API calls and scraping logic.\n* Finally, a ghost logo reminiscent of library's [namesake](https:\u002F\u002Fstatic.wikia.nocookie.net\u002Fsuperheroes\u002Fimages\u002F4\u002F49\u002FSpace_Ghost.jpg\u002Frevision\u002Flatest\u002Fscale-to-width-down\u002F1000?cb=20140111031255).\n\n## 0.3.0 - 2023-03-20\n\n* Add tests, docs, and complete examples!\n* Add preprocessors to `SchemaScraper` to allow for uniform interface for cleaning & selecting HTML.\n* Use `tiktoken` for accurate token counts.\n* New `cost_estimate` utility function.\n* Cost is now tracked on a per-scraper basis (see the `total_cost` attribute on `SchemaScraper` objects).\n* `SchemaScraper` now takes a `max_cost` parameter to limit the total cost of a scraper.\n* Prompt improvements, list mode simplification.\n\n## 0.2.0 - 2023-03-18\n\n* Add list mode, auto-splitting, and pagination support.\n* Improve `xpath` and `css` handling.\n* Improve prompt for GPT 3.5.\n* Make it possible to alter parameters when calling scrape.\n* Logging & error handling.\n* Command line interface.\n* See blog post for details: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-part-2\u002F>\n\n## 0.1.0 - 2023-03-17\n\n* Initial experiment, see blog post for more: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-4\u002F>\n","2023-06-06T22:39:40",{"id":133,"version":134,"summary_zh":135,"released_at":136},101164,"0.4.4","# Changelog\n\n## 0.4.3 - 2023-03-32\n\n* Deactivate `HallucinationCheck` by default, it is overly aggressive and needs more work to be useful without raising false positives.\n* Bugfix for postprocessors parameter behavior not overriding defaults.\n\n## 0.4.2 - 2023-03-26\n\n* Fix type bug with JSON nudging.\n* Improve `HallucinationCheck` to handle more cases.\n* More tests!\n\n## 0.4.1 - 2023-03-24\n\n* Fix bug with HallucinationCheck.\n\n## 0.4.0 - 2023-03-24\n\n* New configurable pre- and post-processing pipelines for customizing behavior.\n* Addition of `ScrapeResult` object to hold results of scraping along with metadata.\n* Support for `pydantic` models as schemas and for validation.\n* \"Hallucination\" check to ensure that the data in the response truly exists on the page.\n* Use post-processing pipeline to \"nudge\" JSON errors to a better result.\n* Now fully type-annotated.\n* Another big refactor, separation of API calls and scraping logic.\n* Finally, a ghost logo reminiscent of library's [namesake](https:\u002F\u002Fstatic.wikia.nocookie.net\u002Fsuperheroes\u002Fimages\u002F4\u002F49\u002FSpace_Ghost.jpg\u002Frevision\u002Flatest\u002Fscale-to-width-down\u002F1000?cb=20140111031255).\n\n## 0.3.0 - 2023-03-20\n\n* Add tests, docs, and complete examples!\n* Add preprocessors to `SchemaScraper` to allow for uniform interface for cleaning & selecting HTML.\n* Use `tiktoken` for accurate token counts.\n* New `cost_estimate` utility function.\n* Cost is now tracked on a per-scraper basis (see the `total_cost` attribute on `SchemaScraper` objects).\n* `SchemaScraper` now takes a `max_cost` parameter to limit the total cost of a scraper.\n* Prompt improvements, list mode simplification.\n\n## 0.2.0 - 2023-03-18\n\n* Add list mode, auto-splitting, and pagination support.\n* Improve `xpath` and `css` handling.\n* Improve prompt for GPT 3.5.\n* Make it possible to alter parameters when calling scrape.\n* Logging & error handling.\n* Command line interface.\n* See blog post for details: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-part-2\u002F>\n\n## 0.1.0 - 2023-03-17\n\n* Initial experiment, see blog post for more: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-4\u002F>\n","2023-03-31T16:55:07",{"id":138,"version":139,"summary_zh":140,"released_at":141},101165,"0.4.2","# Changelog\n\n## Next\n\n* Fix bug with JSON nudging.\n* Improve `HallucinationCheck` to handle more cases.\n\n## 0.4.1 - 2023-03-24\n\n* Fix bug with HallucinationCheck.\n\n## 0.4.0 - 2023-03-24\n\n* New configurable pre- and post-processing pipelines for customizing behavior.\n* Addition of `ScrapeResult` object to hold results of scraping along with metadata.\n* Support for `pydantic` models as schemas and for validation.\n* \"Hallucination\" check to ensure that the data in the response truly exists on the page.\n* Use post-processing pipeline to \"nudge\" JSON errors to a better result.\n* Now fully type-annotated.\n* Another big refactor, separation of API calls and scraping logic.\n* Finally, a ghost logo reminiscent of library's [namesake](https:\u002F\u002Fstatic.wikia.nocookie.net\u002Fsuperheroes\u002Fimages\u002F4\u002F49\u002FSpace_Ghost.jpg\u002Frevision\u002Flatest\u002Fscale-to-width-down\u002F1000?cb=20140111031255).\n\n## 0.3.0 - 2023-03-20\n\n* Add tests, docs, and complete examples!\n* Add preprocessors to `SchemaScraper` to allow for uniform interface for cleaning & selecting HTML.\n* Use `tiktoken` for accurate token counts.\n* New `cost_estimate` utility function.\n* Cost is now tracked on a per-scraper basis (see the `total_cost` attribute on `SchemaScraper` objects).\n* `SchemaScraper` now takes a `max_cost` parameter to limit the total cost of a scraper.\n* Prompt improvements, list mode simplification.\n\n## 0.2.0 - 2023-03-18\n\n* Add list mode, auto-splitting, and pagination support.\n* Improve `xpath` and `css` handling.\n* Improve prompt for GPT 3.5.\n* Make it possible to alter parameters when calling scrape.\n* Logging & error handling.\n* Command line interface.\n* See blog post for details: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-part-2\u002F>\n\n## 0.1.0 - 2023-03-17\n\n* Initial experiment, see blog post for more: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-4\u002F>\n","2023-03-26T18:42:08",{"id":143,"version":144,"summary_zh":145,"released_at":146},101166,"0.4.1","# Changelog\n\n## 0.4.1 - 2023-03-24\n\n* Fix bug with HallucinationCheck.\n\n## 0.4.0 - 2023-03-24\n\n* New configurable pre- and post-processing pipelines for customizing behavior.\n* Addition of `ScrapeResult` object to hold results of scraping along with metadata.\n* Support for `pydantic` models as schemas and for validation.\n* \"Hallucination\" check to ensure that the data in the response truly exists on the page.\n* Use post-processing pipeline to \"nudge\" JSON errors to a better result.\n* Now fully type-annotated.\n* Another big refactor, separation of API calls and scraping logic.\n* Finally, a ghost logo reminiscent of library's [namesake](https:\u002F\u002Fstatic.wikia.nocookie.net\u002Fsuperheroes\u002Fimages\u002F4\u002F49\u002FSpace_Ghost.jpg\u002Frevision\u002Flatest\u002Fscale-to-width-down\u002F1000?cb=20140111031255).\n\n## 0.3.0 - 2023-03-20\n\n* Add tests, docs, and complete examples!\n* Add preprocessors to `SchemaScraper` to allow for uniform interface for cleaning & selecting HTML.\n* Use `tiktoken` for accurate token counts.\n* New `cost_estimate` utility function.\n* Cost is now tracked on a per-scraper basis (see the `total_cost` attribute on `SchemaScraper` objects).\n* `SchemaScraper` now takes a `max_cost` parameter to limit the total cost of a scraper.\n* Prompt improvements, list mode simplification.\n\n## 0.2.0 - 2023-03-18\n\n* Add list mode, auto-splitting, and pagination support.\n* Improve `xpath` and `css` handling.\n* Improve prompt for GPT 3.5.\n* Make it possible to alter parameters when calling scrape.\n* Logging & error handling.\n* Command line interface.\n* See blog post for details: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-part-2\u002F>\n\n## 0.1.0 - 2023-03-17\n\n* Initial experiment, see blog post for more: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-4\u002F>\n","2023-03-25T01:05:02",{"id":148,"version":149,"summary_zh":150,"released_at":151},101167,"0.4.0","# Changelog\n\n## 0.4.0 - 2023-03-24\n\n* New configurable pre- and post-processing pipelines for customizing behavior.\n* Addition of `ScrapeResult` object to hold results of scraping along with metadata.\n* Support for `pydantic` models as schemas and for validation.\n* \"Hallucination\" check to ensure that the data in the response truly exists on the page.\n* Use post-processing pipeline to \"nudge\" JSON errors to a better result.\n* Now fully type-annotated.\n* Another big refactor, separation of API calls and scraping logic.\n* Finally, a ghost logo reminiscent of library's [namesake](https:\u002F\u002Fstatic.wikia.nocookie.net\u002Fsuperheroes\u002Fimages\u002F4\u002F49\u002FSpace_Ghost.jpg\u002Frevision\u002Flatest\u002Fscale-to-width-down\u002F1000?cb=20140111031255).\n\n## 0.3.0 - 2023-03-20\n\n* Add tests, docs, and complete examples!\n* Add preprocessors to `SchemaScraper` to allow for uniform interface for cleaning & selecting HTML.\n* Use `tiktoken` for accurate token counts.\n* New `cost_estimate` utility function.\n* Cost is now tracked on a per-scraper basis (see the `total_cost` attribute on `SchemaScraper` objects).\n* `SchemaScraper` now takes a `max_cost` parameter to limit the total cost of a scraper.\n* Prompt improvements, list mode simplification.\n\n## 0.2.0 - 2023-03-18\n\n* Add list mode, auto-splitting, and pagination support.\n* Improve `xpath` and `css` handling.\n* Improve prompt for GPT 3.5.\n* Make it possible to alter parameters when calling scrape.\n* Logging & error handling.\n* Command line interface.\n* See blog post for details: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-part-2\u002F>\n\n## 0.1.0 - 2023-03-17\n\n* Initial experiment, see blog post for more: \u003Chttps:\u002F\u002Fjamesturk.net\u002Fposts\u002Fscraping-with-gpt-4\u002F>\n","2023-03-25T00:48:50",{"id":153,"version":154,"summary_zh":79,"released_at":155},101168,"0.3.0","2023-03-20T10:09:07",{"id":157,"version":158,"summary_zh":79,"released_at":159},101169,"0.2.0","2023-03-19T02:23:09",{"id":161,"version":162,"summary_zh":79,"released_at":163},101170,"0.1.0","2023-03-18T03:01:04"]