zerox

12.2k 836 简单 3 次阅读昨天MIT图像

AI 解读由 AI 自动生成，仅供参考

zerox 是一款基于视觉大模型的开源 OCR 与文档提取工具，旨在将各类文档（如 PDF、图片、DOCX）高效转换为机器可读的 Markdown 格式。传统 OCR 技术往往难以准确还原复杂的页面布局、表格或图表，导致数据质量参差不齐。zerox 通过“文件转图片 + 视觉模型解析”的逻辑，利用 AI 对视觉信息的强大理解力，完美解决了这一痛点，让文档内容真正能被 AI 轻松消化。

zerox 主要面向开发者、数据研究人员以及需要自动化处理文档的团队。它提供了 Node.js 和 Python 两种开发包，支持灵活集成到现有工作流中。技术上，zerox 的一大亮点是广泛的模型兼容性，用户可以选择 OpenAI、Azure OpenAI、AWS Bedrock 或 Google Gemini 等多种主流视觉模型接口。此外，它还具备智能排版保持、多页并发处理及自定义系统提示词等功能，能根据需求调整输出风格。对于希望提升文档数字化效率、构建 RAG 系统或进行数据分析的用户来说，zerox 是一个可靠且强大的选择。

使用场景

某金融科技公司的研发工程师正在搭建内部合规文档知识库，急需将数百份扫描版 PDF 合同转化为可被 AI 精准检索的结构化文本。

没有 zerox 时

传统 OCR 引擎对复杂表格和跨页内容识别率低，导致数据严重错位
人工校对修复格式耗时巨大，单份百页合同需耗费数小时才能可用
原始文档中的图表和特殊符号在转换过程中直接丢失，关键信息缺失
需要针对不同供应商的 API 编写定制化解析脚本，系统耦合度高且维护困难

使用 zerox 后

直接输出高质量 Markdown，自动还原表格结构与段落层级关系
基于视觉模型理解页面布局，无需人工干预即可处理复杂排版与图表
支持 OpenAI、Azure 等多种大模型接口，灵活适配现有云架构环境
通过 Node.js 或 Python 调用 zerox 快速集成，仅需几行代码即可实现批量自动化处理

zerox 利用视觉模型替代传统 OCR 技术，让非结构化文档瞬间变为机器友好的 Markdown 格式。

运行环境要求

操作系统

未说明

GPU

未说明

内存

未说明

依赖

notesNode 版本需系统安装 graphicsmagick 和 ghostscript；Python 版本需系统安装 poppler 并参考 pdf2image 文档。核心 OCR 及视觉处理依赖外部大模型 API（如 OpenAI、Azure、AWS Bedrock、Google Gemini），需配置对应 API Key 凭证。支持异步处理和并发控制。

python未说明

graphicsmagick

ghostscript

poppler

pdf2image

py-zerox

zerox

快速开始

Zerox OCR

一种将文档进行 OCR（光学字符识别）处理以供 AI 摄入的极其简单的方法。毕竟，文档本质上是一种视觉呈现。包含奇怪的布局、表格、图表等。视觉模型在这里非常适用！

基本逻辑如下：

传入文件（PDF、DOCX、图片等）
将该文件转换为一系列图片
将每张图片发送给 GPT 并礼貌地请求 Markdown 格式
聚合响应结果并返回 Markdown

在此处试用托管版本：https://getomni.ai/ocr-demo 或者访问我们的完整文档：https://docs.getomni.ai/zerox

开始使用

Zerox 同时提供 Node 和 Python 包。

功能	Node.js	Python
PDF 处理	✓（需要 graphicsmagick）	✓（需要 poppler）
图像处理	✓	✓
OpenAI 支持	✓	✓
Azure OpenAI 支持	✓	✓
AWS Bedrock 支持	✓	✓
Google Gemini 支持	✓	✓
Vertex AI 支持	✗	✓
数据提取	✓（`schema`）	✗
逐页提取	✓（`extractPerPage`）	✗
自定义系统提示词	✗	✓（`custom_system_prompt`）
保持格式选项	✓（`maintainFormat`）	✓（`maintain_format`）
异步 API	✓	✓
错误处理模式	✓（`errorMode`）	✗
并发处理	✓（`concurrency`）	✓（`concurrency`）
临时目录管理	✓（`tempDir`）	✓（`temp_dir`）
页面选择	✓（`pagesToConvertAsImages`）	✓（`select_pages`）
方向校正	✓（`correctOrientation`）	✗
边缘裁剪	✓（`trimEdges`）	✗

Node Zerox

(Node.js SDK - 支持来自不同提供商的视觉模型，如 OpenAI、Azure OpenAI、Anthropic、AWS Bedrock、Google Gemini 等。)

安装

npm install zerox

Zerox 使用 graphicsmagick 和 ghostscript 进行 PDF 转图片的处理步骤。这些通常会自动拉取，但您可能需要手动安装。

在 Linux 上使用：

sudo apt-get update
sudo apt-get install -y graphicsmagick

使用

使用文件 URL

import { zerox } from "zerox";

const result = await zerox({
  filePath: "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf",
  credentials: {
    apiKey: process.env.OPENAI_API_KEY,
  },
});

从本地路径

import { zerox } from "zerox";
import path from "path";

const result = await zerox({
  filePath: path.resolve(__dirname, "./cs101.pdf"),
  credentials: {
    apiKey: process.env.OPENAI_API_KEY,
  },
});

参数

const result = await zerox({
  // Required
  filePath: "path/to/file",
  credentials: {
    apiKey: "your-api-key",
    // Additional provider-specific credentials as needed
  },

  // Optional
  cleanup: true, // Clear images from tmp after run
  concurrency: 10, // Number of pages to run at a time
  correctOrientation: true, // True by default, attempts to identify and correct page orientation
  directImageExtraction: false, // Extract data directly from document images instead of the markdown
  errorMode: ErrorMode.IGNORE, // ErrorMode.THROW or ErrorMode.IGNORE, defaults to ErrorMode.IGNORE
  extractionPrompt: "", // LLM instructions for extracting data from document
  extractOnly: false, // Set to true to only extract structured data using a schema
  extractPerPage, // Extract data per page instead of the entire document
  imageDensity: 300, // DPI for image conversion
  imageHeight: 2048, // Maximum height for converted images
  llmParams: {}, // Additional parameters to pass to the LLM
  maintainFormat: false, // Slower but helps maintain consistent formatting
  maxImageSize: 15, // Maximum size of images to compress, defaults to 15MB
  maxRetries: 1, // Number of retries to attempt on a failed page, defaults to 1
  maxTesseractWorkers: -1, // Maximum number of Tesseract workers. Zerox will start with a lower number and only reach maxTesseractWorkers if needed
  model: ModelOptions.OPENAI_GPT_4O, // Model to use (supports various models from different providers)
  modelProvider: ModelProvider.OPENAI, // Choose from OPENAI, BEDROCK, GOOGLE, or AZURE
  outputDir: undefined, // Save combined result.md to a file
  pagesToConvertAsImages: -1, // Page numbers to convert to image as array (e.g. `[1, 2, 3]`) or a number (e.g. `1`). Set to -1 to convert all pages
  prompt: "", // LLM instructions for processing the document
  schema: undefined, // Schema for structured data extraction
  tempDir: "/os/tmp", // Directory to use for temporary files (default: system temp directory)
  trimEdges: true, // True by default, trims pixels from all edges that contain values similar to the given background color, which defaults to that of the top-left pixel
});

maintainFormat 选项尝试通过传递前一页面的输出作为下一页面的额外上下文，以一致的格式返回 Markdown。这需要请求同步运行，因此速度要慢得多。但如果您的文档包含大量表格数据，或经常有跨页表格，则非常有价值。

请求 #1 => page_1_image
请求 #2 => page_1_markdown + page_2_image
请求 #3 => page_2_markdown + page_3_image

示例输出

{
  completionTime: 10038,
  fileName: 'invoice_36258',
  inputTokens: 25543,
  outputTokens: 210,
  pages: [
    {
      page: 1,
      content: '# INVOICE # 36258\n' +
        '**Date:** Mar 06 2012  \n' +
        '**Ship Mode:** First Class  \n' +
        '**Balance Due:** $50.10  \n' +
        '## Bill To:\n' +
        'Aaron Bergman  \n' +
        '98103, Seattle,  \n' +
        'Washington, United States  \n' +
        '## Ship To:\n' +
        'Aaron Bergman  \n' +
        '98103, Seattle,  \n' +
        'Washington, United States  \n' +
        '\n' +
        '| Item                                       | Quantity | Rate   | Amount  |\n' +
        '|--------------------------------------------|----------|--------|---------|\n' +
        "| Global Push Button Manager's Chair, Indigo | 1        | $48.71 | $48.71  |\n" +
        '| Chairs, Furniture, FUR-CH-4421             |          |        |         |\n' +
        '\n' +
        '**Subtotal:** $48.71  \n' +
        '**Discount (20%):** $9.74  \n' +
        '**Shipping:** $11.13  \n' +
        '**Total:** $50.10  \n' +
        '---\n' +
        '**Notes:**  \n' +
        'Thanks for your business!  \n' +
        '**Terms:**  \n' +
        'Order ID : CA-2012-AB10015140-40974  ',
      contentLength: 747,
    }
  ],
  extracted: null,
  summary: {
    totalPages: 1,
    ocr: {
      failed: 0,
      successful: 1,
    },
    extracted: null,
  },
}

数据提取

Zerox 支持使用模式（schema）从文档中提取结构化数据。这允许您以结构化格式从文档中拉取特定信息，而不是获取完整的 markdown 转换。

设置 extractOnly: true 并提供 schema 以提取结构化数据。该模式遵循 JSON Schema 标准。

使用 extractPerPage 按页提取数据，而不是一次性从整个文档中提取。

您还可以设置 extractionModel、extractionModelProvider 和 extractionCredentials，以便在提取时使用与 OCR 不同的模型。默认情况下，使用相同的模型。

支持的模型

Zerox 支持来自不同提供商的多种模型：

Azure OpenAI
- GPT-4 Vision (gpt-4o)
- GPT-4 Vision Mini (gpt-4o-mini)
- GPT-4.1 (gpt-4.1)
- GPT-4.1 Mini (gpt-4.1-mini)
OpenAI
- GPT-4 Vision (gpt-4o)
- GPT-4 Vision Mini (gpt-4o-mini)
- GPT-4.1 (gpt-4.1)
- GPT-4.1 Mini (gpt-4.1-mini)
AWS Bedrock
- Claude 3 Haiku (2024.03, 2024.10)
- Claude 3 Sonnet (2024.02, 2024.06, 2024.10)
- Claude 3 Opus (2024.02)
Google Gemini
- Gemini 1.5 (Flash, Flash-8B, Pro)
- Gemini 2.0 (Flash, Flash-Lite)

import { zerox } from "zerox";
import { ModelOptions, ModelProvider } from "zerox/node-zerox/dist/types";

// OpenAI
const openaiResult = await zerox({
  filePath: "path/to/file.pdf",
  modelProvider: ModelProvider.OPENAI,
  model: ModelOptions.OPENAI_GPT_4O,
  credentials: {
    apiKey: process.env.OPENAI_API_KEY,
  },
});

// Azure OpenAI
const azureResult = await zerox({
  filePath: "path/to/file.pdf",
  modelProvider: ModelProvider.AZURE,
  model: ModelOptions.OPENAI_GPT_4O,
  credentials: {
    apiKey: process.env.AZURE_API_KEY,
    endpoint: process.env.AZURE_ENDPOINT,
  },
});

// AWS Bedrock
const bedrockResult = await zerox({
  filePath: "path/to/file.pdf",
  modelProvider: ModelProvider.BEDROCK,
  model: ModelOptions.BEDROCK_CLAUDE_3_SONNET_2024_10,
  credentials: {
    accessKeyId: process.env.AWS_ACCESS_KEY_ID,
    secretAccessKey: process.env.AWS_SECRET_ACCESS_KEY,
    region: process.env.AWS_REGION,
  },
});

// Google Gemini
const geminiResult = await zerox({
  filePath: "path/to/file.pdf",
  modelProvider: ModelProvider.GOOGLE,
  model: ModelOptions.GOOGLE_GEMINI_1_5_PRO,
  credentials: {
    apiKey: process.env.GEMINI_API_KEY,
  },
});

Python Zerox

(Python SDK - 支持来自不同提供商的视觉模型，如 OpenAI、Azure OpenAI、Anthropic、AWS Bedrock 等。)

安装

在系统上安装 poppler，确保其可在路径变量中找到。请参阅 pdf2image 文档获取各平台的安装说明。
安装 py-zerox:

pip install py-zerox

pyzerox.zerox 函数是一个异步 API，使用视觉模型将 OCR（光学字符识别）转换为 markdown。它处理 PDF 文件并将其转换为 markdown 格式。在使用此 API 之前，请确保设置好模型和模型提供者的环境变量。

有关设置环境和传递正确模型名称的说明，请参阅 LiteLLM 文档。

使用

from pyzerox import zerox
import os
import json
import asyncio

### Model Setup (Use only Vision Models) Refer: https://docs.litellm.ai/docs/providers ###

## placeholder for additional model kwargs which might be required for some models
kwargs = {}

## system prompt to use for the vision model
custom_system_prompt = None

# to override
# custom_system_prompt = "For the below PDF page, do something..something..." ## example

###################### Example for OpenAI ######################
model = "gpt-4o-mini" ## openai model
os.environ["OPENAI_API_KEY"] = "" ## your-api-key


###################### Example for Azure OpenAI ######################
model = "azure/gpt-4o-mini" ## "azure/<your_deployment_name>" -> format <provider>/<model>
os.environ["AZURE_API_KEY"] = "" # "your-azure-api-key"
os.environ["AZURE_API_BASE"] = "" # "https://example-endpoint.openai.azure.com"
os.environ["AZURE_API_VERSION"] = "" # "2023-05-15"


###################### Example for Gemini ######################
model = "gemini/gpt-4o-mini" ## "gemini/<gemini_model>" -> format <provider>/<model>
os.environ['GEMINI_API_KEY'] = "" # your-gemini-api-key


###################### Example for Anthropic ######################
model="claude-3-opus-20240229"
os.environ["ANTHROPIC_API_KEY"] = "" # your-anthropic-api-key

###################### Vertex ai ######################
model = "vertex_ai/gemini-1.5-flash-001" ## "vertex_ai/<model_name>" -> format <provider>/<model>
## GET CREDENTIALS
## RUN ##
# !gcloud auth application-default login - run this to add vertex credentials to your env
## OR ##
file_path = 'path/to/vertex_ai_service_account.json'

# Load the JSON file
with open(file_path, 'r') as file:
    vertex_credentials = json.load(file)

# Convert to JSON string
vertex_credentials_json = json.dumps(vertex_credentials)

vertex_credentials=vertex_credentials_json

## extra args
kwargs = {"vertex_credentials": vertex_credentials}

###################### For other providers refer: https://docs.litellm.ai/docs/providers ######################

# Define main async entrypoint
async def main():
    file_path = "https://omni-demo-data.s3.amazonaws.com/test/cs101.pdf" ## local filepath and file URL supported

    ## process only some pages or all
    select_pages = None ## None for all, but could be int or list(int) page numbers (1 indexed)

    output_dir = "./output_test" ## directory to save the consolidated markdown file
    result = await zerox(file_path=file_path, model=model, output_dir=output_dir,
                        custom_system_prompt=custom_system_prompt,select_pages=select_pages, **kwargs)
    return result


# run the main function:
result = asyncio.run(main())

# print markdown result
print(result)

参数

async def zerox(
    cleanup: bool = True,
    concurrency: int = 10,
    file_path: Optional[str] = "",
    maintain_format: bool = False,
    model: str = "gpt-4o-mini",
    output_dir: Optional[str] = None,
    temp_dir: Optional[str] = None,
    custom_system_prompt: Optional[str] = None,
    select_pages: Optional[Union[int, Iterable[int]]] = None,
    **kwargs
) -> ZeroxOutput:
  ...

参数

cleanup (bool, 可选): 是否在处理完成后清理临时文件。默认为 True。
concurrency (int, 可选): 要运行的并发进程数量。默认为 10。
file_path (Optional[str], 可选): 要处理的 PDF 文件的路径。默认为空字符串。
maintain_format (bool, 可选): 是否保留上一页的格式。默认为 False。
model (str, 可选): 用于生成补全内容的模型。默认为 "gpt-4o-mini"。请参考 LiteLLM 提供商 (LiteLLM Providers) 以获取正确的模型名称，因为它可能因提供商而异。
output_dir (Optional[str], 可选): 保存 Markdown 输出的目录。默认为 None。
temp_dir (str, 可选): 存储临时文件的目录，默认为系统临时目录中的某个命名文件夹。如果已存在，Zerox 在使用前将删除其中的内容。
custom_system_prompt (str, 可选): 为模型使用的系统提示词，这将覆盖 Zerox 的默认系统提示词。通常不需要，除非您想要某些特定行为。默认为 None。
select_pages (Optional[Union[int, Iterable[int]]], 可选): 要处理的页面，可以是单个页码或页码的可迭代对象。默认为 None。
kwargs (dict, 可选): 传递给 litellm.completion 方法的额外关键字参数。有关详细信息，请参阅 LiteLLM 文档和补全输入 (Completion Input)。

ZeroxOutput: 包含模型生成的 Markdown 内容以及一些元数据（见下文）。

示例输出 (来自 "azure/gpt-4o-mini")

注意：为了便于阅读，此文档中的输出已手动包装。

ZeroxOutput(
    completion_time=9432.975,
    file_name='cs101',
    input_tokens=36877,
    output_tokens=515,
    pages=[
        Page(
            content='| Type    | Description                          | Wrapper Class |\n' +
                    '|---------|--------------------------------------|---------------|\n' +
                    '| byte    | 8-bit signed 2s complement integer   | Byte          |\n' +
                    '| short   | 16-bit signed 2s complement integer  | Short         |\n' +
                    '| int     | 32-bit signed 2s complement integer  | Integer       |\n' +
                    '| long    | 64-bit signed 2s complement integer  | Long          |\n' +
                    '| float   | 32-bit IEEE 754 floating point number| Float         |\n' +
                    '| double  | 64-bit floating point number         | Double        |\n' +
                    '| boolean | may be set to true or false          | Boolean       |\n' +
                    '| char    | 16-bit Unicode (UTF-16) character    | Character     |\n\n' +
                    'Table 26.2.: Primitive types in Java\n\n' +
                    '### 26.3.1. Declaration & Assignment\n\n' +
                    'Java is a statically typed language meaning that all variables must be declared before you can use ' +
                    'them or refer to them. In addition, when declaring a variable, you must specify both its type and ' +
                    'its identifier. For example:\n\n' +
                    '```java\n' +
                    'int numUnits;\n' +
                    'double costPerUnit;\n' +
                    'char firstInitial;\n' +
                    'boolean isStudent;\n' +
                    '```\n\n' +
                    'Each declaration specifies the variable’s type followed by the identifier and ending with a ' +
                    'semicolon. The identifier rules are fairly standard: a name can consist of lowercase and ' +
                    'uppercase alphabetic characters, numbers, and underscores but may not begin with a numeric ' +
                    'character. We adopt the modern camelCasing naming convention for variables in our code. In ' +
                    'general, variables must be assigned a value before you can use them in an expression. You do not ' +
                    'have to immediately assign a value when you declare them (though it is good practice), but some ' +
                    'value must be assigned before they can be used or the compiler will issue an error.\n\n' +
                    'The assignment operator is a single equal sign, `=` and is a right-to-left assignment. That is, ' +
                    'the variable that we wish to assign the value to appears on the left-hand-side while the value ' +
                    '(literal, variable or expression) is on the right-hand-side. Using our variables from before, ' +
                    'we can assign them values:\n\n' +
                    '> 2 Instance variables, that is variables declared as part of an object do have default values. ' +
                    'For objects, the default is `null`, for all numeric types, zero is the default value. For the ' +
                    'boolean type, `false` is the default, and the default char value is `\\0`, the null-terminating ' +
                    'character (zero in the ASCII table).',
            content_length=2333,
            page=1
        )
    ]
)

支持的文件类型

我们结合使用 libreoffice（LibreOffice）和 graphicsmagick（GraphicsMagick）来实现文档到图像的转换。对于非图像/非 PDF（Portable Document Format）文件，我们使用 libreoffice 将该文件转换为 PDF，然后再转换为图像。

[
  "pdf", // Portable Document Format
  "doc", // Microsoft Word 97-2003
  "docx", // Microsoft Word 2007-2019
  "odt", // OpenDocument Text
  "ott", // OpenDocument Text Template
  "rtf", // Rich Text Format
  "txt", // Plain Text
  "html", // HTML Document
  "htm", // HTML Document (alternative extension)
  "xml", // XML Document
  "wps", // Microsoft Works Word Processor
  "wpd", // WordPerfect Document
  "xls", // Microsoft Excel 97-2003
  "xlsx", // Microsoft Excel 2007-2019
  "ods", // OpenDocument Spreadsheet
  "ots", // OpenDocument Spreadsheet Template
  "csv", // Comma-Separated Values
  "tsv", // Tab-Separated Values
  "ppt", // Microsoft PowerPoint 97-2003
  "pptx", // Microsoft PowerPoint 2007-2019
  "odp", // OpenDocument Presentation
  "otp", // OpenDocument Presentation Template
];

致谢

Litellm: https://github.com/BerriAI/litellm | 此项目为我们的 Python SDK（软件开发工具包）提供支持，以兼容来自不同提供商的所有流行视觉模型。

许可证

本项目根据 MIT License（MIT 许可证）获得许可。

Zerox 快速上手指南

简介

Zerox 是一款简单的 OCR 工具，旨在将文档（PDF、DOCX、图片等）转换为 Markdown 格式，以便 AI 模型直接摄入。它擅长处理包含复杂布局、表格和图表的文档。

环境准备

Node.js 环境

Zerox 依赖 graphicsmagick 和 ghostscript 进行 PDF 转图片处理。

Linux 用户请手动安装：

sudo apt-get update
sudo apt-get install -y graphicsmagick

Windows/macOS 通常会自动处理，若遇到问题请查阅相关包管理器文档。

Python 环境

系统需安装 poppler，并确保其位于环境变量 PATH 中。
参考 pdf2image 安装文档获取各平台安装说明。

API 密钥

准备支持视觉模型的大厂 API 密钥（如 OpenAI, Azure, AWS Bedrock 等）。
设置对应的环境变量（例如 OPENAI_API_KEY）。

安装步骤

Node.js

npm install zerox

Python

pip install py-zerox

基本使用

Node.js 示例

支持传入本地文件路径或远程 URL，以下为使用 OpenAI 模型的极简示例：

import { zerox } from "zerox";

const result = await zerox({
  filePath: "path/to/file.pdf",
  credentials: {
    apiKey: process.env.OPENAI_API_KEY,
  },
});

console.log(result.pages[0].content);

Python 示例

Python SDK 为异步 API，需配置环境变量后调用：

from pyzerox import zerox
import os
import asyncio

# 设置环境变量
os.environ["OPENAI_API_KEY"] = "your-api-key"

async def main():
    result = await zerox(
        filepath="path/to/file.pdf",
        model="gpt-4o-mini"
    )
    print(result)

asyncio.run(main())

支持的模型

Zerox 支持多种模型提供商，包括 OpenAI、Azure OpenAI、AWS Bedrock 和 Google Gemini。开发者可通过 modelProvider 和 model 参数灵活切换。

版本历史

v0.1.042024/10/21

v0.1.032024/10/20

v0.1.022024/10/18

v0.1.012024/10/15

v0.1.02024/09/12

v0.0.22024/09/10

v0.0.12024/09/10

v0.1.062024/12/18

v0.1.052024/11/23

常见问题

使用 Gemini 模型处理图片时报错 "Failed to process image" 如何处理？

运行 Azure OpenAI 示例代码时报错 "Azure client is not an instance of AsyncAzureOpenAI" 怎么办？

如何确认自定义系统提示词（custom_system_prompt）是否正常工作？

PDF 转 Markdown 时内容不完整或缺失是什么原因？

如何将 Zerox 生成的 Markdown 内容转换为 HTML 以支持 EPUB 导出？

安装后页面顺序混乱或功能异常，可能是版本问题吗？

项目是否支持 Azure OpenAI 或其他第三方模型提供商？

相似工具推荐

stable-diffusion-webui

stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。

★ 162.1k|★★★☆☆|今天

开发框架图像Agent

ComfyUI

ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。

★ 107.7k|★★☆☆☆|2天前

开发框架图像Agent

ML-For-Beginners

ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。

★ 85k|★★☆☆☆|今天

图像数据工具视频

ragflow

RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。

★ 77.1k|★★★☆☆|昨天

Agent图像开发框架

PaddleOCR

PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。 PaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。

★ 74.9k|★★★☆☆|今天

语言模型图像开发框架

tesseract

Tesseract 是一款历史悠久且备受推崇的开源光学字符识别（OCR）引擎，最初由惠普实验室开发，后由 Google 维护，目前由全球社区共同贡献。它的核心功能是将图片中的文字转化为可编辑、可搜索的文本数据，有效解决了从扫描件、照片或 PDF 文档中提取文字信息的难题，是数字化归档和信息自动化的重要基础工具。在技术层面，Tesseract 展现了强大的适应能力。从版本 4 开始，它引入了基于长短期记忆网络（LSTM）的神经网络 OCR 引擎，显著提升了行识别的准确率；同时，为了兼顾旧有需求，它依然支持传统的字符模式识别引擎。Tesseract 原生支持 UTF-8 编码，开箱即用即可识别超过 100 种语言，并兼容 PNG、JPEG、TIFF 等多种常见图像格式。输出方面，它灵活支持纯文本、hOCR、PDF、TSV 等多种格式，方便后续数据处理。 Tesseract 主要面向开发者、研究人员以及需要构建文档处理流程的企业用户。由于它本身是一个命令行工具和库（libtesseract），不包含图形用户界面（GUI），因此最适合具备一定编程能力的技术人员集成到自动化脚本或应用程序中

★ 73.3k|★★☆☆☆|3天前

开发框架图像