[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-niuzaisheng--ScreenAgent":3,"tool-niuzaisheng--ScreenAgent":62},[4,18,28,37,45,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":24,"last_commit_at":25,"category_tags":26,"status":17},9989,"n8n","n8n-io\u002Fn8n","n8n 是一款面向技术团队的公平代码（fair-code）工作流自动化平台，旨在让用户在享受低代码快速构建便利的同时，保留编写自定义代码的灵活性。它主要解决了传统自动化工具要么过于封闭难以扩展、要么完全依赖手写代码效率低下的痛点，帮助用户轻松连接 400 多种应用与服务，实现复杂业务流程的自动化。\n\nn8n 特别适合开发者、工程师以及具备一定技术背景的业务人员使用。其核心亮点在于“按需编码”：既可以通过直观的可视化界面拖拽节点搭建流程，也能随时插入 JavaScript 或 Python 代码、调用 npm 包来处理复杂逻辑。此外，n8n 原生集成了基于 LangChain 的 AI 能力，支持用户利用自有数据和模型构建智能体工作流。在部署方面，n8n 提供极高的自由度，支持完全自托管以保障数据隐私和控制权，也提供云端服务选项。凭借活跃的社区生态和数百个现成模板，n8n 让构建强大且可控的自动化系统变得简单高效。",184740,2,"2026-04-19T23:22:26",[16,14,13,15,27],"插件",{"id":29,"name":30,"github_repo":31,"description_zh":32,"stars":33,"difficulty_score":10,"last_commit_at":34,"category_tags":35,"status":17},10095,"AutoGPT","Significant-Gravitas\u002FAutoGPT","AutoGPT 是一个旨在让每个人都能轻松使用和构建 AI 的强大平台，核心功能是帮助用户创建、部署和管理能够自动执行复杂任务的连续型 AI 智能体。它解决了传统 AI 应用中需要频繁人工干预、难以自动化长流程工作的痛点，让用户只需设定目标，AI 即可自主规划步骤、调用工具并持续运行直至完成任务。\n\n无论是开发者、研究人员，还是希望提升工作效率的普通用户，都能从 AutoGPT 中受益。开发者可利用其低代码界面快速定制专属智能体；研究人员能基于开源架构探索多智能体协作机制；而非技术背景用户也可直接选用预置的智能体模板，立即投入实际工作场景。\n\nAutoGPT 的技术亮点在于其模块化“积木式”工作流设计——用户通过连接功能块即可构建复杂逻辑，每个块负责单一动作，灵活且易于调试。同时，平台支持本地自托管与云端部署两种模式，兼顾数据隐私与使用便捷性。配合完善的文档和一键安装脚本，即使是初次接触的用户也能在几分钟内启动自己的第一个 AI 智能体。AutoGPT 正致力于降低 AI 应用门槛，让人人都能成为 AI 的创造者与受益者。",183572,"2026-04-20T04:47:55",[13,36,27,14,15],"语言模型",{"id":38,"name":39,"github_repo":40,"description_zh":41,"stars":42,"difficulty_score":10,"last_commit_at":43,"category_tags":44,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":46,"name":47,"github_repo":48,"description_zh":49,"stars":50,"difficulty_score":24,"last_commit_at":51,"category_tags":52,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",161147,"2026-04-19T23:31:47",[14,13,36],{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":59,"last_commit_at":60,"category_tags":61,"status":17},8272,"opencode","anomalyco\u002Fopencode","OpenCode 是一款开源的 AI 编程助手（Coding Agent），旨在像一位智能搭档一样融入您的开发流程。它不仅仅是一个代码补全插件，而是一个能够理解项目上下文、自主规划任务并执行复杂编码操作的智能体。无论是生成全新功能、重构现有代码，还是排查难以定位的 Bug，OpenCode 都能通过自然语言交互高效完成，显著减少开发者在重复性劳动和上下文切换上的时间消耗。\n\n这款工具专为软件开发者、工程师及技术研究人员设计，特别适合希望利用大模型能力来提升编码效率、加速原型开发或处理遗留代码维护的专业人群。其核心亮点在于完全开源的架构，这意味着用户可以审查代码逻辑、自定义行为策略，甚至私有化部署以保障数据安全，彻底打破了传统闭源 AI 助手的“黑盒”限制。\n\n在技术体验上，OpenCode 提供了灵活的终端界面（Terminal UI）和正在测试中的桌面应用程序，支持 macOS、Windows 及 Linux 全平台。它兼容多种包管理工具，安装便捷，并能无缝集成到现有的开发环境中。无论您是追求极致控制权的资深极客，还是渴望提升产出的独立开发者，OpenCode 都提供了一个透明、可信",144296,1,"2026-04-16T14:50:03",[13,27],{"id":63,"github_repo":64,"name":65,"description_en":66,"description_zh":67,"ai_summary_zh":67,"readme_en":68,"readme_zh":69,"quickstart_zh":70,"use_case_zh":71,"hero_image_url":72,"owner_login":73,"owner_name":74,"owner_avatar_url":75,"owner_bio":76,"owner_company":77,"owner_location":78,"owner_email":79,"owner_twitter":80,"owner_website":80,"owner_url":81,"languages":82,"stars":95,"forks":96,"last_commit_at":97,"license":98,"difficulty_score":99,"env_os":100,"env_gpu":101,"env_ram":102,"env_deps":103,"category_tags":115,"github_topics":116,"view_count":24,"oss_zip_url":80,"oss_zip_packed_at":80,"status":17,"created_at":121,"updated_at":122,"faqs":123,"releases":164},10128,"niuzaisheng\u002FScreenAgent","ScreenAgent","ScreenAgent: A Computer Control Agent Driven by Visual Language Large Model (IJCAI-24)","ScreenAgent 是一款由视觉语言大模型驱动的智能体，能够像人类一样直接观察电脑屏幕截图，并通过模拟鼠标点击和键盘输入来操控图形界面。它主要解决了传统自动化脚本依赖特定 API、缺乏通用性且难以处理复杂多步任务的问题，让 AI 能够跨越不同操作系统和应用软件，自主完成文件管理、网页浏览甚至游戏操作等日常任务。\n\n该项目特别适合人工智能研究人员、开发者以及对智能体技术感兴趣的技术爱好者使用。通过内置的“规划 - 执行 - 反思”闭环机制，ScreenAgent 不仅能将用户指令拆解为子任务逐步执行，还能根据操作结果自我评估并动态调整策略，显著提升了任务完成的准确率。其独特的技术亮点在于借鉴 VNC 远程桌面协议设计了通用的动作空间，无需针对特定软件进行适配即可实现跨平台控制；同时，项目开源了涵盖多种场景的高质量数据集，为训练和评估具备视觉定位与工具使用能力的智能体提供了宝贵资源。无论是用于探索多模态大模型的实际应用潜力，还是开发下一代自动化办公助手，ScreenAgent 都提供了一个坚实的研究框架与实践环境。","\u003Cp align=\"center\">\n\u003Ch1 align=\"center\"> ScreenAgent \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_69820c987aa4.png\" alt=\"ScreenAgent Logo\" width=\"30\"> : A Computer Control Agent Driven by Visual Language Large Model\u003C\u002Fh1>\n\u003C\u002Fp>\n\n[View ScreenAgent Paper arxiv:2402.07945](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07945)\n\n[中文版 Readme](README-zh.md)\n\n## News\n- (2024-4-17) The ScreenAgent paper has been accepted for presentation at IJCAI 2024! \n- (2024-5-19) [ScreenAgent Web Client](https:\u002F\u002Fgithub.com\u002Fniuzaisheng\u002FScreenAgentWebClient) released, a simpler way to experience controlling a desktop with a large model.\n\n## \nWe have built the ScreenAgent project, creating an environment for Visual Language Model agents (VLM Agent) to interact with real computer screens. In this environment, the agent can observe screenshots and manipulate the GUI by outputting mouse and keyboard operations. We have also designed an automatic control process, which includes planning, action, and reflection stages, guiding the agent to continuously interact with the environment and complete multi-step tasks. In addition, we have built the ScreenAgent dataset, which collects screenshots and action sequences when completing various daily computer tasks.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_4d88c7ff554a.png\" alt=\"Motivation\" width=\"50%\">\n  \u003Cp>\u003Ci>ScreenAgent Design Motivation\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\nTo guide the VLM Agent to interact continuously with the computer screen, we have built a process that includes \"planning-execution-reflection\". In the planning phase, the Agent is asked to break down the user task into subtasks. In the execution phase, the Agent will observe the screenshot and give specific mouse and keyboard actions to execute the subtasks. The controller will execute these actions and feed back the execution results to the Agent. In the reflection phase, the Agent will observe the execution results, judge the current state, and choose to continue execution, retry, or adjust the plan. This process will continue until the task is completed.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_ce225118995b.png\" alt=\"Running process\" width=\"100%\">\n  \u003Cp>\u003Ci>Running Process\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\nWe referred to the VNC remote desktop connection protocol to design the action space of the Agent, which are all the most basic mouse and keyboard operations. Most of the mouse click operations require the Agent to give the exact screen coordinate position. Compared with calling specific APIs to complete tasks, this method is more universal and can be applied to various desktop operating systems and applications.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_0c5bf3ebe7ac.png\" alt=\"Action Space\" width=\"50%\">\n  \u003Cp>\u003Ci>Supported Action Types and Action Attributes\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\nTeaching the Agent to use a computer is not a simple matter. It requires the Agent to have multiple comprehensive abilities such as task planning, image understanding, visual positioning, and tool use. For this reason, we manually annotated the ScreenAgent dataset. This dataset covers a variety of daily computer tasks, including file operations, web browsing, gaming entertainment and other scenarios. We build a session according to the above \"planning-execution-reflection\" process.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_6604f907d56a.png\" alt=\"Dataset Task Type Distribution\" width=\"50%\">\n  \u003Cp>\u003Ci>Dataset Task Type Distribution\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\nThe project mainly includes the following parts:\n```\nScreenAgent\n├── client # Controller client code\n│   ├── prompt # Prompt template\n│   ├── config.yml # Controller client configuration file template\n│   └── tasks.txt # Task list\n├── data # Contains the ScreenAgent dataset and other vision positioning related datasets\n├── model_workers # VLM inferencer\n└── train # Model training code\n```\n\n# Preparation\n\n## Step 1, Prepare the desktop to be controlled\nFirst, you need to prepare the desktop operating system to be controlled, where the VNC Server is installed, such as [TightVNC](https:\u002F\u002Fwww.tightvnc.com\u002Fdownload.php). Or you can use a Docker container with a GUI. We have prepared a container `niuniushan\u002Fscreenagent-env`. You can use the following command to pull and start this container:\n\n```bash\ndocker run -d --name ScreenAgent -e RESOLUTION=1024x768 -p 5900:5900 -p 8001:8001 -e VNC_PASSWORD=\u003CVNC_PASSWORD> -e CLIPBOARD_SERVER_SECRET_TOKEN=\u003CCLIPBOARD_SERVER_SECRET_TOKEN> -v \u002Fdev\u002Fshm:\u002Fdev\u002Fshm niuniushan\u002Fscreenagent-env:latest\n```\n\nPlease replace `\u003CVNC_PASSWORD>` with your new VNC password, and `\u003CCLIPBOARD_SERVER_SECRET_TOKEN>` with your clipboard service password. Since keyboard input of long strings of text or unicode characters relies on the clipboard, if the clipboard service is not enabled, you can only input ASCII strings by pressing the keyboard in sequence, and you cannot input Chinese and other unicode characters. This image already contains a clipboard service, which listens to port 8001 by default. You need to set a password to protect your clipboard service. `niuniushan\u002Fscreenagent-env` is built based on `fcwu\u002Fdocker-ubuntu-vnc-desktop`. You can find more information about this image [here](https:\u002F\u002Fgithub.com\u002Ffcwu\u002Fdocker-ubuntu-vnc-desktop).\n\nIf you want to use an existing desktop environment, such as Windows, Linux Desktop, or any other desktop environment, you need to run any VNC Server and note its IP address and port number. If you want to enable the clipboard service, please perform the following steps in your desktop environment:\n\n```bash\n# Install dependencies\npip install fastapi pydantic uvicorn pyperclip \n# Set password in environment variable\nexport CLIPBOARD_SERVER_SECRET_TOKEN=\u003CCLIPBOARD_SERVER_SECRET_TOKEN>\n# Start clipboard service\npython client\u002Fclipboard_server.py\n```\n\n`clipboard_server.py` will listen to port 8001 and receive the (text) instruction for keyboard input of long strings of text from the controller.\n\nAfter keeping it running, you can test whether the clipboard service is working properly, for example:\n\n```bash\ncurl --location 'http:\u002F\u002Flocalhost:8001\u002Fclipboard' \\\n--header 'Content-Type: application\u002Fjson' \\\n--data '{\n    \"text\":\"Hello world\",\n    \"token\":\"\u003CCLIPBOARD_SERVER_SECRET_TOKEN>\"\n}'\n```\n\nIf it works correctly, you will receive a response of `{\"success\": True, \"message\": \"Text copied to clipboard\"}`.\nIf you encounter an error \"Pyperclip could not find a copy\u002Fpaste mechanism for your system.\", please add an environment variable specifying the X server location before running `python client\u002Fclipboard_server.py`:\n\n```bash\nexport DISPLAY=:0.0\n```\n\nPlease adjust according to your system environment. If you still encounter errors, please refer to [here](https:\u002F\u002Fpyperclip.readthedocs.io\u002Fen\u002Flatest\u002Fintroduction.html#not-implemented-error).\n\nPlease fill in the above information in the `remote_vnc_server` item of the configuration file `client\u002Fconfig.yml`.\n\n## Step 2, Prepare the controller code running environment\n\nYou need to run the controller code, which has three missions: First, the controller will connect to the VNC Server, collect screenshots, and send commands such as mouse and keyboard; Second, the controller maintains a state machine internally, implementing an automatic control process of planning, action, and reflection, guiding the agent to continuously interact with the environment; Finally, the controller will construct complete prompts based on the prompt word template, send them to the large model inference API, and parse the control commands in the large model generated reply. The controller is a program based on PyQt5, you need to install some dependencies:\n\n```bash\npip install -r client\u002Frequirements.txt\n```\n\n## Step 3, Prepare the large model inferencer or API\n\nPlease choose a VLM as the Agent, we provide inferencers for 4 models in `model_workers`, they are: GPT-4V, LLaVA-1.5, CogAgent, and ScreenAgent. You can also implement an inferencer yourself or use a third-party API, you can refer to the code in `client\u002Finterface_api` to implement a new API call interface.\n\nPlease refer to the `llm_api` part in `client\u002Fconfig.yml` to prepare the configuration file, only keep one model under `llm_api`.\n\n```yaml\nllm_api:\n\n  # Select ONE of the following models to use:\n\n  GPT4V:\n    model_name: \"gpt-4-vision-preview\"\n    openai_api_key: \"\u003CYOUR-OPENAI-API-KEY>\"\n    target_url: \"https:\u002F\u002Fapi.openai.com\u002Fv1\u002Fchat\u002Fcompletions\"\n\n  LLaVA:\n    model_name: \"LLaVA-1.5\"\n    target_url: \"http:\u002F\u002Flocalhost:40000\u002Fworker_generate\"\n\n  CogAgent:\n    target_url: \"http:\u002F\u002Flocalhost:40000\u002Fworker_generate\"\n\n  ScreenAgent:\n    target_url: \"http:\u002F\u002Flocalhost:40000\u002Fworker_generate\"\n\n  # Common settings for all models\n  temperature: 1.0\n  top_p: 0.9\n  max_tokens: 500\n  \n```\n\n### If you use GPT-4V as the Agent\n\nPlease set `llm_api` to `GPT4V` in `client\u002Fconfig.yml`, and fill in your OpenAI API Key, please always pay attention to your account balance.\n\n### If you use LLaVA-1.5 as the Agent\n\nPlease refer to the [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA) project to download and prepare the LLaVA-1.5 model, for example:\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA.git\ncd LLaVA\nconda create -n llava python=3.10 -y\nconda activate llava\npip install --upgrade pip  # enable PEP 660 support\npip install -e .\n```\n\n`model_workers\u002Fllava_model_worker.py` provides a non-streaming output inferencer for LLaVA-1.5. You can copy it to `llava\u002Fserve\u002Fmodel_worker` and start it with the following command:\n\n```bash\ncd llava\npython -m llava.serve.llava_model_worker --host 0.0.0.0 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path liuhaotian\u002Fllava-v1.5-13b --no-register\n```\n\n### If using CogAgent as the Agent\n\nPlease refer to the [CogVLM](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVLM) project to download and prepare the CogAgent model. Download the sat version of the CogAgent weights `cogagent-chat.zip` from [here](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002FCogAgent\u002Ftree\u002Fmain), unzip it and place it in the `train\u002Fsaved_models\u002Fcogagent-chat` directory.\n`train\u002Fcogagent_model_worker.py` provides a non-streaming output inferencer for CogAgent. You can start it with the following command:\n\n```bash\ncd train\nRANK=0 WORLD_SIZE=1 LOCAL_RANK=0 python .\u002Fcogagent_model_worker.py --host 0.0.0.0  --port 40000 --from_pretrained \"saved_models\u002Fcogagent-chat\" --bf16 --max_length 2048\n```\n\n### If using ScreenAgent as the Agent\n\nScreenAgent is trained based on CogAgent. Download the sat format weight file `ScreenAgent-2312.zip` from [here](https:\u002F\u002Fhuggingface.co\u002Fniurl\u002FScreenAgent), unzip it and place it in `train\u002Fsaved_models\u002FScreenAgent-2312`. You can start it with the following command:\n\n```bash\ncd train\nRANK=0 WORLD_SIZE=1 LOCAL_RANK=0 python .\u002Fcogagent_model_worker.py --host 0.0.0.0  --port 40000 --from_pretrained \".\u002Fsaved_models\u002FScreenAgent-2312\" --bf16 --max_length 2048\n```\n\n# Run\nAfter the preparation is complete, you can run the controller:\n\n```bash\ncd client\npython run_controller.py -c config.yml\n```\n\nThe controller interface is as follows. You need to double-click to select a task from the left side first, then press the \"Start Automation\" button. The controller will automatically run according to the plan-action-reflection process. The controller will collect the current screen image, fill in the prompt word template, send the image and complete prompt words to the large model inferencer, parse the reply from the large model inferencer, send mouse and keyboard control commands to the VNC Server, and repeat the process.\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_f87f0f4f2c49.png\" alt=\"Controller\" width=\"70%\">\n  \u003Cp>\u003Ci>The controller interface\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n\nIf the screen is stuck, try pressing the \"Re-connect\" button. The controller will try to reconnect to the VNC Server.\n\n# Dataset\n\nAll datasets and dataset processing code are in the `data` directory. We used three existing datasets: COCO2014, Rico & widget-caption, and Mind2Web.\n\n## COCO Dataset\nWe used COCO 2014 validation images as the training dataset for visual positioning capabilities. You can download COCO 2014 train images from [here](https:\u002F\u002Fcocodataset.org\u002F#download). The annotation information we used here is refcoco, split by unc.\n\n```\n├── COCO\n   ├── prompts # Prompt word templates for training the Agent's visual positioning capabilities\n   ├── train2014 # COCO 2014 train\n   └── annotations # COCO 2014 annotations\n```\n\n## Rico & widget-caption Dataset\n\nRico is a dataset that contains a large number of screenshots and widget information of Android applications. You can download the \"1. UI Screenshots and View Hierarchies (6 GB)\" part of the Rico dataset from [here](http:\u002F\u002Fwww.interactionmining.org\u002Frico.html). The file name is `unique_uis.tar.gz`. Please put the unzipped `combined` folder in the `data\u002FRico` directory.\nwidget-caption is an annotation of widget information based on Rico. Please clone the `https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002Fwidget-caption` project under `data\u002FRico`.\nThe final directory structure is as follows:\n```\n├── Rico\n   ├── prompts # Prompt word templates for training the Agent's visual positioning capabilities\n   ├── combined # Rico dataset screenshots\n   └── widget-caption\n       ├── split\n       │   ├── dev.txt\n       │   ├── test.txt\n       │   └── train.txt\n       └── widget_captions.csv\n```\n\n## Mind2Web Dataset \n\n[Mind2Web](https:\u002F\u002Fosu-nlp-group.github.io\u002FMind2Web\u002F) is a real simulated web browsing dataset. You need to download the original dataset and process it. First, use the globus tool to download the original web screenshots from [here](https:\u002F\u002Fapp.globus.org\u002Ffile-manager?origin_id=32e6b738-a0b0-47f8-b475-26bf1c5ebf19). The folder name is `raw_dump`, placed in the `data\u002FMind2Web\u002Fraw_dump` directory, and then use the following command to process the dataset:\n\n```bash\ncd data\u002FMind2Web\npython convert_dataset.py\n```\n\nThis code will download the processed form of the `osunlp\u002FMind2Web` dataset from huggingface datasets. Please ensure that the network is smooth. At the same time, this step will involve translating English instructions into Chinese instructions.\n\n```bash\ncd data\u002FMind2Web\npython convert_dataset.py\n```\n\nThis code will download the processed form of the `osunlp\u002FMind2Web` dataset from huggingface datasets. Please ensure that the network is smooth. This step will involve translating English instructions into Chinese instructions. You need to call your own translation API in `data\u002FMind2Web\u002Ftranslate.py`.\nThe directory structure is as follows:\n```\n├── Mind2Web\n   ├── convert_dataset.py\n   ├── translate.py\n   ├── prompts # Prompt word templates for training the Agent's web browsing capabilities\n   ├── raw_dump # Mind2Web raw_dump downloaded from globus\n   └── processed_dataset # Created by convert_dataset.py\n```\n\n## ScreenAgent Dataset\n\nScreenAgent is the dataset annotated in this paper, divided into training and testing sets, the directory structure is as follows:\n\n```\n├── data\n    ├── ScreenAgent\n        ├── train\n        │   ├── \u003Csession id>\n        │   │   ├── images\n        │   │   │   ├── \u003Ctimestamp-1>.jpg\n        │   │   │   └── ...\n        │   │   ├── \u003Ctimestamp-1>.json\n        │   │   └── ...\n        │   ├── ...\n        └── test\n```\n\nThe meaning of each field in the json file:\n- session_id: Session ID\n- task_prompt: Overall goal of the task\n- task_prompt_en: Overall goal of the task (En)\n- task_prompt_zh: Overall goal of the task (Zh)\n- send_prompt: Complete prompt sent to the model\n- send_prompt_en: Complete prompt sent to the model (En)\n- send_prompt_zh: Complete prompt sent to the model (Zh)\n- LLM_response: Original reply text given by the model, i.e., reject response in RLHF\n- LLM_response_editer: Reply text after manual correction, i.e., choice response in RLHF\n- LLM_response_editer_en: Reply text after manual correction (En)\n- LLM_response_editer_zh: Reply text after manual correction (Zh)\n- video_height, video_width: Height and width of the image\n- saved_image_name: Screenshot filename, under each session's images folder\n- actions: Action sequence parsed from LLM_response_editer\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_38a9ea9fbc06.png\" alt=\"Dataset Example\" width=\"100%\">\n  \u003Cp>\u003Ci>An Example in ScreenAgent Dataset\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n# Train ScreenAgent\nIf you want to train your own model, or reproduce the ScreenAgent model, please prepare the above datasets first, and check all dataset paths in the `train\u002Fdataset\u002Fmixture_dataset.py` file. If you only want to use part of the datasets or add new datasets, please modify the `make_supervised_data_module` function in `train\u002Fdataset\u002Fmixture_dataset.py`. Please download the sat version of the CogAgent weights `cogagent-chat.zip` from [here](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002FCogAgent\u002Ftree\u002Fmain), unzip it and place it in the `train\u002Fsaved_models\u002F` directory.\n\nYou need to pay attention to and check the following files:\n```\ntrain\n├── data -> ..\u002Fdata\n├── dataset\n│   └── mixture_dataset.py\n├── finetune_ScreenAgent.sh\n└── saved_models\n    └── cogagent-chat # unzip cogagent-chat.zip\n        ├── 1\n        │   └── mp_rank_00_model_states.pt\n        ├── latest\n        └── model_config.json\n```\n\nPlease modify the parameters in `train\u002Ffinetune_ScreenAgent.sh` according to your device situation, then run:\n```bash\ncd train\nbash finetune_ScreenAgent.sh\n```\n\nFinally, if you want to merge the weights of sat distributed training into a single weight file, please refer to the `train\u002Fmerge_model.sh` code. Make sure that the number of model parallelism `MP_SIZE` in this file is consistent with `WORLD_SIZE` in `train\u002Ffinetune_ScreenAgent.sh`. Modify the parameter after `--from-pretrained` to the checkpoint location stored during training. The merged weight file will be saved as the `train\u002Fsaved_models\u002Fmerged_model` folder.\n\n# TODO\n- [ ] Provide huggingface transformers weights.\n- [ ] Simplify the design of the controller, provide a no render mode.\n- [ ] Integrate Gym.\n- [ ] Add skill libraries to support more complex function calls.\n\n# Related Projects\n- Mobile-Agent: Autonomous Multi-Modal Mobile Device Agent with Visual Perception http:\u002F\u002Farxiv.org\u002Fabs\u002F2401.16158\n- UFO: A UI-Focused Agent for Windows OS Interaction http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07939\n- ScreenAI: A Vision-Language Model for UI and Infographics Understanding http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.04615\n- AppAgent: Multimodal Agents as Smartphone Users http:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13771\n- CogAgent: A Visual Language Model for GUI Agents http:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08914\n- Screen2Words: Automatic Mobile UI Summarization with Multimodal Learning http:\u002F\u002Farxiv.org\u002Fabs\u002F2108.03353\n- A Real-World WebAgent with Planning, Long Context Understanding, and Program Synthesis http:\u002F\u002Farxiv.org\u002Fabs\u002F2307.12856\n- Comprehensive Cognitive LLM Agent for Smartphone GUI Automation http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11941\n- Towards General Computer Control: A Multimodal Agent for Red Dead Redemption II as a Case Study https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.03186\n- Scaling Instructable Agents Across Many Simulated Worlds [link to technical report](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002FDeepMind.com\u002FBlog\u002Fsima-generalist-ai-agent-for-3d-virtual-environments\u002FScaling%20Instructable%20Agents%20Across%20Many%20Simulated%20Worlds.pdf)\n- Android in the Zoo: Chain-of-Action-Thought for GUI Agents http:\u002F\u002Farxiv.org\u002Fabs\u002F2403.02713\n- OmniACT: A Dataset and Benchmark for Enabling Multimodal Generalist Autonomous Agents for Desktop and Web http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17553\n- Comprehensive Cognitive LLM Agent for Smartphone GUI Automation http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11941\n- Improving Language Understanding from Screenshots http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.14073\n- AndroidEnv: A Reinforcement Learning Platform for Android http:\u002F\u002Farxiv.org\u002Fabs\u002F2105.13231\n- SeeClick: Harnessing GUI Grounding for Advanced Visual GUI Agents https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10935\n- AgentStudio: A Toolkit for Building General Virtual Agents https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.17918\n- ReALM: Reference Resolution As Language Modeling https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.20329\n- AutoWebGLM: Bootstrap And Reinforce A Large Language Model-based Web Navigating Agent https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03648\n- Octopus v2: On-device language model for super agent https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.01744.pdf\n- Mobile-Env: An Evaluation Platform and Benchmark for LLM-GUI Interaction https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.08144\n- OSWorld: Benchmarking Multimodal Agents for Open-Ended Tasks in Real Computer Environments https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.07972\n- Ferret-UI: Grounded Mobile UI Understanding with Multimodal LLMs https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05719\n- VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05955\n- OS-Copilot: Towards Generalist Computer Agents with Self-Improvement https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07456\n- VisualWebBench: How Far Have Multimodal LLMs Evolved in Web Page Understanding and Grounding? https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05955\n- AutoDroid: LLM-powered Task Automation in Android https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.15272\n- Fine-Tuning Large Vision-Language Models as Decision-Making Agents via Reinforcement Learning https:\u002F\u002Frl4vlm.github.io\n- WebArena: A Realistic Web Environment for Building Autonomous Agents https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.13854\n- Synapse: Trajectory-as-Exemplar Prompting with Memory for Computer Control https:\u002F\u002Fopenreview.net\u002Fpdf?id=Pc8AU1aF5e\n- gpt-computer-assistant https:\u002F\u002Fgithub.com\u002Fonuratakan\u002Fgpt-computer-assistant\n- Mobile-Agent: The Powerful Mobile Device Operation Assistant Family https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FMobileAgent\n- GUI Odyssey: A Comprehensive Dataset for Cross-App GUI Navigation on Mobile Devices https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FGUI-Odyssey\n- ASSISTGUI: Task-Oriented PC Graphical User Interface Automation https:\u002F\u002Fshowlab.github.io\u002Fassistgui\u002F\n- You Only Look at Screens: Multimodal Chain-of-Action Agents https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.11436 https:\u002F\u002Fgithub.com\u002Fcooelf\u002FAuto-GUI\n- E-ANT: A Large-Scale Dataset for Efficient Automatic GUI NavigaTion https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.14250\n- AndroidWorld https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fandroid_world\n- AppWorld: A Controllable World of Apps and People for Benchmarking Interactive Coding Agents https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18901 https:\u002F\u002Fappworld.dev\u002F\n- Read Anywhere Pointed: Layout-aware GUI Screen Reading with Tree-of-Lens Grounding https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.19263\n- Cradle: Empowering Foundation Agents Towards General Computer Control https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.03186\n- Tree Search for Language Model Agents https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.01476\n- LaVague: Web Agent framework for builders https:\u002F\u002Fgithub.com\u002Flavague-ai\u002FLaVague\n- WebLlama: https:\u002F\u002Fgithub.com\u002FMcGill-NLP\u002Fwebllama\n- AutoCrawler: A Progressive Understanding Web Agent for Web Crawler Generation https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12753\n\n# Citation\n\n```bib\n@article{niu2024screenagent,\n      title={ScreenAgent: A Vision Language Model-driven Computer Control Agent}, \n      author={Runliang Niu and Jindong Li and Shiqi Wang and Yali Fu and Xiyu Hu and Xueyuan Leng and He Kong and Yi Chang and Qi Wang},\n      year={2024},\n      eprint={2402.07945},\n      archivePrefix={arXiv},\n      primaryClass={cs.HC}\n}\n```\n\n# License\n```\nCode License: MIT\nDataset License: Apache-2.0\nModel License: The CogVLM License\n```\n","\u003Cp align=\"center\">\n\u003Ch1 align=\"center\"> ScreenAgent \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_69820c987aa4.png\" alt=\"ScreenAgent Logo\" width=\"30\"> ：由视觉语言大模型驱动的计算机控制代理\u003C\u002Fh1>\n\u003C\u002Fp>\n\n[查看 ScreenAgent 论文 arxiv:2402.07945](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07945)\n\n[中文版 Readme](README-zh.md)\n\n## 新闻\n- (2024-4-17) ScreenAgent 论文已被 IJCAI 2024 接收并将在会议上展示！\n- (2024-5-19) [ScreenAgent Web 客户端](https:\u002F\u002Fgithub.com\u002Fniuzaisheng\u002FScreenAgentWebClient) 发布，提供了一种更简便的方式，让用户通过大模型体验桌面控制。\n\n## \n我们构建了 ScreenAgent 项目，为视觉语言模型代理（VLM Agent）创建了一个与真实计算机屏幕交互的环境。在这个环境中，代理可以通过观察屏幕截图，并输出鼠标和键盘操作来操控图形用户界面。我们还设计了一套自动控制流程，包括规划、执行和反思三个阶段，引导代理持续与环境互动，完成多步骤任务。此外，我们还构建了 ScreenAgent 数据集，收集了完成各种日常计算机任务时的屏幕截图及操作序列。\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_4d88c7ff554a.png\" alt=\"Motivation\" width=\"50%\">\n  \u003Cp>\u003Ci>ScreenAgent 设计动机\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n为了引导 VLM 代理持续与计算机屏幕交互，我们构建了一个包含“规划-执行-反思”的流程。在规划阶段，代理需要将用户任务分解为子任务。在执行阶段，代理会观察屏幕截图，并给出具体的鼠标和键盘操作来执行这些子任务。控制器会执行这些操作，并将执行结果反馈给代理。在反思阶段，代理会观察执行结果，判断当前状态，并决定是继续执行、重试，还是调整计划。这一过程将持续进行，直到任务完成。\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_ce225118995b.png\" alt=\"Running process\" width=\"100%\">\n  \u003Cp>\u003Ci>运行流程\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n我们参考了 VNC 远程桌面连接协议来设计代理的动作空间，这些动作都是最基本的鼠标和键盘操作。大多数鼠标点击操作都需要代理给出精确的屏幕坐标位置。与调用特定 API 来完成任务相比，这种方法更具通用性，可以应用于各种桌面操作系统和应用程序。\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_0c5bf3ebe7ac.png\" alt=\"Action Space\" width=\"50%\">\n  \u003Cp>\u003Ci>支持的动作类型及动作属性\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n教会代理使用计算机并非易事。这要求代理具备任务规划、图像理解、视觉定位和工具使用等多种综合能力。为此，我们手动标注了 ScreenAgent 数据集。该数据集涵盖了多种日常计算机任务，包括文件操作、网页浏览、游戏娱乐等场景。我们按照上述“规划-执行-反思”流程构建了一个个会话。\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_6604f907d56a.png\" alt=\"Dataset Task Type Distribution\" width=\"50%\">\n  \u003Cp>\u003Ci>数据集任务类型分布\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n该项目主要包括以下几个部分：\n```\nScreenAgent\n├── client # 控制器客户端代码\n│   ├── prompt # 提示模板\n│   ├── config.yml # 控制器客户端配置文件模板\n│   └── tasks.txt # 任务列表\n├── data # 包含 ScreenAgent 数据集及其他视觉定位相关数据集\n├── model_workers # VLM 推理服务\n└── train # 模型训练代码\n```\n\n# 准备工作\n\n## 第一步，准备待控制的桌面\n首先，您需要准备一台安装了 VNC 服务器的待控制桌面操作系统，例如 [TightVNC](https:\u002F\u002Fwww.tightvnc.com\u002Fdownload.php)。或者您也可以使用带有 GUI 的 Docker 容器。我们已经准备了一个名为 `niuniushan\u002Fscreenagent-env` 的容器。您可以使用以下命令拉取并启动该容器：\n\n```bash\ndocker run -d --name ScreenAgent -e RESOLUTION=1024x768 -p 5900:5900 -p 8001:8001 -e VNC_PASSWORD=\u003CVNC_PASSWORD> -e CLIPBOARD_SERVER_SECRET_TOKEN=\u003CCLIPBOARD_SERVER_SECRET_TOKEN> -v \u002Fdev\u002Fshm:\u002Fdev\u002Fshm niuniushan\u002Fscreenagent-env:latest\n```\n\n请将 `\u003CVNC_PASSWORD>` 替换为您新的 VNC 密码，将 `\u003CCLIPBOARD_SERVER_SECRET_TOKEN>` 替换为您剪贴板服务的密码。由于输入较长文本或 Unicode 字符依赖于剪贴板，如果未启用剪贴板服务，则只能通过依次按键输入 ASCII 字符串，而无法输入中文及其他 Unicode 字符。本镜像已内置剪贴板服务，默认监听 8001 端口。您需要设置密码以保护您的剪贴板服务。`niuniushan\u002Fscreenagent-env` 是基于 `fcwu\u002Fdocker-ubuntu-vnc-desktop` 构建的。有关该镜像的更多信息，请参阅 [这里](https:\u002F\u002Fgithub.com\u002Ffcwu\u002Fdocker-ubuntu-vnc-desktop)。\n\n如果您希望使用现有的桌面环境，如 Windows、Linux 桌面或其他任何桌面环境，您需要运行任意 VNC 服务器，并记下其 IP 地址和端口号。若要启用剪贴板服务，请在您的桌面环境中执行以下步骤：\n\n```bash\n# 安装依赖\npip install fastapi pydantic uvicorn pyperclip \n# 设置环境变量中的密码\nexport CLIPBOARD_SERVER_SECRET_TOKEN=\u003CCLIPBOARD_SERVER_SECRET_TOKEN>\n# 启动剪贴板服务\npython client\u002Fclipboard_server.py\n```\n\n`clipboard_server.py` 将监听 8001 端口，接收来自控制器的长字符串键盘输入指令（文本形式）。\n\n服务启动后，您可以测试剪贴板是否正常工作，例如：\n\n```bash\ncurl --location 'http:\u002F\u002Flocalhost:8001\u002Fclipboard' \\\n--header 'Content-Type: application\u002Fjson' \\\n--data '{\n    \"text\":\"Hello world\",\n    \"token\":\"\u003CCLIPBOARD_SERVER_SECRET_TOKEN>\"\n}'\n```\n\n如果一切正常，您将收到响应 `{\"success\": True, \"message\": \"Text copied to clipboard\"}`。如果遇到错误“Pyperclip could not find a copy\u002Fpaste mechanism for your system.”，请在运行 `python client\u002Fclipboard_server.py` 前添加一个指定 X 服务器位置的环境变量：\n\n```bash\nexport DISPLAY=:0.0\n```\n\n请根据您的系统环境进行相应调整。如果仍然遇到问题，请参考 [这里](https:\u002F\u002Fpyperclip.readthedocs.io\u002Fen\u002Flatest\u002Fintroduction.html#not-implemented-error)。\n\n请将上述信息填写到客户端配置文件 `client\u002Fconfig.yml` 中的 `remote_vnc_server` 项中。\n\n## 第2步，准备控制器代码的运行环境\n\n您需要运行控制器代码，它具有三项任务：首先，控制器将连接到VNC服务器，采集屏幕截图，并发送鼠标和键盘等命令；其次，控制器内部维护一个状态机，实现规划、行动和反思的自动化控制流程，引导智能体与环境持续交互；最后，控制器会根据提示词模板构建完整的提示信息，将其发送至大模型推理API，并解析大模型生成回复中的控制命令。控制器是一个基于PyQt5的程序，您需要安装一些依赖项：\n\n```bash\npip install -r client\u002Frequirements.txt\n```\n\n## 第3步，准备大模型推理器或API\n\n请为智能体选择一个视觉语言模型（VLM）。我们在`model_workers`中提供了4种模型的推理器，分别是：GPT-4V、LLaVA-1.5、CogAgent和ScreenAgent。您也可以自行实现推理器，或使用第三方API。您可以参考`client\u002Finterface_api`中的代码来实现新的API调用接口。\n\n请参照`client\u002Fconfig.yml`中的`llm_api`部分准备配置文件，仅保留`llm_api`下的一个模型。\n\n```yaml\nllm_api:\n\n  # 从以下模型中选择一个使用：\n\n  GPT4V:\n    model_name: \"gpt-4-vision-preview\"\n    openai_api_key: \"\u003CYOUR-OPENAI-API-KEY>\"\n    target_url: \"https:\u002F\u002Fapi.openai.com\u002Fv1\u002Fchat\u002Fcompletions\"\n\n  LLaVA:\n    model_name: \"LLaVA-1.5\"\n    target_url: \"http:\u002F\u002Flocalhost:40000\u002Fworker_generate\"\n\n  CogAgent:\n    target_url: \"http:\u002F\u002Flocalhost:40000\u002Fworker_generate\"\n\n  ScreenAgent:\n    target_url: \"http:\u002F\u002Flocalhost:40000\u002Fworker_generate\"\n\n  # 所有模型的通用设置\n  temperature: 1.0\n  top_p: 0.9\n  max_tokens: 500\n  \n```\n\n### 如果您使用GPT-4V作为智能体\n\n请在`client\u002Fconfig.yml`中将`llm_api`设置为`GPT4V`，并填写您的OpenAI API Key，请务必注意您的账户余额。\n\n### 如果您使用LLaVA-1.5作为智能体\n\n请参考[LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA)项目下载并准备LLaVA-1.5模型，例如：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA.git\ncd LLaVA\nconda create -n llava python=3.10 -y\nconda activate llava\npip install --upgrade pip  # 启用PEP 660支持\npip install -e .\n```\n\n`model_workers\u002Fllava_model_worker.py`提供了一个用于LLaVA-1.5的非流式输出推理器。您可以将其复制到`llava\u002Fserve\u002Fmodel_worker`目录下，并使用以下命令启动：\n\n```bash\ncd llava\npython -m llava.serve.llava_model_worker --host 0.0.0.0 --port 40000 --worker http:\u002F\u002Flocalhost:40000 --model-path liuhaotian\u002Fllava-v1.5-13b --no-register\n```\n\n### 如果您使用CogAgent作为智能体\n\n请参考[CogVLM](https:\u002F\u002Fgithub.com\u002FTHUDM\u002FCogVLM)项目下载并准备CogAgent模型。从[这里](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002FCogAgent\u002Ftree\u002Fmain)下载CogAgent权重的压缩版`cogagent-chat.zip`，解压后放置在`train\u002Fsaved_models\u002Fcogagent-chat`目录下。\n`train\u002Fcogagent_model_worker.py`提供了一个用于CogAgent的非流式输出推理器。您可以使用以下命令启动它：\n\n```bash\ncd train\nRANK=0 WORLD_SIZE=1 LOCAL_RANK=0 python .\u002Fcogagent_model_worker.py --host 0.0.0.0  --port 40000 --from_pretrained \"saved_models\u002Fcogagent-chat\" --bf16 --max_length 2048\n```\n\n### 如果您使用ScreenAgent作为智能体\n\nScreenAgent是在CogAgent基础上训练的。从[这里](https:\u002F\u002Fhuggingface.co\u002Fniurl\u002FScreenAgent)下载`ScreenAgent-2312.zip`格式的权重文件，解压后放置在`train\u002Fsaved_models\u002FScreenAgent-2312`目录下。您可以使用以下命令启动它：\n\n```bash\ncd train\nRANK=0 WORLD_SIZE=1 LOCAL_RANK=0 python .\u002Fcogagent_model_worker.py --host 0.0.0.0  --port 40000 --from_pretrained \".\u002Fsaved_models\u002FScreenAgent-2312\" --bf16 --max_length 2048\n```\n\n# 运行\n\n准备工作完成后，您可以运行控制器：\n\n```bash\ncd client\npython run_controller.py -c config.yml\n```\n\n控制器界面如下所示。您需要先双击左侧选择一个任务，然后按下“开始自动化”按钮。控制器将按照规划-行动-反思的流程自动运行。它会采集当前屏幕图像，填充提示词模板，将图像和完整提示词发送到大模型推理器，解析大模型的回复，向VNC服务器发送鼠标和键盘控制命令，并重复这一过程。\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_f87f0f4f2c49.png\" alt=\"控制器\" width=\"70%\">\n  \u003Cp>\u003Ci>控制器界面\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n如果屏幕卡住，可以尝试点击“重新连接”按钮。控制器将尝试重新连接到VNC服务器。\n\n# 数据集\n\n所有数据集及数据处理代码均位于`data`目录下。我们使用了三个现有的数据集：COCO2014、Rico & widget-caption以及Mind2Web。\n\n## COCO数据集\n\n我们使用COCO 2014验证集图片作为视觉定位能力的训练数据。您可以从[这里](https:\u002F\u002Fcocodataset.org\u002F#download)下载COCO 2014训练集图片。我们在此使用的标注信息是refcoco，由unc划分。\n\n```\n├── COCO\n   ├── prompts # 用于训练智能体视觉定位能力的提示词模板\n   ├── train2014 # COCO 2014训练集\n   └── annotations # COCO 2014标注\n```\n\n## Rico & widget-caption数据集\n\nRico数据集包含大量Android应用的屏幕截图和控件信息。您可以从[这里](http:\u002F\u002Fwww.interactionmining.org\u002Frico.html)下载Rico数据集的“1. UI Screenshots and View Hierarchies (6 GB)”部分，文件名为`unique_uis.tar.gz`。请将解压后的`combined`文件夹放入`data\u002FRico`目录下。\nwidget-caption是基于Rico的控件信息标注。请将`https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002Fwidget-caption`项目克隆到`data\u002FRico`目录下。\n最终的目录结构如下：\n```\n├── Rico\n   ├── prompts # 用于训练智能体视觉定位能力的提示词模板\n   ├── combined # Rico数据集的屏幕截图\n   └── widget-caption\n       ├── split\n       │   ├── dev.txt\n       │   ├── test.txt\n       │   └── train.txt\n       └── widget_captions.csv\n```\n\n## Mind2Web 数据集\n\n[Mind2Web](https:\u002F\u002Fosu-nlp-group.github.io\u002FMind2Web\u002F) 是一个真实的模拟网页浏览数据集。您需要下载原始数据集并进行处理。首先，使用 Globus 工具从[这里](https:\u002F\u002Fapp.globus.org\u002Ffile-manager?origin_id=32e6b738-a0b0-47f8-b475-26bf1c5ebf19)下载原始网页截图。文件夹名为 `raw_dump`，放置在 `data\u002FMind2Web\u002Fraw_dump` 目录下，然后使用以下命令处理数据集：\n\n```bash\ncd data\u002FMind2Web\npython convert_dataset.py\n```\n\n这段代码会从 Hugging Face 数据集中下载 `osunlp\u002FMind2Web` 数据集的处理后版本。请确保网络畅通。同时，这一步骤将涉及将英文指令翻译成中文指令。\n\n```bash\ncd data\u002FMind2Web\npython convert_dataset.py\n```\n\n这段代码会从 Hugging Face 数据集中下载 `osunlp\u002FMind2Web` 数据集的处理后版本。请确保网络畅通。这一步骤将涉及将英文指令翻译成中文指令。您需要在 `data\u002FMind2Web\u002Ftranslate.py` 中调用您自己的翻译 API。\n目录结构如下：\n```\n├── Mind2Web\n   ├── convert_dataset.py\n   ├── translate.py\n   ├── prompts # 用于训练 Agent 网页浏览能力的提示词模板\n   ├── raw_dump # 从 Globus 下载的 Mind2Web 原始数据\n   └── processed_dataset # 由 convert_dataset.py 创建\n```\n\n## ScreenAgent 数据集\n\nScreenAgent 是本文中标注的数据集，分为训练集和测试集，目录结构如下：\n\n```\n├── data\n    ├── ScreenAgent\n        ├── train\n        │   ├── \u003Csession id>\n        │   │   ├── images\n        │   │   │   ├── \u003Ctimestamp-1>.jpg\n        │   │   │   └── ...\n        │   │   ├── \u003Ctimestamp-1>.json\n        │   │   └── ...\n        │   ├── ...\n        └── test\n```\n\nJSON 文件中各字段的含义：\n- session_id：会话 ID\n- task_prompt：任务总体目标\n- task_prompt_en：任务总体目标（英文）\n- task_prompt_zh：任务总体目标（中文）\n- send_prompt：发送给模型的完整提示\n- send_prompt_en：发送给模型的完整提示（英文）\n- send_prompt_zh：发送给模型的完整提示（中文）\n- LLM_response：模型给出的原始回复文本，即 RLHF 中的拒绝响应\n- LLM_response_editer：人工修正后的回复文本，即 RLHF 中的选择响应\n- LLM_response_editer_en：人工修正后的回复文本（英文）\n- LLM_response_editer_zh：人工修正后的回复文本（中文）\n- video_height, video_width：图像的高度和宽度\n- saved_image_name：截图文件名，位于每个会话的 images 文件夹下\n- actions：从 LLM_response_editer 中解析出的动作序列\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_readme_38a9ea9fbc06.png\" alt=\"Dataset Example\" width=\"100%\">\n  \u003Cp>\u003Ci>ScreenAgent 数据集中的一例\u003C\u002Fi>\u003C\u002Fp>\n\u003C\u002Fdiv>\n\n# 训练 ScreenAgent\n如果您想训练自己的模型，或复现 ScreenAgent 模型，请先准备好上述数据集，并检查 `train\u002Fdataset\u002Fmixture_dataset.py` 文件中的所有数据集路径。如果您只想使用部分数据集或添加新的数据集，请修改 `train\u002Fdataset\u002Fmixture_dataset.py` 中的 `make_supervised_data_module` 函数。请从[这里](https:\u002F\u002Fhuggingface.co\u002FTHUDM\u002FCogAgent\u002Ftree\u002Fmain)下载 CogAgent 权重的单机版 `cogagent-chat.zip`，解压后放置在 `train\u002Fsaved_models\u002F` 目录下。\n\n您需要关注并检查以下文件：\n```\ntrain\n├── data -> ..\u002Fdata\n├── dataset\n│   └── mixture_dataset.py\n├── finetune_ScreenAgent.sh\n└── saved_models\n    └── cogagent-chat # 解压 cogagent-chat.zip\n        ├── 1\n        │   └── mp_rank_00_model_states.pt\n        ├── latest\n        └── model_config.json\n```\n\n请根据您的设备情况修改 `train\u002Ffinetune_ScreenAgent.sh` 中的参数，然后运行：\n```bash\ncd train\nbash finetune_ScreenAgent.sh\n```\n\n最后，如果您想将单机分布式训练的权重合并为一个单独的权重文件，请参考 `train\u002Fmerge_model.sh` 代码。请确保该文件中的模型并行度 `MP_SIZE` 与 `train\u002Ffinetune_ScreenAgent.sh` 中的 `WORLD_SIZE` 一致。将 `--from-pretrained` 后的参数修改为训练过程中保存的检查点位置。合并后的权重文件将保存为 `train\u002Fsaved_models\u002Fmerged_model` 文件夹。\n\n# 待办事项\n- [ ] 提供 Hugging Face Transformers 权重。\n- [ ] 简化控制器设计，提供无渲染模式。\n- [ ] 集成 Gym。\n- [ ] 添加技能库，以支持更复杂的函数调用。\n\n# 相关项目\n- Mobile-Agent：具有视觉感知能力的自主多模态移动设备代理 http:\u002F\u002Farxiv.org\u002Fabs\u002F2401.16158\n- UFO：面向Windows操作系统交互的UI导向型代理 http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07939\n- ScreenAI：用于理解UI和信息图表的视觉语言模型 http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.04615\n- AppAgent：作为智能手机用户的多模态代理 http:\u002F\u002Farxiv.org\u002Fabs\u002F2312.13771\n- CogAgent：用于GUI代理的视觉语言模型 http:\u002F\u002Farxiv.org\u002Fabs\u002F2312.08914\n- Screen2Words：基于多模态学习的自动移动UI摘要 http:\u002F\u002Farxiv.org\u002Fabs\u002F2108.03353\n- 具有规划、长上下文理解和程序合成能力的真实世界Web代理 http:\u002F\u002Farxiv.org\u002Fabs\u002F2307.12856\n- 用于智能手机GUI自动化的综合认知LLM代理 http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11941\n- 通往通用计算机控制之路：以《荒野大镖客2》为例的多模态代理 https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.03186\n- 在多个模拟世界中扩展可指令代理 [技术报告链接](https:\u002F\u002Fstorage.googleapis.com\u002Fdeepmind-media\u002FDeepMind.com\u002FBlog\u002Fsima-generalist-ai-agent-for-3d-virtual-environments\u002FScaling%20Instructable%20Agents%20Across%20Many%20Simulated%20Worlds.pdf)\n- Android in the Zoo：用于GUI代理的动作链思维 http:\u002F\u002Farxiv.org\u002Fabs\u002F2403.02713\n- OmniACT：一个数据集和基准，旨在支持桌面和网页上的多模态通用自主代理 http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.17553\n- 用于智能手机GUI自动化的综合认知LLM代理 http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11941\n- 从截图中提升语言理解能力 http:\u002F\u002Farxiv.org\u002Fabs\u002F2402.14073\n- AndroidEnv：一个针对Android的强化学习平台 http:\u002F\u002Farxiv.org\u002Fabs\u002F2105.13231\n- SeeClick：利用GUI接地机制构建先进的视觉GUI代理 https:\u002F\u002Farxiv.org\u002Fabs\u002F2401.10935\n- AgentStudio：用于构建通用虚拟代理的工具包 https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.17918\n- ReALM：将指代消解视为语言建模 https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.20329\n- AutoWebGLM：自举并强化基于大型语言模型的网页导航代理 https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03648\n- Octopus v2：用于超级代理的设备端语言模型 https:\u002F\u002Farxiv.org\u002Fpdf\u002F2404.01744.pdf\n- Mobile-Env：一个用于评估LLM与GUI交互的平台和基准 https:\u002F\u002Farxiv.org\u002Fabs\u002F2305.08144\n- OSWorld：在真实计算机环境中对多模态代理进行开放式任务的基准测试 https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.07972\n- Ferret-UI：利用多模态LLM实现 grounded移动UI理解 https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05719\n- VisualWebBench：多模态LLM在网页理解和接地方面已经发展到何种程度？ https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05955\n- OS-Copilot：迈向具备自我改进能力的通用计算机代理 https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.07456\n- VisualWebBench：多模态LLM在网页理解和接地方面已经发展到何种程度？ https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.05955\n- AutoDroid：基于LLM的Android任务自动化 https:\u002F\u002Farxiv.org\u002Fabs\u002F2308.15272\n- 通过强化学习微调大型视觉语言模型，使其成为决策代理 https:\u002F\u002Frl4vlm.github.io\n- WebArena：一个用于构建自主代理的真实网络环境 https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.13854\n- Synapse：基于轨迹示例提示的记忆增强型计算机控制方法 https:\u002F\u002Fopenreview.net\u002Fpdf?id=Pc8AU1aF5e\n- gpt-computer-assistant https:\u002F\u002Fgithub.com\u002Fonuratakan\u002Fgpt-computer-assistant\n- Mobile-Agent：强大的移动设备操作助手系列 https:\u002F\u002Fgithub.com\u002FX-PLUG\u002FMobileAgent\n- GUI Odyssey：一个用于移动设备跨应用GUI导航的综合数据集 https:\u002F\u002Fgithub.com\u002FOpenGVLab\u002FGUI-Odyssey\n- ASSISTGUI：面向任务的PC图形用户界面自动化 https:\u002F\u002Fshowlab.github.io\u002Fassistgui\u002F\n- You Only Look at Screens：多模态动作链代理 https:\u002F\u002Farxiv.org\u002Fabs\u002F2309.11436 https:\u002F\u002Fgithub.com\u002Fcooelf\u002FAuto-GUI\n- E-ANT：一个用于高效自动GUI导航的大规模数据集 https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.14250\n- AndroidWorld https:\u002F\u002Fgithub.com\u002Fgoogle-research\u002Fandroid_world\n- AppWorld：一个可控的应用和人物世界，用于基准测试交互式编码代理 https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.18901 https:\u002F\u002Fappworld.dev\u002F\n- Read Anywhere Pointed：基于树状透镜接地的布局感知GUI屏幕阅读 https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.19263\n- Cradle：赋能基础代理迈向通用计算机控制 https:\u002F\u002Farxiv.org\u002Fabs\u002F2403.03186\n- 针对语言模型代理的树搜索 https:\u002F\u002Farxiv.org\u002Fabs\u002F2407.01476\n- LaVague：面向开发者的Web代理框架 https:\u002F\u002Fgithub.com\u002Flavague-ai\u002FLaVague\n- WebLlama：https:\u002F\u002Fgithub.com\u002FMcGill-NLP\u002Fwebllama\n- AutoCrawler：一种渐进式理解型Web代理，用于生成网络爬虫 https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.12753\n\n# 引用\n\n```bib\n@article{niu2024screenagent,\n      title={ScreenAgent：一种由视觉语言模型驱动的计算机控制代理}, \n      author={牛润良、李金东、王世奇、付亚丽、胡西宇、冷雪源、孔赫、常毅、王琪},\n      year={2024},\n      eprint={2402.07945},\n      archivePrefix={arXiv},\n      primaryClass={cs.HC}\n}\n```\n\n# 许可证\n```\n代码许可证：MIT\n数据集许可证：Apache-2.0\n模型许可证：CogVLM许可证\n```","# ScreenAgent 快速上手指南\n\nScreenAgent 是一个由视觉语言大模型（VLM）驱动的计算机控制智能体。它能够通过观察屏幕截图，自动规划并执行鼠标和键盘操作，从而完成多步骤的桌面任务。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**：Linux, Windows, 或 macOS（作为控制器端）\n- **被控端环境**：\n  - 安装了 VNC Server 的桌面系统（如 TightVNC），或\n  - 支持 GUI 的 Docker 容器\n- **Python 版本**：推荐 Python 3.8+\n\n### 前置依赖\n- **Docker**（可选，用于快速部署被控端环境）\n- **Git**\n- **PyQt5** 及相关 Python 库\n\n---\n\n## 安装步骤\n\n### 第一步：准备被控桌面环境\n\n你可以选择使用现有的桌面系统，或使用官方提供的 Docker 镜像快速启动一个包含 VNC 和剪贴板服务的环境。\n\n#### 方案 A：使用 Docker（推荐）\n拉取并启动预置环境的容器：\n\n```bash\ndocker run -d --name ScreenAgent -e RESOLUTION=1024x768 -p 5900:5900 -p 8001:8001 -e VNC_PASSWORD=\u003CVNC_PASSWORD> -e CLIPBOARD_SERVER_SECRET_TOKEN=\u003CCLIPBOARD_SERVER_SECRET_TOKEN> -v \u002Fdev\u002Fshm:\u002Fdev\u002Fshm niuniushan\u002Fscreenagent-env:latest\n```\n\n> **注意**：请将 `\u003CVNC_PASSWORD>` 替换为你的 VNC 密码，`\u003CCLIPBOARD_SERVER_SECRET_TOKEN>` 替换为剪贴板服务密码。启用剪贴板服务对于输入中文或非 ASCII 字符至关重要。\n\n#### 方案 B：使用现有桌面\n1. 在目标机器上安装并运行任意 VNC Server，记录其 IP 和端口。\n2. 若需支持长文本或中文输入，需在目标机器上安装依赖并启动剪贴板服务：\n\n```bash\npip install fastapi pydantic uvicorn pyperclip \nexport CLIPBOARD_SERVER_SECRET_TOKEN=\u003CCLIPBOARD_SERVER_SECRET_TOKEN>\n# 若遇到 Pyperclip 错误，可能需要设置 DISPLAY 变量\nexport DISPLAY=:0.0\npython client\u002Fclipboard_server.py\n```\n\n### 第二步：安装控制器依赖\n\n克隆项目代码并安装控制器所需的 Python 依赖：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fniuzaisheng\u002FScreenAgent.git\ncd ScreenAgent\npip install -r client\u002Frequirements.txt\n```\n\n### 第三步：配置大模型接口\n\n编辑 `client\u002Fconfig.yml` 文件，配置你要使用的视觉语言模型（VLM）。目前支持 GPT-4V, LLaVA-1.5, CogAgent 和 ScreenAgent。\n\n#### 选项 1：使用 GPT-4V\n在 `config.yml` 中保留 `GPT4V` 部分，填入你的 API Key：\n\n```yaml\nllm_api:\n  GPT4V:\n    model_name: \"gpt-4-vision-preview\"\n    openai_api_key: \"\u003CYOUR-OPENAI-API-KEY>\"\n    target_url: \"https:\u002F\u002Fapi.openai.com\u002Fv1\u002Fchat\u002Fcompletions\"\n  # 其他模型请注释掉\n```\n\n#### 选项 2：使用本地模型 (LLaVA \u002F CogAgent \u002F ScreenAgent)\n你需要先启动本地模型推理服务（监听端口 40000），然后在 `config.yml` 中将 `target_url` 指向 `http:\u002F\u002Flocalhost:40000\u002Fworker_generate` 并仅保留对应模型配置。\n\n*以启动 ScreenAgent 模型为例：*\n1. 下载权重文件 `ScreenAgent-2312.zip` 并解压至 `train\u002Fsaved_models\u002FScreenAgent-2312`。\n2. 启动推理服务：\n\n```bash\ncd train\nRANK=0 WORLD_SIZE=1 LOCAL_RANK=0 python .\u002Fcogagent_model_worker.py --host 0.0.0.0 --port 40000 --from_pretrained \".\u002Fsaved_models\u002FScreenAgent-2312\" --bf16 --max_length 2048\n```\n\n3. 修改 `client\u002Fconfig.yml`，仅保留 `ScreenAgent` 配置项。\n\n---\n\n## 基本使用\n\n完成上述配置后，即可启动控制器进行自动化任务演示。\n\n1. **启动控制器**：\n   进入 client 目录并运行：\n\n   ```bash\n   cd client\n   python run_controller.py -c config.yml\n   ```\n\n2. **执行任务**：\n   - 程序启动后会弹出图形界面。\n   - 在界面左侧列表中**双击**选择一个预设任务（如文件操作、网页浏览等）。\n   - 点击 **\"Start Automation\"** 按钮。\n\n3. **运行过程**：\n   控制器将自动执行“规划 - 执行 - 反思”循环：\n   - 截取当前屏幕画面。\n   - 结合提示词模板发送给大模型。\n   - 解析大模型返回的鼠标\u002F键盘指令并发送至 VNC Server。\n   - 根据执行结果判断是否继续、重试或调整计划，直到任务完成。\n\n> **提示**：如果界面卡住或连接断开，可尝试点击界面上的 **\"Re-connect\"** 按钮重新连接 VNC 服务。","一位数据分析师需要在每天早晨从多个内部网页系统中抓取最新销售数据，整理成 Excel 报表并发送给团队，这一过程涉及频繁的跨应用操作。\n\n### 没有 ScreenAgent 时\n- **重复劳动耗时**：分析师每天需手动打开浏览器、登录三个不同系统、复制粘贴数据，耗时约 45 分钟，极易因疲劳产生复制错误。\n- **流程僵化难维护**：若使用传统 RPA 脚本，一旦网页按钮位置微调或弹出广告遮挡，脚本即刻失效，需要开发人员重新编写定位代码。\n- **异常处理缺失**：遇到系统加载缓慢或验证码弹窗时，自动化脚本无法判断状态，只能报错停止，仍需人工介入排查。\n- **跨应用协同困难**：将网页数据转入 Excel 并进行格式调整涉及鼠标拖拽和快捷键组合，传统 API 方案难以模拟这些图形界面操作。\n\n### 使用 ScreenAgent 后\n- **自然语言驱动自动化**：分析师只需输入“抓取昨日销售数据并生成报表”，ScreenAgent 即可基于视觉大模型自主规划步骤，将准备时间缩短至 2 分钟。\n- **视觉自适应能力强**：依托屏幕截图识别，即使网页布局微调或有临时弹窗，ScreenAgent 也能像人类一样“看见”并调整点击坐标，无需重写代码。\n- **智能反思与纠错**：在执行中若发现数据未加载完成，ScreenAgent 会通过“反思”机制自动等待或刷新页面，确保持续运行直至任务完成。\n- **通用键鼠模拟**：直接模拟真实的鼠标拖拽、右键菜单和键盘快捷键，无缝衔接浏览器与 Excel 软件，完美复现人工操作流程。\n\nScreenAgent 通过将视觉理解与自主决策结合，把繁琐的跨应用手工操作转化为可自我修正的智能工作流，真正实现了“所见即所得”的电脑自动化控制。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fniuzaisheng_ScreenAgent_e6994a4f.png","niuzaisheng","R.L. Niu","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fniuzaisheng_74c7c575.png","Ph.D. student at the School of Artificial Intelligence, Jilin University.","Jilin University","Changchun Jilin China","niu_rl@outlook.com",null,"https:\u002F\u002Fgithub.com\u002Fniuzaisheng",[83,87,91],{"name":84,"color":85,"percentage":86},"Python","#3572A5",99.3,{"name":88,"color":89,"percentage":90},"Shell","#89e051",0.6,{"name":92,"color":93,"percentage":94},"Jinja","#a52a22",0,592,64,"2026-04-19T01:37:42","NOASSERTION",4,"Linux, Windows, macOS","运行本地模型（LLaVA, CogAgent, ScreenAgent）时需要 NVIDIA GPU，需支持 bf16 精度；使用 GPT-4V API 则无需本地 GPU。具体显存未说明，建议 16GB+ 以运行 13B 参数模型。","未说明",{"notes":104,"python":105,"dependencies":106},"1. 核心架构基于 VNC 协议，被控端需安装 VNC Server（如 TightVNC）或使用提供的 Docker 容器。\n2. 若需输入长文本或非 ASCII 字符（如中文），被控端必须运行额外的剪贴板服务（clipboard_server.py）。\n3. 支持多种大模型后端：GPT-4V（云端 API）、LLaVA-1.5、CogAgent 及 ScreenAgent（本地部署）。\n4. 本地部署 CogAgent 或 ScreenAgent 时需安装 Switzter-Transformers (sat) 库。\n5. 控制器界面基于 PyQt5 开发。","3.10 (LLaVA 环境明确要求，其他未明确但建议一致)",[107,108,109,110,111,112,113,114],"PyQt5","fastapi","pydantic","uvicorn","pyperclip","torch","transformers","sat",[13,14,15,36],[117,118,119,120],"agent","ai","llm","vlm","2026-03-27T02:49:30.150509","2026-04-20T19:32:33.606634",[124,129,134,139,144,149,154,159],{"id":125,"question_zh":126,"answer_zh":127,"source_url":128},45463,"ScreenAgent 模型权重在哪里下载？","模型权重已开源，可以直接从 Hugging Face 下载：https:\u002F\u002Fhuggingface.co\u002Fniurl\u002FScreenAgent","https:\u002F\u002Fgithub.com\u002Fniuzaisheng\u002FScreenAgent\u002Fissues\u002F10",{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},45464,"为什么微调后的模型鼠标定位（mouse_position）不准确？对输入图片尺寸有什么要求？","视觉定位是难点，因为基础模型多使用自然图像训练，缺乏 GUI 样本。关于尺寸问题：上游模型复用了 Blip 的 image_processor，会将图像缩放到 256x256（EVA-CLIP Encoder）和 1120x1120（Cross Vision Encoder），导致与标注的 grounding 坐标不对应。解决方案是先将图片 padding 到最大尺寸（如 1120x1120）再输入。如果位置仍有较大差距，建议针对特定场景收集数据进行更针对性的训练。","https:\u002F\u002Fgithub.com\u002Fniuzaisheng\u002FScreenAgent\u002Fissues\u002F3",{"id":135,"question_zh":136,"answer_zh":137,"source_url":138},45465,"数据集或代码中出现的 \"up\" 和 \"down\" 动作类型是什么意思？","根据 VNC 协议定义，ScreenAgent 中的操作由更细粒度的动作组成：\n1. 鼠标点击（Click）：由 Move（移动）-> Down（按下）-> Up（抬起）三个步骤组成。\n2. 鼠标拖拽（Drag）：由 Down -> Move -> Up 组成。\n3. 键盘按键：同样由 Down（按下）和 Up（抬起）先后执行完成。\n具体实现可参考 `client\u002Faction.py` 文件。","https:\u002F\u002Fgithub.com\u002Fniuzaisheng\u002FScreenAgent\u002Fissues\u002F7",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},45466,"如何在 Docker 或 Linux 环境中解决剪贴板报错 \"No such file or directory: 'clip.exe'\"？","这是 pyperclip 库在非 Windows 环境下尝试调用 Windows 命令导致的问题。解决方法是检查 `clipboard_server` 代码（通常在第 22 行左右），注释掉或修改相关的检查语句。该问题与 pyperclip 在某些环境下的兼容性有关，参考 pyperclip 官方 issue #194 可获得更多细节。如果是在 Docker 中，确保安装了必要的剪贴板工具（如 xclip），但通常直接修改代码绕过对该执行的依赖更为有效。","https:\u002F\u002Fgithub.com\u002Fniuzaisheng\u002FScreenAgent\u002Fissues\u002F29",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},45467,"是否支持无头模式（Headless Mode）运行？","目前官方计划支持无头模式以用于强化学习训练，但由于当前 Agent 在无干预下完成任务的成功率较低，收集到的多为负样本，因此优先级不高。临时解决方案是参考 ScreenAgentWebClient 项目，使用 JavaScript 自行实现无头模式，这比在 Python 中实现更容易。","https:\u002F\u002Fgithub.com\u002Fniuzaisheng\u002FScreenAgent\u002Fissues\u002F27",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},45468,"可以通过 VSCode SSH 远程连接服务器运行代码吗？","目前开源版本基于 QT 开发，远程渲染图形界面需要配置 X11 转发等手段，较为复杂。开发团队正在开发基于 Web 的版本，届时将更方便地在服务器端部署和访问。","https:\u002F\u002Fgithub.com\u002Fniuzaisheng\u002FScreenAgent\u002Fissues\u002F15",{"id":155,"question_zh":156,"answer_zh":157,"source_url":158},45469,"在 Windows 上使用 curl 测试剪贴板接口时，正确的命令格式是什么？","在 Windows 命令行中，必须使用双引号包裹参数，且 JSON 数据内部的双引号需要加反斜杠转义。正确格式如下：\ncurl --location \"http:\u002F\u002Flocalhost:8001\u002Fclipboard\" --header \"Content-Type: application\u002Fjson\" --data \"{\\\"text\\\":\\\"Hello\\\", \\\"token\\\":\\\"secret\\\"}\"\n或者简化版：\ncurl http:\u002F\u002Flocalhost:8001\u002Fclipboard -H \"Content-Type: application\u002Fjson\" -d \"{\\\"text\\\":\\\"Hello\\\", \\\"token\\\":\\\"secret\\\"}\"","https:\u002F\u002Fgithub.com\u002Fniuzaisheng\u002FScreenAgent\u002Fissues\u002F32",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},45470,"启动时遇到 \"Not a VNC server\" 错误怎么办？","这通常是端口映射问题。例如在 macOS 上，尝试将默认的 5900 端口更改为 5901（或其他未被占用的端口）即可解决。请检查您的 VNC 服务器配置和端口映射设置。","https:\u002F\u002Fgithub.com\u002Fniuzaisheng\u002FScreenAgent\u002Fissues\u002F24",[]]