[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-a-real-ai--pywinassistant":3,"tool-a-real-ai--pywinassistant":64},[4,17,27,35,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,43,44,45,15,46,26,13,47],"数据工具","视频","插件","其他","音频",{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":10,"last_commit_at":54,"category_tags":55,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,46],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[26,14,13,46],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":80,"owner_email":80,"owner_twitter":80,"owner_website":80,"owner_url":81,"languages":82,"stars":87,"forks":88,"last_commit_at":89,"license":90,"difficulty_score":23,"env_os":91,"env_gpu":92,"env_ram":93,"env_deps":94,"category_tags":99,"github_topics":100,"view_count":23,"oss_zip_url":80,"oss_zip_packed_at":80,"status":16,"created_at":112,"updated_at":113,"faqs":114,"releases":144},3493,"a-real-ai\u002Fpywinassistant","pywinassistant","The first open-source Artificial Narrow Intelligence generalist agentic framework Computer-Using-Agent that fully operates graphical-user-interfaces (GUIs) by using only natural language. Uses Visualization-of-Thought and Chain-of-Thought reasoning to elicit spatial reasoning and perception, emulates, plans and simulates synthetic HID interactions.","pywinassistant 是一款专为 Windows 10\u002F11 设计的开源人工智能助手，能够仅通过自然语言指令全自动操作图形用户界面（GUI）。它主要解决了传统自动化工具依赖图像识别、OCR 或像素匹配导致的效率低、维护难及适应性差等问题。\n\n与传统方案不同，pywinassistant 不依赖视觉成像管道，而是直接调用 Windows 原生辅助功能 API，通过提取控件元数据（如类型、状态、坐标）来理解界面。这种“非视觉感知”结合“符号空间映射”技术，使其能像人类一样理解界面的层级与几何关系。即便界面元素 ID 发生变化，其自带的自愈工作流也能自动调整，大幅降低了脚本维护成本。此外，它利用思维链（CoT）和思维可视化（VoT）推理机制，能将自然语言转化为代码或执行步骤，统一了 GUI、Web 及 API 的自动化流程。\n\n这款工具非常适合开发者构建复杂自动化脚本、研究人员探索符号推理与通用人工智能（AGI）路径，以及需要无障碍辅助功能的普通用户。无论是希望用语音控制电脑操作，还是寻求一种更稳定、面向未来的自动化框架，pywinassistant 都提供了一个轻量且强大的解决方案","pywinassistant 是一款专为 Windows 10\u002F11 设计的开源人工智能助手，能够仅通过自然语言指令全自动操作图形用户界面（GUI）。它主要解决了传统自动化工具依赖图像识别、OCR 或像素匹配导致的效率低、维护难及适应性差等问题。\n\n与传统方案不同，pywinassistant 不依赖视觉成像管道，而是直接调用 Windows 原生辅助功能 API，通过提取控件元数据（如类型、状态、坐标）来理解界面。这种“非视觉感知”结合“符号空间映射”技术，使其能像人类一样理解界面的层级与几何关系。即便界面元素 ID 发生变化，其自带的自愈工作流也能自动调整，大幅降低了脚本维护成本。此外，它利用思维链（CoT）和思维可视化（VoT）推理机制，能将自然语言转化为代码或执行步骤，统一了 GUI、Web 及 API 的自动化流程。\n\n这款工具非常适合开发者构建复杂自动化脚本、研究人员探索符号推理与通用人工智能（AGI）路径，以及需要无障碍辅助功能的普通用户。无论是希望用语音控制电脑操作，还是寻求一种更稳定、面向未来的自动化框架，pywinassistant 都提供了一个轻量且强大的解决方案。","**PyWinAssistant: An artificial assistant** – **MIT Licensed** | **Public Release: December 31, 2023** |  Complies with federal coordinations AI Standards for Complex Adaptive Systems, Asilomar AI Principles and IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems.\r\n\r\n---\r\n\r\nPyWinAssistant is the first open-source Artificial Narrow Intelligence to elicit spatial reasoning and perception as a generalist agentic framework Computer-Using-Agent that fully operates graphical-user-interfaces (GUIs) for Windows 10\u002F11 **through direct OS-native semantic interaction**. It functions as a Computer-Using-Agent \u002F Large-Action-Model, forming the foundation for a pure **symbolic spatial cognition framework** that enables artificial operation of a computer using only natural language, **without relying on computer vision, OCR, or pixel-level imaging**. PyWinAssistant emulates, plans, and simulates synthetic Human-Interface-Device (HID) interactions through **native Windows Accessibility APIs**, eliciting human-like abstraction across geometric, hierarchical, and temporal dimensions at an Operating-System level. This OS-integrated approach simulating spatial utilization of a computer provides a future-proof, generalized, modular, and dynamic ANI orchestration framework for multi-agent-driven automation, marking an important step in symbolic reasoning towards AGI.\r\n\r\n**Key Features:**\r\n*   **Not relying only on Imaging Pipeline**: Operates exclusively through Windows UI Automation (UIA) and programmatic GUI semantics, enabling universal workflow orchestration.\r\n*   **Symbolic Spatial Mapping**: Hierarchical element tracking via OS-native parent\u002Fchild relationships and coordinate systems.\r\n*   **Non-Visual Perception**: Real-time interface understanding through direct metadata extraction (control types, states, positions).\r\n*   **Visual Perception**: A single screenshot can elicit comprehension and perception with attention to detail by visualizing goal intent and environment changes in a spatial space over time, can be fine-tuned to look up for visual cues, bugs, causal reasoning bugs, static, semantic grounding, errors, corruption...\r\n*   **Unified Automation**: Automatic element detection. Combines GUI, system, and web automation under one Python API. Eliminates context-switching between tools.\r\n*   **AI-Powered Script Generation**: Translates natural language or demonstrations into any kind of code inside any IDE or text edit areas.\r\n*   **Self-Healing Workflows**: Auto-adjusts to UI changes (e.g., element ID shifts). Reducing maintenance overhead, making PyWinAssistant's algorithm future-proof.\r\n*   **AI\u002FML Integration**: Using NLP to generate scripts (e.g., “Automate Application” → plan of test execution steps in JSON) with self-correcting selectors.\r\n*   **Cross-Context Automation**: Seamlessly combining GUI, web, and API workflows in a Pythonic way, unifying disjointed automation methods (GUI, API, web) into a single framework.\r\n*   **Accessibility**: Enhancing accessibility for users with different needs, enabling voice or simple text commands to control complex actions. \r\n*   **Generalization**: Elicits spatial cognition to understand and execute a wide range of commands in a natural, intuitive manner.\r\n*    **Small and compact**: PyWinAssistant functions as an example algorithm of a modular and generalized computer assistant framework that elicits spatial cognition.\r\n\r\nPyWinAssistant has its own set of **reasoning agents**, utilizing Visualization-of-Thought (VoT) and Chain-of-Thought (CoT) to enhance generalization, dynamically simulating actions through abstract GUI semantic dimensions rather than visual processing, making it **future-proof** for next-generation **LLM models**. By **visualizing interface contents** to dynamically **simulate and plan actions** over **abstract GUI semantic dimensions, concepts, and differentials**, PyWinAssistant **redefines computer vision automation**, enabling **high-efficiency visual processing** at a fraction of traditional computational costs. PyWinAssistant has achieved **real-time spatial perception** at an **Operating-System level**, allowing for **memorization of visual cues and tracking of on-screen changes over time**.\r\n\r\n---\r\n\r\nReleased before key breakthroughs in AI for Spatial Reasoning, it predates:\r\n*   **Microsoft’s** [**Visualization-of-Thought research paper**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03622) (April 4, 2024)\r\n*   **Anthropic** [**Claude’s Computer-Use Agent**](https:\u002F\u002Fwww.anthropic.com\u002Fnews\u002F3-5-models-and-computer-use) (October 22, 2024)\r\n*   **OpenIA** [**ChatGPT’s Operator Computer-Using Agent (CUA)**](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-operator\u002F) (January 23, 2025)\r\n\r\nPyWinAssistant represents a major paradigm shift in AI and automation by pioneering **pure symbolic computer interaction** bridging **human intent with GUI automation at an OS level** through these breakthroughs:\r\n*   **First Agent** to bypass OCR\u002Fimaging for Computer-Using-Agent GUI automation.\r\n*   **First Framework** using Windows UIA as the primary spatial perception channel.\r\n*   **First System** demonstrating OS-native hierarchical-temporal reasoning.\r\n\r\n---\r\n\r\n### **1. Unified Natural Language → GUI Automation**\r\n**Traditional Approach**:  \r\nAutomation tools require scripting (e.g., AutoHotkey) or API integration (e.g., Selenium).  \r\n\r\n**PyWinAssistant Breakthrough**:  \r\n```python\r\n# True generalization for natural language directly driving UI actions\r\nassistant(\"Play Daft Punk on Spotify and email the lyrics to my friend\")\r\n# The agent chooses a fitting item according to the related context to comply with user intent.\r\n```\r\n\r\n**Mechanism**: Combines UIAutomation’s GUI control detection with LLMs to:\r\n  - Parse intent (\"play\", \"email lyrics\")\r\n  - Map to UI elements (Spotify play button, Outlook compose window)\r\n  - Generate adaptive workflows  \r\n\r\n**PyWinAssistant Innovation**: Eliminates the need for:\r\n- Predefined API integrations\r\n- XPath\u002FCSS selector knowledge\r\n- Manual error handling\r\n\r\n---\r\n\r\n### **2. Cross-Application State Awareness**\r\n**Traditional Limitation**:  \r\nTools operate in app silos (e.g., Power Automate connectors).  \r\n\r\n**PyWinAssistant Innovation**:  \r\n```python\r\n# Notes:\r\n# The full set of steps generation from the Assistant is working flawlessly, but in-step modifier and memory-content retrieval was purposely disabled and commented into the code- [def act()](https:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fblob\u002F6aae4e514a0dc661f7ed640181663f483972bc1e\u002Fcore\u002Fdriver.py#L648C1-L648C8)\r\n# to comply with federal coordinations AI Standards for Complex Adaptive Systems, Asilomar AI Principles and IEEE Global Initiative on Ethics of Autonomous and Intelligent Systems.\r\n\r\n# Accurately maintains context and intent across apps using UIA tree and spatial memory: (Example for further development)\r\nassistant(\"Find for the best and cheapest flight to Mexico, and also look for local hotels and suggest me on new tabs the best on cultural options\")\r\nassistant(\"Look for various pizza coupons for anything but pineapple, fill in the details to order and show me the results\")\r\n\r\n# PyWinAssistant is highly modular (example):\r\ndef workflow():\r\n    song = assistant(goal=\"get the current track\")  # UIA\r\n    write_action(f\"Review '{song}': Great bassline!\", app=\"Notepad\")  # Win32\r\n    assistant(goal=\"Post on twitter the written text from notepad\")  # Web\r\n\r\n# The previous set of actions can be also executed by simply using natural language:\r\nassistant(f\"Get the current song playing and in notepad put the title as Review song name: Great bassline, and write about why it is a great baseline, then post it on twitter\", assistant_identity=\"You're an expert music critic\")\r\n```\r\n**Key Advancements**:\r\n1. **Unified Control Graph**: Treats all apps as nodes in a single UIA-accessible graph\r\n2. **State Transfer**: Passes data between apps via clipboard\u002FUIA properties\r\n3. **Semantic Transfer**: Passes semantics of goal intent acros all steps\r\n4. **Error Recovery**: Uses agentic reasoning systems to avoid failing actions\r\n\r\n**Impact**: Enables workflows previously requiring custom middleware.\r\n---\r\n\r\n### **3. Probabilistic Automation Engine**\r\n**Traditional Model**:  \r\nDeterministic scripts fail on UI changes.  \r\n\r\n**PyWinAssistant’s Solution**:  \r\n```python\r\n# Adaptive element discovery\r\ndef fast_action(goal):\r\n    speaker(f\"Clicking onto the element without visioning context. No imaging is required.\")\r\n    analyzed_ui = analyze_app(application=ai_choosen_app, additional_search_options=generated_keywords)\r\n    \r\n    gen_coordinates = [{\"role\": \"assistant\",\r\n        f\"content\": f\"You are an AI Windows Mouse Agent that can interact with the mouse. Only respond with the \"\r\n              f\"predicted coordinates of the mouse click position to the center of the element object \"\r\n              f\"\\\"x=, y=\\\" to achieve the goal.\"},\r\n        {\"role\": \"system\", \"content\": f\"Goal: {single_step}\\n\\nContext:{original_goal}\\n{analyzed_ui}\"}]\r\n    coordinates = api_call(gen_coordinates, model_name=\"gpt-4-1106-preview\", max_tokens=100, temperature=0.0)\r\n    print(f\"AI decision coordinates: \\'{coordinates}\\'\")\r\n```\r\n**Revolutionary Features**:\r\n- **Semantic Search by thinking**: Example `synonyms(\"download\") → [\"save\", \"export\", \"↓ icon\"]`\r\n- **Spatial Probability**: Prioritizes elements by utilizing sets of self-reasoning agents for the synthetic operation of the actions\r\n- **Spatial-Prevention**: Senses and prevents possible bad actions or misaligned step execution by utilizing sets of self-reasoning agents\r\n- **Self-Healing**: Automatically chooses the perfect plan to execute without failing its step reasoning, by utilizing sets of self-reasoning agents\r\n\r\n---\r\n\r\n### **4. Democratized Accessibility**\r\n\r\nTask: Automate to save a song on spotify GUI.\r\n**Before**:  \r\nAutomation required:\r\n```autohotkey\r\nWinWait, Spotify\r\nControlClick, x=152 y=311  # Fragile coordinates\r\n```\r\n\r\n**Now**:  Only 1 natural language command.\r\n```python\r\nassistant(\"Like this song\")  # Language-first\r\n```\r\n\r\n| **Shift Metrics**:    | Traditional Tools | PyWinAssistant |\r\n|-----------------------|-------------------|----------------|\r\n| Learning Curve        | Days, even months | Minutes        |\r\n| Cross-App Workflows   | Manual Integration| Automatic      |\r\n| Maintenance Overhead  | High              | LLM-AutoPatch  |\r\n\r\n---\r\n\r\n### **Why This is Transformative**\r\n\r\n1. **From Scripts to Intent**:  \r\n   Replaces brittle `click(x,y)` with human-like \"understand → act\" cycles.\r\n\r\n2. **From Silos to OS as API**:  \r\n   Treats the entire Windows environment as a programmable interface.\r\n\r\n3. **From Fixed to Adaptive**:  \r\n   Leverages LLMs to handle UI changes (e.g., Spotify’s 2023 UI overhaul).\r\n\r\n4. **From Developers to Everyone**:  \r\n   Makes advanced automation accessible through natural language, improving the generality quality and minimizing the overall data usage of LLM and vision models.\r\n   Has built-in assistance options to improve human utilization of a computer, with a new technical approach to User Interface and User Experience assistance and testing by spatial visualization of thought,\r\n   generalizes correctly any natural language prompt, and plans to perform correct actions into the OS with security in mind.\r\n   \r\nBy **directly interfacing with Windows underlying UI hierarchy**, it achieves real-time spatial perception at the OS level while eliminating traditional computer vision pipelines, enabling:\r\n*   **100x Efficiency Gains**: Native API access.\r\n*   **Blind Operation**: Can function on headless systems, virtual machines, or minimized windows.\r\n*   **Precision Abstraction**: Mathematical modeling of GUI relationships rather than visual pattern matching.\r\n\r\n  **Image-Free by Design (Core Architecture)**  \r\nWhile some projects *require* visual processing for fundamental operation, PyWinAssistant achieves **complete GUI interaction capability without an imaging pipeline** through:  \r\n\r\n1. **Native OS Semantic Access**  \r\n   Direct Windows UIA API integration provides full control metadata:  \r\n   ```python\r\n   # Example of an element properties via UIA - No screenshots needed\r\n   button = uia.Element.find(Name=\"Submit\", ControlType=\"Button\")\r\n   print(button.BoundingRectangle)  # {x: 120, y: 240, width: 80, height: 30}\r\n   ```\r\n2. **Imaging Module**  \r\n\r\n   ```diff\r\n   # PyWinAssistant imaging functions like Pixel level visualization can be enabled as real-time spatial perception with memorization of visual cues and tracking of on-screen changes over time.\r\n   + Capable of planning successful sets of highly technical steps to perform operations on a computer at an OS level, with only one screenshot.\r\n   + Pixel level visualization.\r\n   + Visual hash matching can be enabled for dynamic elements. \r\n   - OCR fallback \u002F object detection for non-UIA legacy apps.\r\n   # The experimental features of OCR were added but not fully developed as it was not necessary for the current implementation as the assistant currently works too well without it.\r\n   ```\r\n\r\n| **Key Differentiation** | PyWinAssistant | Traditional Automation |  \r\n|-|----------------|------------------------|  \r\n| **Primary Perception** | UIA Metadata | Screenshots\u002FOCR |  \r\n| **Vision Dependency** | Optional Add-on | Required Core |  \r\n| **Headless Ready** | ✅ Native | ❌ Requires virtual display |  \r\n\r\n---\r\n\r\n### **Development Notes:**\r\nPyWinAssistant is limited to model's intelligence and time to inference. New advancements on LLM's are required to reach for a complete Artificial General Intelligence system with Artificial Narrow Intelligences managing it.\r\nThe system's autonomous task decomposition leverages **native semantic differentials** rather than visual changes, visual changes can be optionally activated for real-time image corruption analysis in GUI\u002FScreen.\r\nLong-term memory and self-learning mechanisms were designed to evolve **symbolic state representations**, and can be also represented into visual patterns, aligning with AGI development.\r\n\r\nPaper related: Visualization-of-Thought Elicits Spatial Reasoning in Large Language Models (April 4, 2024):\r\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_026e5ad7b833.png)\r\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03622\r\n\r\n# Overview\r\n\r\nPyWinAssistant includes built-in assistant features designed to enhance human-computer interaction for all users. It integrates real-time voice recognition, customizable assistant personalities, subtitles, and chat functionality.\r\nTalk with your computer friendly and naturally to perform any User Interface activity.\r\nUse natural language to operate freely your Windows Operating System.\r\nGenerates and plans test cases of your User Interface applications for continuous testing on any Win32api supported application by simply using natural language.\r\nYour own open and secure personal assistant that responds as you want, control the way you want your computer to assist you.\r\nIt's engineered to be modular, understand and execute a wide range of tasks, automating interactions with any desktop applications.\r\n\r\n# Demos (Videos below)\r\n\r\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_29ce5ee65bd8.png)\r\n\r\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_b097438d8848.png)\r\n\r\n![Screenshot 2023-12-18 043612](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_b99dd067f243.png)\r\n\r\n![Screenshot 2023-12-18 040443](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_1cbda5140e44.png)\r\n\r\n![Screenshot 2023-12-01 143812](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_79cc5c2408e3.png)\r\n\r\n![Screenshot 2023-12-01 150047](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_7b30bd83d40d.png)\r\n\r\n![Screenshot 2023-11-13 161219](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_88a30e233018.png)\r\n\r\n---\r\n\r\n## Please enable the Audio for the demo videos.\r\nVoice 1 - Input Human (English Female Australian TTS)\r\n\r\nVoice 2 - Output Assistant (English Female US Google TTS)\r\n\r\n---\r\n\r\n### Use your computer by natural language - Real-time usage of VoT, an example of a Computer-Using-Agent; Single Action Model.\r\nDoes not use any vision. Only API LLM calls. Demonstrating flawless execution of multiple prompt actions.\r\n\r\nhttps:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fassets\u002F18397328\u002F25b39d8c-62d6-442e-9d5e-bc8a35aa971a\r\n\r\n---\r\n\r\n### Use your computer as an assistant - Real-time usage of planning VoT, an example of a Computer-Using-Agent; Large-Action-Model.\r\n**Takes only 1 screenshot**: Gets to know what the user is doing and what is that the user wants to achieve, the assistant plans to perform it.\r\n```\r\nVoice Recognized Prompt: Make a new post on twitter saying hello world and a brief greeting explaining you're an artificial intelligence.\r\n```\r\nhttps:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fassets\u002F18397328\u002Fd04f0609-68fb-4fb4-9ac3-279047c7a4f7\r\n\r\n---\r\n\r\n### The assistant can do anything for you - Real-time usage of planning VoT, an example of a Computer-Using-Agent; Large-Action-Model.\r\nThe inference is the only constraint for speed.\r\n```\r\nVoice Recognized Prompt: Create a new comment explaining why it is so important.\r\n```\r\nhttps:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fassets\u002F18397328\u002F6d3bb6e6-ccf8-4380-bc89-df512ae207f2\r\n\r\n---\r\n\r\n### Other demos with Real-time usage of planning VoT.\r\n\r\nNovember 16th 2023 live demo: (Firefox, Spotify, Notepad, Calculator, Mail)\r\n```python\r\nassistant(goal=f\"Open a new tab the song \\'Wall Of Eyes - The Smile\\', from google search results filter by videos then play it on Firefox\")  # Working 100%\r\nassistant(goal=f\"Pause the music on Spotify\")  # Working 100%\r\nassistant(goal=f\"Create a short greet text for the user using AI Automated Windows in notepad.exe\")  # Working 100%\r\nassistant(goal=f\"Open calc.exe and press 4 x 4 =\")  # Working 100%\r\n```\r\nhttps:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fassets\u002F18397328\u002Fce574640-5f20-4b8e-84f9-341fa102c0e6\r\n\r\n---\r\n\r\nDecember 1st 2023 live demo: (Chrome, Spotify, Firefox) Example of programmable methods.\r\n```python\r\nassistant(goal=f\"Play the song \\'Robot Rock - Daft Punk\\' on Spotify\", keep_in_mind=f\"To start playback double click the song.\")  # Working 100%\r\nassistant(goal=f\"Open 3 new tabs on google chrome and in each of them search for 3 different types of funny AI Memes\", keep_in_mind=\" Filter the results by images.\")  # Working 100%\r\nassistant(goal=f\"Open a new tab the song \\'Windows 95 but it's a PHAT hip hop beat\\', from google search results filter by videos then play it by clicking on the text on Firefox.\")  # Working 100%\r\n\r\n```\r\nhttps:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fassets\u002F18397328\u002F7e0583d1-1c19-40fa-a750-a77fff98a6da\r\n\r\nCurrently supporting all generalized win32api apps, meaning:\r\nChrome, Firefox, OperaGX, Discord, Telegram, Spotify...\r\n\r\n---\r\n\r\n# Key Features\r\n- Dynamic Case Generator: The assistant() function accepts a goal parameter, which is a natural language command, and intelligently maps it to a series of executable actions. This allows for a seamless translation of user intentions into effective actions on the computer.\r\n1. Single Action Execution:\r\nThe act() function is a streamlined method for executing actions, enhancing the tool's efficiency and responsiveness.\r\n2. Advanced Context Handling: The framework is adept at understanding context by analyzing the screen and the application, ensuring that actions are carried out with an awareness of the necessary prerequisites or steps.\r\n3. Semantic router map: The framework has a database of a semantic router map to successfully execute generated test cases. This semantic maps can be created by other AI.\r\n4. Wide Application Range: From multimedia control (like playing songs or pausing playback on Spotify and YouTube) to complex actions (like creating AI-generated text, sending emails, or managing applications like Telegram or Firefox), the framework covers a broad spectrum of tasks.\r\n5. Customizable AI Identity: The write_action() function allows for a customizable assistant identity, enabling personalized interactions and responses that align with the user's preferences or the nature of the task.\r\n6. Robust Error Handling and Feedback: The framework is designed to handle unexpected scenarios gracefully, providing clear feedback and ensuring reliability. (In Overview)\r\n7. Projects for mood and personality: Generate or suggest now and then useful scenarios based on your mood and personality. (In Overview)\r\n\r\n\r\n# Technical Innovations\r\n1. Natural Language Processing (NLP): Employs advanced NLP techniques to parse and understand user commands in a natural, conversational manner.\r\n2. Task Automation Algorithms: Utilizes sophisticated algorithms to break down complex tasks into executable steps.\r\n3. Context-Aware Execution: Integrates contextual awareness for more nuanced and effective task execution.\r\n4. Cross-Application Functionality: Seamlessly interfaces with various applications and web services, demonstrating extensive compatibility and integration capabilities.\r\n5. Use Cases.\r\n6. Automating repetitive tasks in a Windows environment.\r\n7. Streamlining workflows for professionals and casual users alike.\r\n8. Enhancing accessibility for users with different needs, enabling voice or simple text commands to control complex actions.\r\n9. Assisting in learning and exploration by providing AI-driven guidance and execution of tasks.\r\n\r\n\r\n# Conclusion\r\nThis Artificially Assisted User Interface Testing framework is a pioneering tool in the realm of desktop automation. Its ability to understand and execute a wide range of commands in a natural, intuitive manner makes it an invaluable asset for anyone looking to enhance their productivity and interaction with their Windows environment. It's not just a tool; it's a step towards a future where AI seamlessly integrates into our daily computing tasks, making technology more accessible and user-friendly.\r\n\r\n# Installation\r\n```\r\n# Add your Chat-GPT API Keys to the project:\r\nadd your API Key in \u002Fcore\u002Fcore_api.py  ->  line 3: client = OpenAI(api_key='insert_your_api_key_here')\r\nadd your API Key in \u002Fcore\u002Fcore_imaging.py  ->  line 12: api_key = 'insert_your_api_key_here'\r\n\r\n# Install requirements:\r\ncd pywinassistant\r\npip install -r .\\requirements.txt\r\n\r\n# Execute the assistant:\r\ncd .\\core\r\npython .\u002Fassistant.py\r\n```\r\n\r\n# Usage\r\nRun \"Assistant.py\", say \"Ok computer\" to enable the assistant by voice commands or click to it or enable the chat to do a fast action. Use Right click above the Assistant to see the available options for the assistant.\r\n\r\nFor debugging mode execute \"Driver.py\". Inside it, you can debug and try easily the functions of \"act\" which is used alongside the assistant, \"fast_act\" and \"assistant\" by using the examples.\r\nTo run a JSON test case, modify the JSON path from the \"assistant\" function.\r\n\r\n# Working cases (on cases.py)\r\n\r\n```\r\nassistant(goal=f\"Play the song \\'One More Time - Daft Punk\\' on Spotify\")  # Working 100%\r\nassistant(goal=f\"Open a new tab the song \\'Wall Of Eyes - The Smile\\', from google search results filter by videos then play it on Firefox\")  # Working 100%\r\nassistant(goal=f\"Open a new tab the song \\'Windows XP Error beat\\', from google search results filter by videos then play it by clicking on the text on Firefox.\")  # Working 100%\r\nfast_act(goal=f\"Click on the Like button\") # Working 100%\r\nassistant(goal=f\"Pause the music on Spotify\")  # Working 100%\r\nwrite_action(goal=\"Comment about why IA is great for the current playing song\", assistant_identity=\"You\\'re an advanced music AI agent that specializes on music\") # Working 100%\r\nassistant(f\"Create a long AI essay about an AI Starting to control a Windows computer on Notepad\")  # Working 100%\r\nfast_act(goal=\"Click on the button at the bottom in HueSync app\")  # Working 100%\r\nwrite_action(goal=\"Weird Fishes - Radiohead\")  # Working 100%\r\nassistant(f\"Open Calc and press 4 x 4 - 4 * 4 + 1 =\")  # Working 100%\r\nassistant(goal=f\"Open 3 new tabs on google chrome and in each of them search for 3 different types of funny dogs\", keep_in_mind=\" Filter the results by images.\")  # Working 100%\r\nassistant(goal=f\"Stop the playback from Firefox app\")  # Working 100%\r\nassistant(f\"Send a list of steps to make a joke about engineers whilist making it an essay to my friend Diana in Telegram\")  # Working 100%\r\nassistant(f\"Send a list of steps to make a chocolate cake to my saved messages in Telegram\")  # Working 100%\r\nassistant(f\"Create three new tabs on Firefox, in each of them search 3 different types of funny youtube bad tutorial videos, generate the titles to search.\")  # Working 100%\r\nassistant(f\"Write an essay about an AI that a person created to use freely the computer, like you. Write it in notepad.exe\") # Working 100%\r\nassistant(f\"Send an AI joke and say it's generated by an AI to my friend Diana on Discord\")  # Working 100%\r\nassistant(goal=f\"Create a short greet text for the user using AI Automated Windows in notepad.exe\") # Working 100%\r\nassistant(goal=f\"Open calc.exe and press 4 x 4 =\")  # Working 100%\r\nassistant(goal=f\"Send a mail to \\'testmail@gmail.com\\' with the subject \\'Hello\\' and generate the message \\'Generate a message about how an AI is helping everyone as an users\\' on the Mail app\",\r\n          keep_in_mind=\"Press \\'Tab\\' tree times to navigate to the subject area. Do not combine steps.\")  # Need to update the app semantic map to get it working 100%.\r\nassistant(goal=f\"Play the song \\'The Smile - Wall Of Eyes\\' on Spotify\")  # Working 100%\r\nassistant(goal=f\"Play the song \\'Panda Bear - Tropic of cancer\\' on Spotify\")  # Working 100%\r\nassistant(goal=\"Pause the music on the Spotify app\")  # Working 100%\r\nassistant(goal=f\"Open 3 new tabs with different Daft Punk songs on each of them on Firefox\")  # Working 100%\r\nfast_act(\"Open spotify and Search the album \\'Grimes - Visions\\'\")  # Working 100%\r\nwrite_action(\"Open spotify and Search the album \\'Grimes - Visions\\'\")  # Working 100%\r\nfast_act(\"Click on the first result on spotify\")  # Working 100%\r\nfast_act(\"Skip to the next song on Spotify\")  # Working 100%\r\nfast_act(\"Add the album to the library\")  # Working 100%\r\nfast_act(\"Go to Home on Spotify\")  # Working 100%\r\nfast_act(\"Save the song to my library on Spotify\")  # Working 100%\r\n```\r\n\r\n\r\n# Current approaches to UI Testing\r\n### There are three main types of GUI testing approaches, namely:\r\n\r\n1. ***Manual Testing:***\r\n\r\nIn manual testing, a human tester performs a set of operations to check whether the application is functioning correctly and that the graphical elements conform to the documented requirements. Manual-based testing has notable downsides in that it can be time-consuming, and the test coverage is extremely low. Additionally, the quality of testing in this approach depends on the knowledge and capabilities of the testing team.\r\n\r\n2. ***Record-and-Playback Testing:***\r\n\r\nAlso known as record-and-replay testing, it is executed using automation tools. The automated UI testing tool records all tasks, actions, and interactions with the application. The recorded steps are then reproduced, executed, and compared with the expected behavior. For further testing, the replay phase can be repeated with various data sets.\r\n\r\n3. ***Model-Based Testing:***\r\n\r\nIn this testing approach, we focus on building graphical models that describe the behavior of a system. This provides a deeper understanding of the system, which allows the tester to generate highly efficient test cases. In the models, we determine the inputs and outputs of the system, which are in turn, used to run the tests. Model-based testing works as follows:\r\n\r\n    Create a model for the system\r\n    Determine system inputs\r\n    Verify the expected output\r\n    Execute tests\r\n    Check and validate system output vs. the expected output\r\n\r\nThe model-based approach is great because it allows a higher level of automation. It also covers a higher number of states in the system, thereby improving the test coverage.\r\n\r\n\r\n# New Approaches to UI Testing using AI\r\n4. ***Artificially Assisted User Interface Testing:***\r\n\r\nArtificially Assisted User Interface Testing harnesses the power of artificial intelligence to revolutionize the process of testing graphical user interfaces. Unlike traditional methods, Artificially Assisted User Interface Testing integrates machine learning algorithms and intelligent decision-making processes to autonomously identify, analyze, and interact with UI elements. This approach significantly enhances the depth and breadth of testing in several ways:\r\n\r\n    Dynamic Interaction with UI Elements: AI-driven tests can adapt to changes in the UI, such as modified button locations or altered element properties. This flexibility is achieved through the use of AI models trained to recognize and interact with various UI components, regardless of superficial changes.\r\n    Learning and Pattern Recognition: Utilizing machine learning, Artificially Assisted User Interface Testing systems can learn from previous interactions, test runs, and user feedback. This enables the AI to recognize patterns and predict potential issues, improving over time and offering more thorough testing with each iteration.\r\n    Automated Test Case Generation: The AI can generate test cases based on its understanding of the application's functionality and user behavior patterns. This not only saves time but also ensures that a wider range of scenarios is tested, including edge cases that might be overlooked in manual testing.\r\n    Natural Language Processing (NLP): AI Testing tools often incorporate NLP to interpret and execute tests written in plain language. This feature makes the testing process more accessible to non-technical stakeholders and facilitates better communication across the team.\r\n    Real-Time Feedback and Analytics: AI systems provide real-time insights into the testing process, identifying bugs, performance issues, and usability problems promptly. This immediate feedback loop enables quicker rectifications and enhances the overall quality of the product.\r\n    Predictive Analysis and Risk Assessment: By analyzing past data, Artificially Assisted User Interface Testing tools can predict potential problem areas and allocate testing resources more efficiently. This proactive approach to risk management ensures that critical issues are identified and addressed early in the development lifecycle.\r\n\r\nIn conclusion, Artificially Assisted User Interface Testing represents a significant leap forward in software quality assurance. By automating and enhancing the testing process, AI-driven tools offer improved accuracy, speed, and coverage, paving the way for more reliable and user-friendly applications.\r\n\r\n\r\n### Notes:\r\n\r\nThis project is being updated as of start of 2024. The list of requirements is being updated.\r\n","**PyWinAssistant：人工智能助手**——**MIT 许可** | **公开发布：2023年12月31日** | 符合联邦协调的复杂适应系统人工智能标准、阿西洛马人工智能原则以及 IEEE 自主与智能系统伦理全球倡议。\n\n---\n\nPyWinAssistant 是首个开源的人工窄智能体，能够通过通用的代理框架实现空间推理与感知，作为一款完全基于图形用户界面（GUI）操作 Windows 10\u002F11 的计算机使用型智能体，其核心机制是**直接利用操作系统原生语义交互**。它以计算机使用型智能体\u002F大型动作模型的身份运行，为纯粹的**符号化空间认知框架**奠定了基础，从而仅依靠自然语言即可实现对计算机的人工操作，**无需依赖计算机视觉、光学字符识别或像素级图像处理**。PyWinAssistant 通过**Windows 原生辅助功能 API**模拟、规划并仿真合成的人机接口设备（HID）交互，在操作系统层面实现了跨越几何、层次和时间维度的人类式抽象。这种集成于操作系统的空间化计算机使用方式，提供了一个面向未来的、通用的、模块化的动态 ANI 协调框架，适用于多智能体驱动的自动化场景，标志着在迈向 AGI 的道路上，符号推理迈出了重要一步。\n\n**核心特性：**\n*   **不依赖图像处理管道**：完全通过 Windows UI 自动化（UIA）及程序化的 GUI 语义进行操作，支持通用的工作流编排。\n*   **符号化空间映射**：借助操作系统原生的父子关系与坐标系，实现层次化的元素追踪。\n*   **非视觉感知**：通过直接提取元数据（控件类型、状态、位置），实时理解界面内容。\n*   **视觉感知**：仅需一张屏幕截图，便可通过可视化目标意图及环境随时间的变化，在空间范围内激发细致入微的理解与感知；还可进一步优化，用于查找视觉线索、潜在缺陷、因果推理中的错误、静态问题、语义锚定问题等。\n*   **统一自动化**：自动检测界面元素，将 GUI、系统及网页自动化整合于单一 Python API 下，避免工具间频繁切换上下文。\n*   **AI 驱动的脚本生成**：将自然语言或演示操作转化为任意 IDE 或文本编辑区域中的代码。\n*   **自愈性工作流**：能够自动适应 UI 变化（如元素 ID 的变动），降低维护开销，使 PyWinAssistant 的算法更具未来适应性。\n*   **AI\u002FML 集成**：利用 NLP 生成脚本（例如，“自动化应用程序”→以 JSON 格式规划测试执行步骤），并配备具备自我修正能力的选择器。\n*   **跨情境自动化**：以 Python 式的方式无缝衔接 GUI、网页与 API 工作流，将分散的自动化方法（GUI、API、网页）统一到一个框架中。\n*   **无障碍访问**：提升各类用户群体的可访问性，允许通过语音或简单文本命令控制复杂的操作。\n*   **泛化能力**：激发空间认知能力，以自然、直观的方式理解和执行多种指令。\n*   **轻量紧凑**：PyWinAssistant 作为一个模块化、通用的计算机助手框架示例算法，能够有效激发空间认知能力。\n\nPyWinAssistant 拥有一套独立的**推理智能体**，采用思维可视化（VoT）与思维链（CoT）技术来增强泛化能力，通过抽象的 GUI 语义维度而非视觉处理来动态模拟动作，因此对于下一代**LLM 模型**而言具有**未来适应性**。通过**可视化界面内容**，在**抽象的 GUI 语义维度、概念及微分上动态模拟并规划行动**，PyWinAssistant**重新定义了计算机视觉自动化**，能够在远低于传统计算成本的情况下实现**高效率的视觉处理**。PyWinAssistant 已经在**操作系统层面**实现了**实时的空间感知能力**，能够**记忆视觉线索，并随时间追踪屏幕上的变化**。\n\n---\n\n在空间推理领域取得关键突破之前，PyWinAssistant 就已问世，早于以下成果：\n*   **微软**的[**思维可视化研究论文**](https:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03622)（2024年4月4日）\n*   **Anthropic**的[**Claude 的计算机使用型智能体**](https:\u002F\u002Fwww.anthropic.com\u002Fnews\u002F3-5-models-and-computer-use)（2024年10月22日）\n*   **OpenIA**的[**ChatGPT 的 Operator 计算机使用型智能体（CUA）**](https:\u002F\u002Fopenai.com\u002Findex\u002Fintroducing-operator\u002F)（2025年1月23日）\n\nPyWinAssistant 通过以下突破性进展，引领了人工智能与自动化领域的重大范式转变，实现了**纯符号化的计算机交互**，将**人类意图与操作系统级别的 GUI 自动化**无缝衔接：\n*   **首个**绕过 OCR\u002F图像处理的计算机使用型 GUI 自动化智能体。\n*   **首个**以 Windows UIA 为主要空间感知通道的框架。\n*   **首个**展示操作系统原生层次—时间推理能力的系统。\n\n---\n\n### **1. 统一的自然语言 → GUI 自动化**\n**传统方法**：  \n自动化工具通常需要编写脚本（如 AutoHotkey）或集成 API（如 Selenium）。\n\n**PyWinAssistant 的突破**：  \n```python\n# 真正实现自然语言直接驱动 UI 操作的泛化能力\nassistant(\"在 Spotify 上播放 Daft Punk，并把歌词发给我的朋友\")\n# 智能体会根据相关上下文选择合适的项目，以符合用户的意图。\n```\n\n**工作机制**：将 UI 自动化技术的 GUI 控件检测与 LLM 结合，完成以下任务：\n- 解析用户意图（“播放”、“发送歌词”）\n- 将意图映射至具体 UI 元素（Spotify 播放按钮、Outlook 写信窗口）\n- 自动生成适应性强的工作流\n\n**PyWinAssistant 的创新之处**：不再需要：\n- 预先定义的 API 集成\n- XPath\u002FCSS 选择器知识\n- 手动处理错误逻辑\n\n---\n\n### **2. 跨应用状态感知**\n**传统局限性**：  \n现有工具往往局限于单个应用内部运行（如 Power Automate 的连接器）。\n\n**PyWinAssistant 的创新**：  \n```python\n# 注释：\n# 助手生成完整步骤的过程运行得非常流畅，但为了遵守联邦协调的复杂适应系统人工智能标准、阿西洛马人工智能原则以及 IEEE 自主与智能系统伦理全球倡议，代码中特意禁用了步进修改器和内存内容检索功能——参见 [def act()](https:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fblob\u002F6aae4e514a0dc661f7ed640181663f483972bc1e\u002Fcore\u002Fdriver.py#L648C1-L648C8)。\n\n# 准确地利用 UIA 树和空间记忆在不同应用间保持上下文与意图：（用于进一步开发的示例）\n助手(\"帮我查找飞往墨西哥的最佳且最便宜的航班，同时搜索当地酒店，并在新标签页中向我推荐最佳的文化活动选项\")\n助手(\"寻找各种不含菠萝的披萨优惠券，填写订单详情并展示结果\")\n\n# PyWinAssistant 具有高度模块化（示例）：\ndef workflow():\n    song = assistant(goal=\"获取当前播放的歌曲\")  # UIA\n    write_action(f\"评论‘{song}’：很棒的贝斯线！\", app=\"记事本\")  # Win32\n    assistant(goal=\"将记事本中的内容发布到推特上\")  # Web\n\n# 上述一系列操作也可以仅通过自然语言来执行：\n助手(f\"获取当前正在播放的歌曲，在记事本中以‘评论歌曲名称：很棒的贝斯线’为标题，写下它为何是一段出色的贝斯线，然后将其发布到推特上\", assistant_identity=\"你是一位资深音乐评论家\")\n```  \n**关键进展**：\n1. **统一控制图**：将所有应用程序视为单个可通过 UIA 访问的图中的节点\n2. **状态传递**：通过剪贴板\u002FUIA 属性在应用程序之间传递数据\n3. **语义传递**：在整个流程中传递目标意图的语义\n4. **错误恢复**：利用代理式推理系统避免操作失败\n\n**影响**：实现了以往需要自定义中间件才能完成的工作流。\n---\n\n### **3. 概率型自动化引擎**\n**传统模式**：  \n确定性脚本会在界面变化时失效。  \n\n**PyWinAssistant 的解决方案**：  \n```python\n# 自适应元素发现\ndef fast_action(goal):\n    speaker(f\"直接点击元素，无需视觉上下文。不需要图像处理。\")\n    analyzed_ui = analyze_app(application=ai_choosen_app, additional_search_options=generated_keywords)\n    \n    gen_coordinates = [{\"role\": \"assistant\",\n        f\"content\": f\"你是一个能够操控鼠标的 Windows AI 鼠标代理。请仅回复鼠标点击位置的预测坐标，即元素对象中心的 'x=, y='，以实现该目标。\"},\n        {\"role\": \"system\", \"content\": f\"目标: {single_step}\\n\\n上下文: {original_goal}\\n{analyzed_ui}\"}]\n    coordinates = api_call(gen_coordinates, model_name=\"gpt-4-1106-preview\", max_tokens=100, temperature=0.0)\n    print(f\"AI 决策坐标: '{coordinates}'\")\n```  \n**革命性特性**：\n- **基于思维的语义搜索**：例如 `synonyms(\"下载\") → [\"保存\", \"导出\", \"↓ 图标\"]`\n- **空间概率**：利用一组自我推理代理对元素进行优先级排序，从而实现合成操作\n- **空间预防机制**：借助自我推理代理组感知并防止潜在的错误操作或步骤执行偏差\n- **自我修复能力**：通过使用自我推理代理组，自动选择完美的执行方案，确保步骤推理不会失败\n\n---\n\n### **4. 普及化的辅助功能**\n\n任务：自动化操作以在 Spotify GUI 中收藏一首歌曲。\n**之前**：  \n自动化需要：\n```autohotkey\nWinWait, Spotify\nControlClick, x=152 y=311  # 极易失效的固定坐标\n```  \n\n**现在**：只需一条自然语言指令。\n```python\nassistant(\"点赞这首歌\")  # 以语言优先\n```  \n\n| **转变指标**：    | 传统工具 | PyWinAssistant |\n|-----------------------|-------------------|----------------|\n| 学习曲线        | 几天甚至几个月 | 几分钟        |\n| 跨应用工作流   | 手动集成| 自动化      |\n| 维护开销        | 高              | LLM 自动补丁  |\n\n---\n\n### **为何具有变革意义**\n\n1. **从脚本到意图**：  \n   用类人“理解→行动”的循环取代脆弱的 `click(x,y)`。\n\n2. **从孤岛到将操作系统作为 API**：  \n   将整个 Windows 环境视为可编程接口。\n\n3. **从固定到适应**：  \n   利用大语言模型应对界面变化（如 Spotify 2023 年的界面改版）。\n\n4. **从开发者到所有人**：  \n   通过自然语言使高级自动化触手可及，提升通用性并最大限度地减少 LLM 和视觉模型的整体数据消耗。内置辅助功能，旨在通过空间思维可视化的新技术方法改善人类对计算机的使用体验，并在保障安全的前提下，正确解析任何自然语言指令，规划并在操作系统中执行相应操作。\n\n通过**直接对接 Windows 底层 UI 层次结构**，它能够在操作系统级别实现实时的空间感知，同时摒弃传统的计算机视觉流水线，从而实现：\n*   **效率提升 100 倍**：原生 API 访问。\n*   **无头运行**：可在无显示器系统、虚拟机或最小化窗口中正常工作。\n*   **精确抽象**：采用数学建模而非视觉模式匹配来描述 GUI 之间的关系。\n\n**设计之初即不依赖图像（核心架构）**  \n尽管某些项目在基础运行层面*需要*视觉处理，但 PyWinAssistant 却通过以下方式实现了**完全无需图像处理即可进行 GUI 交互的能力**：\n\n1. **原生操作系统语义访问**  \n   直接集成 Windows UIA API，提供完整的控制元数据：  \n   ```python\n   # 通过 UIA 查找元素属性——无需截图\n   button = uia.Element.find(Name=\"提交\", ControlType=\"按钮\")\n   print(button.BoundingRectangle)  # {x: 120, y: 240, 宽度: 80, 高度: 30}\n   ```  \n2. **成像模块**  \n\n   ```diff\n   # PyWinAssistant 的成像功能，如像素级可视化，可以在实时空间感知的基础上启用，同时具备记忆视觉线索和追踪屏幕变化的能力。\n   + 能够仅凭一张截图，就规划出一套高难度的技术性操作步骤，在操作系统层面完成复杂任务。\n   + 支持像素级可视化。\n   + 可为动态元素启用视觉哈希匹配。\n   - 提供 OCR 回退或针对非 UIA 旧版应用的对象检测功能。\n   # OCR 的实验性功能已被加入，但并未完全开发，因为在当前实现中，助手的表现已经非常出色，因此并不需要这一功能。\n   ```  \n\n| **关键区别** | PyWinAssistant | 传统自动化 |  \n|-|----------------|------------------------|  \n| **主要感知方式** | UIA 元数据 | 截图\u002FOCR |  \n| **对视觉的依赖** | 可选附加 | 核心必需 |  \n| **无头环境支持** | ✅ 原生支持 | ❌ 需要虚拟显示器 |  \n\n---\n\n### **开发说明：**\nPyWinAssistant 的功能受限于模型的智能水平和推理时间。要实现一个由人工智能窄领域系统协同管理的完整通用人工智能系统，还需要大语言模型领域的进一步突破。\n该系统的自主任务分解机制利用**原生语义差异**，而非视觉变化；当然，也可以选择启用视觉变化功能，以便在 GUI\u002F屏幕界面上进行实时图像损坏分析。\n长期记忆和自我学习机制旨在演化出**符号化的状态表示**，这些表示同样可以转化为视觉模式，从而与 AGI 的发展保持一致。\n\n相关论文：《思维可视化激发大型语言模型的空间推理能力》（2024年4月4日）：\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_026e5ad7b833.png)\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F2404.03622\n\n# 概述\n\nPyWinAssistant 内置了多种助理功能，旨在提升所有用户的⼈机交互体验。它集成了实时语音识别、可定制的助理人格、字幕显示以及聊天功能。\n您可以以友好自然的方式与计算机对话，完成任何用户界面操作。\n通过自然语言，您可以自由操控 Windows 操作系统。\n只需使用自然语言，即可为任何支持 Win32 API 的应用程序生成并规划测试用例，实现持续测试。\n这是一款开放且安全的个人助理，能够按照您的需求作出响应，并让您以期望的方式掌控计算机的协助方式。\n其模块化设计使其能够理解并执行广泛的任务，从而自动化与各类桌面应用程序的交互。\n\n# 演示（以下视频）\n\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_29ce5ee65bd8.png)\n\n![image](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_b097438d8848.png)\n\n![Screenshot 2023-12-18 043612](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_b99dd067f243.png)\n\n![Screenshot 2023-12-18 040443](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_1cbda5140e44.png)\n\n![Screenshot 2023-12-01 143812](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_79cc5c2408e3.png)\n\n![Screenshot 2023-12-01 150047](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_7b30bd83d40d.png)\n\n![Screenshot 2023-11-13 161219](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_readme_88a30e233018.png)\n\n---\n\n## 请打开音频观看演示视频。\n语音 1 — 用户输入（英语女性，澳大利亚 TTS）\n\n语音 2 — 助理输出（英语女性，美国 Google TTS）\n\n---\n\n### 用自然语言操控电脑——实时使用 VoT，作为计算机使用代理的一个例子；单动作模型。\n不使用任何视觉信息，仅通过 API 调用大语言模型。演示多步指令的完美执行。\n\nhttps:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fassets\u002F18397328\u002F25b39d8c-62d6-442e-9d5e-bc8a35aa971a\n\n---\n\n### 把电脑当作助手来使用——实时使用规划型 VoT，作为计算机使用代理的一个例子；大型动作模型。\n**仅需一张截图**：助理会了解用户当前的操作及目标，然后制定计划来完成任务。\n``` \n语音识别到的指令：在 Twitter 上发布一条新帖子，内容为“Hello World”以及一段简短的问候，说明你是一个人工智能。\n```\nhttps:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fassets\u002F18397328\u002Fd04f0609-68fb-4fb4-9ac3-279047c7a4f7\n\n---\n\n### 助理可以为您完成任何事情——实时使用规划型 VoT，作为计算机使用代理的一个例子；大型动作模型。\n速度的唯一限制就是推理时间。\n``` \n语音识别到的指令：创建一条新的评论，解释为什么这件事如此重要。\n```\nhttps:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fassets\u002F18397328\u002F6d3bb6e6-ccf8-4380-bc89-df512ae207f2\n\n---\n\n### 其他实时使用规划型 VoT 的演示。\n\n2023年11月16日现场演示：（Firefox、Spotify、记事本、计算器、邮件）\n```python\nassistant(goal=f\"从谷歌搜索结果中筛选视频，打开包含歌曲‘Wall Of Eyes - The Smile’的新标签页，并在 Firefox 中播放\")  # 完美运行\nassistant(goal=f\"暂停 Spotify 中的音乐\")  # 完美运行\nassistant(goal=f\"使用 AI 自动化窗口，在 notepad.exe 中为用户创建一段简短的问候文字\")  # 完美运行\nassistant(goal=f\"打开 calc.exe 并计算 4 x 4 =\")  # 完美运行\n```\nhttps:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fassets\u002F18397328\u002Fce574640-5f20-4b8e-84f9-341fa102c0e6\n\n---\n\n2023年12月1日现场演示：（Chrome、Spotify、Firefox）可编程方法示例。\n```python\nassistant(goal=f\"在 Spotify 上播放歌曲‘Robot Rock - Daft Punk’\", keep_in_mind=f\"双击歌曲即可开始播放\")  # 完美运行\nassistant(goal=f\"在 Google Chrome 中打开 3 个新标签页，分别搜索 3 种不同类型的搞笑 AI 表情包\", keep_in_mind=\"将搜索结果过滤为图片\")  # 完美运行\nassistant(goal=f\"从谷歌搜索结果中筛选视频，打开包含歌曲‘Windows 95 but it's a PHAT hip hop beat’的新标签页，然后在 Firefox 上点击文本进行播放\")  # 完美运行\n```\nhttps:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fassets\u002F18397328\u002F7e0583d1-1c19-40fa-a750-a77fff98a6da\n\n目前支持所有基于 Win32 API 的通用应用程序，包括：\nChrome、Firefox、OperaGX、Discord、Telegram、Spotify……\n\n---\n\n# 核心功能\n- 动态用例生成器：`assistant()` 函数接受一个目标参数，即自然语言指令，并将其智能映射为一系列可执行的操作。这使得用户意图能够无缝转换为计算机上的有效操作。\n1. 单步动作执行：\n`act()` 函数是一种简化的动作执行方法，提升了工具的效率和响应速度。\n2. 高级上下文处理：该框架通过分析屏幕和应用程序来理解上下文，确保在执行操作时考虑到必要的前提条件或步骤。\n3. 语义路由地图：框架内置了一个语义路由地图数据库，用于成功执行生成的测试用例。这些语义地图可以由其他 AI 创建。\n4. 广泛的应用范围：从多媒体控制（如在 Spotify 和 YouTube 上播放或暂停音乐）到复杂操作（如生成 AI 文本、发送电子邮件，或管理 Telegram、Firefox 等应用程序），该框架覆盖了广泛的任务类型。\n5. 可定制的 AI 身份：`write_action()` 函数允许自定义助手身份，从而实现个性化的交互和响应，以符合用户的偏好或任务性质。\n6. 强大的错误处理与反馈机制：框架设计用于优雅地处理意外情况，提供清晰的反馈并确保可靠性。（见概述）\n7. 情绪与性格驱动的项目：根据您的情绪和性格，不时生成或建议实用场景。（见概述）\n\n\n# 技术创新\n1. 自然语言处理 (NLP)：采用先进的 NLP 技术，以自然、对话式的方式解析和理解用户命令。\n2. 任务自动化算法：利用复杂的算法将复杂任务分解为可执行的步骤。\n3. 上下文感知执行：集成上下文感知能力，实现更精细且高效的任务执行。\n4. 跨应用功能：能够无缝对接各类应用程序和网络服务，展现出广泛的兼容性和集成能力。\n5. 使用场景。\n6. 自动化 Windows 环境中的重复性任务。\n7. 为专业人士和普通用户简化工作流程。\n8. 提升不同需求用户的可访问性，支持通过语音或简单文本命令控制复杂操作。\n9. 通过 AI 驱动的任务指导与执行，帮助学习和探索。\n\n\n# 结论\n这款人工辅助的用户界面测试框架是桌面自动化领域的一项开创性工具。它能够以自然、直观的方式理解和执行多种命令，对于希望提升生产力及与 Windows 环境交互体验的人来说，无疑是一项宝贵的资产。这不仅仅是一个工具，更是迈向未来的重要一步——让 AI 无缝融入我们的日常计算任务中，使技术更加易用、更加人性化。\n\n# 安装\n```python\n# 将您的 ChatGPT API 密钥添加到项目中：\n在 \u002Fcore\u002Fcore_api.py 的第 3 行添加您的 API 密钥：client = OpenAI(api_key='insert_your_api_key_here')\n在 \u002Fcore\u002Fcore_imaging.py 的第 12 行添加您的 API 密钥：api_key = 'insert_your_api_key_here'\n\n# 安装依赖项：\ncd pywinassistant\npip install -r .\\requirements.txt\n\n# 执行助手程序：\ncd .\\core\npython .\u002Fassistant.py\n```\n\n# 使用方法\n运行 `Assistant.py`，说出“Ok computer”即可通过语音命令启用助手；您也可以点击助手图标或打开聊天窗口来快速执行操作。右键单击助手界面，即可查看可用选项。\n\n若需进入调试模式，请运行 `Driver.py`。在该文件中，您可以使用示例轻松调试并尝试与助手协同工作的 `act`、`fast_act` 和 `assistant` 等函数。要运行 JSON 测试用例，只需修改 `assistant` 函数中的 JSON 文件路径即可。\n\n# 有效案例（在 cases.py 中）\n\n``` \nassistant(goal=f\"在 Spotify 上播放歌曲 'One More Time - Daft Punk'\")  # 100% 有效\nassistant(goal=f\"从 Google 搜索结果中筛选视频，打开包含歌曲 'Wall Of Eyes - The Smile' 的新标签页，并在 Firefox 浏览器中播放\")  # 100% 有效\nassistant(goal=f\"从 Google 搜索结果中筛选视频，找到歌曲 'Windows XP Error beat'，然后在 Firefox 浏览器中点击文本以播放\")  # 100% 有效\nfast_act(goal=f\"点击‘喜欢’按钮\") # 100% 有效\nassistant(goal=f\"暂停 Spotify 上的音乐\")  # 100% 有效\nwrite_action(goal=\"评论当前正在播放的歌曲为何很棒\", assistant_identity=\"你是一位专注于音乐的高级音乐 AI 代理\") # 100% 有效\nassistant(f\"创作一篇关于 AI 开始通过记事本控制 Windows 计算机的长篇 AI 文章\")  # 100% 有效\nfast_act(goal=\"点击 HueSync 应用程序底部的按钮\")  # 100% 有效\nwrite_action(goal=\"Weird Fishes - Radiohead\")  # 100% 有效\nassistant(f\"打开计算器并计算 4 x 4 - 4 * 4 + 1 =\")  # 100% 有效\nassistant(goal=f\"在 Google Chrome 中打开 3 个新标签页，在每个标签页中搜索 3 种不同类型的搞笑狗狗图片\")  # 100% 有效\nassistant(goal=f\"停止 Firefox 应用程序中的播放\")  # 100% 有效\nassistant(f\"将一则关于工程师的笑话步骤清单以文章形式发送给我的朋友 Diana，使用 Telegram\")  # 100% 有效\nassistant(f\"将制作巧克力蛋糕的步骤清单发送到我在 Telegram 中的收藏消息\")  # 100% 有效\nassistant(f\"在 Firefox 浏览器中创建三个新标签页，分别搜索三种不同类型的搞笑 YouTube 劣质教程视频，并生成相应的搜索标题\")  # 100% 有效\nassistant(f\"写一篇关于一个人创建的 AI 如何像你一样自由操控计算机的文章，使用 notepad.exe 编写\") # 100% 有效\nassistant(f\"向我的朋友 Diana 在 Discord 上发送一则由 AI 生成的笑话，并说明这是 AI 创作的\")  # 100% 有效\nassistant(goal=f\"使用 AI 自动化 Windows 系统，在 notepad.exe 中为用户创建一段简短的问候语\") # 100% 有效\nassistant(goal=f\"打开 calc.exe 并计算 4 x 4 =\")  # 100% 有效\nassistant(goal=f\"向 'testmail@gmail.com' 发送一封主题为 'Hello' 的邮件，并在邮件应用程序中生成一条关于 AI 如何帮助所有用户的留言\",\n          keep_in_mind=\"按三次 Tab 键导航到主题输入框。不要合并操作步骤。\")  # 需要更新应用程序的语义地图才能实现 100% 有效。\nassistant(goal=f\"在 Spotify 上播放歌曲 'The Smile - Wall Of Eyes'\")  # 100% 有效\nassistant(goal=f\"在 Spotify 上播放歌曲 'Panda Bear - Tropic of cancer'\")  # 100% 有效\nassistant(goal=\"暂停 Spotify 应用程序中的音乐\")  # 100% 有效\nassistant(goal=f\"在 Firefox 浏览器中打开 3 个新标签页，每个标签页播放不同的 Daft Punk 歌曲\")  # 100% 有效\nfast_act(\"打开 Spotify 并搜索专辑 'Grimes - Visions'\")  # 100% 有效\nwrite_action(\"打开 Spotify 并搜索专辑 'Grimes - Visions'\")  # 100% 有效\nfast_act(\"点击 Spotify 中的第一个搜索结果\")  # 100% 有效\nfast_act(\"在 Spotify 上跳过当前歌曲，播放下一首\")  # 100% 有效\nfast_act(\"将该专辑添加到资料库\")  # 100% 有效\nfast_act(\"返回 Spotify 主页\")  # 100% 有效\nfast_act(\"将当前歌曲保存到我的 Spotify 资料库\")  # 100% 有效\n```\n\n\n# 当前 UI 测试方法\n### GUI 测试主要有三种方法，分别是：\n\n1. ***手动测试：***\n\n在手动测试中，测试人员会执行一系列操作，以检查应用程序是否正常运行，以及图形界面元素是否符合文档要求。然而，这种方法存在明显的缺点：耗时较长，测试覆盖率极低。此外，测试质量很大程度上取决于测试团队的知识和能力。\n\n2. ***录制与回放测试：***\n\n也称为记录与重放测试，它借助自动化工具来完成。自动化 UI 测试工具会记录用户对应用程序的所有操作、动作和交互过程，随后再重复这些步骤并将其与预期行为进行对比。为了进一步验证，可以使用不同的数据集多次重复回放过程。\n\n3. ***基于模型的测试：***\n\n在这种测试方法中，我们专注于构建描述系统行为的图形化模型。这有助于更深入地理解系统，从而生成高效的测试用例。在模型中，我们会明确系统的输入和输出，进而用于执行测试。基于模型的测试流程如下：\n\n    创建系统模型\n    确定系统输入\n    验证预期输出\n    执行测试\n    比较并验证系统实际输出与预期输出\n\n基于模型的方法具有显著优势，因为它能够实现更高程度的自动化，同时覆盖更多的系统状态，从而提高测试覆盖率。\n\n# 基于人工智能的新型UI测试方法\n4. ***人工智能辅助用户界面测试：***\n\n人工智能辅助用户界面测试利用人工智能的强大功能，彻底革新了图形用户界面的测试流程。与传统方法不同，该方法集成了机器学习算法和智能决策机制，能够自主识别、分析并操作UI元素。这一方法从多个方面显著提升了测试的深度和广度：\n\n    与UI元素的动态交互：基于AI的测试可以适应UI的变化，例如按钮位置的调整或元素属性的改变。这种灵活性得益于训练有素的AI模型，这些模型能够识别并操作各种UI组件，而无需关注表面变化。\n    学习与模式识别：通过机器学习技术，人工智能辅助UI测试系统可以从以往的交互、测试运行以及用户反馈中不断学习。这使得AI能够识别模式并预测潜在问题，从而在每次迭代中不断提升测试效果，实现更全面的测试覆盖。\n    自动化测试用例生成：AI可以根据对应用程序功能和用户行为模式的理解，自动生成测试用例。这不仅节省了时间，还能确保覆盖更广泛的场景，包括手动测试中可能被忽略的边界情况。\n    自然语言处理（NLP）：AI测试工具通常会集成NLP技术，以解析并执行用自然语言编写的测试脚本。这一特性使非技术人员也能更轻松地参与测试过程，并促进团队内部的沟通协作。\n    实时反馈与数据分析：AI系统能够提供测试过程中的实时洞察，及时发现缺陷、性能问题和可用性问题。这种即时反馈机制有助于快速修复问题，从而提升产品的整体质量。\n    预测分析与风险评估：通过分析历史数据，人工智能辅助UI测试工具可以预测潜在的问题区域，并更高效地分配测试资源。这种主动的风险管理方式确保在开发周期的早期就能识别并解决关键问题。\n\n综上所述，人工智能辅助用户界面测试标志着软件质量保证领域的一次重大飞跃。通过自动化和优化测试流程，基于AI的工具能够提供更高的准确性、更快的速度和更全面的覆盖，为开发更加可靠、更易用的应用程序铺平了道路。\n\n\n### 备注：\n\n该项目自2024年初开始持续更新中。需求列表也在不断更新。","# PyWinAssistant 快速上手指南\n\nPyWinAssistant 是一款开源的窄人工智能（ANI）代理框架，专为 Windows 10\u002F11 设计。它不依赖计算机视觉、OCR 或像素级图像识别，而是通过原生的 Windows 无障碍 API (UI Automation) 直接理解图形用户界面 (GUI) 的语义结构，实现自然语言驱动的电脑操作。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**：Windows 10 或 Windows 11（必须为原生系统，不支持 macOS 或 Linux）。\n*   **Python 版本**：Python 3.8 或更高版本。\n*   **前置依赖**：\n    *   已安装 `pip` 包管理工具。\n    *   需要配置有效的 LLM API 密钥（如 OpenAI GPT-4 等，用于意图解析和动作规划）。\n    *   确保目标应用程序（如 Spotify, Chrome, Outlook 等）处于可被系统无障碍服务访问的状态（通常默认开启）。\n\n## 安装步骤\n\n使用 pip 直接从 PyPI 安装最新稳定版：\n\n```bash\npip install pywinassistant\n```\n\n> **提示**：如果下载速度较慢，可使用国内镜像源加速安装：\n> ```bash\n> pip install pywinassistant -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n安装完成后，您需要在代码中配置您的 LLM API 密钥（具体配置方式请参考库的环境变量设置或初始化参数）。\n\n## 基本使用\n\nPyWinAssistant 的核心优势在于将自然语言直接转化为 GUI 操作。无需编写复杂的坐标点击或控件查找脚本。\n\n### 1. 最简单的单步操作\n\n只需调用 `assistant` 函数并传入自然语言指令，代理会自动分析当前界面语义并执行操作。\n\n```python\nfrom pywinassistant import assistant\n\n# 示例：控制音乐播放并发送邮件\n# 代理会自动识别 Spotify 的播放按钮和 Outlook 的撰写窗口\nassistant(\"Play Daft Punk on Spotify and email the lyrics to my friend\")\n```\n\n### 2. 跨应用工作流\n\nPyWinAssistant 能够维护跨应用的状态和上下文，实现复杂的多步骤自动化。\n\n```python\nfrom pywinassistant import assistant, write_action\n\n# 示例：获取当前歌曲 -> 写入记事本 -> 发布到 Twitter\ndef workflow():\n    # 从当前活动窗口获取正在播放的歌曲信息\n    song = assistant(goal=\"get the current track\")\n    \n    # 将信息写入记事本 (Win32 API 交互)\n    write_action(f\"Review '{song}': Great bassline!\", app=\"Notepad\")\n    \n    # 将记事本内容发布到 Twitter (Web 交互)\n    assistant(goal=\"Post on twitter the written text from notepad\")\n\n# 或者直接使用一条自然语言指令完成上述所有步骤\nassistant(\"Get the current song playing, write a review in Notepad saying it has a great bassline, and then post it on Twitter\", \n          assistant_identity=\"You're an expert music critic\")\n```\n\n### 3. 自适应元素操作\n\n无需硬编码控件 ID 或坐标，代理会通过语义推理自动定位元素，即使界面发生微调也能自动适应（自愈能力）。\n\n```python\n# 示例：点赞当前歌曲\n# 传统方式需要脆弱的坐标控制，PyWinAssistant 仅需语义指令\nassistant(\"Like this song\")\n```\n\n**注意**：部分高级功能（如内存内容检索）可能出于伦理和安全标准在默认版本中被禁用，具体行为请以实际运行结果和官方最新文档为准。","财务分析师小李需要每天从多个独立的桌面软件（如 ERP 系统、Excel 和本地税务工具）中抓取数据并汇总成报表，这些软件界面复杂且缺乏 API 接口。\n\n### 没有 pywinassistant 时\n- **工具割裂严重**：必须分别编写 Selenium 处理网页、PyAutoGUI 模拟鼠标点击、以及专用脚本解析 Excel，频繁在不同技术栈间切换导致开发效率极低。\n- **维护成本高昂**：一旦软件界面更新导致按钮位置或 ID 变化，基于固定坐标或图像匹配的脚本就会立即失效，需要人工逐行修复代码。\n- **无法理解语义**：传统自动化只能机械执行“点击坐标 (x,y)\"，无法识别“提交按钮”或“错误弹窗”的实际含义，遇到异常流程只能盲目报错。\n- **依赖视觉不稳定**：基于 OCR 或截图的方案受屏幕分辨率、主题颜色变化影响大，环境稍有变动就无法准确提取数据。\n\n### 使用 pywinassistant 后\n- **统一自动化框架**：pywinassistant 通过 Windows 原生辅助功能 API，用一套 Python 代码即可无缝串联 GUI 操作、网页交互和系统命令，消除上下文切换。\n- **具备自愈能力**：利用符号空间映射和层级关系追踪，即使界面元素 ID 变更或布局微调，pywinassistant 也能自动重新定位目标，大幅降低维护工作量。\n- **深度语义理解**：借助思维链（CoT）推理，它能像人一样理解“如果弹出错误则关闭”的逻辑，直接操作控件元数据而非像素，实现真正的智能决策。\n- **非视觉稳定运行**：不依赖截图或 OCR，直接提取控件的状态、类型和位置信息，确保在任何分辨率或主题下都能精准、高速地执行任务。\n\npywinassistant 通过将自然语言指令转化为对操作系统深层语义的理解，让复杂的跨软件桌面工作流实现了真正的“自动驾驶”。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fa-real-ai_pywinassistant_1cbda514.png","a-real-ai","A Real AI","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fa-real-ai_0401da64.jpg","An open-source initiative.",null,"https:\u002F\u002Fgithub.com\u002Fa-real-ai",[83],{"name":84,"color":85,"percentage":86},"Python","#3572A5",100,1330,189,"2026-04-01T17:18:28","MIT","Windows 10, Windows 11","未说明（工具主要依赖 Windows 原生 API 和 LLM API，不强制要求本地 GPU）","未说明",{"notes":95,"python":93,"dependencies":96},"该工具专为 Windows 设计，通过原生 Windows 无障碍 API (UIA) 进行交互，不依赖计算机视觉、OCR 或像素级图像处理。核心功能需要调用外部大语言模型（如 GPT-4）API 来进行意图解析和坐标预测。代码示例中提及部分功能（如记忆内容检索）为了符合伦理标准已被禁用。",[97,98],"Windows UI Automation (UIA)","LLM API (如 gpt-4-1106-preview)",[15,14],[101,102,103,104,105,106,107,108,109,110,111],"artificial-general-intelligence","artificial-intelligence-algorithms","cot","vot","artificial-narrow-intelligence","ui","ui-ux","ux","cua","computer-using-agent","graphical-user-interface","2026-03-27T02:49:30.150509","2026-04-06T07:05:55.427129",[115,120,125,130,135,140],{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},16029,"如何正确安装和配置项目以解决模块缺失和数据库错误？","1. 下载所有文件并将驱动文件移动到 'core' 文件夹。\n2. 移除所有导入语句中的 'core.' 前缀（例如将 'from core.voice' 改为 'from voice'）。\n3. 安装缺失的依赖包，特别是 'pyaudio'。\n4. 在 'core' 文件夹内手动创建一个名为 'history.db' 的空数据库文件（或者确保代码路径指向正确位置，如将路径从 r'Database\\history.db' 改为 r'history.db'）。\n5. 确保在 'core' 文件夹目录下运行命令：python assistant.py（或使用 python -m core.assistant）。\n6. 参考项目的 PR #9 获取完整的修复补丁。","https:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fissues\u002F5",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},16030,"遇到 'KeyError: choices' 错误该如何解决？","该错误通常是因为未正确配置 OpenAI API 密钥导致的。请检查并修改以下两个文件：\n1. 打开 '\u002Fcore\u002Fcore_api.py'，在第 3 行添加你的密钥：client = OpenAI(api_key='insert_your_api_key_here')\n2. 打开 '\u002Fcore\u002Fcore_imaging.py'，在第 12 行添加你的密钥：api_key = 'insert_your_api_key_here'\n替换 'insert_your_api_key_here' 为你实际的 API 密钥，并确保你的 OpenAI 账户有有效的计费计划。","https:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fissues\u002F15",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},16031,"安装时提示找不到 'speech_recognition' 模块或导入错误怎么办？","这是一个大小写敏感的问题。Python 包的正确名称是 'SpeechRecognition'（注意大写 S 和 R）。\n1. 确保在 requirements.txt 和安装命令中使用正确的名称：pip install SpeechRecognition。\n2. 在代码导入时，也应使用：import SpeechRecognition as sr。\n如果仍然报错，请检查是否已正确安装包，并确认没有拼写错误。","https:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fissues\u002F6",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},16032,"运行时提示找不到数据库文件 (history.db) 如何解决？","程序无法自动创建数据库文件或路径配置有误。解决方法如下：\n1. 手动在 'core' 文件夹中创建一个名为 'history.db' 的空文件。\n2. 修改代码中的数据库路径配置。在 'driver.py' 文件中，将路径 r'Database\\history.db' 修改为 r'history.db'（假设脚本在 core 目录下运行），或者确保 'Database' 文件夹存在且路径正确。\n3. 参考 PR #9 中的代码更改以应用永久修复。","https:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fissues\u002F4",{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},16033,"导入时出现 'No module named core' 错误的原因及解决方法？","这通常是由于导入路径错误或运行方式不当引起的。\n解决方法 1（代码修正）：移除导入语句中的 'core.' 前缀，并将相关文件（如 driver.py）移动到 'core' 文件夹内，使导入路径相对化。\n解决方法 2（运行方式）：不要直接运行 assistant.py，而是使用模块运行方式：python -m core.assistant。\n此外，确保手动创建了 'Database' 文件夹并配置了正确的媒体文件绝对路径。","https:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Fissues\u002F2",{"id":141,"question_zh":142,"answer_zh":143,"source_url":119},16034,"在哪里插入 OpenAI API 密钥？创建的 history.db 文件应该是空的吗？","API 密钥需要插入到两个特定的核心文件中：\n1. \u002Fcore\u002Fcore_api.py (第 3 行)\n2. \u002Fcore\u002Fcore_imaging.py (第 12 行)\n格式为：api_key = '你的密钥'。\n关于 history.db 文件，是的，初次使用时应手动创建一个空的 'history.db' 文件放在 'core' 文件夹中，程序后续会向其中写入数据。",[145],{"id":146,"version":147,"summary_zh":148,"released_at":149},90699,"Pre-Database","请使用 “[pywinassistant-main.zip](https:\u002F\u002Fgithub.com\u002Fa-real-ai\u002Fpywinassistant\u002Freleases\u002Fdownload\u002FPre-Database\u002Fpywinassistant-main.zip)” 或直接下载主仓库代码，因为它包含了社区修复，使其成为一个稳定且面向未来的版本。\n\n已添加将生成的用例插入数据库的功能。\n\n待办事项：\n- 实现其他鼠标操作功能，如从 A 到 B 的区域选择、拖拽、移动滑块等（计划在 v0.5.0 至 v1.0.0 版本期间开发）。\n- 读取数据库，仅返回有用的映射信息（目标版本为 v0.6.0）。\n- 添加本地缓存，生成按钮和元素的实际最小截图，以加快定位速度（目标版本为 v0.6.5）。\n- 在测试过程中实时修改测试用例，并在每一步对其进行分析（目标版本为 v0.7.0）。\n- 实现分析工具，用于检查并确保用例正确执行了 ID 操作（目标版本为 v0.8.0）。\n- 集成其他 LMM API（目标版本为 v0.9.0）。\n- 添加外部通用应用模板数据库（目标版本为 v0.9.5）。\n- 对 GPT 模型进行微调，并实现本地化运行（目标版本为 v1.0.0）。\n\n助手建议与用户交互的实现：\n助手将逐步了解用户的日常行为，例如判断用户的情绪状态。如果助手感知到用户心情愉悦，便可分享一些快乐的瞬间，并寻找进一步提升幸福感的方式，比如推荐歌曲、视频或生成图片等，从而增强与用户的互动性。此外，还可以引入个性化设置，更深入地了解用户的喜好，以便提供更加精准的推荐（例如基于用户偏好）。同时，可接入摄像头，进一步捕捉用户的表情、行为及情绪变化，以更好地理解其个性特征（目标版本为 v1.0.0 至 v2.0.0）。","2024-01-03T23:56:15"]