[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-microsoft--AIOpsLab":3,"similar-microsoft--AIOpsLab":123},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":8,"readme_en":9,"readme_zh":10,"quickstart_zh":11,"use_case_zh":12,"hero_image_url":13,"owner_login":14,"owner_name":15,"owner_avatar_url":16,"owner_bio":17,"owner_company":18,"owner_location":18,"owner_email":19,"owner_twitter":20,"owner_website":21,"owner_url":22,"languages":23,"stars":59,"forks":60,"last_commit_at":61,"license":62,"difficulty_score":63,"env_os":64,"env_gpu":65,"env_ram":66,"env_deps":67,"category_tags":79,"github_topics":18,"view_count":83,"oss_zip_url":18,"oss_zip_packed_at":18,"status":84,"created_at":85,"updated_at":86,"faqs":87,"releases":122},5110,"microsoft\u002FAIOpsLab","AIOpsLab","A holistic framework to enable the design, development, and evaluation of autonomous AIOps agents.","AIOpsLab 是一个专为设计、开发和评估自主 AIOps（智能运维）智能体而打造的全方位框架。它致力于解决当前智能运维领域缺乏标准化、可复现且可扩展基准测试的痛点，让研究人员和开发者能够在一个可控的环境中验证算法效果。\n\n通过 AIOpsLab，用户可以轻松部署微服务云环境、模拟各类故障注入、生成复杂工作负载并导出遥测数据。框架不仅协调这些组件的运行，还提供了统一的交互接口，内置了一套丰富的基准测试套件，支持对智能体进行交互式评估，且可根据特定需求灵活扩展。\n\n这款工具特别适合从事系统可靠性工程的研究人员、开发自主运维代理的工程师，以及需要构建标准化评测体系的团队。其独特的技术亮点在于将环境仿真、故障模拟与智能体评估无缝集成，支持本地模拟集群（基于 Kind）等多种部署方式，确保了实验的高度可复现性与互操作性。无论是学术探索还是工业界落地，AIOpsLab 都能为构建更聪明的运维助手提供坚实基石。","\u003Cdiv align=\"center\">\n\n\u003Ch1>AIOpsLab\u003C\u002Fh1>\n\n[🤖Overview](#🤖overview) | \n[🚀Quick Start](#🚀quickstart) | \n[📦Installation](#📦installation) | \n[⚙️Usage](#⚙️usage) | \n[📂Project Structure](#📂project-structure) |\n[📄How to Cite](#📄how-to-cite)\n\n[![ArXiv Link](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2501.06706-red?logo=arxiv)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.06706)\n[![ArXiv Link](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2407.12165-red?logo=arxiv)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.12165)\n\u003C\u002Fdiv>\n\n\n\n\u003Ch2 id=\"🤖overview\">🤖 Overview\u003C\u002Fh2>\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_AIOpsLab_readme_e7a348e37be6.png)\n\n\nAIOpsLab is a holistic framework to enable the design, development, and evaluation of autonomous AIOps agents that, additionally, serve the purpose of building reproducible, standardized, interoperable and scalable benchmarks. AIOpsLab can deploy microservice cloud environments, inject faults, generate workloads, and export telemetry data, while orchestrating these components and providing interfaces for interacting with and evaluating agents. \n\nMoreover, AIOpsLab provides a built-in benchmark suite with a set of problems to evaluate AIOps agents in an interactive environment. This suite can be easily extended to meet user-specific needs. See the problem list [here](\u002Faiopslab\u002Forchestrator\u002Fproblems\u002Fregistry.py#L15).\n\n\u003Ch2 id=\"📦installation\">📦 Installation\u003C\u002Fh2>\n\n### Requirements\n- Python >= 3.11\n- [Helm](https:\u002F\u002Fhelm.sh\u002F)\n- [Poetry](https:\u002F\u002Fpython-poetry.org\u002Fdocs\u002F) (recommended) or pip\n- Additional requirements depend on the deployment option selected, which is explained in the next section\n\n### Step 1: Install Python 3.11\n```bash\nsudo apt update\nsudo apt install python3.11 python3.11-venv python3.11-dev -y\n```\n\n### Step 2: Install Poetry (Official Installer)\n```bash\n# Use the official installer (NOT apt - the apt version is outdated)\ncurl -sSL https:\u002F\u002Finstall.python-poetry.org | python3.11 -\nexport PATH=\"$HOME\u002F.local\u002Fbin:$PATH\"\n\n# Add to your shell profile for persistence\necho 'export PATH=\"$HOME\u002F.local\u002Fbin:$PATH\"' >> ~\u002F.bashrc\n```\n\n> **Warning**: Do NOT use `sudo apt install python3-poetry` - it installs an outdated version that may not work with the lock file.\n\n### Step 3: Clone and Install\n```bash\ngit clone --recurse-submodules \u003CCLONE_PATH_TO_THE_REPO>\ncd AIOpsLab\npoetry env use python3.11\npoetry install\neval $(poetry env activate)\n```\n\n> **Troubleshooting**: If you get a \"lock file not compatible\" error, run `poetry lock` first, then `poetry install`.\n\nAlternative installation with pip:\n```bash\npip install -e .\n```\n\n\u003Ch2 id=\"🚀quickstart\">🚀 Quick Start \u003C\u002Fh2>\n\n\u003C!-- TODO: Add instructions for both local cluster and remote cluster -->\nChoose either a) or b) to set up your cluster and then proceed to the next steps.\n\n### a) Local simulated cluster\nAIOpsLab can be run on a local simulated cluster using [kind](https:\u002F\u002Fkind.sigs.k8s.io\u002F) on your local machine. Please look at this [README](kind\u002FREADME.md#prerequisites) for a list of prerequisites.\n\n```bash\n# For x86 machines\nkind create cluster --config kind\u002Fkind-config-x86.yaml\n\n# For ARM machines\nkind create cluster --config kind\u002Fkind-config-arm.yaml\n```\n\nIf you're running into issues, consider building a Docker image for your machine by following this [README](kind\u002FREADME.md#deployment-steps). Please also open an issue.\n\n### [Tips]\nIf you are running AIOpsLab using a proxy, beware of exporting the HTTP proxy as `172.17.0.1`. When creating the kind cluster, all the nodes in the cluster will inherit the proxy setting from the host environment and the Docker container. \n\nThe `172.17.0.1` address is used to communicate with the host machine. For more details, refer to the official guide: [Configure Kind to Use a Proxy](https:\u002F\u002Fkind.sigs.k8s.io\u002Fdocs\u002Fuser\u002Fquick-start\u002F#configure-kind-to-use-a-proxy).\n\nAdditionally, Docker doesn't support SOCKS5 proxy directly. If you're using a SOCKS5 protocol to proxy, you may need to use [Privoxy](https:\u002F\u002Fwww.privoxy.org) to forward SOCKS5 to HTTP.\n\nIf you're running VLLM and the LLM agent locally, Privoxy will by default proxy `localhost`, which will cause errors. To avoid this issue, you should set the following environment variable:\n\n```bash\nexport no_proxy=localhost\n``` \n\nAfter finishing cluster creation, proceed to the next \"Update `config.yml`\" step.\n\n### b) Remote cluster (Manual setup with Ansible)\nAIOpsLab supports any remote kubernetes cluster that your `kubectl` context is set to, whether it's a cluster from a cloud provider or one you build yourself. We have some Ansible playbooks to setup clusters on providers like [CloudLab](https:\u002F\u002Fwww.cloudlab.us\u002F) and our own machines. Follow this [README](.\u002Fscripts\u002Fansible\u002FREADME.md) to set up your own cluster, and then proceed to the next \"Update `config.yml`\" step.\n\n### c) Azure VMs with Terraform + Ansible (Recommended for cloud)\nSingle command provisions VMs, sets up K8s, and configures AIOpsLab:\n\n```bash\n# Mode B (AIOpsLab on laptop, remote kubectl):\npython3 scripts\u002Fterraform\u002Fdeploy.py --apply --resource-group \u003Cyour-rg> --workers 2 --mode B\n\n# Mode A (AIOpsLab on controller VM, full fault injection support):\npython3 scripts\u002Fterraform\u002Fdeploy.py --apply --resource-group \u003Cyour-rg> --workers 2 --mode A\n```\n\nSee [Terraform README](.\u002Fscripts\u002Fterraform\u002FREADME.md) for all options (`--allowed-ips`, `--dev`, `--setup-only`, etc.).\n\n> **Note**: Mode B is convenient for development but some fault injectors (e.g., VirtualizationFaultInjector) require Docker on the local machine. Use Mode A for full functionality.\n\n### Update `config.yml`\n```bash\ncd aiopslab\ncp config.yml.example config.yml\n```\nUpdate your `config.yml` so that `k8s_host` is the host name of the control plane node of your cluster. Update `k8s_user` to be your username on the control plane node. If you are using a kind cluster, your `k8s_host` should be `kind`. If you're running AIOpsLab on cluster, your `k8s_host` should be `localhost`.\n\n### Running agents locally\nHuman as the agent:\n\n```bash\npython3 cli.py\n(aiopslab) $ start misconfig_app_hotel_res-detection-1 # or choose any problem you want to solve\n# ... wait for the setup ...\n(aiopslab) $ submit(\"Yes\") # submit solution\n```\n\nRun GPT-4 baseline agent:\n\n```bash\n# Create a .env file in the project root (if not exists)\necho \"OPENAI_API_KEY=\u003CYOUR_OPENAI_API_KEY>\" > .env\n# Add more API keys as needed:\n# echo \"QWEN_API_KEY=\u003CYOUR_QWEN_API_KEY>\" >> .env\n# echo \"DEEPSEEK_API_KEY=\u003CYOUR_DEEPSEEK_API_KEY>\" >> .env\n\npython3 clients\u002Fgpt.py # you can also change the problem to solve in the main() function\n```\n\nOur repository comes with a variety of pre-integrated agents, including agents that enable **secure authentication with Azure OpenAI endpoints using identity-based access**. Please check out [Clients](\u002Fclients) for a comprehensive list of all implemented clients.\n\nThe clients will automatically load API keys from your .env file.\n\nYou can check the running status of the cluster using [k9s](https:\u002F\u002Fk9scli.io\u002F) or other cluster monitoring tools conveniently.\n\nTo browse your logged `session_id` values in the W&B app as a table:\n\n1. Make sure you have W&B installed and configured.\n2. Set the USE_WANDB environment variable:\n    ```bash\n    # Add to your .env file\n    echo \"USE_WANDB=true\" >> .env\n    ```\n3. In the W&B web UI, open any run and click Tables → Add Query Panel.\n4. In the key field, type `runs.summary` and click `Run`, then you will see the results displayed in a table format.\n\n\u003Ch2 id=\"⚙️usage\">⚙️ Usage\u003C\u002Fh2>\n\nAIOpsLab can be used in the following ways:\n- [Onboard your agent to AIOpsLab](#how-to-onboard-your-agent-to-aiopslab)\n- [Add new applications to AIOpsLab](#how-to-add-new-applications-to-aiopslab)\n- [Add new problems to AIOpsLab](#how-to-add-new-problems-to-aiopslab)\n\n### Running agents remotely\nYou can run AIOpsLab on a remote machine with larger computational resources. This section guides you through setting up and using AIOpsLab remotely.\n\n1. **On the remote machine, start the AIOpsLab service**:\n\n    ```bash\n    SERVICE_HOST=\u003CYOUR_HOST> SERVICE_PORT=\u003CYOUR_PORT> SERVICE_WORKERS=\u003CYOUR_WORKERS> python service.py\n    ```\n2. **Test the connection from your local machine**:\n    In your local machine, you can test the connection to the remote AIOpsLab service using `curl`:\n\n    ```bash\n    # Check if the service is running\n    curl http:\u002F\u002F\u003CYOUR_HOST>:\u003CYOUR_PORT>\u002Fhealth\n    \n    # List available problems\n    curl http:\u002F\u002F\u003CYOUR_HOST>:\u003CYOUR_PORT>\u002Fproblems\n    \n    # List available agents\n    curl http:\u002F\u002F\u003CYOUR_HOST>:\u003CYOUR_PORT>\u002Fagents\n    ```\n\n3. **Run vLLM on the remote machine (if using vLLM agent):**\n    If you're using the vLLM agent, make sure to launch the vLLM server on the remote machine:\n\n    ```bash\n    # On the remote machine\n    chmod +x .\u002Fclients\u002Flaunch_vllm.sh\n    .\u002Fclients\u002Flaunch_vllm.sh\n    ```\n    You can customize the model by editing `launch_vllm.sh` before running it.\n\n4. **Run the agent**:\n    In your local machine, you can run the agent using the following command:\n\n    ```bash\n    curl -X POST http:\u002F\u002F\u003CYOUR_HOST>:\u003CYOUR_PORT>\u002Fsimulate \\\n      -H \"Content-Type: application\u002Fjson\" \\\n      -d '{\n        \"problem_id\": \"misconfig_app_hotel_res-mitigation-1\",\n        \"agent_name\": \"vllm\",\n        \"max_steps\": 10,\n        \"temperature\": 0.7,\n        \"top_p\": 0.9\n      }'\n    ```\n\n### How to onboard your agent to AIOpsLab?\n\nAIOpsLab makes it extremely easy to develop and evaluate your agents. You can onboard your agent to AIOpsLab in 3 simple steps:\n\n1. **Create your agent**: You are free to develop agents using any framework of your choice. The only requirements are:\n    - Wrap your agent in a Python class, say `Agent`\n    - Add an async method `get_action` to the class:\n\n        ```python\n        # given current state and returns the agent's action\n        async def get_action(self, state: str) -> str:\n            # \u003Cyour agent's logic here>\n        ```\n\n2. **Register your agent with AIOpsLab**: You can now register the agent with AIOpsLab's orchestrator. The orchestrator will manage the interaction between your agent and the environment:\n\n    ```python\n    from aiopslab.orchestrator import Orchestrator\n\n    agent = Agent()             # create an instance of your agent\n    orch = Orchestrator()       # get AIOpsLab's orchestrator\n    orch.register_agent(agent)  # register your agent with AIOpsLab\n    ```\n\n3. **Evaluate your agent on a problem**:\n\n    1. **Initialize a problem**: AIOpsLab provides a list of problems that you can evaluate your agent on. Find the list of available problems [here](\u002Faiopslab\u002Forchestrator\u002Fproblems\u002Fregistry.py) or using `orch.probs.get_problem_ids()`. Now initialize a problem by its ID: \n\n        ```python\n        problem_desc, instructs, apis = orch.init_problem(\"k8s_target_port-misconfig-mitigation-1\")\n        ```\n    \n    2. **Set agent context**: Use the problem description, instructions, and APIs available to set context for your agent. (*This step depends on your agent's design and is left to the user*)\n\n\n    3. **Start the problem**: Start the problem by calling the `start_problem` method. You can specify the maximum number of steps too:\n\n        ```python\n        import asyncio\n        asyncio.run(orch.start_problem(max_steps=30))\n        ```\n\nThis process will create a [`Session`](\u002Faiopslab\u002Fsession.py) with the orchestrator, where the agent will solve the problem. The orchestrator will evaluate your agent's solution and provide results (stored under `data\u002Fresults\u002F`). You can use these to improve your agent.\n\n\n### How to add new applications to AIOpsLab?\n\nAIOpsLab provides a default [list of applications](\u002Faiopslab\u002Fservice\u002Fapps\u002F) to evaluate agents for operations tasks. However, as a developer you can add new applications to AIOpsLab and design problems around them.\n\n> *Note*: for auto-deployment of some apps with K8S, we integrate Helm charts (you can also use `kubectl` to install as [HotelRes application](\u002Faiopslab\u002Fservice\u002Fapps\u002Fhotelres.py)). More on Helm [here](https:\u002F\u002Fhelm.sh).\n\nTo add a new application to AIOpsLab with Helm, you need to:\n\n1. **Add application metadata**\n    - Application metadata is a JSON object that describes the application.\n    - Include *any* field such as the app's name, desc, namespace, etc.\n    - We recommend also including a special `Helm Config` field, as follows:\n\n        ```json\n        \"Helm Config\": {\n            \"release_name\": \"\u003Cname for the Helm release to deploy>\",\n            \"chart_path\": \"\u003Cpath to the Helm chart of the app>\",\n            \"namespace\": \"\u003CK8S namespace where app should be deployed>\"\n        }\n        ```\n        > *Note*: The `Helm Config` is used by the orchestrator to auto-deploy your app when a problem associated with it is started.\n\n        > *Note*: The orchestrator will auto-provide *all other* fields as context to the agent for any problem associated with this app.\n\n    Create a JSON file with this metadata and save it in the [`metadata`](\u002Faiopslab\u002Fservice\u002Fmetadata) directory. For example the `social-network` app: [social-network.json](\u002Faiopslab\u002Fservice\u002Fmetadata\u002Fsocial-network.json)\n\n2. **Add application class**\n\n    Extend the base class in a new Python file in the [`apps`](\u002Faiopslab\u002Fservice\u002Fapps) directory:\n\n    ```python\n    from aiopslab.service.apps.base import Application\n\n    class MyApp(Application):\n        def __init__(self):\n            super().__init__(\"\u003Cpath to app metadata JSON>\")\n    ```\n\n    The `Application` class provides a base implementation for the application. You can override methods as needed and add new ones to suit your application's requirements, but the base class should suffice for most applications.\n\n\n\n### How to add new problems to AIOpsLab?\n\nSimilar to applications, AIOpsLab provides a default [list of problems](\u002Faiopslab\u002Forchestrator\u002Fproblems\u002Fregistry.py) to evaluate agents. However, as a developer you can add new problems to AIOpsLab and design them around your applications.\n\nEach problem in AIOpsLab has 5 components:\n1. *Application*: The application on which the problem is based.\n2. *Task*: The AIOps task that the agent needs to perform.\n Currently we support: [Detection](\u002Faiopslab\u002Forchestrator\u002Ftasks\u002Fdetection.py), [Localization](\u002Faiopslab\u002Forchestrator\u002Ftasks\u002Flocalization.py), [Analysis](\u002Faiopslab\u002Forchestrator\u002Ftasks\u002Fanalysis.py), and [Mitigation](\u002Faiopslab\u002Forchestrator\u002Ftasks\u002Fmitigation.py).\n3. *Fault*: The fault being introduced in the application.\n4. *Workload*: The workload that is generated for the application.\n5. *Evaluator*: The evaluator that checks the agent's performance.\n\nTo add a new problem to AIOpsLab, create a new Python file \nin the [`problems`](\u002Faiopslab\u002Forchestrator\u002Fproblems) directory, as follows:\n\n1. **Setup**. Import your chosen application (say `MyApp`) and task (say `LocalizationTask`):\n\n    ```python\n    from aiopslab.service.apps.myapp import MyApp\n    from aiopslab.orchestrator.tasks.localization import LocalizationTask\n    ```\n\n2. **Define**. To define a problem, create a class that inherits from your chosen `Task`, and defines 3 methods: `start_workload`, `inject_fault`, and `eval`:\n\n    ```python\n    class MyProblem(LocalizationTask):\n        def __init__(self):\n            self.app = MyApp()\n        \n        def start_workload(self):\n            # \u003Cyour workload logic here>\n        \n        def inject_fault(self)\n            # \u003Cyour fault injection logic here>\n        \n        def eval(self, soln, trace, duration):\n            # \u003Cyour evaluation logic here>\n    ```\n\n3. **Register**. Finally, add your problem to the orchestrator's registry [here](\u002Faiopslab\u002Forchestrator\u002Fproblems\u002Fregistry.py).\n\n\nSee a full example of a problem [here](\u002Faiopslab\u002Forchestrator\u002Fproblems\u002Fk8s_target_port_misconfig\u002Ftarget_port.py). \n\u003Cdetails>\n  \u003Csummary>Click to show the description of the problem in detail\u003C\u002Fsummary>\n\n- **`start_workload`**: Initiates the application's workload. Use your own generator or AIOpsLab's default, which is based on [wrk2](https:\u002F\u002Fgithub.com\u002Fgiltene\u002Fwrk2):\n\n    ```python\n    from aiopslab.generator.workload.wrk import Wrk\n\n    wrk = Wrk(rate=100, duration=10)\n    wrk.start_workload(payload=\"\u003Cwrk payload script>\", url=\"\u003Capp URL>\")\n    ```\n    > Relevant Code: [aiopslab\u002Fgenerators\u002Fworkload\u002Fwrk.py](\u002Faiopslab\u002Fgenerators\u002Fworkload\u002Fwrk.py)\n\n- **`inject_fault`**: Introduces a fault into the application. Use your own injector or AIOpsLab's built-in one which you can also extend. E.g., a misconfig in the K8S layer:\n\n    ```python\n    from aiopslab.generators.fault.inject_virtual import *\n\n    inj = VirtualizationFaultInjector(testbed=\"\u003Cnamespace>\")\n    inj.inject_fault(microservices=[\"\u003Cservice-name>\"], fault_type=\"misconfig\")\n    ```\n\n    > Relevant Code: [aiopslab\u002Fgenerators\u002Ffault](\u002Faiopslab\u002Fgenerators\u002Ffault)\n\n\n- **`eval`**: Evaluates the agent's solution using 3 params: (1) *soln*: agent's submitted solution if any, (2) *trace*: agent's action trace, and (3) *duration*: time taken by the agent.\n\n    Here, you can use built-in default evaluators for each task and\u002For add custom evaluations. The results are stored in `self.results`:\n    ```python\n    def eval(self, soln, trace, duration) -> dict:\n        super().eval(soln, trace, duration)     # default evaluation\n        self.add_result(\"myMetric\", my_metric(...))     # add custom metric\n        return self.results\n    ```\n\n    > *Note*: When an agent starts a problem, the orchestrator creates a [`Session`](\u002Faiopslab\u002Fsession.py) object that stores the agent's interaction. The `trace` parameter is this session's recorded trace.\n\n    > Relevant Code: [aiopslab\u002Forchestrator\u002Fevaluators\u002F](\u002Faiopslab\u002Forchestrator\u002Fevaluators\u002F)\n\n\u003C\u002Fdetails>\n\n\n\n\n\u003Ch2 id=\"📂project-structure\">📂 Project Structure\u003C\u002Fh2>\n\n\u003Csummary>\u003Ccode>aiopslab\u003C\u002Fcode>\u003C\u002Fsummary>\n\u003Cdetails>\n  \u003Csummary>Generators\u003C\u002Fsummary>\n  \u003Cpre>\n  generators - the problem generators for aiopslab\n  ├── fault - the fault generator organized by fault injection level\n  │   ├── base.py\n  │   ├── inject_app.py\n  │  ...\n  │   └── inject_virtual.py\n  └── workload - the workload generator organized by workload type\n      └── wrk.py - wrk tool interface\n  \u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Orchestrator\u003C\u002Fsummary>\n  \u003Cpre>\n  orchestrator\n  ├── orchestrator.py - the main orchestration engine\n  ├── parser.py - parser for agent responses\n  ├── evaluators - eval metrics in the system\n  │   ├── prompts.py - prompts for LLM-as-a-Judge\n  │   ├── qualitative.py - qualitative metrics\n  │   └── quantitative.py - quantitative metrics\n  ├── problems - problem definitions in aiopslab\n  │   ├── k8s_target_port_misconfig - e.g., A K8S TargetPort misconfig problem\n  │  ...\n  │   └── registry.py\n  ├── actions - actions that agents can perform organized by AIOps task type\n  │   ├── base.py\n  │   ├── detection.py\n  │   ├── localization.py\n  │   ├── analysis.py\n  │   └── mitigation.py\n  └── tasks - individual AIOps task definition that agents need to solve\n      ├── base.py\n      ├── detection.py\n      ├── localization.py\n      ├── analysis.py\n      └── mitigation.py\n  \u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Service\u003C\u002Fsummary>\n  \u003Cpre>\n  service\n  ├── apps - interfaces\u002Fimpl. of each app\n  ├── helm.py - helm interface to interact with the cluster\n  ├── kubectl.py - kubectl interface to interact with the cluster\n  ├── shell.py - shell interface to interact with the cluster\n  ├── metadata - metadata and configs for each apps\n  └── telemetry - observability tools besides observer, e.g., in-memory log telemetry for the agent\n  \u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Observer\u003C\u002Fsummary>\n  \u003Cpre>\n  observer\n  ├── filebeat - Filebeat installation\n  ├── logstash - Logstash installation\n  ├── prometheus - Prometheus installation\n  ├── log_api.py - API to store the log data on disk\n  ├── metric_api.py - API to store the metrics data on disk\n  └── trace_api.py - API to store the traces data on disk\n  \u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>Utils\u003C\u002Fsummary>\n  \u003Cpre>\n  ├── config.yml - aiopslab configs\n  ├── config.py - config parser\n  ├── paths.py - paths and constants\n  ├── session.py - aiopslab session manager\n  └── utils\n      ├── actions.py - helpers for actions that agents can perform\n      ├── cache.py - cache manager\n      └── status.py - aiopslab status, error, and warnings\n  \u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Csummary>\u003Ccode>cli.py\u003C\u002Fcode>: A command line interface to interact with AIOpsLab, e.g., used by human operators.\u003C\u002Fsummary>\n\n\n\u003Ch2 id=\"📄how-to-cite\">📄 How to Cite\u003C\u002Fh2>\n\n```bibtex\n@inproceedings{\nchen2025aiopslab,\ntitle={{AIO}psLab: A Holistic Framework to Evaluate {AI} Agents for Enabling Autonomous Clouds},\nauthor={Yinfang Chen and Manish Shetty and Gagan Somashekar and Minghua Ma and Yogesh Simmhan and Jonathan Mace and Chetan Bansal and Rujia Wang and Saravan Rajmohan},\nbooktitle={Eighth Conference on Machine Learning and Systems},\nyear={2025},\nurl={https:\u002F\u002Fopenreview.net\u002Fforum?id=3EXBLwGxtq}\n}\n@inproceedings{shetty2024building,\n  title = {Building AI Agents for Autonomous Clouds: Challenges and Design Principles},\n  author = {Shetty, Manish and Chen, Yinfang and Somashekar, Gagan and Ma, Minghua and Simmhan, Yogesh and Zhang, Xuchao and Mace, Jonathan and Vandevoorde, Dax and Las-Casas, Pedro and Gupta, Shachee Mishra and Nath, Suman and Bansal, Chetan and Rajmohan, Saravan},\n  year = {2024},\n  booktitle = {Proceedings of 15th ACM Symposium on Cloud Computing},\n}\n```\n\n## Code of Conduct\n\nThis project has adopted the [Microsoft Open Source Code of Conduct](https:\u002F\u002Fopensource.microsoft.com\u002Fcodeofconduct\u002F). For more information see the [Code of Conduct FAQ](https:\u002F\u002Fopensource.microsoft.com\u002Fcodeofconduct\u002Ffaq\u002F) or contact [opencode@microsoft.com](mailto:opencode@microsoft.com) with any additional questions or comments.\n\n\n## License\n\nCopyright (c) Microsoft Corporation. All rights reserved.\n\nLicensed under the [MIT](LICENSE.txt) license.\n\n\n### Trademarks\n\nThis project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow [Microsoft’s Trademark & Brand Guidelines](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Flegal\u002Fintellectualproperty\u002Ftrademarks). Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos is subject to those third-party’s policies.\n","\u003Cdiv align=\"center\">\n\n\u003Ch1>AIOpsLab\u003C\u002Fh1>\n\n[🤖概览](#🤖overview) | \n[🚀快速入门](#🚀quickstart) | \n[📦安装](#📦installation) | \n[⚙️使用](#⚙️usage) | \n[📂项目结构](#📂project-structure) |\n[📄如何引用](#📄how-to-cite)\n\n[![ArXiv链接](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2501.06706-red?logo=arxiv)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2501.06706)\n[![ArXiv链接](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2407.12165-red?logo=arxiv)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2407.12165)\n\u003C\u002Fdiv>\n\n\n\n\u003Ch2 id=\"🤖overview\">🤖 概述\u003C\u002Fh2>\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_AIOpsLab_readme_e7a348e37be6.png)\n\n\nAIOpsLab 是一个整体框架，旨在支持自主 AIOps 代理的设计、开发和评估，同时用于构建可复现、标准化、可互操作且可扩展的基准测试。AIOpsLab 可以部署微服务云环境、注入故障、生成工作负载并导出遥测数据，同时编排这些组件，并提供与代理交互及评估的接口。\n\n此外，AIOpsLab 还内置了一套基准测试套件，包含一系列问题，可在交互式环境中评估 AIOps 代理。该套件可以轻松扩展以满足用户特定需求。有关问题列表，请参阅 [此处](\u002Faiopslab\u002Forchestrator\u002Fproblems\u002Fregistry.py#L15)。\n\n\u003Ch2 id=\"📦installation\">📦 安装\u003C\u002Fh2>\n\n### 要求\n- Python >= 3.11\n- [Helm](https:\u002F\u002Fhelm.sh\u002F)\n- [Poetry](https:\u002F\u002Fpython-poetry.org\u002Fdocs\u002F)（推荐）或 pip\n- 其他要求取决于所选的部署选项，将在下一节中说明\n\n### 第一步：安装 Python 3.11\n```bash\nsudo apt update\nsudo apt install python3.11 python3.11-venv python3.11-dev -y\n```\n\n### 第二步：安装 Poetry（官方安装程序）\n```bash\n# 使用官方安装程序（不要使用 apt - apt 版本已过时）\ncurl -sSL https:\u002F\u002Finstall.python-poetry.org | python3.11 -\nexport PATH=\"$HOME\u002F.local\u002Fbin:$PATH\"\n\n# 添加到您的 shell 配置文件以保持持久性\necho 'export PATH=\"$HOME\u002F.local\u002Fbin:$PATH\"' >> ~\u002F.bashrc\n```\n\n> **警告**：请勿使用 `sudo apt install python3-poetry` - 它会安装一个过时的版本，可能无法与锁定文件兼容。\n\n### 第三步：克隆并安装\n```bash\ngit clone --recurse-submodules \u003CCLONE_PATH_TO_THE_REPO>\ncd AIOpsLab\npoetry env use python3.11\npoetry install\neval $(poetry env activate)\n```\n\n> **故障排除**：如果出现“锁定文件不兼容”的错误，请先运行 `poetry lock`，再运行 `poetry install`。\n\n使用 pip 的替代安装方法：\n```bash\npip install -e .\n```\n\n\u003Ch2 id=\"🚀quickstart\">🚀 快速入门 \u003C\u002Fh2>\n\n\u003C!-- TODO: 添加本地集群和远程集群的说明 -->\n选择 a) 或 b) 来设置您的集群，然后继续执行后续步骤。\n\n### a) 本地模拟集群\nAIOpsLab 可以在您本地机器上使用 [kind](https:\u002F\u002Fkind.sigs.k8s.io\u002F) 在本地模拟集群上运行。请参阅此 [README](kind\u002FREADME.md#prerequisites) 以获取先决条件列表。\n\n```bash\n# 对于 x86 机器\nkind create cluster --config kind\u002Fkind-config-x86.yaml\n\n# 对于 ARM 机器\nkind create cluster --config kind\u002Fkind-config-arm.yaml\n```\n\n如果您遇到问题，可以考虑按照此 [README](kind\u002FREADME.md#deployment-steps) 为您的机器构建 Docker 镜像。如有需要，请提交问题。\n\n### [提示]\n如果您使用代理运行 AIOpsLab，请注意不要将 HTTP 代理地址设置为 `172.17.0.1`。创建 kind 集群时，集群中的所有节点都会继承主机环境和 Docker 容器的代理设置。\n\n`172.17.0.1` 地址用于与宿主机通信。更多详情请参阅官方指南：[配置 Kind 使用代理](https:\u002F\u002Fkind.sigs.k8s.io\u002Fdocs\u002Fuser\u002Fquick-start\u002F#configure-kind-to-use-a-proxy)。\n\n此外，Docker 不直接支持 SOCKS5 代理。如果您使用 SOCKS5 协议进行代理，可能需要使用 [Privoxy](https:\u002F\u002Fwww.privoxy.org) 将 SOCKS5 转换为 HTTP。\n\n如果您在本地运行 VLLM 和 LLM 代理，Privoxy 默认会代理 `localhost`，这会导致错误。为避免此问题，您应设置以下环境变量：\n\n```bash\nexport no_proxy=localhost\n``` \n\n完成集群创建后，继续执行下一步“更新 `config.yml`”。\n\n### b) 远程集群（使用 Ansible 手动设置）\nAIOpsLab 支持任何您已配置 `kubectl` 上下文的远程 Kubernetes 集群，无论是来自云提供商的集群，还是您自己搭建的集群。我们提供了一些 Ansible 剧本，可用于在 [CloudLab](https:\u002F\u002Fwww.cloudlab.us\u002F) 等云平台以及我们自己的机器上设置集群。请按照此 [README](.\u002Fscripts\u002Fansible\u002FREADME.md) 设置您自己的集群，然后继续执行下一步“更新 `config.yml`”。\n\n### c) 使用 Terraform + Ansible 的 Azure VM（推荐用于云端）\n只需一条命令即可 provision VM、设置 K8s 并配置 AIOpsLab：\n\n```bash\n# 模式 B（AIOpsLab 在笔记本电脑上，远程 kubectl）：\npython3 scripts\u002Fterraform\u002Fdeploy.py --apply --resource-group \u003Cyour-rg> --workers 2 --mode B\n\n# 模式 A（AIOpsLab 在控制器 VM 上，完全支持故障注入）：\npython3 scripts\u002Fterraform\u002Fdeploy.py --apply --resource-group \u003Cyour-rg> --workers 2 --mode A\n```\n\n有关所有选项（`--allowed-ips`、`--dev`、`--setup-only` 等），请参阅 [Terraform README](.\u002Fscripts\u002Fterraform\u002FREADME.md)。\n\n> **注意**：模式 B 便于开发，但某些故障注入器（例如 VirtualizationFaultInjector）需要在本地机器上运行 Docker。如需完整功能，请使用模式 A。\n\n### 更新 `config.yml`\n```bash\ncd aiopslab\ncp config.yml.example config.yml\n```\n请更新您的 `config.yml`，使 `k8s_host` 成为您集群控制平面节点的主机名。将 `k8s_user` 更新为您在控制平面节点上的用户名。如果您使用的是 kind 集群，您的 `k8s_host` 应为 `kind`。如果您在集群上运行 AIOpsLab，您的 `k8s_host` 应为 `localhost`。\n\n### 在本地运行代理\n由人类担任代理：\n\n```bash\npython3 cli.py\n(aiopslab) $ start misconfig_app_hotel_res-detection-1 # 或选择您想解决的任何问题\n# ... 等待设置 ...\n(aiopslab) $ submit(\"Yes\") # 提交解决方案\n```\n\n运行 GPT-4 基线代理：\n\n```bash\n# 如果项目根目录下没有 .env 文件，则创建一个\necho \"OPENAI_API_KEY=\u003CYOUR_OPENAI_API_KEY>\" > .env\n# 如有需要，可添加更多 API 密钥：\n# echo \"QWEN_API_KEY=\u003CYOUR_QWEN_API_KEY>\" >> .env\n\n# echo \"DEEPSEEK_API_KEY=\u003CYOUR_DEEPSEEK_API_KEY>\" >> .env\n\npython3 clients\u002Fgpt.py # 你也可以在 main() 函数中更改要解决的问题\n```\n\n我们的仓库预集成多种代理，其中包括支持**使用基于身份的访问权限对 Azure OpenAI 终端节点进行安全认证**的代理。请查看 [Clients](\u002Fclients)，以获取所有已实现客户端的完整列表。\n\n客户端会自动从你的 .env 文件中加载 API 密钥。\n\n你可以使用 [k9s](https:\u002F\u002Fk9scli.io\u002F) 或其他集群监控工具方便地检查集群的运行状态。\n\n要在 W&B 应用程序中以表格形式浏览你记录的 `session_id` 值：\n\n1. 确保你已安装并配置好 W&B。\n2. 设置 USE_WANDB 环境变量：\n    ```bash\n    # 添加到你的 .env 文件\n    echo \"USE_WANDB=true\" >> .env\n    ```\n3. 在 W&B Web UI 中，打开任意运行，点击“Tables”→“Add Query Panel”。\n4. 在 key 字段中输入 `runs.summary` 并点击“Run”，你将看到结果以表格形式显示。\n\n\u003Ch2 id=\"⚙️usage\">⚙️ 使用方法\u003C\u002Fh2>\n\nAIOpsLab 可以通过以下方式使用：\n- [将你的代理接入 AIOpsLab](#how-to-onboard-your-agent-to-aiopslab)\n- [向 AIOpsLab 添加新应用](#how-to-add-new-applications-to-aiopslab)\n- [向 AIOpsLab 添加新问题](#how-to-add-new-problems-to-aiopslab)\n\n### 远程运行代理\n你可以在具有更大计算资源的远程机器上运行 AIOpsLab。本节将指导你如何在远程设置和使用 AIOpsLab。\n\n1. **在远程机器上启动 AIOpsLab 服务**：\n\n    ```bash\n    SERVICE_HOST=\u003CYOUR_HOST> SERVICE_PORT=\u003CYOUR_PORT> SERVICE_WORKERS=\u003CYOUR_WORKERS> python service.py\n    ```\n2. **从本地机器测试连接**：\n    在你的本地机器上，可以使用 `curl` 测试与远程 AIOpsLab 服务的连接：\n\n    ```bash\n    # 检查服务是否运行\n    curl http:\u002F\u002F\u003CYOUR_HOST>:\u003CYOUR_PORT>\u002Fhealth\n    \n    # 列出可用问题\n    curl http:\u002F\u002F\u003CYOUR_HOST>:\u003CYOUR_PORT>\u002Fproblems\n    \n    # 列出可用代理\n    curl http:\u002F\u002F\u003CYOUR_HOST>:\u003CYOUR_PORT>\u002Fagents\n    ```\n\n3. **在远程机器上运行 vLLM（如果使用 vLLM 代理）：**\n    如果你使用的是 vLLM 代理，务必在远程机器上启动 vLLM 服务器：\n\n    ```bash\n    # 在远程机器上\n    chmod +x .\u002Fclients\u002Flaunch_vllm.sh\n    .\u002Fclients\u002Flaunch_vllm.sh\n    ```\n    你可以在运行前编辑 `launch_vllm.sh` 来自定义模型。\n\n4. **运行代理**：\n    在你的本地机器上，可以使用以下命令运行代理：\n\n    ```bash\n    curl -X POST http:\u002F\u002F\u003CYOUR_HOST>:\u003CYOUR_PORT>\u002Fsimulate \\\n      -H \"Content-Type: application\u002Fjson\" \\\n      -d '{\n        \"problem_id\": \"misconfig_app_hotel_res-mitigation-1\",\n        \"agent_name\": \"vllm\",\n        \"max_steps\": 10,\n        \"temperature\": 0.7,\n        \"top_p\": 0.9\n      }'\n    ```\n\n### 如何将你的代理接入 AIOpsLab？\n\nAIOpsLab 使得开发和评估你的代理变得极其简单。你可以通过以下三个简单步骤将你的代理接入 AIOpsLab：\n\n1. **创建你的代理**：你可以自由选择任何框架来开发代理。唯一的要求是：\n    - 将你的代理封装在一个 Python 类中，例如 `Agent`。\n    - 为该类添加一个异步方法 `get_action`：\n\n        ```python\n        # 根据当前状态返回代理的动作\n        async def get_action(self, state: str) -> str:\n            # \u003C你的代理逻辑在这里>\n        ```\n\n2. **将你的代理注册到 AIOpsLab**：现在你可以将代理注册到 AIOpsLab 的编排器中。编排器将管理你的代理与环境之间的交互：\n\n    ```python\n    from aiopslab.orchestrator import Orchestrator\n\n    agent = Agent()             # 创建你的代理实例\n    orch = Orchestrator()       # 获取 AIOpsLab 的编排器\n    orch.register_agent(agent)  # 将你的代理注册到 AIOpsLab\n    ```\n\n3. **在某个问题上评估你的代理**：\n\n    1. **初始化一个问题**：AIOpsLab 提供了一系列你可以用来评估代理的问题。你可以在 [这里](\u002Faiopslab\u002Forchestrator\u002Fproblems\u002Fregistry.py) 或使用 `orch.probs.get_problem_ids()` 查看可用问题列表。现在根据问题 ID 初始化一个问题：\n\n        ```bash\n        problem_desc, instructs, apis = orch.init_problem(\"k8s_target_port-misconfig-mitigation-1\")\n        ```\n    \n    2. **设置代理上下文**：使用问题描述、指令和可用的 API 为你的代理设置上下文。（这一步取决于你的代理设计，由用户自行决定）\n\n\n    3. **开始解决问题**：调用 `start_problem` 方法开始解决问题。你还可以指定最大步骤数：\n\n        ```bash\n        import asyncio\n        asyncio.run(orch.start_problem(max_steps=30))\n        ```\n\n此过程将在编排器中创建一个 [`Session`](\u002Faiopslab\u002Fsession.py)，代理将在其中解决问题。编排器会评估你的代理解决方案，并提供结果（存储在 `data\u002Fresults\u002F` 下）。你可以利用这些结果来改进你的代理。\n\n### 如何向 AIOpsLab 添加新应用？\n\nAIOpsLab 提供了一个默认的[应用列表](\u002Faiopslab\u002Fservice\u002Fapps\u002F)，用于评估代理在运维任务中的表现。然而，作为开发者，你也可以向 AIOpsLab 添加新应用，并围绕这些应用设计问题。\n\n> *注意*：对于某些支持 K8S 自动部署的应用，我们集成了 Helm 图表（你也可以使用 `kubectl` 来安装，例如 [HotelRes 应用](\u002Faiopslab\u002Fservice\u002Fapps\u002Fhotelres.py)）。有关 Helm 的更多信息请参见[这里](https:\u002F\u002Fhelm.sh)。\n\n要通过 Helm 向 AIOpsLab 添加新应用，你需要：\n\n1. **添加应用元数据**\n    - 应用元数据是一个描述该应用的 JSON 对象。\n    - 可以包含任何字段，如应用名称、描述、命名空间等。\n    - 我们建议同时包含一个特殊的 `Helm Config` 字段，如下所示：\n\n        ```json\n        \"Helm Config\": {\n            \"release_name\": \"\u003C用于部署的 Helm 发布名称>\",\n            \"chart_path\": \"\u003C应用的 Helm 图表路径>\",\n            \"namespace\": \"\u003C应用应部署的 K8S 命名空间>\"\n        }\n        ```\n        > *注意*：`Helm Config` 由编排器使用，以便在与该应用相关的问题启动时自动部署你的应用。\n\n        > *注意*：编排器会自动为与该应用相关的所有问题提供上下文信息给代理。\n\n    创建一个包含这些元数据的 JSON 文件，并将其保存在 [`metadata`](\u002Faiopslab\u002Fservice\u002Fmetadata) 目录中。例如，`social-network` 应用：[social-network.json](\u002Faiopslab\u002Fservice\u002Fmetadata\u002Fsocial-network.json)\n\n2. **添加应用类**\n\n    在 [`apps`](\u002Faiopslab\u002Fservice\u002Fapps) 目录下的新 Python 文件中扩展基类：\n\n    ```python\n    from aiopslab.service.apps.base import Application\n\n    class MyApp(Application):\n        def __init__(self):\n            super().__init__(\"\u003C应用元数据 JSON 的路径>\")\n    ```\n\n    `Application` 类提供了应用的基础实现。你可以根据需要覆盖方法或添加新方法以满足你的应用需求，但对于大多数应用来说，基类已经足够。\n\n### 如何向 AIOpsLab 添加新问题？\n\n与应用类似，AIOpsLab 提供了一个默认的[问题列表](\u002Faiopslab\u002Forchestrator\u002Fproblems\u002Fregistry.py)，用于评估代理的表现。然而，作为开发者，你也可以向 AIOpsLab 添加新问题，并围绕你的应用来设计这些问题。\n\nAIOpsLab 中的每个问题包含 5 个组成部分：\n1. *应用*：问题所基于的应用。\n2. *任务*：代理需要执行的 AIOps 任务。目前我们支持：[检测](\u002Faiopslab\u002Forchestrator\u002Ftasks\u002Fdetection.py)、[定位](\u002Faiopslab\u002Forchestrator\u002Ftasks\u002Flocalization.py)、[分析](\u002Faiopslab\u002Forchestrator\u002Ftasks\u002Fanalysis.py) 和 [缓解](\u002Faiopslab\u002Forchestrator\u002Ftasks\u002Fmitigation.py)。\n3. *故障*：在应用中引入的故障。\n4. *工作负载*：为应用生成的工作负载。\n5. *评估者*：用于检查代理表现的评估者。\n\n要向 AIOpsLab 添加新问题，在 [`problems`](\u002Faiopslab\u002Forchestrator\u002Fproblems) 目录下创建一个新的 Python 文件，步骤如下：\n\n1. **设置**。导入你选择的应用（例如 `MyApp`）和任务（例如 `LocalizationTask`）：\n\n    ```python\n    from aiopslab.service.apps.myapp import MyApp\n    from aiopslab.orchestrator.tasks.localization import LocalizationTask\n    ```\n\n2. **定义**。要定义一个问题，创建一个继承自你所选 `Task` 的类，并定义 3 个方法：`start_workload`、`inject_fault` 和 `eval`：\n\n    ```python\n    class MyProblem(LocalizationTask):\n        def __init__(self):\n            self.app = MyApp()\n        \n        def start_workload(self):\n            # \u003C你的工作负载逻辑在这里>\n        \n        def inject_fault(self)\n            # \u003C你的故障注入逻辑在这里>\n        \n        def eval(self, soln、trace、duration):\n            # \u003C你的评估逻辑在这里>\n    ```\n\n3. **注册**。最后，将你的问题添加到编排器的注册表中[这里](\u002Faiopslab\u002Forchestrator\u002Fproblems\u002Fregistry.py)。\n\n\n完整的问题示例请参见[这里](\u002Faiopslab\u002Forchestrator\u002Fproblems\u002Fk8s_target_port_misconfig\u002Ftarget_port.py)。 \n\u003Cdetails>\n  \u003Csummary>点击以查看问题的详细描述\u003C\u002Fsummary>\n\n- **`start_workload`**：启动应用的工作负载。可以使用你自己的生成器，也可以使用 AIOpsLab 的默认生成器，该生成器基于 [wrk2](https:\u002F\u002Fgithub.com\u002Fgiltene\u002Fwrk2)：\n\n    ```python\n    from aiopslab.generator.workload.wrk import Wrk\n\n    wrk = Wrk(rate=100, duration=10)\n    wrk.start_workload(payload=\"\u003Cwrk 负载脚本>\", url=\"\u003C应用 URL>\")\n    ```\n    > 相关代码：[aiopslab\u002Fgenerators\u002Fworkload\u002Fwrk.py](\u002Faiopslab\u002Fgenerators\u002Fworkload\u002Fwrk.py)\n\n- **`inject_fault`**：在应用中引入故障。可以使用你自己的注入器，也可以使用 AIOpsLab 内置的注入器，并且还可以对其进行扩展。例如，K8S 层的配置错误：\n\n    ```python\n    from aiopslab.generators.fault.inject_virtual import *\n\n    inj = VirtualizationFaultInjector(testbed=\"\u003C命名空间>\")\n    inj.inject_fault(microservices=[\"\u003C服务名称>\"]，fault_type=\"misconfig\")\n    ```\n\n    > 相关代码：[aiopslab\u002Fgenerators\u002Ffault](\u002Faiopslab\u002Fgenerators\u002Ffault)\n\n\n- **`eval`**：使用 3 个参数评估代理的解决方案：(1) *soln*：代理提交的解决方案（如果有），(2) *trace*：代理的操作轨迹，以及 (3) *duration*：代理所花费的时间。\n\n    在这里，你可以使用每个任务的内置默认评估器，也可以添加自定义评估。结果会存储在 `self.results` 中：\n    ```python\n    def eval(self, soln、trace、duration) -> dict:\n        super().eval(soln、trace、duration)     # 默认评估\n        self.add_result(\"myMetric\", my_metric(...))     # 添加自定义指标\n        return self.results\n    ```\n\n    > *注意*：当代理开始一个问题时，编排器会创建一个 [`Session`](\u002Faiopslab\u002Fsession.py) 对象来存储代理的交互记录。`trace` 参数就是这个会话记录的轨迹。\n\n    > 相关代码：[aiopslab\u002Forchestrator\u002Fevaluators\u002F](\u002Faiopslab\u002Forchestrator\u002Fevaluators\u002F)\n\n\u003C\u002Fdetails>\n\n\n\n\n\u003Ch2 id=\"📂project-structure\">📂 项目结构\u003C\u002Fh2>\n\n\u003Csummary>\u003Ccode>aiopslab\u003C\u002Fcode>\u003C\u002Fsummary>\n\u003Cdetails>\n  \u003Csummary>生成器\u003C\u002Fsummary>\n  \u003Cpre>\n  generators - AIOpsLab 的问题生成器\n  ├── fault - 按故障注入级别组织的故障生成器\n  │   ├── base.py\n  │   ├── inject_app.py\n  │  ...\n  │   └── inject_virtual.py\n  └── workload - 按工作负载类型组织的工作负载生成器\n      └── wrk.py - wrk 工具接口\n  \u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>编排器\u003C\u002Fsummary>\n  \u003Cpre>\n  orchestrator\n  ├── orchestrator.py - 主要的编排引擎\n  ├── parser.py - 用于解析智能体响应的解析器\n  ├── evaluators - 系统中的评估指标\n  │   ├── prompts.py - 用于LLM作为评判者的提示模板\n  │   ├── qualitative.py - 定性指标\n  │   └── quantitative.py - 定量指标\n  ├── problems - aiopslab中的问题定义\n  │   ├── k8s_target_port_misconfig - 例如，K8S TargetPort配置错误问题\n  │  ...\n  │   └── registry.py\n  ├── actions - 按照AIOps任务类型组织的智能体可执行动作\n  │   ├── base.py\n  │   ├── detection.py\n  │   ├── localization.py\n  │   ├── analysis.py\n  │   └── mitigation.py\n  └── tasks - 智能体需要解决的单个AIOps任务定义\n      ├── base.py\n      ├── detection.py\n      ├── localization.py\n      ├── analysis.py\n      └── mitigation.py\n  \u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>服务\u003C\u002Fsummary>\n  \u003Cpre>\n  service\n  ├── apps - 各应用的接口\u002F实现\n  ├── helm.py - 与集群交互的Helm接口\n  ├── kubectl.py - 与集群交互的kubectl接口\n  ├── shell.py - 与集群交互的Shell接口\n  ├── metadata - 各应用的元数据和配置\n  └── telemetry - 观测性工具（除observer外），例如用于智能体的内存日志观测\n  \u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>观察者\u003C\u002Fsummary>\n  \u003Cpre>\n  observer\n  ├── filebeat - Filebeat安装\n  ├── logstash - Logstash安装\n  ├── prometheus - Prometheus安装\n  ├── log_api.py - 用于将日志数据存储到磁盘的API\n  ├── metric_api.py - 用于将指标数据存储到磁盘的API\n  └── trace_api.py - 用于将追踪数据存储到磁盘的API\n  \u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n  \u003Csummary>工具\u003C\u002Fsummary>\n  \u003Cpre>\n  ├── config.yml - aiopslab配置文件\n  ├── config.py - 配置解析器\n  ├── paths.py - 路径和常量\n  ├── session.py - aiopslab会话管理器\n  └── utils\n      ├── actions.py - 智能体可执行动作的帮助函数\n      ├── cache.py - 缓存管理器\n      └── status.py - aiopslab的状态、错误和警告信息\n  \u003C\u002Fpre>\n\u003C\u002Fdetails>\n\n\u003Csummary>\u003Ccode>cli.py\u003C\u002Fcode>: 用于与AIOpsLab交互的命令行界面，例如供人工操作员使用。\u003C\u002Fsummary>\n\n\n\u003Ch2 id=\"📄how-to-cite\">📄 如何引用\u003C\u002Fh2>\n\n```bibtex\n@inproceedings{\nchen2025aiopslab,\ntitle={{AIO}psLab: A Holistic Framework to Evaluate {AI} Agents for Enabling Autonomous Clouds},\nauthor={Yinfang Chen and Manish Shetty and Gagan Somashekar and Minghua Ma and Yogesh Simmhan and Jonathan Mace and Chetan Bansal and Rujia Wang and Saravan Rajmohan},\nbooktitle={Eighth Conference on Machine Learning and Systems},\nyear={2025},\nurl={https:\u002F\u002Fopenreview.net\u002Fforum?id=3EXBLwGxtq}\n}\n@inproceedings{shetty2024building,\n  title = {Building AI Agents for Autonomous Clouds: Challenges and Design Principles},\n  author = {Shetty, Manish and Chen, Yinfang and Somashekar, Gagan and Ma, Minghua and Simmhan, Yogesh and Zhang, Xuchao and Mace, Jonathan and Vandevoorde, Dax and Las-Casas, Pedro and Gupta, Shachee Mishra and Nath, Suman and Bansal, Chetan and Rajmohan, Saravan},\n  year = {2024},\n  booktitle = {Proceedings of 15th ACM Symposium on Cloud Computing},\n}\n```\n\n\n\n## 行为准则\n\n本项目已采用[微软开源行为准则](https:\u002F\u002Fopensource.microsoft.com\u002Fcodeofconduct\u002F)。更多信息请参阅[行为准则常见问题解答](https:\u002F\u002Fopensource.microsoft.com\u002Fcodeofconduct\u002Ffaq\u002F)，或如有任何其他疑问或意见，请联系[opencode@microsoft.com](mailto:opencode@microsoft.com)。\n\n\n## 许可证\n\n版权所有 © 微软公司。保留所有权利。\n\n根据[MIT许可证](LICENSE.txt)授权。\n\n\n### 商标\n\n本项目可能包含项目、产品或服务的商标或标识。未经授权使用微软商标或标识须遵守并遵循[微软商标与品牌指南](https:\u002F\u002Fwww.microsoft.com\u002Fen-us\u002Flegal\u002Fintellectualproperty\u002Ftrademarks)。在本项目的修改版本中使用微软商标或标识不得造成混淆或暗示微软的赞助。任何第三方商标或标识的使用均应遵守该第三方的相关政策。","# AIOpsLab 快速上手指南\n\nAIOpsLab 是一个用于设计、开发和评估自主 AIOps 智能体的综合框架。它支持部署微服务云环境、注入故障、生成工作负载并导出遥测数据，同时提供标准化的基准测试套件。\n\n## 环境准备\n\n### 系统要求\n- **操作系统**: Linux (推荐 Ubuntu) 或 macOS\n- **Python**: >= 3.11 (必须)\n- **包管理**: [Poetry](https:\u002F\u002Fpython-poetry.org\u002F) (推荐) 或 pip\n- **容器编排**: [Helm](https:\u002F\u002Fhelm.sh\u002F)\n- **集群工具**: [Kind](https:\u002F\u002Fkind.sigs.k8s.io\u002F) (用于本地模拟) 或任意远程 Kubernetes 集群\n\n### 前置依赖安装\n请确保已安装 Docker 和 kubectl。\n\n**1. 安装 Python 3.11**\n```bash\nsudo apt update\nsudo apt install python3.11 python3.11-venv python3.11-dev -y\n```\n\n**2. 安装 Poetry (官方安装方式)**\n> ⚠️ **注意**: 请勿使用 `apt` 安装 Poetry，版本过旧可能导致兼容性问题。\n```bash\ncurl -sSL https:\u002F\u002Finstall.python-poetry.org | python3.11 -\nexport PATH=\"$HOME\u002F.local\u002Fbin:$PATH\"\necho 'export PATH=\"$HOME\u002F.local\u002Fbin:$PATH\"' >> ~\u002F.bashrc\nsource ~\u002F.bashrc\n```\n\n**3. 安装 Helm**\n```bash\ncurl https:\u002F\u002Fraw.githubusercontent.com\u002Fhelm\u002Fhelm\u002Fmain\u002Fscripts\u002Fget-helm-3 | bash\n```\n\n## 安装步骤\n\n**1. 克隆项目**\n```bash\ngit clone --recurse-submodules \u003CCLONE_PATH_TO_THE_REPO>\ncd AIOpsLab\n```\n\n**2. 配置虚拟环境并安装依赖**\n```bash\npoetry env use python3.11\npoetry install\neval $(poetry env activate)\n```\n> **故障排查**: 若遇到 \"lock file not compatible\" 错误，请先运行 `poetry lock`，再执行 `poetry install`。\n\n*备选方案 (使用 pip)*:\n```bash\npip install -e .\n```\n\n## 基本使用\n\n### 第一步：搭建集群\n你可以选择在本机运行模拟集群，或使用远程集群。\n\n**选项 A：本地模拟集群 (推荐新手)**\n适用于 x86 架构机器：\n```bash\nkind create cluster --config kind\u002Fkind-config-x86.yaml\n```\n*(ARM 架构机器请使用 `kind-config-arm.yaml`)*\n\n> **代理提示**: 若使用代理，请注意 Kind 节点会继承主机代理设置。若使用 SOCKS5 代理，建议通过 Privoxy 转为 HTTP，并设置 `export no_proxy=localhost` 以避免本地 LLM 服务出错。\n\n**选项 B：远程集群 (Azure\u002FCloudLab 等)**\n若使用 Azure VM，可通过脚本一键部署：\n```bash\n# Mode B: 本地运行 AIOpsLab，远程运行 K8s\npython3 scripts\u002Fterraform\u002Fdeploy.py --apply --resource-group \u003Cyour-rg> --workers 2 --mode B\n```\n\n### 第二步：配置文件\n复制并编辑配置文件，确保 `k8s_host` 指向控制平面节点。\n- 本地 Kind 集群：设置为 `kind`\n- 在集群内部运行：设置为 `localhost`\n\n```bash\ncd aiopslab\ncp config.yml.example config.yml\n# 使用编辑器修改 config.yml 中的 k8s_host 和 k8s_user\n```\n\n### 第三步：运行智能体\n\n**1. 准备 API Key**\n在项目根目录创建 `.env` 文件并填入密钥：\n```bash\necho \"OPENAI_API_KEY=\u003CYOUR_OPENAI_API_KEY>\" > .env\n# 如需其他模型可追加:\n# echo \"QWEN_API_KEY=\u003CYOUR_QWEN_API_KEY>\" >> .env\n```\n\n**2. 启动交互式会话 (人类作为智能体)**\n```bash\npython3 cli.py\n```\n在交互界面中：\n```text\n(aiopslab) $ start misconfig_app_hotel_res-detection-1\n# ... 等待环境 setup 完成 ...\n(aiopslab) $ submit(\"Yes\") \n```\n\n**3. 运行 GPT-4 基准智能体**\n```bash\npython3 clients\u002Fgpt.py\n```\n\n**4. 监控状态**\n推荐使用 [k9s](https:\u002F\u002Fk9scli.io\u002F) 查看集群实时状态：\n```bash\nk9s\n```\n\n**5. (可选) 启用 W&B 实验追踪**\n在 `.env` 中添加 `USE_WANDB=true`，即可在 W&B 面板中以表格形式查看 `session_id` 和运行结果。","某大型电商平台的 SRE 团队需要在微服务架构上线前，验证其新研发的“故障自愈 Agent\"能否准确识别并修复复杂的级联故障。\n\n### 没有 AIOpsLab 时\n- **环境搭建耗时极长**：团队需手动配置 Kubernetes 集群、部署微服务应用及监控探针，每次测试准备耗时数天。\n- **故障复现困难且不安全**：难以在生产环境中安全地注入特定网络延迟或数据库宕机故障，导致测试场景单一，无法覆盖极端情况。\n- **评估标准不统一**：缺乏标准化的基准测试集，不同版本的 Agent 性能对比依赖人工观察日志，结果主观且不可复现。\n- **数据孤岛严重**：故障注入、工作负载生成与遥测数据分散在不同工具中，难以关联分析 Agent 的决策链路。\n\n### 使用 AIOpsLab 后\n- **一键部署仿真环境**：利用 AIOpsLab 内置的 Kind 集群编排能力，几分钟内即可在本地拉起包含完整微服务链路的仿真云环境。\n- **标准化故障注入**：直接调用内置基准套件，精准注入如“支付服务超时”或“缓存雪崩”等复杂故障，安全且可重复执行。\n- **自动化量化评估**：AIOpsLab 自动记录 Agent 从发现故障到恢复服务的全流程指标，提供客观的评分报告，实现版本间的公平对比。\n- **全链路数据闭环**：框架自动协调故障生成与遥测数据导出，为 Agent 提供连贯的训练与评估数据流，大幅缩短调试周期。\n\nAIOpsLab 将原本需要数周才能完成的闭环验证压缩至小时级，为自主运维智能体的研发提供了可复现、标准化的核心基础设施。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fmicrosoft_AIOpsLab_71e0b609.png","microsoft","Microsoft","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fmicrosoft_4900709c.png","Open source projects and samples from Microsoft",null,"opensource@microsoft.com","OpenAtMicrosoft","https:\u002F\u002Fopensource.microsoft.com","https:\u002F\u002Fgithub.com\u002Fmicrosoft",[24,28,32,36,40,44,48,52,56],{"name":25,"color":26,"percentage":27},"Python","#3572A5",89.3,{"name":29,"color":30,"percentage":31},"Mustache","#724b3b",4.4,{"name":33,"color":34,"percentage":35},"Shell","#89e051",2.9,{"name":37,"color":38,"percentage":39},"HCL","#844FBA",1.4,{"name":41,"color":42,"percentage":43},"Go Template","#00ADD8",1.1,{"name":45,"color":46,"percentage":47},"Makefile","#427819",0.5,{"name":49,"color":50,"percentage":51},"C","#555555",0.3,{"name":53,"color":54,"percentage":55},"Jinja","#a52a22",0,{"name":57,"color":58,"percentage":55},"Dockerfile","#384d54",850,152,"2026-04-05T18:03:41","MIT",4,"Linux, macOS","未说明（本地运行 LLM 代理如 vLLM 时通常需要 GPU，但文档未指定具体型号或显存要求）","未说明（运行 Kubernetes 集群和微服务环境通常建议 16GB+）",{"notes":68,"python":69,"dependencies":70},"该工具主要用于部署和评估 AIOps 代理，核心依赖是 Kubernetes 环境。用户可选择在本地使用 kind 模拟集群，或在远程\u002F云端（如 Azure）部署真实集群。若需在本地运行大语言模型（LLM）代理，需额外配置 vLLM 并注意代理设置（SOCKS5 需转为 HTTP）。推荐使用 Poetry 管理 Python 依赖，严禁使用 apt 安装过时的 Poetry 版本。",">=3.11",[71,72,73,74,75,76,77,78],"Helm","Poetry","kind","kubectl","Docker","Ansible (可选)","Terraform (可选)","vLLM (可选)",[80,81,82],"开发框架","Agent","其他",2,"ready","2026-03-27T02:49:30.150509","2026-04-08T00:58:15.359574",[88,93,98,102,107,112,117],{"id":89,"question_zh":90,"answer_zh":91,"source_url":92},23214,"为什么在运行场景时会出现 'Timeout: Not all pods in namespace openebs reached the Ready state' 错误？","这通常是由于在 Ubuntu 笔记本电脑上安装 Kind 集群时，'AIopslabImage' 导致系统崩溃或 OpenEBS 组件无法正常启动（Pod 处于 CrashLoopBackOff 状态）。维护者建议检查系统配置，并尝试重新安装 Kind。如果问题持续，可能是特定于 Ubuntu 环境的兼容性问题，需等待官方修复该镜像或调整本地资源限制。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FAIOpsLab\u002Fissues\u002F117",{"id":94,"question_zh":95,"answer_zh":96,"source_url":97},23215,"AIOpsLab 中功能性故障（functional faults）和症状性故障（symptomatic faults）在任务级别上有什么区别？","在 AIOpsLab 基准测试中，功能性故障用于所有四个任务级别（检测、定位、根因分析、缓解），而症状性故障仅用于检测和定位级别。这是因为基准测试并未为症状性故障定义或评估根因分析（RCA）和缓解的地面真实值（ground truth）。这是一个基准范围的决定，而非代理（Agent）能力的限制。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FAIOpsLab\u002Fissues\u002F136",{"id":99,"question_zh":100,"answer_zh":101,"source_url":97},23216,"在缓解（Mitigation）任务中，如何判定操作是否成功？","缓解任务的成功与否不是仅仅通过调用 `Submit()` 来判断的，而是评估集群是否返回到健康状态。只有当整个集群状态恢复健康时，才会被评估为“成功”。",{"id":103,"question_zh":104,"answer_zh":105,"source_url":106},23217,"使用 Chaos Mesh 进行 Pod 杀死实验时，为什么 Pod 被杀死后立即重启了？","这是因为当前的配置使用了 `pod-kill`，它杀死 Pod 但不会保持其死亡状态，Kubernetes 会立即重建 Pod。建议改用 `pod-failure` 实验类型，它可以模拟 Pod 失败并保持该状态一段时间，从而更有效地测试系统的容错能力。参考文档：https:\u002F\u002Fchaos-mesh.org\u002Fdocs\u002Fsimulate-pod-chaos-on-kubernetes\u002F#pod-failure-example","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FAIOpsLab\u002Fissues\u002F111",{"id":108,"question_zh":109,"answer_zh":110,"source_url":111},23218,"如何在本地机器（Mac\u002FLinux\u002FWindows）或 AKS 上部署和运行 AIOpsLab？","如果您使用的是 fork 版本，只要您的集群已在 kubectl 上下文中配置好，可以直接运行提供的 problems 脚本。对于使用 AKS 的用户，设置可能略有不同，但基本原理相同：确保集群可访问且上下文正确。如有具体问题，可在 Issue 中寻求维护者帮助。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FAIOpsLab\u002Fissues\u002F11",{"id":113,"question_zh":114,"answer_zh":115,"source_url":116},23219,"项目是否有 CI 流水线？它是如何配置的？","项目已添加了一个基于 GitHub Actions 的简易 CI 流水线，使用 `noop` 场景进行健全性检查，以确保核心功能未被破坏。该流水线在公共仓库中免费运行，使用的是 `ubuntu-latest` 标准 runner。日志显示流水线工作正常。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FAIOpsLab\u002Fissues\u002F19",{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},23220,"遇到 MongoDB 启动时的竞态条件错误（No suitable servers found）怎么办？","该错误表明服务在 MongoDB 完全就绪前尝试连接。虽然 Issue 详情被截断，但此类问题通常需要在服务启动脚本中加入重试机制或等待 MongoDB 端口就绪的逻辑（例如使用 `wait-for-it` 脚本或在代码中增加重试循环），确保数据库索引创建前连接已建立。","https:\u002F\u002Fgithub.com\u002Fmicrosoft\u002FAIOpsLab\u002Fissues\u002F102",[],[124,135,143,152,160,169],{"id":125,"name":126,"github_repo":127,"description_zh":128,"stars":129,"difficulty_score":130,"last_commit_at":131,"category_tags":132,"status":84},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[81,80,133,134],"图像","数据工具",{"id":136,"name":137,"github_repo":138,"description_zh":139,"stars":140,"difficulty_score":130,"last_commit_at":141,"category_tags":142,"status":84},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[80,133,81],{"id":144,"name":145,"github_repo":146,"description_zh":147,"stars":148,"difficulty_score":83,"last_commit_at":149,"category_tags":150,"status":84},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",143909,"2026-04-07T11:33:18",[80,81,151],"语言模型",{"id":153,"name":154,"github_repo":155,"description_zh":156,"stars":157,"difficulty_score":83,"last_commit_at":158,"category_tags":159,"status":84},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,"2026-04-06T11:32:50",[80,133,81],{"id":161,"name":162,"github_repo":163,"description_zh":164,"stars":165,"difficulty_score":83,"last_commit_at":166,"category_tags":167,"status":84},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[168,80],"插件",{"id":170,"name":171,"github_repo":172,"description_zh":173,"stars":174,"difficulty_score":130,"last_commit_at":175,"category_tags":176,"status":84},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[151,133,81,80]]