[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-RightNow-AI--picolm":3,"tool-RightNow-AI--picolm":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":81,"owner_website":82,"owner_url":83,"languages":84,"stars":101,"forks":102,"last_commit_at":103,"license":104,"difficulty_score":10,"env_os":105,"env_gpu":106,"env_ram":107,"env_deps":108,"category_tags":115,"github_topics":116,"view_count":10,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":126,"updated_at":127,"faqs":128,"releases":148},785,"RightNow-AI\u002Fpicolm","picolm","Run a 1-billion parameter LLM on a $10 board with 256MB RAM","picolm 是一款专为极低配置硬件设计的本地大语言模型推理引擎。它的核心能力在于让一个 10 亿参数的模型在仅 256MB 内存、售价约 10 美元的微型开发板上流畅运行。\n\n许多用户受困于云端 API 的费用、隐私风险以及对网络的依赖，而本地部署又往往需要昂贵的显卡。picolm 完美解决了这些问题，实现了真正的离线 AI 体验。无需互联网连接，没有月度账单，所有数据都保留在设备本地。\n\n技术上，picolm 采用纯 C 语言编写，没有任何外部依赖，生成的二进制文件仅有 80KB 左右，运行时内存占用低至 45MB。它支持标准输入输出，能轻松集成到各种脚本或应用中，比如与 PicoClaw 配合构建全功能的离线智能体。\n\n嵌入式系统开发者、物联网爱好者，以及那些对数据隐私有严格要求的研究人员都能从中受益。如果你想在树莓派或更廉价的芯片上体验大模型的魅力，picolm 是绝佳的选择。","\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLanguage-C11-blue?style=flat-square\" alt=\"C11\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBinary_Size-~80KB-brightgreen?style=flat-square\" alt=\"Binary Size\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRuntime_RAM-45MB-orange?style=flat-square\" alt=\"RAM\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDependencies-Zero-success?style=flat-square\" alt=\"Zero Dependencies\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow?style=flat-square\" alt=\"MIT License\">\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">PicoLM\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Cstrong>Run a 1-billion parameter LLM on a $10 board with 256MB RAM.\u003C\u002Fstrong>\u003Cbr>\n  Pure C. Zero dependencies. One binary. No Python. No cloud.\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ccode>echo \"Explain gravity\" | .\u002Fpicolm model.gguf -n 100 -j 4\u003C\u002Fcode>\n\u003C\u002Fp>\n\n---\n\n## The Perfect Match: PicoLM + PicoClaw\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRightNow-AI_picolm_readme_3b61618c0739.jpg\" alt=\"PicoLM — Run a 1-billion parameter LLM on a $10 board\" width=\"640\">\n  \u003Cbr>\u003Cbr>\n\u003C\u002Fdiv>\n\nPicoLM was built as the **local brain** for [PicoClaw](https:\u002F\u002Fgithub.com\u002Fsipeed\u002Fpicoclaw) — an ultra-lightweight AI assistant in Go that runs on $10 hardware. Together, they form a **fully offline AI agent** — no cloud, no API keys, no internet, no monthly bills.\n\n> **Every other LLM provider needs the internet. PicoLM doesn't.**\n\n\u003Ctable align=\"center\">\n  \u003Ctr align=\"center\">\n    \u003Ctd>\u003Cb>The Hardware\u003C\u002Fb>\u003C\u002Ftd>\n    \u003Ctd>\u003Cb>The Architecture\u003C\u002Fb>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRightNow-AI_picolm_readme_3193375fa8f7.png\" alt=\"$9.90 LicheeRV Nano\" width=\"360\">\u003C\u002Ftd>\n    \u003Ctd align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRightNow-AI_picolm_readme_6693e07d72d9.jpg\" alt=\"PicoClaw architecture — PicoLM sits in the LLM box\" width=\"420\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd align=\"center\">\u003Cem>$9.90 — that's the entire server\u003C\u002Fem>\u003C\u002Ftd>\n    \u003Ctd align=\"center\">\u003Cem>PicoLM powers the LLM box in PicoClaw's agent loop\u003C\u002Fem>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n### Why they're a perfect fit\n\n| | Cloud Provider (OpenAI, etc.) | PicoLM (Local) |\n|---|---|---|\n| **Cost** | Pay per token, forever | Free forever |\n| **Privacy** | Your data sent to servers | Everything stays on-device |\n| **Internet** | Required for every request | Not needed at all |\n| **Latency** | Network round-trip + inference | Inference only |\n| **Hardware** | Needs a $599 Mac Mini | Runs on a $10 board |\n| **Binary** | N\u002FA | ~80KB single file |\n| **RAM** | N\u002FA | 45 MB total |\n\n### How it works\n\nPicoClaw's agent loop spawns PicoLM as a subprocess. Messages come in from Telegram, Discord, or CLI — PicoClaw formats them into a chat template, pipes the prompt to `picolm` via stdin, and reads the response from stdout. When tools are needed, `--json` grammar mode guarantees valid JSON even from a 1B model.\n\n```\nTelegram \u002F Discord \u002F CLI\n        │\n        ▼\n   ┌──────────┐    stdin: prompt     ┌───────────┐\n   │ PicoClaw │ ──────────────────►  │  picolm   │\n   │   (Go)   │ ◄──────────────────  │   (C)     │\n   └──────────┘    stdout: response  │ + model   │\n        │                            └───────────┘\n        ▼                            45 MB RAM\n   User gets reply                   No internet\n```\n\n### Quick setup\n\n```bash\n# 1. Build PicoLM\ncd picolm && make native    # or: make pi (Raspberry Pi)\n\n# 2. Download model (one-time, 638 MB)\nmake model\n\n# 3. Build PicoClaw\ncd ..\u002Fpicoclaw && make deps && make build\n\n# 4. Configure (~\u002F.picoclaw\u002Fconfig.json)\n```\n\n```json\n{\n  \"agents\": {\n    \"defaults\": {\n      \"provider\": \"picolm\",\n      \"model\": \"picolm-local\"\n    }\n  },\n  \"providers\": {\n    \"picolm\": {\n      \"binary\": \"~\u002F.picolm\u002Fbin\u002Fpicolm\",\n      \"model\": \"~\u002F.picolm\u002Fmodels\u002Ftinyllama-1.1b-chat-v1.0.Q4_K_M.gguf\",\n      \"max_tokens\": 256,\n      \"threads\": 4,\n      \"template\": \"chatml\"\n    }\n  }\n}\n```\n\n```bash\n# 5. Chat — fully offline!\npicoclaw agent -m \"What is photosynthesis?\"\n```\n\n### Or install everything in one line\n\n```bash\ncurl -sSL https:\u002F\u002Fraw.githubusercontent.com\u002FRightNow-AI\u002Fpicolm\u002Fmain\u002Finstall.sh | bash\n```\n\n### Performance on real hardware\n\n| Device | Price | Generation Speed | RAM Used |\n|--------|-------|-----------------|----------|\n| **Pi 5** (4-core) | $60 | ~10 tok\u002Fs | 45 MB |\n| **Pi 4** (4-core) | $35 | ~8 tok\u002Fs | 45 MB |\n| **Pi 3B+** | $25 | ~4 tok\u002Fs | 45 MB |\n| **Pi Zero 2W** | $15 | ~2 tok\u002Fs | 45 MB |\n| **LicheeRV Nano** | $10 | ~1 tok\u002Fs | 45 MB |\n\n### JSON tool calling\n\nPicoClaw automatically activates `--json` grammar mode when it needs structured output. This **guarantees syntactically valid JSON** even from a 1B parameter model — essential for reliable tool calling on tiny hardware:\n\n```bash\npicoclaw agent -m \"Search for weather in Tokyo\"\n# → PicoLM generates: {\"tool_calls\": [{\"function\": {\"name\": \"web_search\", \"arguments\": \"{\\\"query\\\": \\\"weather Tokyo\\\"}\"}}]}\n```\n\n> For the full PicoClaw documentation, see the [PicoClaw README](https:\u002F\u002Fgithub.com\u002Fsipeed\u002Fpicoclaw).\n\n---\n\n## What is PicoLM?\n\nPicoLM is a **minimal, from-scratch LLM inference engine** written in ~2,500 lines of C11. It runs [TinyLlama 1.1B](https:\u002F\u002Fhuggingface.co\u002FTinyLlama\u002FTinyLlama-1.1B-Chat-v1.0) (and other LLaMA-architecture models in GGUF format) on hardware that most inference frameworks won't even consider:\n\n- **Raspberry Pi Zero 2W** ($15, 512MB RAM, ARM Cortex-A53)\n- **Sipeed LicheeRV** ($12, 512MB RAM, RISC-V)\n- **Raspberry Pi 3\u002F4\u002F5** (1-8GB RAM, ARM NEON SIMD)\n- Any Linux\u002FWindows\u002FmacOS x86-64 machine\n\nThe model file (638MB) stays on disk. PicoLM **memory-maps** it and streams one layer at a time through RAM. Total runtime memory: **~45MB** including the FP16 KV cache.\n\n```\n                    ┌──────────────────────────────────────────┐\n   What goes        │         45 MB Runtime RAM                │\n   in RAM           │  ┌─────────┐ ┌──────────┐ ┌───────────┐  │\n                    │  │ Buffers │ │ FP16 KV  │ │ Tokenizer │  │\n                    │  │  1.2 MB │ │ Cache    │ │   4.5 MB  │  │\n                    │  │         │ │  ~40 MB  │ │           │  │\n                    │  └─────────┘ └──────────┘ └───────────┘  │\n                    └──────────────────────────────────────────┘\n\n                    ┌──────────────────────────────────────────┐\n   What stays       │        638 MB Model on Disk              │\n   on disk          │       (mmap — OS pages in layers         │\n   (via mmap)       │        as needed, ~1 at a time)          │\n                    └──────────────────────────────────────────┘\n```\n\n---\n\n## Features\n\n| Feature | Description |\n|---------|-------------|\n| **GGUF Native** | Reads GGUF v2\u002Fv3 files directly — no conversion needed |\n| **K-Quant Support** | Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, F16, F32 |\n| **mmap Layer Streaming** | Model weights stay on disk; OS pages in one layer at a time |\n| **FP16 KV Cache** | Halves KV cache memory (44MB vs 88MB for 2048 context) |\n| **Flash Attention** | Online softmax — no O(seq_len) attention buffer needed |\n| **Pre-computed RoPE** | cos\u002Fsin lookup tables eliminate transcendentals from hot loop |\n| **SIMD Acceleration** | ARM NEON (Pi 3\u002F4\u002F5) and x86 SSE2 (Intel\u002FAMD) auto-detected |\n| **Fused Dot Products** | Dequantize + dot-product in one pass — no intermediate buffer |\n| **Multi-threaded matmul** | Parallel matrix-vector multiply across CPU cores |\n| **Grammar-Constrained JSON** | `--json` flag forces valid JSON output (for tool calling) |\n| **KV Cache Persistence** | `--cache` saves\u002Floads prompt state — skip prefill on re-runs |\n| **BPE Tokenizer** | Score-based byte-pair encoding, loaded from GGUF metadata |\n| **Top-p Sampling** | Temperature + nucleus sampling with configurable seed |\n| **Pipe-friendly** | Reads prompts from stdin: `echo \"Hello\" \\| .\u002Fpicolm model.gguf` |\n| **Zero Dependencies** | Only libc, libm, libpthread. No external libraries. |\n| **Cross-platform** | Linux, Windows (MSVC), macOS. ARM, x86-64, RISC-V. |\n\n---\n\n## Quick Start\n\n### One-liner install (Raspberry Pi \u002F Linux)\n\n```bash\ncurl -sSL https:\u002F\u002Fraw.githubusercontent.com\u002FRightNow-AI\u002Fpicolm\u002Fmain\u002Finstall.sh | bash\n```\n\nThis will:\n1. Detect your platform (ARM64, ARMv7, x86-64)\n2. Install build dependencies (`gcc`, `make`, `curl`)\n3. Build PicoLM with optimal SIMD flags for your CPU\n4. Download TinyLlama 1.1B Q4_K_M (638 MB)\n5. Run a quick test\n6. Generate PicoClaw config\n7. Add `picolm` to your PATH\n\n### Build from source\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Frightnow-ai\u002Fpicolm.git\ncd picolm\u002Fpicolm\n\n# Auto-detect CPU (enables SSE2\u002FAVX on x86, NEON on ARM)\nmake native\n\n# Download a model\nmake model\n\n# Run it\n.\u002Fpicolm \u002Fopt\u002Fpicolm\u002Fmodels\u002Ftinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \\\n    -p \"The meaning of life is\" -n 100\n```\n\n### Build on Windows (MSVC)\n\n```cmd\ncd picolm\nbuild.bat\npicolm.exe model.gguf -p \"Hello world\" -n 50\n```\n\n### Platform-specific builds\n\n```bash\nmake native      # x86\u002FARM auto-detect (recommended for local machine)\nmake pi          # Raspberry Pi 3\u002F4\u002F5 (64-bit ARM + NEON SIMD)\nmake pi-arm32    # Pi Zero \u002F Pi 1 (32-bit ARM)\nmake cross-pi    # Cross-compile for Pi from x86 (static binary)\nmake riscv       # RISC-V (Sipeed LicheeRV, etc.)\nmake static      # Static binary for single-file deployment\nmake debug       # Debug build with symbols, no optimization\n```\n\n---\n\n## Usage\n\n```\nPicoLM — ultra-lightweight LLM inference engine\n\nUsage: picolm \u003Cmodel.gguf> [options]\n\nGeneration options:\n  -p \u003Cprompt>    Input prompt (or pipe via stdin)\n  -n \u003Cint>       Max tokens to generate (default: 256)\n  -t \u003Cfloat>     Temperature (default: 0.8, 0=greedy)\n  -k \u003Cfloat>     Top-p \u002F nucleus sampling (default: 0.9)\n  -s \u003Cint>       RNG seed (default: 42)\n  -c \u003Cint>       Context length override\n  -j \u003Cint>       Number of threads (default: 4)\n\nAdvanced options:\n  --json         Grammar-constrained JSON output mode\n  --cache \u003Cfile> KV cache file (saves\u002Floads prompt state)\n```\n\n### Examples\n\n**Basic generation:**\n```bash\n.\u002Fpicolm model.gguf -p \"Once upon a time\" -n 200\n```\n\n**Greedy decoding (deterministic, temperature=0):**\n```bash\n.\u002Fpicolm model.gguf -p \"The capital of France is\" -n 20 -t 0\n# Output: Paris. It is the largest city in France and...\n```\n\n**Chat with TinyLlama (ChatML format):**\n```bash\n.\u002Fpicolm model.gguf -n 200 -t 0.7 -p \"\u003C|user|>\nWhat is photosynthesis?\u003C\u002Fs>\n\u003C|assistant|>\n\"\n```\n\n**Force JSON output (for tool calling \u002F structured data):**\n```bash\n.\u002Fpicolm model.gguf --json -t 0.3 -n 100 -p \"\u003C|user|>\nReturn the current time as JSON.\u003C\u002Fs>\n\u003C|assistant|>\n\"\n# Output: {\"time\": \"12:00 PM\"}\n```\n\n**Pipe from stdin:**\n```bash\necho \"Explain quantum computing in one sentence\" | .\u002Fpicolm model.gguf -n 50\n```\n\n**KV cache — skip repeated prefill:**\n```bash\n# First run: processes prompt + saves cache\n.\u002Fpicolm model.gguf --cache prompt.kvc -p \"Long system prompt here...\" -n 50\n\n# Second run: loads cache, skips prompt prefill (74% faster)\n.\u002Fpicolm model.gguf --cache prompt.kvc -p \"Long system prompt here...\" -n 50\n# Output: \"Skipping 25 cached prompt tokens\"\n```\n\n**Multi-threaded on a Pi 4 (4 cores):**\n```bash\n.\u002Fpicolm model.gguf -p \"Hello\" -n 100 -j 4\n```\n\n---\n\n## Performance\n\nMeasured on TinyLlama 1.1B Q4_K_M (638 MB model):\n\n| Metric | x86-64 (8 threads) | Pi 4 (4 cores, NEON) | Pi Zero 2W |\n|--------|--------------------|-----------------------|------------|\n| **Prefill** | ~11 tok\u002Fs | ~6 tok\u002Fs | ~1.5 tok\u002Fs |\n| **Generation** | ~13 tok\u002Fs | ~8 tok\u002Fs* | ~2 tok\u002Fs* |\n| **Runtime RAM** | 45 MB | 45 MB | 45 MB |\n| **First token** | ~2.3s | ~4s | ~16s |\n| **Binary size** | ~80 KB | ~70 KB | ~65 KB |\n\n*\\*Estimated with NEON SIMD enabled. Actual numbers depend on SD card speed and thermal throttling.*\n\n### What makes it fast\n\n```\n Raw C inference          ████████████░░░░░░░░  13.5 tok\u002Fs  (baseline: 1.6)\n + Fused dot products     ████████████████░░░░  (eliminate dequant buffer)\n + Multi-threaded matmul  █████████████████░░░  (4-8 cores in parallel)\n + FP16 KV cache          █████████████████░░░  (halve memory bandwidth)\n + Pre-computed RoPE      ██████████████████░░  (no sin\u002Fcos in hot loop)\n + Flash attention        ██████████████████░░  (no O(n) attention alloc)\n + NEON\u002FSSE2 SIMD         ███████████████████░  (4-wide vector ops)\n + KV cache persistence   ████████████████████  (skip prefill entirely)\n```\n\n---\n\n## Architecture\n\n```\n                          ┌─────────────────────────────────┐\n                          │           picolm.c              │\n                          │     CLI + Generation Loop       │\n                          └──────┬──────────────┬───────────┘\n                                 │              │\n                    ┌────────────┘              └────────────┐\n                    │                                        │\n           ┌────────┴────────┐                    ┌──────────┴──────────┐\n           │    model.h\u002Fc    │                    │    sampler.h\u002Fc      │\n           │  GGUF Parser    │                    │  Temperature +      │\n           │  mmap Layer     │                    │  Top-p Sampling     │\n           │  Streaming      │                    └──────────┬──────────┘\n           │  Forward Pass   │                               │\n           │  KV Cache I\u002FO   │                    ┌──────────┴──────────┐\n           └───┬────────┬────┘                    │    grammar.h\u002Fc      │\n               │        │                         │  JSON Constraint    │\n      ┌────────┘        └───────┐                 │  Logit Masking      │\n      │                         │                 └─────────────────────┘\n┌─────┴──────┐          ┌───────┴────────┐\n│ tensor.h\u002Fc │          │ tokenizer.h\u002Fc  │\n│ matmul     │          │ BPE Encode     │\n│ rmsnorm    │          │ Decode         │\n│ softmax    │          │ Vocab Lookup   │\n│ rope       │          └────────────────┘\n│ silu       │\n│ threading  │\n└─────┬──────┘\n      │\n┌─────┴──────┐\n│  quant.h\u002Fc │\n│ Q4_K, Q6_K │\n│ Q3_K, Q2_K │\n│ FP16, F32  │\n│ NEON + SSE │\n│ Fused Dots │\n└────────────┘\n```\n\n### The LLaMA Forward Pass (what happens for each token)\n\n```\nInput Token\n    │\n    ▼\n┌───────────────┐\n│ Embedding     │  Dequantize row from token_embd → x[2048]\n│ Lookup        │\n└───────┬───────┘\n        │\n        ▼\n┌───────────────┐  ×22 layers\n│ RMSNorm       │─────────────────────────────────────────┐\n│               │                                         │\n│ Q = xb @ Wq   │  Matrix-vector multiply (quantized)     │\n│ K = xb @ Wk   │  Store K,V in FP16 KV cache             │\n│ V = xb @ Wv   │                                         │\n│               │                                         │\n│ RoPE(Q, K)    │  Rotary position encoding (table lookup)│\n│               │                                         │\n│ Attention     │  Flash attention with online softmax    │\n│ (GQA 32→4)    │  Grouped-query: 32 Q heads, 4 KV heads  │\n│               │                                         │\n│ x += Out@Wo   │  Output projection + residual           │\n│               │                                         │\n│ RMSNorm       │                                         │\n│               │                                         │\n│ SwiGLU FFN    │  gate=SiLU(xb@Wg), up=xb@Wu             │\n│               │  x += (gate*up) @ Wd                    │\n└───────┬───────┘─────────────────────────────────────────┘\n        │\n        ▼\n┌───────────────┐\n│ Final RMSNorm │\n│ x @ W_output  │─→ logits[32000]\n└───────┬───────┘\n        │\n        ▼\n┌───────────────┐\n│ Grammar Mask  │  (if --json: force valid JSON structure)\n│ Sample Token  │  temperature → softmax → top-p → pick\n└───────────────┘\n```\n\n---\n\n## Memory Budget\n\nFor TinyLlama 1.1B Q4_K_M with 2048 context length:\n\n| Component | Size | Notes |\n|-----------|------|-------|\n| FP16 KV cache | ~40 MB | 22 layers x 2 x 2048 x 256 x 2 bytes |\n| Tokenizer | ~4.5 MB | 32K vocab strings + scores + sorted index |\n| Activation buffers | ~0.14 MB | x, xb, xb2, q, hb, hb2 |\n| Logits buffer | ~0.12 MB | 32000 x 4 bytes |\n| Dequant scratch | ~0.02 MB | Max(n_embd, n_ffn) floats |\n| Norm weights (pre-dequant) | ~0.35 MB | 45 norm vectors x 2048 x 4 bytes |\n| RoPE tables | ~0.03 MB | cos + sin x 2048 x 32 entries |\n| **Total runtime** | **~45 MB** | |\n| | | |\n| Model file (on disk) | 638 MB | Memory-mapped, ~1 layer in RAM at a time |\n\nWith 512 context (for constrained devices):\n\n| Component | Size |\n|-----------|------|\n| FP16 KV cache | ~10 MB |\n| Everything else | ~5 MB |\n| **Total** | **~15 MB** |\n\n---\n\n## Optimizations Deep-Dive\n\nPicoLM implements 9 optimizations that brought generation speed from **1.6 tok\u002Fs to 13.5 tok\u002Fs** on x86, with even larger gains expected on ARM with NEON:\n\n### 1. ARM NEON SIMD\n\n4-wide float vector operations for all hot paths. Example: dequantizing Q4_K nibbles with `vmovl_u8` → `vmovl_u16` → `vcvtq_f32_u32`, and RoPE with interleaved `vld2q_f32` \u002F `vst2q_f32`.\n\n### 2. x86 SSE2 SIMD\n\nAuto-detected on Intel\u002FAMD. 4-wide `__m128` operations for dot products, RMSNorm, and vector operations.\n\n### 3. FP16 KV Cache\n\nKey and value vectors stored as 16-bit floats instead of 32-bit. Halves KV cache memory from ~88MB to ~44MB. Conversion uses software `fp32_to_fp16()` \u002F `fp16_to_fp32()` — no hardware FP16 support required.\n\n### 4. Pre-computed RoPE Tables\n\nSine and cosine values for all positions computed once at model load. The forward pass does a table lookup instead of calling `sinf()` \u002F `cosf()` \u002F `powf()` 64 times per token.\n\n### 5. Flash Attention (Online Softmax)\n\nSingle-pass attention with running maximum rescaling. Eliminates the `O(seq_len)` attention score buffer — critical for long contexts on memory-constrained devices.\n\n### 6. Fused Dequantize + Dot Product\n\n`vec_dot_q4_K_f32()` dequantizes and accumulates in one pass. No intermediate float buffer for the weight row. Reduces memory traffic by ~50% for matmul.\n\n### 7. Multi-threaded Matrix Multiply\n\n`matmul()` distributes output rows across threads using pthreads. Each thread processes its chunk independently with fused dot products. Scales linearly up to ~8 cores.\n\n### 8. Grammar-Constrained JSON\n\nThe `--json` mode pre-analyzes every token in the vocabulary at load time (brace delta, bracket delta, quote parity). During generation, it masks logits to guarantee syntactically valid JSON — essential for tool-calling with small models.\n\n### 9. KV Cache Persistence\n\n`--cache file.kvc` saves the FP16 KV cache state after prompt processing. On the next run with the same prompt, it loads the cache and skips prefill entirely. **74% latency reduction** for repeated system prompts.\n\n---\n\n## Supported Models\n\nPicoLM supports any LLaMA-architecture model in GGUF format:\n\n| Model | Parameters | GGUF Size (Q4_K_M) | RAM Needed |\n|-------|-----------|---------------------|------------|\n| **TinyLlama 1.1B** | 1.1B | 638 MB | ~45 MB |\n| **Llama 2 7B** | 7B | 4.1 GB | ~200 MB |\n| **Phi-2** | 2.7B | 1.6 GB | ~90 MB |\n\n> **Recommended for embedded:** TinyLlama 1.1B Q4_K_M — fits comfortably on devices with 256MB+ RAM.\n\n### Supported quantization formats\n\n`Q2_K` `Q3_K` `Q4_K` `Q4_0` `Q5_K` `Q6_K` `Q8_0` `F16` `F32`\n\n---\n\n## File Structure\n\n```\nPicoLM\u002F\n├── README.md              ← you are here\n├── BLOG.md                ← technical deep-dive blog post\n├── install.sh             ← one-liner Pi installer\n│\n├── picolm\u002F                ← the inference engine (pure C)\n│   ├── picolm.c           ← CLI entry point, generation loop (273 lines)\n│   ├── model.h\u002Fc          ← GGUF parser, mmap, forward pass (146 + 833 lines)\n│   ├── tensor.h\u002Fc         ← matmul, rmsnorm, softmax, rope (44 + 298 lines)\n│   ├── quant.h\u002Fc          ← dequantization, SIMD kernels (140 + 534 lines)\n│   ├── tokenizer.h\u002Fc      ← BPE tokenizer (32 + ~200 lines)\n│   ├── sampler.h\u002Fc        ← temperature + top-p sampling (19 + ~100 lines)\n│   ├── grammar.h\u002Fc        ← JSON grammar constraints (64 + 175 lines)\n│   ├── Makefile           ← build targets for all platforms\n│   └── build.bat          ← Windows MSVC build script\n│\n└── tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf  ← model file (638 MB, not in git)\n```\n\n**Total C source: ~2,500 lines.** That's the entire inference engine — GGUF parsing, mmap, dequantization, matrix math, attention, tokenization, sampling, and grammar constraints.\n\n---\n\n## How It Works\n\n### The mmap trick\n\nTraditional inference engines load the entire model into RAM. PicoLM doesn't. Instead:\n\n1. The model file is **memory-mapped** (`mmap` on Linux\u002FmacOS, `MapViewOfFile` on Windows)\n2. Weight pointers point directly into the mapped file — no copying\n3. During the forward pass, each layer's weights are accessed sequentially\n4. The OS automatically pages in the needed weights and evicts old ones\n5. `madvise(MADV_SEQUENTIAL)` hints the access pattern to the kernel\n\n**Result:** A 638MB model runs on a device with 256MB RAM. Only ~30MB of the model is in physical memory at any time.\n\n### Quantization\n\nWeights are stored in 4-bit quantized format (Q4_K_M). For TinyLlama:\n- **Original:** 1.1B parameters x 4 bytes = 4.4 GB\n- **Q4_K:** 1.1B parameters x ~0.56 bytes = 638 MB\n- **Quality loss:** Minimal — Q4_K preserves 6-bit scales per 32-weight sub-block\n\n### Grouped-Query Attention (GQA)\n\nTinyLlama uses 32 query heads but only 4 key\u002Fvalue heads. Each KV head is shared by 8 query heads. This reduces KV cache size by 8x compared to full multi-head attention.\n\n---\n\n## Building & Testing\n\n### Prerequisites\n\n| Platform | Requirements |\n|----------|-------------|\n| **Linux\u002FPi** | `gcc`, `make` (install via `apt install build-essential`) |\n| **macOS** | Xcode Command Line Tools (`xcode-select --install`) |\n| **Windows** | Visual Studio Build Tools (cl.exe) |\n\n### Verify your build\n\n```bash\n# Build\nmake native\n\n# Test with greedy decoding (deterministic output)\n.\u002Fpicolm model.gguf -p \"The capital of France is\" -n 20 -t 0\n# Expected: \"Paris. It is the largest city in France...\"\n\n# Test JSON mode\n.\u002Fpicolm model.gguf --json -p \"Return JSON with name and age\" -n 50 -t 0.3\n# Expected: valid JSON like {\"name\": \"...\", \"age\": ...}\n\n# Test KV cache\n.\u002Fpicolm model.gguf --cache test.kvc -p \"Hello\" -n 10 -t 0\n.\u002Fpicolm model.gguf --cache test.kvc -p \"Hello\" -n 10 -t 0\n# Second run should say \"Skipping N cached prompt tokens\"\n```\n\n### Memory verification\n\nPicoLM prints memory stats to stderr:\n\n```\nMemory: 1.17 MB runtime state (FP16 KV cache separate)\n```\n\nTotal = runtime state + FP16 KV cache. For TinyLlama with 2048 context: ~45 MB.\n\n---\n\n## FAQ\n\n**Q: Can this run Llama 2 7B?**\nA: Yes, if you have enough RAM for the KV cache (~1.4 GB for 7B with 4096 context). The model file stays on disk via mmap. On a Pi 4 with 4GB RAM, it works but is slow (~1-2 tok\u002Fs).\n\n**Q: Why not use llama.cpp?**\nA: llama.cpp is excellent but requires ~200MB+ for the runtime on small models, has complex build dependencies, and targets desktop\u002Fserver use cases. PicoLM is purpose-built for embedded: 45MB RAM, 80KB binary, zero dependencies.\n\n**Q: Is the output quality good?**\nA: TinyLlama 1.1B is a small model — it handles simple tasks (Q&A, summarization, basic reasoning, JSON generation) well. It won't match GPT-4, but it runs on a $10 board with no internet. For structured output, the `--json` grammar mode guarantees valid JSON regardless of model quality.\n\n**Q: What about GPU acceleration?**\nA: PicoLM is CPU-only by design. The target hardware ($10-15 boards) doesn't have GPUs. On x86\u002FARM CPUs, SIMD (NEON\u002FSSE2) provides meaningful speedup.\n\n**Q: Can I use a different model?**\nA: Any LLaMA-architecture GGUF model works. Download from [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fmodels?search=gguf) and point PicoLM at it. Recommended quantizations: Q4_K_M (best quality\u002Fsize balance) or Q2_K (smallest, lower quality).\n\n---\n\n## Roadmap\n\n- [ ] AVX2\u002FAVX-512 kernels for x86 (2-4x generation speed on modern CPUs)\n- [ ] Speculative decoding with a draft model\n- [ ] Context sliding window (infinite generation beyond max_seq_len)\n- [ ] Weight pruning for further memory reduction\n- [ ] Continuous batching for server mode\n- [ ] Mistral \u002F Phi architecture support\n\n---\n\n## Technical Blog\n\nFor a detailed writeup of the optimization journey (with code snippets and war stories), see [**BLOG.md**](BLOG.md).\n\n---\n\n## License\n\nMIT License. See [LICENSE](LICENSE) for details.\n\n---\n\n\u003Cp align=\"center\">\n  \u003Cstrong>PicoLM\u003C\u002Fstrong> — because intelligence shouldn't require a data center.\n\u003C\u002Fp>\n","\u003C\u002Fthink>\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLanguage-C11-blue?style=flat-square\" alt=\"C11\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FBinary_Size-~80KB-brightgreen?style=flat-square\" alt=\"Binary Size\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FRuntime_RAM-45MB-orange?style=flat-square\" alt=\"RAM\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FDependencies-Zero-success?style=flat-square\" alt=\"Zero Dependencies\">\n  \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-MIT-yellow?style=flat-square\" alt=\"MIT License\">\n\u003C\u002Fp>\n\n\u003Ch1 align=\"center\">PicoLM\u003C\u002Fh1>\n\n\u003Cp align=\"center\">\n  \u003Cstrong>在拥有 256MB RAM 的 10 美元开发板上运行 10 亿参数的大语言模型 (LLM)。\u003C\u002Fstrong>\u003Cbr>\n  纯 C 语言。零依赖。单个二进制文件。无需 Python。无需云端。\n\u003C\u002Fp>\n\n\u003Cp align=\"center\">\n  \u003Ccode>echo \"Explain gravity\" | .\u002Fpicolm model.gguf -n 100 -j 4\u003C\u002Fcode>\n\u003C\u002Fp>\n\n---\n\n## 完美搭档：PicoLM + PicoClaw\n\n\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRightNow-AI_picolm_readme_3b61618c0739.jpg\" alt=\"PicoLM — Run a 1-billion parameter LLM on a $10 board\" width=\"640\">\n  \u003Cbr>\u003Cbr>\n\u003C\u002Fdiv>\n\nPicoLM 是 [PicoClaw](https:\u002F\u002Fgithub.com\u002Fsipeed\u002Fpicoclaw) 的**本地大脑**——一个运行在 10 美元硬件上的超轻量级 Go 语言 AI 助手。两者结合形成一个**完全离线的 AI 智能体**——无需云端，无需 API 密钥，无需互联网，无需月费。\n\n> **其他所有 LLM 提供商都需要互联网。PicoLM 不需要。**\n\n\u003Ctable align=\"center\">\n  \u003Ctr align=\"center\">\n    \u003Ctd>\u003Cb>硬件\u003C\u002Fb>\u003C\u002Ftd>\n    \u003Ctd>\u003Cb>架构\u003C\u002Fb>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRightNow-AI_picolm_readme_3193375fa8f7.png\" alt=\"$9.90 LicheeRV Nano\" width=\"360\">\u003C\u002Ftd>\n    \u003Ctd align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRightNow-AI_picolm_readme_6693e07d72d9.jpg\" alt=\"PicoClaw architecture — PicoLM sits in the LLM box\" width=\"420\">\u003C\u002Ftd>\n  \u003C\u002Ftr>\n  \u003Ctr>\n    \u003Ctd align=\"center\">\u003Cem>9.90 美元——这就是整个服务器\u003C\u002Fem>\u003C\u002Ftd>\n    \u003Ctd align=\"center\">\u003Cem>PicoLM 为 PicoClaw 智能体循环中的 LLM 模块提供动力\u003C\u002Fem>\u003C\u002Ftd>\n  \u003C\u002Ftr>\n\u003C\u002Ftable>\n\n### 为什么它们是完美契合\n\n| | 云提供商 (OpenAI 等) | PicoLM (本地) |\n|---|---|---|\n| **成本** | 按令牌付费，永久付费 | 永久免费 |\n| **隐私** | 数据发送至服务器 | 所有数据保留在设备端 |\n| **互联网** | 每次请求都需要 | 完全不需要 |\n| **延迟** | 网络往返 + 推理 | 仅推理 |\n| **硬件** | 需要价值 599 美元的 Mac Mini | 运行在 10 美元的开发板上 |\n| **二进制文件** | N\u002FA | ~80KB 单文件 |\n| **内存** | N\u002FA | 总计 45 MB |\n\n### 工作原理\n\nPicoClaw 的智能体循环将 PicoLM 作为子进程启动。消息来自 Telegram、Discord 或命令行界面（CLI）——PicoClaw 将其格式化为聊天模板，通过标准输入（stdin）将提示词管道传输给 `picolm`，并从标准输出（stdout）读取响应。当需要工具时，`--json` 语法模式可确保即使是从 1B 模型也能生成有效的 JSON。\n\n```\nTelegram \u002F Discord \u002F CLI\n        │\n        ▼\n   ┌──────────┐    stdin: prompt     ┌───────────┐\n   │ PicoClaw │ ──────────────────►  │  picolm   │\n   │   (Go)   │ ◄──────────────────  │   (C)     │\n   └──────────┘    stdout: response  │ + model   │\n        │                            └───────────┘\n        ▼                            45 MB RAM\n   User gets reply                   No internet\n```\n\n### 快速设置\n\n```bash\n# 1. Build PicoLM\ncd picolm && make native    # or: make pi (Raspberry Pi)\n\n# 2. Download model (one-time, 638 MB)\nmake model\n\n# 3. Build PicoClaw\ncd ..\u002Fpicoclaw && make deps && make build\n\n# 4. Configure (~\u002F.picoclaw\u002Fconfig.json)\n```\n\n```json\n{\n  \"agents\": {\n    \"defaults\": {\n      \"provider\": \"picolm\",\n      \"model\": \"picolm-local\"\n    }\n  },\n  \"providers\": {\n    \"picolm\": {\n      \"binary\": \"~\u002F.picolm\u002Fbin\u002Fpicolm\",\n      \"model\": \"~\u002F.picolm\u002Fmodels\u002Ftinyllama-1.1b-chat-v1.0.Q4_K_M.gguf\",\n      \"max_tokens\": 256,\n      \"threads\": 4,\n      \"template\": \"chatml\"\n    }\n  }\n}\n```\n\n```bash\n# 5. Chat — fully offline!\npicoclaw agent -m \"What is photosynthesis?\"\n```\n\n### 或者一键安装所有内容\n\n```bash\ncurl -sSL https:\u002F\u002Fraw.githubusercontent.com\u002FRightNow-AI\u002Fpicolm\u002Fmain\u002Finstall.sh | bash\n```\n\n### 真实硬件性能表现\n\n| 设备 | 价格 | 生成速度 | 占用内存 |\n|--------|-------|-----------------|----------|\n| **Pi 5** (4 核) | 60 美元 | ~10 词元\u002F秒 | 45 MB |\n| **Pi 4** (4 核) | 35 美元 | ~8 词元\u002F秒 | 45 MB |\n| **Pi 3B+** | 25 美元 | ~4 词元\u002F秒 | 45 MB |\n| **Pi Zero 2W** | 15 美元 | ~2 词元\u002F秒 | 45 MB |\n| **LicheeRV Nano** | 10 美元 | ~1 词元\u002F秒 | 45 MB |\n\n### JSON 工具调用\n\n当需要结构化输出时，PicoClaw 会自动激活 `--json` 语法模式。这**保证了即使是从 10 亿参数模型生成的 JSON 也是语法有效的**——对于在微型硬件上可靠地进行工具调用至关重要：\n\n```bash\npicoclaw agent -m \"Search for weather in Tokyo\"\n# → PicoLM generates: {\"tool_calls\": [{\"function\": {\"name\": \"web_search\", \"arguments\": \"{\\\"query\\\": \\\"weather Tokyo\\\"}\"}}]}\n```\n\n> 如需完整的 PicoClaw 文档，请参阅 [PicoClaw README](https:\u002F\u002Fgithub.com\u002Fsipeed\u002Fpicoclaw)。\n\n---\n\n## PicoLM 是什么？\n\nPicoLM 是一个用约 2,500 行 C11 代码编写的**极简从头构建的 LLM 推理引擎**。它能在大多数推理框架甚至不会考虑运行的硬件上运行 [TinyLlama 1.1B](https:\u002F\u002Fhuggingface.co\u002FTinyLlama\u002FTinyLlama-1.1B-Chat-v1.0)（以及以 GGUF 格式的其他 LLaMA 架构模型）：\n\n- **Raspberry Pi Zero 2W**（15 美元，512MB RAM，ARM Cortex-A53）\n- **Sipeed LicheeRV**（12 美元，512MB RAM，RISC-V）\n- **Raspberry Pi 3\u002F4\u002F5**（1-8GB RAM，ARM NEON SIMD）\n- 任何 Linux\u002FWindows\u002FmacOS x86-64 机器\n\n模型文件（638MB）保留在磁盘上。PicoLM 对其使用**内存映射（memory-mapping）**，并通过 RAM 逐层流式传输。总运行时内存：**约 45MB**，包括 FP16 KV 缓存。\n\n```\n                    ┌──────────────────────────────────────────┐\n   What goes        │         45 MB Runtime RAM                │\n   in RAM           │  ┌─────────┐ ┌──────────┐ ┌───────────┐  │\n                    │  │ Buffers │ │ FP16 KV  │ │ Tokenizer │  │\n                    │  │  1.2 MB │ │ Cache    │ │   4.5 MB  │  │\n                    │  │         │ │  ~40 MB  │ │           │  │\n                    │  └─────────┘ └──────────┘ └───────────┘  │\n                    └──────────────────────────────────────────┘\n\n                    ┌──────────────────────────────────────────┐\n   What stays       │        638 MB Model on Disk              │\n   on disk          │       (mmap — OS pages in layers         │\n   (via mmap)       │        as needed, ~1 at a time)          │\n                    └──────────────────────────────────────────┘\n```\n\n## 功能特性\n\n| 功能 | 描述 |\n|---------|-------------|\n| **GGUF Native** | 直接读取 GGUF v2\u002Fv3 文件 — 无需转换 |\n| **K-Quant 支持** | Q2_K, Q3_K, Q4_K, Q5_K, Q6_K, Q8_0, Q4_0, F16, F32 |\n| **mmap 层流式加载** | 模型权重保留在磁盘上；OS 逐页加载一层 (mmap 内存映射) |\n| **FP16 KV 缓存** | 减少一半 KV 缓存 (Key-Value Cache) 内存 (2048 上下文下 44MB vs 88MB) |\n| **Flash Attention** | 在线 Softmax — 无需 O(seq_len) Attention (注意力机制) 缓冲区 |\n| **预计算 RoPE** | cos\u002Fsin 查找表消除热点循环中的超越函数 (RoPE 旋转位置编码) |\n| **SIMD 加速** | ARM NEON (Pi 3\u002F4\u002F5) 和 x86 SSE2 (Intel\u002FAMD) 自动检测 (SIMD 单指令多数据流) |\n| **融合点积** | 反量化 + 点积一次完成 — 无中间缓冲区 |\n| **多线程矩阵乘法** | 跨 CPU 核心并行矩阵向量乘法 |\n| **语法约束 JSON** | `--json` 标志强制输出有效 JSON（用于工具调用） |\n| **KV 缓存持久化** | `--cache` 保存\u002F加载提示状态 — 重新运行时跳过预填充 (Prefill) |\n| **BPE 分词器** | 基于分数的字节对编码 (BPE)，从 GGUF 元数据加载 |\n| **Top-p 采样** | 温度 + 核采样，可配置种子 |\n| **管道友好** | 从 stdin 读取提示：`echo \"Hello\" \\| .\u002Fpicolm model.gguf` |\n| **零依赖** | 仅 libc, libm, libpthread。无外部库。 |\n| **跨平台** | Linux, Windows (MSVC), macOS。ARM, x86-64, RISC-V。 |\n\n---\n\n## 快速开始\n\n### 一键安装 (树莓派 \u002F Linux)\n\n```bash\ncurl -sSL https:\u002F\u002Fraw.githubusercontent.com\u002FRightNow-AI\u002Fpicolm\u002Fmain\u002Finstall.sh | bash\n```\n\n这将执行以下操作：\n1. 检测你的平台 (ARM64, ARMv7, x86-64)\n2. 安装构建依赖项 (`gcc`, `make`, `curl`)\n3. 使用针对你 CPU 优化的 SIMD 标志构建 PicoLM\n4. 下载 TinyLlama 1.1B Q4_K_M (638 MB)\n5. 运行快速测试\n6. 生成 PicoClaw 配置\n7. 将 `picolm` 添加到你的 PATH\n\n### 从源码构建\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Frightnow-ai\u002Fpicolm.git\ncd picolm\u002Fpicolm\n\n# 自动检测 CPU (在 x86 上启用 SSE2\u002FAVX，在 ARM 上启用 NEON)\nmake native\n\n# 下载模型\nmake model\n\n# 运行它\n.\u002Fpicolm \u002Fopt\u002Fpicolm\u002Fmodels\u002Ftinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \\\n    -p \"The meaning of life is\" -n 100\n```\n\n### Windows 构建 (MSVC)\n\n```cmd\ncd picolm\nbuild.bat\npicolm.exe model.gguf -p \"Hello world\" -n 50\n```\n\n### 特定平台构建\n\n```bash\nmake native      # x86\u002FARM 自动检测 (推荐用于本地机器)\nmake pi          # Raspberry Pi 3\u002F4\u002F5 (64 位 ARM + NEON SIMD)\nmake pi-arm32    # Pi Zero \u002F Pi 1 (32 位 ARM)\nmake cross-pi    # 从 x86 交叉编译给 Pi (静态二进制)\nmake riscv       # RISC-V (Sipeed LicheeRV 等)\nmake static      # 用于单文件部署的静态二进制\nmake debug       # 带符号、无优化的调试构建\n```\n\n---\n\n## 使用方法\n\n```\nPicoLM — 超轻量级大语言模型推理引擎\n\n用法：picolm \u003Cmodel.gguf> [选项]\n\n生成选项：\n  -p \u003Cprompt>    输入提示 (或通过 stdin 管道输入)\n  -n \u003Cint>       最大生成词元数 (默认：256)\n  -t \u003Cfloat>     温度 (默认：0.8, 0=贪婪)\n  -k \u003Cfloat>     Top-p \u002F 核采样 (默认：0.9)\n  -s \u003Cint>       RNG 种子 (默认：42)\n  -c \u003Cint>       上下文长度覆盖\n  -j \u003Cint>       线程数 (默认：4)\n\n高级选项：\n  --json         语法约束 JSON 输出模式\n  --cache \u003Cfile> KV 缓存文件 (保存\u002F加载提示状态)\n```\n\n### 示例\n\n**基本生成：**\n```bash\n.\u002Fpicolm model.gguf -p \"Once upon a time\" -n 200\n```\n\n**贪婪解码（确定性，temperature=0）：**\n```bash\n.\u002Fpicolm model.gguf -p \"The capital of France is\" -n 20 -t 0\n# 输出：Paris. It is the largest city in France and...\n```\n\n**与 TinyLlama 聊天 (ChatML 格式)：**\n```bash\n.\u002Fpicolm model.gguf -n 200 -t 0.7 -p \"\u003C|user|>\nWhat is photosynthesis?\u003C\u002Fs>\n\u003C|assistant|>\n\"\n```\n\n**强制 JSON 输出（用于工具调用\u002F结构化数据）：**\n```bash\n.\u002Fpicolm model.gguf --json -t 0.3 -n 100 -p \"\u003C|user|>\nReturn the current time as JSON.\u003C\u002Fs>\n\u003C|assistant|>\n\"\n# 输出：{\"time\": \"12:00 PM\"}\n```\n\n**从 stdin 管道输入：**\n```bash\necho \"Explain quantum computing in one sentence\" | .\u002Fpicolm model.gguf -n 50\n```\n\n**KV 缓存 — 跳过重复预填充：**\n```bash\n# 第一次运行：处理提示 + 保存缓存\n.\u002Fpicolm model.gguf --cache prompt.kvc -p \"Long system prompt here...\" -n 50\n\n# 第二次运行：加载缓存，跳过提示预填充 (快 74%)\n.\u002Fpicolm model.gguf --cache prompt.kvc -p \"Long system prompt here...\" -n 50\n# 输出：\"Skipping 25 cached prompt tokens\"\n```\n\n**Pi 4 多线程（4 核）：**\n```bash\n.\u002Fpicolm model.gguf -p \"Hello\" -n 100 -j 4\n```\n\n---\n\n## 性能\n\n基于 TinyLlama 1.1B Q4_K_M（638 MB 模型）测量：\n\n| 指标 | x86-64 (8 线程) | Pi 4 (4 核，NEON) | Pi Zero 2W |\n|--------|--------------------|-----------------------|------------|\n| **预填充** | ~11 tok\u002Fs | ~6 tok\u002Fs | ~1.5 tok\u002Fs |\n| **生成** | ~13 tok\u002Fs | ~8 tok\u002Fs* | ~2 tok\u002Fs* |\n| **运行时内存** | 45 MB | 45 MB | 45 MB |\n| **首词时间** | ~2.3s | ~4s | ~16s |\n| **二进制大小** | ~80 KB | ~70 KB | ~65 KB |\n\n*\\*估计启用 NEON SIMD。实际数字取决于 SD 卡速度和热节流。*\n\n### 为何如此快速\n\n```\n 原始 C 推理          ████████████░░░░░░░░  13.5 tok\u002Fs  (基准：1.6)\n + 融合点积           ████████████████░░░░  (消除反量化缓冲区)\n + 多线程矩阵乘法     █████████████████░░░  (4-8 核并行)\n + FP16 KV 缓存       █████████████████░░░  (减半内存带宽)\n + 预计算 RoPE        ██████████████████░░  (热点循环中无 sin\u002Fcos)\n + Flash Attention    ██████████████████░░  (无 O(n) 注意力分配)\n + NEON\u002FSSE2 SIMD     ███████████████████░  (4 宽向量运算)\n + KV 缓存持久化      ████████████████████  (完全跳过预填充)\n```\n\n## 架构\n\n```\n                          ┌─────────────────────────────────┐\n                          │           picolm.c              │\n                          │     CLI + Generation Loop       │\n                          └──────┬──────────────┬───────────┘\n                                 │              │\n                    ┌────────────┘              └────────────┐\n                    │                                        │\n           ┌────────┴────────┐                    ┌──────────┴──────────┐\n           │    model.h\u002Fc    │                    │    sampler.h\u002Fc      │\n           │  GGUF Parser    │                    │  Temperature +      │\n           │  mmap Layer     │                    │  Top-p Sampling     │\n           │  Streaming      │                    └──────────┬──────────┘\n           │  Forward Pass   │                               │\n           │  KV Cache I\u002FO   │                    ┌──────────┴──────────┐\n           └───┬────────┬────┘                    │    grammar.h\u002Fc      │\n               │        │                         │  JSON Constraint    │\n      ┌────────┘        └───────┐                 │  Logit Masking      │\n      │                         │                 └─────────────────────┘\n┌─────┴──────┐          ┌───────┴────────┐\n│ tensor.h\u002Fc │          │ tokenizer.h\u002Fc  │\n│ matmul     │          │ BPE Encode     │\n│ rmsnorm    │          │ Decode         │\n│ softmax    │          │ Vocab Lookup   │\n│ rope       │          └────────────────┘\n│ silu       │\n│ threading  │\n└─────┬──────┘\n      │\n┌─────┴──────┐\n│  quant.h\u002Fc │\n│ Q4_K, Q6_K │\n│ Q3_K, Q2_K │\n│ FP16, F32  │\n│ NEON + SSE │\n│ Fused Dots │\n└────────────┘\n```\n\n### LLaMA 前向传播（每个 Token 发生了什么）\n\n```\nInput Token\n    │\n    ▼\n┌───────────────┐\n│ Embedding     │  Dequantize row from token_embd → x[2048]\n│ Lookup        │\n└───────┬───────┘\n        │\n        ▼\n┌───────────────┐  ×22 layers\n│ RMSNorm       │─────────────────────────────────────────┐\n│               │                                         │\n│ Q = xb @ Wq   │  Matrix-vector multiply (quantized)     │\n│ K = xb @ Wk   │  Store K,V in FP16 KV cache             │\n│ V = xb @ Wv   │                                         │\n│               │                                         │\n│ RoPE(Q, K)    │  Rotary position encoding (table lookup)│\n│               │                                         │\n│ Attention     │  Flash attention with online softmax    │\n│ (GQA 32→4)    │  Grouped-query: 32 Q heads, 4 KV heads  │\n│               │                                         │\n│ x += Out@Wo   │  Output projection + residual           │\n│               │                                         │\n│ RMSNorm       │                                         │\n│               │                                         │\n│ SwiGLU FFN    │  gate=SiLU(xb@Wg), up=xb@Wu             │\n│               │  x += (gate*up) @ Wd                    │\n└───────┬───────┘─────────────────────────────────────────┘\n        │\n        ▼\n┌───────────────┐\n│ Final RMSNorm │\n│ x @ W_output  │─→ logits[32000]\n└───────┬───────┘\n        │\n        ▼\n┌───────────────┐\n│ Grammar Mask  │  (if --json: force valid JSON structure)\n│ Sample Token  │  temperature → softmax → top-p → pick\n└───────────────┘\n```\n\n---\n\n## 内存预算\n\n对于上下文长度为 2048 的 TinyLlama 1.1B Q4_K_M：\n\n| 组件 | 大小 | 备注 |\n|-----------|------|-------|\n| FP16 KV Cache (键值缓存) | ~40 MB | 22 层 x 2 x 2048 x 256 x 2 字节 |\n| 分词器 | ~4.5 MB | 32K 词表字符串 + 分数 + 排序索引 |\n| 激活缓冲 | ~0.14 MB | x, xb, xb2, q, hb, hb2 |\n| Logits 缓冲 | ~0.12 MB | 32000 x 4 字节 |\n| 反量化临时区 | ~0.02 MB | Max(n_embd, n_ffn) 浮点数 |\n| 归一化权重（反量化前） | ~0.35 MB | 45 个归一化向量 x 2048 x 4 字节 |\n| RoPE 表 | ~0.03 MB | cos + sin x 2048 x 32 项 |\n| **总运行时** | **~45 MB** | |\n| | | |\n| 模型文件（磁盘上） | 638 MB | 内存映射，每次约 1 层在 RAM 中 |\n\n使用 512 上下文（针对受限设备）：\n\n| 组件 | 大小 |\n|-----------|------|\n| FP16 KV Cache (键值缓存) | ~10 MB |\n| 其他所有内容 | ~5 MB |\n| **总计** | **~15 MB** |\n\n---\n\n## 优化深度解析\n\nPicoLM 实现了 9 项优化，将 x86 上的生成速度从 **1.6 tok\u002Fs 提升至 13.5 tok\u002Fs**，预计基于 ARM 和 NEON 的增益更大：\n\n### 1. ARM NEON SIMD (单指令多数据流)\n\n所有热点路径均采用 4 宽浮点向量操作。示例：使用 `vmovl_u8` → `vmovl_u16` → `vcvtq_f32_u32` 对 Q4_K 半字节进行反量化，以及使用交错的 `vld2q_f32` \u002F `vst2q_f32` 处理 RoPE。\n\n### 2. x86 SSE2 SIMD\n\n在 Intel\u002FAMD 上自动检测。用于点积、RMSNorm 和向量操作的 4 宽 `__m128` 操作。\n\n### 3. FP16 KV Cache (键值缓存)\n\n键值向量以 16 位浮点数而非 32 位存储。KV 缓存内存减少一半，从 ~88MB 降至 ~44MB。转换使用软件 `fp32_to_fp16()` \u002F `fp16_to_fp32()` —— 无需硬件 FP16 支持。\n\n### 4. 预计算 RoPE 表\n\n所有位置的正弦和余弦值在模型加载时计算一次。前向传播执行表查找，而不是每个 token 调用 64 次 `sinf()` \u002F `cosf()` \u002F `powf()`。\n\n### 5. Flash Attention (在线 Softmax)\n\n单次通过注意力机制，带有运行最大值重缩放。消除了 O(seq_len) 注意力分数缓冲区 —— 对于内存受限设备上的长上下文至关重要。\n\n### 6. 融合反量化 + 点积\n\n`vec_dot_q4_K_f32()` 在一个传递中完成反量化和累加。权重行没有中间浮点缓冲区。将矩阵乘法的内存流量减少约 50%。\n\n### 7. 多线程矩阵乘法\n\n`matmul()` 使用 pthreads 将输出行分发到各个线程。每个线程独立处理其块并使用融合点积。可扩展至约 8 个核心。\n\n### 8. 语法约束 JSON\n\n`--json` 模式在加载时预先分析词表中的每个 token（大括号差值、方括号差值、引号配对）。生成期间，它屏蔽 logits 以确保语法有效的 JSON —— 对于小模型的函数调用至关重要。\n\n### 9. KV 缓存持久化\n\n`--cache file.kvc` 在提示词处理后保存 FP16 KV 缓存状态。下次使用相同提示词运行时，它加载缓存并完全跳过预填充。**重复系统提示可降低 74% 延迟**。\n\n---\n\n## 支持的模型\n\nPicoLM 支持任何 GGUF 格式的 LLaMA 架构模型：\n\n| 模型 | 参数量 | GGUF 大小 (Q4_K_M) | 所需 RAM |\n|-------|-----------|---------------------|------------|\n| **TinyLlama 1.1B** | 1.1B | 638 MB | ~45 MB |\n| **Llama 2 7B** | 7B | 4.1 GB | ~200 MB |\n| **Phi-2** | 2.7B | 1.6 GB | ~90 MB |\n\n> **嵌入式推荐：** TinyLlama 1.1B Q4_K_M —— 可舒适地运行在 256MB+ RAM 的设备上。\n\n### 支持的量化格式\n\n`Q2_K` `Q3_K` `Q4_K` `Q4_0` `Q5_K` `Q6_K` `Q8_0` `F16` `F32`\n\n## 文件结构\n\n```\nPicoLM\u002F\n├── README.md              ← you are here\n├── BLOG.md                ← technical deep-dive blog post\n├── install.sh             ← one-liner Pi installer\n│\n├── picolm\u002F                ← the inference engine (pure C)\n│   ├── picolm.c           ← CLI entry point, generation loop (273 lines)\n│   ├── model.h\u002Fc          ← GGUF parser, mmap, forward pass (146 + 833 lines)\n│   ├── tensor.h\u002Fc         ← matmul, rmsnorm, softmax, rope (44 + 298 lines)\n│   ├── quant.h\u002Fc          ← dequantization, SIMD kernels (140 + 534 lines)\n│   ├── tokenizer.h\u002Fc      ← BPE tokenizer (32 + ~200 lines)\n│   ├── sampler.h\u002Fc        ← temperature + top-p sampling (19 + ~100 lines)\n│   ├── grammar.h\u002Fc        ← JSON grammar constraints (64 + 175 lines)\n│   ├── Makefile           ← build targets for all platforms\n│   └── build.bat          ← Windows MSVC build script\n│\n└── tinyllama-1.1b-chat-v1.0.Q4_K_M.gguf  ← model file (638 MB, not in git)\n```\n\n**总计 C 源代码：约 2,500 行。**这就是整个推理引擎——包括 GGUF 解析、内存映射（mmap）、反量化、矩阵运算、注意力机制、分词、采样和语法约束。\n\n---\n\n## 工作原理\n\n### mmap 技巧\n\n传统的推理引擎将整个模型加载到 RAM（随机存取存储器）中。PicoLM 则不然。相反：\n\n1. 模型文件被**内存映射**（Linux\u002FmacOS 上使用 `mmap`，Windows 上使用 `MapViewOfFile`）\n2. 权重指针直接指向映射的文件——无需复制\n3. 在**前向传播（forward pass）**期间，每一层的权重按顺序访问\n4. 操作系统会自动将需要的权重调入页面并淘汰旧的\n5. `madvise(MADV_SEQUENTIAL)` 向内核提示访问模式\n\n**结果：** 638MB 的模型可以在拥有 256MB RAM 的设备上运行。任何时刻只有约 30MB 的模型数据在物理内存中。\n\n### 量化\n\n权重以 4 位**量化**格式存储（Q4_K_M）。对于 TinyLlama：\n- **原始大小：** 11 亿参数 x 4 字节 = 4.4 GB\n- **Q4_K：** 11 亿参数 x ~0.56 字节 = 638 MB\n- **质量损失：** 极小——Q4_K 每 32 个权重的子块保留 6 位缩放比例\n\n### 分组查询注意力 (GQA)\n\nTinyLlama 使用 32 个查询头，但只有 4 个键\u002F值头。每个 KV 头由 8 个查询头共享。与完整的**多头注意力（multi-head attention）**机制相比，这使**KV 缓存（KV cache）**大小减少了 8 倍。\n\n---\n\n## 构建与测试\n\n### 前置条件\n\n| 平台 | 要求 |\n|----------|-------------|\n| **Linux\u002FPi** | `gcc`, `make`（通过 `apt install build-essential` 安装） |\n| **macOS** | Xcode 命令行工具（`xcode-select --install`） |\n| **Windows** | Visual Studio 构建工具（cl.exe） |\n\n### 验证构建\n\n```bash\n# Build\nmake native\n\n# Test with greedy decoding (deterministic output)\n.\u002Fpicolm model.gguf -p \"The capital of France is\" -n 20 -t 0\n# Expected: \"Paris. It is the largest city in France...\"\n\n# Test JSON mode\n.\u002Fpicolm model.gguf --json -p \"Return JSON with name and age\" -n 50 -t 0.3\n# Expected: valid JSON like {\"name\": \"...\", \"age\": ...}\n\n# Test KV cache\n.\u002Fpicolm model.gguf --cache test.kvc -p \"Hello\" -n 10 -t 0\n.\u002Fpicolm model.gguf --cache test.kvc -p \"Hello\" -n 10 -t 0\n# Second run should say \"Skipping N cached prompt tokens\"\n```\n\n### 内存验证\n\nPicoLM 将内存统计信息打印到标准错误输出（stderr）：\n\n```\nMemory: 1.17 MB runtime state (FP16 KV cache separate)\n```\n\n总计 = 运行时状态 + FP16 KV 缓存。对于上下文长度为 2048 的 TinyLlama：约 45 MB。\n\n---\n\n## 常见问题\n\n**问：能运行 Llama 2 7B 吗？**\n答：可以，前提是你有足够的 RAM 用于 KV 缓存（4096 上下文下 7B 模型约需 1.4 GB）。模型文件通过 mmap 保留在磁盘上。在配备 4GB RAM 的 Pi 4 上可以运行，但速度较慢（约 1-2 tok\u002Fs）。\n\n**问：为什么不使用 llama.cpp？**\n答：llama.cpp 很优秀，但在小模型上运行时仍需约 200MB+ 内存，构建依赖复杂，且主要针对桌面\u002F服务器用例。PicoLM 专为嵌入式设计：仅需 45MB RAM，80KB 二进制文件，零依赖。\n\n**问：输出质量好吗？**\n答：TinyLlama 1.1B 是一个小模型——它能很好地处理简单任务（问答、摘要、基本推理、JSON 生成）。它无法与 GPT-4 相媲美，但能在没有网络的 10 美元开发板上运行。对于结构化输出，`--json` 语法模式可确保生成有效的 JSON，无论模型质量如何。\n\n**问：关于 GPU 加速呢？**\n答：PicoLM 设计上仅支持 CPU。目标硬件（10-15 美元的开发板）没有 GPU。在 x86\u002FARM CPU 上，**SIMD（单指令多数据流）**（NEON\u002FSSE2）可提供显著的速度提升。\n\n**问：我可以使用不同的模型吗？**\n答：任何 LLaMA 架构的 GGUF 模型均可。从 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fmodels?search=gguf) 下载并将 PicoLM 指向该模型。推荐的量化方式：Q4_K_M（最佳质量\u002F大小平衡）或 Q2_K（最小，质量较低）。\n\n---\n\n## 路线图\n\n- [ ] AVX2\u002FAVX-512 内核（现代 CPU 上生成速度提升 2-4 倍）\n- [ ] 使用草稿模型的投机解码\n- [ ] 上下文滑动窗口（超出 max_seq_len 的无限生成）\n- [ ] 权重剪枝以进一步减少内存\n- [ ] 服务器模式的连续批处理\n- [ ] 支持 Mistral \u002F Phi 架构\n\n---\n\n## 技术博客\n\n有关优化历程的详细文章（包含代码片段和实战经验），请查看 [**BLOG.md**](BLOG.md)。\n\n---\n\n## 许可证\n\nMIT 许可证。详情见 [LICENSE](LICENSE)。\n\n---\n\n\u003Cp align=\"center\">\n  \u003Cstrong>PicoLM\u003C\u002Fstrong> —— 因为智能不应需要数据中心。\n\u003C\u002Fp>","# PicoLM 快速上手指南\n\nPicoLM 是一个超轻量级的本地大语言模型（LLM）推理引擎，采用纯 C 语言编写，零外部依赖。它能在仅 45MB 内存的设备上运行 10 亿参数模型，支持离线运行，无需联网或 API 密钥。\n\n## 环境准备\n\n*   **操作系统**: Linux, macOS, Windows (支持 MSVC)\n*   **硬件架构**: x86-64, ARM (Raspberry Pi 3\u002F4\u002F5\u002FZero), RISC-V\n*   **内存要求**: 运行时约 45MB (模型文件通过 mmap 驻留磁盘)\n*   **构建依赖**: GCC, Make (若从源码编译)\n*   **网络**: 首次下载模型需要网络连接，后续推理完全离线\n\n## 安装步骤\n\n### 方法一：一键安装脚本（推荐）\n\n此脚本会自动检测平台、安装依赖、编译二进制文件并下载默认模型。\n\n```bash\ncurl -sSL https:\u002F\u002Fraw.githubusercontent.com\u002FRightNow-AI\u002Fpicolm\u002Fmain\u002Finstall.sh | bash\n```\n\n### 方法二：从源码编译\n\n适用于需要自定义编译选项的场景。\n\n```bash\n# 1. 克隆仓库\ngit clone https:\u002F\u002Fgithub.com\u002Frightnow-ai\u002Fpicolm.git\ncd picolm\u002Fpicolm\n\n# 2. 自动检测 CPU 并编译 (启用 SSE2\u002FAVX 或 NEON 加速)\nmake native\n\n# 3. 下载模型文件 (TinyLlama 1.1B Q4_K_M, 约 638 MB)\nmake model\n\n# 4. 运行测试\n.\u002Fpicolm \u002Fopt\u002Fpicolm\u002Fmodels\u002Ftinyllama-1.1b-chat-v1.0.Q4_K_M.gguf \\\n    -p \"The meaning of life is\" -n 100\n```\n\n## 基本使用\n\nPicoLM 支持通过命令行参数或管道输入进行交互。\n\n### 1. 基础生成\n直接指定提示词和最大生成长度。\n\n```bash\n.\u002Fpicolm model.gguf -p \"Once upon a time\" -n 200\n```\n\n### 2. 管道输入\n适合与其他命令配合使用。\n\n```bash\necho \"Explain gravity\" | .\u002Fpicolm model.gguf -n 100 -j 4\n```\n\n### 3. 聊天模式 (ChatML)\n使用特定的模板格式进行对话。\n\n```bash\n.\u002Fpicolm model.gguf -n 200 -t 0.7 -p \"\u003C|user|>\nWhat is photosynthesis?\u003C\u002Fs>\n\u003C|assistant|>\n\"\n```\n\n### 4. 强制 JSON 输出\n用于工具调用或结构化数据提取，确保输出符合语法。\n\n```bash\n.\u002Fpicolm model.gguf --json -t 0.3 -n 100 -p \"\u003C|user|>\nReturn the current time as JSON.\u003C\u002Fs>\n\u003C|assistant|>\n\"\n```\n\n### 常用参数说明\n\n| 参数 | 说明 | 默认值 |\n| :--- | :--- | :--- |\n| `-p \u003Cprompt>` | 输入提示词 (也可通过 stdin 传入) | - |\n| `-n \u003Cint>` | 最大生成 Token 数 | 256 |\n| `-t \u003Cfloat>` | 温度系数 (0 为贪婪解码) | 0.8 |\n| `-k \u003Cfloat>` | Top-p \u002F Nucleus 采样率 | 0.9 |\n| `-j \u003Cint>` | 使用的线程数 | 4 |\n| `--json` | 开启语法约束的 JSON 输出模式 | - |","一位物联网工程师在信号覆盖极差的矿区部署智能监测终端，需要设备能独立处理异常报警数据而不依赖外部网络。\n\n### 没有 picolm 时\n- 必须依赖云端 API 接口，一旦断网整个系统就瘫痪失效，无法响应现场指令。\n- 长期运行需持续支付 Token 费用，运维成本随数据量线性增长，预算难以控制。\n- 原始传感器数据需上传至第三方服务器，存在隐私泄露隐患，不符合合规要求。\n- 传统大模型需要高性能 GPU 支持，普通嵌入式板卡无法承载，导致硬件选型受限。\n\n### 使用 picolm 后\n- 直接编译为单一二进制文件，在 $10 开发板上即可离线运行，彻底摆脱网络依赖。\n- 硬件一次性投入，后续推理零成本，彻底消除月度账单，大幅降低总拥有成本。\n- 所有数据处理均在本地完成，确保敏感信息绝不流出设备，满足严格的数据安全标准。\n- 仅需 45MB 内存占用，完美适配 256MB RAM 的资源受限环境，让低端硬件也能跑大模型。\n\npicolm 通过极致轻量化架构，让低成本硬件也能拥有安全可靠的本地智能决策能力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FRightNow-AI_picolm_3b61618c.jpg","RightNow-AI","RightNow","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FRightNow-AI_841b6441.png","GPU AI Code Editor",null,"jaber@rightnowai.co","rightnowai_co","https:\u002F\u002Fwww.rightnowai.co\u002F","https:\u002F\u002Fgithub.com\u002FRightNow-AI",[85,89,93,97],{"name":86,"color":87,"percentage":88},"C","#555555",89.7,{"name":90,"color":91,"percentage":92},"Shell","#89e051",8,{"name":94,"color":95,"percentage":96},"Makefile","#427819",2.1,{"name":98,"color":99,"percentage":100},"Batchfile","#C1F12E",0.3,1469,178,"2026-04-05T18:05:59","MIT","Linux, Windows, macOS","无需 GPU (仅 CPU)","运行时约 45MB，建议硬件至少 256MB",{"notes":109,"python":110,"dependencies":111},"纯 C11 编写，零外部依赖。支持 GGUF 格式模型（如 TinyLlama）。完全离线运行，无需网络。支持 ARM、x86-64、RISC-V 架构。首次运行需下载约 638MB 模型文件。","无需 Python",[112,113,114],"libc","libm","libpthread",[15,13,26],[117,118,119,120,121,122,123,124,125],"arm","embedded","inference","llm","openclaw","picoclaw","quantization","raspberry-pi","risc-v","2026-03-27T02:49:30.150509","2026-04-06T08:16:04.570767",[129,134,138,143],{"id":130,"question_zh":131,"answer_zh":132,"source_url":133},3373,"运行 picolm 时遇到 Segmentation Fault（段错误）如何解决？","尝试使用 install.sh 脚本重新安装，而不是仅使用 make 命令。有用户反馈使用 install.sh 后问题已解决。","https:\u002F\u002Fgithub.com\u002FRightNow-AI\u002Fpicolm\u002Fissues\u002F3",{"id":135,"question_zh":136,"answer_zh":137,"source_url":133},3374,"如果安装后仍然出现段错误，如何进行调试？","维护者建议运行 make debug 命令，并提供生成的日志以定位具体的失败点。",{"id":139,"question_zh":140,"answer_zh":141,"source_url":142},3375,"README 中的安装脚本链接无法访问（404），正确的地址是什么？","正确的安装脚本 URL 为 https:\u002F\u002Fraw.githubusercontent.com\u002FRightNow-AI\u002Fpicolm\u002Fmain\u002Finstall.sh。","https:\u002F\u002Fgithub.com\u002FRightNow-AI\u002Fpicolm\u002Fissues\u002F1",{"id":144,"question_zh":145,"answer_zh":146,"source_url":147},3376,"使用 picoclaw agent 命令没有输出响应怎么办？","目前该问题暂无明确修复方案。根据测试，直接使用 picolm 命令行工具运行模型（如 .\u002Fbuild\u002Fpicolm \u003Cmodel> -p \"prompt\"）可以正常生成内容。","https:\u002F\u002Fgithub.com\u002FRightNow-AI\u002Fpicolm\u002Fissues\u002F15",[]]