[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-AlmondGod--tinyworlds":3,"tool-AlmondGod--tinyworlds":65},[4,23,32,40,49,57],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":22},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,2,"2026-04-05T10:45:23",[13,14,15,16,17,18,19,20,21],"图像","数据工具","视频","插件","Agent","其他","语言模型","开发框架","音频","ready",{"id":24,"name":25,"github_repo":26,"description_zh":27,"stars":28,"difficulty_score":29,"last_commit_at":30,"category_tags":31,"status":22},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[17,13,20,19,18],{"id":33,"name":34,"github_repo":35,"description_zh":36,"stars":37,"difficulty_score":29,"last_commit_at":38,"category_tags":39,"status":22},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[19,13,20,18],{"id":41,"name":42,"github_repo":43,"description_zh":44,"stars":45,"difficulty_score":46,"last_commit_at":47,"category_tags":48,"status":22},3215,"awesome-machine-learning","josephmisiti\u002Fawesome-machine-learning","awesome-machine-learning 是一份精心整理的机器学习资源清单，汇集了全球优秀的机器学习框架、库和软件工具。面对机器学习领域技术迭代快、资源分散且难以甄选的痛点，这份清单按编程语言（如 Python、C++、Go 等）和应用场景（如计算机视觉、自然语言处理、深度学习等）进行了系统化分类，帮助使用者快速定位高质量项目。\n\n它特别适合开发者、数据科学家及研究人员使用。无论是初学者寻找入门库，还是资深工程师对比不同语言的技术选型，都能从中获得极具价值的参考。此外，清单还延伸提供了免费书籍、在线课程、行业会议、技术博客及线下聚会等丰富资源，构建了从学习到实践的全链路支持体系。\n\n其独特亮点在于严格的维护标准：明确标记已停止维护或长期未更新的项目，确保推荐内容的时效性与可靠性。作为机器学习领域的“导航图”，awesome-machine-learning 以开源协作的方式持续更新，旨在降低技术探索门槛，让每一位从业者都能高效地站在巨人的肩膀上创新。",72149,1,"2026-04-03T21:50:24",[20,18],{"id":50,"name":51,"github_repo":52,"description_zh":53,"stars":54,"difficulty_score":46,"last_commit_at":55,"category_tags":56,"status":22},2234,"scikit-learn","scikit-learn\u002Fscikit-learn","scikit-learn 是一个基于 Python 构建的开源机器学习库，依托于 SciPy、NumPy 等科学计算生态，旨在让机器学习变得简单高效。它提供了一套统一且简洁的接口，涵盖了从数据预处理、特征工程到模型训练、评估及选择的全流程工具，内置了包括线性回归、支持向量机、随机森林、聚类等在内的丰富经典算法。\n\n对于希望快速验证想法或构建原型的数据科学家、研究人员以及 Python 开发者而言，scikit-learn 是不可或缺的基础设施。它有效解决了机器学习入门门槛高、算法实现复杂以及不同模型间调用方式不统一的痛点，让用户无需重复造轮子，只需几行代码即可调用成熟的算法解决分类、回归、聚类等实际问题。\n\n其核心技术亮点在于高度一致的 API 设计风格，所有估算器（Estimator）均遵循相同的调用逻辑，极大地降低了学习成本并提升了代码的可读性与可维护性。此外，它还提供了强大的模型选择与评估工具，如交叉验证和网格搜索，帮助用户系统地优化模型性能。作为一个由全球志愿者共同维护的成熟项目，scikit-learn 以其稳定性、详尽的文档和活跃的社区支持，成为连接理论学习与工业级应用的最",65628,"2026-04-05T10:10:46",[20,18,14],{"id":58,"name":59,"github_repo":60,"description_zh":61,"stars":62,"difficulty_score":10,"last_commit_at":63,"category_tags":64,"status":22},3364,"keras","keras-team\u002Fkeras","Keras 是一个专为人类设计的深度学习框架，旨在让构建和训练神经网络变得简单直观。它解决了开发者在不同深度学习后端之间切换困难、模型开发效率低以及难以兼顾调试便捷性与运行性能的痛点。\n\n无论是刚入门的学生、专注算法的研究人员，还是需要快速落地产品的工程师，都能通过 Keras 轻松上手。它支持计算机视觉、自然语言处理、音频分析及时间序列预测等多种任务。\n\nKeras 3 的核心亮点在于其独特的“多后端”架构。用户只需编写一套代码，即可灵活选择 TensorFlow、JAX、PyTorch 或 OpenVINO 作为底层运行引擎。这一特性不仅保留了 Keras 一贯的高层易用性，还允许开发者根据需求自由选择：利用 JAX 或 PyTorch 的即时执行模式进行高效调试，或切换至速度最快的后端以获得最高 350% 的性能提升。此外，Keras 具备强大的扩展能力，能无缝从本地笔记本电脑扩展至大规模 GPU 或 TPU 集群，是连接原型开发与生产部署的理想桥梁。",63927,"2026-04-04T15:24:37",[20,14,18],{"id":66,"github_repo":67,"name":68,"description_en":69,"description_zh":70,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":77,"owner_avatar_url":78,"owner_bio":79,"owner_company":80,"owner_location":81,"owner_email":80,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":90,"forks":91,"last_commit_at":92,"license":93,"difficulty_score":29,"env_os":94,"env_gpu":94,"env_ram":94,"env_deps":95,"category_tags":103,"github_topics":80,"view_count":104,"oss_zip_url":80,"oss_zip_packed_at":80,"status":22,"created_at":105,"updated_at":106,"faqs":107,"releases":137},663,"AlmondGod\u002Ftinyworlds","tinyworlds","A minimal implementation of DeepMind's Genie world model","TinyWorlds 是 Google DeepMind Genie 架构的极简开源实现，旨在帮助开发者理解如何构建可扩展的自回归世界模型。简单来说，它能让 AI 像人类一样，通过压缩环境规律来预测视频的下一帧画面。\n\n传统模型往往需要大量带有动作标注的数据才能训练，而 TinyWorlds 解决了这一瓶颈。它能在没有先验动作信息的情况下，自动推断帧与帧之间的潜在动作，从而实现更高效的视频生成与物理世界模拟。\n\n这个项目特别适合对 AI 视频生成、强化学习及底层架构感兴趣的研究人员和开发者。TinyWorlds 的独特之处在于它将连续的图像预测转化为离散令牌的选择问题，结合了时空 Transformer、变分自编码器等组件，使得利用大语言模型（LLM）的技术来优化动态预测成为可能。如果你想探索智能体如何“理解”并“生成”世界，TinyWorlds 提供了一个清晰且可实践的代码入口。","\u003Cdiv align=\"center\">\n\u003Cpicture>\n  \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"\u002Fassets\u002Ftinyworldslight.png\">\n  \u003Cimg alt=\"tiny corp logo\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_086a5aa10aef.png\" width=\"80%\" height=\"80%\">\n\u003C\u002Fpicture>\n\u003C\u002Fdiv>\n\nTinyWorlds is a minimal autoregressive world model built on Google Deepmind's [Genie Architecture](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.15391).\n\nWorld models can't use action-less internet video to scale like [VEO3](https:\u002F\u002Fdeepmind.google\u002Fmodels\u002Fveo\u002F). Deepmind's [Genie](https:\u002F\u002Fdeepmind.google\u002Fdiscover\u002Fblog\u002Fgenie-3-a-new-frontier-for-world-models\u002F) solves this by inferring the actions between frames using **no prior action data**.\n\nTinyWorlds is meant to help people understand the clever autoregressive, unsupervised method Deepmind likely used to achieve **scalable world models**.\n\n## Table of Contents\n\n- [Getting Started](#getting-started)\n- [Overview](#architecture-overview)\n- [Building Blocks](#architecture-building-blocks)\n   - [Space-Time Transformer](#space-time-transformer-stt)\n   - [Variational Autoencoder](#vaes)\n   - [Finite Scalar Quantization](#finite-scalar-quantization)\n- [Architecture](#architecture)\n   - [Video Tokenizer](#video-tokenizer)\n   - [Action Tokenizer](#action-tokenizer)\n   - [Dynamics Model](#dynamics-model)\n   - [TinyWorlds Inference](#full-tinyworlds-inference)\n   - [Data](#data)\n   - [Training\u002FInference Acceleration](#traininginference-acceleration)\n   - [Shape Annotation Key](#shape-annotation-key)\n- [Contributing](#contributing)\n\n# Getting Started\n\n```bash\n# Installation\ngit clone https:\u002F\u002Fgithub.com\u002FAlmondGod\u002Ftinyworlds.git\ncd tinyworlds\npip install -r requirements.txt\nexport WANDB_API_KEY=\u003CYOUR_WANDB_API_KEY>\nexport PYTHONPATH=\"\u002Fworkspace\u002Ftinyworlds:$PYTHONPATH\"\n\n# Training\n# 1. download data from huggingface\npython scripts\u002Fdownload_assets.py datasets --pattern \"zelda_frames.h5\"\n# 2. run training\npython scripts\u002Ffull_train.py --config configs\u002Ftraining.yaml -- --dataset=ZELDA\n\n# Inference\n# 1. pull pretrained sonic checkpoints from huggingface\npython scripts\u002Fdownload_assets.py models --suite-name sonic\n# 2. run inference\npython scripts\u002Frun_inference.py --config configs\u002Finference.yaml -- use_latest_checkpoints=true dataset=SONIC\n```\n\n# Overview\n\n### Why World Models?\n\n*To shape the world, generate it*\n\nA [world model](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.10122) is a function mapping the current state of an environment to the next state of an environment.\n\nTo predict the next environment state accurately, this function must compress all information in the world into a set of laws. \n\nSo the world model **captures all the inherent structure and emergent phenomena of the world.** \n\nAll of deep learning, and all of intelligence, is [trying to compress the universe into a model](https:\u002F\u002Farxiv.org\u002Fpdf\u002F0812.4360). A model that can predict important aspects of the next state of the universe, by learning heuristics about how it operates.\n\nOur universe can also be thought of as a world model. It is a map from state to state executing every moment by following a set of laws. Humans experience the many layers of emergent behavior over these foundational laws.\n\nAs of 2025, video-based world models have been practically applied as:\n\n1. cortexes to give physical world understanding to robots\n2. simulators for models to interact with physics fully-online\n3. experiences with new structures of reality for humans to interact with\n\nBut humans are only at the very beginning of modeling our own worlds.\n\nTinyWorlds is built to help you to understand world modeling better, and to [learn by contributing](#contributing). \n\n### Architecture Overview\n![tinyworldsarch](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_fa434a9b2e77.png)\n\nTinyWorlds is an autoregressive transformer over discrete tokens, so we can also use SOTA LLM techniques to improve our world model. \n\nWhy discrete tokens? Discretization makes our dynamics prediction problem much easier, because instead of predicting an image a near-infinite continuous space, it need only select one of the ~1000 tokens in our vocabulary (aka codebook).\n\nTinyWorlds consists of three modules:\n\n**Video Tokenizer:** This tokenizer reconstructs a sequence of video with a small discrete bottleneck (our video tokens) in the middle. **This layer  compresses the important information from video to tokens.**\n\n**Action Tokenizer:** This tokenizer **infers the discrete action token between two frames**. It trains by reconstructing the next frame using the previous frame and a discrete action token that sees the next frame.\n\n**Dynamics Model:** Given past action and frame tokens, this predicts our next frame tokens. **This should capture the physics of our tiny video game worlds.**\n\n# Building Blocks\n\n### Space-Time Transformer\n![stt](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_49be20799b77.png)\n\n[Space-Time Transformer](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2001.02908) (STT) is a transformer for video. Each STT block contains a spatial attention layer, a temporal attention layer, and a FeedForward Network (FFN). For a brush up on self-attention, see Karpathy's [GPT From Scratch Video](https:\u002F\u002Fyoutu.be\u002FkCc8FmEb1nY?si=tvfcBnGHBbEiS70v&t=3748)\n\nIn the spatial layer, each token attends to all other tokens in the same frame. In the temporal layer, each token attends to tokens in the same position but previous timesteps.\n\nThe FFN is a multi-layer perceptron on each embedding vector. Inspired by divine benevolence, I used [SwiGLU](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2002.05202) for the FFN. SwiGLU adds a Gated Linear Unit (GLU) to [Swish](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSwish_function), and is computed as \n\n$x_t = W_3[\\sigma(W_1x + b1) * W_2x + b2] + b3$ (see SwiGLU diagram for clarity)\n\n\nFor regular STT, I used [Root Mean Squared Normalization (RMSNorm)](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Fgenerated\u002Ftorch.nn.modules.normalization.RMSNorm.html) as the normalizer, which is less sensitive to extreme outliers than 0-variance norm. In RMS, we divide our input by \n\n$\\sqrt(\\epsilon + x \u002F \\sum x^2)$. \n\nFor STT conditioned on actions, I used [Feature-wise Linear Modulation (FiLM)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1709.07871). FiLM passes actions for each timestep through an FFN to transform each action latent into Gamma ($\\gamma$) and Beta ($\\beta$) vectors. Our norm is then \n\n$(x - \\mu) \u002F \\sigma * (1 + \\gamma) + \\beta$\n\n### Variational Autoencoder\n\n*\\*VAEs are complex, but below is an overview with many details omitted*\n\n[Variational Autoencoders]((https:\u002F\u002Farxiv.org\u002Fpdf\u002F1906.02691)) (VAEs) are defined by:\n1. An encoder network to parameterize the approximate posterior distribution $q(z | x)$ of latent variables $z$ given data $x$\n3. A decoder network to parameterize the likelihood $p(x | z)$ over input data x given latent z\n\nVAEs maximize $log(p(x | z))$, the likelihood the decoder exactly reconstructs the input x given latent z from the encoder. \n\nThe important takeaway is that $z$ is low dimensional, so for reconstruction, it will compress all the important information from $x$.\n\n### Finite Scalar Quantization \n\n![fsq](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_4c2ca92bc3a3.png)\n\nSince we want a set of discrete tokens, we quantize continous $z$ to one of a finite set of possible $z$.\n\nIf vectors are points in high dimensional space, [Finite Scalar Quantization](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.15505) (FSQ) is a quantization method that divides space into hypercubes, and the hypercube a vector falls into becomes its quantized representation.\n\nConcretely, we quantize a vector in FSQ by:\n\n1. tanh(x) which bounds to [-1,1]\n2. scale\u002Fshift to [0, L]\n3. round to the nearest integer (quantization step)\n4. scale\u002Fshift back to [-1,1]\n\nThe token vocabulary has size ${L^{D}}$ where $L$ is bins per dimension and $D$ is the dimensionality of the hypercube. With 3 dimensions and 2 levels per dimension, we'd have 8 regions in the cube and size 8 token vocabulary. \n\nFSQ VAEs let us learn structured hypercubes to use as our token vocabularies that encode information about the input. In our context, maybe one of these hypercubes represents moving left, another jumping, another crouching, et cetera.\n\nTo allow gradients to flow to the encoder (since quantization is non-differentiable), we pass the post-quantization gradients directly to the pre-quantization layer. \n\nPrecisely, the decoder takes as input $z + stopgrad(z_q - z)$ where stopgrad is, in pytorch, `.detach()`. The decoder only uses $z_q$ (since $z - z = 0$), but the gradient is taken only on $z$.\n\n# Architecture\n\n### Video Tokenizer\n\n![videotokenizer](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_210c9c5e5468.png)\n\nThe video tokenizer is an FSQ VAE that compresses videos into discrete tokens. It reduces the dimensionality of dynamics while enabling high quality video generation.\n\nIt converts patches to embeddings with pixel-mixing 2D Convolutions.\n\nIt then uses an STTransformer over the embeddings to produce quantized tokens. \n\nEach video token contains information about both its own patch and other patches in the same location or timestep. \n\nFinally, it decodes the video tokens into a reconstructed image.\n\n\n### Action Tokenizer\n\n![actiontokenizer](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_78fb59d54df6.png)\n\nThe Action Tokenizer is also an FSQ VAE, and is the key to scalability. It allows us to train without action labels by learning to infer actions between two frames. We then condition the dynamics on these actions.\n\nThe encoder takes in a sequence of frames and outputs action tokens between the frames.\n\nThe decoder takes in all previous frames $(x_1...x_t-1)$ and quantized action latent vectors $(a_1...a_t-1)$ as input and predicts the next frames $(x_2...x_t)$.\n\nAction tokens should learn to encode the most meaningful change between the past and current frame, which should correspond to some high-level action.\n\nIn practice, the action decoder tries to ignore actions and infer purely from images. To counteract this, \n1. we mask most frames except the first, so the decoder must learn to use the string of actions as signal for reconstruction\n2. we encourage batch-wise variance in the encoder through an auxiliary loss\n\nAt inference time, we map each key to one of the action tokens that conditions the dynamics for the user to influence video generation.\n\n### Dynamics Model\n![dynamicsmodel](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_831b90b6bee8.png)\n\nAt timestep $t$, the dynamics model should take in tokenized video tokens $z_{1..t - 1}$ and action tokens $a_{_1..t - 1}$ and predict next frame tokens $z_{t}$.\n\nIn practice, we train dynamics like [MaskGIT](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04200) and [BERT](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1810.04805).\n\nWe mask a subset of tokens and train our model to predict the masked tokens, conditioned on all current and previous frame and action tokens.\n\nTo infer dynamics at a given step, we first append a fully masked frame to our context sequence. Then, for T steps we:\n1. Predict logits at each masked position\n2. Compute token probabilities with softmax\n3. Sample the k most likely tokens out of the still unmasked positions\n4. Place them into the context tensor, removing corresponding mask tokens\n5. Repeat\n\nI chose an exponential schedule for k (first step samples ~1 token, then ~2, then ~5, then ~20, then ~50, etc)\n\n### TinyWorlds Inference\n\nGiven initial context frames from the training distribution, we first tokenize them.\n\nWe then run the following loop:\n1. The player specifies one of the n_actions action tokens to use by choosing integer in $[0, |A|]$\n2. Condition the dynamics model with context window c on the video tokens t-c...t and action tokens t-c..t and run dynamics inference \n3. Detokenize the predicted video tokens into a new video frame for the user\n\nWe repeat this process autoregressively over the time dimension as actions are passed to the model, tokens are predicted by the dynamics model, we detokenize them into frames to display to the user.\n\nThis process also lets us predict multiple future frames at once (bounded by memory and the training distribution), which can improve inference quality.\n\n### Data\n\n![datasets](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_425c3e1226bc.png)\n\nThe data is processed and downsampled from gameplay `.mp4s` into `.h5` files. \nYou can download existing datasets from [Huggingface TinyWorlds Datasets](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAlmondGod\u002Ftinyworlds\u002Ftree\u002Fmain) with the datasets command in [getting started](#getting-started). \n\nAvailable are:\n1. **PicoDoom** (`picodoom_frames.h5`), a minimal version of Doom\n2. **Pong** (`pong_frames.h5`), the classic\n3. **Zelda Ocarina of Time** (`zelda_frames.h5`), one of the originl 2D Zelda games\n4. **Pole Position** (`pole_position_frames.h5`), a pixel racing game\n5. **Sonic** (`sonic_frames.h5`), the original game\n\nTo create a new dataset, create a new dataclass in [datasets.py](datasets\u002Fdatasets.py) and specify mp4 path. PR or dm me to upload your dataset to the HF repo so others can use it :)\n\n### Training\u002FInference Acceleration\n\nTinyWorlds supports the following torch features to accelerate training and\u002For inference:\n1. **Torch compile**, which allows us to use faster CUDA kernels for certain pre-optimized operations like attention and matmuls\n2. **Distributed data parallel (DDP)**, which allows us to train using multiple gpus by using same model different data per-gpu\n3. **Automatic mixed precision (AMP)**, which scales certain ops from FP32 to BF16 based on the current nodes used floating point range\n4. **TF32 training**, which lets us use NVIDIA TensorFloat32 for tensor-core-optimized matmuls and convolutions\n\n### Shape Annotation Key\n\nAll tensors are shape-annotated and use einops tensor manipulation operations with the following abbreviations:\n\n**B:** batch size \\\n**T:** time\u002Fsequence dimension (number of frames) \\\n**P:** number of patch tokens per frame \\\n**E:** embedding dim (d_model) \\\n**L:** Video Tokenizer latent dim \\\n**A:** Action Tokenizer latent dim (action dim) \\\n**D:** number of bins for each video tokenizer dim \\\n**L^D:** Size of the video tokenizer vocabulary \\\n**C:** image channels \\\n**H:** pixel-grid height \\\n**W:** pixel-grid width \\\n**Hp:** patch-grid height \\\n**Wp:** patch-grid width \\\n**S:** patch size\n\n# Contributing\n\nThere are still many TODOs which may offer significant performance gains...\n\n- [ ] Implement Mixture of Experts in the Feedforward Network\n- [ ] Try `RoPE`\u002F`AliBi` Position Embeddings\n- [ ] Try different optimizers (`Muon`, `SOAP`)\n- [ ] Add more datasets (Terraria, Street Fighter, \\\u003Cyour favorite retro videogame\\>) \n- [ ] Try [AdaLN-Zero](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.09748) instead of `FiLM` (adds a pre-scale parameter)\n- [ ] Add new schedulers for MaskGIT like cosine and [Halton](https:\u002F\u002Fgithub.com\u002Fvaleoai\u002FHalton-MaskGIT)\n- [ ] Replace `mean pool + concat` in the action tokenizer with `length-2 windowed attention + mean`\n- [ ] Accelerate dynamics training by producing, saving, and loading pre-processed image patch embeddings instead of full frames \n- [x] Scale: train on more GPUs and scale to multibillions of params by adding `FSDP` Support — made by [alekseymalakhov11](https:\u002F\u002Fgithub.com\u002Falekseymalakhov11)\n\n\n\n### *Miscellanea*\n\nTinyWorlds (excluding datasets and external assets) is licensed under the MIT [LICENSE](LICENSE). TinyWorlds is an independent research project and is not affiliated with, endorsed by, or sponsored by DeepMind or Google.\n\n*aesthetic inspired by [Tinygrad](https:\u002F\u002Ftinygrad.org\u002F) and [Tinygpu](https:\u002F\u002Fgithub.com\u002Fadam-maj\u002Ftiny-gpu)*","\u003Cdiv align=\"center\">\n\u003Cpicture>\n  \u003Csource media=\"(prefers-color-scheme: light)\" srcset=\"\u002Fassets\u002Ftinyworldslight.png\">\n  \u003Cimg alt=\"tiny corp logo\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_086a5aa10aef.png\" width=\"80%\" height=\"80%\">\n\u003C\u002Fpicture>\n\u003C\u002Fdiv>\n\nTinyWorlds 是一个基于 Google Deepmind 的 [Genie 架构](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.15391) 构建的最小化自回归 (autoregressive) 世界模型 (World Models)。\n\n世界模型无法像 [VEO3](https:\u002F\u002Fdeepmind.google\u002Fmodels\u002Fveo\u002F) 那样利用无动作的互联网视频进行扩展。Deepmind 的 [Genie](https:\u002F\u002Fdeepmind.google\u002Fdiscover\u002Fblog\u002Fgenie-3-a-new-frontier-for-world-models\u002F) 通过**不使用任何先验动作数据**来推断帧之间的动作，从而解决了这个问题。\n\nTinyWorlds 旨在帮助人们理解 Deepmind 可能用于实现**可扩展世界模型**的那种巧妙的自回归、无监督方法。\n\n## 目录\n\n- [入门](#入门)\n- [概述](#概述)\n- [构建模块](#构建模块)\n   - [时空 Transformer](#时空-transformer-stt)\n   - [变分自编码器](#vaes)\n   - [有限标量量化](#finite-scalar-quantization)\n- [架构](#架构)\n   - [视频 Tokenizer](#video-tokenizer)\n   - [动作 Tokenizer](#action-tokenizer)\n   - [动力学模型](#dynamics-model)\n   - [TinyWorlds 推理](#full-tinyworlds-inference)\n   - [数据](#data)\n   - [训练\u002F推理加速](#traininginference-acceleration)\n   - [形状标注键](#shape-annotation-key)\n- [贡献](#贡献)\n\n# 入门\n\n```bash\n# Installation\ngit clone https:\u002F\u002Fgithub.com\u002FAlmondGod\u002Ftinyworlds.git\ncd tinyworlds\npip install -r requirements.txt\nexport WANDB_API_KEY=\u003CYOUR_WANDB_API_KEY>\nexport PYTHONPATH=\"\u002Fworkspace\u002Ftinyworlds:$PYTHONPATH\"\n\n# Training\n# 1. download data from huggingface\npython scripts\u002Fdownload_assets.py datasets --pattern \"zelda_frames.h5\"\n# 2. run training\npython scripts\u002Ffull_train.py --config configs\u002Ftraining.yaml -- --dataset=ZELDA\n\n# Inference\n# 1. pull pretrained sonic checkpoints from huggingface\npython scripts\u002Fdownload_assets.py models --suite-name sonic\n# 2. run inference\npython scripts\u002Frun_inference.py --config configs\u002Finference.yaml -- use_latest_checkpoints=true dataset=SONIC\n```\n\n# 概述\n\n### 为什么需要世界模型？\n\n*塑造世界，生成它*\n\n[世界模型 (World Models)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1803.10122) 是一个将环境当前状态映射到环境下一状态的函数。\n\n为了准确预测环境的下一个状态，该函数必须将世界中所有信息压缩为一组定律。 \n\n因此，世界模型**捕捉了世界的所有固有结构和涌现现象。** \n\n所有深度学习，以及所有智能，都在 [试图将宇宙压缩为一个模型](https:\u002F\u002Farxiv.org\u002Fpdf\u002F0812.4360)。一个通过学习关于宇宙如何运行的启发式规则来预测宇宙下一状态重要方面的模型。\n\n我们的宇宙也可以被视为一个世界模型。它是一个遵循一组定律在每一刻执行的状态到状态的映射。人类在这些基础定律之上体验着许多层的涌现行为。\n\n截至 2025 年，基于视频的世界模型已实际应用为：\n\n1. 赋予机器人物理世界理解的皮层\n2. 供模型完全在线与物理交互的模拟器\n3. 供人类交互的新现实结构的体验\n\n但人类才刚刚处于为我们自己的世界建模的起步阶段。\n\nTinyWorlds 旨在帮助你更好地理解世界建模，并[通过贡献学习](#贡献)。 \n\n### 架构概述\n![tinyworldsarch](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_fa434a9b2e77.png)\n\nTinyWorlds 是一个基于离散 token 的自回归 Transformer，因此我们也可以使用 SOTA（最先进）LLM（大语言模型）技术来改进我们的世界模型。 \n\n为什么是离散 token？离散化使我们的动力学预测问题变得容易得多，因为不需要预测图像所在的近无限连续空间，它只需要从我们词汇表（即码本）中的约 1000 个 token 中选择一个。\n\nTinyWorlds 由三个模块组成：\n\n**视频 Tokenizer：** 这个 Tokenizer（标记器）重建视频序列，中间有一个小的离散瓶颈（我们的视频 token）。**这一层将重要信息从视频压缩到 token。**\n\n**动作 Tokenizer：** 这个 Tokenizer **推断两个帧之间的离散动作 token**。它通过使用前一帧和一个能看到下一帧的离散动作 token 来重建下一帧的方式进行训练。\n\n**动力学模型：** 给定过去的动作和帧 token，此模型预测我们的下一帧 token。**这应该捕捉我们微型电子游戏世界的物理规律。**\n\n# 构建模块\n\n### 时空 Transformer\n![stt](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_49be20799b77.png)\n\n[时空 Transformer (Space-Time Transformer)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2001.02908) (STT) 是一种用于视频的 Transformer。每个 STT 块包含一个空间注意力层、一个时间注意力层和一个前馈网络 (FFN)。关于自注意力机制的复习，请参阅 Karpathy 的 [从零开始构建 GPT 视频](https:\u002F\u002Fyoutu.be\u002FkCc8FmEb1nY?si=tvfcBnGHBbEiS70v&t=3748)。\n\n在空间层中，每个 token 关注同一帧中的所有其他 token。在时间层中，每个 token 关注相同位置但在之前时间步的 token。\n\nFFN 是每个嵌入向量上的多层感知机。受神圣仁慈的启发，我使用了 [SwiGLU](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2002.05202) 作为 FFN。SwiGLU 在 [Swish](https:\u002F\u002Fen.wikipedia.org\u002Fwiki\u002FSwish_function) 基础上增加了一个门控线性单元 (GLU)，其计算方式为 \n\n$x_t = W_3[\\sigma(W_1x + b1) * W_2x + b2] + b3$ (参见 SwiGLU 图表以获取清晰度)\n\n\n对于常规 STT，我使用了 [均方根归一化 (RMSNorm)](https:\u002F\u002Fdocs.pytorch.org\u002Fdocs\u002Fstable\u002Fgenerated\u002Ftorch.nn.modules.normalization.RMSNorm.html) 作为归一化器，它比零方差归一化对极端异常值更不敏感。在 RMS 中，我们将输入除以 \n\n$\\sqrt(\\epsilon + x \u002F \\sum x^2)$. \n\n对于基于动作条件化的 STT，我使用了 [特征线性调制 (FiLM)](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1709.07871)。FiLM 将每个时间步的动作通过 FFN 传递，以将每个动作潜在变量转换为 Gamma ($\\gamma$) 和 Beta ($\\beta$) 向量。我们的归一化公式则为 \n\n$(x - \\mu) \u002F \\sigma * (1 + \\gamma) + \\beta$\n\n### 变分自编码器\n\n*\\*VAE 很复杂，但以下是概述，省略了许多细节*\n\n[变分自编码器 (Variational Autoencoders)]((https:\u002F\u002Farxiv.org\u002Fpdf\u002F1906.02691)) (VAEs) 定义如下：\n1. 一个编码器网络 (Encoder)，用于参数化给定数据 $x$ 下潜在变量 $z$ 的近似后验分布 (posterior distribution) $q(z | x)$\n3. 一个解码器网络 (Decoder)，用于参数化给定潜在变量 $z$ 下输入数据 $x$ 的似然 (likelihood) $p(x | z)$\n\nVAEs 最大化 $log(p(x | z))$，即在给定编码器提供的潜在变量 $z$ 的情况下，解码器精确重构输入 $x$ 的可能性。 \n\n重要的结论是 $z$ 是低维的，因此对于重构，它将压缩来自 $x$ 的所有重要信息。\n\n### 有限标量量化 \n\n![fsq](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_4c2ca92bc3a3.png)\n\n由于我们需要一组离散的 token（标记），我们将连续的 $z$ 量化为有限个可能的 $z$ 中的一个。\n\n如果向量是高维空间中的点，[有限标量量化](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2309.15505)（FSQ）是一种将空间划分为超立方体的量化方法，向量落入的超立方体即成为其量化表示。\n\n具体来说，我们在 FSQ 中通过以下步骤对向量进行量化：\n\n1. tanh(x) 将其限制在 [-1,1] 范围内\n2. 缩放\u002F平移至 [0, L]\n3. 四舍五入到最近的整数（量化步长）\n4. 缩放\u002F平移回 [-1,1]\n\nToken 词汇表的大小为 ${L^{D}}$，其中 $L$ 是每个维度的分箱（bins）数量，$D$ 是超立方体的维度。如果有 3 个维度且每个维度有 2 个层级，那么立方体内将有 8 个区域，token 词汇表大小为 8。 \n\nFSQ VAE（变分自编码器）允许我们学习结构化的超立方体，用作编码输入信息的 token 词汇表。在我们的上下文中，这些超立方体中的一个可能代表向左移动，另一个代表跳跃，另一个代表蹲下，等等。\n\n为了让梯度流向编码器（因为量化是不可微的），我们将量化后的梯度直接传递给量化前的层。 \n\n具体来说，解码器以 $z + stopgrad(z_q - z)$ 作为输入，其中 stopgrad 在 PyTorch 中对应 `.detach()`。解码器仅使用 $z_q$（因为 $z - z = 0$），但梯度仅针对 $z$ 计算。\n\n# 架构\n\n### 视频 Tokenizer\n\n![videotokenizer](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_210c9c5e5468.png)\n\n视频 Tokenizer 是一个 FSQ VAE（变分自编码器），它将视频压缩为离散 token。它在实现高质量视频生成的同时降低了动态的维度。\n\n它使用像素混合 2D 卷积将 patch 转换为嵌入。\n\n然后它使用 STTransformer 处理这些嵌入以生成量化 token。 \n\n每个视频 token 包含关于其自身 patch 以及同一位置或时间步其他 patch 的信息。 \n\n最后，它将视频 token 解码为重建图像。\n\n\n### 动作 Tokenizer\n\n![actiontokenizer](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_78fb59d54df6.png)\n\n动作 Tokenizer 也是一个 FSQ VAE（变分自编码器），它是可扩展性的关键。它通过学习推断两帧之间的动作，使我们能够在没有动作标签的情况下进行训练。然后我们基于这些动作来条件化动态模型。\n\n编码器接收一系列帧并输出帧之间的动作 token。\n\n解码器接收所有之前的帧 $(x_1...x_t-1)$ 和量化动作潜在向量 $(a_1...a_t-1)$ 作为输入，并预测下一帧 $(x_2...x_t)$。\n\n动作 token 应学习编码过去帧与当前帧之间最有意义的变化，这应对应于某种高级动作。\n\n在实践中，动作解码器试图忽略动作，仅从图像中推断。为了抵消这种情况， \n1. 我们掩码除第一帧外的所有帧，因此解码器必须学习使用动作序列作为重建信号\n2. 我们通过辅助损失鼓励编码器中的批次方差\n\n在推理时，我们将每个键映射到一个动作 token，该 token 用于条件化动态模型，供用户影响视频生成。\n\n### 动态模型\n![dynamicsmodel](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_831b90b6bee8.png)\n\n在时间步 $t$，动态模型应接收 tokenized 视频 token $z_{1..t - 1}$ 和动作 token $a_{_1..t - 1}$，并预测下一帧 token $z_{t}$。\n\n在实践中，我们像 [MaskGIT](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2202.04200) 和 [BERT](https:\u002F\u002Farxiv.org\u002Fpdf\u002F1810.04805) 一样训练动态模型。\n\n我们掩码一部分 token，并训练我们的模型在给定所有当前和之前帧及动作 token 的条件下预测被掩码的 token。\n\n为了推断给定步骤的动态，我们首先将一个完全掩码的帧附加到我们的上下文序列中。然后，对于 T 个步骤，我们执行以下操作：\n1. 预测每个掩码位置的 logits\n2. 使用 Softmax 计算 token 概率\n3. 从未掩码的位置中采样 k 个最可能的 token\n4. 将它们放入上下文张量中，移除对应的掩码 token\n5. 重复此过程\n\n我选择了 k 的指数调度（第一步采样约 1 个 token，然后约 2 个，然后约 5 个，然后约 20 个，然后约 50 个，以此类推）\n\n### TinyWorlds 推理\n\n给定来自训练分布的初始上下文帧，我们首先将它们 tokenized。\n\n然后我们运行以下循环：\n1. 玩家通过在 $[0, |A|]$ 中选择整数指定要使用的 n_actions 个动作 token 之一\n2. 使用上下文窗口 c 上的视频 token t-c...t 和动作 token t-c..t 条件化动态模型，并运行动态推理 \n3. 将预测的视频 token 反 tokenize 为用户的新视频帧\n\n随着动作被传递给模型，tokens 由动态模型预测，我们将它们反 tokenize 为帧展示给用户，我们沿时间维度自回归地重复此过程。\n\n此过程还允许我们一次性预测多个未来帧（受内存和训练分布限制），这可以提高推理质量。\n\n### 数据\n\n![datasets](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_readme_425c3e1226bc.png)\n\n数据从游戏 `.mp4s` 处理和下采样为 `.h5` 文件。 \n您可以在 [开始指南](#getting-started) 中使用 datasets 命令从 [Huggingface TinyWorlds Datasets](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAlmondGod\u002Ftinyworlds\u002Ftree\u002Fmain) 下载现有数据集。 \n\n可用数据集包括：\n1. **PicoDoom** (`picodoom_frames.h5`)，Doom 的极简版本\n2. **Pong** (`pong_frames.h5`)，经典游戏\n3. **Zelda Ocarina of Time** (`zelda_frames.h5`)，原始 2D 塞尔达游戏之一\n4. **Pole Position** (`pole_position_frames.h5`)，像素赛车游戏\n5. **Sonic** (`sonic_frames.h5`)，原始游戏\n\n要创建新数据集，请在 [datasets.py](datasets\u002Fdatasets.py) 中创建新的 dataclass 并指定 mp4 路径。提交 PR 或直接私信我，将你的数据集上传到 HF 仓库，以便其他人使用 :)\n\n### 训练\u002F推理加速\n\nTinyWorlds 支持以下 torch 特性以加速训练和\u002F或推理：\n1. **Torch compile**，允许我们对某些预优化的操作（如 attention 和 matmuls）使用更快的 CUDA kernels\n2. **分布式数据并行 (DDP)**，允许我们通过在每台 GPU 上使用相同模型不同数据的方式使用多 GPU 进行训练\n3. **自动混合精度 (AMP)**，根据当前节点使用的浮点范围将某些操作从 FP32 扩展到 BF16\n4. **TF32 训练**，允许我们对张量核心优化的 matmuls 和 convolutions 使用 NVIDIA TensorFloat32\n\n### 形状标注键\n\n所有张量均进行了形状标注，并使用 einops（张量操作库）进行张量操作，缩写如下：\n\n**B：** 批次大小  \n**T：** 时间\u002F序列维度（帧数）  \n**P：** 每帧的 patch token 数量  \n**E：** 嵌入维度 (d_model)  \n**L：** 视频分词器潜在维度  \n**A：** 动作分词器潜在维度（动作维度）  \n**D：** 每个视频分词器维度的箱数  \n**L^D：** 视频分词器的词汇表大小  \n**C：** 图像通道  \n**H：** 像素网格高度  \n**W：** 像素网格宽度  \n**Hp：** Patch 网格高度  \n**Wp：** Patch 网格宽度  \n**S：** Patch 大小\n\n# 贡献指南\n\n仍有许多待办事项可能带来显著的性能提升……\n\n- [ ] 在前馈网络 (Feedforward Network) 中实现混合专家 (Mixture of Experts)\n- [ ] 尝试使用 `RoPE`（旋转位置编码）\u002F`AliBi` 位置编码 (Position Embeddings)\n- [ ] 尝试不同的优化器 (optimizers) (`Muon`, `SOAP`)\n- [ ] 添加更多数据集（Terraria, Street Fighter, \\\u003Cyour favorite retro videogame\\>）\n- [ ] 尝试使用 [AdaLN-Zero](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2212.09748) 替代 `FiLM`（增加预缩放参数）\n- [ ] 为 MaskGIT 添加新的调度器，如余弦和 [Halton](https:\u002F\u002Fgithub.com\u002Fvaleoai\u002FHalton-MaskGIT)\n- [ ] 将动作分词器中的 `mean pool + concat` 替换为 `length-2 windowed attention + mean`\n- [ ] 通过生成、保存和加载预处理后的图像 patch 嵌入而不是完整帧来加速动力学训练\n- [x] 扩展规模：通过在更多 GPU 上训练并添加 `FSDP`（全分片数据并行）支持扩展到数十亿参数 —— 由 [alekseymalakhov11](https:\u002F\u002Fgithub.com\u002Falekseymalakhov11) 完成\n\n\n\n### *杂项*\n\nTinyWorlds（不包括数据集和外部资源）采用 MIT [许可证](LICENSE)。TinyWorlds 是一个独立的研究项目，与 DeepMind 或 Google 无关联、未获其认可或赞助。\n\n*aesthetic 设计灵感来源于 [Tinygrad](https:\u002F\u002Ftinygrad.org\u002F) 和 [Tinygpu](https:\u002F\u002Fgithub.com\u002Fadam-maj\u002Ftiny-gpu)*","# TinyWorlds 快速上手指南\n\n**TinyWorlds** 是一个基于 Google DeepMind [Genie 架构](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2402.15391) 构建的最小化自回归世界模型。它通过离散 token 预测视频帧，无需先验动作数据即可推断动作，旨在帮助开发者理解可扩展的世界模型技术。\n\n## 环境准备\n\n*   **系统要求**: Linux \u002F macOS \u002F Windows (WSL)\n*   **硬件建议**: NVIDIA GPU (CUDA 支持)，显存建议 8GB+ (推理可更低，训练需更高)\n*   **前置依赖**:\n    *   Python 3.8+\n    *   Git\n    *   PyTorch (与 CUDA 版本匹配)\n    *   Weights & Biases (WandB) 账号 (用于实验追踪)\n\n> **网络提示**: 由于涉及从 GitHub 克隆仓库及从 HuggingFace 下载数据集\u002F模型，国内用户建议使用代理或配置镜像源（如 HF Mirror）以确保连接稳定。\n\n## 安装步骤\n\n1.  **克隆项目**\n    ```bash\n    git clone https:\u002F\u002Fgithub.com\u002FAlmondGod\u002Ftinyworlds.git\n    cd tinyworlds\n    ```\n\n2.  **安装依赖**\n    ```bash\n    pip install -r requirements.txt\n    ```\n\n3.  **配置环境变量**\n    替换 `\u003CYOUR_WANDB_API_KEY>` 为你的实际 API Key：\n    ```bash\n    export WANDB_API_KEY=\u003CYOUR_WANDB_API_KEY>\n    export PYTHONPATH=\"\u002Fworkspace\u002Ftinyworlds:$PYTHONPATH\"\n    ```\n\n## 基本使用\n\n### 1. 模型推理 (Inference)\n这是最快体验模型效果的方式。\n\n1.  **下载预训练检查点** (以 Sonic 为例):\n    ```bash\n    python scripts\u002Fdownload_assets.py models --suite-name sonic\n    ```\n\n2.  **运行推理**:\n    ```bash\n    python scripts\u002Frun_inference.py --config configs\u002Finference.yaml -- use_latest_checkpoints=true dataset=SONIC\n    ```\n\n### 2. 模型训练 (Training)\n如需自定义训练，请确保已准备好数据。\n\n1.  **下载训练数据** (以 Zelda 为例):\n    ```bash\n    python scripts\u002Fdownload_assets.py datasets --pattern \"zelda_frames.h5\"\n    ```\n\n2.  **开始训练**:\n    ```bash\n    python scripts\u002Ffull_train.py --config configs\u002Ftraining.yaml -- --dataset=ZELDA\n    ```\n\n---\n*注：详细架构原理及更多配置项请参考项目根目录 README 文档。*","某机器人研发团队计划利用《塞尔达传说》游戏录像训练智能体的导航能力，试图在无标注动作数据的情况下理解复杂物理环境。\n\n### 没有 tinyworlds 时\n- 必须依赖昂贵的动作捕捉设备或人工逐帧标注操作指令，数据准备周期长达数月且成本极高。\n- 传统视频生成模型无法准确推断帧与帧之间的潜在动作逻辑，导致预测轨迹出现断裂或不连贯。\n- 通用预训练模型难以捕捉特定游戏世界的物理规则，仿真环境与真实交互之间存在显著分布鸿沟。\n\n### 使用 tinyworlds 后\n- tinyworlds 通过无监督方式自动推断帧间隐含动作，彻底省去了手动标注动作标签的高昂人力成本。\n- 基于离散 token 的自回归架构能高效压缩世界状态，精准预测下一时刻的环境变化及物体运动规律。\n- 直接加载官方提供的 Sonic 等数据集检查点，数小时内即可构建出具备基础物理认知的动态模型并投入测试。\n\ntinyworlds 通过解耦动作与观测，让研究者能以极低算力成本探索可扩展的世界模型构建路径。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FAlmondGod_tinyworlds_4badcebc.png","AlmondGod","Anand","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FAlmondGod_b2cb4737.png","upenn | evolution + neuro",null,"sf","Almondgodd","anandmaj.com","https:\u002F\u002Fgithub.com\u002FAlmondGod",[86],{"name":87,"color":88,"percentage":89},"Python","#3572A5",100,1195,98,"2026-04-03T10:13:52","MIT","未说明",{"notes":96,"python":94,"dependencies":97},"需配置 WANDB_API_KEY 环境变量；首次运行需通过脚本从 HuggingFace 下载数据和预训练模型；使用 bash 环境执行 export 命令；依赖 requirements.txt 中的具体版本",[98,99,100,101,102],"torch","wandb","datasets","huggingface-hub","pyyaml",[15,18],5,"2026-03-27T02:49:30.150509","2026-04-06T05:17:38.047311",[108,113,117,122,127,132],{"id":109,"question_zh":110,"answer_zh":111,"source_url":112},2743,"运行训练或推理时报错 \"Error: num_frames got 0\" 如何处理？","建议直接从 Huggingface 下载模型，并在配置文件中将 `preload_ratio` 重置。这通常能解决数据加载器因检测到 0 帧而导致的初始化失败问题。","https:\u002F\u002Fgithub.com\u002FAlmondGod\u002Ftinyworlds\u002Fissues\u002F5",{"id":114,"question_zh":115,"answer_zh":116,"source_url":112},2744,"在个人显卡（如 RTX 4080）上运行时报错 CUDA Out of Memory 怎么办？","当前实现中影响 GPU 使用的参数是特定于设置的，其中批量大小（Batch size）是关键参数之一。请尝试根据显存大小调整配置文件中的 batch size 参数。",{"id":118,"question_zh":119,"answer_zh":120,"source_url":121},2745,"训练或推理时遇到 Torch Compile 相关的报错或日志过于复杂如何处理？","首先检查数据集的 `frame_size` 设置：Zelda 数据集应设为 128，其他数据集设为 64。若需简化调试，可在配置中将 `compile` 参数设置为 `false`。","https:\u002F\u002Fgithub.com\u002FAlmondGod\u002Ftinyworlds\u002Fissues\u002F7",{"id":123,"question_zh":124,"answer_zh":125,"source_url":126},2746,"缺少 `sonic_frames.mp4` 等视频文件无法进行推理或训练怎么办？","您可以准备自己的 MP4 文件并放入 data 文件夹。或者使用作者的 H5 数据进行训练，但需在 `training.yaml` 中将 `preload_ratio` 修改为 1.0。","https:\u002F\u002Fgithub.com\u002FAlmondGod\u002Ftinyworlds\u002Fissues\u002F3",{"id":128,"question_zh":129,"answer_zh":130,"source_url":131},2747,"README 中的克隆指令已过时，正确的仓库地址和路径是什么？","请使用 `tinyworlds` 作为仓库名。克隆命令应为 `git clone https:\u002F\u002Fgithub.com\u002FAlmondGod\u002Ftinyworlds.git`，进入目录后设置 `PYTHONPATH` 指向 `\u002Fworkspace\u002Ftinyworlds`。","https:\u002F\u002Fgithub.com\u002FAlmondGod\u002Ftinyworlds\u002Fissues\u002F1",{"id":133,"question_zh":134,"answer_zh":135,"source_url":136},2748,"该项目训练所需的推荐 GPU 硬件配置是什么？","开发阶段使用了 1 块 NVIDIA A100，最终训练阶段使用了 2 到 6 块 H200 GPU。","https:\u002F\u002Fgithub.com\u002FAlmondGod\u002Ftinyworlds\u002Fissues\u002F10",[]]