[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-ruotianluo--ImageCaptioning.pytorch":3,"tool-ruotianluo--ImageCaptioning.pytorch":64},[4,18,26,35,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107888,2,"2026-04-06T11:32:50",[14,15,13],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":10,"last_commit_at":41,"category_tags":42,"status":17},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[43,15,13,14],"语言模型",{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":10,"last_commit_at":50,"category_tags":51,"status":17},4292,"Deep-Live-Cam","hacksider\u002FDeep-Live-Cam","Deep-Live-Cam 是一款专注于实时换脸与视频生成的开源工具，用户仅需一张静态照片，即可通过“一键操作”实现摄像头画面的即时变脸或制作深度伪造视频。它有效解决了传统换脸技术流程繁琐、对硬件配置要求极高以及难以实时预览的痛点，让高质量的数字内容创作变得触手可及。\n\n这款工具不仅适合开发者和技术研究人员探索算法边界，更因其极简的操作逻辑（仅需三步：选脸、选摄像头、启动），广泛适用于普通用户、内容创作者、设计师及直播主播。无论是为了动画角色定制、服装展示模特替换，还是制作趣味短视频和直播互动，Deep-Live-Cam 都能提供流畅的支持。\n\n其核心技术亮点在于强大的实时处理能力，支持口型遮罩（Mouth Mask）以保留使用者原始的嘴部动作，确保表情自然精准；同时具备“人脸映射”功能，可同时对画面中的多个主体应用不同面孔。此外，项目内置了严格的内容安全过滤机制，自动拦截涉及裸露、暴力等不当素材，并倡导用户在获得授权及明确标注的前提下合规使用，体现了技术发展与伦理责任的平衡。",88924,"2026-04-06T03:28:53",[14,15,13,52],"视频",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85013,"2026-04-06T11:09:19",[15,16,52,61,13,62,43,14,63],"插件","其他","音频",{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":80,"owner_email":81,"owner_twitter":82,"owner_website":83,"owner_url":84,"languages":85,"stars":98,"forks":99,"last_commit_at":100,"license":101,"difficulty_score":102,"env_os":103,"env_gpu":104,"env_ram":103,"env_deps":105,"category_tags":116,"github_topics":82,"view_count":32,"oss_zip_url":82,"oss_zip_packed_at":82,"status":17,"created_at":117,"updated_at":118,"faqs":119,"releases":158},4770,"ruotianluo\u002FImageCaptioning.pytorch","ImageCaptioning.pytorch","I decide to sync up this repo and self-critical.pytorch. (The old master is in old master branch for archive)","ImageCaptioning.pytorch 是一个专为图像描述（Image Captioning）研究打造的开源代码库，旨在帮助开发者训练能够自动为图片生成自然语言描述的深度学习模型。它有效解决了从视觉特征提取到序列生成训练的全流程需求，特别针对提升描述准确性和流畅性提供了成熟方案。\n\n该项目非常适合人工智能研究人员、算法工程师以及希望深入探索多模态领域的开发者使用。其核心亮点在于完整复现并支持了多项业界经典技术：包括基于“自临界序列训练”（Self-critical Sequence Training）的强化学习方法，能直接优化 CIDEr 等评估指标；集成“自下而上”（Bottom-up）的注意力机制特征提取；以及原生支持 Transformer 架构和多 GPU 分布式训练。此外，项目还贴心地提供了测试时集成策略和预训练模型库，并兼容 COCO 与 Flickr30k 等主流数据集。无论是想要复现前沿论文成果，还是从零开始构建自己的图像描述系统，ImageCaptioning.pytorch 都提供了一个灵活且功能强大的基础平台。","# An Image Captioning codebase\n\nThis is a codebase for image captioning research.\n\nIt supports:\n- Self critical training from [Self-critical Sequence Training for Image Captioning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.00563)\n- Bottom up feature from [ref](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.07998).\n- Test time ensemble\n- Multi-GPU training. (DistributedDataParallel is now supported with the help of pytorch-lightning, see [ADVANCED.md](ADVANCED.md) for details)\n- Transformer captioning model.\n\nA simple demo colab notebook is available [here](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fruotianluo\u002FImageCaptioning.pytorch\u002Fblob\u002Fcolab\u002Fnotebooks\u002Fcaptioning_demo.ipynb)\n\n## Requirements\n- Python 3\n- PyTorch 1.3+ (along with torchvision) (Test with 1.13)\n- cider (already been added as a submodule)\n- coco-caption (already been added as a submodule) (**Remember to follow initialization steps in coco-caption\u002FREADME.md**)\n- yacs\n- lmdbdict\n- Optional: pytorch-lightning (Tested with 2.0)\n\n## Install\n\nIf you have difficulty running the training scripts in `tools`. You can try installing this repo as a python package:\n```\npython -m pip install -e .\n```\n\n## Pretrained models\n\nCheckout [MODEL_ZOO.md](MODEL_ZOO.md).\n\nIf you want to do evaluation only, you can then follow [this section](#generate-image-captions) after downloading the pretrained models (and also the pretrained resnet101 or precomputed bottomup features, see [data\u002FREADME.md](data\u002FREADME.md)).\n\n## Train your own network on COCO\u002FFlickr30k\n\n### Prepare data.\n\nWe now support both flickr30k and COCO. See details in [data\u002FREADME.md](data\u002FREADME.md). (Note: the later sections assume COCO dataset; it should be trivial to use flickr30k.)\n\n### Start training\n\n```bash\n$ python tools\u002Ftrain.py --id fc --caption_model newfc --input_json data\u002Fcocotalk.json --input_fc_dir data\u002Fcocotalk_fc --input_att_dir data\u002Fcocotalk_att --input_label_h5 data\u002Fcocotalk_label.h5 --batch_size 10 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log_fc --save_checkpoint_every 6000 --val_images_use 5000 --max_epochs 30\n```\n\nor \n\n```bash\n$ python tools\u002Ftrain.py --cfg configs\u002Ffc.yml --id fc\n```\n\nThe train script will dump checkpoints into the folder specified by `--checkpoint_path` (default = `log_$id\u002F`). By default only save the best-performing checkpoint on validation and the latest checkpoint to save disk space. You can also set `--save_history_ckpt` to 1 to save every checkpoint.\n\nTo resume training, you can specify `--start_from` option to be the path saving `infos.pkl` and `model.pth` (usually you could just set `--start_from` and `--checkpoint_path` to be the same).\n\nTo checkout the training curve or validation curve, you can use tensorboard. The loss histories are automatically dumped into `--checkpoint_path`.\n\nThe current command use scheduled sampling, you can also set `--scheduled_sampling_start` to -1 to turn off scheduled sampling.\n\nIf you'd like to evaluate BLEU\u002FMETEOR\u002FCIDEr scores during training in addition to validation cross entropy loss, use `--language_eval 1` option, but don't forget to pull the submodule `coco-caption`.\n\nFor all the arguments, you can specify them in a yaml file and use `--cfg` to use the configurations in that yaml file. The configurations in command line will overwrite cfg file if there are conflicts.  \n\nFor more options, see `opts.py`. \n\n\u003C!-- **A few notes on training.** To give you an idea, with the default settings one epoch of MS COCO images is about 11000 iterations. After 1 epoch of training results in validation loss ~2.5 and CIDEr score of ~0.68. By iteration 60,000 CIDEr climbs up to about ~0.84 (validation loss at about 2.4 (under scheduled sampling)). -->\n\n### Train using self critical\n\nFirst you should preprocess the dataset and get the cache for calculating cider score:\n```\n$ python scripts\u002Fprepro_ngrams.py --input_json data\u002Fdataset_coco.json --dict_json data\u002Fcocotalk.json --output_pkl data\u002Fcoco-train --split train\n```\n\nThen, copy the model from the pretrained model using cross entropy. (It's not mandatory to copy the model, just for back-up)\n```\n$ bash scripts\u002Fcopy_model.sh fc fc_rl\n```\n\nThen\n```bash\n$ python tools\u002Ftrain.py --id fc_rl --caption_model newfc --input_json data\u002Fcocotalk.json --input_fc_dir data\u002Fcocotalk_fc --input_att_dir data\u002Fcocotalk_att --input_label_h5 data\u002Fcocotalk_label.h5 --batch_size 10 --learning_rate 5e-5 --start_from log_fc_rl --checkpoint_path log_fc_rl --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --self_critical_after 30 --cached_tokens coco-train-idxs --max_epoch 50 --train_sample_n 5\n```\n\nor \n```bash\n$ python tools\u002Ftrain.py --cfg configs\u002Ffc_rl.yml --id fc_rl\n```\n\n\nYou will see a huge boost on Cider score, : ).\n\n**A few notes on training.** Starting self-critical training after 30 epochs, the CIDEr score goes up to 1.05 after 600k iterations (including the 30 epochs pertraining).\n\n## Generate image captions\n\n### Evaluate on raw images\n\n**Note**: this doesn't work for models trained with bottomup feature.\nNow place all your images of interest into a folder, e.g. `blah`, and run\nthe eval script:\n\n```bash\n$ python tools\u002Feval.py --model model.pth --infos_path infos.pkl --image_folder blah --num_images 10\n```\n\nThis tells the `eval` script to run up to 10 images from the given folder. If you have a big GPU you can speed up the evaluation by increasing `batch_size`. Use `--num_images -1` to process all images. The eval script will create an `vis.json` file inside the `vis` folder, which can then be visualized with the provided HTML interface:\n\n```bash\n$ cd vis\n$ python -m SimpleHTTPServer\n```\n\nNow visit `localhost:8000` in your browser and you should see your predicted captions.\n\n### Evaluate on Karpathy's test split\n\n```bash\n$ python tools\u002Feval.py --dump_images 0 --num_images 5000 --model model.pth --infos_path infos.pkl --language_eval 1 \n```\n\nThe defualt split to evaluate is test. The default inference method is greedy decoding (`--sample_method greedy`), to sample from the posterior, set `--sample_method sample`.\n\n**Beam Search**. Beam search can increase the performance of the search for greedy decoding sequence by ~5%. However, this is a little more expensive. To turn on the beam search, use `--beam_size N`, N should be greater than 1.\n\n### Evaluate on COCO test set\n\n```bash\n$ python tools\u002Feval.py --input_json cocotest.json --input_fc_dir data\u002Fcocotest_bu_fc --input_att_dir data\u002Fcocotest_bu_att --input_label_h5 none --num_images -1 --model model.pth --infos_path infos.pkl --language_eval 0\n```\n\nYou can download the preprocessed file `cocotest.json`, `cocotest_bu_att` and `cocotest_bu_fc` from [link](https:\u002F\u002Fdrive.google.com\u002Fopen?id=1eCdz62FAVCGogOuNhy87Nmlo5_I0sH2J).\n\n## Miscellanea\n**Using cpu**. The code is currently defaultly using gpu; there is even no option for switching. If someone highly needs a cpu model, please open an issue; I can potentially create a cpu checkpoint and modify the eval.py to run the model on cpu. However, there's no point using cpus to train the model.\n\n**Train on other dataset**. It should be trivial to port if you can create a file like `dataset_coco.json` for your own dataset.\n\n**Live demo**. Not supported now. Welcome pull request.\n\n## For more advanced features:\n\nCheckout [ADVANCED.md](ADVANCED.md).\n\n## Reference\n\nIf you find this repo useful, please consider citing (no obligation at all):\n\n```\n@article{luo2018discriminability,\n  title={Discriminability objective for training descriptive captions},\n  author={Luo, Ruotian and Price, Brian and Cohen, Scott and Shakhnarovich, Gregory},\n  journal={arXiv preprint arXiv:1803.04376},\n  year={2018}\n}\n```\n\nOf course, please cite the original paper of models you are using (You can find references in the model files).\n\n## Acknowledgements\n\nThanks the original [neuraltalk2](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fneuraltalk2) and awesome PyTorch team.","# 一个图像字幕生成代码库\n\n这是一个用于图像字幕生成研究的代码库。\n\n它支持：\n- 自我批判训练，来自 [Self-critical Sequence Training for Image Captioning](https:\u002F\u002Farxiv.org\u002Fabs\u002F1612.00563)\n- 自下而上的特征提取，来自 [ref](https:\u002F\u002Farxiv.org\u002Fabs\u002F1707.07998)。\n- 测试时集成\n- 多GPU训练。（在PyTorch Lightning的帮助下，现在支持DistributedDataParallel，详情请参阅[ADVANCED.md](ADVANCED.md)）\n- Transformer字幕生成模型。\n\n一个简单的演示Colab笔记本可以在[这里](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fruotianluo\u002FImageCaptioning.pytorch\u002Fblob\u002Fcolab\u002Fnotebooks\u002Fcaptioning_demo.ipynb)找到。\n\n## 需求\n- Python 3\n- PyTorch 1.3+（以及torchvision）（测试版本为1.13）\n- cider（已作为子模块添加）\n- coco-caption（已作为子模块添加）（**请务必按照coco-caption\u002FREADME.md中的初始化步骤操作**）\n- yacs\n- lmdbdict\n- 可选：pytorch-lightning（测试版本为2.0）\n\n## 安装\n\n如果您在运行`tools`中的训练脚本时遇到困难，可以尝试将此仓库安装为Python包：\n```\npython -m pip install -e .\n```\n\n## 预训练模型\n\n请查看[MODEL_ZOO.md](MODEL_ZOO.md)。\n\n如果您只想进行评估，可以在下载预训练模型（以及预训练的ResNet101或预计算的自下而上特征，详见[data\u002FREADME.md](data\u002FREADME.md)）后，按照[此处](#generate-image-captions)的说明进行操作。\n\n## 在COCO\u002FFlickr30k数据集上训练您自己的网络\n\n### 准备数据。\n\n我们现在同时支持Flickr30k和COCO数据集。详细信息请参阅[data\u002FREADME.md](data\u002FREADME.md)。（注意：后续部分假设使用COCO数据集；使用Flickr30k应该也很简单。）\n\n### 开始训练\n\n```bash\n$ python tools\u002Ftrain.py --id fc --caption_model newfc --input_json data\u002Fcocotalk.json --input_fc_dir data\u002Fcocotalk_fc --input_att_dir data\u002Fcocotalk_att --input_label_h5 data\u002Fcocotalk_label.h5 --batch_size 10 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log_fc --save_checkpoint_every 6000 --val_images_use 5000 --max_epochs 30\n```\n\n或者\n\n```bash\n$ python tools\u002Ftrain.py --cfg configs\u002Ffc.yml --id fc\n```\n\n训练脚本会将检查点保存到`--checkpoint_path`指定的文件夹中（默认为`log_$id\u002F`）。默认情况下，为了节省磁盘空间，只保存验证集上表现最好的检查点和最新的检查点。您也可以设置`--save_history_ckpt`为1，以保存每一个检查点。\n\n要恢复训练，您可以指定`--start_from`选项为保存`infos.pkl`和`model.pth`的路径（通常您可以将`--start_from`和`--checkpoint_path`设置为同一路径）。\n\n要查看训练曲线或验证曲线，可以使用TensorBoard。损失历史会自动保存到`--checkpoint_path`中。\n\n当前命令使用了计划采样，您也可以将`--scheduled_sampling_start`设置为-1来关闭计划采样。\n\n如果您希望在训练过程中除了验证交叉熵损失外，还评估BLEU\u002FMETEOR\u002FCIDEr分数，请使用`--language_eval 1`选项，但别忘了拉取子模块`coco-caption`。\n\n对于所有参数，您可以将其指定在一个yaml文件中，并使用`--cfg`来加载该yaml文件中的配置。如果命令行中的配置与yaml文件中的配置冲突，命令行中的配置将优先。\n\n更多选项请参阅`opts.py`。\n\n\u003C!-- **关于训练的一些注意事项。** 供您参考，使用默认设置时，MS COCO数据集的一个epoch大约是11000次迭代。经过一个epoch的训练后，验证损失约为2.5，CIDEr分数约为0.68。到第60,000次迭代时，CIDEr分数会上升到约0.84左右（验证损失约为2.4，在计划采样条件下）。 -->\n\n### 使用自我批判训练\n\n首先，您需要对数据集进行预处理，并生成用于计算CIDEr分数的缓存：\n```\n$ python scripts\u002Fprepro_ngrams.py --input_json data\u002Fdataset_coco.json --dict_json data\u002Fcocotalk.json --output_pkl data\u002Fcoco-train --split train\n```\n\n然后，从使用交叉熵训练的预训练模型中复制模型。（这不是必须的，只是为了备份）\n```\n$ bash scripts\u002Fcopy_model.sh fc fc_rl\n```\n\n接着：\n```bash\n$ python tools\u002Ftrain.py --id fc_rl --caption_model newfc --input_json data\u002Fcocotalk.json --input_fc_dir data\u002Fcocotalk_fc --input_att_dir data\u002Fcocotalk_att --input_label_h5 data\u002Fcocotalk_label.h5 --batch_size 10 --learning_rate 5e-5 --start_from log_fc_rl --checkpoint_path log_fc_rl --save_checkpoint_every 6000 --language_eval 1 --val_images_use 5000 --self_critical_after 30 --cached_tokens coco-train-idxs --max_epoch 50 --train_sample_n 5\n```\n\n或者\n```bash\n$ python tools\u002Ftrain.py --cfg configs\u002Ffc_rl.yml --id fc_rl\n```\n\n您会看到CIDEr分数会有显著提升，:)。\n\n**关于训练的一些注意事项。** 从第30个epoch开始进行自我批判训练，经过60万次迭代后（包括之前的30个epoch），CIDEr分数可以达到1.05。\n\n## 生成图像字幕\n\n### 在原始图像上评估\n\n**注意**：这不适用于使用自下而上特征训练的模型。现在将您感兴趣的所有图像放入一个文件夹中，例如`blah`，然后运行评估脚本：\n```bash\n$ python tools\u002Feval.py --model model.pth --infos_path infos.pkl --image_folder blah --num_images 10\n```\n\n这会告诉评估脚本从给定文件夹中最多处理10张图像。如果您有大显存的GPU，可以通过增加`batch_size`来加快评估速度。使用`--num_images -1`来处理所有图像。评估脚本会在`vis`文件夹中生成一个`vis.json`文件，您可以使用提供的HTML界面来可视化结果：\n```bash\n$ cd vis\n$ python -m SimpleHTTPServer\n```\n\n现在在浏览器中访问`localhost:8000`，您应该可以看到预测的字幕。\n\n### 在Karpathy的测试集上评估\n\n```bash\n$ python tools\u002Feval.py --dump_images 0 --num_images 5000 --model model.pth --infos_path infos.pkl --language_eval 1 \n```\n\n默认的评估分割是测试集。默认的推理方法是贪婪解码（`--sample_method greedy`），如果想从后验分布中采样，可以设置`--sample_method sample`。\n\n**束搜索**。束搜索可以将贪婪解码序列的性能提高约5%。不过，这种方法的成本稍高。要启用束搜索，可以使用`--beam_size N`，N应大于1。\n\n### 在COCO测试集上评估\n\n```bash\n$ python tools\u002Feval.py --input_json cocotest.json --input_fc_dir data\u002Fcocotest_bu_fc --input_att_dir data\u002Fcocotest_bu_att --input_label_h5 none --num_images -1 --model model.pth --infos_path infos.pkl --language_eval 0\n```\n\n您可以从[链接](https:\u002F\u002Fdrive.google.com\u002Fopen?id=1eCdz62FAVCGogOuNhy87Nmlo5_I0sH2J)下载预处理好的文件`cocotest.json`、`cocotest_bu_att`和`cocotest_bu_fc`。\n\n## 杂项\n**使用 CPU**。当前代码默认使用 GPU；甚至没有切换选项。如果有人非常需要 CPU 版本的模型，请提交一个 issue；我可以尝试创建一个 CPU 检查点，并修改 `eval.py` 以在 CPU 上运行模型。不过，使用 CPU 训练该模型并无意义。\n\n**在其他数据集上训练**。如果你能为自己的数据集创建一个类似 `dataset_coco.json` 的文件，那么移植应该非常简单。\n\n**在线演示**。目前暂不支持。欢迎提交 Pull Request。\n\n## 更高级的功能：\n\n请查看 [ADVANCED.md](ADVANCED.md)。\n\n## 参考文献\n\n如果你觉得这个仓库对你有帮助，请考虑引用（完全自愿）：\n\n```\n@article{luo2018discriminability,\n  title={用于训练描述性标题的可区分性目标},\n  author={Luo, Ruotian and Price, Brian and Cohen, Scott and Shakhnarovich, Gregory},\n  journal={arXiv 预印本 arXiv:1803.04376},\n  year={2018}\n}\n```\n\n当然，也请引用你所使用的模型的原始论文（参考文献可在模型文件中找到）。\n\n## 致谢\n\n感谢原始的 [neuraltalk2](https:\u002F\u002Fgithub.com\u002Fkarpathy\u002Fneuraltalk2) 以及优秀的 PyTorch 团队。","# ImageCaptioning.pytorch 快速上手指南\n\n本指南旨在帮助开发者快速部署并使用 ImageCaptioning.pytorch 进行图像描述（Image Captioning）的研究与训练。该项目基于 PyTorch，支持自临界序列训练（Self-critical）、Bottom-up 特征、Transformer 模型及多 GPU 分布式训练。\n\n## 环境准备\n\n在开始之前，请确保您的系统满足以下要求：\n\n*   **操作系统**: Linux (推荐) 或 macOS\n*   **Python**: 3.x\n*   **PyTorch**: 1.3+ (推荐使用 1.13 及以上版本，需同时安装 `torchvision`)\n*   **可选依赖**: `pytorch-lightning` (用于多 GPU 分布式训练，测试版本 2.0)\n\n**前置依赖说明：**\n项目已包含 `cider` 和 `coco-caption` 作为子模块。克隆仓库后，**务必**初始化子模块并遵循 `coco-caption\u002FREADME.md` 中的初始化步骤（通常需要下载 Java 和相关脚本），否则无法计算 CIDEr 等评估指标。\n\n```bash\ngit clone --recursive https:\u002F\u002Fgithub.com\u002Fruotianluo\u002FImageCaptioning.pytorch.git\ncd ImageCaptioning.pytorch\n# 按照 coco-caption\u002FREADME.md 完成后续初始化\n```\n\n其他 Python 依赖可通过以下方式安装（建议使用国内镜像源加速）：\n\n```bash\npip install -r requirements.txt -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n# 或者手动安装核心依赖\npip install yacs lmdbdict -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n```\n\n## 安装步骤\n\n如果您在运行 `tools` 目录下的训练脚本时遇到困难，建议将本项目作为 Python 包进行安装：\n\n```bash\npython -m pip install -e .\n```\n\n## 基本使用\n\n### 1. 数据准备\n本项目支持 COCO 和 Flickr30k 数据集。以下示例默认使用 **COCO** 数据集。\n请参考 `data\u002FREADME.md` 下载预处理好的数据文件（如 `cocotalk.json`, `cocotalk_fc`, `cocotalk_att`, `cocotalk_label.h5` 等）。\n\n### 2. 开始训练\n您可以直接通过命令行参数或配置文件启动训练。以下是最基础的训练命令示例，使用交叉熵损失函数训练一个 `newfc` 模型：\n\n```bash\npython tools\u002Ftrain.py --id fc --caption_model newfc --input_json data\u002Fcocotalk.json --input_fc_dir data\u002Fcocotalk_fc --input_att_dir data\u002Fcocotalk_att --input_label_h5 data\u002Fcocotalk_label.h5 --batch_size 10 --learning_rate 5e-4 --learning_rate_decay_start 0 --scheduled_sampling_start 0 --checkpoint_path log_fc --save_checkpoint_every 6000 --val_images_use 5000 --max_epochs 30\n```\n\n或者使用配置文件（更简洁）：\n\n```bash\npython tools\u002Ftrain.py --cfg configs\u002Ffc.yml --id fc\n```\n\n*   **断点续训**: 添加 `--start_from log_fc` 参数（路径需包含 `infos.pkl` 和 `model.pth`）即可从上次保存的状态继续训练。\n*   **监控**: 训练日志和 Loss 曲线会自动保存在 `--checkpoint_path` 指定的文件夹中，可使用 TensorBoard 查看。\n\n### 3. 生成图像描述 (评估)\n训练完成后，您可以使用预训练模型或自己的模型对图像生成描述。\n\n**对本地文件夹中的图片进行评估：**\n将待测试图片放入一个文件夹（例如 `blah`），然后运行：\n\n```bash\npython tools\u002Feval.py --model model.pth --infos_path infos.pkl --image_folder blah --num_images 10\n```\n\n*   `--num_images`: 设置要处理的图片数量，设为 `-1` 处理所有图片。\n*   结果将生成在 `vis\u002Fvis.json`，可通过启动简易 HTTP 服务器在浏览器查看可视化结果：\n    ```bash\n    cd vis\n    python -m SimpleHTTPServer\n    # 访问 http:\u002F\u002Flocalhost:8000\n    ```\n\n**在标准测试集上评估 (含 BLEU\u002FCIDEr 分数)：**\n\n```bash\npython tools\u002Feval.py --dump_images 0 --num_images 5000 --model model.pth --infos_path infos.pkl --language_eval 1\n```\n\n*   **束搜索 (Beam Search)**: 添加 `--beam_size N` (N>1) 可提升生成质量，但会增加计算开销。\n*   **采样方法**: 默认为贪婪解码 (`greedy`)，如需从后验分布采样，设置 `--sample_method sample`。\n\n> **提示**: 若需体验快速演示，可直接访问项目提供的 [Colab Notebook](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002Fruotianluo\u002FImageCaptioning.pytorch\u002Fblob\u002Fcolab\u002Fnotebooks\u002Fcaptioning_demo.ipynb)。","某电商平台的算法团队正致力于为海量商品图自动生成多语言描述，以提升视障用户的购物体验和搜索转化率。\n\n### 没有 ImageCaptioning.pytorch 时\n- **模型复现困难**：团队需从零搭建复杂的“自临界序列训练”（Self-critical）架构，耗费数周调试代码且难以对齐论文效果。\n- **特征提取繁琐**：缺乏对“自下而上”（Bottom-up）注意力机制的原生支持，手动集成 Faster R-CNN 特征提取流程极易出错。\n- **训练效率低下**：单卡训练速度缓慢，不支持分布式多 GPU 加速，导致在大规模 COCO 数据集上的迭代周期长达数天。\n- **评估指标缺失**：缺少集成的 CIDEr、BLEU 等权威评分工具，无法在训练过程中实时监控生成文本的质量。\n\n### 使用 ImageCaptioning.pytorch 后\n- **快速落地 SOTA**：直接调用内置的 Self-critical 训练脚本和 Transformer 模型，几天内即可复现业界领先的基线效果。\n- **开箱即用特征**：原生支持预计算的 Bottom-up 特征，无需重复开发视觉前端，大幅降低工程门槛。\n- **高效分布式训练**：借助 PyTorch Lightning 实现多 GPU 并行训练，将原本数天的训练时间压缩至数小时，加速实验迭代。\n- **自动化质量监控**：训练过程中自动计算并记录 CIDEr 等指标，结合 TensorBoard 可视化曲线，让模型优化方向清晰可见。\n\nImageCaptioning.pytorch 通过提供一站式的科研级代码库，将图像描述算法的研发周期从“月级”缩短至“天级”，让团队能专注于业务逻辑而非底层架构。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fruotianluo_ImageCaptioning.pytorch_b9b2fcb0.png","ruotianluo","Ruotian(RT) Luo","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fruotianluo_ee04f309.png","Waymo Perception SWE. Phd graduate from TTIC.","Waymo","Austin","rluo@ttic.edu",null,"http:\u002F\u002Fttic.uchicago.edu\u002F~rluo","https:\u002F\u002Fgithub.com\u002Fruotianluo",[86,90,94],{"name":87,"color":88,"percentage":89},"Python","#3572A5",94.8,{"name":91,"color":92,"percentage":93},"Shell","#89e051",4.7,{"name":95,"color":96,"percentage":97},"HTML","#e34c26",0.6,1480,423,"2026-03-30T10:39:43","MIT",4,"未说明","训练默认使用 GPU（无 CPU 选项），支持多 GPU 训练；具体型号和显存大小未说明，需自行根据模型规模配置",{"notes":106,"python":107,"dependencies":108},"1. 代码默认强制使用 GPU 运行，暂无原生 CPU 推理选项（如需 CPU 支持需提交 Issue 请求作者修改）。2. 使用前必须初始化 'coco-caption' 子模块（遵循其 README 中的步骤）。3. 支持分布式多 GPU 训练（需安装 pytorch-lightning）。4. 若仅进行评估，需下载预训练模型及对应的 ResNet101 或 Bottom-up 特征文件。","3.x",[109,110,111,112,113,114,115],"PyTorch>=1.3 (测试版本 1.13)","torchvision","cider (子模块)","coco-caption (子模块)","yacs","lmdbdict","pytorch-lightning>=2.0 (可选)",[15,62],"2026-03-27T02:49:30.150509","2026-04-07T09:50:12.922495",[120,125,130,135,139,144,148,153],{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},21664,"运行 eval.py 评估时出现 'ZeroDivisionError: division by zero' 错误怎么办？","该错误通常发生在第二次运行评估或特定图像文件夹配置不正确时。请确保 `--image_folder` 参数指向正确的图像目录（例如 `test2014`）。如果问题依旧，尝试添加 `--force 1` 参数强制重新生成预测。此外，检查生成的 JSON 文件是否包含有效的预测结果，避免空列表导致除以零。","https:\u002F\u002Fgithub.com\u002Fruotianluo\u002FImageCaptioning.pytorch\u002Fissues\u002F98",{"id":126,"question_zh":127,"answer_zh":128,"source_url":129},21665,"加载预训练模型时遇到状态字典键不匹配（如 'q_func.weight'）的错误如何解决？","这是因为预训练模型中包含了一些当前模型架构不需要的层（如强化学习相关的 Q 函数）。解决方法是在加载状态字典前手动删除这些键。代码示例如下：\npretrained_states = torch.load(opt.model)\npretrained_states.pop(\"q_func.weight\", 0)\npretrained_states.pop(\"q_func.bias\", 0)\nmodel.load_state_dict(pretrained_states)","https:\u002F\u002Fgithub.com\u002Fruotianluo\u002FImageCaptioning.pytorch\u002Fissues\u002F9",{"id":131,"question_zh":132,"answer_zh":133,"source_url":134},21666,"在 GPU 上训练时遇到 '_MultiProcessingDataLoaderIter' object has no attribute '_send_idx' 错误？","这是由于 PyTorch 版本更新导致 DataLoader 内部属性名称变化引起的兼容性问题。通常出现在较新的 PyTorch 版本中。建议尝试降低 PyTorch 版本至与项目兼容的版本（如 0.4.1 或 1.x 早期版本），或者修改 `dataloader.py` 中的 `state_dict` 方法，避免访问 `_send_idx` 和 `_rcvd_idx` 等私有属性，改为直接保存迭代器状态或跳过该部分的序列化。","https:\u002F\u002Fgithub.com\u002Fruotianluo\u002FImageCaptioning.pytorch\u002Fissues\u002F102",{"id":136,"question_zh":137,"answer_zh":138,"source_url":134},21667,"使用 Bottom-up 特征处理数据时，make_bu_data.py 报错 'TypeError: expected bytes-like object, not str'？","这是 Python 3 中 base64 解码的类型问题。需要修改 `make_bu_data.py`：\n1. 将打开文件的模式从 `\"r+b\"` 改为 `\"r\"`。\n2. 在调用 `base64.decodestring` (或 `decodebytes`) 之前，将输入字符串编码为 ASCII 字节。 \n修改后的代码逻辑：\nopen(os.path.join(args.downloaded_feats, infile), \"r\")\n...\nbase64.decodebytes(item[field].encode('ascii'))",{"id":140,"question_zh":141,"answer_zh":142,"source_url":143},21668,"代码中出现 'list indices must be integers, not str' 或 'imgs' 相关类型错误？","这通常是因为输入的 JSON 数据结构与代码预期不符。代码期望 `imgs` 是一个字典且包含 'images' 键，但实际读取到的可能直接是一个列表。请检查输入的 `dataset_coco.json` 文件格式，确保其顶层结构包含 `{'images': [...]}`。如果数据格式已变更，需修改 `scripts\u002Fprepro_labels.py` 第 138 行左右的代码，适配新的数据结构（例如直接遍历列表而不是通过键访问）。","https:\u002F\u002Fgithub.com\u002Fruotianluo\u002FImageCaptioning.pytorch\u002Fissues\u002F3",{"id":145,"question_zh":146,"answer_zh":147,"source_url":129},21669,"eval.py 运行时提示 'AttributeError: 'Namespace' object has no attribute 'caption_model''？","这是因为 `eval.py` 的参数解析器中缺少 `caption_model` 定义，而模型加载代码依赖此参数。解决方法是手动编辑 `eval.py`，在参数解析部分添加 `--caption_model` 参数，或者直接从保存的 `infos_fc-best.pkl` 文件中读取该配置并应用到 `opt` 对象中，确保在调用 `models.setup(opt)` 之前该属性已存在。",{"id":149,"question_zh":150,"answer_zh":151,"source_url":152},21670,"dataloaderraw.py 中出现 'img.concatenate' 属性错误？","这是一个代码笔误。`img` 变量是 numpy 数组，没有 `concatenate` 方法。应将 `img = img.concatenate((img, img, img), axis=2)` 修改为 `img = np.concatenate((img, img, img), axis=2)`，使用 numpy 库的 concatenate 函数。","https:\u002F\u002Fgithub.com\u002Fruotianluo\u002FImageCaptioning.pytorch\u002Fissues\u002F4",{"id":154,"question_zh":155,"answer_zh":156,"source_url":157},21671,"LSTM 的隐藏状态初始化以及 Start\u002FEnd Token (0 和 1) 的具体含义是什么？","在该项目中，Token ID 的使用遵循 Neuraltalk2 的逻辑：\n1. 当模型**预测**下一个词时，输出 0 代表结束符（End Token）。\n2. 当向 LSTM **输入**词向量时，输入 0 代表开始符（Start Token）。\n这是因为在预测阶段永远不会预测出 Start Token，而在输入阶段不需要输入 End Token。初始化隐藏状态时，通常直接使用图像特征（fc_feats）通过线性层变换得到。","https:\u002F\u002Fgithub.com\u002Fruotianluo\u002FImageCaptioning.pytorch\u002Fissues\u002F25",[]]