[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-BAAI-DCAI--Bunny":3,"similar-BAAI-DCAI--Bunny":97},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":8,"readme_en":9,"readme_zh":10,"quickstart_zh":11,"use_case_zh":12,"hero_image_url":13,"owner_login":14,"owner_name":15,"owner_avatar_url":16,"owner_bio":17,"owner_company":15,"owner_location":15,"owner_email":15,"owner_twitter":15,"owner_website":15,"owner_url":18,"languages":19,"stars":28,"forks":29,"last_commit_at":30,"license":31,"difficulty_score":32,"env_os":33,"env_gpu":34,"env_ram":35,"env_deps":36,"category_tags":45,"github_topics":48,"view_count":56,"oss_zip_url":15,"oss_zip_packed_at":15,"status":57,"created_at":58,"updated_at":59,"faqs":60,"releases":96},5014,"BAAI-DCAI\u002FBunny","Bunny","A family of lightweight multimodal models. ","Bunny 是一个轻量级但功能强大的多模态模型家族，旨在让 AI 同时具备“看”和“说”的能力。它主要解决了在有限计算资源下，如何平衡模型体积与性能的行业难题，让用户能在普通硬件上部署高效的多模态应用。\n\n无论是希望快速集成视觉能力的开发者、追求极致效率的研究人员，还是需要在本地运行大模型的极客用户，都能从 Bunny 中获益。其核心亮点在于极高的灵活性：支持即插即用的多种视觉编码器（如 EVA-CLIP、SigLIP）以及丰富的语言基座（包括 Llama-3、Phi-3、Qwen1.5 等）。为了弥补小参数量的潜在劣势，Bunny 通过精心筛选的高质量数据进行训练，显著提升了理解力。\n\n特别是最新的 Bunny-Llama-3-8B-V 和 Bunny-4B 版本，不仅支持高达 1152x1152 的高分辨率图像输入，更在多项基准测试中展现出超越同尺寸甚至更大规模模型的卓越性能。作为首个基于 Llama-3 的视觉语言模型系列，Bunny 以开源友好的姿态，为构建下一代多模态应用提供了坚实且高效的基础。","# Bunny: A family of lightweight multimodal models\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBAAI-DCAI_Bunny_readme_3290ae7aa2a1.png\" alt=\"Logo\" width=\"350\">\n\u003C\u002Fp>\n\n📖 [Technical report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11530) | 🤗 [Data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_1-data) | 🤖 [Data](https:\u002F\u002Fwww.modelscope.cn\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1.1-data) | 🤗 [HFSpace](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FBoZhaoHuggingFace\u002FBunny) 🐰 [Demo](http:\u002F\u002Fbunny.baai.ac.cn)\n\n**Bunny-Llama-3-8B-V**: 🤗 [v1.1](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V) | 🤗 [v1.0](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-Llama-3-8B-V) | 🤗 [v1.0-GGUF](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-Llama-3-8B-V-gguf)\n\n**Bunny-4B**: 🤗 [v1.1](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-4B) | 🤗 [v1.0](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-4B) | 🤗 [v1.0-GGUF](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-4B-gguf)\n\nBunny is a family of lightweight but powerful multimodal models. It offers multiple plug-and-play vision encoders, like **EVA-CLIP, SigLIP** and language backbones, including **Llama-3-8B, Phi-3-mini, Phi-1.5, StableLM-2, Qwen1.5, MiniCPM and Phi-2**. To compensate for the decrease in model size, we construct more informative training data by curated selection from a broader data source. \n\nWe are thrilled to introduce **Bunny-Llama-3-8B-V**, the pioneering vision-language model based on Llama-3, showcasing exceptional performance. The v1.1 version accepts high-resolution images up to **1152x1152**.\n\n![comparison_8B](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBAAI-DCAI_Bunny_readme_68b43cf29934.png)\n\nMoreover, our **Bunny-4B** model built upon SigLIP and Phi-3-mini outperforms the state-of-the-art MLLMs, not only in comparison with models of similar size but also against larger MLLMs (7B and 13B). Also, the v1.1 version accepts high-resolution images up to **1152x1152**.\n\n\u003Cdetails>\n\u003Csummary>Expand to see the performance of Bunny-4B\u003C\u002Fsummary>\n\u003CIMG src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBAAI-DCAI_Bunny_readme_8d8807af4fad.png\"\u002F>\n\u003C\u002Fdetails>\n\n\n\n## News and Updates\n\n* 2024.07.23 🔥 **All of the training strategy and data of latest Bunny is released!** Check more details about Bunny in [Technical Report](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11530), [Data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_1-data) and [Training Tutorial](#training-tutorial)!\n* 2024.07.21 🔥 **SpatialBot, SpatialQA and SpatialBench are released!** SpatialBot is an embodiment model based on Bunny, which comprehends spatial relationships by understanding and using depth information. Try model, dataset and benchmark at [GitHub](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FSpatialBot)!\n* 2024.06.20 🔥 **MMR benchmark is released!** It is a benchmark for measuring MLLMs' understanding ability and their robustness against misleading questions. Check the performance of Bunny and more details in [GitHub](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FMultimodal-Robustness-Benchmark)!\n* 2024.06.01 🔥 **Bunny-v1.1-Llama-3-8B-V, supporting 1152x1152 resolution, is released!** It is built upon SigLIP and Llama-3-8B-Instruct with S$`^2`$-Wrapper. Check more details in [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V) and [wisemodel](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FBunny-v1.1-Llama-3-8B-V)! 🐰 [Demo](http:\u002F\u002Fbunny.baai.ac.cn)\n\n* 2024.05.08 **Bunny-v1.1-4B, supporting 1152x1152 resolution, is released!** It is built upon SigLIP and Phi-3-Mini-4K 3.8B with S$`^2`$-Wrapper. Check more details in [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-4B)! 🐰 [Demo](http:\u002F\u002Fbunny.baai.ac.cn)\n\n* 2024.05.01 **Bunny-v1.0-4B, a vision-language model based on Phi-3, is released!** It is built upon SigLIP and Phi-3-Mini-4K 3.8B. Check more details in [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-4B)! 🤗 [GGUF](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-4B-gguf)\n\n* 2024.04.21 **Bunny-Llama-3-8B-V, the first vision-language model based on Llama-3, is released!** It is built upon SigLIP and Llama-3-8B-Instruct. Check more details in [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-Llama-3-8B-V), [ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FBAAI\u002FBunny-Llama-3-8B-V), and [wisemodel](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FBunny-Llama-3-8B-V)! The **GGUF** format is in [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-Llama-3-8B-V-gguf) and [wisemodel](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FBunny-Llama-3-8B-V-gguf).\n\n* 2024.04.18 **Bunny-v1.0-3B-zh, powerful on English and Chinese, is released!** It is built upon SigLIP and MiniCPM-2B. Check more details in [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B-zh), [ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FBAAI\u002FBunny-v1.0-3B-zh\u002Fsummary), and [wisemodel](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FBunny-v1.0-3B-zh)! The evaluation results are in the [Evaluation](#evaluation). We sincerely thank Zhenwei Shao for his kind help.\n\n* 2024.03.15 **Bunny-v1.0-2B-zh, focusing on Chinese, is released!** It is built upon SigLIP and Qwen1.5-1.8B. Check more details in [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-2B-zh), [ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FBAAI\u002FBunny-v1.0-2B-zh\u002Fsummary), and [wisemodel](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FBunny-v1.0-2B-zh)! The evaluation results are in the [Evaluation](#evaluation).\n\n* 2024.03.06 **Bunny training data is released!** Check more details about Bunny-v1.0-data in [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_0-data) or [ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1.0-data)!\n* 2024.02.20 **Bunny technical report is ready!** Check more details about Bunny [here](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11530)!\n* 2024.02.07 **Bunny is released!** Bunny-v1.0-3B built upon SigLIP and Phi-2 outperforms the state-of-the-art MLLMs, not only in comparison with models of similar size but also against larger MLLMs (7B), and even achieves performance on par with LLaVA-13B! 🤗 [Bunny-v1.0-3B](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B)\n\n## Quickstart\n\n### HuggingFace transformers\n\nHere we show a code snippet to show you how to use [Bunny-v1.1-Llama-3-8B-V](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V), [Bunny-v1.1-4B](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-4B), [Bunny-v1.0-3B](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B) and so on with HuggingFace transformers.\n\nThis snippet is only used for above models because we **manually** combine some configuration code into a single file for users' convenience. For example, you can check [`modeling_bunny_llama.py`](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V\u002Fblob\u002Fmain\u002Fmodeling_bunny_llama.py) and [`configuration_bunny_llama.py`](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V\u002Fblob\u002Fmain\u002Fconfiguration_bunny_llama.py) and their related parts in the source code of Bunny to see the difference. For other models including models trained by yourself, we **recommend** loading them with installing the source code of Bunny. Or you can copy files like [`modeling_bunny_llama.py`](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V\u002Fblob\u002Fmain\u002Fmodeling_bunny_llama.py) and [`configuration_bunny_llama.py`](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V\u002Fblob\u002Fmain\u002Fconfiguration_bunny_llama.py) into your model and modify `auto_map` in `config.json`, but we can't guarantee its correctness and you may need to modify some code to fit your model.\n\nBefore running the snippet, you need to install the following dependencies:\n\n```shell\npip install torch transformers accelerate pillow\n```\n\nIf the CUDA memory is enough, it would be faster to execute this snippet by setting `CUDA_VISIBLE_DEVICES=0`.\n\nUsers especially those in Chinese mainland may want to refer to a HuggingFace [mirror site](https:\u002F\u002Fhf-mirror.com). \n\n```python\nimport torch\nimport transformers\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom PIL import Image\nimport warnings\n\n# disable some warnings\ntransformers.logging.set_verbosity_error()\ntransformers.logging.disable_progress_bar()\nwarnings.filterwarnings('ignore')\n\n# set device\ndevice = 'cuda'  # or cpu\ntorch.set_default_device(device)\n\nmodel_name = 'BAAI\u002FBunny-v1_1-Llama-3-8B-V' # or 'BAAI\u002FBunny-Llama-3-8B-V' or 'BAAI\u002FBunny-v1_1-4B' or 'BAAI\u002FBunny-v1_0-4B' or 'BAAI\u002FBunny-v1_0-3B' or 'BAAI\u002FBunny-v1_0-3B-zh' or 'BAAI\u002FBunny-v1_0-2B-zh'\noffset_bos = 1 # for Bunny-v1_1-Llama-3-8B-V, Bunny-Llama-3-8B-V, Bunny-v1_1-4B, Bunny-v1_0-4B and Bunny-v1_0-3B-zh\n# offset_bos = 0 for Bunny-v1_0-3B and Bunny-v1_0-2B-zh\n\n# create model\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=torch.float16, # float32 for cpu\n    device_map='auto',\n    trust_remote_code=True)\ntokenizer = AutoTokenizer.from_pretrained(\n    model_name,\n    trust_remote_code=True)\n\n# text prompt\nprompt = 'Why is the image funny?'\ntext = f\"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \u003Cimage>\\n{prompt} ASSISTANT:\"\ntext_chunks = [tokenizer(chunk).input_ids for chunk in text.split('\u003Cimage>')]\ninput_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0).to(device)\n\n# image, sample images can be found in https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V\u002Ftree\u002Fmain\u002Fimages\nimage = Image.open('example_2.png')\nimage_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)\n\n# generate\noutput_ids = model.generate(\n    input_ids,\n    images=image_tensor,\n    max_new_tokens=100,\n    use_cache=True,\n    repetition_penalty=1.0 # increase this to avoid chattering\n)[0]\n\nprint(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())\n```\n\n### ModelScope\n\nWe advise users especially those in Chinese mainland to use ModelScope.\n`snapshot_download` can help you solve issues concerning downloading checkpoints.\n\n\u003Cdetails>\n\u003Csummary>Expand to see the snippet\u003C\u002Fsummary>\n\nBefore running the snippet, you need to install the following dependencies:\n\n\n```shell\npip install torch modelscope transformers accelerate pillow\n```\nIf the CUDA memory is enough, it would be faster to execute this snippet by setting `CUDA_VISIBLE_DEVICES=0`.\n\n```python\nimport torch\nimport transformers\nfrom modelscope import AutoTokenizer, AutoModelForCausalLM\nfrom modelscope.hub.snapshot_download import snapshot_download\nfrom PIL import Image\nimport warnings\n\n# disable some warnings\ntransformers.logging.set_verbosity_error()\ntransformers.logging.disable_progress_bar()\nwarnings.filterwarnings('ignore')\n\n# set device\ndevice = 'cuda'  # or cpu\ntorch.set_default_device(device)\n\nmodel_name = 'BAAI\u002FBunny-Llama-3-8B-V' # or 'BAAI\u002FBunny-v1.0-3B' or 'BAAI\u002FBunny-v1.0-3B-zh' or 'BAAI\u002FBunny-v1.0-2B-zh'\noffset_bos = 1 # for Bunny-Llama-3-8B-V and Bunny-v1.0-3B-zh\n# offset_bos = 0 for Bunny-v1.0-3B and Bunny-v1.0-2B-zh\n\n# create model\nsnapshot_download(model_id='thomas\u002Fsiglip-so400m-patch14-384')\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=torch.float16, # float32 for cpu\n    device_map='auto',\n    trust_remote_code=True)\ntokenizer = AutoTokenizer.from_pretrained(\n    model_name,\n    trust_remote_code=True)\n\n# text prompt\nprompt = 'Why is the image funny?'\ntext = f\"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \u003Cimage>\\n{prompt} ASSISTANT:\"\ntext_chunks = [tokenizer(chunk).input_ids for chunk in text.split('\u003Cimage>')]\ninput_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0).to(device)\n\n# image, sample images can be found in images folder on https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FBAAI\u002FBunny-Llama-3-8B-V\u002Ffiles\nimage = Image.open('example_2.png')\nimage_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)\n\n# generate\noutput_ids = model.generate(\n    input_ids,\n    images=image_tensor,\n    max_new_tokens=100,\n    use_cache=True,\n    repetition_penalty=1.0 # increase this to avoid chattering\n)[0]\n\nprint(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())\n```\n\n\u003C\u002Fdetails>\n\n## Model Zoo\n\n### Evaluation\n\n| Checkpoint                                                   | MME$`^\\text{P}`$ | MME$`^\\text{C}`$ | MMB$`^{\\text{T}\u002F\\text{D}}`$ | MMB-CN$`^{\\text{T}\u002F \\text{D}}`$ | SEED(-IMG) | MMMU$`^{\\text{V}\u002F\\text{T}}`$ | VQA$`^\\text{v2}`$ | GQA  | SQA$`^\\text{I}`$ | POPE |\n| :----------------------------------------------------------- | :--------------: | :--------------: | :--------------: | :--------------: | :--------------: | :--: | :---------------: | :---------------: | :---------------: | :--: |\n| [bunny-phi-1.5-eva-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-phi-1.5-eva-lora) |      1213.7      |      278.9      |       60.9\u002F56.8       |       -       | 56.4\u002F64.1 | 30.0\u002F28.4 | 76.5 |       60.4       |       58.2       | 86.1 |\n| [bunny-stablelm-2-eva-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-stablelm-2-eva-lora) |      1301.0      |      235.0       |       58.4\u002F56.4       |       -       | 55.3\u002F62.8 | 29.8\u002F29.4 | 74.6 |       56.7       |       60.0    | 84.8 |\n| [bunny-phi-2-eva-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-phi-2-eva-lora) |      1421.0      |      285.4      |       68.6\u002F67.4       |       -       | 62.2\u002F70.2 | 35.9\u002F32.6 | 78.9 |       62.3       |       69.1       | 87.1 |\n| [bunny-phi-1.5-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-phi-1.5-siglip-lora) |      1230.0      |      237.5      |       61.2\u002F59.7       |       -       | 57.7\u002F65.3 | 30.0\u002F29.1 | 78.0 |       61.1       |       61.3       | 85.8 |\n| [bunny-stablelm-2-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-stablelm-2-siglip-lora) |      1366.8      |      236.1       |       65.1\u002F62.8       |       -       | 58.8\u002F67.5 | 29.9\u002F29.8 | 78.9 |       60.9       |       61.1    | 85.9 |\n| [Bunny-v1.0-2B-zh\u002Fbunny-qwen1.5-1.8b-siglip](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-2B-zh) |      1300.8      |      254.3      |       59.8\u002F59.1       |       59.5\u002F58.5       | 55.4\u002F62.3 | 34.4\u002F30.4 | 76.6 |       59.6       |       64.6       | 85.8 |\n| [Bunny-v1.0-3B-zh\u002Fbunny-minicpm-siglip](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B-zh) |      1410.4      |      281.4      |       66.1\u002F65.5       |       64.9\u002F63.6       | 59.6\u002F67.3 | 35.4\u002F32.4 | 78.6 |       60.8       |       68.7       | 86.5 |\n| [Bunny-v1.0-3B\u002Fbunny-phi-2-siglip](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B) |      1488.8      |      289.3      |       69.2\u002F68.6       |       -       | 62.5\u002F70.7 | 38.2\u002F33.0 | 79.8 |       62.5       |       70.9       | 86.8 |\n| [Bunny-v1.0-4B](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-4B) |      1495.2      |      338.9      |       74.0\u002F73.5       |       -       | 64.5\u002F72.1 | 40.1\u002F39.1 | 81.5 |       63.5       |       75.2       | 86.7 |\n| **[Bunny-v1.1-4B](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-4B)** |      1581.5      |      361.1      |       75.7\u002F74.2       |       66.5\u002F64.5       | 64.9\u002F72.5 | 41.4\u002F38.4 | 82.1 |       63.2       |       78.3       | 87.2 |\n| [Bunny-Llama-3-8B-V](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-Llama-3-8B-V) |      1588.9      |      321.1      |       77.2\u002F76.7       |       73.8\u002F72.3       | 65.9\u002F73.3 | 42.8\u002F39.0 | 82.6 |       64.8       |       80.4       | 86.9 |\n| **[Bunny-1.1-Llama-3-8B-V](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V)** |      1644.1      |      367.5      |       78.1\u002F77.2       |       74.3\u002F74.8       | 66.2\u002F73.5 | 43.3\u002F39.0 | 82.9 |       64.0       |       79.9       | 87.2 |\n\nThe small model with the best performance is denoted as Bunny-v1.0-3B or bunny-phi-2-siglip, whose merged weights can be found [here](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B) and the LoRA weights can be found [here](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbunny-phi-2-siglip-lora).\n\nWe also provide two models that focus on Chinese QA ability, namely Bunny-v1.0-3B-zh (bunny-minicpm-siglip) and Bunny-v1.0-2B-zh (bunny-qwen1.5-1.8b-siglip). The merged weights can be found [here](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B-zh) and [here](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-2B-zh). The LoRA weights can be found [here](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-minicpm-siglip-lora) and [here](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-qwen1.5-1.8b-siglip-lora).\n\n### Training Tutorial\n\n| Checkpoint                                                   | Vision Encoder                                               | LLM                                                          | Pretrain weights                                             |Training Tutorial|\n| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------  | ------------------------------------------------------------ |---|\n| [bunny-phi-1.5-eva-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-phi-1.5-eva-lora) | [EVA02_CLIP_L_336_psz14_s6B](https:\u002F\u002Fhuggingface.co\u002FQuanSun\u002FEVA-CLIP\u002Fblob\u002Fmain\u002FEVA02_CLIP_L_336_psz14_s6B.pt) | [microsoft\u002Fphi-1_5](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-1_5)     | [bunny-pretrain-phi-1.5-eva](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-phi-1.5-eva) |[link](script\u002Ftrain\u002Ftutorials\u002Fbunny-phi-1.5-eva-lora.md)     |\n| [bunny-stablelm-2-eva-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-stablelm-2-eva-lora) | [EVA02_CLIP_L_336_psz14_s6B](https:\u002F\u002Fhuggingface.co\u002FQuanSun\u002FEVA-CLIP\u002Fblob\u002Fmain\u002FEVA02_CLIP_L_336_psz14_s6B.pt) | [stabilityai\u002Fstablelm-2-1_6b](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fstablelm-2-1_6b)     | [bunny-pretrain-stablelm-2-eva](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-stablelm-2-eva) |[link](script\u002Ftrain\u002Ftutorials\u002Fbunny-stablelm-2-eva-lora.md)     |\n| [bunny-phi-2-eva-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-phi-2-eva-lora) | [EVA02_CLIP_L_336_psz14_s6B](https:\u002F\u002Fhuggingface.co\u002FQuanSun\u002FEVA-CLIP\u002Fblob\u002Fmain\u002FEVA02_CLIP_L_336_psz14_s6B.pt) | [microsoft\u002Fphi-2](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-2)        | [bunny-pretrain-phi-2-eva](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-phi-2-eva) |[link](script\u002Ftrain\u002Ftutorials\u002Fbunny-phi-2-eva-lora.md)     |\n| [bunny-phi-1.5-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-phi-1.5-siglip-lora) | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [microsoft\u002Fphi-1_5](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-1_5)     | [bunny-pretrain-phi-1.5-siglip](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-phi-1.5-siglip) |[link](script\u002Ftrain\u002Ftutorials\u002Fbunny-phi-1.5-siglip-lora.md)     |\n| [bunny-stablelm-2-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-stablelm-2-siglip-lora) | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [stabilityai\u002Fstablelm-2-1_6b](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fstablelm-2-1_6b)      | [bunny-pretrain-stablelm-2-siglip](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-stablelm-2-siglip) |[link](script\u002Ftrain\u002Ftutorials\u002Fbunny-stablelm-2-siglip-lora.md)     |\n| [bunny-qwen1.5-1.8b-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-qwen1.5-1.8b-siglip-lora) | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [Qwen\u002FQwen1.5-1.8B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen1.5-1.8B)     | [bunny-pretrain-qwen1.5-1.8b-siglip](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-qwen1.5-1.8b-siglip) |[link](script\u002Ftrain\u002Ftutorials\u002Fbunny-qwen1.5-1.8b-siglip-lora.md)     |\n| [bunny-minicpm-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-minicpm-siglip-lora) | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [openbmb\u002FMiniCPM-2B-history](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-history) (step 280000)    | [bunny-pretrain-minicpm-siglip](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-minicpm-siglip) |[link](script\u002Ftrain\u002Ftutorials\u002Fbunny-minicpm-siglip-lora.md)     |\n| [bunny-phi-2-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbunny-phi-2-siglip-lora) | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [microsoft\u002Fphi-2](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-2)        | [bunny-pretrain-phi-2-siglip](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbunny-pretrain-phi-2-siglip) |[link](script\u002Ftrain\u002Ftutorials\u002Fbunny-phi-2-siglip-lora.md)     |\n| Bunny-v1.0-4B                                            | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [microsoft\u002FPhi-3-mini-4k-instruct](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-3-mini-4k-instruct)     | [bunny-pretrain-phi-3-siglip](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-phi-3-siglip) |[link](script\u002Ftrain\u002Ftutorials\u002FBunny-v1.0-4B.md)     |\n| **Bunny-v1.1-4B**                                            | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [microsoft\u002FPhi-3-mini-4k-instruct](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-3-mini-4k-instruct)     | [bunny-pretrain-phi-3-siglip-s2](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-phi-3-siglip-s2) |[link](script\u002Ftrain\u002Ftutorials\u002FBunny-v1.1-4B.md)     |\n| Bunny-Llama-3-8B-V                                       | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [meta-llama\u002FMeta-Llama-3-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FMeta-Llama-3-8B-Instruct)     | [bunny-pretrain-llama3-8b-siglip](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-llama3-8b-siglip) |[link](script\u002Ftrain\u002Ftutorials\u002FBunny-Llama-3-8B-V.md)     |\n| **Bunny-v1.1-Llama-3-8B-V**                                       | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [meta-llama\u002FMeta-Llama-3-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FMeta-Llama-3-8B-Instruct)     | [bunny-pretrain-llama3-8b-siglip-s2](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-llama3-8b-siglip-s2) |[link](script\u002Ftrain\u002Ftutorials\u002FBunny-v1.1-Llama-3-8B-V.md)     |\n\n## Install\nEither start from our docker or install locally on your own. \n\n### Start from Our Docker\nDirectly start from our configured docker image by `docker pull russellrobin\u002Fbunny:latest`. \n\n\u003Cdetails>\n\u003Csummary>Expand to see how to keep codes up to date.\u003C\u002Fsummary>\nAlthough this docker is under regular maintenance by us, local Bunny codes aren't guaranteed to be kept up to date with our GitHub repo. \nYou may want to:\n\n1. Run `pip install --upgrade transformers && cd Bunny` in a running container,\n\n2. Set default GitHub identity by `git config user.email \"you@example.com\" && git config user.name \"Your Name\"`,\n\n3. Update Bunny local codes using `git pull`. \n\n4. `pip install -e .`\n\nYou are all set!\n\u003C\u002Fdetails>\n\n### Local Installation\n* CUDA and cuDNN\n\n  We use CUDA 11.8 and cuDNN 8.7.0. We actually use the CUDA docker by NVIDIA: `docker pull nvcr.io\u002Fnvidia\u002Fcuda:11.8.0-cudnn8-devel-ubuntu20.04`. CUDA 12 is fine, too.\n\n* Create a conda virtual environment and activate it:\n\n  ```shell\n  conda create -n bunny python=3.10\n  conda activate bunny\n  ```\n\n* Basic requirements\n\n  ```shell\n  pip install --upgrade pip  # enable PEP 660 support\n  pip install transformers\n  pip install torch torchvision xformers --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n  ```\n\n* Install apex\n\n  ```shell\n  # https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fapex#from-source\n  pip install ninja\n  git clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fapex\n  cd apex\n  # if pip >= 23.1 (ref: https:\u002F\u002Fpip.pypa.io\u002Fen\u002Fstable\u002Fnews\u002F#v23-1) which supports multiple `--config-settings` with the same key...\n  pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings \"--build-option=--cpp_ext\" --config-settings \"--build-option=--cuda_ext\" .\u002F\n  # otherwise\n  pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" .\u002F\n  ```\n\n* Install flash-attention\n\n  ```shell\n  # https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention?tab=readme-ov-file#installation-and-features\n  pip install packaging\n  pip install flash-attn --no-build-isolation\n  ```\n\n* Install bunny and other requirements\n\n  ```shell\n  git clone https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FBunny.git\n  cd Bunny\n  pip install -e .\n  ```\n\n## Training\n\nBunny training consists of two stages: (1) pretrain stage: use data to connect a *frozen pretrained* vision encoder to a *frozen* LLM, and only the connector is trained; (2) visual instruction tuning stage: use data to teach the model to follow multimodal instructions, where the connector, learnable LLM parameters and vision encoder (optional) are updated.\n\nBunny is trained on 8 A100 GPUs. Under other circumstances, you can reduce the `per_device_train_batch_size` and increase the `gradient_accumulation_steps` accordingly. Always keep the global batch size the same: `global_batch_size ` = `per_device_train_batch_size` $`\\times`$ `gradient_accumulation_steps` $`\\times`$ `num_gpus`.\n\n### Support Models\n\nCurrently, we support several vision encoders and LLMs.\n\nFor vision encoders, we support CLIP, EVA-CLIP and SigLIP.\n\n| Vision Encoders            | Download Link                                                |\n| -------------------------- | ------------------------------------------------------------ |\n| clip-vit-large-patch14-336 | [openai\u002Fclip-vit-large-patch14-336](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fclip-vit-large-patch14-336) |\n| EVA02_CLIP_L_336_psz14_s6B | [QuanSun\u002FEVA-CLIP](https:\u002F\u002Fhuggingface.co\u002FQuanSun\u002FEVA-CLIP\u002Fblob\u002Fmain\u002FEVA02_CLIP_L_336_psz14_s6B.pt) |\n| siglip-so400m-patch14-384  | [google\u002Fsiglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) |\n\nFor LLMs, we support phi-1.5, stablelm-2, qwen1.5-1.8b, minicpm, phi-2, phi-3 and llama3-8b.\n\n| MODEL_TYPE | LLM             | Download Link                                                |\n| ---------- | --------------- | ------------------------------------------------------------ |\n| phi-1.5    | phi-1_5     | [microsoft\u002Fphi-1_5](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-1_5) |\n| stablelm-2 | stablelm-2-1_6b | [stabilityai\u002Fstablelm-2-1_6b](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fstablelm-2-1_6b) |\n| qwen1.5-1.8b | Qwen1.5-1.8B | [Qwen\u002FQwen1.5-1.8B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen1.5-1.8B) |\n| minicpm | MiniCPM-2B | [openbmb\u002FMiniCPM-2B-history](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-history) (step 280000) |\n| phi-2 | phi-2 | [microsoft\u002Fphi-2](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-2) |\n| phi-3 | Phi-3-mini-4k-instruct | [microsoft\u002FPhi-3-mini-4k-instruct](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-3-mini-4k-instruct) |\n| llama3-8b | Meta-Llama-3-8B-Instruct | [meta-llama\u002FMeta-Llama-3-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FMeta-Llama-3-8B-Instruct) |\n\nNote that there are many variants of above models.\nWe build and test our code based on the exact versions mentioned above.\nMore models will be supported in the future!\n\n### Pretrain\n\n* Data preparation\n\n  We use a high-quality coreset with less duplicates and more informative samples of LAION-2B built by [this work](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FDataset-Pruning\u002Ftree\u002Fmain\u002FLAION). We randomly sample 2 million image-text pairs from the coreset and convert them to training format.\n\n  The dataset is available [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_1-data).\n\n* Run\n\n  Update `--model_name_or_path` and `--vision_tower` to the paths of the LLM and vision encoder, respectively. Update `MODEL_TYPE` and `OUTPUT_DIR` accordingly. The global batch size is 256. S$`^2`$-Wrapper would be enabled if `--use_s2 True` added.\n  \n  You may refer to the settings in our experiments in the [Training Tutorial](#training-tutorial).\n  \n  ```shell\n  sh script\u002Ftrain\u002Fpretrain.sh\n  ```\n\n### Visual Instruction Tuning\n\n* Data preparation\n\n  We build Bunny-695K by modifying [SVIT-mix-665K](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.04087) for finetuning. And we then combine it with LLaVA-665K and ALLaVA-Instruct-4V, i.e., Bunny-LLaVA-1.4M, Bunny-ALLaVA-1.3M, and Bunny-LLaVA-ALLaVA-2M.\n\n  The dataset is available [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_1-data). If you only want to use Bunny-695K and the related images, you can just download them [here](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_0-data).\n\n* Run\n\n  Update `--model_name_or_path` and `--vision_tower` to the paths of the LLM and vision encoder, respectively. Update `MODEL_TYPE`, `PRETRAIN_DIR` and `OUTPUT_DIR` accordingly. The global batch size is 128. For `MODEL_TYPE = minicpm\u002Fphi-3\u002Fllama3-8b`, change `--version` to `minicpm\u002Fphi3\u002Fllama`, too. S$`^2`$-Wrapper would be enabled if `--use_s2 True` added. The vision encoder would be tuned if `--unfreeze_vision_tower True` added.\n  \n  We explore a better strategy including more visual instruction tuning data, S$`^2`$-Wrapper, trainable vision encoder, weight merging, and etc. You may refer to the settings in our experiments in the [Training Tutorial](#training-tutorial).\n  \n  ```shell\n  # full-parameter tuning\n  sh script\u002Ftrain\u002Ffinetune_full.sh\n  \n  # LoRA tuning\n  sh script\u002Ftrain\u002Ffinetune_lora.sh\n  ```\n### Continuous  Fine-tuning\n\nIf you want to continuously fine-tuning our released Bunny models on your data or to adapt certain task, \n\n\u003Cdetails>\n\u003Csummary>expand to see the instructions.\u003C\u002Fsummary>\n\n\n1. Prepare data: convert your data to a `JSON` file of a list of all samples with the format like [Bunny-695K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_0-data\u002Fblob\u002Fmain\u002Ffinetune\u002Fbunny_695k.json).\n\n2. Prepare model:\n\n   * download Bunny [models](#model-zoo) and if only LoRA provided, merge the LoRA weights and base LLM\n\n     ```shell\n     python script\u002Fmerge_lora_weights.py \\\n       --model-path \u002Fpath\u002Fto\u002Fbunny_lora_weights \\\n       --model-base \u002Fpath\u002Fto\u002Fbase_llm_model \\\n       --model-type phi-2 (or stablelm-2 or phi-1.5 or qwen1.5-1.8b or minicpm or phi-3 or llama3-8b) \\\n       --save-model-path \u002Fpath\u002Fto\u002Fmerged_model\n     ```\n   * add `\"continuous_training\": true` in `\u002Fpath\u002Fto\u002Fmerged_model\u002Fconfig.json` to ensure loading the vision tower from merged weights\n   \n\n\n3. Edit script: both `finetune_full.sh` and `finetune_lora.sh` can be used, before:\n\n   * change `--model_name_or_path` to `\u002Fpath\u002Fto\u002Fmerged_model`\n\n   * delete `--pretrain_mm_mlp_adapter` because we load the cross-modality projector from merged weights\n\n   * customize the hyperparameters, e.g. the learning rate, to fit your dataset\n   \n   * for `MODEL_TYPE = minicpm\u002Fphi-3\u002Fllama3-8b`, change `--version` to `minicpm\u002Fphi3\u002Fllama`, too. S$`^2`$-Wrapper would be enabled if `--use_s2 True` added. The vision encoder would be tuned if `--unfreeze_vision_tower True` added.\n   \n\n**Note** that if you continuously fine-tune Bunny models using LoRA, `--model-base` should be Bunny models rather than the original LLMs when loading.\n\n\u003C\u002Fdetails>\n\n## Demo\n\n### Gradio Web UI\n\n* Starting the Controller\n\n  First, start the controller. This service orchestrates communication between the web server and model workers.\n  \n  ```shell\n  python -m bunny.serve.controller \\\n  \t--host 0.0.0.0 \\\n  \t--port 10000\n  ```\n\n* Launching the Gradio Web Server\n\n  To interact with the models through a web interface, start the Gradio web server.\n\n  Basic start:\n\n  ```shell\n  python -m bunny.serve.gradio_web_server \\\n  \t--controller http:\u002F\u002Flocalhost:10000 \\\n  \t--model-list-mode reload\n  ```\n\n  If you want to share your web server with others, use `--share` option. Note that `frpc_linux_amd64_v0.2` may be missing and you can fix it following instructions printed on the screen and making the file executable.\n\n  ```shell\n  python -m bunny.serve.gradio_web_server \\\n  \t--controller http:\u002F\u002Flocalhost:10000 \\\n  \t--model-list-mode reload \\\n  \t--share\n  ```\n\n  Now, you can open the web interface with **the URL printed on the screen**. You may notice that there is no model in the model list. Do not worry, as we have not launched any model worker yet. It will be automatically updated when you launch a model worker.\n\n* Launching Model Workers\n\n  Model workers handle the processing of model inferences. Configure each worker with the appropriate model and start it. Note to check whether `conv_mode` is correct [here](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FBunny\u002Fblob\u002Fmain\u002Fbunny\u002Fserve\u002Fgradio_web_server.py#L194) which is decided by the name (path) of the model.\n\n  * For full-parameter tuning models\n\n      ```shell\n      python -m bunny.serve.model_worker \\\n        --host 0.0.0.0 \\\n        --controller http:\u002F\u002Flocalhost:10000 \\\n        --port 40000 \\\n        --worker http:\u002F\u002Flocalhost:40000 \\\n        --model-path \u002Fpath\u002Fto\u002Fbunny\u002Fmodel \\\n        --model-type phi-2 (or stablelm-2 or phi-1.5 or qwen1.5-1.8b or minicpm or phi-3 or llama3-8b)\n      ```\n\n  * For LoRA tuning models\n\n      You can use `script\u002Fmerge_lora_weights.py` to merge the LoRA weights and base LLM, and use it as above.\n      \n      ```Shell\n      python script\u002Fmerge_lora_weights.py \\\n        --model-path \u002Fpath\u002Fto\u002Fbunny_lora_weights \\\n        --model-base \u002Fpath\u002Fto\u002Fbase_llm_model \\\n        --model-type phi-2 (or stablelm-2 or phi-1.5 or qwen1.5-1.8b or minicpm or phi-3 or llama3-8b) \\\n        --save-model-path \u002Fpath\u002Fto\u002Fmerged_model\n      ```\n      Or you can use it without merging as below.\n      \n      ```shell\n      python -m bunny.serve.model_worker \\\n        --host 0.0.0.0 \\\n        --controller http:\u002F\u002Flocalhost:10000 \\\n        --port 40000 \\\n        --worker http:\u002F\u002Flocalhost:40000 \\\n        --model-path \u002Fpath\u002Fto\u002Fbunny_lora_weights \\\n        --model-base \u002Fpath\u002Fto\u002Fbase_llm_model \\\n        --model-type phi-2 (or stablelm-2 or phi-1.5 or qwen1.5-1.8b or minicpm or phi-3 or llama3-8b)\n      ```\n\n\n### CLI Inference (Without Gradio Interface)\n\nFor CLI-based inference without using the Gradio interface, use the following command:\n\n* For full-parameter tuning models\n\n  ```shell\n  python -m bunny.serve.cli \\\n  \t--model-path \u002Fpath\u002Fto\u002Fbunny\u002Fmodel \\\n  \t--model-type phi-2 (or stablelm-2 or phi-1.5 or qwen1.5-1.8b or minicpm) \\\n  \t--image-file \u002Fpath\u002Fto\u002Fthe\u002Ftest\u002Fimage \\\n  \t--conv-mode bunny (change to minicpm\u002Fphi3\u002Fllama for model-type = minicpm\u002Fphi-3\u002Fllama3-8b)\n  ```\n\n* For LoRA tuning models\n\n  You can use `script\u002Fmerge_lora_weights.py` to merge the LoRA weights and base LLM, and use it as above.\n\n  ```Shell\n  python script\u002Fmerge_lora_weights.py \\\n  \t--model-path \u002Fpath\u002Fto\u002Fbunny_lora_weights \\\n  \t--model-base \u002Fpath\u002Fto\u002Fbase_llm_model \\\n  \t--model-type phi-2 (or stablelm-2 or phi-1.5 or qwen1.5-1.8b or minicpm or phi-3 or llama3-8b) \\\n  \t--save-model-path \u002Fpath\u002Fto\u002Fmerged_model\n  ```\n\n  Or you can use it without merging as below.\n\n  ```shell\n  python -m bunny.serve.cli \\\n  \t--model-path \u002Fpath\u002Fto\u002Fbunny_lora_weights \\\n  \t--model-base \u002Fpath\u002Fto\u002Fbase_llm_model \\\n  \t--model-type phi-2 (or stablelm-2 or phi-1.5 or qwen1.5-1.8b or minicpm or phi-3 or llama3-8b) \\\n  \t--image-file \u002Fpath\u002Fto\u002Fthe\u002Ftest\u002Fimage \\\n  \t--conv-mode bunny (change to minicpm\u002Fphi3\u002Fllama for model-type = minicpm\u002Fphi-3\u002Fllama3-8b)\n  ```\n\nYou can also control `temperature`, `repetition-penalty` and `max-new-tokens`.\n\n## Evaluation\n\nFor full-parameter tuning models, see [evaluation_full.md](script\u002Feval\u002Ffull\u002Fevaluation_full.md).\n\nFor LoRA tuning models, see [evaluation_lora.md](script\u002Feval\u002Flora\u002Fevaluation_lora.md).\n\n## Citation\nIf you find this repository helpful, please cite the paper below.\n\n```bibtex\n@article{he2024bunny,\n      title={Efficient Multimodal Learning from Data-centric Perspective}, \n      author={He, Muyang and Liu, Yexin and Wu, Boya and Yuan, Jianhao and Wang, Yueze and Huang, Tiejun and Zhao, Bo},\n      journal={arXiv preprint arXiv:2402.11530},\n      year={2024}\n}\n```\n\n## License\nThis project utilizes certain datasets and checkpoints that are subject to their respective original licenses. Users must comply with all terms and conditions of these original licenses.\nThe content of this project itself is licensed under the [Apache license 2.0](.\u002FLICENSE).\n\n## Acknowledgement\n\nWe build our project based on [LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA): Large Language and Vision Assistant.\n","# Bunny：一个轻量级多模态模型家族\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBAAI-DCAI_Bunny_readme_3290ae7aa2a1.png\" alt=\"Logo\" width=\"350\">\n\u003C\u002Fp>\n\n📖 [技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11530) | 🤗 [数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_1-data) | 🤖 [数据集](https:\u002F\u002Fwww.modelscope.cn\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1.1-data) | 🤗 [HFSpace](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FBoZhaoHuggingFace\u002FBunny) 🐰 [演示](http:\u002F\u002Fbunny.baai.ac.cn)\n\n**Bunny-Llama-3-8B-V**: 🤗 [v1.1](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V) | 🤗 [v1.0](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-Llama-3-8B-V) | 🤗 [v1.0-GGUF](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-Llama-3-8B-V-gguf)\n\n**Bunny-4B**: 🤗 [v1.1](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-4B) | 🤗 [v1.0](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-4B) | 🤗 [v1.0-GGUF](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-4B-gguf)\n\nBunny 是一个轻量级但功能强大的多模态模型家族。它提供了多种即插即用的视觉编码器，如 **EVA-CLIP、SigLIP**，以及语言主干网络，包括 **Llama-3-8B、Phi-3-mini、Phi-1.5、StableLM-2、Qwen1.5、MiniCPM 和 Phi-2**。为了弥补模型规模减小带来的性能损失，我们从更广泛的数据源中精心筛选出更具信息量的训练数据。\n\n我们非常高兴地推出 **Bunny-Llama-3-8B-V**，这是首个基于 Llama-3 的视觉-语言模型，展现了卓越的性能。其 v1.1 版本可接受高达 **1152x1152** 分辨率的高分辨率图像。\n\n![comparison_8B](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBAAI-DCAI_Bunny_readme_68b43cf29934.png)\n\n此外，我们基于 SigLIP 和 Phi-3-mini 构建的 **Bunny-4B** 模型，在与同类规模模型以及更大规模（7B 和 13B）MLLMs 的对比中，均表现出超越当前最先进水平的性能。同时，v1.1 版本同样支持高达 **1152x1152** 分辨率的高分辨率图像。\n\n\u003Cdetails>\n\u003Csummary>展开查看 Bunny-4B 的性能\u003C\u002Fsummary>\n\u003CIMG src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBAAI-DCAI_Bunny_readme_8d8807af4fad.png\"\u002F>\n\u003C\u002Fdetails>\n\n\n\n## 新闻与更新\n\n* 2024年7月23日 🔥 **最新版 Bunny 的所有训练策略和数据已公开！** 更多关于 Bunny 的详细信息，请参阅 [技术报告](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11530)、[数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_1-data) 和 [训练教程](#training-tutorial)！\n* 2024年7月21日 🔥 **SpatialBot、SpatialQA 和 SpatialBench 已发布！** SpatialBot 是一个基于 Bunny 的具身化模型，通过理解和利用深度信息来理解空间关系。您可以在 [GitHub](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FSpatialBot) 上体验该模型、数据集和基准测试！\n* 2024年6月20日 🔥 **MMR 基准测试已发布！** 这是一个用于评估 MLLMs 理解能力及其对误导性问题鲁棒性的基准测试。请在 [GitHub](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FMultimodal-Robustness-Benchmark) 上查看 Bunny 的表现及更多详情！\n* 2024年6月1日 🔥 **Bunny-v1.1-Llama-3-8B-V 发布，支持 1152x1152 分辨率！** 该模型基于 SigLIP 和 Llama-3-8B-Instruct，并采用了 S$`^2`$-Wrapper 技术。更多详情请见 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V) 和 [wisemodel](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FBunny-v1.1-Llama-3-8B-V)! 🐰 [演示](http:\u002F\u002Fbunny.baai.ac.cn)\n\n* 2024年5月8日 **Bunny-v1.1-4B 发布，支持 1152x1152 分辨率！** 该模型基于 SigLIP 和 Phi-3-Mini-4K 3.8B，并采用了 S$`^2`$-Wrapper 技术。更多详情请见 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-4B)! 🐰 [演示](http:\u002F\u002Fbunny.baai.ac.cn)\n\n* 2024年5月1日 **Bunny-v1.0-4B 发布，这是一款基于 Phi-3 的视觉-语言模型！** 它基于 SigLIP 和 Phi-3-Mini-4K 3.8B 构建。更多详情请见 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-4B)! 🤗 [GGUF 格式](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-4B-gguf)\n\n* 2024年4月21日 **Bunny-Llama-3-8B-V 发布，这是首个基于 Llama-3 的视觉-语言模型！** 它基于 SigLIP 和 Llama-3-8B-Instruct 构建。更多详情请见 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-Llama-3-8B-V)、[ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FBAAI\u002FBunny-Llama-3-8B-V) 和 [wisemodel](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FBunny-Llama-3-8B-V)! **GGUF** 格式已在 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-Llama-3-8B-V-gguf) 和 [wisemodel](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FBunny-Llama-3-8B-V-gguf) 上提供。\n\n* 2024年4月18日 **Bunny-v1.0-3B-zh 发布，该模型在英语和中文上都表现出色！** 它基于 SigLIP 和 MiniCPM-2B 构建。更多详情请见 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B-zh)、[ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FBAAI\u002FBunny-v1.0-3B-zh\u002Fsummary) 和 [wisemodel](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FBunny-v1.0-3B-zh)! 评估结果详见 [评估](#evaluation)。我们衷心感谢邵振伟先生的鼎力相助。\n\n* 2024年3月15日 **Bunny-v1.0-2B-zh 发布，该模型专注于中文！** 它基于 SigLIP 和 Qwen1.5-1.8B 构建。更多详情请见 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-2B-zh)、[ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FBAAI\u002FBunny-v1.0-2B-zh\u002Fsummary) 和 [wisemodel](https:\u002F\u002Fwisemodel.cn\u002Fmodels\u002FBAAI\u002FBunny-v1.0-2B-zh)! 评估结果详见 [评估](#evaluation)。\n\n* 2024年3月6日 **Bunny 训练数据已发布！** 关于 Bunny-v1.0 数据的更多信息，请参阅 [HuggingFace](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_0-data) 或 [ModelScope](https:\u002F\u002Fwww.modelscope.cn\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1.0-data)！\n\n* 2024年2月20日 **Bunny 技术报告已完成！** 更多关于 Bunny 的详细信息，请参阅 [这里](https:\u002F\u002Farxiv.org\u002Fabs\u002F2402.11530)！\n\n* 2024年2月7日 **Bunny 正式发布！** Bunny-v1.0-3B 基于 SigLIP 和 Phi-2 构建，其性能不仅超越了同类规模的 MLLMs，甚至在与更大规模（7B）MLLMs 的比较中也表现出色，其性能甚至可以与 LLaVA-13B 相媲美！ 🤗 [Bunny-v1.0-3B](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B)\n\n## 快速入门\n\n### HuggingFace Transformers\n\n以下代码片段展示了如何使用 [Bunny-v1.1-Llama-3-8B-V](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V)、[Bunny-v1.1-4B](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-4B)、[Bunny-v1.0-3B](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B) 等模型与 HuggingFace Transformers 库进行交互。\n\n此代码片段仅适用于上述模型，因为我们**手动**将部分配置代码整合到一个文件中，以方便用户使用。例如，您可以查看 [Bunny 模型的源码中的 `modeling_bunny_llama.py`](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V\u002Fblob\u002Fmain\u002Fmodeling_bunny_llama.py) 和 [`configuration_bunny_llama.py`](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V\u002Fblob\u002Fmain\u002Fconfiguration_bunny_llama.py)，以及它们的相关部分，以了解其中的差异。对于其他模型（包括您自己训练的模型），我们**建议**通过安装 Bunny 的源代码来加载这些模型。或者，您可以将类似 `modeling_bunny_llama.py` 和 `configuration_bunny_llama.py` 的文件复制到您的模型目录，并修改 `config.json` 中的 `auto_map` 字段，但这样做无法保证完全正确，您可能还需要对部分代码进行调整以适配您的模型。\n\n在运行该代码片段之前，您需要安装以下依赖项：\n\n```shell\npip install torch transformers accelerate pillow\n```\n\n如果您的 CUDA 显存足够，可以通过设置 `CUDA_VISIBLE_DEVICES=0` 来加速执行。\n\n对于中国大陆地区的用户，建议使用 HuggingFace 的[镜像站点](https:\u002F\u002Fhf-mirror.com)。\n\n```python\nimport torch\nimport transformers\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom PIL import Image\nimport warnings\n\n# 禁用部分警告信息\ntransformers.logging.set_verbosity_error()\ntransformers.logging.disable_progress_bar()\nwarnings.filterwarnings('ignore')\n\n# 设置设备\ndevice = 'cuda'  # 或 cpu\ntorch.set_default_device(device)\n\nmodel_name = 'BAAI\u002FBunny-v1_1-Llama-3-8B-V' # 或 'BAAI\u002FBunny-Llama-3-8B-V' 或 'BAAI\u002FBunny-v1_1-4B' 或 'BAAI\u002FBunny-v1_0-4B' 或 'BAAI\u002FBunny-v1_0-3B' 或 'BAAI\u002FBunny-v1_0-3B-zh' 或 'BAAI\u002FBunny-v1_0-2B-zh'\noffset_bos = 1 # 对于 Bunny-v1_1-Llama-3-8B-V、Bunny-Llama-3-8B-V、Bunny-v1_1-4B、Bunny-v1_0-4B 和 Bunny-v1_0-3B-zh 使用\n# 对于 Bunny-v1_0-3B 和 Bunny-v1_0-2B-zh，offset_bos = 0\n\n# 加载模型\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=torch.float16, # cpu 上使用 float32\n    device_map='auto',\n    trust_remote_code=True)\ntokenizer = AutoTokenizer.from_pretrained(\n    model_name,\n    trust_remote_code=True)\n\n# 文本提示\nprompt = '为什么这张图片很有趣？'\ntext = f\"一位好奇的用户与人工智能助手之间的对话。助手会针对用户的问题给出有帮助、详细且礼貌的回答。用户： \u003Cimage>\\n{prompt} 助手：\"\ntext_chunks = [tokenizer(chunk).input_ids for chunk in text.split('\u003Cimage>')]\ninput_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0).to(device)\n\n# 图片，示例图片可在 https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V\u002Ftree\u002Fmain\u002Fimages 找到\nimage = Image.open('example_2.png')\nimage_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)\n\n# 生成回答\noutput_ids = model.generate(\n    input_ids,\n    images=image_tensor，\n    max_new_tokens=100，\n    use_cache=True，\n    repetition_penalty=1.0 # 可适当提高以避免重复输出\n)[0]\n\nprint(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())\n```\n\n### ModelScope\n\n我们建议尤其是中国大陆地区的用户使用 ModelScope。\n`snapshot_download` 工具可以帮助您解决下载检查点时遇到的问题。\n\n\u003Cdetails>\n\u003Csummary>展开查看代码片段\u003C\u002Fsummary>\n\n在运行该代码片段之前，您需要安装以下依赖项：\n\n```shell\npip install torch modelscope transformers accelerate pillow\n```\n如果您的 CUDA 显存足够，可以通过设置 `CUDA_VISIBLE_DEVICES=0` 来加速执行。\n\n```python\nimport torch\nimport transformers\nfrom modelscope import AutoTokenizer、AutoModelForCausalLM\nfrom modelscope.hub.snapshot_download import snapshot_download\nfrom PIL import Image\nimport warnings\n\n# 禁用部分警告信息\ntransformers.logging.set_verbosity_error()\ntransformers.logging.disable_progress_bar()\nwarnings.filterwarnings('ignore')\n\n# 设置设备\ndevice = 'cuda'  # 或 cpu\ntorch.set_default_device(device)\n\nmodel_name = 'BAAI\u002FBunny-Llama-3-8B-V' # 或 'BAAI\u002FBunny-v1.0-3B' 或 'BAAI\u002FBunny-v1.0-3B-zh' 或 'BAAI\u002FBunny-v1.0-2B-zh'\noffset_bos = 1 # 对于 Bunny-Llama-3-8B-V 和 Bunny-v1.0-3B-zh 使用\n# 对于 Bunny-v1.0-3B 和 Bunny-v1.0-2B-zh，offset_bos = 0\n\n# 加载模型\nsnapshot_download(model_id='thomas\u002Fsiglip-so400m-patch14-384')\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=torch.float16, # cpu 上使用 float32\n    device_map='auto',\n    trust_remote_code=True)\ntokenizer = AutoTokenizer.from_pretrained(\n    model_name,\n    trust_remote_code=True)\n\n# 文本提示\nprompt = '为什么这张图片很有趣？'\ntext = f\"一位好奇的用户与人工智能助手之间的对话。助手会针对用户的问题给出有帮助、详细且礼貌的回答。用户： \u003Cimage>\\n{prompt} 助手：\"\ntext_chunks = [tokenizer(chunk).input_ids for chunk in text.split('\u003Cimage>')]\ninput_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0).to(device)\n\n# 图片，示例图片可在 https:\u002F\u002Fwww.modelscope.cn\u002Fmodels\u002FBAAI\u002FBunny-Llama-3-8B-V\u002Ffiles 找到\nimage = Image.open('example_2.png')\nimage_tensor = model.process_images([image], model.config).to(dtype=model.dtype，device=device)\n\n# 生成回答\noutput_ids = model.generate(\n    input_ids，\n    images=image_tensor，\n    max_new_tokens=100，\n    use_cache=True，\n    repetition_penalty=1.0 # 可适当提高以避免重复输出\n)[0]\n\nprint(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())\n```\n\n\u003C\u002Fdetails>\n\n## 模型库\n\n### 评估\n\n| 检查点                                                   | MME$`^\\text{P}`$ | MME$`^\\text{C}`$ | MMB$`^{\\text{T}\u002F\\text{D}}`$ | MMB-CN$`^{\\text{T}\u002F \\text{D}}`$ | SEED(-IMG) | MMMU$`^{\\text{V}\u002F\\text{T}}`$ | VQA$`^\\text{v2}`$ | GQA  | SQA$`^\\text{I}`$ | POPE |\n| :----------------------------------------------------------- | :--------------: | :--------------: | :--------------: | :--------------: | :--------------: | :--: | :---------------: | :---------------: | :---------------: | :--: |\n| [bunny-phi-1.5-eva-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-phi-1.5-eva-lora) |      1213.7      |      278.9      |       60.9\u002F56.8       |       -       | 56.4\u002F64.1 | 30.0\u002F28.4 | 76.5 |       60.4       |       58.2       | 86.1 |\n| [bunny-stablelm-2-eva-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-stablelm-2-eva-lora) |      1301.0      |      235.0       |       58.4\u002F56.4       |       -       | 55.3\u002F62.8 | 29.8\u002F29.4 | 74.6 |       56.7       |       60.0    | 84.8 |\n| [bunny-phi-2-eva-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-phi-2-eva-lora) |      1421.0      |      285.4      |       68.6\u002F67.4       |       -       | 62.2\u002F70.2 | 35.9\u002F32.6 | 78.9 |       62.3       |       69.1       | 87.1 |\n| [bunny-phi-1.5-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-phi-1.5-siglip-lora) |      1230.0      |      237.5      |       61.2\u002F59.7       |       -       | 57.7\u002F65.3 | 30.0\u002F29.1 | 78.0 |       61.1       |       61.3       | 85.8 |\n| [bunny-stablelm-2-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-stablelm-2-siglip-lora) |      1366.8      |      236.1       |       65.1\u002F62.8       |       -       | 58.8\u002F67.5 | 29.9\u002F29.8 | 78.9 |       60.9       |       61.1    | 85.9 |\n| [Bunny-v1.0-2B-zh\u002Fbunny-qwen1.5-1.8b-siglip](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-2B-zh) |      1300.8      |      254.3      |       59.8\u002F59.1       |       59.5\u002F58.5       | 55.4\u002F62.3 | 34.4\u002F30.4 | 76.6 |       59.6       |       64.6       | 85.8 |\n| [Bunny-v1.0-3B-zh\u002Fbunny-minicpm-siglip](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B-zh) |      1410.4      |      281.4      |       66.1\u002F65.5       |       64.9\u002F63.6       | 59.6\u002F67.3 | 35.4\u002F32.4 | 78.6 |       60.8       |       68.7       | 86.5 |\n| [Bunny-v1.0-3B\u002Fbunny-phi-2-siglip](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B) |      1488.8      |      289.3      |       69.2\u002F68.6       |       -       | 62.5\u002F70.7 | 38.2\u002F33.0 | 79.8 |       62.5       |       70.9       | 86.8 |\n| [Bunny-v1.0-4B](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-4B) |      1495.2      |      338.9      |       74.0\u002F73.5       |       -       | 64.5\u002F72.1 | 40.1\u002F39.1 | 81.5 |       63.5       |       75.2       | 86.7 |\n| **[Bunny-v1.1-4B](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-4B)** |      1581.5      |      361.1      |       75.7\u002F74.2       |       66.5\u002F64.5       | 64.9\u002F72.5 | 41.4\u002F38.4 | 82.1 |       63.2       |       78.3       | 87.2 |\n| [Bunny-Llama-3-8B-V](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-Llama-3-8B-V) |      1588.9      |      321.1      |       77.2\u002F76.7       |       73.8\u002F72.3       | 65.9\u002F73.3 | 42.8\u002F39.0 | 82.6 |       64.8       |       80.4       | 86.9 |\n| **[Bunny-1.1-Llama-3-8B-V](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V)** |      1644.1      |      367.5      |       78.1\u002F77.2       |       74.3\u002F74.8       | 66.2\u002F73.5 | 43.3\u002F39.0 | 82.9 |       64.0       |       79.9       | 87.2 |\n\n性能最佳的小模型为 Bunny-v1.0-3B 或 bunny-phi-2-siglip，其合并权重可在此处找到 [链接](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B)，LoRA 权重则可在此处找到 [链接](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbunny-phi-2-siglip-lora)。\n\n我们还提供了两款专注于中文问答能力的模型，分别是 Bunny-v1.0-3B-zh（bunny-minicpm-siglip）和 Bunny-v1.0-2B-zh（bunny-qwen1.5-1.8b-siglip）。它们的合并权重分别可在以下链接找到：[Bunny-v1.0-3B-zh](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-3B-zh) 和 [Bunny-v1.0-2B-zh](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_0-2B-zh)。LoRA 权重则可在以下链接找到：[bunny-minicpm-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-minicpm-siglip-lora) 和 [bunny-qwen1.5-1.8b-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-qwen1.5-1.8b-siglip-lora)。\n\n### 训练教程\n\n| 检查点                                                   | 视觉编码器                                               | 大语言模型                                                          | 预训练权重                                             | 训练教程|\n| ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------  | ------------------------------------------------------------ |---|\n| [bunny-phi-1.5-eva-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-phi-1.5-eva-lora) | [EVA02_CLIP_L_336_psz14_s6B](https:\u002F\u002Fhuggingface.co\u002FQuanSun\u002FEVA-CLIP\u002Fblob\u002Fmain\u002FEVA02_CLIP_L_336_psz14_s6B.pt) | [microsoft\u002Fphi-1_5](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-1_5)     | [bunny-pretrain-phi-1.5-eva](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-phi-1.5-eva) |[链接](script\u002Ftrain\u002Ftutorials\u002Fbunny-phi-1.5-eva-lora.md)     |\n| [bunny-stablelm-2-eva-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-stablelm-2-eva-lora) | [EVA02_CLIP_L_336_psz14_s6B](https:\u002F\u002Fhuggingface.co\u002FQuanSun\u002FEVA-CLIP\u002Fblob\u002Fmain\u002FEVA02_CLIP_L_336_psz14_s6B.pt) | [stabilityai\u002Fstablelm-2-1_6b](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fstablelm-2-1_6b)     | [bunny-pretrain-stablelm-2-eva](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-stablelm-2-eva) |[链接](script\u002Ftrain\u002Ftutorials\u002Fbunny-stablelm-2-eva-lora.md)     |\n| [bunny-phi-2-eva-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-phi-2-eva-lora) | [EVA02_CLIP_L_336_psz14_s6B](https:\u002F\u002Fhuggingface.co\u002FQuanSun\u002FEVA-CLIP\u002Fblob\u002Fmain\u002FEVA02_CLIP_L_336_psz14_s6B.pt) | [microsoft\u002Fphi-2](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-2)        | [bunny-pretrain-phi-2-eva](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-phi-2-eva) |[链接](script\u002Ftrain\u002Ftutorials\u002Fbunny-phi-2-eva-lora.md)     |\n| [bunny-phi-1.5-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-phi-1.5-siglip-lora) | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [microsoft\u002Fphi-1_5](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-1_5)     | [bunny-pretrain-phi-1.5-siglip](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-phi-1.5-siglip) |[链接](script\u002Ftrain\u002Ftutorials\u002Fbunny-phi-1.5-siglip-lora.md)     |\n| [bunny-stablelm-2-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-stablelm-2-siglip-lora) | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [stabilityai\u002Fstablelm-2-1_6b](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fstablelm-2-1_6b)      | [bunny-pretrain-stablelm-2-siglip](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-stablelm-2-siglip) |[链接](script\u002Ftrain\u002Ftutorials\u002Fbunny-stablelm-2-siglip-lora.md)     |\n| [bunny-qwen1.5-1.8b-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-qwen1.5-1.8b-siglip-lora) | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [Qwen\u002FQwen1.5-1.8B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen1.5-1.8B)     | [bunny-pretrain-qwen1.5-1.8b-siglip](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-qwen1.5-1.8b-siglip) |[链接](script\u002Ftrain\u002Ftutorials\u002Fbunny-qwen1.5-1.8b-siglip-lora.md)     |\n| [bunny-minicpm-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-minicpm-siglip-lora) | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [openbmb\u002FMiniCPM-2B-history](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-history) (step 280000)    | [bunny-pretrain-minicpm-siglip](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-minicpm-siglip) |[链接](script\u002Ftrain\u002Ftutorials\u002Fbunny-minicpm-siglip-lora.md)     |\n| [bunny-phi-2-siglip-lora](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbunny-phi-2-siglip-lora) | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [microsoft\u002Fphi-2](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-2)        | [bunny-pretrain-phi-2-siglip](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002Fbunny-pretrain-phi-2-siglip) |[链接](script\u002Ftrain\u002Ftutorials\u002Fbunny-phi-2-siglip-lora.md)     |\n| Bunny-v1.0-4B                                            | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [microsoft\u002FPhi-3-mini-4k-instruct](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-3-mini-4k-instruct)     | [bunny-pretrain-phi-3-siglip](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-phi-3-siglip) |[链接](script\u002Ftrain\u002Ftutorials\u002FBunny-v1.0-4B.md)     |\n| **Bunny-v1.1-4B**                                            | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [microsoft\u002FPhi-3-mini-4k-instruct](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-3-mini-4k-instruct)     | [bunny-pretrain-phi-3-siglip-s2](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-phi-3-siglip-s2) |[链接](script\u002Ftrain\u002Ftutorials\u002FBunny-v1.1-4B.md)     |\n| Bunny-Llama-3-8B-V                                       | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [meta-llama\u002FMeta-Llama-3-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FMeta-Llama-3-8B-Instruct)     | [bunny-pretrain-llama3-8b-siglip](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-llama3-8b-siglip) |[链接](script\u002Ftrain\u002Ftutorials\u002FBunny-Llama-3-8B-V.md)     |\n| **Bunny-v1.1-Llama-3-8B-V**                                       | [siglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) | [meta-llama\u002FMeta-Llama-3-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FMeta-Llama-3-8B-Instruct)     | [bunny-pretrain-llama3-8b-siglip-s2](https:\u002F\u002Fhuggingface.co\u002FBoyaWu10\u002Fbunny-pretrain-llama3-8b-siglip-s2) |[链接](script\u002Ftrain\u002Ftutorials\u002FBunny-v1.1-Llama-3-8B-V.md)     |\n\n## 安装\n您可以从我们的 Docker 镜像开始，也可以在本地自行安装。\n\n### 从我们的 Docker 镜像开始\n直接使用我们配置好的 Docker 镜像启动：`docker pull russellrobin\u002Fbunny:latest`。\n\n\u003Cdetails>\n\u003Csummary>展开查看如何保持代码最新。\u003C\u002Fsummary>\n尽管我们的 Docker 镜像会定期维护，但本地的 Bunny 代码并不一定能与我们的 GitHub 仓库保持同步。\n您可能需要：\n\n1. 在运行中的容器中执行 `pip install --upgrade transformers && cd Bunny`，\n\n2. 设置默认的 GitHub 用户信息：`git config user.email \"you@example.com\" && git config user.name \"Your Name\"`，\n\n3. 使用 `git pull` 更新 Bunny 的本地代码。\n\n4. 运行 `pip install -e .`\n\n这样就准备好了！\n\u003C\u002Fdetails>\n\n### 本地安装\n* CUDA 和 cuDNN\n\n  我们使用 CUDA 11.8 和 cuDNN 8.7.0。我们实际上使用 NVIDIA 提供的 CUDA Docker 镜像：`docker pull nvcr.io\u002Fnvidia\u002Fcuda:11.8.0-cudnn8-devel-ubuntu20.04`。CUDA 12 也可以。\n\n* 创建并激活一个 conda 虚拟环境：\n\n  ```shell\n  conda create -n bunny python=3.10\n  conda activate bunny\n  ```\n\n* 基本依赖\n\n  ```shell\n  pip install --upgrade pip  # 启用 PEP 660 支持\n  pip install transformers\n  pip install torch torchvision xformers --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\n  ```\n\n* 安装 apex\n\n  ```shell\n  # https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fapex#from-source\n  pip install ninja\n  git clone https:\u002F\u002Fgithub.com\u002FNVIDIA\u002Fapex\n  cd apex\n  # 如果 pip >= 23.1（参考：https:\u002F\u002Fpip.pypa.io\u002Fen\u002Fstable\u002Fnews\u002F#v23-1），支持多个具有相同键的 `--config-settings`...\n  pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --config-settings \"--build-option=--cpp_ext\" --config-settings \"--build-option=--cuda_ext\" .\u002F\n  # 否则\n  pip install -v --disable-pip-version-check --no-cache-dir --no-build-isolation --global-option=\"--cpp_ext\" --global-option=\"--cuda_ext\" .\u002F\n  ```\n\n* 安装 flash-attention\n\n  ```shell\n  # https:\u002F\u002Fgithub.com\u002FDao-AILab\u002Fflash-attention?tab=readme-ov-file#installation-and-features\n  pip install packaging\n  pip install flash-attn --no-build-isolation\n  ```\n\n* 安装 bunny 及其他依赖\n\n  ```shell\n  git clone https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FBunny.git\n  cd Bunny\n  pip install -e .\n  ```\n\n## 训练\nBunny 的训练分为两个阶段：(1) 预训练阶段：使用数据将一个 *冻结的预训练* 视觉编码器与一个 *冻结的* LLM 连接起来，仅训练连接部分；(2) 视觉指令微调阶段：使用数据教会模型遵循多模态指令，此时连接部分、可学习的 LLM 参数以及视觉编码器（可选）都会被更新。\n\nBunny 在 8 张 A100 GPU 上进行训练。在其他情况下，您可以相应地减少 `per_device_train_batch_size` 并增加 `gradient_accumulation_steps`。始终保持全局批次大小不变：`global_batch_size` = `per_device_train_batch_size` $\\times$ `gradient_accumulation_steps` $\\times$ `num_gpus`。\n\n### 支持的模型\n目前，我们支持几种视觉编码器和 LLM。\n\n对于视觉编码器，我们支持 CLIP、EVA-CLIP 和 SigLIP。\n\n| 视觉编码器            | 下载链接                                                |\n| -------------------------- | ------------------------------------------------------------ |\n| clip-vit-large-patch14-336 | [openai\u002Fclip-vit-large-patch14-336](https:\u002F\u002Fhuggingface.co\u002Fopenai\u002Fclip-vit-large-patch14-336) |\n| EVA02_CLIP_L_336_psz14_s6B | [QuanSun\u002FEVA-CLIP](https:\u002F\u002Fhuggingface.co\u002FQuanSun\u002FEVA-CLIP\u002Fblob\u002Fmain\u002FEVA02_CLIP_L_336_psz14_s6B.pt) |\n| siglip-so400m-patch14-384  | [google\u002Fsiglip-so400m-patch14-384](https:\u002F\u002Fhuggingface.co\u002Fgoogle\u002Fsiglip-so400m-patch14-384) |\n\n对于 LLM，我们支持 phi-1.5、stablelm-2、qwen1.5-1.8b、minicpm、phi-2、phi-3 和 llama3-8b。\n\n| MODEL_TYPE | LLM             | 下载链接                                                |\n| ---------- | --------------- | ------------------------------------------------------------ |\n| phi-1.5    | phi-1_5     | [microsoft\u002Fphi-1_5](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-1_5) |\n| stablelm-2 | stablelm-2-1_6b | [stabilityai\u002Fstablelm-2-1_6b](https:\u002F\u002Fhuggingface.co\u002Fstabilityai\u002Fstablelm-2-1_6b) |\n| qwen1.5-1.8b | Qwen1.5-1.8B | [Qwen\u002FQwen1.5-1.8B](https:\u002F\u002Fhuggingface.co\u002FQwen\u002FQwen1.5-1.8B) |\n| minicpm | MiniCPM-2B | [openbmb\u002FMiniCPM-2B-history](https:\u002F\u002Fhuggingface.co\u002Fopenbmb\u002FMiniCPM-2B-history)（第 280000 步） |\n| phi-2 | phi-2 | [microsoft\u002Fphi-2](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002Fphi-2) |\n| phi-3 | Phi-3-mini-4k-instruct | [microsoft\u002FPhi-3-mini-4k-instruct](https:\u002F\u002Fhuggingface.co\u002Fmicrosoft\u002FPhi-3-mini-4k-instruct) |\n| llama3-8b | Meta-Llama-3-8B-Instruct | [meta-llama\u002FMeta-Llama-3-8B-Instruct](https:\u002F\u002Fhuggingface.co\u002Fmeta-llama\u002FMeta-Llama-3-8B-Instruct) |\n\n请注意，上述模型有许多变体。\n我们基于上述确切版本构建并测试代码。\n未来还将支持更多模型！\n\n### 预训练\n* 数据准备\n\n  我们使用由 [这项工作](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FDataset-Pruning\u002Ftree\u002Fmain\u002FLAION) 构建的高质量 coreset，其中重复较少且信息量更丰富，来自 LAION-2B 数据集。我们从该 coreset 中随机抽取 200 万对图像-文本，并将其转换为训练格式。\n\n  数据集可在 [这里](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_1-data) 获取。\n\n* 运行\n\n  将 `--model_name_or_path` 和 `--vision_tower` 分别更新为 LLM 和视觉编码器的路径。相应地更新 `MODEL_TYPE` 和 `OUTPUT_DIR`。全局批次大小为 256。如果添加了 `--use_s2 True`，则会启用 S$^2$-Wrapper。\n  \n  您可以参考我们在 [训练教程](#training-tutorial) 中的实验设置。\n  \n  ```shell\n  sh script\u002Ftrain\u002Fpretrain.sh\n  ```\n\n### 视觉指令微调\n* 数据准备\n\n  我们通过修改 [SVIT-mix-665K](https:\u002F\u002Farxiv.org\u002Fabs\u002F2307.04087) 来构建 Bunny-695K，用于微调。然后我们将它与 LLaVA-665K 和 ALLaVA-Instruct-4V 结合，得到 Bunny-LLaVA-1.4M、Bunny-ALLaVA-1.3M 和 Bunny-LLaVA-ALLaVA-2M。\n\n  数据集可在 [这里](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_1-data) 获取。如果您只想使用 Bunny-695K 及其相关图像，可以直接从 [这里](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_0-data) 下载。\n\n* 运行\n\n  将 `--model_name_or_path` 和 `--vision_tower` 分别更新为 LLM 和视觉编码器的路径。相应地更新 `MODEL_TYPE`、`PRETRAIN_DIR` 和 `OUTPUT_DIR`。全局批次大小为 128。对于 `MODEL_TYPE = minicpm\u002Fphi-3\u002Fllama3-8b`，还需将 `--version` 更改为 `minicpm\u002Fphi3\u002Fllama`。如果添加了 `--use_s2 True`，则会启用 S$^2$-Wrapper。如果添加了 `--unfreeze_vision_tower True`，视觉编码器也会被微调。\n  \n  我们正在探索更好的策略，包括更多的视觉指令微调数据、S$^2$-Wrapper、可训练的视觉编码器、权重合并等。您可参考我们在 [训练教程](#training-tutorial) 中的实验设置。\n  \n  ```shell\n  # 全参数微调\n  sh script\u002Ftrain\u002Ffinetune_full.sh\n  \n  # LoRA 微调\n  sh script\u002Ftrain\u002Ffinetune_lora.sh\n  ```\n\n### 持续微调\n\n如果您希望在您的数据上持续微调我们发布的Bunny模型，或针对特定任务进行适配，\n\n\u003Cdetails>\n\u003Csummary>展开查看说明。\u003C\u002Fsummary>\n\n\n1. 准备数据：将您的数据转换为一个包含所有样本的`JSON`文件，格式与[Bunny-695K](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FBoyaWu10\u002FBunny-v1_0-data\u002Fblob\u002Fmain\u002Ffinetune\u002Fbunny_695k.json)类似。\n\n2. 准备模型：\n\n   * 下载Bunny [模型](#model-zoo)，如果仅提供了LoRA权重，则需将LoRA权重与基础LLM合并：\n\n     ```shell\n     python script\u002Fmerge_lora_weights.py \\\n       --model-path \u002Fpath\u002Fto\u002Fbunny_lora_weights \\\n       --model-base \u002Fpath\u002Fto\u002Fbase_llm_model \\\n       --model-type phi-2 (或 stablelm-2 或 phi-1.5 或 qwen1.5-1.8b 或 minicpm 或 phi-3 或 llama3-8b) \\\n       --save-model-path \u002Fpath\u002Fto\u002Fmerged_model\n     ```\n   * 在`\u002Fpath\u002Fto\u002Fmerged_model\u002Fconfig.json`中添加`\"continuous_training\": true`，以确保从合并后的权重中加载视觉塔。\n   \n\n\n3. 编辑脚本：可以使用`finetune_full.sh`或`finetune_lora.sh`，但在使用前：\n\n   * 将`--model_name_or_path`改为`\u002Fpath\u002Fto\u002Fmerged_model`\n\n   * 删除`--pretrain_mm_mlp_adapter`，因为我们已从合并后的权重中加载跨模态投影器。\n\n   * 自定义超参数，例如学习率，以适应您的数据集。\n\n   * 对于`MODEL_TYPE = minicpm\u002Fphi-3\u002Fllama3-8b`，还需将`--version`改为`minicpm\u002Fphi3\u002Fllama`。如果添加`--use_s2 True`，则会启用S$^2$-Wrapper。如果添加`--unfreeze_vision_tower True`，则会微调视觉编码器。\n   \n\n**注意**：如果您使用LoRA持续微调Bunny模型，在加载时`--model-base`应为Bunny模型，而非原始LLM。\n\u003C\u002Fdetails>\n\n## 演示\n\n### Gradio Web UI\n\n* 启动控制器\n\n  首先启动控制器。该服务负责协调Web服务器和模型工作节点之间的通信。\n\n  ```shell\n  python -m bunny.serve.controller \\\n  \t--host 0.0.0.0 \\\n  \t--port 10000\n  ```\n\n* 启动Gradio Web服务器\n\n  为了通过Web界面与模型交互，需要启动Gradio Web服务器。\n\n  基本启动命令如下：\n\n  ```shell\n  python -m bunny.serve.gradio_web_server \\\n  \t--controller http:\u002F\u002Flocalhost:10000 \\\n  \t--model-list-mode reload\n  ```\n\n  如果您想与他人共享您的Web服务器，可以使用`--share`选项。请注意，`frpc_linux_amd64_v0.2`可能缺失，您可以按照屏幕上打印的说明修复，并使该文件可执行。\n\n  ```shell\n  python -m bunny.serve.gradio_web_server \\\n  \t--controller http:\u002F\u002Flocalhost:10000 \\\n  \t--model-list-mode reload \\\n  \t--share\n  ```\n\n  现在，您可以使用**屏幕上打印的URL**打开Web界面。您可能会注意到模型列表中还没有任何模型。不用担心，因为我们尚未启动任何模型工作节点。当您启动一个模型工作节点时，模型列表会自动更新。\n\n* 启动模型工作节点\n\n  模型工作节点负责处理模型推理。请为每个工作节点配置合适的模型并启动它。请注意检查`conv_mode`是否正确[在此处](https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FBunny\u002Fblob\u002Fmain\u002Fbunny\u002Fserve\u002Fgradio_web_server.py#L194)，这取决于模型的名称（路径）。\n\n  * 对于全参数微调模型\n\n      ```shell\n      python -m bunny.serve.model_worker \\\n        --host 0.0.0.0 \\\n        --controller http:\u002F\u002Flocalhost:10000 \\\n        --port 40000 \\\n        --worker http:\u002F\u002Flocalhost:40000 \\\n        --model-path \u002Fpath\u002Fto\u002Fbunny\u002Fmodel \\\n        --model-type phi-2 (或 stablelm-2 或 phi-1.5 或 qwen1.5-1.8b 或 minicpm 或 phi-3 或 llama3-8b)\n      ```\n\n  * 对于LoRA微调模型\n\n      您可以使用`script\u002Fmerge_lora_weights.py`将LoRA权重与基础LLM合并，然后按上述方法使用。\n\n      ```Shell\n      python script\u002Fmerge_lora_weights.py \\\n        --model-path \u002Fpath\u002Fto\u002Fbunny_lora_weights \\\n        --model-base \u002Fpath\u002Fto\u002Fbase_llm_model \\\n        --model-type phi-2 (或 stablelm-2 或 phi-1.5 或 qwen1.5-1.8b 或 minicpm 或 phi-3 或 llama3-8b) \\\n        --save-model-path \u002Fpath\u002Fto\u002Fmerged_model\n      ```\n      或者您也可以不进行合并，直接使用以下命令：\n\n      ```shell\n      python -m bunny.serve.model_worker \\\n        --host 0.0.0.0 \\\n        --controller http:\u002F\u002Flocalhost:10000 \\\n        --port 40000 \\\n        --worker http:\u002F\u002Flocalhost:40000 \\\n        --model-path \u002Fpath\u002Fto\u002Fbunny_lora_weights \\\n        --model-base \u002Fpath\u002Fto\u002Fbase_llm_model \\\n        --model-type phi-2 (或 stablelm-2 或 phi-1.5 或 qwen1.5-1.8b 或 minicpm 或 phi-3 或 llama3-8b)\n      ```\n\n\n### CLI推理（无需Gradio界面）\n\n对于不使用Gradio界面的CLI推理，可以使用以下命令：\n\n* 对于全参数微调模型\n\n  ```shell\n  python -m bunny.serve.cli \\\n  \t--model-path \u002Fpath\u002Fto\u002Fbunny\u002Fmodel \\\n  \t--model-type phi-2 (或 stablelm-2 或 phi-1.5 或 qwen1.5-1.8b 或 minicpm) \\\n  \t--image-file \u002Fpath\u002Fto\u002Fthe\u002Ftest\u002Fimage \\\n  \t--conv-mode bunny (对于model-type = minicpm\u002Fphi-3\u002Fllama3-8b, 改为minicpm\u002Fphi3\u002Fllama)\n  ```\n\n* 对于LoRA微调模型\n\n  您可以使用`script\u002Fmerge_lora_weights.py`将LoRA权重与基础LLM合并，然后按上述方法使用。\n\n  ```Shell\n  python script\u002Fmerge_lora_weights.py \\\n  \t--model-path \u002Fpath\u002Fto\u002Fbunny_lora_weights \\\n  \t--model-base \u002Fpath\u002Fto\u002Fbase_llm_model \\\n  \t--model-type phi-2 (或 stablelm-2 或 phi-1.5 或 qwen1.5-1.8b 或 minicpm 或 phi-3 或 llama3-8b) \\\n  \t--save-model-path \u002Fpath\u002Fto\u002Fmerged_model\n  ```\n\n  或者您也可以不进行合并，直接使用以下命令：\n\n  ```shell\n  python -m bunny.serve.cli \\\n  \t--model-path \u002Fpath\u002Fto\u002Fbunny_lora_weights \\\n  \t--model-base \u002Fpath\u002Fto\u002Fbase_llm_model \\\n  \t--model-type phi-2 (或 stablelm-2 或 phi-1.5 或 qwen1.5-1.8b 或 minicpm 或 phi-3 或 llama3-8b) \\\n  \t--image-file \u002Fpath\u002Fto\u002Fthe\u002Ftest\u002Fimage \\\n  \t--conv-mode bunny (对于model-type = minicpm\u002Fphi-3\u002Fllama3-8b, 改为minicpm\u002Fphi3\u002Fllama)\n  ```\n\n您还可以控制`temperature`、`repetition-penalty`和`max-new-tokens`。\n\n## 评估\n\n对于全参数微调模型，请参阅[evaluation_full.md](script\u002Feval\u002Ffull\u002Fevaluation_full.md)。\n\n对于LoRA微调模型，请参阅[evaluation_lora.md](script\u002Feval\u002Flora\u002Fevaluation_lora.md)。\n\n## 引用\n如果您觉得本仓库对您有帮助，请引用以下论文：\n\n```bibtex\n@article{he2024bunny,\n      title={Efficient Multimodal Learning from Data-centric Perspective}, \n      author={He, Muyang and Liu, Yexin and Wu, Boya and Yuan, Jianhao and Wang, Yueze and Huang, Tiejun and Zhao, Bo},\n      journal={arXiv preprint arXiv:2402.11530},\n      year={2024}\n}\n```\n\n## 许可证\n本项目使用了某些数据集和检查点，这些资源受其各自原始许可证的约束。用户必须遵守所有这些原始许可证的条款和条件。\n本项目的具体内容则采用[Apache许可证2.0](.\u002FLICENSE)授权。\n\n## 致谢\n我们的项目基于[LLaVA](https:\u002F\u002Fgithub.com\u002Fhaotian-liu\u002FLLaVA)构建：大型语言和视觉助手。","# Bunny 快速上手指南\n\nBunny 是一个轻量级但功能强大的多模态模型家族，支持多种视觉编码器（如 EVA-CLIP, SigLIP）和语言骨干网络（如 Llama-3, Phi-3, Qwen1.5 等）。最新 v1.1 版本支持高达 1152x1152 的高分辨率图像输入。\n\n## 环境准备\n\n*   **系统要求**：Linux 或 macOS，推荐配备 NVIDIA GPU（CUDA 支持）以获得最佳推理速度。\n*   **Python 版本**：建议 Python 3.8+。\n*   **前置依赖**：`torch`, `transformers`, `accelerate`, `pillow`。\n*   **国内加速**：中国大陆用户推荐使用 [HuggingFace 镜像站](https:\u002F\u002Fhf-mirror.com) 或 [ModelScope (魔搭)](https:\u002F\u002Fmodelscope.cn) 下载模型和依赖。\n\n## 安装步骤\n\n### 方案一：使用 HuggingFace Transformers（通用）\n\n安装必要的 Python 库：\n\n```shell\npip install torch transformers accelerate pillow\n```\n\n> **提示**：若下载速度慢，可设置环境变量使用镜像：\n> `export HF_ENDPOINT=https:\u002F\u002Fhf-mirror.com`\n\n### 方案二：使用 ModelScope（推荐国内用户）\n\nModelScope 提供了更稳定的国内下载链路和 `snapshot_download` 工具来解决大模型文件下载问题。\n\n```shell\npip install torch modelscope transformers accelerate pillow\n```\n\n## 基本使用\n\n以下示例展示如何加载 **Bunny-v1.1-Llama-3-8B-V** 模型并进行图文对话。请确保当前目录下有一张名为 `example_2.png` 的图片，或修改代码中的图片路径。\n\n### 代码示例 (基于 HuggingFace\u002FModelScope 通用逻辑)\n\n```python\nimport torch\nimport transformers\nfrom transformers import AutoModelForCausalLM, AutoTokenizer\nfrom PIL import Image\nimport warnings\n\n# 禁用部分警告以保持输出整洁\ntransformers.logging.set_verbosity_error()\ntransformers.logging.disable_progress_bar()\nwarnings.filterwarnings('ignore')\n\n# 设置设备 (cuda 或 cpu)\ndevice = 'cuda'\ntorch.set_default_device(device)\n\n# 选择模型名称\n# 可选模型: \n# 'BAAI\u002FBunny-v1_1-Llama-3-8B-V', 'BAAI\u002FBunny-v1_1-4B', \n# 'BAAI\u002FBunny-v1_0-4B', 'BAAI\u002FBunny-v1_0-3B-zh', 'BAAI\u002FBunny-v1_0-2B-zh'\nmodel_name = 'BAAI\u002FBunny-v1_1-Llama-3-8B-V'\n\n# 注意：不同版本的 offset_bos 设置不同\n# v1.1-Llama-3-8B-V, v1.1-4B, v1.0-4B, v1.0-3B-zh 设置为 1\n# v1.0-3B, v1.0-2B-zh 设置为 0\noffset_bos = 1 \n\n# 加载模型和分词器\n# 国内用户若使用 ModelScope，可将 AutoModelForCausalLM 和 AutoTokenizer 替换为 modelscope 对应的导入\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=torch.float16, # CPU 运行时请改为 float32\n    device_map='auto',\n    trust_remote_code=True)\n\ntokenizer = AutoTokenizer.from_pretrained(\n    model_name,\n    trust_remote_code=True)\n\n# 构建提示词\nprompt = 'Why is the image funny?'\ntext = f\"A chat between a curious user and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions. USER: \u003Cimage>\\n{prompt} ASSISTANT:\"\n\n# 处理文本输入\ntext_chunks = [tokenizer(chunk).input_ids for chunk in text.split('\u003Cimage>')]\ninput_ids = torch.tensor(text_chunks[0] + [-200] + text_chunks[1][offset_bos:], dtype=torch.long).unsqueeze(0).to(device)\n\n# 加载并处理图像\n# 示例图片可从模型主页下载: https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V\u002Ftree\u002Fmain\u002Fimages\nimage = Image.open('example_2.png')\nimage_tensor = model.process_images([image], model.config).to(dtype=model.dtype, device=device)\n\n# 生成回答\noutput_ids = model.generate(\n    input_ids,\n    images=image_tensor,\n    max_new_tokens=100,\n    use_cache=True,\n    repetition_penalty=1.0 # 增大此值可减少重复啰嗦\n)[0]\n\n# 输出结果\nprint(tokenizer.decode(output_ids[input_ids.shape[1]:], skip_special_tokens=True).strip())\n```\n\n### 针对国内用户的 ModelScope 特别说明\n\n如果您选择使用 ModelScope 加载模型，建议在代码开头添加以下下载逻辑以确保模型文件完整：\n\n```python\nfrom modelscope.hub.snapshot_download import snapshot_download\n# 预下载依赖的视觉编码器（如需）和主模型\n# snapshot_download(model_id='thomas\u002Fsiglip-so400m-patch14-384') \n# snapshot_download(model_id='BAAI\u002FBunny-v1_1-Llama-3-8B-V')\n```\n\n然后使用 `modelscope` 包中的 `AutoTokenizer` 和 `AutoModelForCausalLM` 替换上述代码中的 `transformers` 对应类即可，其余推理逻辑保持一致。","一家初创电商团队希望在其移动端 App 中集成“拍照搜同款”功能，让用户能通过拍摄商品照片快速获取详细参数和购买链接。\n\n### 没有 Bunny 时\n- **部署成本高昂**：主流多模态模型参数量巨大（如 13B+），需要昂贵的 GPU 服务器支撑，远超初创团队的预算。\n- **响应延迟严重**：大模型推理速度慢，用户拍照后需等待数秒才能看到结果，严重影响购物体验。\n- **细节识别不足**：普通轻量模型无法处理高分辨率图片，难以识别商品标签上的微小文字或复杂纹理，导致搜索不准。\n- **端侧运行困难**：由于模型体积过大，无法量化压缩至适合手机或边缘设备运行的格式，必须依赖云端传输，增加带宽压力。\n\n### 使用 Bunny 后\n- **轻量化低成本部署**：利用 Bunny-4B 或 Bunny-Llama-3-8B-V 等轻量级模型，团队仅需消费级显卡甚至边缘设备即可流畅运行，大幅降低算力成本。\n- **实时交互体验**：Bunny 凭借高效的架构设计实现毫秒级推理，用户拍照瞬间即可获得反馈，流程丝滑无卡顿。\n- **高清细节洞察**：借助 Bunny v1.1 版本支持的 1152x1152 高分辨率输入能力，模型能精准读取商品包装上的细小文字和材质细节，显著提升搜索准确率。\n- **灵活的端云协同**：通过 GGUF 量化版本，Bunny 可直接部署在用户手机端离线运行，既保护隐私又节省服务器带宽。\n\nBunny 以轻量级的身体承载了强大的多模态理解力，让资源有限的团队也能轻松落地高精度的视觉 AI 应用。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FBAAI-DCAI_Bunny_a36fcb4b.png","BAAI-DCAI",null,"https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FBAAI-DCAI_919738ee.png","Data-centric AI Group @ Beijing Academy of Artificial Intelligence.","https:\u002F\u002Fgithub.com\u002FBAAI-DCAI",[20,24],{"name":21,"color":22,"percentage":23},"Python","#3572A5",97.7,{"name":25,"color":26,"percentage":27},"Shell","#89e051",2.3,1053,77,"2026-04-03T09:27:39","Apache-2.0",3,"Linux, macOS, Windows","非必需（支持 CPU），但推荐使用 NVIDIA GPU 以加速。显存需求取决于模型大小：Bunny-4B 约需 8GB+，Bunny-Llama-3-8B-V 约需 16GB+（FP16）。未明确指定 CUDA 版本，通常需匹配 PyTorch 版本（建议 CUDA 11.8 或 12.1+）。","未说明（建议至少 16GB，运行 8B 模型推荐 32GB+）",{"notes":37,"python":38,"dependencies":39},"1. 代码示例中明确指定设备可为 'cuda' 或 'cpu'，CPU 运行时需使用 float32 精度。\n2. 部分模型（如 Llama-3 系列）加载时需设置 trust_remote_code=True。\n3. 中国大陆用户建议使用 ModelScope 或 HuggingFace 镜像站下载模型。\n4. Bunny-v1.1 系列支持高达 1152x1152 分辨率的图像输入。\n5. 不同版本的模型在处理文本前缀时可能需要不同的 offset_bos 参数（0 或 1）。","未说明",[40,41,42,43,44],"torch","transformers","accelerate","pillow","modelscope",[46,47],"其他","语言模型",[49,50,51,52,53,54,55],"mllm","chatgpt","gpt-4","multimodal-large-language-models","vlm","chinese","english",2,"ready","2026-03-27T02:49:30.150509","2026-04-07T22:59:45.600235",[61,66,71,76,81,86,91],{"id":62,"question_zh":63,"answer_zh":64,"source_url":65},22795,"运行 model_worker 时遇到路径错误或模型加载失败怎么办？","请确保执行以下两步操作：\n1. `--model-path` 参数必须指向模型的**绝对路径**。\n2. 打开模型目录下的 `config.json` 文件，将 `mm_vision_tower` 字段的值修改为 SigLIP 模型的**绝对路径**。\n如果仍然报错，请检查配置文件中的路径是否正确无误。","https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FBunny\u002Fissues\u002F55",{"id":67,"question_zh":68,"answer_zh":69,"source_url":70},22796,"使用 H100 GPU 进行推理时出现乱码或特殊符号（如 !!!!!），但在 A100 上正常，如何解决？","这通常与数据类型精度有关。如果在 H100 显卡上推理出现问题，尝试在代码中显式指定数据类型为 `torch.bfloat16`。例如：\n`dtype=torch.bfloat16`\nA100 和 H100 对不同精度的支持表现可能不同，切换精度通常能解决此类兼容性问题。","https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FBunny\u002Fissues\u002F60",{"id":72,"question_zh":73,"answer_zh":74,"source_url":75},22797,"微调后模型出现灾难性遗忘（只回答训练数据内容，丢失通用常识）或生成重复，有什么建议？","针对微调后的性能下降或异常行为，建议检查以下几点：\n1. **停止符设置**：检查 `eos_token` 是否配置正确，防止模型无法正确停止生成。\n2. **学习率与损失曲线**：检查学习率设置是否过大，并观察 `log.txt` 中的损失曲线是否收敛正常。\n3. **数据集质量**：仔细检查微调数据集的格式是否符合 MLLM 训练要求，低质量或不恰当的数据引入会导致模型性能大幅下降。\n4. **命令行测试**：尝试在 CLI 环境中加载模型进行测试，排除代码调用问题。\n5. **提示词敏感性**：部分模型（如基于 Qwen 的版本）对提示词较敏感，可尝试调整 System Prompt 或指令措辞。","https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FBunny\u002Fissues\u002F42",{"id":77,"question_zh":78,"answer_zh":79,"source_url":80},22798,"模型生成内容出现重复循环，应该如何调整参数？","生成重复可能与提示词（Prompt）敏感性或采样参数有关。\n1. **调整提示词**：某些模型（特别是未针对中文指令微调的版本）对特定提示词敏感，尝试修改 Prompt 措辞（例如从“详细描述”改为其他表达）。\n2. **检查系统提示**：确认训练时的 System Prompt 是否在推理时被正确应用。\n3. **生成参数**：虽然默认参数通常有效，但如果问题持续，可尝试调整 `temperature` 或 `top_p` 等采样参数以增加多样性。","https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FBunny\u002Fissues\u002F14",{"id":82,"question_zh":83,"answer_zh":84,"source_url":85},22799,"预训练使用的是 LLaMA 3 还是 LLaMA 3-Instruct 版本？","项目已发布支持更高分辨率（1152x1152）的 [Bunny-v1.1-Llama-3-8B-V](https:\u002F\u002Fhuggingface.co\u002FBAAI\u002FBunny-v1_1-Llama-3-8B-V) 模型。关于具体的基座模型细节，建议参考最新发布的模型卡片或直接使用最新版本以获得最佳效果。","https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FBunny\u002Fissues\u002F47",{"id":87,"question_zh":88,"answer_zh":89,"source_url":90},22800,"连续微调（Continuous Fine-tuning）与一次性多指令微调相比，哪种效果更好？","这取决于数据的种类和比例，没有简单的通用原则。实验表明：\n1. **连续微调风险**：先微调任务 A 再微调任务 B，可能导致任务 B 的效果不如直接微调，且任务 A 的能力会基本丧失（灾难性遗忘）。\n2. **混合微调优势**：直接同时微调任务 A 和 B，通常能更好地保留任务 A 的能力，尽管任务 B 的效果可能略低于单独微调。\n建议根据具体数据分布谨慎选择策略，若需连续微调，需注意数据配比和回放机制。","https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FBunny\u002Fissues\u002F82",{"id":92,"question_zh":93,"answer_zh":94,"source_url":95},22801,"使用 SigLIP 和 LLaMA3 进行预训练和微调后，测试准确率较低，默认的学习率是多少？","如果没有特殊设置，默认的学习率通常为 `2e-4`（预训练阶段）和 `2e-5`（微调阶段）。如果测试结果不佳，请检查是否使用了正确的学习率，并确认数据预处理和超参数配置是否与官方推荐一致。","https:\u002F\u002Fgithub.com\u002FBAAI-DCAI\u002FBunny\u002Fissues\u002F78",[],[98,108,117,125,137,145],{"id":99,"name":100,"github_repo":101,"description_zh":102,"stars":103,"difficulty_score":56,"last_commit_at":104,"category_tags":105,"status":57},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",143909,"2026-04-07T11:33:18",[106,107,47],"开发框架","Agent",{"id":109,"name":110,"github_repo":111,"description_zh":112,"stars":113,"difficulty_score":32,"last_commit_at":114,"category_tags":115,"status":57},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[47,116,107,106],"图像",{"id":118,"name":119,"github_repo":120,"description_zh":121,"stars":122,"difficulty_score":56,"last_commit_at":123,"category_tags":124,"status":57},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[106,47],{"id":126,"name":127,"github_repo":128,"description_zh":129,"stars":130,"difficulty_score":56,"last_commit_at":131,"category_tags":132,"status":57},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85013,"2026-04-06T11:09:19",[116,133,134,135,107,46,47,106,136],"数据工具","视频","插件","音频",{"id":138,"name":139,"github_repo":140,"description_zh":141,"stars":142,"difficulty_score":32,"last_commit_at":143,"category_tags":144,"status":57},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[107,116,106,47,46],{"id":146,"name":147,"github_repo":148,"description_zh":149,"stars":150,"difficulty_score":32,"last_commit_at":151,"category_tags":152,"status":57},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",75054,"2026-04-07T10:38:03",[47,116,106,46]]