[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-hkust-nlp--deita":3,"similar-hkust-nlp--deita":82},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":8,"readme_en":9,"readme_zh":10,"quickstart_zh":11,"use_case_zh":12,"hero_image_url":13,"owner_login":14,"owner_name":15,"owner_avatar_url":16,"owner_bio":17,"owner_company":18,"owner_location":18,"owner_email":18,"owner_twitter":18,"owner_website":19,"owner_url":20,"languages":21,"stars":26,"forks":27,"last_commit_at":28,"license":29,"difficulty_score":30,"env_os":31,"env_gpu":32,"env_ram":31,"env_deps":33,"category_tags":39,"github_topics":42,"view_count":47,"oss_zip_url":18,"oss_zip_packed_at":18,"status":48,"created_at":49,"updated_at":50,"faqs":51,"releases":81},7872,"hkust-nlp\u002Fdeita","deita","Deita: Data-Efficient Instruction Tuning for Alignment [ICLR2024]","Deita 是一个专注于大语言模型（LLM）指令微调的开源项目，旨在通过“数据高效”的方式实现模型对齐。它核心解决了当前大模型训练中过度依赖海量标注数据的痛点，证明并实现了仅需极少量高质量数据即可训练出媲美主流顶尖模型的效果。\n\n该项目主要面向 AI 研究人员和开发者，特别是那些受限于计算资源或标注成本，却希望快速构建高性能对话模型的团队。Deita 提供了一套自动化的数据选择工具包，能够智能地从庞杂数据集中筛选出最具价值的样本。其独特的技术亮点在于发布了仅含 6000 至 10000 条数据的轻量级高质量数据集。实验数据显示，基于 Deita 数据训练的模型，在仅使用其他最先进模型十分之一数据量的情况下，依然在 MT-Bench 和 AlpacaEval 等权威评测中取得了极具竞争力的分数。此外，其数据处理流程已被 Hugging Face 等机构采纳用于构建知名模型。对于希望以更低成本、更高效率优化大语言模型表现的用户而言，Deita 提供了一条经过验证的可行路径。","# Deita\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhkust-nlp_deita_readme_ba00f75f25ca.png\" width=\"600\">\n\u003C\u002Fp>\n\n\n\u003Cp align=\"center\">\n  🤗 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fhkust-nlp\u002Fdeita-6569c198c174808d94cf5bd4\">HF Repo\u003C\u002Fa>&nbsp;&nbsp;&nbsp;\n  📄 \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.15685\">Paper\u003C\u002Fa>&nbsp;&nbsp;&nbsp;\n  📚 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-6k-v0\">6K Data\u003C\u002Fa>&nbsp;&nbsp;&nbsp;\n  📚 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-10k-v0\">10K Data\u003C\u002Fa>\n\u003C\u002Fp>\n\n\nWelcome to Deita (**D**ata-**E**fficient **I**nstruction **T**uning for **A**lignment) Project! \n\nWe will continue to update, please stay tuned!\n\n\n## What is Deita?\nDeita is an open-sourced project designed to facilitate **Automatic Data Selection** for instruction tuning in Large Language Models (LLMs).\n\nIt includes:\n- **Open-sourced Toolkits** for automatic data selection in instruction tuning\n- **Deita Datasets**: A series of extremely *lightweight*, high-quality alignment SFT data. We release 6k-sized and 10k-sized datasets in the first release\n- **Deita Models**: A series of powerful models on par with SOTA chat LLMs with an extremely efficient instruction tuning Process. Deita models can be obained by training with 10x less instruction tuning data compared with other SOTA LLMs\n\n## News\n- :fire: [03\u002F2024] Our datasets have been used by Huggingface to creat the [Zephyr Gemma Model](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FHuggingFaceH4\u002Fzephyr-7b-gemma-65e1fd82d26b426e3e63d956).\n- 📄 [01\u002F2024] Deita paper [What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.15685) has been accepted by ICLR2024!\n- :fire: [01\u002F2024] [Deita pipelines](#deita-pipelines) have been released! With one line code and configurations, a high-quality data subset for alignment can be selected.\n- 📚 [01\u002F2024] Our scorer datasets [deita-complexity-scorer-data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-complexity-scorer-data) and [deita-quality-scorer-data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-quality-scorer-data) have been released.\n- :fire: [12\u002F2023] We release the first collection of the Deita resources [here](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fhkust-nlp\u002Fdeita-6569c198c174808d94cf5bd4), which include a series of extremely lightweight, effective sft datasets, the data complexity\u002Fquality scorer models, as well as the resulted deita chat models. \n\n## Performance\n:bell: Still curious about how far a small amount of high-quality data can lead LLMs? \n\nDeita may provide an answer for you:\n\n**🔦 Highlights**\n| Model                                          | Align        | Data Size  | MT-Bench | AlpacaEval(%) |\n|------------------------------------------------|--------------|------------|----------|---------------|\n| Zephyr-7B-sft                                  | SFT          | 200K       | 5.32     | 75.12         |\n| $\\text{Zephyr-7B-}\\beta$                      | SFT + DPO    | 200K SFT + 60K DPO | 7.34     | 90.60         |\n| OpenChat-3.5                                   | C-RLFT | >> 70K C-RLFT | 7.81     | 88.51         |\n| Starling-7B                                    | C-RLFT + APA | >> 70K C-RLFT + 183K APA | 8.09     | 91.99         |\n| Tulu-2-13B                                     | SFT          | 326K       | 6.70     | 78.90         |\n| Tulu-2-13B+DPO                                 | SFT + DPO    | 326K SFT + 60K DPO | 7.00     | 89.50         |\n| LLaMA2-13B-Chat                                | SFT + PPO    | --         | 6.65     | 81.09         |\n| WizardLM-13B-v1.2                              | SFT          | >70K       | 7.09     | 89.17         |\n| Vicuna-13B-v1.5                                | SFT          | >125K      | 6.57    | 78.80         |\n| DEITA-7B-v1.0 (6K)          | SFT          | 6K       |   7.22   |    80.78      |\n| DEITA-7B-v1.0-sft            | SFT          | 10K        | 7.32     | 81.67         |\n| DEITA-7B-v1.0 | SFT + DPO    | 6K SFT + 10K DPO | 7.55     | 90.06         |\n\nDEITA models are based on Mistral-7B-v0.1. :fire: \n\nPlease refer to [this table](#chart\\_with\\_upwards\\_trend-full-evaluations) for full evaluations including Open LLM Leaderboard as well, which includes DEITA models with LLaMA base models and comparisons with other data selection approaches.\n\n\n\n## :chart_with_upwards_trend: Full Evaluations\n\n\u003Cdetails>\n  \u003Csummary>See full evaluations\u003C\u002Fsummary>\n\n| Model                                          | Align     | Data Size  | MT-Bench | AlpacaEval(%) | OpenLLM (Avg.) |\n|------------------------------------------------|-----------|------------|----------|---------------|----------------|\n| **Proprietary Models**                         |           |            |          |               |                |\n| GPT-4-Turbo                                    | ?         | --         | 9.32     | 97.70         | --             |\n| GPT-4                                          | SFT + PPO | --         | 8.99     | 95.03         | --             |\n| Claude-2                                       | SFT + PPO | --         | 8.06     | 91.36         | --             |\n| GPT-3.5-turbo                                  | SFT + PPO | --         | 7.94     | 89.37         | --             |\n| **Open-sourced Models based on LLaMA-1-13B**   |           |            |          |               |                |\n| LIMA                                           | SFT       | 1K SFT        | 4.29     | 41.98         | 59.82          |\n| WizardLM-13B                                   | SFT       | 70K SFT       | 6.35     | 75.31         | 58.96          |\n| Vicuna-13B-v1.3                                | SFT       | 125K SFT      | 6.39     | 82.11         | 60.01          |\n| Random                                         | SFT       | 10K SFT       | 6.03     | 71.52         | 60.14          |\n| DEITA-LLaMA1-13B-v1.0-sft                           | SFT       | 10K SFT       | 6.60     | 78.01         | 64.27          |\n| **Open-sourced Models based on LLaMA-2-13B**   |           |            |          |               |                |\n| Tulu-2-13B                                     | SFT       | 326K SFT      | 6.70     | 78.90         | --             |\n| Tulu-2-13B+DPO                                 | SFT + DPO | 326K SFT + 60K DPO | 7.00     | 89.50         | --             |\n| LLaMA2-13B-Chat                                | SFT + PPO | --         | 6.65     | 81.09         | --             |\n| WizardLM-13B-v1.2                              | SFT          | >70K SFT      | 7.09     | 89.17         | --             |\n| Vicuna-13B-v1.5                                | SFT       | 125K SFT      | 6.57     | 78.80         | 61.63          |\n| Random                                         | SFT       | 10K SFT       | 5.78     | 65.19         | 61.32          |\n| DEITA-LLaMA2-13B-v1.0-sft                           | SFT       | 10K SFT       | 6.79     | 81.09         | 62.71          |\n| **Open-sourced Models based on Mistral-7B**    |           |            |          |               |                |\n| Mistral-7B-Instruct-v0.1                       | --        | --         | 6.84     | 69.65         | 60.45          |\n| Zephyr-7B-sft                                  | SFT       | 200K SFT      | 5.32     | 75.12         | 60.93          |\n| $\\text{Zephyr-7B-}\\beta$                       | SFT + DPO | 200K SFT + 60K DPO | 7.34     | 90.60         | 66.36          |\n| OpenChat-3.5                                   | C-RLFT | >> 70K C-RLFT | 7.81     | 88.51         | --           |\n| Starling-7B                                    | C-RLFT + APA | >>70K C-RLFT + 183K APA | 8.09     | 91.99         | --            |\n| Random                                         | SFT       | 10K SFT       | 5.89     | 56.90         | 61.72          |\n| DEITA-7B-v1.0-sft (6K)                           | SFT       | 6K SFT       | 7.22     | 80.78         | 64.94          |\n| DEITA-7B-v1.0-sft (10K)                  | SFT       | 10K SFT       | 7.32     | 81.67         | 64.00          |\n| DEITA-7B-v1.0             | SFT + DPO | 6K SFT + 10K DPO   | 7.55     | 90.06         | 69.86          |\n\n\n\u003C\u002Fdetails>\n\n## :rocket: Deita Resources\n\n| Resource                                       | Link     | License  |\n|------------------------------------------------|-----------|------------|\n| **Deita Datasets**                                   |           |            |\n| deita-6k-v0                                    | [:hugs: HF Repo](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-6k-v0)          | [MIT License](https:\u002F\u002Fopensource.org\u002Flicense\u002Fmit\u002F) |\n| deita-10k-v0                                    | [:hugs: HF Repo](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-10k-v0)          | [MIT License](https:\u002F\u002Fopensource.org\u002Flicense\u002Fmit\u002F) |\n| deita-complexity-scorer-data                                    | [:hugs: HF Repo](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-complexity-scorer-data)          | [MIT License](https:\u002F\u002Fopensource.org\u002Flicense\u002Fmit\u002F) |\n| deita-quality-scorer-data                                    | [:hugs: HF Repo](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-quality-scorer-data)          | [MIT License](https:\u002F\u002Fopensource.org\u002Flicense\u002Fmit\u002F) |\n| deita-redundant-pool (100K)                                    | [:hugs: HF Repo](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-redundant-pool-data)          | [MIT License](https:\u002F\u002Fopensource.org\u002Flicense\u002Fmit\u002F) |\n| deita-sota-pool (300K)                                    | [:hugs: HF Repo](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAndrewZeng\u002Fdeita_sota_pool)          | [MIT License](https:\u002F\u002Fopensource.org\u002Flicense\u002Fmit\u002F) |\n| **Scorers**                                   |           |             |\n|  deita-complexity-scorer                      | [:hugs: HF Repo](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-complexity-scorer)          | [LLaMA License](https:\u002F\u002Fai.meta.com\u002Fresources\u002Fmodels-and-libraries\u002Fllama-downloads\u002F)|\n|  deita-quality-scorer               | [:hugs: HF Repo](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-quality-scorer)          | [LLaMA License](https:\u002F\u002Fai.meta.com\u002Fresources\u002Fmodels-and-libraries\u002Fllama-downloads\u002F)|\n| **Deita Models**                                   |           |             |\n| DEITA-7B-v1.0-sft                | [:hugs: HF Repo](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-7b-v1.0-sft)           | [Apache-2.0](https:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0)             |\n| DEITA-7B-v1.0                | [:hugs: HF Repo](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-7B-v1.0)           | [Apache-2.0](https:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0)             |\n| DEITA-LLaMA2-13B-v1.0-sft         | [:hugs: HF Repo](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-llama2-13b-v1.0-sft)           |  [LLaMA 2 License](https:\u002F\u002Fai.meta.com\u002Fresources\u002Fmodels-and-libraries\u002Fllama-downloads\u002F)           |\n| DEITA-LLaMA1-13B-v1.0-sft          | [:hugs: HF Repo](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-llama1-13b-v1.0-sft)          |  [LLaMA License](https:\u002F\u002Fai.meta.com\u002Fresources\u002Fmodels-and-libraries\u002Fllama-downloads\u002F)           |\n\n## :running_man: How to start?\n\n\n### Installation\n```bash\n  git clone https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002Fdeita.git\n  cd deita\n  pip install -e .\n```\n\n### Data Sample Scoring\n\nIf you wish to assess the **quality** of a response for a single sample, you can follow these steps:\n```python\nfrom deita.selection.scorer import Llama_Scorer\n\nmodel_name_or_path = \"hkust-nlp\u002Fdeita-quality-scorer\"\n\nscorer = Llama_Scorer(model_name_or_path)\n\n# example input\ninput_text = \"word to describe UI with helpful tooltips\" # Example Input\noutput_text = \"User-friendly or intuitive UI\" # Example Output\nquality_score = scorer.infer_quality(input_text, output_text)\n\nprint(quality_score)\n# 2.0230105920381902\n```\n\nDeita also supports VLLM for faster inference. If you want to use VLLM for inference,\n\n```bash\npip install vllm\n```\n\nAnd set ```is_vllm = True``` when initilizing scorer\n\n```python\nscorer = Llama_Scorer(model_name_or_path, is_vllm = True)\n```\n\nTo assess other dimensions of data samples, please refer to the ```examples\u002Fscoring```\n\n### Deita Pipelines\n\nYou can use deita pipelines to perform a variety of operations on the dataset with only one line code and configurations.\n\n- **Dataset Scoring**\n\n```python\nfrom deita.pipeline import Pipeline\n\npipeline = Pipeline(\"score_pipeline\", \n                    data_path = args.data_path,   # json file with sharegpt format\n                    scorer = args.scorer,   # [mistral, llama]\n                    scorer_name_or_path = args.scorer_name_or_path,  # scorer name or path e.g. hkust-nlp\u002Fdeita-complexity-scorer\n                    is_vllm = args.is_vllm,  # launch with vllm [True, False]\n                    score_type = args.score_type, # [complexity, quality]\n                    output_path = args.output_path)  # output path (json format)\n\npipeline.run()\n```\n\n- **Get Embeddings**\n\nWe use Huggingface Accelerate to enhance efficiency:\n\n```python\nfrom deita.pipeline import Pipeline\n\nembed_pipeline = Pipeline(\"embed_pipeline\", \n                          data_path = args.data_path,   # json file with sharegpt format\n                          output_path = args.output_path,  # output path (pickle format)\n                          model_name_or_path = args.model_name_or_path,  # model name or path e.g. mistralai\u002FMistral-7B-v0.1\n                          max_length = args.max_length,\n                          use_flash_attention = args.use_flash_attention,  \n                          batch_size_per_device = args.batch_size_per_device,\n                          conv_template = args.conv_template,\n                          only_answer = args.only_answer,\n                          random_shuffle = args.random_shuffle,\n                          bfloat16 = True\n                          )\n\nembed_pipeline.run()\n```\n\n```bash\nCUDA_VISIBLE_DEVICES=$GPUIDX accelerate launch \\\n    --mixed_precision bf16 \\\n    --num_processes $NUMPROCESS \\\n    --num_machines 1 \\\n    examples\u002Fpipelines\u002Fembed_datasets.py \\\n    --use_flash_attention true \\\n    --data_path $DATAPATH \\\n    --output_path $OUTPUTPATH \\\n    --batch_size_per_device $BSZ\n```\n\n- **Score-first, Diversity-aware Selection**\n\n```python\nfrom deita.pipeline import Pipeline\n\nfilter_pipeline = Pipeline(\"filter_pipeline\", \n                          data_path = args.data_path,  # json file with sharegpt format\n                          other_data_path = args.other_data_path,  # embedding file path (pickle format)\n                          threshold = args.threshold,  # filter threshold default: 0.9 \n                          data_size = args.data_size,  # size of selected data\n                          chunk_size = args.chunk_size,  # used for more efficient GPU computing  default: 100000\n                          sort_key = args.sort_key,  # default: \"complexity_scores,quality_scores\"\n                          output_path = args.output_path,  # json format output path\n                          distance_metric = args.distance_metric,  # default: cosine\n                          embedding_field = args.embedding_field,  # default: embedding\n                          is_compression = args.is_compression,  # default: False\n                          device = args.device  # GPU IDX, default: 0\n                          )\n\nfilter_pipeline.run()\n```\n\nYou can refer to ```examples\u002Fpipelines``` for more details. A doc will also be coming soon.\n\n### SFT Training\nPlease refer to ```examples\u002Ftrain\u002Fsft.sh```\n```bash\ndeepspeed --include localhost:${DEVICES} --master_port 29501 src\u002Fdeita\u002Falignment\u002Ftrain.py \\\n    --model_name_or_path ${MODELPATH} \\\n    --data_path ${DATAPATH} \\\n    --output_dir ${OUTPUTPATH}\u002F${RUNNAME} \\\n    --num_train_epochs 6 \\\n    --per_device_train_batch_size ${BSZPERDEV} \\\n    --per_device_eval_batch_size 1 \\\n    --gradient_accumulation_steps ${GRADACC} \\\n    --eval_steps 50 \\\n    --save_strategy \"no\" \\\n    --save_steps 100 \\\n    --save_total_limit 10 \\\n    --learning_rate 2e-5 \\\n    --warmup_ratio 0.1 \\\n    --lr_scheduler_type \"cosine\" \\\n    --logging_steps 1 \\\n    --do_eval False \\\n    --evaluation_strategy \"no\" \\\n    --model_max_length 2048 \\\n    --lazy_preprocess True \\\n    --conv_template \"vicuna_v1.1\" \\\n    --mask_user True \\\n    --report_to \"wandb\" \\\n    --run_name ${RUNNAME} \\\n    --bf16 True \\\n    --deepspeed src\u002Fdeita\u002Fds_configs\u002Fdeepspeed_config_zero2_no_offload.json\n```\n\n### DPO Training\nPlease refer to ```examples\u002Ftrain\u002Fdpo.sh```\n```bash\ndeepspeed --include localhost:${DEVICES} --master_port 29502 src\u002Fdeita\u002Falignment\u002Fdpo_train.py \\\n    --model_name_or_path ${MODELPATH} \\\n    --json_path ${JSONPATH} \\\n    --data_split ${DATASPLIT} \\\n    --output_dir ${OUTPUTPATH}\u002F${RUNNAME} \\\n    --num_train_epochs ${DPOEPOCH} \\\n    --beta 0.1 \\\n    --per_device_train_batch_size ${BSZPERDEV} \\\n    --per_device_eval_batch_size 1 \\\n    --gradient_accumulation_steps ${GRADACC} \\\n    --save_global_steps False \\\n    --eval_steps 50 \\\n    --save_strategy \"no\" \\\n    --save_steps 500 \\\n    --save_total_limit 1 \\\n    --learning_rate 5e-7 \\\n    --warmup_ratio 0.1 \\\n    --lr_scheduler_type \"linear\" \\\n    --logging_steps 1 \\\n    --do_eval False \\\n    --evaluation_strategy \"no\" \\\n    --model_max_length 2048 \\\n    --conv_template \"vicuna_v1.1\" \\\n    --report_to \"wandb\" \\\n    --run_name ${RUNNAME} \\\n    --bf16 True \\\n    --gradient_checkpointing True \\\n    --deepspeed src\u002Fdeita\u002Fds_configs\u002Fstage3_no_offloading_accelerate.json\n```\n\n### Evaluation\n- For MT-Bench, please refer to [MT-Bench](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat\u002Ftree\u002Fmain\u002Ffastchat\u002Fllm_judge)\n- For AlpacaEval, please refer to [alpaca_eval](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval)\n- For Open LLM Benchmark, please refer to [lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Ftree\u002Fmaster) and follow settings on [HuggingFaceH4\u002Fopen_llm_leaderboard](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FHuggingFaceH4\u002Fopen_llm_leaderboard)\n\n## :muscle: What's more?\n\nThis is the preview version of Deita project. We will continue to update including\n\n- [ ] Release data selection pipeline with efficient implementation\n- [ ] More automatic data selection strategies\n- [ ] CLI-Interface Supported\n- [ ] Online Demo\n\n## Citation\nIf you find the content of this project helpful, please cite our paper as follows:\n\n```\n@inproceedings{\nliu2024what,\ntitle={What Makes Good Data for Alignment? A Comprehensive Study of Automatic Data Selection in Instruction Tuning},\nauthor={Wei Liu and Weihao Zeng and Keqing He and Yong Jiang and Junxian He},\nbooktitle={The Twelfth International Conference on Learning Representations},\nyear={2024},\nurl={https:\u002F\u002Fopenreview.net\u002Fforum?id=BTKAeLqLMw}\n}\n```\n\n## Acknowledgement\nFor training code, we use the code template of [fastchat](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat).\n","# Deita\n\n\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhkust-nlp_deita_readme_ba00f75f25ca.png\" width=\"600\">\n\u003C\u002Fp>\n\n\n\u003Cp align=\"center\">\n  🤗 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fhkust-nlp\u002Fdeita-6569c198c174808d94cf5bd4\">HF 仓库\u003C\u002Fa>&nbsp;&nbsp;&nbsp;\n  📄 \u003Ca href=\"https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.15685\">论文\u003C\u002Fa>&nbsp;&nbsp;&nbsp;\n  📚 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-6k-v0\">6K 数据集\u003C\u002Fa>&nbsp;&nbsp;&nbsp;\n  📚 \u003Ca href=\"https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-10k-v0\">10K 数据集\u003C\u002Fa>\n\u003C\u002Fp>\n\n\n欢迎来到 Deita（**D**ata-**E**fficient **I**nstruction **T**uning for **A**lignment）项目！\n\n我们将持续更新，敬请关注！\n\n\n## 什么是 Deita？\nDeita 是一个开源项目，旨在为大型语言模型（LLMs）的指令微调提供**自动化数据选择**支持。\n\n它包括：\n- **开源工具包**：用于指令微调中的自动化数据选择\n- **Deita 数据集**：一系列极其轻量级、高质量的对齐 SFT 数据。首次发布时，我们提供了 6k 和 10k 规模的数据集。\n- **Deita 模型**：一系列性能强大、与当前最先进聊天模型相当的模型，同时采用极其高效的指令微调流程。使用 Deita 模型时，所需的指令微调数据仅为其他 SOTA LLM 的十分之一。\n\n## 最新动态\n- :fire: [2024年3月] 我们的数据集已被 Hugging Face 用于创建 [Zephyr Gemma 模型](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002FHuggingFaceH4\u002Fzephyr-7b-gemma-65e1fd82d26b426e3e63d956)。\n- 📄 [2024年1月] Deita 论文《什么才是对齐任务中优质的数据？——指令微调中自动化数据选择的全面研究》（[arXiv:2312.15685](https:\u002F\u002Farxiv.org\u002Fabs\u002F2312.15685)）已被 ICLR 2024 接受！\n- :fire: [2024年1月] [Deita 流水线](#deita-pipelines) 已发布！只需一行代码和配置，即可筛选出高质量的对齐数据子集。\n- 📚 [2024年1月] 我们发布了评分数据集 [deita-complexity-scorer-data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-complexity-scorer-data) 和 [deita-quality-scorer-data](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-quality-scorer-data)。\n- :fire: [2023年12月] 我们在此发布了 Deita 资源的第一批集合 [这里](https:\u002F\u002Fhuggingface.co\u002Fcollections\u002Fhkust-nlp\u002Fdeita-6569c198c174808d94cf5bd4)，其中包括一系列极其轻量级且高效的 SFT 数据集、数据复杂度\u002F质量评分模型，以及最终生成的 Deita 聊天模型。\n\n## 性能表现\n:bell: 仍然好奇少量高质量数据究竟能将 LLM 带向多远吗？\n\nDeita 或许能为你解答：\n\n**🔦 亮点**\n| 模型                                          | 对齐方式        | 数据规模  | MT-Bench | AlpacaEval(%) |\n|------------------------------------------------|--------------|------------|----------|---------------|\n| Zephyr-7B-sft                                  | SFT          | 20万       | 5.32     | 75.12         |\n| $\\text{Zephyr-7B-}\\beta$                      | SFT + DPO    | 20万 SFT + 6万 DPO | 7.34     | 90.60         |\n| OpenChat-3.5                                   | C-RLFT | >> 7万 C-RLFT | 7.81     | 88.51         |\n| Starling-7B                                    | C-RLFT + APA | >> 7万 C-RLFT + 18.3万 APA | 8.09     | 91.99         |\n| Tulu-2-13B                                     | SFT          | 32.6万       | 6.70     | 78.90         |\n| Tulu-2-13B+DPO                                 | SFT + DPO    | 32.6万 SFT + 6万 DPO | 7.00     | 89.50         |\n| LLaMA2-13B-Chat                                | SFT + PPO    | --         | 6.65     | 81.09         |\n| WizardLM-13B-v1.2                              | SFT          | >7万       | 7.09     | 89.17         |\n| Vicuna-13B-v1.5                                | SFT          | >12.5万      | 6.57    | 78.80         |\n| DEITA-7B-v1.0 (6K)          | SFT          | 6千       |   7.22   |    80.78      |\n| DEITA-7B-v1.0-sft            | SFT          | 1万        | 7.32     | 81.67         |\n| DEITA-7B-v1.0 | SFT + DPO    | 6千 SFT + 1万 DPO | 7.55     | 90.06         |\n\nDEITA 模型基于 Mistral-7B-v0.1。:fire: \n\n请参阅 [这张表格](#chart\\_with\\_upwards\\_trend-full-evaluations) 获取完整评估结果，其中也包含 Open LLM Leaderboard 的排名信息，涵盖了使用 LLaMA 基础模型的 DEITA 模型，以及与其他数据选择方法的对比。\n\n## :chart_with_upwards_trend: 完整评估\n\n\u003Cdetails>\n  \u003Csummary>查看完整评估\u003C\u002Fsummary>\n\n| 模型                                          | 对齐方式     | 数据规模  | MT-Bench | AlpacaEval(%) | OpenLLM (平均) |\n|------------------------------------------------|-----------|------------|----------|---------------|----------------|\n| **专有模型**                         |           |            |          |               |                |\n| GPT-4-Turbo                                    | ?         | --         | 9.32     | 97.70         | --             |\n| GPT-4                                          | SFT + PPO | --         | 8.99     | 95.03         | --             |\n| Claude-2                                       | SFT + PPO | --         | 8.06     | 91.36         | --             |\n| GPT-3.5-turbo                                  | SFT + PPO | --         | 7.94     | 89.37         | --             |\n| **基于 LLaMA-1-13B 的开源模型**   |           |            |          |               |                |\n| LIMA                                           | SFT       | 1K SFT        | 4.29     | 41.98         | 59.82          |\n| WizardLM-13B                                   | SFT       | 70K SFT       | 6.35     | 75.31         | 58.96          |\n| Vicuna-13B-v1.3                                | SFT       | 125K SFT      | 6.39     | 82.11         | 60.01          |\n| Random                                         | SFT       | 10K SFT       | 6.03     | 71.52         | 60.14          |\n| DEITA-LLaMA1-13B-v1.0-sft                           | SFT       | 10K SFT       | 6.60     | 78.01         | 64.27          |\n| **基于 LLaMA-2-13B 的开源模型**   |           |            |          |               |                |\n| Tulu-2-13B                                     | SFT       | 326K SFT      | 6.70     | 78.90         | --             |\n| Tulu-2-13B+DPO                                 | SFT + DPO | 326K SFT + 60K DPO | 7.00     | 89.50         | --             |\n| LLaMA2-13B-Chat                                | SFT + PPO | --         | 6.65     | 81.09         | --             |\n| WizardLM-13B-v1.2                              | SFT          | >70K SFT      | 7.09     | 89.17         | --             |\n| Vicuna-13B-v1.5                                | SFT       | 125K SFT      | 6.57     | 78.80         | 61.63          |\n| Random                                         | SFT       | 10K SFT       | 5.78     | 65.19         | 61.32          |\n| DEITA-LLaMA2-13B-v1.0-sft                           | SFT       | 10K SFT       | 6.79     | 81.09         | 62.71          |\n| **基于 Mistral-7B 的开源模型**    |           |            |          |               |                |\n| Mistral-7B-Instruct-v0.1                       | --        | --         | 6.84     | 69.65         | 60.45          |\n| Zephyr-7B-sft                                  | SFT       | 200K SFT      | 5.32     | 75.12         | 60.93          |\n| $\\text{Zephyr-7B-}\\beta$                       | SFT + DPO | 200K SFT + 60K DPO | 7.34     | 90.60         | 66.36          |\n| OpenChat-3.5                                   | C-RLFT | >> 70K C-RLFT | 7.81     | 88.51         | --           |\n| Starling-7B                                    | C-RLFT + APA | >>70K C-RLFT + 183K APA | 8.09     | 91.99         | --            |\n| Random                                         | SFT       | 10K SFT       | 5.89     | 56.90         | 61.72          |\n| DEITA-7B-v1.0-sft (6K)                           | SFT       | 6K SFT       | 7.22     | 80.78         | 64.94          |\n| DEITA-7B-v1.0-sft (10K)                  | SFT       | 10K SFT       | 7.32     | 81.67         | 64.00          |\n| DEITA-7B-v1.0             | SFT + DPO | 6K SFT + 10K DPO   | 7.55     | 90.06         | 69.86          |\n\n\n\u003C\u002Fdetails>\n\n## :rocket: Deita 资源\n\n| 资源                                       | 链接     | 许可证  |\n|------------------------------------------------|-----------|------------|\n| **Deita 数据集**                                   |           |            |\n| deita-6k-v0                                    | [:hugs: HF 仓库](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-6k-v0)          | [MIT 许可证](https:\u002F\u002Fopensource.org\u002Flicense\u002Fmit\u002F) |\n| deita-10k-v0                                    | [:hugs: HF 仓库](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-10k-v0)          | [MIT 许可证](https:\u002F\u002Fopensource.org\u002Flicense\u002Fmit\u002F) |\n| deita-complexity-scorer-data                                    | [:hugs: HF 仓库](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-complexity-scorer-data)          | [MIT 许可证](https:\u002F\u002Fopensource.org\u002Flicense\u002Fmit\u002F) |\n| deita-quality-scorer-data                                    | [:hugs: HF 仓库](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-quality-scorer-data)          | [MIT 许可证](https:\u002F\u002Fopensource.org\u002Flicense\u002Fmit\u002F) |\n| deita-redundant-pool (100K)                                    | [:hugs: HF 仓库](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-redundant-pool-data)          | [MIT 许可证](https:\u002F\u002Fopensource.org\u002Flicense\u002Fmit\u002F) |\n| deita-sota-pool (300K)                                    | [:hugs: HF 仓库](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002FAndrewZeng\u002Fdeita_sota_pool)          | [MIT 许可证](https:\u002F\u002Fopensource.org\u002Flicense\u002Fmit\u002F) |\n| **评分器**                                   |           |             |\n|  deita-complexity-scorer                      | [:hugs: HF 仓库](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-complexity-scorer)          | [LLaMA 许可证](https:\u002F\u002Fai.meta.com\u002Fresources\u002Fmodels-and-libraries\u002Fllama-downloads\u002F)|\n|  deita-quality-scorer               | [:hugs: HF 仓库](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-quality-scorer)          | [LLaMA 许可证](https:\u002F\u002Fai.meta.com\u002Fresources\u002Fmodels-and-libraries\u002Fllama-downloads\u002F)|\n| **Deita 模型**                                   |           |             |\n| DEITA-7B-v1.0-sft                | [:hugs: HF 仓库](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-7b-v1.0-sft)           | [Apache-2.0](https:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0)             |\n| DEITA-7B-v1.0                | [:hugs: HF 仓库](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-7B-v1.0)           | [Apache-2.0](https:\u002F\u002Fwww.apache.org\u002Flicenses\u002FLICENSE-2.0)             |\n| DEITA-LLaMA2-13B-v1.0-sft         | [:hugs: HF 仓库](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-llama2-13b-v1.0-sft)           |  [LLaMA 2 许可证](https:\u002F\u002Fai.meta.com\u002Fresources\u002Fmodels-and-libraries\u002Fllama-downloads\u002F)           |\n| DEITA-LLaMA1-13B-v1.0-sft          | [:hugs: HF 仓库](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-llama1-13b-v1.0-sft)          |  [LLaMA 许可证](https:\u002F\u002Fai.meta.com\u002Fresources\u002Fmodels-and-libraries\u002Fllama-downloads\u002F)           |\n\n## :running_man: 如何开始？\n\n\n### 安装\n```bash\n  git clone https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002Fdeita.git\n  cd deita\n  pip install -e .\n```\n\n### 数据样本评分\n\n如果您希望评估单个样本的响应**质量**，可以按照以下步骤操作：\n```python\nfrom deita.selection.scorer import Llama_Scorer\n\nmodel_name_or_path = \"hkust-nlp\u002Fdeita-quality-scorer\"\n\nscorer = Llama_Scorer(model_name_or_path)\n\n# 示例输入\ninput_text = \"描述带有实用提示框的用户界面的词语\" # 示例输入\noutput_text = \"用户友好的或直观的用户界面\" # 示例输出\nquality_score = scorer.infer_quality(input_text, output_text)\n\nprint(quality_score)\n# 2.0230105920381902\n```\n\nDeita 还支持 VLLM 以实现更快的推理。如果想使用 VLLM 进行推理，\n\n```bash\npip install vllm\n```\n\n并在初始化评分器时设置 ```is_vllm = True```：\n\n```python\nscorer = Llama_Scorer(model_name_or_path, is_vllm = True)\n```\n\n如需评估数据样本的其他维度，请参阅 ```examples\u002Fscoring```。\n\n### Deita 流水线\n\n您可以通过 Deita 流水线，仅用一行代码和配置即可对数据集执行多种操作。\n\n- **数据集评分**\n\n```python\nfrom deita.pipeline import Pipeline\n\npipeline = Pipeline(\"score_pipeline\", \n                    data_path = args.data_path,   # 具有 ShareGPT 格式的 JSON 文件\n                    scorer = args.scorer,   # [mistral, llama]\n                    scorer_name_or_path = args.scorer_name_or_path,  # 评分器名称或路径，例如 hkust-nlp\u002Fdeita-complexity-scorer\n                    is_vllm = args.is_vllm,  # 是否使用 VLLM 启动 [True, False]\n                    score_type = args.score_type, # [复杂度, 质量]\n                    output_path = args.output_path)  # 输出路径（JSON 格式）\n\npipeline.run()\n```\n\n- **获取嵌入**\n\n我们使用 Hugging Face Accelerate 来提升效率：\n\n```python\nfrom deita.pipeline import Pipeline\n\nembed_pipeline = Pipeline(\"embed_pipeline\", \n                          data_path = args.data_path,   # 具有 ShareGPT 格式的 JSON 文件\n                          output_path = args.output_path,  # 输出路径（Pickle 格式）\n                          model_name_or_path = args.model_name_or_path,  # 模型名称或路径，例如 mistralai\u002FMistral-7B-v0.1\n                          max_length = args.max_length,\n                          use_flash_attention = args.use_flash_attention,  \n                          batch_size_per_device = args.batch_size_per_device,\n                          conv_template = args.conv_template,\n                          only_answer = args.only_answer,\n                          random_shuffle = args.random_shuffle,\n                          bfloat16 = True\n                          )\n\nembed_pipeline.run()\n```\n\n```bash\nCUDA_VISIBLE_DEVICES=$GPUIDX accelerate launch \\\n    --mixed_precision bf16 \\\n    --num_processes $NUMPROCESS \\\n    --num_machines 1 \\\n    examples\u002Fpipelines\u002Fembed_datasets.py \\\n    --use_flash_attention true \\\n    --data_path $DATAPATH \\\n    --output_path $OUTPUTPATH \\\n    --batch_size_per_device $BSZ\n```\n\n- **先评分、再多样性感知筛选**\n\n```python\nfrom deita.pipeline import Pipeline\n\nfilter_pipeline = Pipeline(\"filter_pipeline\", \n                          data_path = args.data_path,  # 具有 ShareGPT 格式的 JSON 文件\n                          other_data_path = args.other_data_path,  # 嵌入文件路径（Pickle 格式）\n                          threshold = args.threshold,  # 筛选阈值，默认：0.9 \n                          data_size = args.data_size,  # 选定数据的数量\n                          chunk_size = args.chunk_size,  # 用于更高效的 GPU 计算，默认：100000\n                          sort_key = args.sort_key,  # 默认：\"complexity_scores,quality_scores\"\n                          output_path = args.output_path,  # JSON 格式输出路径\n                          distance_metric = args.distance_metric,  # 默认：余弦距离\n                          embedding_field = args.embedding_field,  # 默认：embedding\n                          is_compression = args.is_compression,  # 默认：False\n                          device = args.device  # GPU 编号，默认：0\n                          )\n\nfilter_pipeline.run()\n```\n\n更多详情请参阅 ```examples\u002Fpipelines```。文档也将很快发布。\n\n### SFT 训练\n请参考 ```examples\u002Ftrain\u002Fsft.sh```：\n```bash\ndeepspeed --include localhost:${DEVICES} --master_port 29501 src\u002Fdeita\u002Falignment\u002Ftrain.py \\\n    --model_name_or_path ${MODELPATH} \\\n    --data_path ${DATAPATH} \\\n    --output_dir ${OUTPUTPATH}\u002F${RUNNAME} \\\n    --num_train_epochs 6 \\\n    --per_device_train_batch_size ${BSZPERDEV} \\\n    --per_device_eval_batch_size 1 \\\n    --gradient_accumulation_steps ${GRADACC} \\\n    --eval_steps 50 \\\n    --save_strategy \"no\" \\\n    --save_steps 100 \\\n    --save_total_limit 10 \\\n    --learning_rate 2e-5 \\\n    --warmup_ratio 0.1 \\\n    --lr_scheduler_type \"cosine\" \\\n    --logging_steps 1 \\\n    --do_eval False \\\n    --evaluation_strategy \"no\" \\\n    --model_max_length 2048 \\\n    --lazy_preprocess True \\\n    --conv_template \"vicuna_v1.1\" \\\n    --mask_user True \\\n    --report_to \"wandb\" \\\n    --run_name ${RUNNAME} \\\n    --bf16 True \\\n    --deepspeed src\u002Fdeita\u002Fds_configs\u002Fdeepspeed_config_zero2_no_offload.json\n```\n\n### DPO 训练\n请参考 ```examples\u002Ftrain\u002Fdpo.sh```：\n```bash\ndeepspeed --include localhost:${DEVICES} --master_port 29502 src\u002Fdeita\u002Falignment\u002Fdpo_train.py \\\n    --model_name_or_path ${MODELPATH} \\\n    --json_path ${JSONPATH} \\\n    --data_split ${DATASPLIT} \\\n    --output_dir ${OUTPUTPATH}\u002F${RUNNAME} \\\n    --num_train_epochs ${DPOEPOCH} \\\n    --beta 0.1 \\\n    --per_device_train_batch_size ${BSZPERDEV} \\\n    --per_device_eval_batch_size 1 \\\n    --gradient_accumulation_steps ${GRADACC} \\\n    --save_global_steps False \\\n    --eval_steps 50 \\\n    --save_strategy \"no\" \\\n    --save_steps 500 \\\n    --save_total_limit 1 \\\n    --learning_rate 5e-7 \\\n    --warmup_ratio 0.1 \\\n    --lr_scheduler_type \"linear\" \\\n    --logging_steps 1 \\\n    --do_eval False \\\n    --evaluation_strategy \"no\" \\\n    --model_max_length 2048 \\\n    --conv_template \"vicuna_v1.1\" \\\n    --report_to \"wandb\" \\\n    --run_name ${RUNNAME} \\\n    --bf16 True \\\n    --gradient_checkpointing True \\\n    --deepspeed src\u002Fdeita\u002Fds_configs\u002Fstage3_no_offloading_accelerate.json\n```\n\n### 评估\n- 对于 MT-Bench，请参考 [MT-Bench](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat\u002Ftree\u002Fmain\u002Ffastchat\u002Fllm_judge)\n- 对于 AlpacaEval，请参考 [alpaca_eval](https:\u002F\u002Fgithub.com\u002Ftatsu-lab\u002Falpaca_eval)\n- 对于 Open LLM Benchmark，请参考 [lm-evaluation-harness](https:\u002F\u002Fgithub.com\u002FEleutherAI\u002Flm-evaluation-harness\u002Ftree\u002Fmaster) 并按照 [HuggingFaceH4\u002Fopen_llm_leaderboard](https:\u002F\u002Fhuggingface.co\u002Fspaces\u002FHuggingFaceH4\u002Fopen_llm_leaderboard) 上的设置进行操作。\n\n## :muscle: 还有哪些亮点？\n\n这是 Deita 项目的预览版本。我们将持续更新，包括：\n\n- [ ] 发布高效实现的数据选择流水线\n- [ ] 更多自动化的数据选择策略\n- [ ] 支持 CLI 界面\n- [ ] 在线演示\n\n## 引用\n如果您觉得本项目的内容有所帮助，请按以下格式引用我们的论文：\n\n```\n@inproceedings{\nliu2024what,\ntitle={对齐任务中优质数据的构成要素：指令微调中自动数据选择的全面研究},\nauthor={刘伟、曾伟豪、何克清、蒋勇、何俊贤},\nbooktitle={第十二届国际学习表征会议},\nyear={2024},\nurl={https:\u002F\u002Fopenreview.net\u002Fforum?id=BTKAeLqLMw}\n}\n```\n\n## 致谢\n在训练代码方面，我们使用了 [fastchat](https:\u002F\u002Fgithub.com\u002Flm-sys\u002FFastChat) 的代码模板。","# Deita 快速上手指南\n\nDeita (**D**ata-**E**fficient **I**nstruction **T**uning for **A**lignment) 是一个开源项目，旨在为大语言模型（LLM）的指令微调提供**自动数据选择**方案。通过使用 Deita，你可以利用极少的高质量数据（如 6k 或 10k 条）训练出媲美 SOTA 的模型，大幅降低训练成本。\n\n## 环境准备\n\n在开始之前，请确保你的开发环境满足以下要求：\n\n*   **操作系统**: Linux 或 macOS (推荐 Linux)\n*   **Python 版本**: Python 3.8 或更高版本\n*   **硬件要求**: \n    *   运行评分模型（Scorer）建议配备 NVIDIA GPU。\n    *   若使用 VLLM 加速推理，需确保显存充足。\n*   **前置依赖**: \n    *   PyTorch\n    *   Transformers\n    *   Git\n\n> **提示**: 国内用户建议在 `pip` 命令中添加清华或阿里镜像源以加速下载，例如：`pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple ...`\n\n## 安装步骤\n\n通过以下命令克隆仓库并安装 Deita 工具包：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002Fdeita.git\ncd deita\npip install -e .\n```\n\n若需使用 VLLM 进行加速推理（推荐用于大规模数据评分），请额外安装：\n\n```bash\npip install vllm\n```\n\n## 基本使用\n\nDeita 的核心功能是评估指令 - 回答对的质量与复杂度。以下是使用预训练的 **质量评分模型 (Quality Scorer)** 对单个样本进行评分的最简示例。\n\n### 1. 基础评分示例 (基于 Transformers)\n\n```python\nfrom deita.selection.scorer import Llama_Scorer\n\n# 指定模型路径 (自动从 HuggingFace 下载，国内网络不畅时可配置镜像)\nmodel_name_or_path = \"hkust-nlp\u002Fdeita-quality-scorer\"\n\n# 初始化评分器\nscorer = Llama_Scorer(model_name_or_path)\n\n# 准备测试数据\ninput_text = \"word to describe UI with helpful tooltips\"  # 指令输入\noutput_text = \"User-friendly or intuitive UI\"             # 模型回答\n\n# 执行质量评分\nquality_score = scorer.infer_quality(input_text, output_text)\n\nprint(quality_score)\n# 输出示例: 2.0230105920381902\n```\n\n### 2. 使用 VLLM 加速评分\n\n如果你需要处理大量数据，可以启用 VLLM 后端以提升推理速度：\n\n```python\nfrom deita.selection.scorer import Llama_Scorer\n\nmodel_name_or_path = \"hkust-nlp\u002Fdeita-quality-scorer\"\n\n# 初始化时设置 is_vllm=True\nscorer = Llama_Scorer(model_name_or_path, is_vllm=True)\n\ninput_text = \"word to describe UI with helpful tooltips\"\noutput_text = \"User-friendly or intuitive UI\"\n\nquality_score = scorer.infer_quality(input_text, output_text)\nprint(quality_score)\n```\n\n### 3. 获取数据集与模型\n\n你可以直接通过 HuggingFace 下载 Deita 提供的精选数据集和预训练模型：\n\n*   **精选数据集**: [deita-6k-v0](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-6k-v0) | [deita-10k-v0](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fhkust-nlp\u002Fdeita-10k-v0)\n*   **评分模型**: [Complexity Scorer](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-complexity-scorer) | [Quality Scorer](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-quality-scorer)\n*   **微调后模型**: [DEITA-7B-v1.0](https:\u002F\u002Fhuggingface.co\u002Fhkust-nlp\u002Fdeita-7B-v1.0)\n\n> **注意**: 部分模型（如基于 LLaMA 的评分器）可能需要遵守相应的原始许可证（LLaMA License），使用前请确认授权范围。","某初创 AI 团队需要在有限的 GPU 预算下，快速为垂直领域的客服机器人构建一个高质量的对话模型。\n\n### 没有 deita 时\n- **数据筛选成本高昂**：团队不得不人工从数十万条通用指令数据中抽样清洗，耗时数周仍难以保证数据的多样性和复杂度。\n- **训练资源浪费严重**：由于无法识别低质量样本，模型被迫在大量简单或重复数据上进行全量微调，导致显存占用高且训练周期漫长。\n- **模型效果遭遇瓶颈**：即便使用了超过 20 万条数据进行 SFT（监督微调），模型在复杂逻辑推理和多轮对话中的表现依然平平，MT-Bench 评分难以突破 6.5 分。\n- **迭代试错困难**：每次调整数据策略都需要重新进行大规模训练，反馈循环过长，严重拖慢了产品上线节奏。\n\n### 使用 deita 后\n- **自动化精准选材**：利用 deita 的复杂度与质量评分模型，团队仅用一行代码就从海量数据中自动筛选出 6,000 条最具价值的高质指令。\n- **训练效率显著提升**：基于筛选后的轻量级数据集，模型训练时间缩短了 90% 以上，大幅降低了算力成本和碳排放。\n- **小数据达成高性能**：仅用 6k 数据训练的 DEITA-7B 模型，其 MT-Bench 得分高达 7.22，性能直接媲美甚至超越使用 20 万数据训练的 Zephyr-7B 等主流模型。\n- **敏捷迭代成为常态**：极低的数据门槛让团队能在一天内完成多次“筛选 - 训练 - 评估”闭环，迅速针对特定业务场景优化模型表现。\n\ndeita 通过“以质换量”的核心机制，证明了极少量的高质量数据足以驱动大模型实现卓越的对齐效果，让资源受限的团队也能轻松打造 SOTA 级应用。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fhkust-nlp_deita_face7bd3.png","hkust-nlp","NLP Group @ HKUST","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fhkust-nlp_83897fa3.jpg","We are a group of NLP researchers in the Hong Kong University of Science and Technology",null,"https:\u002F\u002Fjxhe.github.io\u002Fgroup\u002F","https:\u002F\u002Fgithub.com\u002Fhkust-nlp",[22],{"name":23,"color":24,"percentage":25},"Python","#3572A5",100,593,35,"2026-04-15T12:20:51","Apache-2.0",3,"未说明","需要 NVIDIA GPU（用于运行 Llama Scorer 模型及可选的 vLLM 加速），具体显存需求取决于所选基座模型（如 Mistral-7B 或 LLaMA-13B），建议 8GB 以上以支持推理",{"notes":34,"python":31,"dependencies":35},"README 中未明确列出具体的操作系统、Python 版本及完整的 requirements.txt 依赖列表。工具核心功能是通过 'Llama_Scorer' 对数据进行质量和复杂度打分，需下载 hkust-nlp 发布的评分模型（基于 LLaMA 架构）。支持使用 vLLM 库进行加速推理。安装方式为克隆仓库后执行 'pip install -e .'。",[36,37,38],"vllm (可选，用于加速推理)","transformers (隐含依赖，用于加载 Llama_Scorer)","torch (隐含依赖)",[40,41],"数据工具","语言模型",[43,44,45,46],"alignment","data-centric","instruction-tuning","large-language-models",2,"ready","2026-03-27T02:49:30.150509","2026-04-16T08:13:22.997606",[52,57,62,67,72,76],{"id":53,"question_zh":54,"answer_zh":55,"source_url":56},35249,"复杂度和质量评分器（Scorer）是如何训练的？使用了什么提示词和数据量？","评分器是通过简单的下一令牌预测（next-token prediction）任务在收集的数据样本上训练的。具体使用的提示词如下：\n1. 质量评分提示词：\"You are a helpful assistant. Please identify the quality score of the Response corresponding to the Question. \\n #Question#:\\n{instruction}\\n#Response#:\\n{output} \\n##Quality: {score}\"\n2. 复杂度评分提示词：\"You are a helpful assistant. Please identify the complexity score of the following user query. \\n##Query: {instruction}  \\n##Complexity: {score}\"\n\n在推理阶段，代码会提取 6 个分数（范围 1 到 6）的概率来确定最终得分。每个评分器总共使用了 6,000 个数据样本进行训练，其中包括 1,000 个初始种子数据样本和 5,000 个经过五次迭代演化生成的样本。","https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002Fdeita\u002Fissues\u002F3",{"id":58,"question_zh":59,"answer_zh":60,"source_url":61},35250,"如何实现“分数优先（score-first）”的功能？代码中似乎缺少合并复杂度和质量分数的步骤。","确实遗漏了直接合并两个评分文件的步骤。如果您有两个单独的 JSON 文件分别存储复杂度分数和质量分数，需要先将它们合并到一个文件中。例如，向已包含 \"complexity_scores\" 的 JSON 文件中添加一个新的键 \"quality_scores\"。\n\n合并完成后，将该合并后的数据文件路径作为 `args.data_path` 传递给过滤管道即可进行选择。后续步骤参考代码如下：\nfrom deita.pipeline import Pipeline\nfilter_pipeline = Pipeline(\"filter_pipeline\", \n                          data_path = args.data_path,  # 包含 sharegpt 格式的 json 文件（需已合并分数）\n                          other_data_path = args.other_data_path,  # embedding 文件路径 (pickle 格式)\n                          threshold = args.threshold,  # 过滤阈值，默认 0.9\n                          data_size = args.data_size,  # 选择的数据大小\n                          chunk_size = args.chunk_size,  # 用于更高效的 GPU 计算，默认 100000\n                          sort_key = args.sort_key)  # 默认：\"complexity_scores,quality_scores\"","https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002Fdeita\u002Fissues\u002F34",{"id":63,"question_zh":64,"answer_zh":65,"source_url":66},35251,"代码中计算的是余弦相似度还是余弦距离？这是否会影响数据的多样性选择？","代码实现实际上使用的是余弦相似度（cosine similarity），但通过逻辑取反来实现基于距离的过滤效果，从而保证多样性。\n\n虽然变量名或注释可能引起混淆，但在核心过滤逻辑中（参见 `src\u002Fdeita\u002Fselection\u002Ffilter\u002Fbase.py`），代码使用了 `~distance_bool` 而不是 `distance_bool`。这意味着当余弦相似度高于阈值（如 0.9，表示非常相似）时，该样本会被标记为排除（即不选择相似的样本），最终保留下来的索引是经过过滤后的“选中索引”。因此，尽管底层计算是相似度，但通过取反操作，实际效果是筛选出低相似度（高距离）的多样化数据。","https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002Fdeita\u002Fissues\u002F20",{"id":68,"question_zh":69,"answer_zh":70,"source_url":71},35252,"为什么使用 SFT (6k) + DPO (10k) 的训练方法能提升多项选择题（如 ARC-challenge, MMLU）的性能？","性能提升主要有以下两个推测原因：\n1. **中间检查点策略**：表现最佳的模型是基于一个中间的 SFT 检查点进行 DPO 训练的，而不是完全优化后的 SFT 模型。假设过度优化的 SFT 可能会损害大语言模型的固有通用能力。因此，使用次优的 SFT 检查点作为基础，再进行专门针对对齐优化的 DPO 训练，能够同时提升学术基准测试（如 OpenLLM leaderboard）的表现和对齐能力。这一发现与 Zephyr 模型的观察结果一致。\n2. **数据集构成**：UltraFeedback 数据集中可能包含了一些 STEM 相关的样本，这些样本有助于模型学习解决特定类型的推理问题。此外，DPO 通过偏好对齐，可能修正了模型在某些问题上不一致的回答，使其能更稳定地输出正确答案。","https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002Fdeita\u002Fissues\u002F21",{"id":73,"question_zh":74,"answer_zh":75,"source_url":56},35253,"在推理时如何根据评分器输出的概率确定最终的分数值？","在推理阶段，不应直接使用 softmax 后的加权和作为最终分数，因为该值变化范围有限。正确的做法是参考项目代码，提取模型对 6 个可能分数（范围为 1 到 6）的预测概率。通常选择概率最高的那个分数作为该样本的最终质量或复杂度评分。具体的提取逻辑请查看仓库中的评分代码实现部分。",{"id":77,"question_zh":78,"answer_zh":79,"source_url":80},35254,"Repr Filter 部分的代码是否会开源？","维护者已确认计划开源该部分代码。当时正在进行代码的清理和重构工作，并计划在当周发布。建议关注仓库的最新更新以获取完整代码。","https:\u002F\u002Fgithub.com\u002Fhkust-nlp\u002Fdeita\u002Fissues\u002F2",[],[83,94,102,110,118,130],{"id":84,"name":85,"github_repo":86,"description_zh":87,"stars":88,"difficulty_score":30,"last_commit_at":89,"category_tags":90,"status":48},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,"2026-04-06T06:32:30",[91,92,93,40],"Agent","开发框架","图像",{"id":95,"name":96,"github_repo":97,"description_zh":98,"stars":99,"difficulty_score":47,"last_commit_at":100,"category_tags":101,"status":48},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",157379,"2026-04-15T23:32:42",[92,91,41],{"id":103,"name":104,"github_repo":105,"description_zh":106,"stars":107,"difficulty_score":30,"last_commit_at":108,"category_tags":109,"status":48},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,"2026-04-06T11:19:32",[41,93,91,92],{"id":111,"name":112,"github_repo":113,"description_zh":114,"stars":115,"difficulty_score":47,"last_commit_at":116,"category_tags":117,"status":48},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[92,41],{"id":119,"name":120,"github_repo":121,"description_zh":122,"stars":123,"difficulty_score":47,"last_commit_at":124,"category_tags":125,"status":48},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[93,40,126,127,91,128,41,92,129],"视频","插件","其他","音频",{"id":131,"name":132,"github_repo":133,"description_zh":134,"stars":135,"difficulty_score":136,"last_commit_at":137,"category_tags":138,"status":48},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[41,40,128]]