[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"tool-ictnlp--StreamSpeech":3,"similar-ictnlp--StreamSpeech":126},{"id":4,"github_repo":5,"name":6,"description_en":7,"description_zh":8,"ai_summary_zh":9,"readme_en":10,"readme_zh":11,"quickstart_zh":12,"use_case_zh":13,"hero_image_url":14,"owner_login":15,"owner_name":16,"owner_avatar_url":17,"owner_bio":18,"owner_company":19,"owner_location":19,"owner_email":20,"owner_twitter":19,"owner_website":21,"owner_url":22,"languages":23,"stars":61,"forks":62,"last_commit_at":63,"license":64,"difficulty_score":65,"env_os":66,"env_gpu":67,"env_ram":68,"env_deps":69,"category_tags":76,"github_topics":79,"view_count":100,"oss_zip_url":19,"oss_zip_packed_at":19,"status":101,"created_at":102,"updated_at":103,"faqs":104,"releases":125},6891,"ictnlp\u002FStreamSpeech","StreamSpeech","StreamSpeech is an “All in One” seamless model for offline and simultaneous speech recognition, speech translation and speech synthesis.","StreamSpeech 是一款集语音识别、翻译与合成于一体的全能型开源模型，旨在打破传统语音处理任务间的壁垒。它不仅能高质量完成离线的语音转文字、跨语言翻译及语音合成，更核心地解决了“同声传译”场景下的低延迟难题。通过独特的流式架构，StreamSpeech 能在用户说话的同时实时输出中间结果（如逐句识别文本或翻译内容），无需等待整段语音结束，从而提供流畅自然的即时沟通体验。\n\n该工具的最大亮点在于\"All in One\"的设计理念：仅需一个统一模型，即可无缝切换并支持包括流式识别、同步翻译和实时语音生成在内的八种不同任务，且在离线与同步场景下均达到了业界领先的性能水平。这种多任务学习能力大幅降低了部署复杂度，避免了维护多个独立模型的繁琐。\n\nStreamSpeech 非常适合人工智能研究人员探索多任务学习机制，也适合开发者构建需要低延迟交互的语音应用（如实时会议助手、跨国直播字幕系统等）。对于希望体验前沿语音技术的普通用户，其提供的本地 Web 演示也能直观展示同声传译的魅力。作为 ACL 2024 的研究成果，StreamSpeech 以开源形式共享代码与模型，为推动高效、实时的","StreamSpeech 是一款集语音识别、翻译与合成于一体的全能型开源模型，旨在打破传统语音处理任务间的壁垒。它不仅能高质量完成离线的语音转文字、跨语言翻译及语音合成，更核心地解决了“同声传译”场景下的低延迟难题。通过独特的流式架构，StreamSpeech 能在用户说话的同时实时输出中间结果（如逐句识别文本或翻译内容），无需等待整段语音结束，从而提供流畅自然的即时沟通体验。\n\n该工具的最大亮点在于\"All in One\"的设计理念：仅需一个统一模型，即可无缝切换并支持包括流式识别、同步翻译和实时语音生成在内的八种不同任务，且在离线与同步场景下均达到了业界领先的性能水平。这种多任务学习能力大幅降低了部署复杂度，避免了维护多个独立模型的繁琐。\n\nStreamSpeech 非常适合人工智能研究人员探索多任务学习机制，也适合开发者构建需要低延迟交互的语音应用（如实时会议助手、跨国直播字幕系统等）。对于希望体验前沿语音技术的普通用户，其提供的本地 Web 演示也能直观展示同声传译的魅力。作为 ACL 2024 的研究成果，StreamSpeech 以开源形式共享代码与模型，为推动高效、实时的语音交互技术发展提供了有力支持。","# StreamSpeech\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2406.03049-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.03049)\n[![project](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%8E%A7%20Demo-Listen%20to%20StreamSpeech-orange.svg)](https:\u002F\u002Fictnlp.github.io\u002FStreamSpeech-site\u002F)\n[![model](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20-StreamSpeech_Models-blue.svg)](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Ftree\u002Fmain)\n[![Hits](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_8313630d44bc.png)](https:\u002F\u002Fhits.seeyoufarm.com)\n\n[![twitter](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTwitter-@Gorden%20Sun-black?logo=X&logoColor=black)](https:\u002F\u002Fx.com\u002FGorden_Sun\u002Fstatus\u002F1798742796524007845) [![twitter](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTwitter-@imxiaohu-black?logo=X&logoColor=black)](https:\u002F\u002Fx.com\u002Fimxiaohu\u002Fstatus\u002F1798999363987124355)\n\n> **Authors**: **[Shaolei Zhang](https:\u002F\u002Fzhangshaolei1998.github.io\u002F), [Qingkai Fang](https:\u002F\u002Ffangqingkai.github.io\u002F), [Shoutao Guo](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=XwHtPyAAAAAJ&hl), [Zhengrui Ma](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=dUgq6tEAAAAJ), [Min Zhang](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=CncXH-YAAAAJ), [Yang Feng*](https:\u002F\u002Fpeople.ucas.edu.cn\u002F~yangfeng?language=en)**\n\n\nCode for ACL 2024 paper \"[StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.03049)\".\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_a00bd2c5d61a.png\" alt=\"StreamSpeech\" style=\"width: 70%; min-width: 300px; display: block; margin: auto;\">\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  🎧 Listen to \u003Ca href=\"https:\u002F\u002Fictnlp.github.io\u002FStreamSpeech-site\u002F\">StreamSpeech's translated speech\u003C\u002Fa> 🎧 \n\u003C\u002Fp>\n\n💡**Highlight**:\n1. StreamSpeech achieves **SOTA performance** on both offline and simultaneous speech-to-speech translation.\n2. StreamSpeech performs **streaming ASR**, **simultaneous speech-to-text translation** and **simultaneous speech-to-speech translation** via an \"All in One\" seamless model.\n3. StreamSpeech can present intermediate results (i.e., ASR or translation results) during simultaneous translation, offering a more comprehensive low-latency communication experience.\n\n## 🔥News\n- [2025.06.17] We are excited to extend the \"All-in-One\" feature of StreamSpeech to more general multimodal interactions via developing **Stream-Omni**. 👉Refer to [paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13642), [code & demo](https:\u002F\u002Fgithub.com\u002Fictnlp\u002FStream-Omni), [model](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002Fstream-omni-8b) for more details.\n  - Stream-Omni is an GPT-4o-like language-vision-speech chatbot that simultaneously supports interactions across any combination of text, vision, and speech modalities.\n  - Stream-Omni can simultaneously produce intermediate textual results (e.g., ASR transcriptions and model responses) during speech interactions, like the advanced voice service of GPT-4o.\n\n- [2024.06.17] Add [Web GUI demo](.\u002Fdemo), now you can experience StreamSpeech in your local browser.\n- [2024.06.05] [Paper](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.03049), [code](https:\u002F\u002Fgithub.com\u002Fictnlp\u002FStreamSpeech), [models](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Ftree\u002Fmain) and [demo](https:\u002F\u002Fictnlp.github.io\u002FStreamSpeech-site\u002F) of StreamSpeech are available!\n\n## ⭐Features\n\n### Support 8 Tasks\n- **Offline**: Speech Recognition (ASR)✅, Speech-to-Text Translation (S2TT)✅, Speech-to-Speech Translation (S2ST)✅, Speech Synthesis (TTS)✅\n- **Simultaneous**: Streaming ASR✅, Simultaneous S2TT✅, Simultaneous S2ST✅, Real-time TTS✅ under any latency (with one model)\n\n### GUI Demo\n\nhttps:\u002F\u002Fgithub.com\u002Fictnlp\u002FStreamSpeech\u002Fassets\u002F34680227\u002F4d9bdabf-af66-4320-ae7d-0f23e721cd71\n\u003Cp align=\"center\">\n  Simultaneously provide ASR, translation, and synthesis results via a seamless model\n\u003C\u002Fp>\n\n### Case\n\n> **Speech Input**: [example\u002Fwavs\u002Fcommon_voice_fr_17301936.mp3](.\u002Fexample\u002Fwavs\u002Fcommon_voice_fr_17301936.mp3)\n>\n> **Transcription** (ground truth): jai donc lexpérience des années passées jen dirai un mot tout à lheure\n>\n> **Translation** (ground truth): i therefore have the experience of the passed years i'll say a few words about that later\n\n| StreamSpeech                                    | Simultaneous                                                 | Offline                                                      |\n| ----------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |\n| **Speech Recognition**                          | jai donc expérience des années passé jen dirairai un mot tout à lheure | jai donc lexpérience des années passé jen dirairai un mot tout à lheure |\n| **Speech-to-Text Translation**                  | i therefore have an experience of last years i will tell a word later | so i have the experience in the past years i'll say a word later |\n| **Speech-to-Speech Translation**                | \u003Cvideo src='https:\u002F\u002Fgithub.com\u002Fzhangshaolei1998\u002FStreamSpeech_dev\u002Fassets\u002F34680227\u002Fed41ba13-353b-489b-acfa-85563d0cc2cb' width=\"30%\"\u002F>                          | \u003Cvideo src='https:\u002F\u002Fgithub.com\u002Fzhangshaolei1998\u002FStreamSpeech_dev\u002Fassets\u002F34680227\u002Fca482ba6-76da-4619-9dfd-24aa2eb3339a' width=\"30%\"\u002F>                          |\n| **Text-to-Speech Synthesis** (*incrementally synthesize speech word by word*) | \u003Cvideo src='https:\u002F\u002Fgithub.com\u002Fzhangshaolei1998\u002FStreamSpeech_dev\u002Fassets\u002F34680227\u002F294f1310-eace-4914-be30-5cd798e8592e' width=\"30%\"\u002F>                          | \u003Cvideo src='https:\u002F\u002Fgithub.com\u002Fzhangshaolei1998\u002FStreamSpeech_dev\u002Fassets\u002F34680227\u002F52854163-7fc5-4622-a5a6-c133cbd99e58' width=\"30%\"\u002F>                          |\n\n\n\n## ⚙Requirements\n\n- Python == 3.10, PyTorch == 2.0.1, Install fairseq & SimulEval\n\n  ```bash\n  cd fairseq\n  pip install --editable .\u002F --no-build-isolation\n  cd SimulEval\n  pip install --editable .\u002F\n  ```\n\n## 🚀Quick Start\n\n### 1. Model Download\n\n#### (1) StreamSpeech Models\n\n| Language | UnitY                                                        | StreamSpeech (offline)                                       | StreamSpeech (simultaneous)                                  |\n| -------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |\n| Fr-En    | unity.fr-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Funity.fr-en.pt)] [[Baidu](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F10uGYgl0xTej9FP43iKx7Cg?pwd=nkvu)] | streamspeech.offline.fr-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.offline.fr-en.pt)] [[Baidu](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1GFckHGP5SNLuOEj6mbIWhQ?pwd=pwgq)] | streamspeech.simultaneous.fr-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.simultaneous.fr-en.pt)] [[Baidu](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1edCPFljogyDHgGXkUV8_3w?pwd=8gg3)] |\n| Es-En    | unity.es-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Funity.es-en.pt)] [[Baidu](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1RwIEHye8jjw3kiIgrCHA3A?pwd=hde4)] | streamspeech.offline.es-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.offline.es-en.pt)] [[Baidu](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1T89G4NC4J0Ofzcsc8Rt2Ww?pwd=yuhd)] | streamspeech.simultaneous.es-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.simultaneous.es-en.pt)] [[Baidu](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1NbLEVcYWHIdqqLD17P1s9g?pwd=p1pc)] |\n| De-En    | unity.de-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Funity.de-en.pt)] [[Baidu](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1Mg_PBeZ5acEDhl5wRJ_-7w?pwd=egvv)] | streamspeech.offline.de-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.offline.de-en.pt)] [[Baidu](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1mTE4eHuVLJPB7Yg9AackEg?pwd=6ga8)] | streamspeech.simultaneous.de-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.simultaneous.de-en.pt)] [[Baidu](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1DYPMg3mdDopLY70BYQTduQ?pwd=r7kw)] |\n\n#### (2) Unit-based HiFi-GAN Vocoder\n\n| Unit config       | Unit size | Vocoder language | Dataset                                             | Model                                                        |\n| ----------------- | --------- | ---------------- | --------------------------------------------------- | ------------------------------------------------------------ |\n| mHuBERT, layer 11 | 1000      | En               | [LJSpeech](https:\u002F\u002Fkeithito.com\u002FLJ-Speech-Dataset\u002F) | [ckpt](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Ffairseq\u002Fspeech_to_speech\u002Fvocoder\u002Fcode_hifigan\u002Fmhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj\u002Fg_00500000), [config](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Ffairseq\u002Fspeech_to_speech\u002Fvocoder\u002Fcode_hifigan\u002Fmhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj\u002Fconfig.json) |\n\n### 2. Prepare Data and Config (only for test\u002Finference)\n\n#### (1) Config Files\n\nReplace `\u002Fdata\u002Fzhangshaolei\u002FStreamSpeech` in files [configs\u002Ffr-en\u002Fconfig_gcmvn.yaml](.\u002Fconfigs\u002Ffr-en\u002Fconfig_gcmvn.yaml) and [configs\u002Ffr-en\u002Fconfig_mtl_asr_st_ctcst.yaml](.\u002Fconfigs\u002Ffr-en\u002Fconfig_mtl_asr_st_ctcst.yaml) with your local address of StreamSpeech repo.\n\n#### (2) Test Data\n\nPrepare test data following [SimulEval](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FSimulEval) format. [example\u002F](.\u002Fexample) provides an example:\n\n- [wav_list.txt](.\u002Fexample\u002Fwav_list.txt): Each line records the path of a source speech.\n- [target.txt](.\u002Fexample\u002Ftarget.txt): Each line records the reference text, e.g., target translation or source transcription (used to calculate the metrics).\n\n### 3. Inference with SimulEval\n\nRun these scripts to inference StreamSpeech on streaming ASR, simultaneous S2TT and  simultaneous S2ST.\n\n> `--source-segment-size`: set the chunk size (millisecond) to any value to control the latency\n\n\u003Cdetails>\n\u003Csummary>Simultaneous Speech-to-Speech Translation\u003C\u002Fsummary>\n\n`--output-asr-translation`: whether to output the intermediate ASR and translated text results during simultaneous speech-to-speech translation.\n\n```shell\nexport CUDA_VISIBLE_DEVICES=0\n\nROOT=\u002Fdata\u002Fzhangshaolei\u002FStreamSpeech # path to StreamSpeech repo\nPRETRAIN_ROOT=\u002Fdata\u002Fzhangshaolei\u002Fpretrain_models \nVOCODER_CKPT=$PRETRAIN_ROOT\u002Funit-based_HiFi-GAN_vocoder\u002FmHuBERT.layer11.km1000.en\u002Fg_00500000 # path to downloaded Unit-based HiFi-GAN Vocoder\nVOCODER_CFG=$PRETRAIN_ROOT\u002Funit-based_HiFi-GAN_vocoder\u002FmHuBERT.layer11.km1000.en\u002Fconfig.json # path to downloaded Unit-based HiFi-GAN Vocoder\n\nLANG=fr\nfile=streamspeech.simultaneous.${LANG}-en.pt # path to downloaded StreamSpeech model\noutput_dir=$ROOT\u002Fres\u002Fstreamspeech.simultaneous.${LANG}-en\u002Fsimul-s2st\n\nchunk_size=320 #ms\nPYTHONPATH=$ROOT\u002Ffairseq simuleval --data-bin ${ROOT}\u002Fconfigs\u002F${LANG}-en \\\n    --user-dir ${ROOT}\u002Fresearches\u002Fctc_unity --agent-dir ${ROOT}\u002Fagent \\\n    --source example\u002Fwav_list.txt --target example\u002Ftarget.txt \\\n    --model-path $file \\\n    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \\\n    --agent $ROOT\u002Fagent\u002Fspeech_to_speech.streamspeech.agent.py \\\n    --vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG --dur-prediction \\\n    --output $output_dir\u002Fchunk_size=$chunk_size \\\n    --source-segment-size $chunk_size \\\n    --quality-metrics ASR_BLEU  --target-speech-lang en --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks DiscontinuitySum DiscontinuityAve DiscontinuityNum RTF \\\n    --device gpu --computation-aware \\\n    --output-asr-translation True\n```\n\nYou should get the following outputs:\n\n```\nfairseq plugins loaded...\nfairseq plugins loaded...\nfairseq plugins loaded...\nfairseq plugins loaded...\n2024-06-06 09:45:46 | INFO     | fairseq.tasks.speech_to_speech | dictionary size: 1,004\nimport agents...\nRemoving weight norm...\n2024-06-06 09:45:50 | INFO     | agent.tts.vocoder | loaded CodeHiFiGAN checkpoint from \u002Fdata\u002Fzhangshaolei\u002Fpretrain_models\u002Funit-based_HiFi-GAN_vocoder\u002FmHuBERT.layer11.km1000.en\u002Fg_00500000\n2024-06-06 09:45:50 | INFO     | simuleval.utils.agent | System will run on device: gpu.\n2024-06-06 09:45:50 | INFO     | simuleval.dataloader | Evaluating from speech to speech.\n  0%|                                                                                                                                                                              | 0\u002F2 [00:00\u003C?, ?it\u002Fs]\nStreaming ASR: \nStreaming ASR: \nStreaming ASR: je\nSimultaneous translation: i would\nStreaming ASR: je voudrais\nSimultaneous translation: i would like to\nStreaming ASR: je voudrais soumettre\nSimultaneous translation: i would like to sub\nStreaming ASR: je voudrais soumettre cette\nSimultaneous translation: i would like to submit\nStreaming ASR: je voudrais soumettre cette idée\nSimultaneous translation: i would like to submit this\nStreaming ASR: je voudrais soumettre cette idée à la\nSimultaneous translation: i would like to submit this idea to\nStreaming ASR: je voudrais soumettre cette idée à la réflexion\nSimultaneous translation: i would like to submit this idea to the\nStreaming ASR: je voudrais soumettre cette idée à la réflexion de\nSimultaneous translation: i would like to submit this idea to the reflection\nStreaming ASR: je voudrais soumettre cette idée à la réflexion de lassemblée\nSimultaneous translation: i would like to submit this idea to the reflection of\nStreaming ASR: je voudrais soumettre cette idée à la réflexion de lassemblée nationale\nSimultaneous translation: i would like to submit this idea to the reflection of the\nStreaming ASR: je voudrais soumettre cette idée à la réflexion de lassemblée nationale\nSimultaneous translation: i would like to submit this idea to the reflection of the national assembly\n 50%|███████████████████████████████████████████████████████████████████████████████████                                                                                   | 1\u002F2 [00:04\u003C00:04,  4.08s\u002Fit]\nStreaming ASR: \nStreaming ASR: \nStreaming ASR: \nStreaming ASR: \nStreaming ASR: jai donc\nSimultaneous translation: i therefore\nStreaming ASR: jai donc\nStreaming ASR: jai donc expérience des\nSimultaneous translation: i therefore have an experience\nStreaming ASR: jai donc expérience des années\nStreaming ASR: jai donc expérience des années passé\nSimultaneous translation: i therefore have an experience of last\nStreaming ASR: jai donc expérience des années passé jen\nSimultaneous translation: i therefore have an experience of last years\nStreaming ASR: jai donc expérience des années passé jen dirairai\nSimultaneous translation: i therefore have an experience of last years i will\nStreaming ASR: jai donc expérience des années passé jen dirairai un mot\nSimultaneous translation: i therefore have an experience of last years i will tell a\nStreaming ASR: jai donc expérience des années passé jen dirairai un mot tout à lheure\nSimultaneous translation: i therefore have an experience of last years i will tell a word\nStreaming ASR: jai donc expérience des années passé jen dirairai un mot tout à lheure\nSimultaneous translation: i therefore have an experience of last years i will tell a word later\n100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2\u002F2 [00:06\u003C00:00,  3.02s\u002Fit]\n2024-06-06 09:45:56 | WARNING  | simuleval.scorer.asr_bleu | Beta feature: Evaluating speech output. Faieseq is required.\n2024-06-06 09:46:12 | INFO | fairseq.tasks.audio_finetuning | Using dict_path : \u002Fdata\u002Fzhangshaolei\u002F.cache\u002Fust_asr\u002Fen\u002Fdict.ltr.txt\nTranscribing predictions: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2\u002F2 [00:01\u003C00:00,  1.63it\u002Fs]\n2024-06-06 09:46:21 | INFO     | simuleval.sentence_level_evaluator | Results:\n ASR_BLEU       AL    AL_CA    AP  AP_CA      DAL  DAL_CA  StartOffset  StartOffset_CA  EndOffset  EndOffset_CA     LAAL  LAAL_CA      ATD   ATD_CA  NumChunks  NumChunks_CA  DiscontinuitySum  DiscontinuitySum_CA  DiscontinuityAve  DiscontinuityAve_CA  DiscontinuityNum  DiscontinuityNum_CA   RTF  RTF_CA\n   15.448 1724.895 2913.508 0.425  0.776 1358.812 3137.55       1280.0        2213.906     1366.0        1366.0 1724.895 2913.508 1440.146 3389.374        9.5           9.5             110.0                110.0              55.0                 55.0                 1                    1 1.326   1.326\n\n```\n\nLogs and evaluation results are stored in ` $output_dir\u002Fchunk_size=$chunk_size`:\n\n```\n$output_dir\u002Fchunk_size=$chunk_size\n├── wavs\u002F\n│   ├── 0_pred.wav # generated speech\n│   ├── 1_pred.wav \n│   ├── 0_pred.txt # asr transcription for ASR-BLEU tookit\n│   ├── 1_pred.txt \n├── config.yaml\n├── asr_transcripts.txt # ASR-BLEU transcription results\n├── metrics.tsv\n├── scores.tsv\n├── asr_cmd.bash\n└── instances.log # logs of Simul-S2ST\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Simultaneous Speech-to-Text Translation\u003C\u002Fsummary>\n\n```shell\nexport CUDA_VISIBLE_DEVICES=0\n\nROOT=\u002Fdata\u002Fzhangshaolei\u002FStreamSpeech # path to StreamSpeech repo\n\nLANG=fr\nfile=streamspeech.simultaneous.${LANG}-en.pt # path to downloaded StreamSpeech model\noutput_dir=$ROOT\u002Fres\u002Fstreamspeech.simultaneous.${LANG}-en\u002Fsimul-s2tt\n\nchunk_size=320 #ms\nPYTHONPATH=$ROOT\u002Ffairseq simuleval --data-bin ${ROOT}\u002Fconfigs\u002F${LANG}-en \\\n    --user-dir ${ROOT}\u002Fresearches\u002Fctc_unity --agent-dir ${ROOT}\u002Fagent \\\n    --source example\u002Fwav_list.txt --target example\u002Ftarget.txt \\\n    --model-path $file \\\n    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \\\n    --agent $ROOT\u002Fagent\u002Fspeech_to_text.s2tt.streamspeech.agent.py\\\n    --output $output_dir\u002Fchunk_size=$chunk_size \\\n    --source-segment-size $chunk_size \\\n    --quality-metrics BLEU  --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks RTF \\\n    --device gpu --computation-aware \n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>Streaming ASR\u003C\u002Fsummary>\n\n```shell\nexport CUDA_VISIBLE_DEVICES=0\n\nROOT=\u002Fdata\u002Fzhangshaolei\u002FStreamSpeech # path to StreamSpeech repo\n\nLANG=fr\nfile=streamspeech.simultaneous.${LANG}-en.pt # path to downloaded StreamSpeech model\noutput_dir=$ROOT\u002Fres\u002Fstreamspeech.simultaneous.${LANG}-en\u002Fstreaming-asr\n\nchunk_size=320 #ms\nPYTHONPATH=$ROOT\u002Ffairseq simuleval --data-bin ${ROOT}\u002Fconfigs\u002F${LANG}-en \\\n    --user-dir ${ROOT}\u002Fresearches\u002Fctc_unity --agent-dir ${ROOT}\u002Fagent \\\n    --source example\u002Fwav_list.txt --target example\u002Fsource.txt \\\n    --model-path $file \\\n    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \\\n    --agent $ROOT\u002Fagent\u002Fspeech_to_text.asr.streamspeech.agent.py\\\n    --output $output_dir\u002Fchunk_size=$chunk_size \\\n    --source-segment-size $chunk_size \\\n    --quality-metrics BLEU  --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks RTF \\\n    --device gpu --computation-aware \n```\n\u003C\u002Fdetails>\n\n## 🎈Develop Your Own StreamSpeech\n\n### 1. Data Preprocess\n\n- Follow [`.\u002Fpreprocess_scripts`](.\u002Fpreprocess_scripts) to process CVSS-C data. \n\n### 2. Training\n\n> [!Note]\n> You can directly use the [downloaded StreamSpeech model](#1-model-download) for evaluation and skip training.\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_b4818b9064ac.png\" alt=\"model\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\">\n\u003C\u002Fp>\n\n- Follow [`researches\u002Fctc_unity\u002Ftrain_scripts\u002Ftrain.simul-s2st.sh`](.\u002Fresearches\u002Fctc_unity\u002Ftrain_scripts\u002Ftrain.simul-s2st.sh) to train StreamSpeech for simultaneous speech-to-speech translation.\n- Follow [`researches\u002Fctc_unity\u002Ftrain_scripts\u002Ftrain.offline-s2st.sh`](.\u002Fresearches\u002Fctc_unity\u002Ftrain_scripts\u002Ftrain.offline-s2st.sh) to train StreamSpeech for offline speech-to-speech translation.\n- We also provide some other StreamSpeech variants and baseline implementations.\n\n| Model             | --user-dir                 | --arch                            | Description                                                  |\n| ----------------- | -------------------------- | --------------------------------- | ------------------------------------------------------------ |\n| **Translatotron 2** | `researches\u002Ftranslatotron` | `s2spect2_conformer_modified`     | [Translatotron 2](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fjia22b.html) |\n| **UnitY**         | `researches\u002Ftranslatotron` | `unity_conformer_modified`        | [UnitY](https:\u002F\u002Faclanthology.org\u002F2023.acl-long.872\u002F)         |\n| **Uni-UnitY**     | `researches\u002Funi_unity`     | `uni_unity_conformer`             | Change all encoders in UnitY into unidirectional             |\n| **Chunk-UnitY**   | `researches\u002Fchunk_unity`   | `chunk_unity_conformer`           | Change the Conformer in UnitY into Chunk-based Conformer     |\n| **StreamSpeech**  | `researches\u002Fctc_unity`     | `streamspeech`                    | StreamSpeech                                                 |\n| **StreamSpeech (cascade)** | `researches\u002Fctc_unity` | `streamspeech_cascade` | Cascaded StreamSpeech of S2TT and TTS. TTS module can be used independently for real-time TTS given incremental text. |\n| **HMT**           | `researches\u002Fhmt`           | `hmt_transformer_iwslt_de_en`     | [HMT](https:\u002F\u002Fopenreview.net\u002Fforum?id=9y0HFvaAYD6): strong simultaneous text-to-text translation method |\n| **DiSeg**         | `researches\u002Fdiseg`         | `convtransformer_espnet_base_seg` | [DiSeg](https:\u002F\u002Faclanthology.org\u002F2023.findings-acl.485\u002F): strong simultaneous speech-to-text translation method |\n\n> [!Tip]\n> The `train_scripts\u002F` and `test_scripts\u002F` in directory `--user-dir` give the training and testing scripts for each model.\n> Refer to official repo of [UnitY](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq\u002Fblob\u002Fmain\u002Ffairseq\u002Fmodels\u002Fspeech_to_speech\u002Fs2s_conformer_unity.py), [Translatotron 2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq\u002Fblob\u002Fmain\u002Ffairseq\u002Fmodels\u002Fspeech_to_speech\u002Fs2s_conformer_translatotron2.py), [HMT](https:\u002F\u002Fgithub.com\u002Fictnlp\u002FHMT) and [DiSeg](https:\u002F\u002Fgithub.com\u002Fictnlp\u002FDiSeg) for more details.\n\n### 3. Evaluation\n\n#### (1) Offline Evaluation\n\nFollow [`pred.offline-s2st.sh`](.\u002Fresearches\u002Fctc_unity\u002Ftest_scripts\u002Fpred.offline-s2st.sh) to evaluate the offline performance of StreamSpeech on ASR, S2TT and S2ST.\n\n#### (2) Simultaneous Evaluation\n\nA trained StreamSpeech model can be used for streaming ASR, simultaneous speech-to-text translation and simultaneous speech-to-speech translation. We provide [agent\u002F](.\u002Fagent) for these three tasks:\n\n- `agent\u002Fspeech_to_speech.streamspeech.agent.py`: simultaneous speech-to-speech translation\n- `agent\u002Fspeech_to_text.s2tt.streamspeech.agent.py`: simultaneous speech-to-text translation\n- `agent\u002Fspeech_to_text.asr.streamspeech.agent.py`: streaming ASR\n\nFollow [`simuleval.simul-s2st.sh`](.\u002Fresearches\u002Fctc_unity\u002Ftest_scripts\u002Fsimuleval.simul-s2st.sh), [`simuleval.simul-s2tt.sh`](.\u002Fresearches\u002Fctc_unity\u002Ftest_scripts\u002Fsimuleval.simul-s2tt.sh), [`simuleval.streaming-asr.sh`](.\u002Fresearches\u002Fctc_unity\u002Ftest_scripts\u002Fsimuleval.streaming-asr.sh)  to evaluate StreamSpeech.\n\n### 4. Our Results\n\nOur project page ([https:\u002F\u002Fictnlp.github.io\u002FStreamSpeech-site\u002F](https:\u002F\u002Fictnlp.github.io\u002FStreamSpeech-site\u002F)) provides some translated speech generated by StreamSpeech, listen to it 🎧.\n\n#### (1) Offline Speech-to-Speech Translation  ( ASR-BLEU: quality )\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_26a9c3a154b0.png\" alt=\"offline\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\">\n\u003C\u002Fp>\n\n#### (2) Simultaneous Speech-to-Speech Translation  ( AL: latency  |  ASR-BLEU: quality )\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_3d54672dcd23.png\" alt=\"simul\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\">\n\u003C\u002Fp>\n\n#### (3) Simultaneous Speech-to-Text Translation  ( AL: latency  |  BLEU: quality )\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_0f57e3d3799e.png\" alt=\"simul\" style=\"width: 38%; min-width: 300px; display: block; margin: auto;\">\n\u003C\u002Fp>\n\n#### (4) Streaming ASR  ( AL: latency  |  WER: quality )\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_bf4403841e05.png\" alt=\"simul\" style=\"width: 50%; min-width: 300px; display: block; margin: auto;\">\n\u003C\u002Fp>\n\n## 🖋Citation\n\nIf you have any questions, please feel free to submit an issue or contact `zhangshaolei20z@ict.ac.cn`.\n\nIf our work is useful for you, please cite as:\n\n```\n@inproceedings{streamspeech,\n      title={StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning}, \n      author={Shaolei Zhang and Qingkai Fang and Shoutao Guo and Zhengrui Ma and Min Zhang and Yang Feng},\n      year={2024},\n      booktitle = {Proceedings of the 62th Annual Meeting of the Association for Computational Linguistics (Long Papers)},\n      publisher = {Association for Computational Linguistics}\n}\n```\n","# StreamSpeech\n\n[![arXiv](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FarXiv-2406.03049-b31b1b.svg?logo=arXiv)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2406.03049)\n[![project](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%8E%A7%20Demo-Listen%20to%20StreamSpeech-orange.svg)](https:\u002F\u002Fictnlp.github.io\u002FStreamSpeech-site\u002F)\n[![model](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F%F0%9F%A4%97%20-StreamSpeech_Models-blue.svg)](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Ftree\u002Fmain)\n[![Hits](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_8313630d44bc.png)](https:\u002F\u002Fhits.seeyoufarm.com)\n\n[![twitter](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTwitter-@Gorden%20Sun-black?logo=X&logoColor=black)](https:\u002F\u002Fx.com\u002FGorden_Sun\u002Fstatus\u002F1798742796524007845) [![twitter](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FTwitter-@imxiaohu-black?logo=X&logoColor=black)](https:\u002F\u002Fx.com\u002Fimxiaohu\u002Fstatus\u002F1798999363987124355)\n\n> **作者**: **[张绍雷](https:\u002F\u002Fzhangshaolei1998.github.io\u002F)、[方庆凯](https:\u002F\u002Ffangqingkai.github.io\u002F)、[郭守涛](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=XwHtPyAAAAAJ&hl)、[马正睿](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=dUgq6tEAAAAJ)、[张敏](https:\u002F\u002Fscholar.google.com.hk\u002Fcitations?user=CncXH-YAAAAJ)、[杨峰*](https:\u002F\u002Fpeople.ucas.edu.cn\u002F~yangfeng?language=en)**\n\n\n用于ACL 2024论文“StreamSpeech: 基于多任务学习的同步语音到语音翻译”的代码。\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_a00bd2c5d61a.png\" alt=\"StreamSpeech\" style=\"width: 70%; min-width: 300px; display: block; margin: auto;\">\n\u003C\u002Fp>\n\u003Cp align=\"center\">\n  🎧 请收听 \u003Ca href=\"https:\u002F\u002Fictnlp.github.io\u002FStreamSpeech-site\u002F\">StreamSpeech 的译文语音\u003C\u002Fa> 🎧 \n\u003C\u002Fp>\n\n💡**亮点**:\n1. StreamSpeech 在离线和同步语音到语音翻译任务上均达到 **SOTA 性能**。\n2. StreamSpeech 通过一个“一体化”无缝模型，同时实现 **流式 ASR**、**同步语音到文本翻译** 和 **同步语音到语音翻译**。\n3. 在同步翻译过程中，StreamSpeech 能够实时呈现中间结果（如 ASR 或翻译结果），从而提供更全面的低延迟沟通体验。\n\n## 🔥新闻\n- [2025.06.17] 我们很高兴将 StreamSpeech 的“一体化”功能扩展到更广泛的多模态交互中，开发出 **Stream-Omni**。👉详情请参阅 [论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2506.13642)、[代码与演示](https:\u002F\u002Fgithub.com\u002Fictnlp\u002FStream-Omni)、[模型](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002Fstream-omni-8b)。\n  - Stream-Omni 是一款类似 GPT-4o 的语言-视觉-语音聊天机器人，能够同时支持文本、视觉和语音任意组合的交互。\n  - 在语音交互过程中，Stream-Omni 可以像 GPT-4o 的高级语音服务一样，同时生成中间文本结果（如 ASR 转录和模型响应）。\n\n- [2024.06.17] 添加了 [Web GUI 演示](.\u002Fdemo)，现在您可以在本地浏览器中体验 StreamSpeech。\n- [2024.06.05] StreamSpeech 的 [论文](https:\u002F\u002Farxiv.org\u002Fpdf\u002F2406.03049)、[代码](https:\u002F\u002Fgithub.com\u002Fictnlp\u002FStreamSpeech)、[模型](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Ftree\u002Fmain) 和 [演示](https:\u002F\u002Fictnlp.github.io\u002FStreamSpeech-site\u002F) 已经发布！\n\n## ⭐功能\n\n### 支持 8 项任务\n- **离线**: 语音识别 (ASR)✅、语音到文本翻译 (S2TT)✅、语音到语音翻译 (S2ST)✅、语音合成 (TTS)✅\n- **同步**: 流式 ASR✅、同步 S2TT✅、同步 S2ST✅、实时 TTS✅，所有这些都由同一个模型完成，且可在任意延迟下运行。\n\n### GUI 演示\n\nhttps:\u002F\u002Fgithub.com\u002Fictnlp\u002FStreamSpeech\u002Fassets\u002F34680227\u002F4d9bdabf-af66-4320-ae7d-0f23e721cd71\n\u003Cp align=\"center\">\n  通过一个无缝模型同时提供 ASR、翻译和合成结果\n\u003C\u002Fp>\n\n### 示例\n\n> **语音输入**: [example\u002Fwavs\u002Fcommon_voice_fr_17301936.mp3](.\u002Fexample\u002Fwavs\u002Fcommon_voice_fr_17301936.mp3)\n>\n> **转录**（真实值）：jai donc lexpérience des années passées jen dirai un mot tout à lheure\n>\n> **翻译**（真实值）：i therefore have the experience of the passed years i'll say a few words about that later\n\n| StreamSpeech                                    | 同步                                                 | 离线                                                      |\n| ----------------------------------------------- | ------------------------------------------------------------ | ------------------------------------------------------------ |\n| **语音识别**                          | jai donc expérience des années passé jen dirairai un mot tout à lheure | jai donc lexpérience des années passé jen dirairai un mot tout à lheure |\n| **语音到文本翻译**                  | i therefore have an experience of last years i will tell a word later | so i have the experience in the past years i'll say a word later |\n| **语音到语音翻译**                | \u003Cvideo src='https:\u002F\u002Fgithub.com\u002Fzhangshaolei1998\u002FStreamSpeech_dev\u002Fassets\u002F34680227\u002Fed41ba13-353b-489b-acfa-85563d0cc2cb' width=\"30%\"\u002F>                          | \u003Cvideo src='https:\u002F\u002Fgithub.com\u002Fzhangshaolei1998\u002FStreamSpeech_dev\u002Fassets\u002F34680227\u002Fca482ba6-76da-4619-9dfd-24aa2eb3339a' width=\"30%\"\u002F>                          |\n| **文本到语音合成** (*逐字增量合成语音*) | \u003Cvideo src='https:\u002F\u002Fgithub.com\u002Fzhangshaolei1998\u002FStreamSpeech_dev\u002Fassets\u002F34680227\u002F294f1310-eace-4914-be30-5cd798e8592e' width=\"30%\"\u002F>                          | \u003Cvideo src='https:\u002F\u002Fgithub.com\u002Fzhangshaolei1998\u002FStreamSpeech_dev\u002Fassets\u002F34680227\u002F52854163-7fc5-4622-a5a6-c133cbd99e58' width=\"30%\"\u002F>                          |\n\n\n\n## ⚙要求\n\n- Python == 3.10, PyTorch == 2.0.1, 安装 fairseq 和 SimulEval\n\n  ```bash\n  cd fairseq\n  pip install --editable .\u002F --no-build-isolation\n  cd SimulEval\n  pip install --editable .\u002F\n  ```\n\n## 🚀快速入门\n\n### 1. 模型下载\n\n#### (1) StreamSpeech 模型\n\n| 语言 | UnitY                                                        | StreamSpeech（离线）                                       | StreamSpeech（同传）                                  |\n| -------- | ------------------------------------------------------------ | ------------------------------------------------------------ | ------------------------------------------------------------ |\n| 法-英    | unity.fr-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Funity.fr-en.pt)] [[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F10uGYgl0xTej9FP43iKx7Cg?pwd=nkvu)] | streamspeech.offline.fr-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.offline.fr-en.pt)] [[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1GFckHGP5SNLuOEj6mbIWhQ?pwd=pwgq)] | streamspeech.simultaneous.fr-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.simultaneous.fr-en.pt)] [[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1edCPFljogyDHgGXkUV8_3w?pwd=8gg3)] |\n| 西-英    | unity.es-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Funity.es-en.pt)] [[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1RwIEHye8jjw3kiIgrCHA3A?pwd=hde4)] | streamspeech.offline.es-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.offline.es-en.pt)] [[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1T89G4NC4J0Ofzcsc8Rt2Ww?pwd=yuhd)] | streamspeech.simultaneous.es-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.simultaneous.es-en.pt)] [[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1NbLEVcYWHIdqqLD17P1s9g?pwd=p1pc)] |\n| 德-英    | unity.de-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Funity.de-en.pt)] [[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1Mg_PBeZ5acEDhl5wRJ_-7w?pwd=egvv)] | streamspeech.offline.de-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.offline.de-en.pt)] [[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1mTE4eHuVLJPB7Yg9AackEg?pwd=6ga8)] | streamspeech.simultaneous.de-en.pt [[Huggingface](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.simultaneous.de-en.pt)] [[百度网盘](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1DYPMg3mdDopLY70BYQTduQ?pwd=r7kw)] |\n\n#### (2) 基于Unit的HiFi-GAN声码器\n\n| Unit配置       | Unit大小 | 声码器语言 | 数据集                                             | 模型                                                        |\n| ----------------- | --------- | ---------------- | --------------------------------------------------- | ------------------------------------------------------------ |\n| mHuBERT, 第11层 | 1000      | 英语               | [LJSpeech](https:\u002F\u002Fkeithito.com\u002FLJ-Speech-Dataset\u002F) | [检查点](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Ffairseq\u002Fspeech_to_speech\u002Fvocoder\u002Fcode_hifigan\u002Fmhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj\u002Fg_00500000), [配置文件](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Ffairseq\u002Fspeech_to_speech\u002Fvocoder\u002Fcode_hifigan\u002Fmhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj\u002Fconfig.json) |\n\n### 2. 准备数据与配置（仅用于测试\u002F推理）\n\n#### (1) 配置文件\n\n将文件 [configs\u002Ffr-en\u002Fconfig_gcmvn.yaml](.\u002Fconfigs\u002Ffr-en\u002Fconfig_gcmvn.yaml) 和 [configs\u002Ffr-en\u002Fconfig_mtl_asr_st_ctcst.yaml](.\u002Fconfigs\u002Ffr-en\u002Fconfig_mtl_asr_st_ctcst.yaml) 中的 `\u002Fdata\u002Fzhangshaolei\u002FStreamSpeech` 替换为本地 StreamSpeech 仓库的路径。\n\n#### (2) 测试数据\n\n按照 [SimulEval](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FSimulEval) 格式准备测试数据。[example\u002F](.\u002Fexample) 提供了一个示例：\n\n- [wav_list.txt](.\u002Fexample\u002Fwav_list.txt)：每行记录一个源语音文件的路径。\n- [target.txt](.\u002Fexample\u002Ftarget.txt)：每行记录参考文本，例如目标翻译或源语音转录（用于计算指标）。\n\n### 3. 使用 SimulEval 进行推理\n\n运行这些脚本，即可对 StreamSpeech 进行流式 ASR、同传 S2TT 和同传 S2ST 推理。\n\n> `--source-segment-size`：设置分块大小（毫秒），以控制延迟。\n\n\u003Cdetails>\n\u003Csummary>同传语音到语音翻译\u003C\u002Fsummary>\n\n`--output-asr-translation`：是否在同传语音到语音翻译过程中输出中间的 ASR 结果和翻译文本。\n\n```shell\nexport CUDA_VISIBLE_DEVICES=0\n\nROOT=\u002Fdata\u002Fzhangshaolei\u002FStreamSpeech # StreamSpeech 仓库路径\nPRETRAIN_ROOT=\u002Fdata\u002Fzhangshaolei\u002Fpretrain_models \nVOCODER_CKPT=$PRETRAIN_ROOT\u002Funit-based_HiFi-GAN_vocoder\u002FmHuBERT.layer11.km1000.en\u002Fg_00500000 # 下载的基于Unit的HiFi-GAN声码器路径\nVOCODER_CFG=$PRETRAIN_ROOT\u002Funit-based_HiFi-GAN_vocoder\u002FmHuBERT.layer11.km1000.en\u002Fconfig.json # 下载的基于Unit的HiFi-GAN声码器配置文件路径\n\nLANG=fr\nfile=streamspeech.simultaneous.${LANG}-en.pt # 下载的StreamSpeech模型路径\noutput_dir=$ROOT\u002Fres\u002Fstreamspeech.simultaneous.${LANG}-en\u002Fsimul-s2st\n\nchunk_size=320 #ms\nPYTHONPATH=$ROOT\u002Ffairseq simuleval --data-bin ${ROOT}\u002Fconfigs\u002F${LANG}-en \\\n    --user-dir ${ROOT}\u002Fresearches\u002Fctc_unity --agent-dir ${ROOT}\u002Fagent \\\n    --source example\u002Fwav_list.txt --target example\u002Ftarget.txt \\\n    --model-path $file \\\n    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \\\n    --agent $ROOT\u002Fagent\u002Fspeech_to_speech.streamspeech.agent.py \\\n    --vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG --dur-prediction \\\n    --output $output_dir\u002Fchunk_size=$chunk_size \\\n    --source-segment-size $chunk_size \\\n    --quality-metrics ASR_BLEU  --target-speech-lang en --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks DiscontinuitySum DiscontinuityAve DiscontinuityNum RTF \\\n    --device gpu --computation-aware \\\n    --output-asr-translation True\n```\n\n您应该会得到以下输出：\n\n```\nfairseq 插件已加载...\nfairseq 插件已加载...\nfairseq 插件已加载...\nfairseq 插件已加载...\n2024-06-06 09:45:46 | INFO     | fairseq.tasks.speech_to_speech | 词典大小：1,004\n导入代理...\n移除权重归一化...\n2024-06-06 09:45:50 | INFO     | agent.tts.vocoder | 从 \u002Fdata\u002Fzhangshaolei\u002Fpretrain_models\u002Funit-based_HiFi-GAN_vocoder\u002FmHuBERT.layer11.km1000.en\u002Fg_00500000 加载了 CodeHiFiGAN 检查点\n2024-06-06 09:45:50 | INFO     | simuleval.utils.agent | 系统将在 GPU 设备上运行。\n2024-06-06 09:45:50 | INFO     | simuleval.dataloader | 正在进行语音到语音的评估。\n  0%|                                                                                                                                                                              | 0\u002F2 [00:00\u003C?, ?it\u002Fs]\n流式 ASR： \n流式 ASR： \n流式 ASR： je\n同时翻译： i would\n流式 ASR： je voudrais\n同时翻译： i would like to\n流式 ASR： je voudrais soumettre\n同时翻译： i would like to sub\n流式 ASR： je voudrais soumettre cette\n同时翻译： i would like to submit\n流式 ASR： je voudrais soumettre cette idée\n同时翻译： i would like to submit this\n流式 ASR： je voudrais soumettre cette idée à la\n同时翻译： i would like to submit this idea to\n流式 ASR： je voudrais soumettre cette idée à la réflexion\n同时翻译： i would like to submit this idea to the\n流式 ASR： je voudrais soumettre cette idée à la réflexion de\n同时翻译： i would like to submit this idea to the reflection\n流式 ASR： je voudrais soumettre cette idée à la réflexion de lassemblée\n同时翻译： i would like to submit this idea to the reflection of\n流式 ASR： je voudrais soumettre cette idée à la réflexion de lassemblée nationale\n同时翻译： i would like to submit this idea to the reflection of the\n流式 ASR： je voudrais soumettre cette idée à la réflexion de lassemblée nationale\n同时翻译： i would like to submit this idea to the reflection of the national assembly\n 50%|███████████████████████████████████████████████████████████████████████████████████                                                                                   | 1\u002F2 [00:04\u003C00:04,  4.08s\u002Fit]\n流式 ASR： \n流式 ASR： \n流式 ASR： \n流式 ASR： \n流式 ASR： jai donc\n同时翻译： i therefore\n流式 ASR： jai donc\n流式 ASR： jai donc expérience des\n同时翻译： i therefore have an experience\n流式 ASR： jai donc expérience des années\n流式 ASR： jai donc expérience des années passé\n同时翻译： i therefore have an experience of last\n流式 ASR： jai donc expérience des années passé jen\n同时翻译： i therefore have an experience of last years\n流式 ASR： jai donc expérience des années passé jen dirairai\n同时翻译： i therefore have an experience of last years i will\n流式 ASR： jai donc expérience des années passé jen dirairai un mot\n同时翻译： i therefore have an experience of last years i will tell a\n流式 ASR： jai donc expérience des années passé jen dirairai un mot tout à lheure\n同时翻译： i therefore have an experience of last years i will tell a word\n流式 ASR： jai donc expérience des années passé jen dirairai un mot tout à lheure\n同时翻译： i therefore have an experience of last years i will tell a word later\n100%|██████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2\u002F2 [00:06\u003C00:00,  3.02s\u002Fit]\n2024-06-06 09:45:56 | WARNING  | simuleval.scorer.asr_bleu | Beta 功能：正在评估语音输出。需要 Faieseq。\n2024-06-06 09:46:12 | INFO | fairseq.tasks.audio_finetuning | 使用字典路径：\u002Fdata\u002Fzhangshaolei\u002F.cache\u002Fust_asr\u002Fen\u002Fdict.ltr.txt\n转录预测结果：100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 2\u002F2 [00:01\u003C00:00,  1.63it\u002Fs]\n2024-06-06 09:46:21 | INFO     | simuleval.sentence_level_evaluator | 结果：\n ASR_BLEU       AL    AL_CA    AP  AP_CA      DAL  DAL_CA  StartOffset  StartOffset_CA  EndOffset  EndOffset_CA     LAAL  LAAL_CA      ATD   ATD_CA  NumChunks  NumChunks_CA  DiscontinuitySum  DiscontinuitySum_CA  DiscontinuityAve  DiscontinuityAve_CA  DiscontinuityNum  DiscontinuityNum_CA   RTF  RTF_CA\n   15.448 1724.895 2913.508 0.425  0.776 1358.812 3137.55       1280.0        2213.906     1366.0        1366.0 1724.895 2913.508 1440.146 3389.374        9.5           9.5             110.0                110.0              55.0                 55.0                 1                    1 1.326   1.326\n\n```\n\n日志和评估结果存储在 `$output_dir\u002Fchunk_size=$chunk_size` 中：\n\n```\n$output_dir\u002Fchunk_size=$chunk_size\n├── wavs\u002F\n│   ├── 0_pred.wav # 生成的语音\n│   ├── 1_pred.wav \n│   ├── 0_pred.txt # ASR-BLEU 工具包的 ASR 转录文本\n│   ├── 1_pred.txt \n├── config.yaml\n├── asr_transcripts.txt # ASR-BLEU 转录结果\n├── metrics.tsv\n├── scores.tsv\n├── asr_cmd.bash\n└── instances.log # Simul-S2ST 的日志\n```\n\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>同时语音到文本翻译\u003C\u002Fsummary>\n\n```shell\nexport CUDA_VISIBLE_DEVICES=0\n\nROOT=\u002Fdata\u002Fzhangshaolei\u002FStreamSpeech # StreamSpeech 仓库路径\n\nLANG=fr\nfile=streamspeech.simultaneous.${LANG}-en.pt # 下载的 StreamSpeech 模型路径\noutput_dir=$ROOT\u002Fres\u002Fstreamspeech.simultaneous.${LANG}-en\u002Fsimul-s2tt\n\nchunk_size=320 #ms\nPYTHONPATH=$ROOT\u002Ffairseq simuleval --data-bin ${ROOT}\u002Fconfigs\u002F${LANG}-en \\\n    --user-dir ${ROOT}\u002Fresearches\u002Fctc_unity --agent-dir ${ROOT}\u002Fagent \\\n    --source example\u002Fwav_list.txt --target example\u002Ftarget.txt \\\n    --model-path $file \\\n    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \\\n    --agent $ROOT\u002Fagent\u002Fspeech_to_text.s2tt.streamspeech.agent.py\\\n    --output $output_dir\u002Fchunk_size=$chunk_size \\\n    --source-segment-size $chunk_size \\\n    --quality-metrics BLEU  --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks RTF \\\n    --device gpu --computation-aware \n```\n\u003C\u002Fdetails>\n\n\u003Cdetails>\n\u003Csummary>流式 ASR\u003C\u002Fsummary>\n\n```shell\nexport CUDA_VISIBLE_DEVICES=0\n\nROOT=\u002Fdata\u002Fzhangshaolei\u002FStreamSpeech # StreamSpeech 仓库路径\n\nLANG=fr\nfile=streamspeech.simultaneous.${LANG}-en.pt # 下载的 StreamSpeech 模型路径\noutput_dir=$ROOT\u002Fres\u002Fstreamspeech.simultaneous.${LANG}-en\u002Fstreaming-asr\n\nchunk_size=320 #毫秒\nPYTHONPATH=$ROOT\u002Ffairseq simuleval --data-bin ${ROOT}\u002Fconfigs\u002F${LANG}-en \\\n    --user-dir ${ROOT}\u002Fresearches\u002Fctc_unity --agent-dir ${ROOT}\u002Fagent \\\n    --source example\u002Fwav_list.txt --target example\u002Fsource.txt \\\n    --model-path $file \\\n    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \\\n    --agent $ROOT\u002Fagent\u002Fspeech_to_text.asr.streamspeech.agent.py\\\n    --output $output_dir\u002Fchunk_size=$chunk_size \\\n    --source-segment-size $chunk_size \\\n    --quality-metrics BLEU  --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks RTF \\\n    --device gpu --computation-aware \n```\n\u003C\u002Fdetails>\n\n\n\n## 🎈开发你自己的StreamSpeech\n\n### 1. 数据预处理\n\n- 按照 [`.\u002Fpreprocess_scripts`](.\u002Fpreprocess_scripts) 中的说明处理 CVSS-C 数据。\n\n### 2. 训练\n\n> [!注意]\n> 你可以直接使用[下载的StreamSpeech模型](#1-model-download)进行评估，而无需重新训练。\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_b4818b9064ac.png\" alt=\"model\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\">\n\u003C\u002Fp>\n\n- 按照 [`researches\u002Fctc_unity\u002Ftrain_scripts\u002Ftrain.simul-s2st.sh`](.\u002Fresearches\u002Fctc_unity\u002Ftrain_scripts\u002Ftrain.simul-s2st.sh) 中的步骤，训练用于同时语音到语音翻译的StreamSpeech。\n- 按照 [`researches\u002Fctc_unity\u002Ftrain_scripts\u002Ftrain.offline-s2st.sh`](.\u002Fresearches\u002Fctc_unity\u002Ftrain_scripts\u002Ftrain.offline-s2st.sh) 中的步骤，训练用于离线语音到语音翻译的StreamSpeech。\n- 我们还提供了一些其他StreamSpeech变体和基线实现。\n\n| 模型             | --user-dir                 | --arch                            | 描述                                                  |\n| ----------------- | -------------------------- | --------------------------------- | ------------------------------------------------------------ |\n| **Translatotron 2** | `researches\u002Ftranslatotron` | `s2spect2_conformer_modified`     | [Translatotron 2](https:\u002F\u002Fproceedings.mlr.press\u002Fv162\u002Fjia22b.html) |\n| **UnitY**         | `researches\u002Ftranslatotron` | `unity_conformer_modified`        | [UnitY](https:\u002F\u002Faclanthology.org\u002F2023.acl-long.872\u002F)         |\n| **Uni-UnitY**     | `researches\u002Funi_unity`     | `uni_unity_conformer`             | 将UnitY中的所有编码器改为单向                             |\n| **Chunk-UnitY**   | `researches\u002Fchunk_unity`   | `chunk_unity_conformer`           | 将UnitY中的Conformer改为基于分块的Conformer               |\n| **StreamSpeech**  | `researches\u002Fctc_unity`     | `streamspeech`                    | StreamSpeech                                                 |\n| **StreamSpeech (级联)** | `researches\u002Fctc_unity` | `streamspeech_cascade` | S2TT和TTS的级联StreamSpeech。TTS模块可以独立用于给定增量文本的实时TTS。 |\n| **HMT**           | `researches\u002Fhmt`           | `hmt_transformer_iwslt_de_en`     | [HMT](https:\u002F\u002Fopenreview.net\u002Fforum?id=9y0HFvaAYD6)：强大的同时文本到文本翻译方法 |\n| **DiSeg**         | `researches\u002Fdiseg`         | `convtransformer_espnet_base_seg` | [DiSeg](https:\u002F\u002Faclanthology.org\u002F2023.findings-acl.485\u002F)：强大的同时语音到文本翻译方法 |\n\n> [!提示]\n> 在`--user-dir`目录下的`train_scripts\u002F`和`test_scripts\u002F`中，提供了每个模型的训练和测试脚本。\n> 更多细节请参考[UnitY](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq\u002Fblob\u002Fmain\u002Ffairseq\u002Fmodels\u002Fspeech_to_speech\u002Fs2s_conformer_unity.py)、[Translatotron 2](https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002Ffairseq\u002Fblob\u002Fmain\u002Ffairseq\u002Fmodels\u002Fspeech_to_speech\u002Fs2s_conformer_translatotron2.py)、[HMT](https:\u002F\u002Fgithub.com\u002Fictnlp\u002FHMT)和[DiSeg](https:\u002F\u002Fgithub.com\u002Fictnlp\u002FDiSeg)的官方仓库。\n\n### 3. 评估\n\n#### (1) 离线评估\n\n按照 [`pred.offline-s2st.sh`](.\u002Fresearches\u002Fctc_unity\u002Ftest_scripts\u002Fpred.offline-s2st.sh) 中的步骤，评估StreamSpeech在ASR、S2TT和S2ST方面的离线性能。\n\n#### (2) 同时性评估\n\n训练好的StreamSpeech模型可用于流式ASR、同时语音到文本翻译以及同时语音到语音翻译。我们为这三项任务提供了[agent\u002F](.\u002Fagent)：\n\n- `agent\u002Fspeech_to_speech.streamspeech.agent.py`：同时语音到语音翻译\n- `agent\u002Fspeech_to_text.s2tt.streamspeech.agent.py`：同时语音到文本翻译\n- `agent\u002Fspeech_to_text.asr.streamspeech.agent.py`：流式ASR\n\n按照 [`simuleval.simul-s2st.sh`](.\u002Fresearches\u002Fctc_unity\u002Ftest_scripts\u002Fsimuleval.simul-s2st.sh)、[`simuleval.simul-s2tt.sh`](.\u002Fresearches\u002Fctc_unity\u002Ftest_scripts\u002Fsimuleval.simul-s2tt.sh) 和 [`simuleval.streaming-asr.sh`](.\u002Fresearches\u002Fctc_unity\u002Ftest_scripts\u002Fsimuleval.streaming-asr.sh) 中的步骤，评估StreamSpeech。\n\n### 4. 我们的成果\n\n我们的项目页面([https:\u002F\u002Fictnlp.github.io\u002FStreamSpeech-site\u002F](https:\u002F\u002Fictnlp.github.io\u002FStreamSpeech-site\u002F))提供了一些由StreamSpeech生成的翻译语音，请收听🎧。\n\n#### (1) 离线语音到语音翻译  ( ASR-BLEU: 质量 )\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_26a9c3a154b0.png\" alt=\"offline\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\">\n\u003C\u002Fp>\n\n#### (2) 同时语音到语音翻译  ( AL: 延迟  |  ASR-BLEU: 质量 )\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_3d54672dcd23.png\" alt=\"simul\" style=\"width: 100%; min-width: 300px; display: block; margin: auto;\">\n\u003C\u002Fp>\n\n#### (3) 同时语音到文本翻译  ( AL: 延迟  |  BLEU: 质量 )\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_0f57e3d3799e.png\" alt=\"simul\" style=\"width: 38%; min-width: 300px; display: block; margin: auto;\">\n\u003C\u002Fp>\n\n#### (4) 流式ASR  ( AL: 延迟  |  WER: 质量 )\n\n\u003Cp align=\"center\" width=\"100%\">\n\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_readme_bf4403841e05.png\" alt=\"simul\" style=\"width: 50%; min-width: 300px; display: block; margin: auto;\">\n\u003C\u002Fp>\n\n## 🖋引用\n\n如果您有任何问题，请随时提交issue或联系`zhangshaolei20z@ict.ac.cn`。\n\n如果我们的工作对您有所帮助，请按以下方式引用：\n\n```\n@inproceedings{streamspeech,\n      title={StreamSpeech: Simultaneous Speech-to-Speech Translation with Multi-task Learning}, \n      author={Shaolei Zhang and Qingkai Fang and Shoutao Guo and Zhengrui Ma and Min Zhang and Yang Feng},\n      year={2024},\n      booktitle = {Proceedings of the 62th Annual Meeting of the Association for Computational Linguistics (Long Papers)},\n      publisher = {Association for Computational Linguistics}\n}\n```","# StreamSpeech 快速上手指南\n\nStreamSpeech 是一个支持离线与流式语音到语音翻译（S2ST）、语音识别（ASR）及语音合成（TTS）的“多合一”模型。它能在低延迟下同时输出中间结果（如实时转录和翻译文本），适用于实时通信场景。\n\n## 1. 环境准备\n\n### 系统要求\n- **Python**: 3.10\n- **PyTorch**: 2.0.1\n- **GPU**: 推荐 NVIDIA GPU 用于加速推理\n\n### 前置依赖\n本项目依赖 `fairseq` 和 `SimulEval` 框架，需从源码安装。\n\n```bash\n# 克隆并安装 fairseq (禁用构建隔离以确保兼容性)\ngit clone https:\u002F\u002Fgithub.com\u002Fpytorch\u002Ffairseq.git\ncd fairseq\npip install --editable .\u002F --no-build-isolation\ncd ..\n\n# 克隆并安装 SimulEval\ngit clone https:\u002F\u002Fgithub.com\u002Ffacebookresearch\u002FSimulEval.git\ncd SimulEval\npip install --editable .\u002F\ncd ..\n```\n\n## 2. 模型与资源下载\n\n### (1) 下载 StreamSpeech 模型\n根据语言对选择对应的模型（支持 Fr-En, Es-En, De-En）。国内用户推荐使用百度网盘链接加速下载。\n\n| 语言对 | 流式模型 (Simultaneous) | 离线模型 (Offline) | 下载链接 (Huggingface \u002F 百度网盘) |\n| :--- | :--- | :--- | :--- |\n| **法译英 (Fr-En)** | `streamspeech.simultaneous.fr-en.pt` | `streamspeech.offline.fr-en.pt` | [HF](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.simultaneous.fr-en.pt) \u002F [百度 (pwd:8gg3)](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1edCPFljogyDHgGXkUV8_3w?pwd=8gg3) |\n| **西译英 (Es-En)** | `streamspeech.simultaneous.es-en.pt` | `streamspeech.offline.es-en.pt` | [HF](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.simultaneous.es-en.pt) \u002F [百度 (pwd:p1pc)](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1NbLEVcYWHIdqqLD17P1s9g?pwd=p1pc) |\n| **德译英 (De-En)** | `streamspeech.simultaneous.de-en.pt` | `streamspeech.offline.de-en.pt` | [HF](https:\u002F\u002Fhuggingface.co\u002FICTNLP\u002FStreamSpeech_Models\u002Fblob\u002Fmain\u002Fstreamspeech.simultaneous.de-en.pt) \u002F [百度 (pwd:r7kw)](https:\u002F\u002Fpan.baidu.com\u002Fs\u002F1DYPMg3mdDopLY70BYQTduQ?pwd=r7kw) |\n\n### (2) 下载声码器 (Vocoder)\n需要下载基于 Unit 的 HiFi-GAN 声码器以生成语音波形。\n\n- **模型文件**: [g_00500000](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Ffairseq\u002Fspeech_to_speech\u002Fvocoder\u002Fcode_hifigan\u002Fmhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj\u002Fg_00500000)\n- **配置文件**: [config.json](https:\u002F\u002Fdl.fbaipublicfiles.com\u002Ffairseq\u002Fspeech_to_speech\u002Fvocoder\u002Fcode_hifigan\u002Fmhubert_vp_en_es_fr_it3_400k_layer11_km1000_lj\u002Fconfig.json)\n\n## 3. 数据与配置准备\n\n### (1) 修改配置文件\n打开项目中的配置文件 `configs\u002Ffr-en\u002Fconfig_gcmvn.yaml` 和 `configs\u002Ffr-en\u002Fconfig_mtl_asr_st_ctcst.yaml`，将其中的路径 `\u002Fdata\u002Fzhangshaolei\u002FStreamSpeech` 替换为你本地的 StreamSpeech 仓库绝对路径。\n\n### (2) 准备测试数据\n按照 SimulEval 格式准备输入列表和目标参考文本：\n- `wav_list.txt`: 每行记录一个源语音文件的路径。\n- `target.txt`: 每行记录参考文本（用于计算指标，如目标翻译或源转录）。\n*(示例数据可在 `example\u002F` 目录下找到)*\n\n## 4. 基本使用 (流式推理)\n\n以下命令演示如何运行**流式语音到语音翻译**。该脚本会同时输出语音、实时 ASR 转录和翻译文本。\n\n请根据实际情况修改以下变量：\n- `ROOT`: StreamSpeech 代码库路径\n- `PRETRAIN_ROOT`: 存放模型和声码器的目录\n- `LANG`: 源语言代码 (如 `fr`, `es`, `de`)\n- `chunk_size`: 延迟控制块大小（毫秒），数值越小延迟越低\n\n```bash\nexport CUDA_VISIBLE_DEVICES=0\n\n# === 用户需修改的路径 ===\nROOT=\u002Fpath\u002Fto\u002Fyour\u002FStreamSpeech \nPRETRAIN_ROOT=\u002Fpath\u002Fto\u002Fyour\u002Fpretrain_models \nVOCODER_CKPT=$PRETRAIN_ROOT\u002Funit-based_HiFi-GAN_vocoder\u002FmHuBERT.layer11.km1000.en\u002Fg_00500000\nVOCODER_CFG=$PRETRAIN_ROOT\u002Funit-based_HiFi-GAN_vocoder\u002FmHuBERT.layer11.km1000.en\u002Fconfig.json\n\n# 设置语言 (fr, es, de)\nLANG=fr\nfile=streamspeech.simultaneous.${LANG}-en.pt \noutput_dir=$ROOT\u002Fres\u002Fstreamspeech.simultaneous.${LANG}-en\u002Fsimul-s2st\n\n# 设置分块大小 (毫秒)，控制延迟\nchunk_size=320 \n\n# 执行推理\nPYTHONPATH=$ROOT\u002Ffairseq simuleval --data-bin ${ROOT}\u002Fconfigs\u002F${LANG}-en \\\n    --user-dir ${ROOT}\u002Fresearches\u002Fctc_unity --agent-dir ${ROOT}\u002Fagent \\\n    --source example\u002Fwav_list.txt --target example\u002Ftarget.txt \\\n    --model-path $file \\\n    --config-yaml config_gcmvn.yaml --multitask-config-yaml config_mtl_asr_st_ctcst.yaml \\\n    --agent $ROOT\u002Fagent\u002Fspeech_to_speech.streamspeech.agent.py \\\n    --vocoder $VOCODER_CKPT --vocoder-cfg $VOCODER_CFG --dur-prediction \\\n    --output $output_dir\u002Fchunk_size=$chunk_size \\\n    --source-segment-size $chunk_size \\\n    --quality-metrics ASR_BLEU  --target-speech-lang en --latency-metrics AL AP DAL StartOffset EndOffset LAAL ATD NumChunks DiscontinuitySum DiscontinuityAve DiscontinuityNum RTF \\\n    --device gpu --computation-aware \\\n    --output-asr-translation True\n```\n\n**运行结果说明：**\n- 程序将在 `output_dir` 生成翻译后的音频文件。\n- 由于开启了 `--output-asr-translation True`，过程中还会输出中间的 ASR 转录文本和翻译文本结果。\n- 调整 `--source-segment-size` 可平衡延迟与翻译质量。","一位跨国医疗协调员正在通过语音通话，实时协助一位只说法语的患者与英语医生进行紧急病情沟通。\n\n### 没有 StreamSpeech 时\n- **系统架构臃肿**：需要串联独立的语音识别、机器翻译和语音合成三个模型，导致部署复杂且维护成本极高。\n- **沟通延迟严重**：数据必须在不同模型间串行传递，患者说完后需等待数秒才能听到翻译结果，极易延误急救时机。\n- **信息断层明显**：在最终翻译完成前，协调员无法看到中间的识别文本，难以判断是否因口音或噪音导致识别错误。\n- **资源消耗巨大**：运行多个重型模型占用了大量显存，难以在边缘设备或低配服务器上流畅运行。\n\n### 使用 StreamSpeech 后\n- **一体化无缝流程**：StreamSpeech 单个模型即可同时完成法语识别、英译及英语合成，大幅简化了技术栈并降低了部署难度。\n- **超低延迟交互**：支持流式处理，患者话音未落，医生端已开始播放翻译语音，实现了近乎面对面的自然对话体验。\n- **中间结果可视**：StreamSpeech 能实时输出中间的识别文本和翻译草稿，让协调员能即时监控质量并在歧义出现时人工干预。\n- **高效资源利用**：单一模型显著减少了显存占用和计算开销，使得在普通笔记本甚至移动端设备上部署成为可能。\n\nStreamSpeech 通过“多合一”的流式架构，将原本割裂、高延迟的语音翻译链路重塑为实时、透明且高效的沟通桥梁。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fictnlp_StreamSpeech_79066a1e.png","ictnlp","ICTNLP","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fictnlp_de5b5ac5.jpg","Natural Language Processing Group, Institute of Computing Technology, Chinese Academy of Sciences",null,"ict_nlp@ict.ac.cn","http:\u002F\u002Fnlp.ict.ac.cn","https:\u002F\u002Fgithub.com\u002Fictnlp",[24,28,32,36,40,44,48,52,55,58],{"name":25,"color":26,"percentage":27},"Python","#3572A5",86.3,{"name":29,"color":30,"percentage":31},"Jupyter Notebook","#DA5B0B",10.3,{"name":33,"color":34,"percentage":35},"Shell","#89e051",2.4,{"name":37,"color":38,"percentage":39},"Cuda","#3A4E3A",0.5,{"name":41,"color":42,"percentage":43},"C++","#f34b7d",0.3,{"name":45,"color":46,"percentage":47},"Cython","#fedf5b",0.1,{"name":49,"color":50,"percentage":51},"Lua","#000080",0,{"name":53,"color":54,"percentage":51},"Perl","#0298c3",{"name":56,"color":57,"percentage":51},"Batchfile","#C1F12E",{"name":59,"color":60,"percentage":51},"Makefile","#427819",1256,101,"2026-04-11T00:45:16","MIT",4,"Linux","需要 NVIDIA GPU (脚本中指定了 CUDA_VISIBLE_DEVICES)，显存需求未说明，需支持 PyTorch 2.0.1 的 CUDA 版本","未说明",{"notes":70,"python":71,"dependencies":72},"1. 必须从源码安装 fairseq 和 SimulEval（使用 --editable 模式且 fairseq 需 --no-build-isolation）。2. 运行前需手动下载 StreamSpeech 模型文件及基于 Unit 的 HiFi-GAN 声码器（检查点和配置文件）。3. 推理时需配置 SimulEval 格式的数据（wav_list.txt 和 target.txt）。4. 示例脚本中使用了特定的环境变量路径，实际使用时需替换为本地路径。","3.10",[73,74,75],"torch==2.0.1","fairseq (editable install)","SimulEval (editable install)",[77,78],"语言模型","音频",[80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99],"seamless","simultaneous-translation","speech","speech-recognition","speech-synthesis","speech-to-text","speech-translation","translation","all-in-one","machine-translation","streaming-audio","text-to-speech","asr","tts","voice","text-to-audio","non-autoregressive","speech-enhancement","audio-processing","speech-processing",2,"ready","2026-03-27T02:49:30.150509","2026-04-12T21:59:36.272341",[105,110,115,120],{"id":106,"question_zh":107,"answer_zh":108,"source_url":109},31060,"运行 simuleval 时遇到 'KeyError: speech_to_speech_ctc' 或 'FileNotFoundError' 错误怎么办？","这通常是因为仓库版本过旧或命令参数不匹配导致的。早期版本在 `--user-dir` 参数上存在 Bug，且预训练模型中可能硬编码了作者本地的绝对路径。\n解决方案：\n1. 将代码更新到最新提交版本（维护者已修复相关 Bug）。\n2. 确保使用正确的启动脚本和参数。参考以下修正后的命令示例：\n```shell\nexport CUDA_VISIBLE_DEVICES=0\nROOT=\u002Fpath\u002Fto\u002FStreamSpeech # StreamSpeech 仓库路径\nPRETRAIN_ROOT=\u002Fpath\u002Fto\u002Fpretrain_models \nVOCODER_CKPT=$PRETRAIN_ROOT\u002Funit-based_HiFi-GAN_vocoder\u002FmHuBERT.layer11.km1000.en\u002Fg_00500000\nVOCODER_CFG=$PRETRAIN_ROOT\u002Funit-based_HiFi-GAN_vocoder\u002FmHuBERT.layer11.km1000.en\u002Fconfig.json\nLANG=fr\nfile=streamspeech.simultaneous.${LANG}-en.pt\noutput_dir=$ROOT\u002Fres\u002Fstreamspeech.simultaneous.${LANG}-en\u002Fsimul-s2st\n# 请根据实际 README 补充完整的 simuleval 命令\n```\n3. 如果问题依旧，检查是否需要在加载 checkpoint 后手动重置 `user_dir` 配置。","https:\u002F\u002Fgithub.com\u002Fictnlp\u002FStreamSpeech\u002Fissues\u002F2",{"id":111,"question_zh":112,"answer_zh":113,"source_url":114},31061,"使用预处理脚本处理 CVSS-C 数据集时报错 'gcmvn_feature_list is empty' 或找不到 'fbank2unit' 文件怎么办？","这通常是由于依赖库版本冲突导致的。建议专门为数据预处理创建一个新的 Conda 环境，并使用经过验证的库版本组合。\n推荐环境配置：\n- 系统：Ubuntu 20.04.6 LTS\n- CUDA: 11.8\n- Python: 3.9\n- 关键依赖：确保安装兼容版本的 `fairseq`, `torchaudio` 等。有用户反馈在清理环境冲突并重新安装特定版本的包后，`gcmvn_feature_list` 能正常生成，后续步骤也能自动创建 `fbank2unit` 目录，无需手动配置。","https:\u002F\u002Fgithub.com\u002Fictnlp\u002FStreamSpeech\u002Fissues\u002F19",{"id":116,"question_zh":117,"answer_zh":118,"source_url":119},31062,"使用 simul-s2st.sh 脚本评估时，ASR-Bleu 分数异常低（如 9.1）是什么原因？","分数过低通常是因为使用了未标准化（unnormalized）的测试集，或者在计算 ASR-Bleu 时包含了静音片段（silence）。\n解决方法：\n1. 确认使用的是经过标准化处理的测试集版本。\n2. 检查评估脚本配置，确保在计算 BLEU 分数前已正确去除静音部分。\n3. 避免只运行测试集的一部分导致路径或索引错误，应确保完整运行官方提供的测试流程。","https:\u002F\u002Fgithub.com\u002Fictnlp\u002FStreamSpeech\u002Fissues\u002F12",{"id":121,"question_zh":122,"answer_zh":123,"source_url":124},31063,"加载预训练模型时在本地机器上报错 'FileNotFoundError' 指向作者本地路径（如 \u002Fdata\u002Fzhangshaolei\u002F...）如何解决？","这是因为预训练权重文件中保存了作者开发时的绝对路径配置。\n临时解决方案：\n在加载模型的代码逻辑中（例如 `speech_to_speech.streamspeech.agent.py`），在调用 `import_user_module` 之前，手动修改配置中的 `user_dir` 路径。\n代码示例：\n```python\nstate = checkpoint_utils.load_checkpoint_to_cpu(filename)\n# 将硬编码的路径修改为当前项目的相对路径或正确路径\nstate[\"cfg\"].common.user_dir = 'reasearchs\u002Fctc_unity\u002F' \nutils.import_user_module(state[\"cfg\"].common)\n```\n注意：维护者表示将在后续版本中修复此问题，移除硬编码路径。","https:\u002F\u002Fgithub.com\u002Fictnlp\u002FStreamSpeech\u002Fissues\u002F1",[],[127,137,147,155,167,176],{"id":128,"name":129,"github_repo":130,"description_zh":131,"stars":132,"difficulty_score":100,"last_commit_at":133,"category_tags":134,"status":101},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",151918,"2026-04-12T11:33:05",[135,136,77],"开发框架","Agent",{"id":138,"name":139,"github_repo":140,"description_zh":141,"stars":142,"difficulty_score":143,"last_commit_at":144,"category_tags":145,"status":101},4487,"LLMs-from-scratch","rasbt\u002FLLMs-from-scratch","LLMs-from-scratch 是一个基于 PyTorch 的开源教育项目，旨在引导用户从零开始一步步构建一个类似 ChatGPT 的大型语言模型（LLM）。它不仅是同名技术著作的官方代码库，更提供了一套完整的实践方案，涵盖模型开发、预训练及微调的全过程。\n\n该项目主要解决了大模型领域“黑盒化”的学习痛点。许多开发者虽能调用现成模型，却难以深入理解其内部架构与训练机制。通过亲手编写每一行核心代码，用户能够透彻掌握 Transformer 架构、注意力机制等关键原理，从而真正理解大模型是如何“思考”的。此外，项目还包含了加载大型预训练权重进行微调的代码，帮助用户将理论知识延伸至实际应用。\n\nLLMs-from-scratch 特别适合希望深入底层原理的 AI 开发者、研究人员以及计算机专业的学生。对于不满足于仅使用 API，而是渴望探究模型构建细节的技术人员而言，这是极佳的学习资源。其独特的技术亮点在于“循序渐进”的教学设计：将复杂的系统工程拆解为清晰的步骤，配合详细的图表与示例，让构建一个虽小但功能完备的大模型变得触手可及。无论你是想夯实理论基础，还是为未来研发更大规模的模型做准备",90106,3,"2026-04-06T11:19:32",[77,146,136,135],"图像",{"id":148,"name":149,"github_repo":150,"description_zh":151,"stars":152,"difficulty_score":100,"last_commit_at":153,"category_tags":154,"status":101},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[135,77],{"id":156,"name":157,"github_repo":158,"description_zh":159,"stars":160,"difficulty_score":100,"last_commit_at":161,"category_tags":162,"status":101},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",85092,"2026-04-10T11:13:16",[146,163,164,165,136,166,77,135,78],"数据工具","视频","插件","其他",{"id":168,"name":169,"github_repo":170,"description_zh":171,"stars":172,"difficulty_score":173,"last_commit_at":174,"category_tags":175,"status":101},5784,"funNLP","fighting41love\u002FfunNLP","funNLP 是一个专为中文自然语言处理（NLP）打造的超级资源库，被誉为\"NLP 民工的乐园”。它并非单一的软件工具，而是一个汇集了海量开源项目、数据集、预训练模型和实用代码的综合性平台。\n\n面对中文 NLP 领域资源分散、入门门槛高以及特定场景数据匮乏的痛点，funNLP 提供了“一站式”解决方案。这里不仅涵盖了分词、命名实体识别、情感分析、文本摘要等基础任务的标准工具，还独特地收录了丰富的垂直领域资源，如法律、医疗、金融行业的专用词库与数据集，甚至包含古诗词生成、歌词创作等趣味应用。其核心亮点在于极高的全面性与实用性，从基础的字典词典到前沿的 BERT、GPT-2 模型代码，再到高质量的标注数据和竞赛方案，应有尽有。\n\n无论是刚刚踏入 NLP 领域的学生、需要快速验证想法的算法工程师，还是从事人工智能研究的学者，都能在这里找到急需的“武器弹药”。对于开发者而言，它能大幅减少寻找数据和复现模型的时间；对于研究者，它提供了丰富的基准测试资源和前沿技术参考。funNLP 以开放共享的精神，极大地降低了中文自然语言处理的开发与研究成本，是中文 AI 社区不可或缺的宝藏仓库。",79857,1,"2026-04-08T20:11:31",[77,163,166],{"id":177,"name":178,"github_repo":179,"description_zh":180,"stars":181,"difficulty_score":173,"last_commit_at":182,"category_tags":183,"status":101},6590,"gpt4all","nomic-ai\u002Fgpt4all","GPT4All 是一款让普通电脑也能轻松运行大型语言模型（LLM）的开源工具。它的核心目标是打破算力壁垒，让用户无需依赖昂贵的显卡（GPU）或云端 API，即可在普通的笔记本电脑和台式机上私密、离线地部署和使用大模型。\n\n对于担心数据隐私、希望完全掌控本地数据的企业用户、研究人员以及技术爱好者来说，GPT4All 提供了理想的解决方案。它解决了传统大模型必须联网调用或需要高端硬件才能运行的痛点，让日常设备也能成为强大的 AI 助手。无论是希望构建本地知识库的开发者，还是单纯想体验私有化 AI 聊天的普通用户，都能从中受益。\n\n技术上，GPT4All 基于高效的 `llama.cpp` 后端，支持多种主流模型架构（包括最新的 DeepSeek R1 蒸馏模型），并采用 GGUF 格式优化推理速度。它不仅提供界面友好的桌面客户端，支持 Windows、macOS 和 Linux 等多平台一键安装，还为开发者提供了便捷的 Python 库，可轻松集成到 LangChain 等生态中。通过简单的下载和配置，用户即可立即开始探索本地大模型的无限可能。",77307,"2026-04-11T06:52:37",[77,135]]