[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-nikitakit--self-attentive-parser":3,"tool-nikitakit--self-attentive-parser":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",140436,2,"2026-04-05T23:32:43",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":78,"owner_email":80,"owner_twitter":78,"owner_website":81,"owner_url":82,"languages":83,"stars":108,"forks":109,"last_commit_at":110,"license":111,"difficulty_score":23,"env_os":112,"env_gpu":113,"env_ram":112,"env_deps":114,"category_tags":124,"github_topics":125,"view_count":23,"oss_zip_url":78,"oss_zip_packed_at":78,"status":16,"created_at":132,"updated_at":133,"faqs":134,"releases":173},3193,"nikitakit\u002Fself-attentive-parser","self-attentive-parser","High-accuracy NLP parser with models for 11 languages.","self-attentive-parser（又称 Berkeley Neural Parser）是一款基于 Python 开发的高精度自然语言处理工具，专注于对 11 种语言进行成分句法分析。它能将句子拆解为具有层级结构的语法树，清晰展示词语如何组合成短语及句子，从而帮助机器深入理解语言的内在逻辑与结构关系。\n\n该工具有效解决了传统解析器在多语言支持、长距离依赖捕捉及分析精度上的不足。其核心技术亮点在于采用了“自注意力编码器”架构，并结合了先进的预训练模型，显著提升了分析的准确性。2021 年更新的 0.2.0 版本更是全面转向 PyTorch 推理引擎，提供了质量更高的预训练模型，并支持与 spaCy 无缝集成，让用户能轻松在现有工作流中调用强大的句法分析能力。\n\nself-attentive-parser 非常适合 NLP 研究人员、AI 开发者以及需要深度文本分析的数据科学家使用。无论是用于构建问答系统、机器翻译优化，还是进行语言学理论研究，它都能提供可靠的句法结构数据。虽然普通用户难以直接操作代码，但基于此工具开发的上层应用将能提供更精准的语言理解服务。","# Berkeley Neural Parser\n\nA high-accuracy parser with models for 11 languages, implemented in Python. Based on [Constituency Parsing with a Self-Attentive Encoder](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.01052) from ACL 2018, with additional changes described in [Multilingual Constituency Parsing with Self-Attention and Pre-Training](https:\u002F\u002Farxiv.org\u002Fabs\u002F1812.11760).\n\n**New February 2021:** Version 0.2.0 of the Berkeley Neural Parser is now out, with higher-quality pre-trained models for all languages. Inference now uses PyTorch instead of TensorFlow (training has always been PyTorch-only). Drops support for Python 2.7 and 3.5. Includes updated support for training and using your own parsers, based on your choice of [pre-trained model](https:\u002F\u002Fhuggingface.co\u002Fmodels).\n\n## Contents\n1. [Installation](#installation)\n2. [Usage](#usage)\n3. [Available Models](#available-models)\n4. [Training](#training)\n5. [Reproducing Experiments](#reproducing-experiments)\n6. [Citation](#citation)\n7. [Credits](#credits)\n\nIf you are primarily interested in training your own parsing models, skip to the [Training](#training) section of this README.\n\n## Installation\n\nTo install the parser, run the command:\n```bash\n$ pip install benepar\n```\n*Note: benepar 0.2.0 is a major upgrade over the previous version, and comes with entirely new and higher-quality parser models. If you are not ready to upgrade, you can pin your benepar version to [the previous release (0.1.3)](https:\u002F\u002Fgithub.com\u002Fnikitakit\u002Fself-attentive-parser\u002Ftree\u002Facl2019).*\n\nPython 3.6 (or newer) and [PyTorch](https:\u002F\u002Fpytorch.org\u002F) 1.6 (or newer) are required. See the PyTorch website for instruction on how to select between GPU-enabled and CPU-only versions of PyTorch; benepar will automatically use the GPU if it is available to pytorch.\n\nThe recommended way of using benepar is through integration with [spaCy](https:\u002F\u002Fspacy.io\u002F). If using spaCy, you should install a spaCy model for your language. For English, the installation command is:\n```sh\n$ python -m spacy download en_core_web_md\n```\n\nThe spaCy model is only used for tokenization and sentence segmentation. If language-specific analysis beyond parsing is not required, you may also forego a language-specific model and instead use a multi-language model that only performs tokenization and segmentation. [One such model](https:\u002F\u002Fspacy.io\u002Fmodels\u002Fxx#xx_sent_ud_sm), newly added in spaCy 3.0, should work for English, German, Korean, Polish, and Swedish (but not Chinese, since it doesn't seem to support Chinese word segmentation).\n\nParsing models need to be downloaded separately, using the commands:\n```python\n>>> import benepar\n>>> benepar.download('benepar_en3')\n```\n\nSee the [Available Models](#available-models) section below for a full list of models.\n\n## Usage\n\n### Usage with spaCy (recommended)\n\nThe recommended way of using benepar is through its integration with spaCy:\n```python\n>>> import benepar, spacy\n>>> nlp = spacy.load('en_core_web_md')\n>>> if spacy.__version__.startswith('2'):\n        nlp.add_pipe(benepar.BeneparComponent(\"benepar_en3\"))\n    else:\n        nlp.add_pipe(\"benepar\", config={\"model\": \"benepar_en3\"})\n>>> doc = nlp(\"The time for action is now. It's never too late to do something.\")\n>>> sent = list(doc.sents)[0]\n>>> print(sent._.parse_string)\n(S (NP (NP (DT The) (NN time)) (PP (IN for) (NP (NN action)))) (VP (VBZ is) (ADVP (RB now))) (. .))\n>>> sent._.labels\n('S',)\n>>> list(sent._.children)[0]\nThe time for action\n```\n\nSince spaCy does not provide an official constituency parsing API, all methods are accessible through the extension namespaces `Span._` and `Token._`.\n\nThe following extension properties are available:\n- `Span._.labels`: a tuple of labels for the given span. A span may have multiple labels when there are unary chains in the parse tree.\n- `Span._.parse_string`: a string representation of the parse tree for a given span.\n- `Span._.constituents`: an iterator over `Span` objects for sub-constituents in a pre-order traversal of the parse tree.\n- `Span._.parent`: the parent `Span` in the parse tree.\n- `Span._.children`: an iterator over child `Span`s in the parse tree.\n- `Token._.labels`, `Token._.parse_string`, `Token._.parent`: these behave the same as calling the corresponding method on the length-one Span containing the token.\n\nThese methods will raise an exception when called on a span that is not a constituent in the parse tree. Such errors can be avoided by traversing the parse tree starting at either sentence level (by iterating over `doc.sents`) or with an individual `Token` object.\n\n### Usage with NLTK\n\nThere is also an NLTK interface, which is designed for use with pre-tokenized datasets and treebanks, or when integrating the parser into an NLP pipeline that already performs (at minimum) tokenization and sentence splitting. For parsing starting with raw text, it is **strongly encouraged** that you use spaCy and `benepar.BeneparComponent` instead.\n\nSample usage with NLTK:\n```python\n>>> import benepar\n>>> parser = benepar.Parser(\"benepar_en3\")\n>>> input_sentence = benepar.InputSentence(\n    words=['\"', 'Fly', 'safely', '.', '\"'],\n    space_after=[False, True, False, False, False],\n    tags=['``', 'VB', 'RB', '.', \"''\"],\n    escaped_words=['``', 'Fly', 'safely', '.', \"''\"],\n)\n>>> tree = parser.parse(input_sentence)\n>>> print(tree)\n(TOP (S (`` ``) (VP (VB Fly) (ADVP (RB safely))) (. .) ('' '')))\n```\n\nNot all fields of `benepar.InputSentence` are required, but at least one of `words` and `escaped_words` must be specified. The parser will attempt to guess the value for missing fields, for example:\n```python\n>>> input_sentence = benepar.InputSentence(\n    words=['\"', 'Fly', 'safely', '.', '\"'],\n)\n>>> parser.parse(input_sentence)\n```\n\nUse `parse_sents` to parse multiple sentences.\n```python\n>>> input_sentence1 = benepar.InputSentence(\n    words=['The', 'time', 'for', 'action', 'is', 'now', '.'],\n)\n>>> input_sentence2 = benepar.InputSentence(\n    words=['It', \"'s\", 'never', 'too', 'late', 'to', 'do', 'something', '.'],\n)\n>>> parser.parse_sents([input_sentence1, input_sentence2])\n```\n\nSome parser models also allow Unicode text input for debugging\u002Finteractive use, but passing in raw text strings is *strongly discouraged* for any application where parsing accuracy matters.\n```python\n>>> parser.parse('\"Fly safely.\"')  # For debugging\u002Finteractive use only.\n```\nWhen parsing from raw text, we recommend using spaCy and `benepar.BeneparComponent` instead. The reason is that parser models do not ship with a tokenizer or sentence splitter, and some models may not include a part-of-speech tagger either. A toolkit must be used to fill in these pipeline components, and spaCy outperforms NLTK in all of these areas (sometimes by a large margin). \n\n\n\n## Available Models\n\nThe following trained parser models are available. To use spaCy integration, you will also need to install a [spaCy model for the appropriate language](https:\u002F\u002Fspacy.io\u002Fmodels).\n\nModel       | Language | Info\n----------- | -------- | ----\n`benepar_en3` | English | 95.40 F1 on [revised](https:\u002F\u002Fcatalog.ldc.upenn.edu\u002FLDC2015T13) WSJ test set. The training data uses revised tokenization and syntactic annotation based on the same guidelines as the English Web Treebank and OntoNotes, which better matches modern tokenization practices in libraries like spaCy. Based on T5-small.\n`benepar_en3_large` | English | 96.29 F1 on [revised](https:\u002F\u002Fcatalog.ldc.upenn.edu\u002FLDC2015T13) WSJ test set. The training data uses revised tokenization and syntactic annotation based on the same guidelines as the English Web Treebank and OntoNotes, which better matches modern tokenization practices in libraries like spaCy. Based on T5-large.\n`benepar_zh2` | Chinese | 92.56 F1 on CTB 5.1 test set. Usage with spaCy allows supports parsing from raw text, but the NLTK API only supports parsing previously tokenized sentences. Based on Chinese ELECTRA-180G-large.\n`benepar_ar2` | Arabic | 90.52 F1 on SPMRL2013\u002F2014 test set. Only supports using the NLTK API for parsing previously tokenized sentences. Parsing from raw text and spaCy integration are not supported. Based on XLM-R.\n`benepar_de2` | German | 92.10 F1 on SPMRL2013\u002F2014 test set. Based on XLM-R.\n`benepar_eu2` | Basque | 93.36 F1 on SPMRL2013\u002F2014 test set. Usage with spaCy first requires implementing Basque support in spaCy. Based on XLM-R.\n`benepar_fr2` | French | 88.43 F1 on SPMRL2013\u002F2014 test set. Based on XLM-R.\n`benepar_he2` | Hebrew | 93.98 F1 on SPMRL2013\u002F2014 test set. Only supports using the NLTK API for parsing previously tokenized sentences. Parsing from raw text and spaCy integration are not supported. Based on XLM-R.\n`benepar_hu2` | Hungarian | 96.19 F1 on SPMRL2013\u002F2014 test set. Usage with spaCy requires a [Hungarian model for spaCy](https:\u002F\u002Fgithub.com\u002Foroszgy\u002Fspacy-hungarian-models). The NLTK API only supports parsing previously tokenized sentences. Based on XLM-R.\n`benepar_ko2` | Korean | 91.72 F1 on SPMRL2013\u002F2014 test set. Can be used with spaCy's [multi-language sentence segmentation model](https:\u002F\u002Fspacy.io\u002Fmodels\u002Fxx#xx_sent_ud_sm) (requires spaCy v3.0). The NLTK API only supports parsing previously tokenized sentences. Based on XLM-R.\n`benepar_pl2` | Polish | 97.15 F1 on SPMRL2013\u002F2014 test set. Based on XLM-R.\n`benepar_sv2` | Swedish | 92.21 F1 on SPMRL2013\u002F2014 test set. Can be used with spaCy's [multi-language sentence segmentation model](https:\u002F\u002Fspacy.io\u002Fmodels\u002Fxx#xx_sent_ud_sm) (requires spaCy v3.0). Based on XLM-R.\n`benepar_en3_wsj` | English | **Consider using `benepar_en3` or `benepar_en3_large` instead**. 95.55 F1 on [canonical](https:\u002F\u002Fcatalog.ldc.upenn.edu\u002FLDC99T42) WSJ test set used for decades of English constituency parsing publications. Based on BERT-large-uncased. We believe that the revised annotation guidelines used for training `benepar_en3`\u002F`benepar_en3_large` are more suitable for downstream use because they better handle language usage in web text, and are more consistent with modern practices in dependency parsing and libraries like spaCy. Nevertheless, we provide the `benepar_en3_wsj` model for cases where using the revised treebanking conventions are not appropriate, such as benchmarking different models on the same dataset.\n\n## Training\n\nTraining requires cloning this repository from GitHub. While the model code in `src\u002Fbenepar` is distributed in the `benepar` package on PyPI, the training and evaluation scripts directly under `src\u002F` are not.\n\n#### Software Requirements for Training\n* Python 3.7 or higher.\n* [PyTorch](http:\u002F\u002Fpytorch.org\u002F) 1.6.0, or any compatible version.\n* All dependencies required by the `benepar` package, including: [NLTK](https:\u002F\u002Fwww.nltk.org\u002F) 3.2, [torch-struct](https:\u002F\u002Fgithub.com\u002Fharvardnlp\u002Fpytorch-struct) 0.4, [transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) 4.3.0, or compatible.\n* [pytokenizations](https:\u002F\u002Fgithub.com\u002Ftamuhey\u002Ftokenizations\u002F) 0.7.2 or compatible.\n* [EVALB](http:\u002F\u002Fnlp.cs.nyu.edu\u002Fevalb\u002F). Before starting, run `make` inside the `EVALB\u002F` directory to compile an `evalb` executable. This will be called from Python for evaluation. If training on the SPMRL datasets, you will need to run `make` inside the `EVALB_SPMRL\u002F` directory instead.\n\n### Training Instructions\n\nA new model can be trained using the command `python src\u002Fmain.py train ...`. Some of the available arguments are:\n\nArgument | Description | Default\n--- | --- | ---\n`--model-path-base` | Path base to use for saving models | N\u002FA\n`--evalb-dir` |  Path to EVALB directory | `EVALB\u002F`\n`--train-path` | Path to training trees | `data\u002Fwsj\u002Ftrain_02-21.LDC99T42`\n`--train-path-text` | Optional non-destructive tokenization of the training data | Guess raw text; see `--text-processing`\n`--dev-path` | Path to development trees | `data\u002Fwsj\u002Fdev_22.LDC99T42`\n`--dev-path-text` | Optional non-destructive tokenization of the development data | Guess raw text; see `--text-processing`\n`--text-processing` | Heuristics for guessing raw text from descructively tokenized tree files. See `load_trees()` in `src\u002Ftreebanks.py` | Default rules for languages other than Arabic, Chinese, and Hebrew\n`--subbatch-max-tokens` | Maximum number of tokens to process in parallel while training (a full batch may not fit in GPU memory) | 2000\n`--parallelize` | Distribute pre-trained model (e.g. T5) layers across multiple GPUs. | Use at most one GPU\n`--batch-size` | Number of examples per training update | 32\n`--checks-per-epoch` | Number of development evaluations per epoch | 4\n`--numpy-seed` | NumPy random seed | Random\n`--use-pretrained` | Use pre-trained encoder | Do not use pre-trained encoder\n`--pretrained-model` | Model to use if `--use-pretrained` is passed. May be a path or a model id from the [HuggingFace Model Hub](https:\u002F\u002Fhuggingface.co\u002Fmodels)| `bert-base-uncased`\n`--predict-tags` | Adds a part-of-speech tagging component and auxiliary loss to the parser | Do not predict tags\n`--use-chars-lstm` | Use learned CharLSTM word representations | Do not use CharLSTM\n`--use-encoder` | Use learned transformer layers on top of pre-trained model or CharLSTM | Do not use extra transformer layers\n`--num-layers` | Number of transformer layers to use if `--use-encoder` is passed | 8\n`--encoder-max-len` | Maximum sentence length (in words) allowed for extra transformer layers | 512\n\nAdditional arguments are available for other hyperparameters; see `make_hparams()` in `src\u002Fmain.py`. These can be specified on the command line, such as `--num-layers 2` (for numerical parameters), `--predict-tags` (for boolean parameters that default to False), or `--no-XXX` (for boolean parameters that default to True).\n\nFor each development evaluation, the F-score on the development set is computed and compared to the previous best. If the current model is better, the previous model will be deleted and the current model will be saved. The new filename will be derived from the provided model path base and the development F-score.\n\nPrior to training the parser, you will first need to obtain appropriate training data. We provide [instructions on how to process standard datasets like PTB, CTB, and the SMPRL 2013\u002F2014 Shared Task data](data\u002FREADME.md). After following the instructions for the English WSJ data, you can use the following command to train an English parser using the default hyperparameters:\n\n```\npython src\u002Fmain.py train --use-pretrained --model-path-base models\u002Fen_bert_base\n```\n\nSee [`EXPERIMENTS.md`](EXPERIMENTS.md) for more examples of good hyperparameter choices.\n\n### Evaluation Instructions\n\nA saved model can be evaluated on a test corpus using the command `python src\u002Fmain.py test ...` with the following arguments:\n\nArgument | Description | Default\n--- | --- | ---\n`--model-path` | Path of saved model | N\u002FA\n`--evalb-dir` |  Path to EVALB directory | `EVALB\u002F`\n`--test-path` | Path to test trees | `data\u002F23.auto.clean`\n`--test-path-text` | Optional non-destructive tokenization of the test data | Guess raw text; see `--text-processing`\n`--text-processing` | Heuristics for guessing raw text from descructively tokenized tree files. See `load_trees()` in `src\u002Ftreebanks.py` | Default rules for languages other than Arabic, Chinese, and Hebrew\n`--test-path-raw` | Alternative path to test trees that is used for evalb only (used to double-check that evaluation against pre-processed trees does not contain any bugs) | Compare to trees from `--test-path`\n`--subbatch-max-tokens` | Maximum number of tokens to process in parallel (a GPU does not have enough memory to process the full dataset in one batch) | 500\n`--parallelize` | Distribute pre-trained model (e.g. T5) layers across multiple GPUs. | Use at most one GPU\n`--output-path` | Path to write predicted trees to (use `\"-\"` for stdout). | Do not save predicted trees\n`--no-predict-tags` | Use gold part-of-speech tags when running EVALB. This is the standard for publications, and omitting this flag may give erroneously high F1 scores. | Use predicted part-of-speech tags for EVALB, if available\n\nAs an example, you can evaluate a trained model using the following command:\n```\npython src\u002Fmain.py test --model-path models\u002Fen_bert_base_dev=*.pt\n```\n\n### Exporting Models for Inference\n\nThe `benepar` package can directly use saved checkpoints by replacing a model name like `benepar_en3` with a path such as `models\u002Fen_bert_base_dev_dev=95.67.pt`. However, releasing the single-file checkpoints has a few shortcomings:\n* Single-file checkpoints do not include the tokenizer or pre-trained model config. These can generally be downloaded automatically from the HuggingFace model hub, but this requires an Internet connection and may also (incidentally and unnecessarily) download pre-trained weights from the HuggingFace Model Hub\n* Single-file checkpoints are 3x larger than necessary, because they save optimizer state\n\nUse `src\u002Fexport.py` to convert a checkpoint file into a directory that encapsulates everything about a trained model. For example,\n```\npython src\u002Fexport.py export \\\n  --model-path models\u002Fen_bert_base_dev=*.pt \\\n  --output-dir=models\u002Fen_bert_base\n```\n\nWhen exporting, there is also a `--compress` option that slightly adjusts model weights, so that the output directory can be compressed into a ZIP archive of much smaller size. We use this for our official model releases, because it's a hassle to distribute model weights that are 2GB+ in size. When using the `--compress` option, it is recommended to specify a test set in order to verify that compression indeed has minimal impact on parsing accuracy. Using the development data for verification is not recommended, since the development data was already used for the model selection criterion during training.\n```\npython src\u002Fexport.py export \\\n  --model-path models\u002Fen_bert_base_dev=*.pt \\\n  --output-dir=models\u002Fen_bert_base \\\n  --test-path=data\u002Fwsj\u002Ftest_23.LDC99T42\n```\n\nThe `src\u002Fexport.py` script also has a `test` subcommand that's roughly similar to `python src\u002Fmain.py test`, except that it supports exported models and has slightly different flags. We can run the following command to verify that our English parser using BERT-large-uncased indeed achieves 95.55 F1 on the canonical WSJ test set:\n```\npython src\u002Fexport.py test --model-path benepar_en3_wsj --test-path data\u002Fwsj\u002Ftest_23.LDC99T42\n```\n\n## Reproducing Experiments\n\nSee [`EXPERIMENTS.md`](EXPERIMENTS.md) for instructions on how to reproduce experiments reported in our ACL 2018 and 2019 papers.\n\n## Citation\n\nIf you use this software for research, please cite our papers as follows:\n\n```\n@inproceedings{kitaev-etal-2019-multilingual,\n    title = \"Multilingual Constituency Parsing with Self-Attention and Pre-Training\",\n    author = \"Kitaev, Nikita  and\n      Cao, Steven  and\n      Klein, Dan\",\n    booktitle = \"Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics\",\n    month = jul,\n    year = \"2019\",\n    address = \"Florence, Italy\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP19-1340\",\n    doi = \"10.18653\u002Fv1\u002FP19-1340\",\n    pages = \"3499--3505\",\n}\n\n@inproceedings{kitaev-klein-2018-constituency,\n    title = \"Constituency Parsing with a Self-Attentive Encoder\",\n    author = \"Kitaev, Nikita  and\n      Klein, Dan\",\n    booktitle = \"Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)\",\n    month = jul,\n    year = \"2018\",\n    address = \"Melbourne, Australia\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP18-1249\",\n    doi = \"10.18653\u002Fv1\u002FP18-1249\",\n    pages = \"2676--2686\",\n}\n```\n\n## Credits\n\nThe code in this repository and portions of this README are based on https:\u002F\u002Fgithub.com\u002Fmitchellstern\u002Fminimal-span-parser\n","# 伯克利神经依存句法分析器\n\n一款高精度的句法分析器，支持11种语言，使用Python实现。该工具基于2018年ACL会议论文《基于自注意力编码器的成分句法分析》（[arXiv:1805.01052](https:\u002F\u002Farxiv.org\u002Fabs\u002F1805.01052)），并结合了[多语言成分句法分析：自注意力与预训练](https:\u002F\u002Farxiv.org\u002Fabs\u002F1812.11760)中提出的改进方法。\n\n**2021年2月更新：** 伯克利神经依存句法分析器现已发布0.2.0版本，所有语言的预训练模型质量均有所提升。推理阶段现采用PyTorch框架，而训练始终仅支持PyTorch。本版本不再支持Python 2.7和3.5。同时新增对自定义训练和使用解析器的支持，并可根据用户选择的[预训练模型](https:\u002F\u002Fhuggingface.co\u002Fmodels)进行配置。\n\n## 目录\n1. [安装](#installation)\n2. [使用](#usage)\n3. [可用模型](#available-models)\n4. [训练](#training)\n5. [实验复现](#reproducing-experiments)\n6. [引用](#citation)\n7. [致谢](#credits)\n\n如果您主要关注如何训练自己的句法分析模型，请直接跳转至本README中的[训练](#training)部分。\n\n## 安装\n\n要安装该句法分析器，运行以下命令：\n```bash\n$ pip install benepar\n```\n*注意：benepar 0.2.0是上一版本的重大升级，配备了全新且质量更高的解析模型。如果您暂不准备升级，可以将benepar版本固定为[之前的0.1.3版本](https:\u002F\u002Fgithub.com\u002Fnikitakit\u002Fself-attentive-parser\u002Ftree\u002Facl2019)。*\n\n需要Python 3.6及以上版本以及[PyTorch](https:\u002F\u002Fpytorch.org\u002F) 1.6及以上版本。请参阅PyTorch官网，了解如何选择GPU版或CPU版；benepar会在PyTorch可用时自动使用GPU。\n\n推荐的使用方式是将其与[spaCy](https:\u002F\u002Fspacy.io\u002F)集成。若使用spaCy，您应为您的语言安装相应的spaCy模型。例如，对于英语，安装命令如下：\n```sh\n$ python -m spacy download en_core_web_md\n```\n\nspaCy模型仅用于分词和句子切分。如果不需要超出句法分析的语言特定分析，也可以不使用特定语言的模型，而改用仅执行分词和句子切分的多语言模型。[此类模型](https:\u002F\u002Fspacy.io\u002Fmodels\u002Fxx#xx_sent_ud_sm)于spaCy 3.0中新增，适用于英语、德语、韩语、波兰语和瑞典语（但不适用于中文，因为它似乎不支持中文分词）。\n\n句法分析模型需单独下载，使用以下命令：\n```python\n>>> import benepar\n>>> benepar.download('benepar_en3')\n```\n\n完整模型列表请参见下方的[可用模型](#available-models)部分。\n\n## 使用\n\n### 与spaCy结合使用（推荐）\n\n推荐的使用方式是通过与spaCy的集成：\n```python\n>>> import benepar, spacy\n>>> nlp = spacy.load('en_core_web_md')\n>>> if spacy.__version__.startswith('2'):\n        nlp.add_pipe(benepar.BeneparComponent(\"benepar_en3\"))\n    else:\n        nlp.add_pipe(\"benepar\", config={\"model\": \"benepar_en3\"})\n>>> doc = nlp(\"The time for action is now. It's never too late to do something.\")\n>>> sent = list(doc.sents)[0]\n>>> print(sent._.parse_string)\n(S (NP (NP (DT The) (NN time)) (PP (IN for) (NP (NN action)))) (VP (VBZ is) (ADVP (RB now))) (. .))\n>>> sent._.labels\n('S',)\n>>> list(sent._.children)[0]\nThe time for action\n```\n\n由于spaCy并未提供官方的成分句法分析API，所有方法均通过扩展命名空间`Span._`和`Token._`访问。\n\n以下是可用的扩展属性：\n- `Span._.labels`：给定span的标签元组。当解析树中存在一元链时，一个span可能具有多个标签。\n- `Span._.parse_string`：给定span的解析树字符串表示。\n- `Span._.constituents`：按解析树的前序遍历迭代子构成的`Span`对象。\n- `Span._.parent`：解析树中的父span。\n- `Span._.children`：解析树中子span的迭代器。\n- `Token._.labels`、`Token._.parse_string`、`Token._.parent`：这些行为与调用包含该token的长度为1的Span上的相应方法相同。\n\n在非解析树中的span上调用这些方法会引发异常。可通过从句子级别（遍历`doc.sents`）或从单个`Token`对象开始遍历解析树来避免此类错误。\n\n### 与NLTK结合使用\n\n此外，还提供了NLTK接口，专为处理预分词的数据集和语料库设计，或者在已具备分词和句子切分功能的NLP流水线中集成解析器时使用。对于从原始文本开始的解析任务，**强烈建议**使用spaCy和`benepar.BeneparComponent`。\n\nNLTK的使用示例：\n```python\n>>> import benepar\n>>> parser = benepar.Parser(\"benepar_en3\")\n>>> input_sentence = benepar.InputSentence(\n    words=['\"', 'Fly', 'safely', '.', '\"'],\n    space_after=[False, True, False, False, False],\n    tags=['``', 'VB', 'RB', '.', \"''\"],\n    escaped_words=['``', 'Fly', 'safely', '.', \"''\"],\n)\n>>> tree = parser.parse(input_sentence)\n>>> print(tree)\n(TOP (S (`` ``) (VP (VB Fly) (ADVP (RB safely))) (. .) ('' '')))\n```\n\n并非`benepar.InputSentence`的所有字段都是必需的，但至少需要指定`words`或`escaped_words`之一。解析器会尝试猜测缺失字段的值，例如：\n```python\n>>> input_sentence = benepar.InputSentence(\n    words=['\"', 'Fly', 'safely', '.', '\"'],\n)\n>>> parser.parse(input_sentence)\n```\n\n使用`parse_sents`可解析多条句子。\n```python\n>>> input_sentence1 = benepar.InputSentence(\n    words=['The', 'time', 'for', 'action', 'is', 'now', '.'],\n)\n>>> input_sentence2 = benepar.InputSentence(\n    words=['It', \"'s\", 'never', 'too', 'late', 'to', 'do', 'something', '.'],\n)\n>>> parser.parse_sents([input_sentence1, input_sentence2])\n```\n\n部分解析模型也支持Unicode文本输入，用于调试或交互式使用，但在任何对解析准确性有要求的应用场景中，*强烈不建议*直接传入原始文本字符串。\n```python\n>>> parser.parse('\"Fly safely.\"')  # 仅用于调试\u002F交互式使用。\n```\n\n从原始文本进行解析时，我们仍推荐使用spaCy和`benepar.BeneparComponent`。原因是解析模型本身并不自带分词器或句子切分器，某些模型甚至可能没有词性标注器。因此必须借助其他工具来补充这些环节，而spaCy在这些方面的表现普遍优于NLTK，有时甚至差距显著。\n\n## 可用模型\n\n以下是可用的已训练解析器模型。要使用 spaCy 集成，您还需要安装适用于相应语言的 [spaCy 模型](https:\u002F\u002Fspacy.io\u002Fmodels)。\n\n模型       | 语言 | 信息\n----------- | -------- | ----\n`benepar_en3` | 英语 | 在[修订版](https:\u002F\u002Fcatalog.ldc.upenn.edu\u002FLDC2015T13) WSJ 测试集上取得 95.40 的 F1 分数。训练数据采用了基于与 English Web Treebank 和 OntoNotes 相同指南的修订版分词和句法标注，这更符合 spaCy 等库中的现代分词实践。基于 T5-small。\n`benepar_en3_large` | 英语 | 在[修订版](https:\u002F\u002Fcatalog.ldc.upenn.edu\u002FLDC2015T13) WSJ 测试集上取得 96.29 的 F1 分数。训练数据采用了基于与 English Web Treebank 和 OntoNotes 相同指南的修订版分词和句法标注，这更符合 spaCy 等库中的现代分词实践。基于 T5-large。\n`benepar_zh2` | 中文 | 在 CTB 5.1 测试集上取得 92.56 的 F1 分数。与 spaCy 一起使用时，支持从原始文本进行解析，但 NLTK API 仅支持对已分词句子进行解析。基于 Chinese ELECTRA-180G-large。\n`benepar_ar2` | 阿拉伯语 | 在 SPMRL2013\u002F2014 测试集上取得 90.52 的 F1 分数。仅支持使用 NLTK API 对已分词句子进行解析。不支持从原始文本解析或与 spaCy 集成。基于 XLM-R。\n`benepar_de2` | 德语 | 在 SPMRL2013\u002F2014 测试集上取得 92.10 的 F1 分数。基于 XLM-R。\n`benepar_eu2` | 巴斯克语 | 在 SPMRL2013\u002F2014 测试集上取得 93.36 的 F1 分数。与 spaCy 一起使用时，首先需要在 spaCy 中实现巴斯克语支持。基于 XLM-R。\n`benepar_fr2` | 法语 | 在 SPMRL2013\u002F2014 测试集上取得 88.43 的 F1 分数。基于 XLM-R。\n`benepar_he2` | 希伯来语 | 在 SPMRL2013\u002F2014 测试集上取得 93.98 的 F1 分数。仅支持使用 NLTK API 对已分词句子进行解析。不支持从原始文本解析或与 spaCy 集成。基于 XLM-R。\n`benepar_hu2` | 匈牙利语 | 在 SPMRL2013\u002F2014 测试集上取得 96.19 的 F1 分数。与 spaCy 一起使用时，需要一个[匈牙利语 spaCy 模型](https:\u002F\u002Fgithub.com\u002Foroszgy\u002Fspacy-hungarian-models)。NLTK API 仅支持对已分词句子进行解析。基于 XLM-R。\n`benepar_ko2` | 韩语 | 在 SPMRL2013\u002F2014 测试集上取得 91.72 的 F1 分数。可与 spaCy 的[多语言句子分割模型](https:\u002F\u002Fspacy.io\u002Fmodels\u002Fxx#xx_sent_ud_sm)一起使用（需 spaCy v3.0）。NLTK API 仅支持对已分词句子进行解析。基于 XLM-R。\n`benepar_pl2` | 波兰语 | 在 SPMRL2013\u002F2014 测试集上取得 97.15 的 F1 分数。基于 XLM-R。\n`benepar_sv2` | 瑞典语 | 在 SPMRL2013\u002F2014 测试集上取得 92.21 的 F1 分数。可与 spaCy 的[多语言句子分割模型](https:\u002F\u002Fspacy.io\u002Fmodels\u002Fxx#xx_sent_ud_sm)一起使用（需 spaCy v3.0）。基于 XLM-R。\n`benepar_en3_wsj` | 英语 | **建议改用 `benepar_en3` 或 `benepar_en3_large`**。在用于数十年英语短语结构解析研究的[标准版](https:\u002F\u002Fcatalog.ldc.upenn.edu\u002FLDC99T42) WSJ 测试集上取得 95.55 的 F1 分数。基于 BERT-large-uncased。我们认为，用于训练 `benepar_en3`\u002F`benepar_en3_large` 的修订版标注指南更适合下游应用，因为它们更能处理网络文本中的语言使用，并且与现代依存句法分析及 spaCy 等库的做法更加一致。尽管如此，我们仍提供 `benepar_en3_wsj` 模型，以备在不适合采用修订版树库规范的情况下使用，例如在同一数据集上比较不同模型的表现。\n\n## 训练\n\n训练需要从 GitHub 克隆本仓库。虽然 `src\u002Fbenepar` 中的模型代码已发布在 PyPI 上的 `benepar` 包中，但直接位于 `src\u002F` 下的训练和评估脚本并未包含在内。\n\n#### 训练所需的软件\n* Python 3.7 或更高版本。\n* [PyTorch](http:\u002F\u002Fpytorch.org\u002F) 1.6.0 或任何兼容版本。\n* `benepar` 包所需的所有依赖项，包括：[NLTK](https:\u002F\u002Fwww.nltk.org\u002F) 3.2、[torch-struct](https:\u002F\u002Fgithub.com\u002Fharvardnlp\u002Fpytorch-struct) 0.4、[transformers](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Ftransformers) 4.3.0 或兼容版本。\n* [pytokenizations](https:\u002F\u002Fgithub.com\u002Ftamuhey\u002Ftokenizations\u002F) 0.7.2 或兼容版本。\n* [EVALB](http:\u002F\u002Fnlp.cs.nyu.edu\u002Fevalb\u002F)。开始前，请在 `EVALB\u002F` 目录下运行 `make` 编译出 `evalb` 可执行文件。该文件将在 Python 中调用以进行评估。若在 SPMRL 数据集上训练，则需在 `EVALB_SPMRL\u002F` 目录下运行 `make`。\n\n### 训练说明\n\n可以使用命令 `python src\u002Fmain.py train ...` 来训练新模型。可用的参数包括：\n\n参数 | 描述 | 默认值\n--- | --- | ---\n`--model-path-base` | 用于保存模型的基础路径 | 无\n`--evalb-dir` | EVALB 目录的路径 | `EVALB\u002F`\n`--train-path` | 训练树文件的路径 | `data\u002Fwsj\u002Ftrain_02-21.LDC99T42`\n`--train-path-text` | 训练数据的可选非破坏性分词 | 根据原始文本猜测；参见 `--text-processing`\n`--dev-path` | 开发集树文件的路径 | `data\u002Fwsj\u002Fdev_22.LDC99T42`\n`--dev-path-text` | 开发集数据的可选非破坏性分词 | 根据原始文本猜测；参见 `--text-processing`\n`--text-processing` | 从非破坏性分词的树文件中推断原始文本的启发式方法。参见 `src\u002Ftreebanks.py` 中的 `load_trees()` | 非阿拉伯语、汉语和希伯来语语言的默认规则\n`--subbatch-max-tokens` | 训练时并行处理的最大词数（完整批次可能无法放入 GPU 内存） | 2000\n`--parallelize` | 将预训练模型（如 T5）的层分布到多个 GPU 上。 | 最多使用一个 GPU\n`--batch-size` | 每次训练更新的样本数量 | 32\n`--checks-per-epoch` | 每个 epoch 的开发集评估次数 | 4\n`--numpy-seed` | NumPy 随机种子 | 随机\n`--use-pretrained` | 使用预训练编码器 | 不使用预训练编码器\n`--pretrained-model` | 如果指定了 `--use-pretrained`，则使用的模型。可以是路径或 [HuggingFace Model Hub](https:\u002F\u002Fhuggingface.co\u002Fmodels) 中的模型 ID | `bert-base-uncased`\n`--predict-tags` | 向解析器添加词性标注组件及辅助损失 | 不预测词性标签\n`--use-chars-lstm` | 使用学习到的 CharLSTM 词表示 | 不使用 CharLSTM\n`--use-encoder` | 在预训练模型或 CharLSTM 之上使用学习到的 Transformer 层 | 不使用额外的 Transformer 层\n`--num-layers` | 如果指定了 `--use-encoder`，则使用的 Transformer 层数量 | 8\n`--encoder-max-len` | 允许额外 Transformer 层处理的最大句子长度（以词为单位） | 512\n\n其他超参数也有相应的参数可供使用，详见 `src\u002Fmain.py` 中的 `make_hparams()`。这些参数可以在命令行中指定，例如 `--num-layers 2`（数值型参数）、`--predict-tags`（默认为 False 的布尔型参数），或 `--no-XXX`（默认为 True 的布尔型参数）。\n\n每次开发集评估时，都会计算开发集上的 F 分数，并与之前的最佳结果进行比较。如果当前模型更好，则会删除之前的模型并保存当前模型。新文件名将基于提供的模型路径基础和开发集 F 分数生成。\n\n在训练解析器之前，您需要先获取合适的训练数据。我们提供了[关于如何处理 PTB、CTB 以及 SMPRL 2013\u002F2014 共享任务数据等标准数据集的说明](data\u002FREADME.md)。按照英语 WSJ 数据的说明操作后，您可以使用以下命令，以默认超参数训练一个英语解析器：\n\n```\npython src\u002Fmain.py train --use-pretrained --model-path-base models\u002Fen_bert_base\n```\n\n更多良好的超参数选择示例，请参阅 [`EXPERIMENTS.md`](EXPERIMENTS.md)。\n\n### 评估说明\n\n可以使用命令 `python src\u002Fmain.py test ...` 对已保存的模型进行测试集评估，参数如下：\n\n参数 | 描述 | 默认值\n--- | --- | ---\n`--model-path` | 已保存模型的路径 | 无\n`--evalb-dir` | EVALB 目录的路径 | `EVALB\u002F`\n`--test-path` | 测试树文件的路径 | `data\u002F23.auto.clean`\n`--test-path-text` | 测试数据的可选非破坏性分词 | 根据原始文本猜测；参见 `--text-processing`\n`--text-processing` | 从非破坏性分词的树文件中推断原始文本的启发式方法。参见 `src\u002Ftreebanks.py` 中的 `load_trees()` | 非阿拉伯语、汉语和希伯来语语言的默认规则\n`--test-path-raw` | 仅用于 EVALB 的测试树文件替代路径（用于双重检查对预处理树的评估是否包含任何错误） | 与 `--test-path` 中的树进行对比\n`--subbatch-max-tokens` | 并行处理的最大词数（GPU 内存不足以一次性处理整个数据集） | 500\n`--parallelize` | 将预训练模型（如 T5）的层分布到多个 GPU 上。 | 最多使用一个 GPU\n`--output-path` | 用于写入预测树的路径（使用 `\"-\"` 表示输出到标准输出）。 | 不保存预测树\n`--no-predict-tags` | 运行 EVALB 时使用金标准词性标签。这是发表论文的标准做法，省略此标志可能会导致 F1 分数虚高。 | 如果有可用的预测词性标签，则在 EVALB 中使用预测标签\n\n例如，您可以使用以下命令评估已训练好的模型：\n```\npython src\u002Fmain.py test --model-path models\u002Fen_bert_base_dev=*.pt\n```\n\n### 导出模型用于推理\n\n`benepar` 包可以直接使用保存的检查点文件，只需将模型名称（如 `benepar_en3`）替换为路径（如 `models\u002Fen_bert_base_dev_dev=95.67.pt`）。然而，发布单文件检查点存在一些不足：\n* 单文件检查点不包含分词器或预训练模型配置。这些通常可以从 HuggingFace 模型库自动下载，但需要互联网连接，并且可能会（无意中且不必要的）从 HuggingFace 模型库下载预训练权重。\n* 单文件检查点比实际需要大三倍，因为它们还保存了优化器状态。\n\n可以使用 `src\u002Fexport.py` 脚本将检查点文件转换为一个包含训练好的模型所有信息的目录。例如：\n```\npython src\u002Fexport.py export \\\n  --model-path models\u002Fen_bert_base_dev=*.pt \\\n  --output-dir=models\u002Fen_bert_base\n```\n\n在导出时，还有一个 `--compress` 选项，它会轻微调整模型权重，以便将输出目录压缩成一个体积小得多的 ZIP 压缩包。我们在官方模型发布中会使用此选项，因为分发大小超过 2GB 的模型权重非常麻烦。使用 `--compress` 选项时，建议指定一个测试集，以验证压缩对解析准确率的影响确实很小。不建议使用开发数据进行验证，因为在训练过程中，开发数据已经被用作模型选择的标准。\n\n```\npython src\u002Fexport.py export \\\n  --model-path models\u002Fen_bert_base_dev=*.pt \\\n  --output-dir=models\u002Fen_bert_base \\\n  --test-path=data\u002Fwsj\u002Ftest_23.LDC99T42\n```\n\n`src\u002Fexport.py` 脚本还有一个 `test` 子命令，与 `python src\u002Fmain.py test` 大致相同，只是它支持已导出的模型，并且标志略有不同。我们可以运行以下命令来验证我们使用的 BERT-large-uncased 英语句法分析器确实在标准 WSJ 测试集上达到了 95.55 的 F1 分数：\n```\npython src\u002Fexport.py test --model-path benepar_en3_wsj --test-path data\u002Fwsj\u002Ftest_23.LDC99T42\n```\n\n## 重现实验\n\n请参阅 [`EXPERIMENTS.md`](EXPERIMENTS.md)，了解如何重现我们发表在 ACL 2018 和 2019 年论文中的实验。\n\n## 引用\n\n如果您将本软件用于研究，请按如下方式引用我们的论文：\n\n```\n@inproceedings{kitaev-etal-2019-multilingual,\n    title = \"多语言句法分析：基于自注意力机制与预训练的方法\",\n    author = \"Kitaev, Nikita  and\n      Cao, Steven  and\n      Klein, Dan\",\n    booktitle = \"第57届计算语言学协会年会论文集\",\n    month = jul,\n    year = \"2019\",\n    address = \"意大利佛罗伦萨\",\n    publisher = \"计算语言学协会\",\n    url = \"https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP19-1340\",\n    doi = \"10.18653\u002Fv1\u002FP19-1340\",\n    pages = \"3499--3505\",\n}\n\n@inproceedings{kitaev-klein-2018-constituency,\n    title = \"基于自注意力编码器的句法分析\",\n    author = \"Kitaev, Nikita  and\n      Klein, Dan\",\n    booktitle = \"第56届计算语言学协会年会论文集（第一卷：长文）\",\n    month = jul,\n    year = \"2018\",\n    address = \"澳大利亚墨尔本\",\n    publisher = \"计算语言学协会\",\n    url = \"https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002FP18-1249\",\n    doi = \"10.18653\u002Fv1\u002FP18-1249\",\n    pages = \"2676--2686\",\n}\n```\n\n## 致谢\n\n本仓库中的代码以及本 README 的部分内容基于 https:\u002F\u002Fgithub.com\u002Fmitchellstern\u002Fminimal-span-parser。","# self-attentive-parser (Berkeley Neural Parser) 快速上手指南\n\n`self-attentive-parser` (包名：`benepar`) 是一个基于自注意力机制的高精度成分句法分析器，支持包括中文在内的 11 种语言。本指南将帮助你快速在 Python 环境中部署并使用该工具。\n\n## 1. 环境准备\n\n在开始之前，请确保你的开发环境满足以下要求：\n\n*   **操作系统**：Linux, macOS 或 Windows\n*   **Python 版本**：3.6 或更高（训练建议 3.7+）\n*   **核心依赖**：\n    *   [PyTorch](https:\u002F\u002Fpytorch.org\u002F) 1.6 或更高版本\n    *   [spaCy](https:\u002F\u002Fspacy.io\u002F) (推荐用于分词和句子分割)\n*   **硬件加速**：如果系统可用 GPU，`benepar` 会自动利用 PyTorch 进行加速。\n\n> **提示**：国内用户安装 PyTorch 时，建议访问 [PyTorch 官网](https:\u002F\u002Fpytorch.org\u002Fget-started\u002Flocally\u002F) 获取针对国内网络优化的安装命令，或使用清华\u002F中科大镜像源。\n\n## 2. 安装步骤\n\n### 2.1 安装主程序包\n使用 pip 安装 `benepar`：\n\n```bash\npip install benepar\n```\n\n### 2.2 安装 spaCy 及语言模型\n推荐使用 spaCy 进行预处理（分词和断句）。以英语为例，安装命令如下：\n\n```bash\npython -m spacy download en_core_web_md\n```\n\n**中文用户注意**：\n若需处理中文文本，请安装对应的 spaCy 中文模型（例如 `zh_core_web_sm` 或 `zh_core_web_md`）：\n```bash\npython -m spacy download zh_core_web_md\n```\n*注：部分语言（如阿拉伯语、希伯来语）的模型暂不支持直接从原始文本解析，仅支持通过 NLTK 接口处理已分词数据。*\n\n### 2.3 下载预训练解析模型\n解析模型需要单独下载。根据目标语言选择对应的模型名称（见下文“可用模型”），并在 Python 中执行下载：\n\n```python\nimport benepar\n# 示例：下载英语模型 benepar_en3\nbenepar.download('benepar_en3')\n\n# 示例：下载中文模型 benepar_zh2\n# benepar.download('benepar_zh2')\n```\n\n## 3. 基本使用\n\n最推荐的方式是通过 **spaCy 集成** 使用，它可以自动处理分词和句子边界检测。\n\n### 3.1 基于 spaCy 的使用示例\n\n```python\nimport benepar\nimport spacy\n\n# 1. 加载 spaCy 语言模型 (此处以英语为例，中文请替换为 'zh_core_web_md')\nnlp = spacy.load('en_core_web_md')\n\n# 2. 添加 benepar 组件\n# spaCy v3.x+ 写法\nnlp.add_pipe(\"benepar\", config={\"model\": \"benepar_en3\"})\n# 如果是 spaCy v2.x，请使用: nlp.add_pipe(benepar.BeneparComponent(\"benepar_en3\"))\n\n# 3. 处理文本\ndoc = nlp(\"The time for action is now. It's never too late to do something.\")\n\n# 4. 获取第一句话的分析结果\nsent = list(doc.sents)[0]\n\n# 打印成分句法树字符串\nprint(sent._.parse_string)\n# 输出示例: (S (NP (NP (DT The) (NN time)) ...) ...)\n\n# 获取当前 Span 的标签\nprint(sent._.labels)\n# 输出: ('S',)\n\n# 遍历子成分\nfor child in sent._.children:\n    print(child.text, \"->\", child._.labels)\n```\n\n### 3.2 基于 NLTK 的使用示例（仅限已分词数据）\n\n如果你已有分好词的数据集，或需要更底层的控制，可以使用 NLTK 接口：\n\n```python\nimport benepar\n\n# 初始化解析器\nparser = benepar.Parser(\"benepar_en3\")\n\n# 构建输入句子对象 (至少提供 words 列表)\ninput_sentence = benepar.InputSentence(\n    words=['\"', 'Fly', 'safely', '.', '\"'],\n    tags=['``', 'VB', 'RB', '.', \"''\"], # 可选：提供词性标注以提高准确率\n)\n\n# 执行解析\ntree = parser.parse(input_sentence)\nprint(tree)\n# 输出: (TOP (S (`` ``) (VP (VB Fly) (ADVP (RB safely))) (. .) ('' '')))\n```\n\n## 附：常用预训练模型列表\n\n| 模型名称 | 语言 | 说明 |\n| :--- | :--- | :--- |\n| `benepar_en3` | 英语 | 推荐默认模型，基于 T5-small，F1 95.40 |\n| `benepar_en3_large` | 英语 | 高精度模型，基于 T5-large，F1 96.29 |\n| `benepar_zh2` | **中文** | 基于 Chinese ELECTRA，F1 92.56，支持 spaCy 原始文本解析 |\n| `benepar_de2` | 德语 | 基于 XLM-R |\n| `benepar_fr2` | 法语 | 基于 XLM-R |\n| `benepar_ko2` | 韩语 | 基于 XLM-R |\n| `benepar_pl2` | 波兰语 | 基于 XLM-R，F1 97.15 |\n\n*完整模型列表请参考官方文档。对于中文开发者优先推荐使用 `benepar_zh2`。*","某跨国教育科技公司的 NLP 团队正在构建一个多语言语法纠错系统，需要精准识别句子中的主谓宾结构以定位语法错误。\n\n### 没有 self-attentive-parser 时\n- **句法分析精度不足**：依赖传统统计模型或基础规则，在处理长难句或复杂嵌套结构时，经常错误划分短语边界，导致纠错建议误导用户。\n- **多语言支持割裂**：为英语、德语、韩语等不同语言维护多套独立的解析管线，代码冗余严重且难以统一迭代升级。\n- **深层语义缺失**：只能获取浅层的词性标注，无法生成完整的成分树（Constituency Tree），难以判断“动作发出者”与“承受者”的逻辑关系。\n- **集成开发成本高**：缺乏与主流工业级框架（如 spaCy）的原生对接，每次调用需编写大量胶水代码进行数据格式转换。\n\n### 使用 self-attentive-parser 后\n- **解析准确率显著提升**：基于自注意力机制的神经模型能精准拆解复杂句式，即使是多层嵌套的从句也能生成正确的树状结构，大幅降低误报率。\n- **一套代码覆盖全球**：直接加载预训练的 11 种语言模型，统一了后端处理逻辑，新语言上线仅需切换模型配置，无需重构核心代码。\n- **结构化信息触手可及**：通过 spaCy 扩展属性直接获取 `parse_string` 和成分标签，轻松提取主语、谓语等关键片段，为纠错逻辑提供坚实依据。\n- **工程落地丝滑顺畅**：原生支持 spaCy 管道集成，几行代码即可将高精度解析能力嵌入现有流程，且自动利用 GPU 加速推理，延迟极低。\n\nself-attentive-parser 通过引入高精度的神经成分句法分析，将多语言深层语义理解从复杂的科研实验变成了可一键集成的工业级能力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fnikitakit_self-attentive-parser_a3a7b66e.png","nikitakit","Nikita Kitaev","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fnikitakit_2298caf1.png",null,"UC Berkeley","nikitakit@gmail.com","kitaev.com","https:\u002F\u002Fgithub.com\u002Fnikitakit",[84,88,92,96,100,104],{"name":85,"color":86,"percentage":87},"Python","#3572A5",65.4,{"name":89,"color":90,"percentage":91},"C","#555555",30.4,{"name":93,"color":94,"percentage":95},"Shell","#89e051",2.3,{"name":97,"color":98,"percentage":99},"Scilab","#ca0f21",1.5,{"name":101,"color":102,"percentage":103},"Makefile","#427819",0.5,{"name":105,"color":106,"percentage":107},"Perl","#0298c3",0,906,158,"2026-04-02T18:43:49","MIT","未说明","非必需。PyTorch 会自动检测并使用可用的 GPU，支持 CPU 模式运行。具体显卡型号、显存大小及 CUDA 版本未在文档中明确指定，取决于所选预训练模型（如 T5-large, BERT-large, XLM-R 等）的大小。",{"notes":115,"python":116,"dependencies":117},"1. 推理默认使用 PyTorch（不再支持 TensorFlow）。2. 强烈建议通过 spaCy 集成使用以进行分词和句子分割；若仅使用 NLTK 接口，输入必须是已分词的文本。3. 解析模型需单独下载（例如 'benepar_en3'）。4. 训练时需要手动编译 EVALB 工具。5. 不同语言模型对 spaCy 的支持程度不同（如阿拉伯语和希伯来语不支持从原始文本解析，仅支持 NLTK 接口处理已分词数据）。6. 中文模型 ('benepar_zh2') 基于 Chinese ELECTRA，使用 spaCy 时需注意其分词限制。","3.6+ (推理), 3.7+ (训练)",[118,119,120,121,122,123],"torch>=1.6","spacy","nltk>=3.2","transformers>=4.3.0","torch-struct>=0.4","pytokenizations>=0.7.2",[15,26,14,13],[126,127,128,129,130,131],"nlp","natural-language-processing","parsing","parser","machine-learning","ai","2026-03-27T02:49:30.150509","2026-04-06T11:31:10.913145",[135,140,144,149,154,159,164,169],{"id":136,"question_zh":137,"answer_zh":138,"source_url":139},14717,"运行解析器时遇到 \"ValueError: No op named GatherV2 in defined operations\" 错误怎么办？","该错误通常由 TensorFlow 版本不兼容引起。维护者已更新 README 以包含具体的 TensorFlow 版本依赖要求。请检查并安装与 benepar 兼容的 TensorFlow 版本（通常建议参考项目文档中指定的版本，如 TensorFlow 1.x 系列），确保 Python、TensorFlow-GPU 和 cuDNN 版本匹配。","https:\u002F\u002Fgithub.com\u002Fnikitakit\u002Fself-attentive-parser\u002Fissues\u002F1",{"id":141,"question_zh":142,"answer_zh":143,"source_url":139},14718,"训练英文模型需要多少个 epoch 才算完成？","对于英文数据，训练通常需要 80-100 个 epoch，具体数量取决于语言特性。判断训练结束的实用方法是观察日志：当验证集（dev set）没有进展导致学习率衰减时，会打印 \"reducing learning rate\" 消息。当该消息打印 2-3 次后，训练基本完成。",{"id":145,"question_zh":146,"answer_zh":147,"source_url":148},14719,"如何导出模型？是否还需要使用 export_bert.py 脚本？","从 benepar v0.2.0a0 版本开始，不再支持导出为 TensorFlow 格式，也不再需要运行 export 脚本。用户可以直接使用 PyTorch 的检查点（checkpoints）。原有的导出代码是为旧版模型编写的临时方案，现已废弃。","https:\u002F\u002Fgithub.com\u002Fnikitakit\u002Fself-attentive-parser\u002Fissues\u002F59",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},14720,"在 PyTorch 1.2 及以上版本运行训练脚本时崩溃或报错怎么办？","这是由 PyTorch 版本变更引起的兼容性问题。解决方案是将代码中的 `torch.uint8` 切换为 `torch.bool`。此外，benepar v0.2.0a 及更高版本已更新以支持 PyTorch 1.6，建议升级库版本或手动修改数据类型以修复此问题。","https:\u002F\u002Fgithub.com\u002Fnikitakit\u002Fself-attentive-parser\u002Fissues\u002F42",{"id":155,"question_zh":156,"answer_zh":157,"source_url":158},14721,"导入 benepar 时出现 \"ImportError: cannot import name 'chart_decoder'\" 错误如何解决？","该错误通常与 PyTorch 版本不兼容有关。维护者确认 PyTorch 1.1.0 可以正常工作，而 1.2.0 及以上版本可能会导致此问题。建议将 PyTorch 降级至 1.1.0 版本，或者升级到已修复该问题的最新 benepar 版本。","https:\u002F\u002Fgithub.com\u002Fnikitakit\u002Fself-attentive-parser\u002Fissues\u002F31",{"id":160,"question_zh":161,"answer_zh":162,"source_url":163},14722,"在 macOS 上安装 benepar 失败怎么办？","macOS 用户安装失败通常是因为缺少 SDK 头文件。解决方法是按照相关指南（如 Frida 项目的 issue #338）安装 macOS SDK headers 包。安装完成后重新运行 pip install 命令即可解决。","https:\u002F\u002Fgithub.com\u002Fnikitakit\u002Fself-attentive-parser\u002Fissues\u002F9",{"id":165,"question_zh":166,"answer_zh":167,"source_url":168},14723,"安装时出现 \"RuntimeError: module compiled against API version... but this version of numpy is...\" 错误？","这是由于 NumPy 版本与编译模块所需的 API 版本不匹配导致的。通常发生在 Anaconda 环境中。尝试更新或重新安装 NumPy (`pip install --upgrade numpy` 或 `conda install numpy`)，确保其版本与安装的 TensorFlow 或其他依赖库兼容。如果问题依旧，可能需要重建虚拟环境。","https:\u002F\u002Fgithub.com\u002Fnikitakit\u002Fself-attentive-parser\u002Fissues\u002F7",{"id":170,"question_zh":171,"answer_zh":172,"source_url":168},14724,"英文解析器的 constituency parse tree tags（成分句法树标签）遵循什么标准？","英文解析器是基于 Penn Treebank (PTB) 数据集训练的，因此其输出的成分句法树标签遵循 Penn Treebank Guidelines。用户可以参考 Penn Treebank 的官方文档来理解这些标签的具体含义。",[174],{"id":175,"version":176,"summary_zh":177,"released_at":178},81630,"models","随附的是部分预训练模型：\n- `en_charlstm_dev.93.61.pt`：我们2018年ACL论文中表现最佳的单系统英语句法分析器，不依赖外部词表示。\n- `en_elmo_dev.95.21.pt`：我们2018年ACL论文中表现最佳的单系统英语句法分析器。使用该分析器需要ELMo权重文件，需单独下载。\n- `benepar_*.zip`：这些文件可通过`benepar.download`自动下载，因此无需手动下载。\n- `en_elmo_ensemble_tfgraph.pb.gz`：此文件可通过`benepar.download('benepar_en_ensemble')`自动下载，因此无需手动下载。\n- `en_elmo_small_tfgraph.pb.gz`：此文件可通过`benepar.download('benepar_en_small')`自动下载，因此无需手动下载。\n- `en_elmo_tfgraph.pb.gz`：此文件可通过`benepar.download('benepar_en')`自动下载，因此无需手动下载。\n\n请勿使用此处的“源代码”链接；有关如何运行代码的说明，请参阅[README](https:\u002F\u002Fgithub.com\u002Fnikitakit\u002Fself-attentive-parser#readme)。","2018-04-27T07:57:29"]