[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-stanfordnlp--stanza":3,"tool-stanfordnlp--stanza":64},[4,17,27,35,43,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,3,"2026-04-05T11:01:52",[13,14,15],"开发框架","图像","Agent","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":23,"last_commit_at":24,"category_tags":25,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,15,26],"语言模型",{"id":28,"name":29,"github_repo":30,"description_zh":31,"stars":32,"difficulty_score":23,"last_commit_at":33,"category_tags":34,"status":16},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",107662,"2026-04-03T11:11:01",[13,14,15],{"id":36,"name":37,"github_repo":38,"description_zh":39,"stars":40,"difficulty_score":23,"last_commit_at":41,"category_tags":42,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,26],{"id":44,"name":45,"github_repo":46,"description_zh":47,"stars":48,"difficulty_score":23,"last_commit_at":49,"category_tags":50,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[14,51,52,53,15,54,26,13,55],"数据工具","视频","插件","其他","音频",{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":10,"last_commit_at":62,"category_tags":63,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,"2026-04-04T04:44:48",[15,14,13,26,54],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":69,"readme_en":70,"readme_zh":71,"quickstart_zh":72,"use_case_zh":73,"hero_image_url":74,"owner_login":75,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":79,"owner_twitter":75,"owner_website":80,"owner_url":81,"languages":82,"stars":103,"forks":104,"last_commit_at":105,"license":106,"difficulty_score":107,"env_os":108,"env_gpu":108,"env_ram":108,"env_deps":109,"category_tags":115,"github_topics":116,"view_count":127,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":128,"updated_at":129,"faqs":130,"releases":158},394,"stanfordnlp\u002Fstanza","stanza","Stanford NLP Python library for tokenization, sentence segmentation, NER, and parsing of many human languages","Stanza 是斯坦福大学自然语言处理小组推出的官方 Python 库，旨在为多语言自然语言处理提供一站式解决方案。它涵盖了分词、句子分割、命名实体识别及句法分析等核心功能，支持全球 60 多种人类语言。\n\n过去，开发者在不同语言间切换 NLP 模型往往面临接口不一致和部署繁琐的难题，Stanza 通过统一的 Python 接口简化了这一流程。它不仅集成了基于 PyTorch 的高效神经流水线，还能无缝调用经典的 Java Stanford CoreNLP 软件，兼顾了性能与兼容性。\n\n特别值得一提的是，Stanza 近期推出了针对生物医学和临床领域的专用模型包，极大地便利了医疗文本的分析工作。无论是希望快速搭建多语言文本处理系统的开发者，还是需要进行深度语言结构研究的科研人员，都能从中受益。其开源且持续维护的特性，使其成为当前多语言 NLP 任务中值得信赖的选择。","\u003Cdiv align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fstanfordnlp_stanza_readme_fc2465468d89.png\" height=\"100px\"\u002F>\u003C\u002Fdiv>\n\n\u003Ch2 align=\"center\">Stanza: A Python NLP Library for Many Human Languages\u003C\u002Fh2>\n\n\u003Cdiv align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Factions\">\n       \u003Cimg alt=\"Run Tests\" src=\"https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Factions\u002Fworkflows\u002Fstanza-tests.yaml\u002Fbadge.svg\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fstanza\u002F\">\n        \u003Cimg alt=\"PyPI Version\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fstanza?color=blue\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fanaconda.org\u002Fstanfordnlp\u002Fstanza\">\n        \u003Cimg alt=\"Conda Versions\" src=\"https:\u002F\u002Fimg.shields.io\u002Fconda\u002Fvn\u002Fstanfordnlp\u002Fstanza?color=blue&label=conda\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fstanza\u002F\">\n        \u003Cimg alt=\"Python Versions\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fstanza?colorB=blue\">\n    \u003C\u002Fa>\n\u003C\u002Fdiv>\n\nThe Stanford NLP Group's official Python NLP library. It contains support for running various accurate natural language processing tools on 60+ languages and for accessing the Java Stanford CoreNLP software from Python. For detailed information please visit our [official website](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002F).\n\n🔥 &nbsp;A new collection of **biomedical** and **clinical** English model packages are now available, offering seamless experience for syntactic analysis and named entity recognition (NER) from biomedical literature text and clinical notes. For more information, check out our [Biomedical models documentation page](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Fbiomed.html).\n\n### References\n\nIf you use this library in your research, please kindly cite our [ACL2020 Stanza system demo paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2003.07082):\n\n```bibtex\n@inproceedings{qi2020stanza,\n    title={Stanza: A {Python} Natural Language Processing Toolkit for Many Human Languages},\n    author={Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.},\n    booktitle = \"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations\",\n    year={2020}\n}\n```\n\nIf you use our biomedical and clinical models, please also cite our [Stanza Biomedical Models description paper](https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.14640):\n\n```bibtex\n@article{zhang2021biomedical,\n    author = {Zhang, Yuhao and Zhang, Yuhui and Qi, Peng and Manning, Christopher D and Langlotz, Curtis P},\n    title = {Biomedical and clinical {E}nglish model packages for the {S}tanza {P}ython {NLP} library},\n    journal = {Journal of the American Medical Informatics Association},\n    year = {2021},\n    month = {06},\n    issn = {1527-974X}\n}\n```\n\nThe PyTorch implementation of the neural pipeline in this repository is due to [Peng Qi](http:\u002F\u002Fqipeng.me) (@qipeng), [Yuhao Zhang](http:\u002F\u002Fyuhao.im) (@yuhaozhang), and [Yuhui Zhang](https:\u002F\u002Fcs.stanford.edu\u002F~yuhuiz\u002F) (@yuhui-zh15), with help from [Jason Bolton](mailto:jebolton@stanford.edu) (@j38), [Tim Dozat](https:\u002F\u002Fweb.stanford.edu\u002F~tdozat\u002F) (@tdozat) and [John Bauer](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fjohn-bauer-b3883b60\u002F) (@AngledLuffa). Maintenance of this repo is currently led by [John Bauer](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fjohn-bauer-b3883b60\u002F).\n\nIf you use the CoreNLP software through Stanza, please cite the CoreNLP software package and the respective modules as described [here](https:\u002F\u002Fstanfordnlp.github.io\u002FCoreNLP\u002F#citing-stanford-corenlp-in-papers) (\"Citing Stanford CoreNLP in papers\"). The CoreNLP client is mostly written by [Arun Chaganty](http:\u002F\u002Farun.chagantys.org\u002F), and [Jason Bolton](mailto:jebolton@stanford.edu) spearheaded merging the two projects together.\n\nIf you use the Semgrex or Ssurgeon part of CoreNLP, please cite [our GURT paper on Semgrex and Ssurgeon](https:\u002F\u002Faclanthology.org\u002F2023.tlt-1.7\u002F):\n\n```bibtex\n@inproceedings{bauer-etal-2023-semgrex,\n    title = \"Semgrex and Ssurgeon, Searching and Manipulating Dependency Graphs\",\n    author = \"Bauer, John  and\n      Kiddon, Chlo{\\'e}  and\n      Yeh, Eric  and\n      Shan, Alex  and\n      D. Manning, Christopher\",\n    booktitle = \"Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT\u002FSyntaxFest 2023)\",\n    month = mar,\n    year = \"2023\",\n    address = \"Washington, D.C.\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Faclanthology.org\u002F2023.tlt-1.7\",\n    pages = \"67--73\",\n    abstract = \"Searching dependency graphs and manipulating them can be a time consuming and challenging task to get right. We document Semgrex, a system for searching dependency graphs, and introduce Ssurgeon, a system for manipulating the output of Semgrex. The compact language used by these systems allows for easy command line or API processing of dependencies. Additionally, integration with publicly released toolkits in Java and Python allows for searching text relations and attributes over natural text.\",\n}\n```\n\n## Issues and Usage Q&A\n\nTo ask questions, report issues or request features 🤔, please use the [GitHub Issue Tracker](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues). Before creating a new issue, please make sure to search for existing issues that may solve your problem, or visit the [Frequently Asked Questions (FAQ) page](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Ffaq.html) on our website.\n\n## Contributing to Stanza\n\nWe welcome community contributions to Stanza in the form of bugfixes 🛠️ and enhancements 💡! If you want to contribute, please first read [our contribution guideline](CONTRIBUTING.md).\n\n## Installation\n\n### pip\n\nStanza supports Python 3.6 or later. We recommend that you install Stanza via [pip](https:\u002F\u002Fpip.pypa.io\u002Fen\u002Fstable\u002Finstalling\u002F), the Python package manager. To install, simply run:\n```bash\npip install stanza\n```\nThis should also help resolve all of the dependencies of Stanza, for instance [PyTorch](https:\u002F\u002Fpytorch.org\u002F) 1.3.0 or above.\n\nIf you currently have a previous version of `stanza` installed, use:\n```bash\npip install stanza -U\n```\n\n### Anaconda\n\nTo install Stanza via Anaconda, use the following conda command:\n\n```bash\nconda install -c stanfordnlp stanza\n```\n\nNote that for now installing Stanza via Anaconda does not work for Python 3.10. For Python 3.10 please use pip installation.\n\n### From Source\n\nAlternatively, you can also install from source of this git repository, which will give you more flexibility in developing on top of Stanza. For this option, run\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza.git\ncd stanza\npip install -e .\n```\n\n## Running Stanza\n\n### Getting Started with the neural pipeline\n\nTo run your first Stanza pipeline, simply follow these steps in your Python interactive interpreter:\n\n```python\n>>> import stanza\n>>> stanza.download('en')       # Optional: pre-download English models (Pipeline can auto-download if needed)\n>>> nlp = stanza.Pipeline('en') # This sets up a default neural pipeline in English\n>>> doc = nlp(\"Barack Obama was born in Hawaii. He was elected president in 2008.\")\n>>> doc.sentences[0].print_dependencies()\n```\n\nIf you encounter `requests.exceptions.ConnectionError`, please try to use a proxy:\n\n```python\n>>> import stanza\n>>> proxies = {'http': 'http:\u002F\u002Fip:port', 'https': 'http:\u002F\u002Fip:port'}\n>>> stanza.download('en', proxies=proxies)  # Optional: pre-download English models (Pipeline can auto-download if needed)\n>>> nlp = stanza.Pipeline('en')             # This sets up a default neural pipeline in English\n>>> doc = nlp(\"Barack Obama was born in Hawaii. He was elected president in 2008.\")\n>>> doc.sentences[0].print_dependencies()\n```\n\nThe last command will print out the words in the first sentence in the input string (or [`Document`](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Fdata_objects.html#document), as it is represented in Stanza), as well as the indices for the word that governs it in the Universal Dependencies parse of that sentence (its \"head\"), along with the dependency relation between the words. The output should look like:\n\n```\n('Barack', '4', 'nsubj:pass')\n('Obama', '1', 'flat')\n('was', '4', 'aux:pass')\n('born', '0', 'root')\n('in', '6', 'case')\n('Hawaii', '4', 'obl')\n('.', '4', 'punct')\n```\n\nSee [our getting started guide](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Finstallation_usage.html#getting-started) for more details.\n\n### Accessing Java Stanford CoreNLP software\n\nAside from the neural pipeline, this package also includes an official wrapper for accessing the Java Stanford CoreNLP software with Python code.\n\nThere are a few initial setup steps.\n\n* Download [Stanford CoreNLP](https:\u002F\u002Fstanfordnlp.github.io\u002FCoreNLP\u002F) and models for the language you wish to use\n* Put the model jars in the distribution folder\n* Tell the Python code where Stanford CoreNLP is located by setting the `CORENLP_HOME` environment variable (e.g., in *nix): `export CORENLP_HOME=\u002Fpath\u002Fto\u002Fstanford-corenlp-4.5.3`\n\nWe provide [comprehensive examples](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Fcorenlp_client.html) in our documentation that show how one can use CoreNLP through Stanza and extract various annotations from it.\n\n### Online Colab Notebooks\n\nTo get your started, we also provide interactive Jupyter notebooks in the `demo` folder. You can also open these notebooks and run them interactively on [Google Colab](https:\u002F\u002Fcolab.research.google.com). To view all available notebooks, follow these steps:\n\n* Go to the [Google Colab website](https:\u002F\u002Fcolab.research.google.com)\n* Navigate to `File` -> `Open notebook`, and choose `GitHub` in the pop-up menu\n* Note that you do **not** need to give Colab access permission to your GitHub account\n* Type `stanfordnlp\u002Fstanza` in the search bar, and click enter\n\n### Trained Models for the Neural Pipeline\n\nWe currently provide models for all of the [Universal Dependencies](https:\u002F\u002Funiversaldependencies.org\u002F) treebanks v2.8, as well as NER models for a few widely-spoken languages. You can find instructions for downloading and using these models [here](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Fmodels.html).\n\n### Batching To Maximize Pipeline Speed\n\nTo maximize speed performance, it is essential to run the pipeline on batches of documents. Running a for loop on one sentence at a time will be very slow. The best approach at this time is to concatenate documents together, with each document separated by a blank line (i.e., two line breaks `\\n\\n`).  The tokenizer will recognize blank lines as sentence breaks. We are actively working on improving multi-document processing.\n\n## Training your own neural pipelines\n\nAll neural modules in this library can be trained with your own data. The tokenizer, the multi-word token (MWT) expander, the POS\u002Fmorphological features tagger, the lemmatizer and the dependency parser require [CoNLL-U](https:\u002F\u002Funiversaldependencies.org\u002Fformat.html) formatted data, while the NER model requires the BIOES format. Currently, we do not support model training via the `Pipeline` interface. Therefore, to train your own models, you need to clone this git repository and run training from the source.\n\nFor detailed step-by-step guidance on how to train and evaluate your own models, please visit our [training documentation](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Ftraining.html).\n\n## LICENSE\n\nStanza is released under the Apache License, Version 2.0. See the [LICENSE](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fblob\u002Fmaster\u002FLICENSE) file for more details.\n","\u003Cdiv align=\"center\">\u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fstanfordnlp_stanza_readme_fc2465468d89.png\" height=\"100px\"\u002F>\u003C\u002Fdiv>\n\n\u003Ch2 align=\"center\">Stanza: 一种用于多种人类语言的 Python NLP 库\u003C\u002Fh2>\n\n\u003Cdiv align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Factions\">\n       \u003Cimg alt=\"Run Tests\" src=\"https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Factions\u002Fworkflows\u002Fstanza-tests.yaml\u002Fbadge.svg\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fstanza\u002F\">\n        \u003Cimg alt=\"PyPI Version\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Fstanza?color=blue\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fanaconda.org\u002Fstanfordnlp\u002Fstanza\">\n        \u003Cimg alt=\"Conda Versions\" src=\"https:\u002F\u002Fimg.shields.io\u002Fconda\u002Fvn\u002Fstanfordnlp\u002Fstanza?color=blue&label=conda\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpypi.org\u002Fproject\u002Fstanza\u002F\">\n        \u003Cimg alt=\"Python Versions\" src=\"https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Fstanza?colorB=blue\">\n    \u003C\u002Fa>\n\u003C\u002Fdiv>\n\n斯坦福 NLP 小组的官方 Python 自然语言处理（NLP）库。它支持在 60 多种语言上运行各种准确的自然语言处理工具，并允许从 Python 访问 Java Stanford CoreNLP 软件。详细信息请访问我们的 [官方网站](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002F)。\n\n🔥 &nbsp;现在提供了一组新的**生物医学**和**临床**英语模型包，为从生物医学文献文本和临床笔记中进行句法分析和命名实体识别（NER）提供了无缝体验。更多信息，请查看我们的 [生物医学模型文档页面](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Fbiomed.html)。\n\n### 引用\n\n如果您在研究中使用本库，请引用我们的 [ACL2020 Stanza 系统演示论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2003.07082)：\n\n```bibtex\n@inproceedings{qi2020stanza,\n    title={Stanza: A {Python} Natural Language Processing Toolkit for Many Human Languages},\n    author={Qi, Peng and Zhang, Yuhao and Zhang, Yuhui and Bolton, Jason and Manning, Christopher D.},\n    booktitle = \"Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics: System Demonstrations\",\n    year={2020}\n}\n```\n\n如果您使用了我们的生物医学和临床模型，也请引用我们的 [Stanza 生物医学模型描述论文](https:\u002F\u002Farxiv.org\u002Fabs\u002F2007.14640)：\n\n```bibtex\n@article{zhang2021biomedical,\n    author = {Zhang, Yuhao and Zhang, Yuhui and Qi, Peng and Manning, Christopher D and Langlotz, Curtis P},\n    title = {Biomedical and clinical {E}nglish model packages for the {S}tanza {P}ython {NLP} library},\n    journal = {Journal of the American Medical Informatics Association},\n    year = {2021},\n    month = {06},\n    issn = {1527-974X}\n}\n```\n\n本仓库中神经流水线的 PyTorch 实现归功于 [Peng Qi](http:\u002F\u002Fqipeng.me) (@qipeng)、[Yuhao Zhang](http:\u002F\u002Fyuhao.im) (@yuhaozhang) 和 [Yuhui Zhang](https:\u002F\u002Fcs.stanford.edu\u002F~yuhuiz\u002F) (@yuhui-zh15)，并得到了 [Jason Bolton](mailto:jebolton@stanford.edu) (@j38)、[Tim Dozat](https:\u002F\u002Fweb.stanford.edu\u002F~tdozat\u002F) (@tdozat) 和 [John Bauer](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fjohn-bauer-b3883b60\u002F) (@AngledLuffa) 的帮助。本仓库的维护目前由 [John Bauer](https:\u002F\u002Fwww.linkedin.com\u002Fin\u002Fjohn-bauer-b3883b60\u002F) 领导。\n\n如果您通过 Stanza 使用 CoreNLP 软件，请按照 [此处](https:\u002F\u002Fstanfordnlp.github.io\u002FCoreNLP\u002F#citing-stanford-corenlp-in-papers) 所述引用 CoreNLP 软件包及相关模块（“在论文中引用 Stanford CoreNLP\"）。CoreNLP 客户端主要由 [Arun Chaganty](http:\u002F\u002Farun.chagantys.org\u002F) 编写，[Jason Bolton](mailto:jebolton@stanford.edu) 牵头将这两个项目合并在一起。\n\n如果您使用了 CoreNLP 中的 Semgrex 或 Ssurgeon 部分，请引用 [我们关于 Semgrex 和 Ssurgeon 的 GURT 论文](https:\u002F\u002Faclanthology.org\u002F2023.tlt-1.7\u002F)：\n\n```bibtex\n@inproceedings{bauer-etal-2023-semgrex,\n    title = \"Semgrex and Ssurgeon, Searching and Manipulating Dependency Graphs\",\n    author = \"Bauer, John  and\n      Kiddon, Chlo{\\'e}  and\n      Yeh, Eric  and\n      Shan, Alex  and\n      D. Manning, Christopher\",\n    booktitle = \"Proceedings of the 21st International Workshop on Treebanks and Linguistic Theories (TLT, GURT\u002FSyntaxFest 2023)\",\n    month = mar,\n    year = \"2023\",\n    address = \"Washington, D.C.\",\n    publisher = \"Association for Computational Linguistics\",\n    url = \"https:\u002F\u002Faclanthology.org\u002F2023.tlt-1.7\",\n    pages = \"67--73\",\n    abstract = \"Searching dependency graphs and manipulating them can be a time consuming and challenging task to get right. We document Semgrex, a system for searching dependency graphs, and introduce Ssurgeon, a system for manipulating the output of Semgrex. The compact language used by these systems allows for easy command line or API processing of dependencies. Additionally, integration with publicly released toolkits in Java and Python allows for searching text relations and attributes over natural text.\",\n}\n```\n\n## 问题与使用问答\n\n要提问、报告问题或请求功能 🤔，请使用 [GitHub Issue Tracker](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues)。在创建新问题之前，请务必搜索可能解决您问题的现有问题，或访问我们网站上的 [常见问题解答 (FAQ) 页面](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Ffaq.html)。\n\n## 贡献给 Stanza\n\n我们欢迎社区以 Bug 修复 🛠️ 和功能增强 💡 的形式向 Stanza 做出贡献！如果您想贡献，请先阅读 [我们的贡献指南](CONTRIBUTING.md)。\n\n## 安装\n\n### pip\n\nStanza 支持 Python 3.6 或更高版本。我们建议您通过 [pip](https:\u002F\u002Fpip.pypa.io\u002Fen\u002Fstable\u002Finstalling\u002F)（Python 包管理器）安装 Stanza。要安装，只需运行：\n```bash\npip install stanza\n```\n这应该也能帮助解析 Stanza 的所有依赖项，例如 [PyTorch](https:\u002F\u002Fpytorch.org\u002F) 1.3.0 或更高版本。\n\n如果您当前已安装了旧版本的 `stanza`，请使用：\n```bash\npip install stanza -U\n```\n\n### Anaconda\n\n要通过 Anaconda 安装 Stanza，请使用以下 conda 命令：\n\n```bash\nconda install -c stanfordnlp stanza\n```\n\n请注意，目前通过 Anaconda 安装 Stanza 不适用于 Python 3.10。对于 Python 3.10，请使用 pip 安装。\n\n### 从源码\n\n或者，您也可以从该 git 仓库的源码进行安装，这将使您在 Stanza 基础上进行开发具有更高的灵活性。对于此选项，请运行\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza.git\ncd stanza\npip install -e .\n```\n\n## 运行 Stanza\n\n### 开始使用神经流水线 (neural pipeline)\n\n要运行您的第一个 Stanza 流水线，只需在 Python 交互式解释器中按照以下步骤操作：\n\n```python\n>>> import stanza\n>>> stanza.download('en')       # Optional: pre-download English models (Pipeline can auto-download if needed)\n>>> nlp = stanza.Pipeline('en') # This sets up a default neural pipeline in English\n>>> doc = nlp(\"Barack Obama was born in Hawaii. He was elected president in 2008.\")\n>>> doc.sentences[0].print_dependencies()\n```\n\n如果您遇到 `requests.exceptions.ConnectionError` 错误，请尝试使用代理：\n\n```python\n>>> import stanza\n>>> proxies = {'http': 'http:\u002F\u002Fip:port', 'https': 'http:\u002F\u002Fip:port'}\n>>> stanza.download('en', proxies=proxies)  # Optional: pre-download English models (Pipeline can auto-download if needed)\n>>> nlp = stanza.Pipeline('en')             # This sets up a default neural pipeline in English\n>>> doc = nlp(\"Barack Obama was born in Hawaii. He was elected president in 2008.\")\n>>> doc.sentences[0].print_dependencies()\n```\n\n最后一条命令将打印输入字符串（或 [`Document`](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Fdata_objects.html#document)（即 Stanza 中的表示方式））中第一句话的单词，以及该句子在 Universal Dependencies (通用依存关系) 解析中支配该单词的索引（即其“头”），同时还包括单词之间的依赖关系。输出结果应如下所示：\n\n```\n('Barack', '4', 'nsubj:pass')\n('Obama', '1', 'flat')\n('was', '4', 'aux:pass')\n('born', '0', 'root')\n('in', '6', 'case')\n('Hawaii', '4', 'obl')\n('.', '4', 'punct')\n```\n\n有关更多详细信息，请参阅 [我们的入门指南](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Finstallation_usage.html#getting-started)。\n\n### 访问 Java Stanford CoreNLP 软件\n\n除了神经流水线外，该软件包还包含一个官方封装器，用于通过 Python 代码访问 Java Stanford CoreNLP 软件。\n\n需要进行一些初始设置步骤。\n\n* 下载 [Stanford CoreNLP](https:\u002F\u002Fstanfordnlp.github.io\u002FCoreNLP\u002F) 和您希望使用的语言模型\n* 将模型 jar 文件放入分发文件夹\n* 通过设置 `CORENLP_HOME` 环境变量来告知 Python 代码 Stanford CoreNLP 的位置（例如，在 *nix 系统中）：`export CORENLP_HOME=\u002Fpath\u002Fto\u002Fstanford-corenlp-4.5.3`\n\n我们在文档中提供了 [全面的示例](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Fcorenlp_client.html)，展示了如何通过 Stanza 使用 CoreNLP 并从中提取各种标注信息。\n\n### 在线 Colab 笔记本\n\n为了帮助您快速上手，我们还在 `demo` 文件夹中提供了交互式的 Jupyter 笔记本。您也可以打开这些笔记本并在 [Google Colab](https:\u002F\u002Fcolab.research.google.com) 上交互式地运行它们。要查看所有可用的笔记本，请遵循以下步骤：\n\n* 前往 [Google Colab 网站](https:\u002F\u002Fcolab.research.google.com)\n* 导航至 `File` -> `Open notebook`，并在弹出菜单中选择 `GitHub`\n* 请注意，您**不需要**授予 Colab 对您 GitHub 账户的访问权限\n* 在搜索栏中输入 `stanfordnlp\u002Fstanza`，然后按回车键\n\n### 神经流水线的预训练模型\n\n我们目前提供所有 [Universal Dependencies](https:\u002F\u002Funiversaldependencies.org\u002F) 树库 v2.8 的模型，以及一些广泛使用语言的命名实体识别 (NER) 模型。您可以在此处找到下载和使用这些模型的说明 [here](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Fmodels.html)。\n\n### 批处理以最大化流水线速度\n\n为了最大化速度性能，必须对文档批次运行流水线。逐句运行 for 循环将会非常慢。目前最好的方法是将文档连接在一起，每个文档之间用空行分隔（即两个换行符 `\\n\\n`）。分词器会将空行识别为句子断点。我们正在积极改进多文档处理功能。\n\n## 训练您自己的神经流水线\n\n本库中的所有神经模块都可以使用您自己的数据进行训练。分词器、多词单元 (MWT) 扩展器、词性 (POS)\u002F形态特征标注器、词形还原器 (lemmatizer) 和依存解析器 (dependency parser) 需要 [CoNLL-U](https:\u002F\u002Funiversaldependencies.org\u002Fformat.html) 格式的数据，而 NER 模型则需要 BIOES 格式。目前，我们不支持通过 `Pipeline` 接口进行模型训练。因此，要训练您自己的模型，您需要克隆此 git 仓库并从源代码运行训练。\n\n有关如何训练和评估您自己模型的详细逐步指导，请访问我们的 [训练文档](https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Ftraining.html)。\n\n## 许可证\n\nStanza 根据 Apache License, Version 2.0 发布。有关更多详细信息，请参阅 [LICENSE](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fblob\u002Fmaster\u002FLICENSE) 文件。","# Stanza 快速上手指南\n\n**Stanza** 是斯坦福 NLP 小组官方推出的 Python 自然语言处理（NLP）库，支持 60 多种人类语言，提供神经管道（Neural Pipeline）及 Java Stanford CoreNLP 的 Python 接口。\n\n## 环境准备\n\n- **Python 版本**：建议 Python 3.6 或更高版本。\n- **依赖管理**：安装时会自动解析并安装 PyTorch 等核心依赖（PyTorch 1.3.0+）。\n- **网络环境**：首次运行需下载模型文件，国内用户建议使用镜像源加速。\n\n## 安装步骤\n\n### 方式一：通过 pip 安装（推荐）\n\n```bash\npip install stanza\n```\n\n> **💡 国内加速提示**：如遇下载慢问题，可指定清华或阿里云镜像：\n> ```bash\n> pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple stanza\n> ```\n\n### 方式二：通过 Anaconda 安装\n\n```bash\nconda install -c stanfordnlp stanza\n```\n\n> **⚠️ 注意**：目前 Anaconda 渠道暂不支持 Python 3.10，若使用 3.10 版本请改用 pip 安装。\n\n### 方式三：从源码安装\n\n如需二次开发，可从 GitHub 克隆源码：\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza.git\ncd stanza\npip install -e .\n```\n\n## 基本使用\n\n### 1. 初始化与下载模型\n\n在 Python 交互环境或脚本中导入库。首次运行时，Pipeline 会自动尝试下载所需语言模型，也可手动预下载。\n\n```python\n>>> import stanza\n>>> stanza.download('en')       # 可选：预下载英文模型\n>>> nlp = stanza.Pipeline('en') # 创建默认神经管道\n```\n\n### 2. 处理文本\n\n将文本传入 Pipeline 进行处理，结果存储在 `Document` 对象中。\n\n```python\n>>> doc = nlp(\"Barack Obama was born in Hawaii. He was elected president in 2008.\")\n>>> doc.sentences[0].print_dependencies()\n```\n\n输出示例：\n```\n('Barack', '4', 'nsubj:pass')\n('Obama', '1', 'flat')\n('was', '4', 'aux:pass')\n...\n```\n\n### 3. 网络代理配置\n\n如果在下载模型时遇到 `ConnectionError`，可设置代理：\n\n```python\n>>> proxies = {'http': 'http:\u002F\u002Fip:port', 'https': 'http:\u002F\u002Fip:port'}\n>>> stanza.download('en', proxies=proxies)\n>>> nlp = stanza.Pipeline('en')\n```\n\n### 性能提示\n\n为了最大化处理速度，建议对文档进行**批处理**（Batching），避免逐句循环。多个文档间用空行（`\\n\\n`）分隔即可。","某跨国医疗科技公司正在构建自动化病历分析系统，需从多国患者的电子病历中提取疾病名称、药物及症状等关键实体。\n\n### 没有 stanza 时\n- 需要为英语、中文、西班牙语等不同语言单独集成各自的开源库，代码维护极其繁琐\n- 通用命名实体识别模型无法准确理解专业医学术语，导致大量关键诊断信息被误判或遗漏\n- 若使用 Java CoreNLP 则需配置复杂的本地环境，在 Python 项目中调用存在兼容性风险\n- 各语言的分词与句法分析标准不一，难以统一后续的数据清洗与结构化存储逻辑\n\n### 使用 stanza 后\n- 通过单一接口即可支持 60 多种语言，无需切换底层库即可实现统一的文本预处理流程\n- 直接调用内置的生物医药模型包，显著提升了对专业术语和临床实体的识别准确率与召回率\n- 纯 Python 实现简化了安装与部署，直接加载预训练模型即可快速运行，避免了环境依赖问题\n- 输出格式标准化且包含丰富的句法树信息，便于将多语言解析结果无缝接入下游数据分析管道\n\nstanza 通过提供统一的多语言神经 NLP 流水线，大幅降低了跨语言医疗文本处理的开发门槛与实施成本。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fstanfordnlp_stanza_0179ecac.png","stanfordnlp","Stanford NLP","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fstanfordnlp_d4449e42.png","",null,"nlp.stanford.edu","https:\u002F\u002Fgithub.com\u002Fstanfordnlp",[83,87,91,95,99],{"name":84,"color":85,"percentage":86},"Python","#3572A5",98.3,{"name":88,"color":89,"percentage":90},"JavaScript","#f1e05a",1.4,{"name":92,"color":93,"percentage":94},"HTML","#e34c26",0.2,{"name":96,"color":97,"percentage":98},"Shell","#89e051",0.1,{"name":100,"color":101,"percentage":102},"CSS","#663399",0,7765,941,"2026-04-04T19:59:41","NOASSERTION",1,"未说明",{"notes":110,"python":111,"dependencies":112},"首次运行需自动下载语言模型文件；支持 Java CoreNLP 集成（需配置环境变量）；建议批量处理文档以提升性能；训练模型需克隆源码并准备特定格式数据；Conda 安装暂不支持 Python 3.10","3.6+",[113,114],"torch>=1.3.0","requests",[13,26],[117,118,119,120,121,122,123,124,125,126],"python","nlp","natural-language-processing","machine-learning","deep-learning","artificial-intelligence","pytorch","universal-dependencies","named-entity-recognition","corenlp",7,"2026-03-27T02:49:30.150509","2026-04-06T07:12:40.372919",[131,136,140,145,149,154],{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},1455,"如何在 Stanza 中使用流式模式进行批量处理？","可以使用 `pipe` 方法传入一个生成器（iterator），而不是直接传递文档列表。例如：`for doc in stanzapipeline.pipe(docsource, otherargs=couldbeset):`。Pipeline 内部会根据配置自动累积或分割输入文本以优化网络运行，然后拆分结果生成单个文档。这比手动拼接文档更高效且能避免分隔符问题。","https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F550",{"id":137,"question_zh":138,"answer_zh":139,"source_url":135},1456,"如何确定批量处理的输入大小和配置参数？","建议由 Pipeline 自动管理文档的合并与拆分，而不是手动决定。关于 batch size、max sequence length 等参数如何影响性能及文档大小的关系，官方建议在文档中查找详细说明，但最佳实践是让流水线根据配置自行调整，避免手动设置不当导致的性能问题。",{"id":141,"question_zh":142,"answer_zh":143,"source_url":144},1457,"为什么 CoreNLP Client 会报错 `google.protobuf.message.DecodeError`？","这是由于 Java 端在处理超过 80 层深度的树结构时，尝试将其扁平化超出了当前 wire format 的限制。这类深层树结构通常是无用的，但在特定文本下可能触发此限制导致解析失败。","https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F154",{"id":146,"question_zh":147,"answer_zh":148,"source_url":144},1458,"遇到 `DecodeError` 导致程序崩溃该如何解决？","开发者已在 `dev` 分支修复了此问题。更新后的客户端代码会在发生解码错误时捕获异常，抛出用户警告并输出空文档对象，从而防止整个程序崩溃。请确保使用包含修复的最新版本。",{"id":150,"question_zh":151,"answer_zh":152,"source_url":153},1459,"如何向 Stanza 添加新的语言模型？","目前项目的重点是确保数据集符合 UniversalDependencies (UD) 标准。如果您想贡献新语言，建议先关注 UD 计划，并联系相关团队成员讨论发布可能性。","https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1365",{"id":155,"question_zh":156,"answer_zh":157,"source_url":153},1460,"如果数据集不在 UniversalDependencies 中，有哪些发表渠道？","除了 UD，还可以考虑其他发表平台，例如 SyntaxFest 2025 或 NAACL 的 LM4UC workshop。这些是潜在的数据集发布和展示场所。",[159,164,169,174,179,184,189,194,199,204,209,214,219,224,229,234,239,244,249,254],{"id":160,"version":161,"summary_zh":162,"released_at":163},100967,"v1.11.1","## New annotator!\r\n\r\n- Add a connection to the Morpheme Segmentation processor: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F450ca74a0641850d47437fbf6ec8567a11f7f2f0  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1527   https:\u002F\u002Fgithub.com\u002FTheWelcomer\u002FMorphSeg  Thank you @TheWelcomer !\r\n\r\n\r\n## Interface improvements\r\n\r\n- Use `platformdirs` to put the downloaded models in the system cache directory by default.  Thank you @McSinyx !  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1541  Note that this means if you have not set a default path, your existing `~\u002Fstanza_resources` will be obsolete and you will now have models (with a version number) in `.cache\u002Fstanza`\r\n\r\n## New \u002F Updated Models\r\n\r\n- Add Abkhaz models from the fasttextwiki word vectors and the abnc UD dataset.  This involved making the tagger & depparse train finetuned word vectors with a lower cutoff for small pretrains, as the fasttextwiki vectors were quite small for Abkhaz.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F485  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F49f97a4a39451ca75b58ecb581ff676e8160fc76 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F76f333522f597677217eba626c36f4cce0f4572d   Can add more test-only UD datasets on request, but the results seem low enough that we aren't doing it by default\r\n\r\n- ANG NER model downloaded from here: https:\u002F\u002Fgithub.com\u002Fdmetola\u002FOld_English-OEDT\u002Ftree\u002Fmain   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F714072d745a3ffec097e936ccaba4d9d70985ade \r\n\r\n## Bugfixes\r\n\r\n- Patch the depparse to not produce `\u003CPAD>` as a relation type.   A more principled fix would be to rebuild all the models, but this will work for now https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F284e9b4397eb50f12d16c2b055f297fbb4a77f29\r\n\r\n## Model Improvements\r\n\r\n- When training smaller POS datasets, finetune more words if the embedding is small.  Makes it more likely that a small embedding is useful, since we can cover everything in the training set.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F76f333522f597677217eba626c36f4cce0f4572d\r\n\r\n- Process `!!!` and `???` the same as `!` and `?` in the pos and depparse, addressing the downstream errors caused by unknown strings of punct.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F5fd1d50414939b3dd3dc8bf315fc44ca0b6a4fe6  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1532\r\n\r\n- Train the tokenizer to recognize non-ascii variants of `!` and `?` with augmentation, addressing the tokenizer errors found for punct that doesn't exist in the training data  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1532  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fd69c33f749e3ffcde6515da88d63450fb622b8bc\r\n\r\n- Modify the depparse model to scale scores so that only one root is ever chosen.  See https:\u002F\u002Faclanthology.org\u002F2020.emnlp-main.390.pdf  https:\u002F\u002Faclanthology.org\u002F2021.emnlp-main.823\u002F   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F88c0cf653bf92e06feae85273d2c853389ea6e4d   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc50fa5c3610e23f9e364d69e065e1af12ae6fd47  \r\n\r\n## Other improvements\r\n\r\n- Fix random typos https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F4af05f68c6674e619fe03b0651101c368d510014  Thanks @thecaptain789 \r\n\r\n- Add an early termination for coref option, as requested in https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1531  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F1d30e90993a5e443e4cc6c9537573e5324653a66\r\n\r\n- Update the semgrex client to allow for results to come back in non-sentence order, allowing for future addition of `sort` operators  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F46eb340ac0434178cfd287c5f1ca5528d8a28f9e\r\n\r\n- Fix a minor memory waste https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Ff8d62fe7fdf63a65640c14ec04cc173ded7d1d0b\r\n\r\n- Use the UD udtools package instead of having our own copy of the scoring script  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fb20cd3a0069e79f8f31757c8fbb2507050707402\r\n\r\n- As requested in https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1523, allow for speaker information passing to the coref annotator: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc4201b9d1a6a1ba70dbead77524ee6d16b99d02c  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fdc50998ef165a3d7800b004157c2cabedd8e63ef   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F1df3f8bc441b4b338c1b81c0a501ce8ad29cd2fd\r\n\r\n- Add a convenience method to retag a conllu file in a Pipeline.  call `pipe.process_conllu(text)` where `text` is a conllu file.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F74fbdc4ec9f81592fbf0fedeb4ad6b07520164cd\r\n","2026-02-26T06:57:11",{"id":165,"version":166,"summary_zh":167,"released_at":168},100968,"v1.11.0","### Training upgrades\r\n\r\n- Should now be possible to train all annotators on Windows: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza-train\u002Fissues\u002F20 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1439  The issue was twofold: a perl script shell call (which could actually be installed, but was annoying for non-perl users) and an overreliance on temp files, which can be opened twice in Unix but not in Windows. Fixed in https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F2677e7789394225c7da09d857a6de15bcb62180b https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fd5c7b7ffee4089f43bc712c3910ae573ed8e686e  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1514\r\n\r\n### Model upgrades\r\n\r\n- Tokenizer can support the pretrained charlm now.  This significantly improves the MWT performance on Hebrew, for example. https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1511\r\n\r\n- Building tokenizers with pretrained charlm exposed a possible issue with the tokenizer including spaces when tokenizing when an MWT is split across two words.  The effect occurred in Hebrew, but an English example would be `wo n't` tokenized as a single token with embedded space.  Augmenting the training to enforce word splits across those spaces fixed the issue.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1511\u002Fcommits\u002F52cea783431c85af68227c0f00dc4022a36ea7f4\r\n\r\n- use PackedSequence for the tokenizer - is slower, but results are stable when using inputs of different lengths: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F4433e83542a34e9ef121d17db84695d9d359d5f1  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1472\r\n\r\n- If a Tokenizer training set consistently has spaces between the ends of words and punctuations, the resulting trained model may not properly recognize the same text with periods at the end of the word.  For example, `this is a test .` vs `this is a test.`  Reported in https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1504  Fixed for VI by https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F6878d8e6405441ee1d14de3d96f8a786ccc599ed\r\n\r\n- Coref now includes a zeros predictor - this predicts when a mention for certain datasets (such as Spanish) is a pro-drop mention.  This behavior occurs by adding an empty node to the sentence.  It can be disabled with the `coref_use_zeros=False` flag to the Pipeline.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1502\r\n\r\n### Model improvements\r\n\r\n- Sindhi pipeline based on the ISRA UD dataset, published at SyntaxFest 2025, with annotation support from MLtwist: https:\u002F\u002Faclanthology.org\u002F2025.udw-1.11\u002F\r\n\r\n- Tamil coreference model from KBC\r\n\r\n- update English lemmatizer with more verbs and ADJ from Prof. Lapalme\r\n\r\n- also, French lemmatizer changes with corrections from Prof. Lapalme\r\n\r\n- create a German lemmatizer using GSD data and a set of ADJ from Wiktionary\r\n\r\n- add GRC models mixed with a copy of the data with the diacritics stripped.  because those work worse on GRC with diacritics, the originals are still the default: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F5beca58c054c404b8ab6552fbcf61dee5b33a7e9\r\n\r\n- add a Thai TUD dataset from https:\u002F\u002Fgithub.com\u002Fnlp-chula\u002FTUD (not yet included in UD): https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fbca078cda2b10e74e04abde2667ddf6b896d7efb\r\n\r\n- NER model for ANG: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F68a56aa51e013631c2a5cfbc044b41d23fe63780  https:\u002F\u002Fgithub.com\u002Fdmetola\u002FOld_English-OEDT\u002Ftree\u002Fmain\r\n\r\n- NER models for Hindi, Telegu, and Urdu: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1469, model built from https:\u002F\u002Fgithub.com\u002Fltrc\u002FIL-NER, added in https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fa4902dfbf14164cda6ae0d82ff393264cc3a347d\r\n\r\n### Other interface improvements\r\n\r\n- Fix conparser SyntaxWarning: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1513 thanks to @orenl\r\n\r\n- improve efficiency of reading conllu documents: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Ff15f0bc56ccea285cd5278ff75207e63ca9178b7\r\n\r\n- sort CoNLLU features when outputting a doc, as is standard: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Faa20fbb27f8e402595723e3609f2d9ae0dd452b1\r\n\r\n- semgrex interface improvements: [search all files](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F24356634e0579af5eca826d7a9bc2930b14d64c1), [only output failed matches](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fdee0efe80058d5c4447119ede9b16c8235d740ba), [process all documents at once](https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F3653f6ae81759a0a519f98ced38fb77f6fc2001e)\r\n\r\n- turn coref `max_train_len` into a parameter: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F1f98d8f55f1b537a688141f181c055506d3eeb1b https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1465\r\n\r\n- allow for combined depparse models with multiple training files in a zip file (easier to mix training data): https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fbe94ac6f1af6c210cd82e841abeaad6ff31b0fb1\r\n\r\n- lemmatizer can skip blank lemmas (useful when training using partially complete lemma data): https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F7c34714d8bfa9c4cbb92b50ea4fa8fc6257f5451\r\n\r\n- if using pretokenized text in the","2025-10-05T06:45:12",{"id":170,"version":171,"summary_zh":172,"released_at":173},100969,"v1.10.1","In this release, we rebuild all of the models with UD 2.15, allowing for new languages such as Georgian, Komi Zyrian, Low Saxon, and Ottoman Turkish.  We also add an Albanian model composed of the two available UD treebanks and an Old English model based on a prototype dataset not yet published in UD.\r\n\r\nOther notable changes:\r\n\r\n- Include a contextual lemmatizer in English for `'s` -> `be` or `have` in the `default_accurate` package.  Also built is a HI model.  Others potentially to follow.  Now with fewer bugs at startup.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1422\r\n- Upgrade the FR NER model to a gold edited version of WikiNER: https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdanrun\u002FWikiNER-fr-gold https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fad1f938276ef81ac9a602d7f1f21f50fd67e5d24\r\n- Pytorch compatibility: set `weights_only=True` when loading models.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1430 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1429\r\n- augment MWT tokenization to accommodate unexpected `'` characters, including `\"` used in `\"s` - https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1437 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1436\r\n- when training the lemmatizer, take advantage of `CorrectForm` annotations in the UD treebanks  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fdbdf429aff4175fec33856501e6899e96b390e86\r\n- add hand-lemmatized French verbs and English words to the \"combined\" lemmatizers, thanks to Prof. Lapalme: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F99f7038634101ea7b92140696c8383a333af1cbc\r\n- add VLSP 2023 constituency dataset: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F1159d0db8ea1d20c6cf9fb37f8fa8676e0f60f49\r\n\r\nBugfixes:\r\n\r\n- `raise_for_status` earlier when failing to download something, so that the proper error gets displayed. \r\nThank you @pattersam  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1432\r\n- Fix the usage of transformers where an unexpected character at the end of a sentence was not properly handled: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F53081c28ba3128fc89ad36919762a54f6cb88f77\r\n- reset the start\u002Fend character annotations on tokens which are predicted to be MWT by the tokenizer, but not processed as such by the MWT processor: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F1a36efb53135e53dd40ad550bc3a659c81b15980 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1436\r\n- similar to the start\u002Fend char issue, fix a situation where a token's text could disappear if the MWT processor didn't split a word: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F215c69e53bf9f11e174b82bb064767749f7dd403\r\n- missing text for a Document does not cause the NER model to crash: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F07326289ce0efef1ba17a0632c011652f884363c https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1428\r\n- tokenize URLs with unexpected TLDs into single tokens rather than splitting them up: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Ff59ccd86b9d146737dd5c0325ac31e4da814ddfa https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1423\r\n","2024-12-29T06:54:23",{"id":175,"version":176,"summary_zh":177,"released_at":178},100970,"v1.10.0","In this release, we rebuild all of the models with UD 2.15, allowing for new languages such as Georgian, Komi Zyrian, Low Saxon, and Ottoman Turkish.  We also add an Albanian model composed of the two available UD treebanks and an Old English model based on a prototype dataset not yet published in UD.\r\n\r\nOther notable changes:\r\n\r\n- Include a contextual lemmatizer in English for `'s` -> `be` or `have` in the `default_accurate` package.  Also built is a HI model.  Others potentially to follow.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1422\r\n- Upgrade the FR NER model to a gold edited version of WikiNER: https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fdanrun\u002FWikiNER-fr-gold https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fad1f938276ef81ac9a602d7f1f21f50fd67e5d24\r\n- Pytorch compatibility: set `weights_only=True` when loading models.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1430 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1429\r\n- augment MWT tokenization to accommodate unexpected `'` characters, including `\"` used in `\"s` - https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1437 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1436\r\n- when training the lemmatizer, take advantage of `CorrectForm` annotations in the UD treebanks  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fdbdf429aff4175fec33856501e6899e96b390e86\r\n- add hand-lemmatized French verbs and English words to the \"combined\" lemmatizers, thanks to Prof. Lapalme: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F99f7038634101ea7b92140696c8383a333af1cbc\r\n- add VLSP 2023 constituency dataset: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F1159d0db8ea1d20c6cf9fb37f8fa8676e0f60f49\r\n\r\nBugfixes:\r\n\r\n- `raise_for_status` earlier when failing to download something, so that the proper error gets displayed. \r\nThank you @pattersam  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1432\r\n- Fix the usage of transformers where an unexpected character at the end of a sentence was not properly handled: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F53081c28ba3128fc89ad36919762a54f6cb88f77\r\n- reset the start\u002Fend character annotations on tokens which are predicted to be MWT by the tokenizer, but not processed as such by the MWT processor: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F1a36efb53135e53dd40ad550bc3a659c81b15980 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1436\r\n- similar to the start\u002Fend char issue, fix a situation where a token's text could disappear if the MWT processor didn't split a word: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F215c69e53bf9f11e174b82bb064767749f7dd403\r\n- missing text for a Document does not cause the NER model to crash: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F07326289ce0efef1ba17a0632c011652f884363c https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1428\r\n- tokenize URLs with unexpected TLDs into single tokens rather than splitting them up: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Ff59ccd86b9d146737dd5c0325ac31e4da814ddfa https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1423\r\n","2024-12-23T04:27:50",{"id":180,"version":181,"summary_zh":182,"released_at":183},100971,"v1.9.2","## multilingual coref!\r\n\r\n- Added models which cover several different languages: one for combined Germanic and Romance languages, one for the Slavic languages available in UDCoref  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1406\r\n\r\n## new features\r\n\r\n- streamlit visualizer for semgrex\u002Fssurgeon  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1396\r\n- updates to the constituency parser ensemble https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1387\r\n- accuracy improvements to the IN_ORDER oracle https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1391\r\n- Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words.  Currently for EN and HE  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1417  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1419\r\n- `download_method=None` now turns off HF downloads as well, for use in instances with no access to internet  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1408  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1399\r\n\r\n## new models\r\n\r\n- Spanish combined models  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1395\r\n- Add IACLT knesset to the HE combined models\r\n- NER based on IACLT\r\n- XCL (Classical Armenian) models with word vectors from Caval\r\n\r\n## bugfixes\r\n\r\n- update tqdm usage to remove some duplicate code: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1413 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F3de69cac904cf023eba4463380b63bc3039be7fd\r\n- long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1410\r\n- Occasionally train the tokenizer with the sentence final punctuation of a batch removed.  This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation.  This was also related to the Spanish tokenization issue  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F56350a0eebf4e2a7b3c54151f83b34db881553fc\r\n- actually include the visualization: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1421  thank you @bollwyvl \r\n","2024-09-12T23:17:21",{"id":185,"version":186,"summary_zh":187,"released_at":188},100972,"v1.9.1","## multilingual coref!\r\n\r\n- Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1406\r\n\r\n## new features\r\n\r\n- streamlit visualizer for semgrex\u002Fssurgeon  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1396\r\n- updates to the constituency parser ensemble https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1387\r\n- accuracy improvements to the IN_ORDER oracle https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1391\r\n- Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words.  Currently for EN and HE  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1417  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1419\r\n- `download_method=None` now turns off HF downloads as well, for use in instances with no access to internet  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1408  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1399\r\n\r\n## new models\r\n\r\n- Spanish combined models  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1395\r\n- Add IACLT knesset to the HE combined models\r\n- NER based on IACLT\r\n- XCL (Classical Armenian) models with word vectors from Caval\r\n\r\n## bugfixes\r\n\r\n- update tqdm usage to remove some duplicate code: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1413 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F3de69cac904cf023eba4463380b63bc3039be7fd\r\n- long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1410\r\n- Occasionally train the tokenizer with the sentence final punctuation of a batch removed.  This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation.  This was also related to the Spanish tokenization issue  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F56350a0eebf4e2a7b3c54151f83b34db881553fc\r\n- actually include the visualization: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1421  thank you @bollwyvl \r\n","2024-09-12T19:40:52",{"id":190,"version":191,"summary_zh":192,"released_at":193},100973,"v1.9.0","## multilingual coref!\r\n\r\n- Added models which cover several different languages: one for combined Germanic and Romantic languages, one for the Slavic languages available in UDCoref  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1406\r\n\r\n## new features\r\n\r\n- streamlit visualizer for semgrex\u002Fssurgeon  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1396\r\n- updates to the constituency parser ensemble https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1387\r\n- accuracy improvements to the IN_ORDER oracle https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1391\r\n- Split-only MWT model - cannot possibly hallucinate, as sometimes happens for OOV words.  Currently for EN and HE  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1417  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1419\r\n- `download_method=None` now turns off HF downloads as well, for use in instances with no access to internet  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1408  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1399\r\n\r\n## new models\r\n\r\n- Spanish combined models  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1395\r\n- Add IACLT knesset to the HE combined models\r\n- NER based on IACLT\r\n- XCL (Classical Armenian) models with word vectors from Caval\r\n\r\n## bugfixes\r\n\r\n- update tqdm usage to remove some duplicate code: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1413 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F3de69cac904cf023eba4463380b63bc3039be7fd\r\n- long list of incorrectly tokenized Spanish words added directly to the combined Spanish training data to improve their tokenization: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1410\r\n- Occasionally train the tokenizer with the sentence final punctuation of a batch removed.  This helps the tokenizer avoid learning to tokenize the last character regardless of whether or not it is punctuation.  This was also related to the Spanish tokenization issue  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F56350a0eebf4e2a7b3c54151f83b34db881553fc\r\n","2024-09-12T07:23:29",{"id":195,"version":196,"summary_zh":197,"released_at":198},100974,"v1.8.2","Add an Old English pipeline, improve the handling of MWT for cases that should be easy, and improve the memory management of our usage of transformers with adapters.\r\n\r\n### Old English\r\n\r\n- Add Old English (ANG) annotation!  Thank you to @dmetola    https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1365\r\n\r\n## MWT improvements\r\n\r\n- Fix words ending with `-nna` split into MWT https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fhandparsed-treebank\u002Fcommit\u002F2c48d4093daddc790bf89d7b35c47ee4d7d272d1  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1366\r\n\r\n- Fix MWT for English splitting into weird words by enforcing that the pieces add up to the whole (which is always the case in the English treebanks)  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1371  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1378\r\n\r\n- Mark `start_char` and `end_char` on an MWT if it is composed of exactly its subwords  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F23840891c37d54a5cf491ea58b0702987dd4a6d7   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1361\r\n\r\n## Peft memory management\r\n\r\n- Previous versions were loading multiple copies of the transformer in order to use adapters.  To save memory, we can use Peft's capacity to attach multiple adapters to the same transformer instead as long as they have different names.  This allows for loading just one copy of the entire transformer when using a Pipeline with several finetuned models.  https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft\u002Fissues\u002F1523   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1381  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1384\r\n\r\n## Other bugfixes and minor upgrades\r\n\r\n- Fix crash when trying to load previously unknown language  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1360 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F381736f8fb9b60a929002cc750bd0df3d7dad03a\r\n\r\n- Check that sys.stderr has isatty before manipulating it with tqdm, in case sys.stderr was monkeypatched:  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fd180ae02b278dd09dff53bc910e7aa43656e944d  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1367\r\n\r\n- Try to avoid OOM in the POS in the Pipeline by reducing its max batch length  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F42718135e2ab4b145bbb5861d55bb9424ca3549f\r\n\r\n- Fix usage of gradient checkpointing & a weird interaction with Peft (thanks to @Jemoka)  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F597d48f1ead89fa9a0cca86cf9f0b530ed249792\r\n\r\n## Other upgrades\r\n\r\n- Add \\* to the list of functional tags to drop in the constituency parser, helping Icelandic annotation  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F57bfa8bbd8d3d42d4ee29d4a406640b126ce0f46  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1356#issuecomment-1981216912\r\n\r\n- Can train depparse without using any of the POS columns, especially useful if training a cross-lingual parser: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F4048caed1b89030082d23b8f71d23bae6c9c54f1  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F15b136bb30dda272d318a61a5f602e7fc81e7a31\r\n\r\n- Add a constituency model for German  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F7a4f48c738f0db8923aa5da88d0a9743eaee4c6a  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F86ddaab31c73a7d0a389d0557f3696c29d441657  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1368\r\n\r\n","2024-04-20T18:58:25",{"id":200,"version":201,"summary_zh":202,"released_at":203},100975,"v1.8.1","## Integrating PEFT into several different annotators\r\n\r\nWe integrate [PEFT](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) into our training pipeline for several different models.  This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the `default_accurate` model.  \r\n\r\nThe biggest gains observed are with the constituency parser and the sentiment classifier.\r\n\r\nPreviously, the `default_accurate` package used transformers where the head was trained but the transformer itself was not finetuned.\r\n\r\n### Model improvements\r\n\r\n- POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1320\r\n- Sentiment trained with peft on the transformer: noticeably improves results for each model.  SST scores go from 68 F1 w\u002F charlm, to 70 F1 w\u002F transformer, to 74-75 F1 with finetuned or Peft finetuned transformer.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1335\r\n- NER also trained with peft: unfortunately, no consistent improvements to scores  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1336\r\n- depparse includes peft: no consistent improvements yet https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1337  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1344\r\n- Dynamic oracle for top-down constituent parser scheme.  Noticeable improvement in the scores for the topdown parser  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1341\r\n- Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies.  Example improvement, 87.01 to 88.11 on ID_ICON dataset.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1347\r\n- Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used.  Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1348\r\n- Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data.  Typical example would be split email addresses in the EWT training set.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1346  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1345\r\n\r\n### Features\r\n\r\n- Include SpacesAfter annotations on words in the CoNLL output of documents: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1315 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1322\r\n- Lemmatizer operates in caseless mode if all of its training data was caseless.  Most relevant to the UD Latin treebanks.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1331  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1330\r\n- wandb support for coref   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1338\r\n- Coref annotator breaks length ties using POS if available https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1326  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc4c3de5803f27843a5050e10ccae71b3fd9c45e9\r\n\r\n### Bugfixes\r\n\r\n- Using a proxy with `download_resources_json` was broken: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1318  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1317  Thank you @ider-zh\r\n- Fix deprecation warnings for escape sequences: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1321  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1293  Thank you @sterliakov\r\n- Coref training rounding error  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1342\r\n- Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1354\r\n- V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits.  No idea if this actually produces reasonable results for words after the token limit.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1350  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1294\r\n- Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1333 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1339 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Ff1fbaaad983e58dc3fcf318200d685663fb90737\r\n- Clarify error when a language is only partially handled:  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fda01644b4ba5ba477c36e5d2736012b81bcd00d4  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1310\r\n\r\n### Additional 1.8.1 Bugfixes\r\n\r\n- Older POS models not loaded correctly... need to use `.get()`  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F13ee3d5cbc2c9174c3e0c67ca75b580e4fe683b1  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1357\r\n- Debug logging for the Constituency retag pipeline to better support someone working on Icelandic  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fsta","2024-03-01T06:47:02",{"id":205,"version":206,"summary_zh":207,"released_at":208},100976,"v1.8.0","## Integrating PEFT into several different annotators\r\n\r\nWe integrate [PEFT](https:\u002F\u002Fgithub.com\u002Fhuggingface\u002Fpeft) into our training pipeline for several different models.  This greatly reduces the size of models with finetuned transformers, letting us make the finetuned versions of those models the `default_accurate` model.  \r\n\r\nThe biggest gains observed are with the constituency parser and the sentiment classifier.\r\n\r\nPreviously, the `default_accurate` package used transformers where the head was trained but the transformer itself was not finetuned.\r\n\r\n### Model improvements\r\n\r\n- POS trained with split optimizer for transformer & non-transformer - unfortunately, did not find settings which consistently improved results https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1320\r\n- Sentiment trained with peft on the transformer: noticeably improves results for each model.  SST scores go from 68 F1 w\u002F charlm, to 70 F1 w\u002F transformer, to 74-75 F1 with finetuned or Peft finetuned transformer.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1335\r\n- NER also trained with peft: unfortunately, no consistent improvements to scores  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1336\r\n- depparse includes peft: no consistent improvements yet https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1337  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1344\r\n- Dynamic oracle for top-down constituent parser scheme.  Noticeable improvement in the scores for the topdown parser  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1341\r\n- Constituency parser uses peft: this produces significant improvements, close to the full benefit of finetuning the entire transformer when training constituencies.  Example improvement, 87.01 to 88.11 on ID_ICON dataset.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1347\r\n- Scripts to build a silver dataset for the constituency parser with filtering of sentences based on model agreement among the sub-models for the ensembles used.  Preliminary work indicates an improvement in the benefits of the silver trees, with more work needed to find the optimal parameters used to build the silver dataset.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1348\r\n- Lemmatizer ignores goeswith words when training: eliminates words which are a single word, labeled with a single lemma, but split into two words in the UD training data.  Typical example would be split email addresses in the EWT training set.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1346  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1345\r\n\r\n### Features\r\n\r\n- Include SpacesAfter annotations on words in the CoNLL output of documents: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1315 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1322\r\n- Lemmatizer operates in caseless mode if all of its training data was caseless.  Most relevant to the UD Latin treebanks.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1331  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1330\r\n- wandb support for coref   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1338\r\n- Coref annotator breaks length ties using POS if available https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1326  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc4c3de5803f27843a5050e10ccae71b3fd9c45e9\r\n\r\n### Bugfixes\r\n\r\n- Using a proxy with `download_resources_json` was broken: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1318  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1317  Thank you @ider-zh\r\n- Fix deprecation warnings for escape sequences: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1321  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1293  Thank you @sterliakov\r\n- Coref training rounding error  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1342\r\n- Top-down constituency models were broken for datasets which did not use ROOT as the top level bracket... this was only DA_Arboretum in practice  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1354\r\n- V1 of chopping up some longer texts into shorter texts for the transformers to get around length limits.  No idea if this actually produces reasonable results for words after the token limit.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1350  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1294\r\n- Coref prediction off-by-one error for short sentences, was falsely throwing an exception at sentence breaks: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1333 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1339 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Ff1fbaaad983e58dc3fcf318200d685663fb90737\r\n- Clarify error when a language is only partially handled:  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fda01644b4ba5ba477c36e5d2736012b81bcd00d4  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1310\r\n","2024-02-25T07:38:59",{"id":210,"version":211,"summary_zh":212,"released_at":213},100977,"v1.7.0","## Neural coref processor added!\r\n\r\nConjunction-Aware Word-Level Coreference Resolution\r\nhttps:\u002F\u002Farxiv.org\u002Fabs\u002F2310.06165\r\noriginal implementation: https:\u002F\u002Fgithub.com\u002FKarelDO\u002Fwl-coref\u002Ftree\u002Fmaster\r\n\r\nUpdated form of Word-Level Coreference Resolution\r\nhttps:\u002F\u002Faclanthology.org\u002F2021.emnlp-main.605\u002F\r\noriginal implementation: https:\u002F\u002Fgithub.com\u002Fvdobrovolskii\u002Fwl-coref\r\n\r\nIf you use Stanza's coref module in your work, please be sure to cite both of the above papers.\r\n\r\nSpecial thanks to [vdobrovolskii](https:\u002F\u002Fgithub.com\u002Fvdobrovolskii), who graciously agreed to allow for integration of his work into Stanza, to @KarelDO for his support of his training enhancement, and to @Jemoka for the LoRA PEFT integration, which makes the finetuning of the transformer based coref annotator much less expensive.\r\n\r\nCurrently there is one model provided, a transformer based English model trained from OntoNotes.  The provided model is currently based on Electra-Large, as that is more harmonious with the rest of our transformer architecture.  When we have LoRA integration with POS, depparse, and the other processors, we will revisit the question of which transformer is most appropriate for English.\r\n\r\nFuture work includes ZH and AR models from OntoNotes, additional language support from UD-Coref, and lower cost non-transformer models\r\n\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1309\r\n\r\n## Interface change: English MWT\r\n\r\nEnglish now has an MWT model by default.  Text such as `won't` is now marked as a single **token**, split into two **words**, `will` and `not`.  Previously it was expected to be tokenized into two pieces, but the `Sentence` object containing that text would not have a single `Token` object connecting the two pieces.  See https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Fmwt.html and https:\u002F\u002Fstanfordnlp.github.io\u002Fstanza\u002Fdata_objects.html#token for more information.\r\n\r\nCode that used to operate with `for word in sentence.words` will continue to work as before, but `for token in sentence.tokens` will now produce **one** object for MWT such as `won't`, `cannot`, `Stanza's`, etc.  \r\n\r\nPipeline creation will not change, as MWT is automatically (but not silently) added at `Pipeline` creation time if the language and package includes MWT.\r\n\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1314\u002Fcommits\u002Ff22dceb93275fc724536b03b31c08a94617880ca  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1314\u002Fcommits\u002F27983aefe191f6abd93dd49915d2515d7c3973d1\r\n\r\n## Other updates\r\n\r\n- NetworkX representation of enhanced dependencies.  Allows for easier usage of Semgrex on enhanced dependencies - searching over enhanced dependencies requires CoreNLP >= 4.5.6   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1295 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1298\r\n- Sentence ending punct tags improved for English to avoid labeling non-punct as punct (and POS is switched to using a DataLoader) https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1000 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1303\r\n- Optional rewriting of MWT after the MWT processing step - will give the user more control over fixing common errors.  Although we still encourage posting issues on github so we can fix them for everyone!  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1302\r\n- Remove deprecated output methods such as `conll_as_string` and `doc2conll_text`.  Use `\"{:C}\".format(doc)` instead  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fe01650f9c56382495082a9a24fa0310414c46651\r\n- Mixed OntoNotes and WW NER model for English is now the default.  Future versions may include CoNLL 2003 and CoNLL++ data as well.\r\n- Sentences now have a `doc_id` field if the document they are created from has a `doc_id`.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1314\u002Fcommits\u002F8e2201f42cb99a5a3d8358ce59501c1d88f2585e\r\n- Optional processors added in cases where the user may not want the model we have run by default.  For example, conparse for Turkish (limited training data) or coref for English (the only available model is the transformer model)  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1314\u002Fcommits\u002F3d90d2b8a82048c5cea549b654e52544ed241833\r\n\r\n## Updated requirements\r\n\r\n- Support dropped for python 3.6 and 3.7.  The `peft` module used for finetuning the transformer used in the coref processor does not support those versions.\r\n- Added `peft` as an optional dependency to transformer based installations\r\n- Added `networkx` as a dependency for reading enhanced dependencies.  Added `toml` as a dependency for reading the coref config.\r\n","2023-12-03T06:47:22",{"id":215,"version":216,"summary_zh":217,"released_at":218},100978,"v1.6.1","V1.6.1 is a patch of a bug in the Arabic POS tagger.\r\n\r\nWe also mark Python 3.11 as supported in the `setup.py` classifiers.  **This will be the last release that supports Python 3.6**\r\n\r\n## Multiple model levels\r\n\r\nThe `package` parameter for building the `Pipeline` now has three default settings:\r\n\r\n- `default`, the same as before, where POS, depparse, and NER use the charlm, but lemma does not\r\n- `default-fast`, where POS and depparse are built without the charlm, making them substantially faster on CPU.  Some languages currently have non-charlm NER as well\r\n- `default-accurate`, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language.  Suggestions for more transformers to use are welcome\r\n\r\nFurthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into `-fast` and `-accurate` versions for each UD dataset.\r\n\r\nPR: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1287\r\n\r\naddresses https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1259 and https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1284\r\n\r\n## Multiple output heads for one NER model\r\n\r\nThe NER models now can learn multiple output layers at once.\r\n\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1289\r\n\r\nTheoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected.  The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.\r\n\r\nResults of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:\r\n\r\n```\r\noriginal ontonotes on worldwide:   88.71  69.29\r\nsimplify-separate                  88.24  75.75\r\nsimplify-connected                 88.32  75.47\r\n```\r\n\r\n\r\nWe also produced combined models for nocharlm and with Electra as the input encoding.  The new English NER models are the packages `ontonotes-combined_nocharlm`, `ontonotes-combined_charlm`, and `ontonotes-combined_electra-large`.\r\n\r\nFuture plans include using multiple NER datasets for other models as well.\r\n\r\n## Other features\r\n\r\n- Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka).  When creating a `Pipeline`, you can now provide a `callable` via the `tokenize_postprocessor` parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the `Pipeline` https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1290\r\n\r\n- Finetuning for transformers in the NER models: have not yet found helpful settings, though https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F45ef5445f44222df862ed48c1b3743dc09f3d3fd\r\n\r\n- SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1279 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F88cd0df5da94664cb04453536212812dc97339bb\r\n\r\n- charlm for PT (improves accuracy on non-transformer models): https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc10763d0218ce87f8f257114a201cc608dbd7b3a\r\n\r\n- build models  with transformers for a few additional languages: MR, AR, PT, JA https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F45b387531c67bafa9bc41ee4d37ba0948daa9742 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F0f3761ee63c57f66630a8e94ba6276900c190a74 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc55472acbd32aa0e55d923612589d6c45dc569cc https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc10763d0218ce87f8f257114a201cc608dbd7b3a\r\n\r\n\r\n## Bugfixes\r\n\r\n- V1.6.1 fixes a bug in the Arabic POS model which was an unfortunate side effect of the NER change to allow multiple tag sets at once: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fb56f442d4d179c07411a44a342c224408eb6a6a9\r\n\r\n- Scenegraph CoreNLP connection needed to be checked before sending messages: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002FCoreNLP\u002Fissues\u002F1346#issuecomment-1713267522 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc71bf3fdac8b782a61454c090763e8885d0e3824\r\n\r\n- `run_ete.py` was not correctly processing the charlm, meaning the whole thing wouldn't actually run  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F16f29f3dcf160f0d10a47fec501ab717adf0d4d7\r\n\r\n- Chinese NER model was pointing to the wrong pretrain https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1285 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F82a02151da17630eb515792a508a967ef70a6cef","2023-10-06T05:16:12",{"id":220,"version":221,"summary_zh":222,"released_at":223},100979,"v1.6.0","## Multiple model levels\r\n\r\nThe `package` parameter for building the `Pipeline` now has three default settings:\r\n\r\n- `default`, the same as before, where POS, depparse, and NER use the charlm, but lemma does not\r\n- `default-fast`, where POS and depparse are built without the charlm, making them substantially faster on CPU.  Some languages currently have non-charlm NER as well\r\n- `default-accurate`, where the lemmatizer also uses the charlm, and other models use transformers if we have one for that language.  Suggestions for more transformers to use are welcome\r\n\r\nFurthermore, package dictionaries are now provided for each UD dataset which encompass the default versions of models for that dataset, although we do not further break that down into `-fast` and `-accurate` versions for each UD dataset.\r\n\r\nPR: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1287\r\n\r\naddresses https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1259 and https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1284\r\n\r\n## Multiple output heads for one NER model\r\n\r\nThe NER models now can learn multiple output layers at once.\r\n\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1289\r\n\r\nTheoretically this could be used to save a bit of time on the encoder while tagging multiple classes at once, but the main use case was to crosstrain the OntoNotes model on the WorldWide English newswire data we collected.  The effect is that the model learns to incorporate some named entities from outside the standard OntoNotes vocabulary into the main 18 class tagset, even though the WorldWide training data is only 8 classes.\r\n\r\nResults of running the OntoNotes model, with charlm but not transformer, on the OntoNotes and WorldWide test sets:\r\n\r\n```\r\noriginal ontonotes on worldwide:   88.71  69.29\r\nsimplify-separate                  88.24  75.75\r\nsimplify-connected                 88.32  75.47\r\n```\r\n\r\n\r\nWe also produced combined models for nocharlm and with Electra as the input encoding.  The new English NER models are the packages `ontonotes-combined_nocharlm`, `ontonotes-combined_charlm`, and `ontonotes-combined_electra-large`.\r\n\r\nFuture plans include using multiple NER datasets for other models as well.\r\n\r\n## Other features\r\n\r\n- Postprocessing of proposed tokenization possible with dependency injection on the Pipeline (ty @Jemoka).  When creating a `Pipeline`, you can now provide a `callable` via the `tokenize_postprocessor` parameter, and it can adjust the candidate list of tokens to change the tokenization used by the rest of the `Pipeline` https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1290\r\n\r\n- Finetuning for transformers in the NER models: have not yet found helpful settings, though https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F45ef5445f44222df862ed48c1b3743dc09f3d3fd\r\n\r\n- SE and SME should both represent Northern Sami, a weird case where UD didn't use the standard 2 letter code https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1279 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F88cd0df5da94664cb04453536212812dc97339bb\r\n\r\n- charlm for PT (improves accuracy on non-transformer models): https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc10763d0218ce87f8f257114a201cc608dbd7b3a\r\n\r\n- build models  with transformers for a few additional languages: MR, AR, PT, JA https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F45b387531c67bafa9bc41ee4d37ba0948daa9742 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F0f3761ee63c57f66630a8e94ba6276900c190a74 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc55472acbd32aa0e55d923612589d6c45dc569cc https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc10763d0218ce87f8f257114a201cc608dbd7b3a\r\n\r\n\r\n## Bugfixes\r\n\r\n- Scenegraph CoreNLP connection needed to be checked before sending messages: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002FCoreNLP\u002Fissues\u002F1346#issuecomment-1713267522 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc71bf3fdac8b782a61454c090763e8885d0e3824\r\n\r\n- `run_ete.py` was not correctly processing the charlm, meaning the whole thing wouldn't actually run  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F16f29f3dcf160f0d10a47fec501ab717adf0d4d7\r\n\r\n- Chinese NER model was pointing to the wrong pretrain https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1285 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F82a02151da17630eb515792a508a967ef70a6cef","2023-10-03T05:11:06",{"id":225,"version":226,"summary_zh":227,"released_at":228},100980,"v1.5.1","## Features\r\n\r\ndepparse can have transformer as an embedding https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002Fee171cd167900fbaac16ff4b1f2fbd1a6e97de0a\r\n\r\nLemmatizer can remember word,pos it has seen before with a flag https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1263 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fa87ffd0a4f43262457cf7eecf5555a621c6dc24e\r\n\r\nScoring scripts for Flair and spAcy NER models (requires the appropriate packages, of course)  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002F63dc212b467cd549039392743a0be493cc9bc9d8  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002Fc42aed569f9d376e71708b28b0fe5b478697ba05  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002Feab062341480e055f93787d490ff31d923a68398\r\n\r\nSceneGraph connection for the CoreNLP client   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002Fd21a95cc90443ec4737de6d7ba68a106d12fb285\r\n\r\nUpdate constituency parser to reduce the learning rate on plateau.  Fiddling with the learning rates significantly improves performance https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002Ff753a4f35b7c2cf7e8e6b01da3a60f73493178e1\r\n\r\nTokenize [] based on () rules if the original dataset doesn't have [] in it  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002F063b4ba3c6ce2075655a70e54c434af4ce7ac3a9\r\n\r\nAttempt to finetune the charlm when building models (have not found effective settings for this yet)  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002F048fdc9c9947a154d4426007301d63d920e60db0\r\n\r\nAdd the charlm to the lemmatizer - this will not be the default, since it is slower, but it is more accurate  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002Fe811f52b4cf88d985e7dbbd499fe30dbf2e76d8d   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002F66add6d519deb54ca9be5fe3148023a5d7d815e4  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002Ff086de2359cce16ef2718c0e6e3b5deef1345c74\r\n\r\n## Bugfixes\r\n\r\nForgot to include the lemmatizer in CoreNLP 4.5.3, now in 4.5.4 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F4dda14bd585893044708c70e30c1c3efec509863 https:\u002F\u002Fgithub.com\u002Fbjascob\u002FLemmInflect\u002Fissues\u002F14#issuecomment-1470954013\r\n\r\nprepare_ner_dataset was always creating an Armenian pipeline, even for non-Armenian langauges https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F78ff85ce7eed596ad195a3f26474065717ad63b3\r\n\r\nFix an empty `bulk_process` throwing an exception  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002F5e2d15d1aa59e4a1fee8bba1de60c09ba21bf53e  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1278\r\n\r\nUnroll the recursion in the Tarjan part of the Chuliu-Edmonds algorithm - should remove stack overflow errors  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002Fe0917b0967ba9752fdf489b86f9bfd19186c38eb\r\n\r\n## Minor updates\r\n\r\nPut NER and POS scores on one line to make it easier to grep for: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fda2ae33e8ef9e48842685dfed88896b646dba8c4 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F8c4cb04d38c1101318755270f3aa75c54236e3fe\r\n\r\nSwitch all pretrains to use a name which indicates their source, rather than the dataset they are used for: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002Fd1c68ed01276b3cf1455d497057fbc0b82da49e5 and many others\r\n\r\nPipeline uses `torch.no_grad()` for a slight speed boost  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002F36ab82edfc574d46698c5352e07d2fcb0d68d3b3\r\n\r\nGeneralize save names, which eventually allows for putting `transformer`, `charlm` or `nocharlm` in the save name - this lets us distinguish different complexities of model https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002Fcc0845826973576d8d8ed279274e6509250c9ad5 for constituency, and others for the other models\r\n\r\nAdd the model's flags to the `--help` for the `run` scripts, such as https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002F83c0901c6ca2827224e156477e42e403d330a16e  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002F7c171dd8d066c6973a8ee18a016b65f62376ea4c   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002F8e1d112bee42f2211f5153fcc89083b97e3d2600\r\n\r\nRemove the dependency on `six`  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002F6daf97142ebc94cca7114a8cda5a20bf66f7f707  (thank you @BLKSerene )\r\n\r\n## New Models\r\n\r\nVLSP constituency https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F500435d3ec1b484b0f1152a613716565022257f2\r\n\r\nVLSP constituency -> tagging https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fcb0f22d7be25af0b3b2790e3ce1b9dbc277c13a7\r\n\r\nCTB 5.1 constituency   https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002Ff2ef62b96c79fcaf0b8aa70e4662d33b26dadf31\r\n\r\nAdd support for CTB 9.0, although those models are not distributed yet https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1282\u002Fcommits\u002F1e3ea8a10b2e485bc7c79c6ab41d1f1dd8c2022f\r\n\r\nAdded an Indonesian charlm\r\n\r\nIndonesian constituency from ICON treebank https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1218\r\n\r\nAll languages with pretrained charlms now have an option to use that charlm for dependency","2023-09-08T22:22:33",{"id":230,"version":231,"summary_zh":232,"released_at":233},100981,"v1.5.0","# Ssurgeon interface\r\n\r\nHeadlining this release is the initial release of Ssurgeon, a rule-based dependency graph editing tool.  Along with the existing Semgrex integration with CoreNLP, Ssurgeon allows for rewriting of dependencies such as in the UD datasets.  More information is in the GURT 2023 paper, https:\u002F\u002Faclanthology.org\u002F2023.tlt-1.7\u002F\r\n\r\nIn addition to this addition, there are two other CoreNLP integrations, a long list of bugfixes, a few other minor features, and a long list of constituency parser experiments which were somewhere between \"ineffective\" and \"small improvements\" and are available for people to experiment with.\r\n\r\n## CoreNLP integration:\r\n- Ssurgeon interface!  New interface allows for editing of dependency graphs using Semgrex patterns and Ssurgeon rules.    https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1205  https:\u002F\u002Faclanthology.org\u002F2023.tlt-1.7\u002F\r\n- English Morphology class (deterministic English lemmatizer) https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F6aed177731e883ce92057be7e78abdce3141a862\r\n- English constituency -> dependency converter https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F0987794c9e960b32ed75d5804dd5c586466ae061\r\n\r\n## Bugfixes:\r\n- Bugfix for older versions of torch: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F376d7ea76248131a96d23e236ab165e7d5a544bb\r\n- Bugfix for training (integration with new scoring script) https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1167 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F9c39636c438cbeb00ab7a7e8d9caa0bcd31ccc44\r\n- Demo was showing constituency parser along with dependency parsing, even with conparse off: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fcbc13b0219281f2c27e89ccf2914e13f8aa2bb1b\r\n- Replace absurdly long characters with UNK (thank you @khughitt) https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1137 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1140 \r\n- Package all relevant pretrains into default.zip - otherwise pretrains used by NER models which are not the default pretrain were being missed.  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F435685f875766e0b9b2b9b1d4792db1c452f9722\r\n- stanza-train NER training bugfix (wrong pretrain): https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F2757cb40edf7a4bf9f62e31eec4b3632ac5ebcb9\r\n- Pass around device everywhere instead of calling cuda().  this should fix models occasionally being split over multiple devices.  would also allow for use of MPS, but the current torch implementation for MPS is buggy https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1209  https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1159\r\n- Fix error in preparing tokenizer datasets (thanks @dvzubarev): https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1161\r\n- Fix unnecessary slowness in preparing tokenizer datasets (again, thanks @dvzubarev): https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1162\r\n- Fix using the correct pretrain when rebuilding POS tags for a Depparse dataset (again, thanks @dvzubarev): https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1170\r\n- When using the tregex interface to corenlp, add parse if it isn't already there (again, depparse was being confused with parse): https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fb118473604d50d678c2857c0f39f59ba0cd9c2a3\r\n- Update use of emoji to match latest releases: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1195 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fea345a88f8916c2ab2cd2e6260caa7831dfe2f23\r\n\r\n## Features:\r\n- Mechanism for resplitting tokens into MWT https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F95 https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F8fac17f625173b2c2bf1cecf611deecb37399322\r\n- CLI for tokenizing text into one paragraph per line, whitespace separated (useful for Glove, for example) https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fcfd44d17f806703b7ed6719993501366a52afbb1\r\n- `detach().cpu()` speeds things up significantly in some cases https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fccfbc56b3b312fdde1350104a0d0d5645c9c80cc\r\n- Potentially use a constituency model as a classifier - WIP research project https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1190\r\n- add an output format `\"{:C}\"` for document objects which prints out documents as CoNLL: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1169\r\n- If a constituency tree is available, include it when outputting conll format for documents: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1171\r\n- Same with sentiment: https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fabb581945a70fec335dbfadd71bf8c457fa908eb\r\n- Additional language code coverage (thank you @juanro49) https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F5802b10882026c4694a4d966e4200c48c5469b1b https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Ff06bf86b566772ea6551c663835ddb9a6f5584ff https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F32f83fa2f2333f42925323c4ac9da059dffdf1dc https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F34505758c9d8de4ca70bfbe5418448ad54af088f\r\n- Allow loading a pipeline for new languages (useful when developing a new suite of models) https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fe7fcd262a6c5f3f71b339fe989bcaa177fb378f1\r\n- Scri","2023-03-14T05:09:33",{"id":235,"version":236,"summary_zh":237,"released_at":238},100982,"v1.4.2","# Stanza v1.4.2: Minor version bump to improve (python) dependencies\r\n\r\n- Pipeline cache in Multilingual is a single OrderedDict\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1115#issuecomment-1239759362\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fba3f64d5f571b1dc70121551364fc89d103ca1cd\r\n\r\n- Don't require `pytest` for all installations unless needed for testing\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1120\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F8c1d9d80e2e12729f60f05b81e88e113fbdd3482\r\n\r\n- hide SiLU and Minh imports if the version of torch installed doesn't have those nonlinearities\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1120\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F6a90ad4bacf923c88438da53219c48355b847ed3\r\n\r\n- Reorder & normalize installations in setup.py\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1124\r\n","2022-09-15T05:47:38",{"id":240,"version":241,"summary_zh":242,"released_at":243},100983,"v1.4.1","# Stanza v1.4.1: Improvements to pos, conparse, and sentiment, jupyter visualization, and wider language coverage\r\n\r\n## Overview\r\n\r\nWe improve the quality of the POS, constituency, and sentiment models, add an integration to displaCy, and add new models for a variety of languages.\r\n\r\n## New NER models\r\n\r\n- New Polish NER model based on NKJP from Karol Saputa and ryszardtuora\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1070\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1110\r\n\r\n- Make GermEval2014 the default German NER model, including an optional Bert version\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1018\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1022\r\n\r\n- Japanese conversion of GSD by Megagon\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1038\r\n\r\n- Marathi NER dataset from L3Cube.  Includes a Sentiment model as well\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1043\r\n\r\n- Thai conversion of LST20\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F555fc0342decad70f36f501a7ea1e29fa0c5b317\r\n\r\n- Kazakh conversion of KazNERD\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1091\u002Fcommits\u002Fde6cd25c2e5b936bc4ad2764b7b67751d0b862d7\r\n\r\n## Other new models\r\n\r\n- Sentiment conversion of Tass2020 for Spanish\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1104\r\n\r\n- VIT constituency dataset for Italian\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1091\u002Fcommits\u002F149f1440dc32d47fbabcc498cfcd316e53aca0c6\r\n... and many subsequent updates\r\n\r\n- Combined UD models for Hebrew\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1109\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fe4fcf003feb984f535371fb91c9e380dd187fd12\r\n\r\n- For UD models with small train dataset & larger test dataset, flip the datasets\r\nUD_Buryat-BDT UD_Kazakh-KTB UD_Kurmanji-MG UD_Ligurian-GLT UD_Upper_Sorbian-UFAL\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1030\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F9618d60d63c49ec1bfff7416e3f1ad87300c7073\r\n\r\n- Spanish conparse model from multiple sources - AnCora, LDC-NW, LDC-DF\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F47740c6252a6717f12ef1fde875cf19fa1cd67cc\r\n\r\n## Model improvements\r\n\r\n- Pretrained charlm integrated into POS.  Gives a small to decent gain for most languages without much additional cost\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1086\r\n\r\n- Pretrained charlm integrated into Sentiment.  Improves English, others not so much\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1025\r\n\r\n- LSTM, 2d maxpool as optional items in the Sentiment\r\nfrom the paper `Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max Pooling`\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1098\r\n\r\n- First learn with AdaDelta, then with another optimizer in conparse training.  Very helpful\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fb1d10d3bdd892c7f68d2da7f4ba68a6ae3087f52\r\n\r\n- Grad clipping in conparse training\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F365066add019096332bcba0da4a626f68b70d303\r\n\r\n## Pipeline interface improvements\r\n\r\n- GPU memory savings: charlm reused between different processors in the same pipeline\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1028\r\n\r\n- Word vectors not saved in the NER models.  Saves bandwidth & disk space\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1033\r\n\r\n- Functions to return tagsets for NER and conparse models\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1066\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1073\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F36b84db71f19e37b36119e2ec63f89d1e509acb0\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F2db43c834bc8adbb8b096cf135f0fab8b8d886cb\r\n\r\n- displaCy integration with NER and dependency trees\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F20714137d81e5e63d2bcee420b22c4fd2a871306\r\n\r\n## Bugfixes\r\n\r\n- Fix that it takes forever to tokenize a single long token (catastrophic backtracking in regex)\r\nTY to Sk Adnan Hassan (VT) and Zainab Aamir (Stony Brook)\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1056\r\n\r\n- Starting a new corenlp client w\u002Fo server shouldn't wait for the server to be available\r\nTY to Mariano Crosetti\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1059\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1061\r\n\r\n- Read raw glove word vectors (they have no header information)\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1074\r\n\r\n- Ensure that illegal languages are not chosen by the LangID model\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1076\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1077\r\n\r\n- Fix cache in Multilingual pipeline\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1115\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fcdf18d8b19c92b0cfbbf987e82b0080ea7b4db32\r\n\r\n- Fix loading of previously unseen languages in Multilingual pipeline\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1101\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fe551ebe60a4d818bc5ba8880dda741cc8bd1aed7\r\n\r\n- Fix that conparse would occasionally train to NaN early in the training\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fc4d785729e42ac90f298e0ef4ab487d14fa35591\r\n\r\n## Improved","2022-09-14T16:41:36",{"id":245,"version":246,"summary_zh":247,"released_at":248},100984,"v1.4.0","# Stanza v1.4.0: Transformer integration to NER and conparse\r\n\r\n## Overview\r\n\r\nAs part of the new Stanza release, we integrate transformer inputs to the NER and conparse modules.  In addition, we now support several additional languages for NER and conparse.\r\n\r\n## Pipeline interface improvements\r\n\r\n- Download resources.json and models into temp dirs first to avoid race conditions between multiple processors\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F213\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1001\r\n\r\n- Download models for Pipelines automatically, without needing to call `stanza.download(...)`\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F486\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F943\r\n\r\n- Add ability to turn off downloads\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F68455d895986357a2c1f496e52c4e59ee0feb165\r\n\r\n- Add a new interface where both processors and package can be set\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F917\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Ff37042924b7665bbaf006b02dcbf8904d71931a1\r\n\r\n- When using pretokenized tokens, get character offsets from text if available\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F967\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F975\r\n\r\n- If Bert or other transformers are used, cache the models rather than loading multiple times\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F980\r\n\r\n- Allow for disabling processors on individual runs of a pipeline\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F945\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F947\r\n\r\n## Other general improvements\r\n\r\n- Add # text and # sent_id to conll output\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fdiscussions\u002F918\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F983\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F995\r\n\r\n- Add ner to the token conll output\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fdiscussions\u002F993\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F996\r\n\r\n- Fix missing Slovak MWT model\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F971\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F5aa19ec2e6bc610576bc12d226d6f247a21dbd75\r\n\r\n- Upgrades to EN, IT, and Indonesian models\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F1003\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F1008\r\nIT improvements with the help of @attardi and @msimi\r\n\r\n- Fix improper tokenization of Chinese text with leading whitespace\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F920\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F924\r\n\r\n- Check if a CoreNLP model exists before downloading it (thank you @interNULL)\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F965\r\n\r\n- Convert the run_charlm script to python\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F942\r\n\r\n- Typing and lint fixes (thank you @asears)\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F833\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F856\r\n\r\n- stanza-train examples now compatible with the python training scripts\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F896\r\n\r\n## NER features\r\n\r\n- Bert integration (not by default, thank you @vythaihn)\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F976\r\n\r\n- Swedish model (thank you @EmilStenstrom)\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F912\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F857\r\n\r\n- Persian model\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F797\r\n\r\n- Danish model\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F910\u002Fcommits\u002F3783cc494ee8c6b6d062c4d652a428a04a4ee839\r\n\r\n- Norwegian model (both NB and NN)\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F910\u002Fcommits\u002F31fa23e5239b10edca8ecea46e2114f9cc7b031d\r\n\r\n- Use updated Ukrainian data (thank you @gawy)\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F873\r\n\r\n- Myanmar model (thank you UCSY)\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F845\r\n\r\n- Training improvements for finetuning models\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F788\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F791\r\n\r\n- Fix inconsistencies in B\u002FS\u002FI\u002FE tags\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F928#issuecomment-1027987531\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F961\r\n\r\n- Add an option for multiple NER models at the same time, merging the results together\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F928\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F955\r\n\r\n## Constituency parser\r\n\r\n- Dynamic oracle (improves accuracy a bit)\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F866\r\n\r\n- Missing tags now okay in the parser\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F862\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F04dbf4f65e417a2ceb19897ab62c4cf293187c0b\r\n\r\n- bugfix of () not being escaped when output in a tree\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Feaf134ca699aca158dc6e706878037a20bc8cbd4\r\n\r\n- charlm integration by default\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F799\r\n\r\n- Bert integration (not the default model) (thank you @vythaihn and @hungbui0411)\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F05a0b04ee6dd701ca1c7c60197be62d4c13b17b6\r\nhttps:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F0bbe8d10f895560a2bf16f542d2e3586d5d45b7e\r\n\r\n- Preemp","2022-04-23T06:01:01",{"id":250,"version":251,"summary_zh":252,"released_at":253},100985,"v1.3.0","# Overview\r\n\r\nStanza 1.3.0 introduces a language id model, a constituency parser, a dictionary in the tokenizer, and some additional features and bugfixes.\r\n\r\n## New features\r\n\r\n- **Langid model and multilingual pipeline**\r\nBased on \"A reproduction of Apple's bi-directional LSTM models for language identification in short strings.\" by Toftrup et al 2021\r\n(https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F154b0e8e59d3276744ae0c8ea56dc226f777fba8)\r\n\r\n- **Constituency parser**\r\nBased on \"In-Order Transition-based Constituent Parsing\" by Jiangming Liu and Yue Zhang.  Currently an `en_wsj` model available, with more to come.\r\n(https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002F90318023432d584c62986123ef414a1fa93683ca)\r\n\r\n- **Evalb interface to CoreNLP**\r\nUseful for evaluating the parser - requires CoreNLP 4.3.0 or later\r\n\r\n- **Dictonary tokenizer feature**\r\nNoticeably improved performance for ZH, VI, TH\r\n(https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F776)\r\n\r\n## Bugfixes \u002F Reliability\r\n\r\n- **HuggingFace integration**\r\nNo more git issues complaining about unavailable models!  (Hopefully)\r\n(https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Ff7af5049568f81a716106fee5403d339ca246f38)\r\n\r\n- **Sentiment processor crashes on certain inputs**\r\n(issue https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F804, fixed by https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fcommit\u002Fe232f67f3850a32a1b4f3a99e9eb4f5c5580c019)\r\n","2021-10-06T06:28:19",{"id":255,"version":256,"summary_zh":257,"released_at":258},100986,"v1.2.3","# Overview\r\n\r\nIn anticipation of a larger release with some new features, we make a small update to fix some existing bugs and add two more NER models.\r\n\r\n## Bugfixes\r\n\r\n- **Sentiment models would crash on no text** (issue https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F769, fixed by https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F781\u002Fcommits\u002F47889e3043c27f9c5abd9913016929f1857de7bf)\r\n\r\n- **Java processes as a context were not properly closed** (https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F781\u002Fcommits\u002Fa39d2ff6801a23aa73add1f710d809a9c0a793b1)\r\n\r\n## Interface improvements\r\n\r\n- **Downloading tokenize now downloads mwt for languages which require it** (issue https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fissues\u002F774, fixed by https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F777, from davidrft)\r\n\r\n- **NER model can finetune and save to\u002Ffrom different filenames** (https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F781\u002Fcommits\u002F0714a0134f0af6ef486b49ce934f894536e31d43)\r\n\r\n- **NER model now displays a confusion matrix at the end of training** (https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F781\u002Fcommits\u002F9bbd3f712f97cb2702a0852e1c353d4d54b4b33b)\r\n\r\n## NER models\r\n\r\n- **Afrikaans, trained in NCHLT** (https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F781\u002Fcommits\u002F6f1f04b6d674691cf9932d780da436063ebd3381)\r\n\r\n- **Italian, trained on a model from FBK** (https:\u002F\u002Fgithub.com\u002Fstanfordnlp\u002Fstanza\u002Fpull\u002F781\u002Fcommits\u002Fd9a361fd7f13105b68569fddeab650ea9bd04b7f)\r\n\r\n","2021-08-09T23:12:42"]