[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-tensorflow--text":3,"tool-tensorflow--text":64},[4,17,25,39,48,56],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":16},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",138956,2,"2026-04-05T11:33:21",[13,14,15],"开发框架","Agent","语言模型","ready",{"id":18,"name":19,"github_repo":20,"description_zh":21,"stars":22,"difficulty_score":10,"last_commit_at":23,"category_tags":24,"status":16},3704,"NextChat","ChatGPTNextWeb\u002FNextChat","NextChat 是一款轻量且极速的 AI 助手，旨在为用户提供流畅、跨平台的大模型交互体验。它完美解决了用户在多设备间切换时难以保持对话连续性，以及面对众多 AI 模型不知如何统一管理的痛点。无论是日常办公、学习辅助还是创意激发，NextChat 都能让用户随时随地通过网页、iOS、Android、Windows、MacOS 或 Linux 端无缝接入智能服务。\n\n这款工具非常适合普通用户、学生、职场人士以及需要私有化部署的企业团队使用。对于开发者而言，它也提供了便捷的自托管方案，支持一键部署到 Vercel 或 Zeabur 等平台。\n\nNextChat 的核心亮点在于其广泛的模型兼容性，原生支持 Claude、DeepSeek、GPT-4 及 Gemini Pro 等主流大模型，让用户在一个界面即可自由切换不同 AI 能力。此外，它还率先支持 MCP（Model Context Protocol）协议，增强了上下文处理能力。针对企业用户，NextChat 提供专业版解决方案，具备品牌定制、细粒度权限控制、内部知识库整合及安全审计等功能，满足公司对数据隐私和个性化管理的高标准要求。",87618,"2026-04-05T07:20:52",[13,15],{"id":26,"name":27,"github_repo":28,"description_zh":29,"stars":30,"difficulty_score":10,"last_commit_at":31,"category_tags":32,"status":16},2268,"ML-For-Beginners","microsoft\u002FML-For-Beginners","ML-For-Beginners 是由微软推出的一套系统化机器学习入门课程，旨在帮助零基础用户轻松掌握经典机器学习知识。这套课程将学习路径规划为 12 周，包含 26 节精炼课程和 52 道配套测验，内容涵盖从基础概念到实际应用的完整流程，有效解决了初学者面对庞大知识体系时无从下手、缺乏结构化指导的痛点。\n\n无论是希望转型的开发者、需要补充算法背景的研究人员，还是对人工智能充满好奇的普通爱好者，都能从中受益。课程不仅提供了清晰的理论讲解，还强调动手实践，让用户在循序渐进中建立扎实的技能基础。其独特的亮点在于强大的多语言支持，通过自动化机制提供了包括简体中文在内的 50 多种语言版本，极大地降低了全球不同背景用户的学习门槛。此外，项目采用开源协作模式，社区活跃且内容持续更新，确保学习者能获取前沿且准确的技术资讯。如果你正寻找一条清晰、友好且专业的机器学习入门之路，ML-For-Beginners 将是理想的起点。",84991,"2026-04-05T10:45:23",[33,34,35,36,14,37,15,13,38],"图像","数据工具","视频","插件","其他","音频",{"id":40,"name":41,"github_repo":42,"description_zh":43,"stars":44,"difficulty_score":45,"last_commit_at":46,"category_tags":47,"status":16},3128,"ragflow","infiniflow\u002Fragflow","RAGFlow 是一款领先的开源检索增强生成（RAG）引擎，旨在为大语言模型构建更精准、可靠的上下文层。它巧妙地将前沿的 RAG 技术与智能体（Agent）能力相结合，不仅支持从各类文档中高效提取知识，还能让模型基于这些知识进行逻辑推理和任务执行。\n\n在大模型应用中，幻觉问题和知识滞后是常见痛点。RAGFlow 通过深度解析复杂文档结构（如表格、图表及混合排版），显著提升了信息检索的准确度，从而有效减少模型“胡编乱造”的现象，确保回答既有据可依又具备时效性。其内置的智能体机制更进一步，使系统不仅能回答问题，还能自主规划步骤解决复杂问题。\n\n这款工具特别适合开发者、企业技术团队以及 AI 研究人员使用。无论是希望快速搭建私有知识库问答系统，还是致力于探索大模型在垂直领域落地的创新者，都能从中受益。RAGFlow 提供了可视化的工作流编排界面和灵活的 API 接口，既降低了非算法背景用户的上手门槛，也满足了专业开发者对系统深度定制的需求。作为基于 Apache 2.0 协议开源的项目，它正成为连接通用大模型与行业专有知识之间的重要桥梁。",77062,3,"2026-04-04T04:44:48",[14,33,13,15,37],{"id":49,"name":50,"github_repo":51,"description_zh":52,"stars":53,"difficulty_score":45,"last_commit_at":54,"category_tags":55,"status":16},519,"PaddleOCR","PaddlePaddle\u002FPaddleOCR","PaddleOCR 是一款基于百度飞桨框架开发的高性能开源光学字符识别工具包。它的核心能力是将图片、PDF 等文档中的文字提取出来，转换成计算机可读取的结构化数据，让机器真正“看懂”图文内容。\n\n面对海量纸质或电子文档，PaddleOCR 解决了人工录入效率低、数字化成本高的问题。尤其在人工智能领域，它扮演着连接图像与大型语言模型（LLM）的桥梁角色，能将视觉信息直接转化为文本输入，助力智能问答、文档分析等应用场景落地。\n\nPaddleOCR 适合开发者、算法研究人员以及有文档自动化需求的普通用户。其技术优势十分明显：不仅支持全球 100 多种语言的识别，还能在 Windows、Linux、macOS 等多个系统上运行，并灵活适配 CPU、GPU、NPU 等各类硬件。作为一个轻量级且社区活跃的开源项目，PaddleOCR 既能满足快速集成的需求，也能支撑前沿的视觉语言研究，是处理文字识别任务的理想选择。",74913,"2026-04-05T10:44:17",[15,33,13,37],{"id":57,"name":58,"github_repo":59,"description_zh":60,"stars":61,"difficulty_score":45,"last_commit_at":62,"category_tags":63,"status":16},2181,"OpenHands","OpenHands\u002FOpenHands","OpenHands 是一个专注于 AI 驱动开发的开源平台，旨在让智能体（Agent）像人类开发者一样理解、编写和调试代码。它解决了传统编程中重复性劳动多、环境配置复杂以及人机协作效率低等痛点，通过自动化流程显著提升开发速度。\n\n无论是希望提升编码效率的软件工程师、探索智能体技术的研究人员，还是需要快速原型验证的技术团队，都能从中受益。OpenHands 提供了灵活多样的使用方式：既可以通过命令行（CLI）或本地图形界面在个人电脑上轻松上手，体验类似 Devin 的流畅交互；也能利用其强大的 Python SDK 自定义智能体逻辑，甚至在云端大规模部署上千个智能体并行工作。\n\n其核心技术亮点在于模块化的软件智能体 SDK，这不仅构成了平台的引擎，还支持高度可组合的开发模式。此外，OpenHands 在 SWE-bench 基准测试中取得了 77.6% 的优异成绩，证明了其解决真实世界软件工程问题的能力。平台还具备完善的企业级功能，支持与 Slack、Jira 等工具集成，并提供细粒度的权限管理，适合从个人开发者到大型企业的各类用户场景。",70612,"2026-04-05T11:12:22",[15,14,13,36],{"id":65,"github_repo":66,"name":67,"description_en":68,"description_zh":69,"ai_summary_zh":70,"readme_en":71,"readme_zh":72,"quickstart_zh":73,"use_case_zh":74,"hero_image_url":75,"owner_login":76,"owner_name":76,"owner_avatar_url":77,"owner_bio":78,"owner_company":79,"owner_location":79,"owner_email":80,"owner_twitter":79,"owner_website":81,"owner_url":82,"languages":83,"stars":106,"forks":107,"last_commit_at":108,"license":109,"difficulty_score":10,"env_os":110,"env_gpu":111,"env_ram":111,"env_deps":112,"category_tags":116,"github_topics":79,"view_count":117,"oss_zip_url":79,"oss_zip_packed_at":79,"status":16,"created_at":118,"updated_at":119,"faqs":120,"releases":150},380,"tensorflow\u002Ftext","text","Making text a first-class citizen in TensorFlow.","TensorFlow Text（包名 tensorflow-text）是 TensorFlow 生态中专注于文本处理的扩展库，旨在让文本数据在 TensorFlow 中享有“一等公民”的地位。它为 TensorFlow 2.0 用户提供了一系列开箱即用的文本类与操作，涵盖 Unicode 处理、文本规范化、多种分词策略（如空格分词、脚本分词）以及 N-grams 生成等功能。\n\n许多开发者在使用 TensorFlow 构建自然语言模型时，常面临训练与推理阶段分词逻辑不一致、需额外维护预处理脚本等难题。TensorFlow Text 通过将文本处理操作直接嵌入 TensorFlow 计算图内，有效解决了这些问题，确保了数据预处理流程的统一性与高效性，无需担心环境差异导致的偏差。\n\n它特别适合 NLP 领域的开发者和研究人员使用，尤其是依赖 TensorFlow 进行序列建模的团队。不仅支持标准 Python 调用，还提供 Keras API 接口，便于快速集成到现有模型架构中。配合完善的文档和社区贡献机制，TensorFlow Text 能显著降低文本数据预处理的复杂度，提升模型开发效率","TensorFlow Text（包名 tensorflow-text）是 TensorFlow 生态中专注于文本处理的扩展库，旨在让文本数据在 TensorFlow 中享有“一等公民”的地位。它为 TensorFlow 2.0 用户提供了一系列开箱即用的文本类与操作，涵盖 Unicode 处理、文本规范化、多种分词策略（如空格分词、脚本分词）以及 N-grams 生成等功能。\n\n许多开发者在使用 TensorFlow 构建自然语言模型时，常面临训练与推理阶段分词逻辑不一致、需额外维护预处理脚本等难题。TensorFlow Text 通过将文本处理操作直接嵌入 TensorFlow 计算图内，有效解决了这些问题，确保了数据预处理流程的统一性与高效性，无需担心环境差异导致的偏差。\n\n它特别适合 NLP 领域的开发者和研究人员使用，尤其是依赖 TensorFlow 进行序列建模的团队。不仅支持标准 Python 调用，还提供 Keras API 接口，便于快速集成到现有模型架构中。配合完善的文档和社区贡献机制，TensorFlow Text 能显著降低文本数据预处理的复杂度，提升模型开发效率。","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftensorflow_text_readme_f826ee9d7e46.png\" width=\"60%\">\u003Cbr>\u003Cbr>\n\u003C\u002Fdiv>\n\n-----------------\n\n[![PyPI version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ftensorflow-text)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Ftensorflow-text)\n[![PyPI nightly version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ftensorflow-text-nightly?color=informational&label=pypi%20%40%20nightly)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Ftensorflow-text-nightly)\n[![PyPI Python version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Ftensorflow-text)](https:\u002F\u002Fpypi.org\u002Fproject\u002Ftensorflow-text\u002F)\n[![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fapi-reference-blue.svg)](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext\u002Fblob\u002Fmaster\u002Fdocs\u002Fapi_docs\u002Fpython\u002Findex.md)\n[![Contributions\nwelcome](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcontributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-brightgreen.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n\n\u003C!-- TODO(broken):  Uncomment when badges are made public.\n### Continuous Integration Test Status\n\n| Build      | Status |\n| ---             | ---    |\n| **Linux**   | [![Status](https:\u002F\u002Fstorage.googleapis.com\u002Ftf-text-badges\u002Fubuntu-gpu-py3.svg)] |\n| **MacOS**   | [![Status](https:\u002F\u002Fstorage.googleapis.com\u002Ftf-text-badges\u002Fubuntu-gpu-py3.svg)] |\n| **Windows**   | [![Status](https:\u002F\u002Fstorage.googleapis.com\u002Ftf-text-badges\u002Fubuntu-gpu-py3.svg)] |\n-->\n\n# TensorFlow Text - Text processing in Tensorflow\n\n**IMPORTANT**: When installing TF Text with `pip install`, please note the\nversion of TensorFlow you are running, as you should specify the corresponding\nminor version of TF Text (eg. for tensorflow==2.3.x use tensorflow_text==2.3.x).\n\n## INDEX\n* [Introduction](#introduction)\n* [Documentation](#documentation)\n* [Unicode](#unicode)\n* [Normalization](#normalization)\n* [Tokenization](#tokenization)\n  * [Whitespace Tokenizer](#whitespacetokenizer)\n  * [UnicodeScript Tokenizer](#unicodescripttokenizer)\n  * [Unicode split](#unicode-split)\n  * [Offsets](#offsets)\n  * [TF.Data Example](#tfdata-example)\n  * [Keras API](#keras-api)\n* [Other Text Ops](#other-text-ops)\n  * [Wordshape](#wordshape)\n  * [N-grams & Sliding Window](#n-grams--sliding-window)\n* [Installation](#installation)\n  * [Install using PIP](#install-using-pip)\n  * [Build from source steps:](#build-from-source-steps)\n\n## Introduction\n\nTensorFlow Text provides a collection of text related classes and ops ready to\nuse with TensorFlow 2.0. The library can perform the preprocessing regularly\nrequired by text-based models, and includes other features useful for sequence\nmodeling not provided by core TensorFlow.\n\nThe benefit of using these ops in your text preprocessing is that they are done\nin the TensorFlow graph. You do not need to worry about tokenization in\ntraining being different than the tokenization at inference, or managing\npreprocessing scripts.\n\n## Documentation\n\nPlease visit [http:\u002F\u002Ftensorflow.org\u002Ftext](http:\u002F\u002Ftensorflow.org\u002Ftext) for all\ndocumentation. This site includes API docs, guides for working with TensorFlow\nText, as well as tutorials for building specific models.\n\n## Unicode\n\nMost ops expect that the strings are in UTF-8. If you're using a different\nencoding, you can use the core tensorflow transcode op to transcode into UTF-8.\nYou can also use the same op to coerce your string to structurally valid UTF-8\nif your input could be invalid.\n\n```python\ndocs = tf.constant([u'Everything not saved will be lost.'.encode('UTF-16-BE'),\n                    u'Sad☹'.encode('UTF-16-BE')])\nutf8_docs = tf.strings.unicode_transcode(docs, input_encoding='UTF-16-BE',\n                                         output_encoding='UTF-8')\n```\n\n## Normalization\n\nWhen dealing with different sources of text, it's important that the same words\nare recognized to be identical. A common technique for case-insensitive matching\nin Unicode is case folding (similar to lower-casing). (Note that case folding\ninternally applies NFKC normalization.)\n\nWe also provide Unicode normalization ops for transforming strings into a\ncanonical representation of characters, with Normalization Form KC being the\ndefault ([NFKC](http:\u002F\u002Funicode.org\u002Freports\u002Ftr15\u002F)).\n\n```python\nprint(text.case_fold_utf8(['Everything not saved will be lost.']))\nprint(text.normalize_utf8(['Äffin']))\nprint(text.normalize_utf8(['Äffin'], 'nfkd'))\n```\n\n```sh\ntf.Tensor(['everything not saved will be lost.'], shape=(1,), dtype=string)\ntf.Tensor(['\\xc3\\x84ffin'], shape=(1,), dtype=string)\ntf.Tensor(['A\\xcc\\x88ffin'], shape=(1,), dtype=string)\n```\n\n## Tokenization\n\nTokenization is the process of breaking up a string into tokens. Commonly, these\ntokens are words, numbers, and\u002For punctuation.\n\nThe main interfaces are `Tokenizer` and `TokenizerWithOffsets` which each have a\nsingle method `tokenize` and `tokenizeWithOffsets` respectively. There are\nmultiple implementing tokenizers available now. Each of these implement\n`TokenizerWithOffsets` (which extends `Tokenizer`) which includes an option for\ngetting byte offsets into the original string. This allows the caller to know\nthe bytes in the original string the token was created from.\n\nAll of the tokenizers return RaggedTensors with the inner-most dimension of\ntokens mapping to the original individual strings. As a result, the resulting\nshape's rank is increased by one. Please review the ragged tensor guide if you\nare unfamiliar with them. https:\u002F\u002Fwww.tensorflow.org\u002Fguide\u002Fragged_tensor\n\n### WhitespaceTokenizer\n\nThis is a basic tokenizer that splits UTF-8 strings on ICU defined whitespace\ncharacters (eg. space, tab, new line).\n\n```python\ntokenizer = text.WhitespaceTokenizer()\ntokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])\nprint(tokens.to_list())\n```\n\n```sh\n[['everything', 'not', 'saved', 'will', 'be', 'lost.'], ['Sad\\xe2\\x98\\xb9']]\n```\n\n### UnicodeScriptTokenizer\n\nThis tokenizer splits UTF-8 strings based on Unicode script boundaries. The\nscript codes used correspond to International Components for Unicode (ICU)\nUScriptCode values. See: http:\u002F\u002Ficu-project.org\u002Fapiref\u002Ficu4c\u002Fuscript_8h.html\n\nIn practice, this is similar to the `WhitespaceTokenizer` with the most apparent\ndifference being that it will split punctuation (USCRIPT_COMMON) from language\ntexts (eg. USCRIPT_LATIN, USCRIPT_CYRILLIC, etc) while also separating language\ntexts from each other.\n\n```python\ntokenizer = text.UnicodeScriptTokenizer()\ntokens = tokenizer.tokenize(['everything not saved will be lost.',\n                             u'Sad☹'.encode('UTF-8')])\nprint(tokens.to_list())\n```\n\n```sh\n[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],\n ['Sad', '\\xe2\\x98\\xb9']]\n```\n\n### Unicode split\n\nWhen tokenizing languages without whitespace to segment words, it is common to\njust split by character, which can be accomplished using the\n[unicode_split](https:\u002F\u002Fwww.tensorflow.org\u002Fapi_docs\u002Fpython\u002Ftf\u002Fstrings\u002Funicode_split)\nop found in core.\n\n```python\ntokens = tf.strings.unicode_split([u\"仅今年前\".encode('UTF-8')], 'UTF-8')\nprint(tokens.to_list())\n```\n\n```sh\n[['\\xe4\\xbb\\x85', '\\xe4\\xbb\\x8a', '\\xe5\\xb9\\xb4', '\\xe5\\x89\\x8d']]\n```\n\n### Offsets\n\nWhen tokenizing strings, it is often desired to know where in the original\nstring the token originated from. For this reason, each tokenizer which\nimplements `TokenizerWithOffsets` has a *tokenize_with_offsets* method that will\nreturn the byte offsets along with the tokens. The start_offsets lists the bytes\nin the original string each token starts at (inclusive), and the end_offsets\nlists the bytes where each token ends at (exclusive, i.e., first byte *after*\nthe token).\n\n```python\ntokenizer = text.UnicodeScriptTokenizer()\n(tokens, start_offsets, end_offsets) = tokenizer.tokenize_with_offsets(\n    ['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])\nprint(tokens.to_list())\nprint(start_offsets.to_list())\nprint(end_offsets.to_list())\n```\n\n```sh\n[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],\n ['Sad', '\\xe2\\x98\\xb9']]\n[[0, 11, 15, 21, 26, 29, 33], [0, 3]]\n[[10, 14, 20, 25, 28, 33, 34], [3, 6]]\n```\n\n### TF.Data Example\n\nTokenizers work as expected with the tf.data API. A simple example is provided\nbelow.\n\n```python\ndocs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'],\n                                           [\"It's a trap!\"]])\ntokenizer = text.WhitespaceTokenizer()\ntokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))\niterator = tokenized_docs.make_one_shot_iterator()\nprint(iterator.get_next().to_list())\nprint(iterator.get_next().to_list())\n```\n\n```sh\n[['Never', 'tell', 'me', 'the', 'odds.']]\n[[\"It's\", 'a', 'trap!']]\n```\n\n### Keras API\n\nWhen you use different tokenizers and ops to preprocess your data, the resulting\noutputs are Ragged Tensors. The Keras API makes it easy now to train a model\nusing Ragged Tensors without having to worry about padding or masking the data,\nby either using the ToDense layer which handles all of these for you or relying\non Keras built-in layers support for natively working on ragged data.\n\n```python\nmodel = tf.keras.Sequential([\n  tf.keras.layers.InputLayer(input_shape=(None,), dtype='int32', ragged=True)\n  text.keras.layers.ToDense(pad_value=0, mask=True),\n  tf.keras.layers.Embedding(100, 16),\n  tf.keras.layers.LSTM(32),\n  tf.keras.layers.Dense(32, activation='relu'),\n  tf.keras.layers.Dense(1, activation='sigmoid')\n])\n```\n\n## Other Text Ops\n\nTF.Text packages other useful preprocessing ops. We will review a couple below.\n\n### Wordshape\n\nA common feature used in some natural language understanding models is to see\nif the text string has a certain property. For example, a sentence breaking\nmodel might contain features which check for word capitalization or if a\npunctuation character is at the end of a string.\n\nWordshape defines a variety of useful regular expression based helper functions\nfor matching various relevant patterns in your input text. Here are a few\nexamples.\n\n```python\ntokenizer = text.WhitespaceTokenizer()\ntokens = tokenizer.tokenize(['Everything not saved will be lost.',\n                             u'Sad☹'.encode('UTF-8')])\n\n# Is capitalized?\nf1 = text.wordshape(tokens, text.WordShape.HAS_TITLE_CASE)\n# Are all letters uppercased?\nf2 = text.wordshape(tokens, text.WordShape.IS_UPPERCASE)\n# Does the token contain punctuation?\nf3 = text.wordshape(tokens, text.WordShape.HAS_SOME_PUNCT_OR_SYMBOL)\n# Is the token a number?\nf4 = text.wordshape(tokens, text.WordShape.IS_NUMERIC_VALUE)\n\nprint(f1.to_list())\nprint(f2.to_list())\nprint(f3.to_list())\nprint(f4.to_list())\n```\n\n```sh\n[[True, False, False, False, False, False], [True]]\n[[False, False, False, False, False, False], [False]]\n[[False, False, False, False, False, True], [True]]\n[[False, False, False, False, False, False], [False]]\n```\n\n### N-grams & Sliding Window\n\nN-grams are sequential words given a sliding window size of *n*. When combining\nthe tokens, there are three reduction mechanisms supported. For text, you would\nwant to use `Reduction.STRING_JOIN` which appends the strings to each other.\nThe default separator character is a space, but this can be changed with the\nstring_separater argument.\n\nThe other two reduction methods are most often used with numerical values, and\nthese are `Reduction.SUM` and `Reduction.MEAN`.\n\n```python\ntokenizer = text.WhitespaceTokenizer()\ntokens = tokenizer.tokenize(['Everything not saved will be lost.',\n                             u'Sad☹'.encode('UTF-8')])\n\n# Ngrams, in this case bi-gram (n = 2)\nbigrams = text.ngrams(tokens, 2, reduction_type=text.Reduction.STRING_JOIN)\n\nprint(bigrams.to_list())\n```\n\n```sh\n[['Everything not', 'not saved', 'saved will', 'will be', 'be lost.'], []]\n```\n\n## Installation\n\n### Install using PIP\n\nWhen installing TF Text with `pip install`, please note the version\nof TensorFlow you are running, as you should specify the corresponding version\nof TF Text. For example, if you're using TF 2.0, install the 2.0 version of TF\nText, and if you're using TF 1.15, install the 1.15 version of TF Text.\n\n```bash\npip install -U tensorflow-text==\u003Cversion>\n```\n\n### A note about different operating system packages\n\nAfter version 2.10, we will only be providing pip packages for Linux x86_64 and\nIntel-based Macs. TensorFlow Text has always leveraged the release\ninfrastructure of the core TensorFlow package to more easily maintain compatible\nreleases with minimal maintenance, allowing the team to focus on TF Text itself\nand contributions to other parts of the TensorFlow ecosystem.\n\nFor other systems like Windows, Aarch64, and Apple Macs, TensorFlow relies on\n[build collaborators](https:\u002F\u002Fblog.tensorflow.org\u002F2022\u002F09\u002Fannouncing-tensorflow-official-build-collaborators.html),\nand so we will not be providing packages for them. However, we will continue to\naccept PRs to make building for these OSs easy for users, and will try to point\nto community efforts related to them.\n\n\n### Build from source steps:\n\nNote that TF Text needs to be built in the same environment as TensorFlow. Thus,\nif you manually build TF Text, it is highly recommended that you also build\nTensorFlow.\n\nIf building on MacOS, you must have coreutils installed. It is probably easiest\nto do with Homebrew.\n\n1. [build and install TensorFlow](https:\u002F\u002Fwww.tensorflow.org\u002Finstall\u002Fsource).\n1. Clone the TF Text repo:\n   ```Shell\n   git clone https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext.git\n   cd text\n   ```\n1. Run the build script to create a pip package:\n   ```Shell\n   .\u002Foss_scripts\u002Frun_build.sh\n   ```\n   After this step, there should be a `*.whl` file in current directory. File name similar to `tensorflow_text-2.5.0rc0-cp38-cp38-linux_x86_64.whl`.\n1. Install the package to environment:\n   ```Shell\n   pip install .\u002Ftensorflow_text-*-*-*-os_platform.whl\n   ```\n\n### Build or test using TensorFlow's SIG docker image:\n\n1.  Pull image from\n    [Tensorflow SIG docker builds](https:\u002F\u002Fhub.docker.com\u002Fr\u002Ftensorflow\u002Fbuild\u002Ftags).\n\n1.  Run a container based with the pulled image and create a bash session.\n    This can be done by running `docker run -it {image_name} bash`. \u003Cbr \u002F>\n    `{image_name}` can be any name with `{tf_verison}-python{python_version}` format.\n    An example for python 3.10 and TF version 2.10 :- `2.10-python3.10`.\n1.  Clone the TF-Text Github repository inside container:  `git clone https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext.git`. \u003Cbr \u002F>\n    Once cloned, change to the working directory using `cd text\u002F`.\n1.  Run the configuration script(s): `.\u002Foss_scripts\u002Fconfigure.sh` and `.\u002Foss_scripts\u002Fprepare_tf_dep.sh`. \u003Cbr \u002F>\n    This will update bazel and TF dependencies to installed tensorflow in the container.\n1.  To run the tests, use the bazel command: `bazel test --test_output=errors tensorflow_text:all`. This will run all the tests declared in the `BUILD` file. \u003Cbr \u002F>\n    To run a specific test, modify the above command replacing `:all` with the test name (for example `:fast_bert_normalizer`).\n    \n1.  Build the pip package\u002Fwheel: \\\n    `bazel build --config=release_cpu_linux\n    oss_scripts\u002Fpip_package:build_pip_package` \\\n    `.\u002Fbazel-bin\u002Foss_scripts\u002Fpip_package\u002Fbuild_pip_package\n    \u002F{wheel_dir}` \u003Cbr \u002F>\n\n    Once the build is complete, you should see the wheel available under\n    `{wheel_dir}` directory.\n","\u003Cdiv align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftensorflow_text_readme_f826ee9d7e46.png\" width=\"60%\">\u003Cbr>\u003Cbr>\n\u003C\u002Fdiv>\n\n-----------------\n\n[![PyPI version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ftensorflow-text)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Ftensorflow-text)\n[![PyPI nightly version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fv\u002Ftensorflow-text-nightly?color=informational&label=pypi%20%40%20nightly)](https:\u002F\u002Fbadge.fury.io\u002Fpy\u002Ftensorflow-text-nightly)\n[![PyPI Python version](https:\u002F\u002Fimg.shields.io\u002Fpypi\u002Fpyversions\u002Ftensorflow-text)](https:\u002F\u002Fpypi.org\u002Fproject\u002Ftensorflow-text\u002F)\n[![Documentation](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fapi-reference-blue.svg)](https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext\u002Fblob\u002Fmaster\u002Fdocs\u002Fapi_docs\u002Fpython\u002Findex.md)\n[![Contributions\nwelcome](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002Fcontributions-welcome-brightgreen.svg)](CONTRIBUTING.md)\n[![License](https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-brightgreen.svg)](https:\u002F\u002Fopensource.org\u002Flicenses\u002FApache-2.0)\n\n\u003C!-- TODO(broken): 当徽章公开时取消注释。\n### 持续集成测试状态\n\n| 构建      | 状态 |\n| ---             | ---    |\n| **Linux**   | [![Status](https:\u002F\u002Fstorage.googleapis.com\u002Ftf-text-badges\u002Fubuntu-gpu-py3.svg)] |\n| **MacOS**   | [![Status](https:\u002F\u002Fstorage.googleapis.com\u002Ftf-text-badges\u002Fubuntu-gpu-py3.svg)] |\n| **Windows**   | [![Status](https:\u002F\u002Fstorage.googleapis.com\u002Ftf-text-badges\u002Fubuntu-gpu-py3.svg)] |\n-->\n\n# TensorFlow Text - TensorFlow 中的文本处理\n\n**重要提示**：使用 `pip install` 安装 TF Text（TensorFlow Text）时，请注意您正在运行的 TensorFlow 版本，因为您应该指定相应次要版本的 TF Text（例如，对于 tensorflow==2.3.x，请使用 tensorflow_text==2.3.x）。\n\n## 目录\n* [简介](#introduction)\n* [文档](#documentation)\n* [Unicode](#unicode)\n* [规范化](#normalization)\n* [分词](#tokenization)\n  * [空白字符分词器](#whitespacetokenizer)\n  * [Unicode 脚本分词器](#unicodescripttokenizer)\n  * [Unicode 分割](#unicode-split)\n  * [偏移量](#offsets)\n  * [TF.Data 示例](#tfdata-example)\n  * [Keras API](#keras-api)\n* [其他文本操作](#other-text-ops)\n  * [单词形状](#wordshape)\n  * [N-gram 与滑动窗口](#n-grams--sliding-window)\n* [安装](#installation)\n  * [使用 PIP 安装](#install-using-pip)\n  * [从源代码构建步骤：](#build-from-source-steps)\n\n## 简介\n\nTensorFlow Text 提供了一系列与文本相关的类和操作 (ops)，可直接用于 TensorFlow 2.0。该库可以执行基于文本的模型通常所需的预处理，并包含核心 TensorFlow 未提供的对序列建模有用的其他功能。\n\n在文本预处理中使用这些操作的好处是，它们是在 TensorFlow 计算图内完成的。您无需担心训练时的分词 (tokenization) 与推理时的分词不同，也无需管理预处理脚本。\n\n## 文档\n\n请访问 [http:\u002F\u002Ftensorflow.org\u002Ftext](http:\u002F\u002Ftensorflow.org\u002Ftext) 查看所有文档。该网站包括 API 文档、关于如何使用 TensorFlow Text 的指南，以及构建特定模型的教程。\n\n## Unicode\n\n大多数操作 (ops) 期望字符串为 UTF-8 格式。如果您使用不同的编码，可以使用核心 TensorFlow 的转码操作 (transcode op) 将其转码为 UTF-8。如果您的输入可能无效，您也可以使用相同的操作将您的字符串强制转换为结构上有效的 UTF-8。\n\n```python\ndocs = tf.constant([u'Everything not saved will be lost.'.encode('UTF-16-BE'),\n                    u'Sad☹'.encode('UTF-16-BE')])\nutf8_docs = tf.strings.unicode_transcode(docs, input_encoding='UTF-16-BE',\n                                         output_encoding='UTF-8')\n```\n\n## 规范化\n\n在处理不同来源的文本时，确保相同的单词被识别为相同非常重要。Unicode 中大小写不敏感匹配的常用技术是大小写折叠 (case folding)（类似于小写化 (lower-casing)）。（注意，大小写折叠内部应用了 NFKC 规范化。）\n\n我们还提供了 Unicode 规范化操作，用于将字符串转换为字符的标准表示形式，默认情况下为规范化形式 KC（[NFKC](http:\u002F\u002Funicode.org\u002Freports\u002Ftr15\u002F)）。\n\n```python\nprint(text.case_fold_utf8(['Everything not saved will be lost.']))\nprint(text.normalize_utf8(['Äffin']))\nprint(text.normalize_utf8(['Äffin'], 'nfkd'))\n```\n\n```sh\ntf.Tensor(['everything not saved will be lost.'], shape=(1,), dtype=string)\ntf.Tensor(['\\xc3\\x84ffin'], shape=(1,), dtype=string)\ntf.Tensor(['A\\xcc\\x88ffin'], shape=(1,), dtype=string)\n```\n\n## 分词\n\n分词 (Tokenization) 是将字符串拆分为标记 (tokens) 的过程。通常，这些标记是单词、数字和\u002F或标点符号。\n\n主要接口是 ``Tokenizer`` 和 ``TokenizerWithOffsets``，它们分别只有一个方法 ``tokenize`` 和 ``tokenizeWithOffsets``。目前有多种实现的分词器可用。每一个都实现了 ``TokenizerWithOffsets``（它扩展了 ``Tokenizer``），其中包括获取原始字符串字节偏移量的选项。这使得调用者能够知道创建该标记的原始字符串中的字节。\n\n所有分词器都返回 RaggedTensors（不规则张量），其中标记的最内维映射到原始的各个字符串。因此，结果形状的秩 (rank) 增加了一。如果您不熟悉它们，请查阅 ragged tensor 指南。https:\u002F\u002Fwww.tensorflow.org\u002Fguide\u002Fragged_tensor\n\n### WhitespaceTokenizer\n\n这是一个基本的分词器，它在 ICU (International Components for Unicode) 定义的空白字符（如空格、制表符、换行）处拆分 UTF-8 字符串。\n\n```python\ntokenizer = text.WhitespaceTokenizer()\ntokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])\nprint(tokens.to_list())\n```\n\n```sh\n[['everything', 'not', 'saved', 'will', 'be', 'lost.'], ['Sad\\xe2\\x98\\xb9']]\n```\n\n### UnicodeScriptTokenizer\n\n此分词器根据 Unicode 脚本边界拆分 UTF-8 字符串。使用的脚本代码对应于国际组件库 for Unicode (ICU) 的 UScriptCode 值。参见：http:\u002F\u002Ficu-project.org\u002Fapiref\u002Ficu4c\u002Fuscript_8h.html\n\n在实践中，这与 ``WhitespaceTokenizer`` 类似，最明显的区别在于它会区分标点符号（USCRIPT_COMMON）和语言文本（例如 USCRIPT_LATIN, USCRIPT_CYRILLIC 等），同时也会将不同的语言文本彼此分开。\n\n```python\ntokenizer = text.UnicodeScriptTokenizer()\ntokens = tokenizer.tokenize(['everything not saved will be lost.',\n                             u'Sad☹'.encode('UTF-8')])\nprint(tokens.to_list())\n```\n\n```sh\n[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],\n ['Sad', '\\xe2\\x98\\xb9']]\n```\n\n### Unicode 分割\n\n在对没有空格的语言进行分词以分割单词时，通常只需按字符分割，这可以使用核心中的 [unicode_split](https:\u002F\u002Fwww.tensorflow.org\u002Fapi_docs\u002Fpython\u002Ftf\u002Fstrings\u002Funicode_split) 操作符 (op) 来实现。\n\n```python\ntokens = tf.strings.unicode_split([u\"仅今年前\".encode('UTF-8')], 'UTF-8')\nprint(tokens.to_list())\n```\n\n```sh\n[['\\xe4\\xbb\\x85', '\\xe4\\xbb\\x8a', '\\xe5\\xb9\\xb4', '\\xe5\\x89\\x8d']]\n```\n\n### 偏移量\n\n在对字符串进行分词时，通常希望知道每个 token 源自原始字符串的哪个位置。因此，每个实现了 `TokenizerWithOffsets`（带偏移量的分词器）的分词器都有一个 *tokenize_with_offsets* 方法，该方法将返回字节偏移量 (byte offsets) 以及 tokens。`start_offsets` 列出每个 token 在原始字符串中开始的字节（包含），而 `end_offsets` 列出每个 token 结束的字节（不包含，即 token 之后的第一个字节）。\n\n```python\ntokenizer = text.UnicodeScriptTokenizer()\n(tokens, start_offsets, end_offsets) = tokenizer.tokenize_with_offsets(\n    ['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])\nprint(tokens.to_list())\nprint(start_offsets.to_list())\nprint(end_offsets.to_list())\n```\n\n```sh\n[['everything', 'not', 'saved', 'will', 'be', 'lost', '.'],\n ['Sad', '\\xe2\\x98\\xb9']]\n[[0, 11, 15, 21, 26, 29, 33], [0, 3]]\n[[10, 14, 20, 25, 28, 33, 34], [3, 6]]\n```\n\n### TF.Data 示例\n\n分词器与 tf.data API 配合使用效果符合预期。下面提供了一个简单的示例。\n\n```python\ndocs = tf.data.Dataset.from_tensor_slices([['Never tell me the odds.'],\n                                           [\"It's a trap!\"]])\ntokenizer = text.WhitespaceTokenizer()\ntokenized_docs = docs.map(lambda x: tokenizer.tokenize(x))\niterator = tokenized_docs.make_one_shot_iterator()\nprint(iterator.get_next().to_list())\nprint(iterator.get_next().to_list())\n```\n\n```sh\n[['Never', 'tell', 'me', 'the', 'odds.']]\n[[\"It's\", 'a', 'trap!']]\n```\n\n### Keras API\n\n当您使用不同的分词器 (tokenizer) 和操作符 (ops) 来预处理数据时，生成的输出是 Ragged Tensors（不规则张量）。Keras API 现在使得使用 Ragged Tensors 训练模型变得容易，无需担心数据的填充或掩码处理，您可以使用自动处理所有这些的 ToDense 层，或者依赖 Keras 内置层对原生处理不规则数据的支持。\n\n```python\nmodel = tf.keras.Sequential([\n  tf.keras.layers.InputLayer(input_shape=(None,), dtype='int32', ragged=True)\n  text.keras.layers.ToDense(pad_value=0, mask=True),\n  tf.keras.layers.Embedding(100, 16),\n  tf.keras.layers.LSTM(32),\n  tf.keras.layers.Dense(32, activation='relu'),\n  tf.keras.layers.Dense(1, activation='sigmoid')\n])\n```\n\n## 其他文本操作符\n\nTF.Text 包封装了其他有用的预处理操作符。下面我们将回顾几个。\n\n### Wordshape 特征\n\n某些自然语言理解模型中常用的一种特征是查看文本字符串是否具有某种属性。例如，句子分割模型可能包含检查单词大小写或标点符号是否在字符串末尾的特征。\n\nWordshape 定义了一系列基于正则表达式的有用辅助函数，用于匹配输入文本中的各种相关模式。以下是一些示例。\n\n```python\ntokenizer = text.WhitespaceTokenizer()\ntokens = tokenizer.tokenize(['Everything not saved will be lost.',\n                             u'Sad☹'.encode('UTF-8')])\n\n# Is capitalized?\nf1 = text.wordshape(tokens, text.WordShape.HAS_TITLE_CASE)\n# Are all letters uppercased?\nf2 = text.wordshape(tokens, text.WordShape.IS_UPPERCASE)\n# Does the token contain punctuation?\nf3 = text.wordshape(tokens, text.WordShape.HAS_SOME_PUNCT_OR_SYMBOL)\n# Is the token a number?\nf4 = text.wordshape(tokens, text.WordShape.IS_NUMERIC_VALUE)\n\nprint(f1.to_list())\nprint(f2.to_list())\nprint(f3.to_list())\nprint(f4.to_list())\n```\n\n```sh\n[[True, False, False, False, False, False], [True]]\n[[False, False, False, False, False, False], [False]]\n[[False, False, False, False, False, True], [True]]\n[[False, False, False, False, False, False], [False]]\n```\n\n### N-gram 与滑动窗口\n\n给定大小为 *n* 的滑动窗口，N-grams 是连续的单词序列。在组合 tokens 时，支持三种归约机制。对于文本，您应该使用 `Reduction.STRING_JOIN`，它将字符串彼此连接。默认分隔符字符是空格，但可以通过 string_separater 参数更改。\n\n另外两种归约方法通常与数值一起使用，它们是 `Reduction.SUM` 和 `Reduction.MEAN`。\n\n```python\ntokenizer = text.WhitespaceTokenizer()\ntokens = tokenizer.tokenize(['Everything not saved will be lost.',\n                             u'Sad☹'.encode('UTF-8')])\n\n# Ngrams, in this case bi-gram (n = 2)\nbigrams = text.ngrams(tokens, 2, reduction_type=text.Reduction.STRING_JOIN)\n\nprint(bigrams.to_list())\n```\n\n```sh\n[['Everything not', 'not saved', 'saved will', 'will be', 'be lost.'], []]\n```\n\n## 安装\n\n### 使用 PIP 安装\n\n当使用 `pip install` 安装 TF Text 时，请注意您正在运行的 TensorFlow 版本，因为您应该指定相应版本的 TF Text。例如，如果您使用的是 TF 2.0，请安装 TF Text 的 2.0 版本；如果您使用的是 TF 1.15，请安装 TF Text 的 1.15 版本。\n\n```bash\npip install -U tensorflow-text==\u003Cversion>\n```\n\n### 关于不同操作系统包的说明\n\n2.10 版本之后，我们将仅提供适用于 Linux x86_64 和基于 Intel 的 Mac 的 pip 包。TF Text 一直利用核心 TensorFlow 包的发布基础设施，以更轻松地维护兼容版本并减少维护工作，使团队能够专注于 TF Text 本身以及对 TensorFlow 生态系统其他部分的贡献。\n\n对于其他系统，如 Windows、Aarch64 和 Apple Mac，TensorFlow 依赖于 [构建协作者](https:\u002F\u002Fblog.tensorflow.org\u002F2022\u002F09\u002Fannouncing-tensorflow-official-build-collaborators.html)，因此我们将不为它们提供包。不过，我们将继续接受 PRs（拉取请求），以使为这些操作系统构建变得更加容易，并尝试指向相关的社区努力。\n\n### 从源码构建步骤：\n\n请注意，TF Text 需要在与 TensorFlow 相同的环境中构建。因此，\n如果您手动构建 TF Text，强烈建议您同时构建\nTensorFlow。\n\n如果在 MacOS 上构建，您必须安装 coreutils。使用 Homebrew\n可能是最简单的方法。\n\n1. [构建并安装 TensorFlow](https:\u002F\u002Fwww.tensorflow.org\u002Finstall\u002Fsource)。\n1. 克隆 TF Text 仓库：\n   ```Shell\n   git clone https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext.git\n   cd text\n   ```\n1. 运行构建脚本以创建 pip 包：\n   ```Shell\n   .\u002Foss_scripts\u002Frun_build.sh\n   ```\n   完成此步骤后，当前目录中应该会有一个 `*.whl` 文件。文件名类似于 `tensorflow_text-2.5.0rc0-cp38-cp38-linux_x86_64.whl`。\n1. 将包安装到环境中：\n   ```Shell\n   pip install .\u002Ftensorflow_text-*-*-*-os_platform.whl\n   ```\n\n### 使用 TensorFlow 的 SIG（特别兴趣组）Docker（容器引擎）镜像构建或测试：\n\n1.  从\n    [TensorFlow SIG Docker 构建](https:\u002F\u002Fhub.docker.com\u002Fr\u002Ftensorflow\u002Fbuild\u002Ftags)\n    拉取镜像。\n\n1.  基于拉取的镜像运行一个容器并创建一个 bash 会话。\n    这可以通过运行 `docker run -it {image_name} bash` 来完成。 \u003Cbr \u002F>\n    `{image_name}` 可以是任何符合 `{tf_version}-python{python_version}` 格式的字符串。\n    Python 3.10 和 TF 版本 2.10 的一个示例为：`2.10-python3.10`。\n1.  在容器内克隆 TF-Text Github 仓库：`git clone https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext.git`。 \u003Cbr \u002F>\n    克隆完成后，使用 `cd text\u002F` 切换到工作目录。\n1.  运行配置脚本：`.\u002Foss_scripts\u002Fconfigure.sh` 和 `.\u002Foss_scripts\u002Fprepare_tf_dep.sh`。 \u003Cbr \u002F>\n    这将更新 Bazel 和 TF 依赖项以匹配容器中安装的 TensorFlow。\n1.  要运行测试，请使用 Bazel 命令：`bazel test --test_output=errors tensorflow_text:all`。这将运行 `BUILD` 文件中声明的所有测试。 \u003Cbr \u002F>\n    要运行特定测试，请修改上述命令，将 `:all` 替换为测试名称（例如 `:fast_bert_normalizer`）。\n    \n1.  构建 pip 包\u002Fwheel： \\\n    `bazel build --config=release_cpu_linux\n    oss_scripts\u002Fpip_package:build_pip_package` \\\n    `.\u002Fbazel-bin\u002Foss_scripts\u002Fpip_package\u002Fbuild_pip_package\n    \u002F{wheel_dir}` \u003Cbr \u002F>\n\n    构建完成后，您应该在\n    `{wheel_dir}` 目录下看到可用的 wheel。","# TensorFlow Text 快速上手指南\n\n**TensorFlow Text** 是 TensorFlow 官方提供的文本处理库，提供了一系列预制的文本类和操作（Ops），可直接用于 TensorFlow 2.0 及更高版本的模型中。它支持在 TensorFlow 图内完成分词、归一化等预处理任务，确保训练与推理阶段的一致性。\n\n## 环境准备\n\n在使用本工具前，请确保满足以下系统要求：\n\n*   **操作系统**: Linux, MacOS, Windows\n*   **Python 版本**: 3.6 及以上\n*   **核心依赖**: **TensorFlow**。\n    > **重要提示**：`tensorflow-text` 的版本必须与已安装的 `tensorflow` 版本严格对应。例如，若使用 `tensorflow==2.3.x`，则需安装 `tensorflow-text==2.3.x`。\n\n## 安装步骤\n\n推荐使用 `pip` 进行安装。请根据您当前 TensorFlow 的版本号指定对应的 `tensorflow-text` 版本。\n\n```bash\npip install tensorflow-text==\u003Ctensorflow_version>\n```\n\n*示例：*\n```bash\npip install tensorflow-text==2.10.0\n```\n\n> **开发者提示**：国内用户若遇到下载缓慢，可配置国内镜像源加速安装，例如：\n> ```bash\n> pip install tensorflow-text==2.10.0 -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 基本使用\n\n### 1. 基础分词 (Tokenization)\n\n引入库后，可以使用内置的 Tokenizer 对字符串进行分割。以下示例展示了如何使用 `WhitespaceTokenizer` 进行空格分词。\n\n```python\nimport tensorflow as tf\nimport tensorflow_text as text\n\ntokenizer = text.WhitespaceTokenizer()\ntokens = tokenizer.tokenize(['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])\nprint(tokens.to_list())\n```\n\n### 2. 获取偏移量 (Offsets)\n\n部分分词器支持返回 Token 在原字符串中的字节偏移量，便于后续定位。\n\n```python\ntokenizer = text.UnicodeScriptTokenizer()\n(tokens, start_offsets, end_offsets) = tokenizer.tokenize_with_offsets(\n    ['everything not saved will be lost.', u'Sad☹'.encode('UTF-8')])\nprint(tokens.to_list())\nprint(start_offsets.to_list())\nprint(end_offsets.to_list())\n```\n\n### 3. Keras 集成\n\nTensorFlow Text 支持与 Keras 无缝集成，可直接处理 Ragged Tensors（不规则张量），无需手动填充或掩码。\n\n```python\nmodel = tf.keras.Sequential([\n  tf.keras.layers.InputLayer(input_shape=(None,), dtype='int32', ragged=True),\n  text.keras.layers.ToDense(pad_value=0, mask=True),\n  tf.keras.layers.Embedding(100, 16),\n  tf.keras.layers.LSTM(32),\n  tf.keras.layers.Dense(32, activation='relu'),\n  tf.keras.layers.Dense(1, activation='sigmoid')\n])\n```\n\n更多详细文档和 API 说明，请访问 [http:\u002F\u002Ftensorflow.org\u002Ftext](http:\u002F\u002Ftensorflow.org\u002Ftext)。","某电商平台技术团队正在开发基于 TensorFlow 的智能客服意图识别系统，需实时处理海量多语言用户咨询文本。\n\n### 没有 text 时\n- 分词逻辑分散在独立的 Python 预处理脚本中，训练集与线上推理的分词规则偶尔不一致，导致模型准确率大幅下降。\n- 面对多语言混合输入，需要手动编写代码处理 UTF-8 编码转换和特殊符号清洗，维护成本极高且极易引入 Bug。\n- 文本预处理步骤无法融入 TensorFlow 计算图，部署时需额外加载外部预处理模块，显著增加了服务延迟和运维复杂度。\n- 不同运行环境下的 Unicode 规范化行为存在细微差异，难以保证模型在生产环境中输出结果的长期稳定性。\n\n### 使用 text 后\n- 利用 TensorFlow Text 提供的原生算子，将分词、归一化直接嵌入模型计算图，彻底消除了训练与推理的逻辑差异。\n- 内置强大的 Unicode 处理功能，自动完成大小写折叠和结构验证，开发者无需再编写繁琐的额外清洗代码。\n- 支持通过 tf.data 管道高效串联预处理步骤，大幅提升了大规模文本数据集的加载速度与流水线执行效率。\n- 统一了所有环境下的文本解析标准，确保了模型在不同服务器部署时表现始终如一，极大降低了测试验证成本。\n\nTensorFlow Text 让文本处理成为计算图的第一公民，实现了从数据清洗到模型预测的全链路一致性与高效性。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Ftensorflow_text_62c015cc.png","tensorflow","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Ftensorflow_07ed5093.png","",null,"github-admin@tensorflow.org","http:\u002F\u002Fwww.tensorflow.org","https:\u002F\u002Fgithub.com\u002Ftensorflow",[84,88,92,96,100,103],{"name":85,"color":86,"percentage":87},"C++","#f34b7d",50.3,{"name":89,"color":90,"percentage":91},"Python","#3572A5",44.2,{"name":93,"color":94,"percentage":95},"Starlark","#76d275",5.1,{"name":97,"color":98,"percentage":99},"Shell","#89e051",0.4,{"name":101,"color":79,"percentage":102},"Linker Script",0,{"name":104,"color":105,"percentage":102},"PureBasic","#5a6986",1285,373,"2026-04-03T15:44:36","Apache-2.0","Linux, macOS, Windows","未说明",{"notes":113,"python":111,"dependencies":114},"TF Text 版本必须与 TensorFlow 版本严格对应（如 TF 2.3.x 对应 TF Text 2.3.x）；输入文本建议为 UTF-8 编码；部分分词功能依赖 ICU 系统库；模型训练输出涉及 Ragged Tensors 数据结构。",[115],"tensorflow (版本需严格匹配)",[15,34],6,"2026-03-27T02:49:30.150509","2026-04-06T06:54:42.220735",[121,126,131,136,141,145],{"id":122,"question_zh":123,"answer_zh":124,"source_url":125},1389,"如何在 Windows 系统上安装 tensorflow-text？","目前官方尚未提供通用的 Windows 预编译二进制文件。根据社区经验，使用 `tensorflow-cpu==2.4.0` 配合标准 Python 发行版可以解决安装问题。验证步骤如下：\n1. 创建虚拟环境：`py -3.8 -m venv tf-text`\n2. 激活环境并安装：`pip install tensorflow-text==2.4.0rc1; pip install tensorflow-cpu==2.4.0 \"numpy\u003C1.19.4\"`\n3. 导入测试：`import tensorflow_text as text`","https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext\u002Fissues\u002F44",{"id":127,"question_zh":128,"answer_zh":129,"source_url":130},1390,"pip 安装 tensorflow-text 时提示找不到匹配的分发版本怎么办？","这通常发生在 M1 Mac 或其他特定架构上，因为目前官方暂未提供此类平台的预构建包。如果遇到 'No matching distribution found' 错误，建议尝试从源代码自行构建并使用。同时请确认您的 Python 和 TensorFlow 版本是否处于受支持的范围，避免使用过旧的 Python 版本（如 3.5）。","https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext\u002Fissues\u002F89",{"id":132,"question_zh":133,"answer_zh":134,"source_url":135},1391,"运行时报错 'Loaded runtime CuDNN library: 8.0.5 but source was compiled with: 8.1.0' 如何解决？","这是 CuDNN 库版本不匹配导致的。解决方法取决于您的环境：\n1. Conda 环境：尝试运行 `conda uninstall cudnn` 移除冲突的库。\n2. 本地或 Docker 环境：前往 NVIDIA 官网下载与编译版本匹配的 CuDNN 文件（如 8.1.0），并按照 NVIDIA 安装指南进行安装。如果是 Docker，可将已安装的 tar 文件复制到容器中执行安装步骤。","https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext\u002Fissues\u002F554",{"id":137,"question_zh":138,"answer_zh":139,"source_url":140},1392,"tensorflow-text 与哪些版本的 TensorFlow 兼容？","为了减少依赖冲突，建议保持主要组件版本一致。已知兼容的版本组合示例如下：\n- tensorflow-text==2.3\n- tensorflow==2.3\n- keras==2.3\n- tensorflow-hub==0.12\n使用不匹配的大版本可能导致加载错误，请优先参考官方发布的版本对应表。","https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext\u002Fissues\u002F160",{"id":142,"question_zh":143,"answer_zh":144,"source_url":140},1393,"运行单元测试时无法加载 '_text_similarity_metric_ops.so' 文件怎么办？","这通常与环境依赖或构建步骤缺失有关。维护者建议使用项目提供的 Docker 镜像来运行测试，这些镜像与 TensorFlow 官方提供的镜像一致，可避免本地环境配置差异。具体使用说明可在项目主页 README 文件的底部找到。",{"id":146,"question_zh":147,"answer_zh":148,"source_url":149},1394,"在 Docker 环境中使用 tensorflow-text 出现 'undefined symbol' 错误如何处理？","该错误常见于 TensorFlow 版本不匹配的情况。部分用户反馈在使用较新版本的 TensorFlow（如 2.11、2.12）时会遇到此问题，而旧版本（如 2.5）表现正常。建议检查 Docker 镜像中的 TensorFlow 版本是否与安装的 tensorflow-text 版本严格对应，或者尝试将 TensorFlow 降级至稳定版本（如 2.5）进行测试。","https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext\u002Fissues\u002F385",[151,156,161,166,171,176,181,186,191,196,201,206,211,216,221,226,231,236,241,246],{"id":152,"version":153,"summary_zh":154,"released_at":155},100880,"v2.20.1","# Release 2.20.1\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Fixed a memory safety bug in FastWordpieceTokenizer concerning StringVocab lifetime. This prevents temporary copies that were previously invalidating std::string_view references to internal vocabulary data, ensuring memory stability during tokenization.\r\n* Update Version for 2.20.1\r\n\r\n## Thanks to our Contributors\r\n\r\nThis release contains contributions from many people at Google, as well as:\r\n\r\nC. Antonio Sánchez","2026-03-10T18:19:40",{"id":157,"version":158,"summary_zh":159,"released_at":160},100881,"v2.20.0","# Release 2.20.0\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Fix nightly builds.\r\n* iterate over Unicode White_Space directly, rather than testing each of 1.1M code points\r\n* Change input size from  int16_t to int32_t to support large inputs.\r\n* Ignore SentencePiece::BYTE during encoding instead of throwing error\r\n* Fix TF-text for TF 2.20.\r\n* Update requirements with latest tensorflow-metadata.\r\n* Fix configure to pull bazelrc from TF 2.20.\r\n* update sha256 for tensorflow 2.20.0\r\n* Revert \"update sha256 for tensorflow 2.20.0\"\r\n* Change strip_prefix logic to grab sha256 properly\r\n* Limit array-record dependency to Linux x86_64 and aarch64.\r\n* Pin array-record to 0.4.1 for darwin platform\r\n* Adds build_bazel_apple_support for newer macOS builds","2026-01-28T22:34:44",{"id":162,"version":163,"summary_zh":164,"released_at":165},100882,"v2.19.0","## Bug Fixes and Other Changes\r\n\r\n* Add docker build scripts and enable aarch64 pip wheels.\r\n* Regenerate stubs with Mypy 1.13.0\r\n* Handle the punctuation definition mismatch between different Unicode versions.\r\n* Replace outdated select() on --cpu in core\u002Fkernels\u002Fsentencepiece\u002FBUILD with platform API equivalent.\r\n* cleanup of deprecated test methods\r\n* Update tensorflow version.\r\n* Limit dm-tree to 0.1.8.\r\n* Remove `srcs_version` and `python_version` attributes, as they already default to `\"PY3\"`\r\n* Remove invalid public_names_test\r\n* Fix protobuf dependency.\r\n* Update test files for new ICU version\r\n* Init `python_wheel_version_suffix_repository` for TF dependency in tf-text project after https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftensorflow\u002Fcommit\u002F805775fcb5f9272e4c52dce751b00cf7f70364f2.\r\n* Add numpy py dep and fix python dependencies.\r\n* Add explicit tf-keras dependency.\r\n* Update Version for 2.19.0-rc0\r\n* Revert \"Init `python_wheel_version_suffix_repository` for TF dependency in tf-text project after https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftensorflow\u002Fcommit\u002F805775fcb5f9272e4c52dce751b00cf7f70364f2.\"\r\n* Refresh requirement lockfiles\r\n* add @absl_py\u002F\u002Fabsl:app dep to tensorflow_build_info\r\n* Update Version for 2.19.0\r\n\r\n## Thanks to our Contributors\r\n\r\nThis release contains contributions from many people at Google, as well as:\r\n\r\nC. Antonio Sánchez","2025-04-04T21:04:24",{"id":167,"version":168,"summary_zh":169,"released_at":170},100883,"v2.19.0-rc0","## Bug Fixes and Other Changes\r\n\r\n* Add docker build scripts and enable aarch64 pip wheels.\r\n* Regenerate stubs with Mypy 1.13.0\r\n* Handle the punctuation definition mismatch between different Unicode versions.\r\n* Replace outdated select() on --cpu in core\u002Fkernels\u002Fsentencepiece\u002FBUILD with platform API equivalent.\r\n* cleanup of deprecated test methods\r\n* Update tensorflow version.\r\n* Limit dm-tree to 0.1.8.\r\n* Remove `srcs_version` and `python_version` attributes, as they already default to `\"PY3\"`\r\n* Remove invalid public_names_test\r\n* Fix protobuf dependency.\r\n* Update test files for new ICU version\r\n* Init `python_wheel_version_suffix_repository` for TF dependency in tf-text project after https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftensorflow\u002Fcommit\u002F805775fcb5f9272e4c52dce751b00cf7f70364f2.\r\n* Add numpy py dep and fix python dependencies.\r\n* Add explicit tf-keras dependency.\r\n* Update Version for 2.19.0-rc0\r\n* Revert \"Init `python_wheel_version_suffix_repository` for TF dependency in tf-text project after https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftensorflow\u002Fcommit\u002F805775fcb5f9272e4c52dce751b00cf7f70364f2.\"\r\n* Refresh requirement lockfiles\r\n* add @absl_py\u002F\u002Fabsl:app dep to tensorflow_build_info\r\n\r\n## Thanks to our Contributors\r\n\r\nThis release contains contributions from many people at Google, as well as:\r\n\r\nC. Antonio Sánchez","2025-04-04T18:45:28",{"id":172,"version":173,"summary_zh":174,"released_at":175},100884,"v2.18.1","## Major Features and Improvements\r\n\r\n* Add Python 3.12 and aarch64 support\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Add docker build scripts and enable aarch64 pip wheels.\r\n* Replace std::string_view with absl::string_view\r\n* Protect the fast wordpiece tokenizer from infinite looping.\r\n* Update lock files\r\n* Update version for 2.18.1\r\n* Set HERMETIC_PYTHON_VERSION correctly\r\n","2024-12-17T19:04:40",{"id":177,"version":178,"summary_zh":179,"released_at":180},100885,"v2.18.0","# Release 2.18.0\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Fix out-of-bounds read in whitespace tokenizer\r\n* Add unit test for fixed bounds check in IsWhitespace\r\n* Add Hermetic CUDA rules.\r\n* Remove tf\u002Flite\u002Fkernels\u002Fshim:tf_headers from tf\u002Fcore:framework\r\n* Update version\r\n* Update configure.sh","2024-10-28T21:49:45",{"id":182,"version":183,"summary_zh":184,"released_at":185},100886,"v2.18.0-rc0","# Release 2.18.0-rc0\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Fix out-of-bounds read in whitespace tokenizer\r\n* Add unit test for fixed bounds check in IsWhitespace\r\n* Add Hermetic CUDA rules.\r\n* Remove tf\u002Flite\u002Fkernels\u002Fshim:tf_headers from tf\u002Fcore:framework\r\n* Update version\r\n* Update configure.sh","2024-09-30T23:37:10",{"id":187,"version":188,"summary_zh":189,"released_at":190},100887,"v2.17.0","# Release 2.17.0\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* negative sampling excludes positive class\r\n* revert html encoding\r\n* much faster set-intersection based version\r\n* Fix notebook failure with Keras 3.\r\n* Remove `tensorflow-macos` from setup.py\r\n* Update tensorflow-macos to 2.16.1\r\n* Update version\r\n\r\n## Thanks to our Contributors\r\n\r\nThis release contains contributions from many people at Google, as well as:\r\n\r\nAlex Shroyer, C. Antonio Sánchez, Maggie Zhang\r\n","2024-07-15T22:42:38",{"id":192,"version":193,"summary_zh":194,"released_at":195},100888,"v2.17.0-rc0","# Release 2.17.0-rc0\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* negative sampling excludes positive class\r\n* revert html encoding\r\n* much faster set-intersection based version\r\n* Fix notebook failure with Keras 3.\r\n* Remove `tensorflow-macos` from setup.py\r\n* Update tensorflow-macos to 2.16.1\r\n* Update version\r\n\r\n## Thanks to our Contributors\r\n\r\nThis release contains contributions from many people at Google, as well as:\r\n\r\nAlex Shroyer, C. Antonio Sánchez, Maggie Zhang\r\n","2024-06-25T18:45:24",{"id":197,"version":198,"summary_zh":199,"released_at":200},100889,"v2.16.1","# Release 2.16.1\r\n\r\n## Major Features and Improvements\r\n\r\n## Breaking Changes\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Update tf-text setup scripts.\r\n* Support resource manager scoped Sentencepiece resources.\r\n* Remove `use_unique_shared_resource_name`.\r\n* Remove tensorflow_text dependency on tf_hub library.\r\n* Fix TF patch, bump TF commit\r\n* Update version\r\n\r\n## Thanks to our Contributors\r\n\r\nThis release contains contributions from many people at Google, as well as:\r\n\r\nRaviteja Gorijala","2024-03-11T20:48:59",{"id":202,"version":203,"summary_zh":204,"released_at":205},100890,"v2.16.0-rc0","# Release 2.16.0-rc0\r\n\r\n## Major Features and Improvements\r\n\r\n## Breaking Changes\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Update tf-text setup scripts.\r\n* Support resource manager scoped Sentencepiece resources.\r\n* Remove `use_unique_shared_resource_name`.\r\n* Remove tensorflow_text dependency on tf_hub library.\r\n* Fix TF patch, bump TF commit\r\n* Update version\r\n\r\n## Thanks to our Contributors\r\n\r\nThis release contains contributions from many people at Google, as well as:\r\n\r\nRaviteja Gorijala","2024-02-28T06:51:28",{"id":207,"version":208,"summary_zh":209,"released_at":210},100891,"v2.15.0","## Release 2.15.0\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Update TF versions and scripts to allow consistently building against tf-nightly.\r\n* No public description\r\n* Update phrase tokenizer to handle end-punctuation.\r\n* Remove private Keras imports.\r\n* Update tensorflow_hub dependency.\r\n* Sprawling .pyi updates related to pybind11 PRs #4831, #4833.\r\n* Report unsupported tensor type in RaggedTensorToTensor in Prepare.\r\n* Check in generated pyi files for some py_extension targets.\r\n* Update version\r\n* Update WORKSPACE","2023-11-15T19:00:05",{"id":212,"version":213,"summary_zh":214,"released_at":215},100892,"v2.15.0-rc0","## Release 2.15.0-rc0\r\n\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Update TF versions and scripts to allow consistently building against tf-nightly.\r\n* No public description\r\n* Update phrase tokenizer to handle end-punctuation.\r\n* Remove private Keras imports.\r\n* Update tensorflow_hub dependency.\r\n* Sprawling .pyi updates related to pybind11 PRs #4831, #4833.\r\n* Report unsupported tensor type in RaggedTensorToTensor in Prepare.\r\n* Check in generated pyi files for some py_extension targets.\r\n* Update version\r\n* Update WORKSPACE\r\n\r\n## Thanks to our Contributors\r\n\r\nThis release contains contributions from many people at Google, as well as:\r\n\r\nnallave","2023-10-26T09:17:11",{"id":217,"version":218,"summary_zh":219,"released_at":220},100893,"v2.14.0","# Release 2.14.0\r\n\r\n\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Fix nullptr dereference issue in UnicodeScriptTokenizeWithOffsetOp.\r\n* Bump tensorflow_hub to 0.13.0\r\n* Add @tensorflow\u002Fdocs-team to CODEOWNERS\r\n* Internal change\r\n* Update TF Text API page to emphasize KerasNLP as the API of first choice.\r\n* Add a note about the implementation differences.\r\n* Fix out-of-bounds absl::string_view handling in RegexSplitImpl\r\n* Disable `coerce_to_valid_utf8_op_test` test on mac\r\n* Update \u002Ftext\u002Ftutorials and \u002Ftext\u002Fguide to highlight KerasNLP.\r\n* Move remaining text tutorials to text\u002F\r\n* Update \u002Ftext\u002Ftutorials and \u002Ftext\u002Fguide index pages to reflect updated navigation.\r\n* Update broken image links\r\n* Creates a patch to use non_hermetic python for tf text.\r\n* Check in generated pyi files for some py_extension targets.\r\n* Update ops.Tensor references to \u002F\u002Fthird_party\u002Ftensorflow\u002Fpython\u002Fframework\u002Ftensor.py.\r\n* Remove invalid stub file\r\n* Update tensorflow-text from 2.11 to 2.13\r\n* Check in generated pyi files for some py_extension targets.\r\n* Remove py38 classifiers in setup.py\r\n* Update version\r\n\r\n## Thanks to our Contributors\r\n\r\nThis release contains contributions from many people at Google, as well as:\r\n","2023-10-06T19:23:09",{"id":222,"version":223,"summary_zh":224,"released_at":225},100894,"v2.14.0-rc0","# Release 2.14.0-rc0\r\n\r\n\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Bump tensorflow_hub to 0.13.0\r\n* Add @tensorflow\u002Fdocs-team to CODEOWNERS\r\n* Internal change\r\n* Update TF Text API page to emphasize KerasNLP as the API of first choice.\r\n* Add a note about the implementation differences.\r\n* Fix out-of-bounds absl::string_view handling in RegexSplitImpl\r\n* Disable `coerce_to_valid_utf8_op_test` test on mac\r\n* Update \u002Ftext\u002Ftutorials and \u002Ftext\u002Fguide to highlight KerasNLP.\r\n* Move remaining text tutorials to text\u002F\r\n* Update \u002Ftext\u002Ftutorials and \u002Ftext\u002Fguide index pages to reflect updated navigation.\r\n* Update broken image links\r\n* Creates a patch to use non_hermetic python for tf text.\r\n* Check in generated pyi files for some py_extension targets.\r\n* Update ops.Tensor references to \u002F\u002Fthird_party\u002Ftensorflow\u002Fpython\u002Fframework\u002Ftensor.py.\r\n* Remove invalid stub file\r\n* Update tensorflow-text from 2.11 to 2.13\r\n* Check in generated pyi files for some py_extension targets.\r\n* Remove py38 classifiers in setup.py\r\n* Update version\r\n\r\n","2023-08-18T04:02:00",{"id":227,"version":228,"summary_zh":229,"released_at":230},100895,"v2.13.0","# Release 2.13.0\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Update word_embeddings.ipynb with Time Series as default dashboard.\r\n* Python ops for new RoundRobinTrimmer kernels.\r\n* Fix bug where rank 1 max sequence lengths were breaking round robin trimmer.\r\n* Fix roundrobintrimmer not being linked in correctly.\r\n* Move control_flow_ops.Assert into its own file, control_flow_assert.py, as a first step in removing circular dependencies with control_flow_ops.cond.\r\n* Move control_flow_ops.while_loop into its own file, while_loop.py.\r\n* Redirect references to stack and unstack to their new location in array_ops_stack.py.\r\n* Prevent crashes with new trimmer op when max_sequence_length is set to a negative value.\r\n* Fix run_build not getting tensorflow bazel version correctly. Also removed some \"set -x\" that were added for debugging.\r\n* Add RetVec-style UTF-8 binarization\r\n* Allow pad_model_inputs to work with Tensors as well.\r\n* Move usages of python\u002Futil:util to the newly split up targets.\r\n* (Generated change) Update tf.Text versions and\u002For docs.\r\n* Fix typo in Transformer tutorial\r\n* Pin protobuf version to prevent failure. See https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext\u002Fissues\u002F1115 for more info.\r\n* Avoid an expensive temporary std::string.\r\n* Callout the differences compared to the reference paper.\r\n* Remove decoding_api.ipynb from tf-text docs (this belongs to TF-Models)\r\n* Altered build scripts to use python3 before python.\r\n* Removes un-used tensorflow\u002Fcore\u002Fplatform:status dependency from round_robin_trimmer.\r\n* Remove usages of tsl::Status::error_message.\r\n* Use Github API to fetch full commit hash from short\r\n* Avoid using jq in prepare_tf_dep.sh since it breaks macos builds\r\n* Update version","2023-07-06T20:43:53",{"id":232,"version":233,"summary_zh":234,"released_at":235},100896,"v2.13.0-rc0","# Release 2.13.0-rc0\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Update word_embeddings.ipynb with Time Series as default dashboard.\r\n* Python ops for new RoundRobinTrimmer kernels.\r\n* Fix bug where rank 1 max sequence lengths were breaking round robin trimmer.\r\n* Fix roundrobintrimmer not being linked in correctly.\r\n* Move control_flow_ops.Assert into its own file, control_flow_assert.py, as a first step in removing circular dependencies with control_flow_ops.cond.\r\n* Move control_flow_ops.while_loop into its own file, while_loop.py.\r\n* Redirect references to stack and unstack to their new location in array_ops_stack.py.\r\n* Prevent crashes with new trimmer op when max_sequence_length is set to a negative value.\r\n* Fix run_build not getting tensorflow bazel version correctly. Also removed some \"set -x\" that were added for debugging.\r\n* Add RetVec-style UTF-8 binarization\r\n* Allow pad_model_inputs to work with Tensors as well.\r\n* Move usages of python\u002Futil:util to the newly split up targets.\r\n* (Generated change) Update tf.Text versions and\u002For docs.\r\n* Fix typo in Transformer tutorial\r\n* Pin protobuf version to prevent failure. See https:\u002F\u002Fgithub.com\u002Ftensorflow\u002Ftext\u002Fissues\u002F1115 for more info.\r\n* Avoid an expensive temporary std::string.\r\n* Callout the differences compared to the reference paper.\r\n* Remove decoding_api.ipynb from tf-text docs (this belongs to TF-Models)\r\n* Altered build scripts to use python3 before python.\r\n* Removes un-used tensorflow\u002Fcore\u002Fplatform:status dependency from round_robin_trimmer.\r\n* Remove usages of tsl::Status::error_message.\r\n* Use Github API to fetch full commit hash from short\r\n* Avoid using jq in prepare_tf_dep.sh since it breaks macos builds\r\n* Update version","2023-05-16T16:44:53",{"id":237,"version":238,"summary_zh":239,"released_at":240},100897,"v2.12.1","# Release 2.12.1\r\n\r\n## Major Features and Improvements\r\n\r\n## Breaking Changes\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Replace usage of the tsl::Status constructor with a tsl::{error, errors}::Code.\r\n* Replace usage of the tsl::Status(tsl::error::Code, ...) constructor.\r\n* Update version\r\n* Pin tensorflow-datasets version to 4.8.3\r\n\r\n## Thanks to our Contributors\r\n\r\nThis release contains contributions from many people at Google, as well as:\r\n\r\nRaviteja Gorijala\r\n","2023-04-12T22:52:51",{"id":242,"version":243,"summary_zh":244,"released_at":245},100898,"v2.12.0","# Release 2.12.0\r\n\r\n## Major Features and Improvements\r\n\r\n* New PhraseTokenizer.\r\n* New ByteSplitter.split_by_offsets which splits a string using byte offsets.\r\n* New `concatenate_segments` op.\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* Updated kernel code and Python API for BoiseTagsToOffsets op\r\n* Fix the bug that we should not re-build the config in the create function.\r\n* Register kernel and ops for phrase tokenizer.\r\n* fix the issue of conversion.\r\n* Fix typos in nmt_with_attention.ipynb\r\n* MacOS TF library was renamed. Update build configuration.\r\n* Update tokenization_layers_test.py\r\n* (Generated change) Update tf.Text versions and\u002For docs.\r\n* Update TF Text's TF Lite guide with ops that are convertible to TF Lite.\r\n* Update transformer test size.\r\n* Fix typos in uncertainty_quantification_with_sngp_bert.ipynb\r\n* (Generated change) Update tf.Text versions and\u002For docs.\r\n* Adds LastNItemSelector an ItemSelector that selects the last n items in the batch.\r\n* Temporarily remove tests for EOS offset since this is being changed in SP.\r\n* Update test files for new ICU version.\r\n* New helper function in the Op Kernel Shim for writing out data to the output tensors.\r\n* Adds configuration flags to enable switch to Fast Wordpiece Tokenizer implementation alternative for on device\r\n* New kernels to enable TF Lite conversion for SentenceFragmenterV2 op.\r\n* Fix possible heap overflow bug in sentence fragmenter op.\r\n* Deprecate PY37 support for TF-Text\r\n* Fix BUILD file by moving tf dep in the appropriate place for FBN to prevent conflicts when building on mobile.\r\n* Clean up a couple dependencies in the kernel BUILD file.\r\n* C++ API for new kernel for the RoundRobinTrimmer which fixes a bug and makes it available for conversion to TF Lite.\r\n* New kernels for the RoundRobinTrimmer which fixes a bug and makes it available for conversion to TF Lite.\r\n* Add two functions to implementations of the OpKernelShim for accessing the name & doc string. Accessing internals directly causes problems when trying to use techniques like Object composition as the op template. In particular, this change is needed for improvements to the polymorphic wrapper.\r\n* Allow int32 or int64 as types for RoundRobinTrimmer ops' splits.\r\n* Extend RoundRobinTrimmer kernels to allow any type as the value.\r\n* Return empty results if get_offsets is false.\r\n* Skip-uncompressing of bazel to try and locate error for mac ci tests.\r\n* Fix scraping full commit from short commit sha\r\n* Update tensorflow-text notebooks from 2.8 to 2.11\r\n* Fix bazel version scrapping logic for .bazelversion in install_bazel.sh\r\n* Fix conditional so it works better with Apple silicon. See issue #1077 for more details.\r\n* Force osname check to always be in lower-case. See #1077\r\n\r\n**Thanks to our Contributors**\r\nThis release contains contributions from many people at Google, as well as:\r\nsynandi, tilakrayal\r\n","2023-03-27T16:25:14",{"id":247,"version":248,"summary_zh":249,"released_at":250},100899,"v2.12.0-rc0","# Release 2.12.0-rc0\r\n\r\n## Bug Fixes and Other Changes\r\n\r\n* BOISE TF op:\r\n  * Add kernel code and Python API for `BoiseTagsTpOffsets` op\r\n* Other:\r\n  * Internal change\r\n  * Add model builder for phrase tokenizer.\r\n  * Fix the bug that we should not re-build the config in the create function.\r\n  * Register kernel and ops for phrase tokenizer.\r\n  * fix the issue of conversion.\r\n  * Fix typos in nmt_with_attention.ipynb\r\n  * Fix broken link in transformer.ipynb\r\n  * MacOS TF library was renamed. Update build configuration.\r\n  * Update tokenization_layers_test.py\r\n  * (Generated change) Update tf.Text versions and\u002For docs.\r\n  * Update TF Text's TF Lite guide with ops that are convertible to TF Lite.\r\n  * Update transformer test size.\r\n  * Fix typos in uncertainty_quantification_with_sngp_bert.ipynb\r\n  * (Generated change) Update tf.Text versions and\u002For docs.\r\n  * Adds LastNItemSelector an ItemSelector that selects the last n items in the batch.\r\n  * Add split_by_offsets method to ByteSplitter.\r\n  * Add split_by_offsets method to ByteSplitter.\r\n  * Temporarily remove tests for EOS offset since this is being changed in SP.\r\n  * Update test files for new ICU version.\r\n  * New helper function in the Op Kernel Shim for writing out data to the output tensors.\r\n  * Add split_by_offsets method to ByteSplitter.\r\n  * Adds configuration flags to enable switch to Fast Wordpiece Tokenizer implementation alternative for on device\r\n  * New kernels to enable TF Lite conversion for SentenceFragmenterV2 op.\r\n  * Fix possible heap overflow bug in sentence fragmenter op.\r\n  * Deprecate PY37 support for TF-Text\r\n  * Fix BUILD file by moving tf dep in the appropriate place for FBN to prevent conflicts when building on mobile.\r\n  * Clean up a couple dependencies in the kernel BUILD file.\r\n  * C++ API for new kernel for the RoundRobinTrimmer which fixes a bug and makes it available for conversion to TF Lite.\r\n  * New kernels for the RoundRobinTrimmer which fixes a bug and makes it available for conversion to TF Lite.\r\n  * Add two functions to implementations of the OpKernelShim for accessing the name & doc string. Accessing internals directly causes problems when trying to use techniques like Object composition as the op template. In particular, this change is needed for improvements to the polymorphic wrapper.\r\n  * Allow int32 or int64 as types for RoundRobinTrimmer ops' splits.\r\n  * Extend RoundRobinTrimmer kernels to allow any type as the value.\r\n  * internal\r\n  * license rules update\r\n  * Remove license changes for now since it has broken the builds.\r\n  * Return empty results if get_offsets is false.\r\n  * tf_text:  Add a \"concatenate_segments\" function.\r\n  * Skip-uncompressing of bazel to try and locate error for mac ci tests.\r\n  * Fix scraping full commit from short commit sha\r\n  * Update tensorflow-text notebooks from 2.8 to 2.11\r\n  * Fix bazel version scrapping logic for .bazelversion in install_bazel.sh\r\n\r\n## Thanks to our Contributors\r\n\r\nThis release contains contributions from many people at Google, as well as:\r\n\r\nsynandi, tilakrayal\r\n","2023-02-21T19:40:44"]