[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-abacusai--Long-Context":3,"tool-abacusai--Long-Context":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",154349,2,"2026-04-13T23:32:16",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":72,"owner_website":77,"owner_url":78,"languages":79,"stars":92,"forks":93,"last_commit_at":94,"license":95,"difficulty_score":96,"env_os":97,"env_gpu":98,"env_ram":98,"env_deps":99,"category_tags":102,"github_topics":76,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":103,"updated_at":104,"faqs":105,"releases":106},7328,"abacusai\u002FLong-Context","Long-Context","This repository contains code and tooling for the Abacus.AI LLM Context Expansion project. Also included are evaluation scripts and benchmark tasks that evaluate a model’s information retrieval capabilities with context expansion. We also include key experimental results and instructions for reproducing and building on them.","Long-Context 是 Abacus.AI 开源的一个项目，旨在突破大型语言模型（LLM）原有的上下文长度限制。它主要解决了传统模型（如基于 RoPE 编码的 Llama）在处理超过预训练长度（例如 2048 token）的文本时，信息检索能力急剧下降的难题。\n\n该项目提供了一套完整的代码库、评估脚本及基准测试任务，帮助开发者复现实验并构建支持长上下文的模型。通过实验发现，简单的线性缩放结合指令微调（IFT）是目前最稳健的扩展方案。项目方甚至开源了经过优化的模型权重（如 Scale 16 版本），使其能有效处理长达 16k 至 24k token 的上下文，远超理论计算值。此外，项目还探索了傅里叶基截断、随机化位置向量以及 xPos 衰减机制等多种前沿技术路径。\n\nLong-Context 非常适合 AI 研究人员、大模型开发者以及需要处理长文档分析、复杂问答任务的技术团队使用。它不仅提供了可落地的模型权重，还分享了详尽的实验数据与训练策略，是社区探索长文本理解能力的重要参考资源。","# Extending LLM Context Length\n\nThe choice of how to encode positional information for transformers has been one of the key components of LLM architectures.\n\nAn area that has been interesting to us and others in the community recently is whether LLMs can be extended to longer contexts.\n\nWe have conducted a range of experiments with different schemes for extending context length capabilities of Llama, which has been pretrained on 2048 context length with the RoPE (Rotary Position Embedding) encoding. Here we share some of the results as well as the training and evaluation scripts in the hope that it will be useful to the community. For our best performing models - linear scaling with IFT at scales 4 and 16 - we are also sharing the weights in case others wish to use them, or to conduct their own tests. We believe the scale 16 model should perform well on real world tasks up to 16k context lengths, and potentially even up to about 20-24k context lengths.\n\n[Scale 16 model](https:\u002F\u002Fhuggingface.co\u002Fabacusai\u002FGiraffe-v1-delta-13b-scaled-16)\n\n[Technical Paper](http:\u002F\u002Farxiv.org\u002Fabs\u002F2308.10882)\n\n## Overview\n\nWe conducted a wide variety of experiments to try to extend the context length of the models. First, we tried simply using the base Llama model zero-shot. As expected, this performed well up to 2048 context length but deterioriated very rapidly afterwards.\n\nWe next investigated fine tuning approaches where we trained the model on the RedPajama dataset at context lengths of 4096. This led to expected improvements in performance up to 4096 context but again, no further.\n\nAnother approach to extending context length is to modify in some way the RoPE encoding. Here, we tried many different ideas:\n- Linear scaling, as described by kaiokendev.github.io.\n- Scaling the Fourier basis of RoPE by a power, such that low frequencies are stretched more than high frequencies.\n- Applying truncation to the Fourier basis. Our idea here was that we wanted the model to see only frequencies that were fast enough so that it got at least one full cycle during training; any slower frequencies were set to 0 (equivalent to no rotation at all, i.e. equally important at all context lengths).\n- Randomising the position vector.\n\nIn particular, we combined fine-tuning on the RedPajama dataset and instruction-fine-tuning with the Vicuna dataset with the above approaches. This is what led to the most fruitful results.\n\nFinally, we implemented and tried the approach described in the [xPos](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.10554) paper. This approach adds decaying amplitude penalty terms that cause fast frequencies to have less impact at long distances than slow frequencies in the Fourier basis (see our blog post for similarity heatmaps that show this).\n\n## Highlighted Results\n\nPerhaps the most pointed observation we made is that different evaluation methodologies\u002Ftasks lead to different rankings of the approaches detailed above. This will be described in further detail below.\n\nThat said, we made the following general observations:\n\n- Linear interpolation\u002Fscaling seems to be the most robust approach for increasing model context length.\n- Using a linear scale of N does not necessarily lead to a model context length increase by a factor of N. For example, our scale 16 experiments generally stopped performing well after a context length of 16000, not 32000 (~2048 * 16). We have ideas for how to ameliorate this effect planned for future work.\n- Truncation and randomisation both seem to have great perplexity scores but perform less well on the retrieval task.\n- Instruction fine tuning with the Vicuna dataset improves accuracy in the retrieval context significantly at lengths which the base model is capable of handling, but cannot 'fix' the base model at lengths where it fails.\n\n## Evaluation Tasks\n\nFor evaluation we used two different datasets:\n\n- LMSys datasets (the 'lines' task) for locating a substring in the context\n- Our own open book question answering dataset, WikiQA, which is based off of other open source base QA datasets\n\nIn addition, we looked at the log loss of the train and eval sets during \n\nFor the LMSys task, we generated new and longer testcases, up to a context length of about 25000, beyond the 16000 context testcases in the original dataset.\n\nThe **WikiQA** task is the task of answering a question based on the information given in a Wikipedia document. We have built upon the short answer format data in [Google Natural Questions](https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002Fnatural-questions\u002Ftree\u002Fmaster) to construct our QA task. It is formatted as a document and a question. We ensure the answer to the question is a short answer which is either a single word or a small sentence directly cut pasted from the document. Having the task structured as such, we can pinpoint exactly where the LLM was supposed to \"look\" for the answer in the context, and thus effectively evaluate every part of the expanded context length by carefully placing the answer in different locations. \n\nWe have selected large Wikipedia documents and have truncated them to get multiple versions of the same document with sizes varying between 2000 to 16000 tokens. For each size of the document, we also have multiple versions which place the question and the answer text at different locations i.e whether it occurs in the first 10%, the bulk or last 10% of the document. Having multiple version of the same document allows us to get a exhaustive and fair evaluation across model sizes, and within one model's context positions since we intrinsically are asking for the same information.\n\nA potential issue in a Wikipedia based dataset is that the model could perhaps correctly answer from its pretrained corpus and not from context. To resolve this, we have created another “altered” dataset. This data only consists of questions which have numerical answers. Here, we change the answer and every occurrence of the answer in the document to a different number. Essentially making sure that if the LLM recollects from its pretrained corpus, it gives a wrong answer. The modification is made as follows:\n- If the answer is a year, which is quite frequent, (i.e. is between 1000-2100), we change it to a different random value within +\u002F- 10 of the original value. We treat years as a special case so as to not make the interpretation of the document absurd by messing up choronological information \n- If the answer is any other number, we change it to a different random number which has the same number of digits\n\nWe call our original QA task [Free Form QA (FFQA)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fabacusai\u002FWikiQA-Free_Form_QA) and the altered task [Altered Numeric QA (AltQA)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fabacusai\u002FWikiQA-Altered_Numeric_QA). \n\nWe evaluate success on every example in both versions of our QA task by measuring \"Presence Accuracy\" i.e, whether or not the answer is present as a subtring in the model's generated answer. To run inference for our models on WikiQA and compute metrics refer to `run_inference_WikiQA.py` and `compute_metrics_WikiQA.ipynb` [here](.\u002Fpython\u002Feval\u002Flongeval)\n\nWe are releasing these datasets on HuggingFace so others can use it to run their own long context experiments.\n\n- [Extended LMSys Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fabacusai\u002FLongChat-Lines)\n- [WikiQA Free_Form_QA Dataset](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fabacusai\u002FWikiQA-Free_Form_QA)\n- [WikiQA Altered_Numeric_QA](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fabacusai\u002FWikiQA-Altered_Numeric_QA)\n\n## Results\n\n### LMSys Eval\n\nAs a general point regarding the results below, the authors believe that small differences in accuracy on this task are not\nparticularly indicative of model ranking quality. We would generally look at the broadest trends here in interpreting the\nresults.\n\nAlso, as a baseline, standard Llama-13b only has non-zero accuracy up to 2048 context length (as does the Vicuna-instruction-\nfine-tuned version of it).\n\n#### Comparison of different scaling approaches\n\n![Comparison of different scaling approaches](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_9f080f7adf7c.png)\n\nIn the above we compare the different scaling approaches. 'Scale' refers to linear interpolation with the designated\nscaling value. We see that linear interpolation with a scale of 16 is the only one to achieve a non-zero accuracy at\ncontext lengths greater than 9000. However, this seems to come with a sacrifice of some accuracy on shorter contexts.\n\nThe power = 0.5 basis seems to work particularly well for this task at shorter contexts but has the sharpest drop off\nin accuracy as context length increases.\n\nIt's interesting to note that scale=16 doesn't generalise quite as far as one would hope. Naively, one expects that\nfollowing the trend of scale=4 - which is non-zero up to 8192 (and this is reasonable as the original context length\nis 2048, and 8192 = 2048 * 4; beyond this, the model is seeing relative distances between keys and queries it has never\nencountered before), scale=16 should be non-zero all the way up to 2048 * 16 = 32768.\n\n#### Impact of IFT (Instruction Fine Tuning)\n\n![Impact of IFT](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_3e89cc012f61.png)\n\nIn the above we display the impact of IFT via training with the Vicuna instruction set using LoRA. We see that IFT does improve\naccuracy by a small but non-negligible margin. However, it is not sufficient to change the overall shape of the accuracy curve -\nand it does not confer any extension to the range of context lengths the model can achieve non-zero accuracy on this task at.\n\n#### Evaluating Zero Shot at different scales than Training\n\n![Evaluating Zero Shot at different scales than Training](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_d3f0c6348813.png)\n\nIn the above, we display various experiments with trying different scale values (for linear interpolation) at evaluation time than the\nmodel was trained on. The green curve is indicative of taking a base model (trained on 2048 context) and applying a scale value to it.\nIt does extend the non-zero range from 2048 to 4096, but with low accuracy throughout. In general, however, once a model has been trained\nwith a scale > 0, it seems that the model can then zero-shot to a larger scale at evaluation time quite well - very greatly increasing\nthe range of coherent context lengths (e.g. compare Train=4, Eval=8 being non-zero here at 16k context length vs being 0 for anything above\n8k two graphs above). However this does come at the cost of accuracy dropoff, particularly for Train=16, Eval=32.\n\nThe Train=16, Eval=12 run has the longest non-zero accuracy context length we have seen. It achieves a non-zero score at a context length of\naround 20000.\n\n### WikiQA Eval\n\nIn the below tables, both models are evaluated with scale=4. However, the 'no scaling' model was no finetuned (i.e. experienced no training) at a scale > 1. The Scale=4 model did\nreceive finetuning at that expanded scale.\n\nPresence Accuracy: \n\n|Context Length | IFT with Scale=4 on FFQA | IFT No scaling on FFQA | IFT with Scale=4 on AltQA | IFT No scaling on AltQA |\n|--------------:|-------------------------:|-----------------------:|--------------------------:|------------------------:|\n|          2048 |                   0.3233 |                 0.2217 |                    0.7281 |                  0.2982 |\n|          4096 |                   0.3783 |                 0.2467 |                    0.7018 |                  0.2829 |\n|          8192 |                   0.4434 |                 0.2406 |                    0.6582 |                  0.2401 |\n|         16384 |                   0.3933 |                 0.0    |                    0.5363 |                  0.0    |\n\nNote: For 16k context length, we use a scale factor of 8 during inference. This enables expanding the original 2k context to 2*8=16k. It is interesting to point out that even though the scaled model was trained with a scale factor of 4, it can zero-shot interpolate to 16k (a scale of 8) during inference without losing too much performance. This however does not hold in the non-scaled models as is evident from the drop in accracy to 0 on the 16k datapoints. Indicating that our scaling and context length interpolation does work.\n\n\n#### Input Context Length Stats\nAs mentioned previously, we truncate and modify the documents to have different version of the WikiQA data. Each version is meant to extensively test the model's performance upto and at a certain context length as indicated by the version name\n\n##### FFQA\n|               | **Mean Context Length** | **Max Context Length** |\n|---------------:|-------------------------:|------------------------:|\n|  ffqa_2k.json |                 1936.71 |                   3228 |\n|  ffqa_4k.json |                 3805.06 |                   5793 |\n|  ffqa_8k.json |                 7598.98 |                   9963 |\n| ffqa_16k.json |                15000.54 |                  16178 |\n\n##### AltQA\n|                | **Mean Context Length** | **Max Context Length** |\n|----------------:|-------------------------:|------------------------:|\n|  altqa_2k.json |                 1953.73 |                   2698 |\n|  altqa_4k.json |                 3737.39 |                   5172 |\n|  altqa_8k.json |                 7481.37 |                   9619 |\n| altqa_16k.json |                15013.44 |                  16173 |\n\n#### Performance Robust to Increasing Context Length\n\n![Performance Robust to Increasing Context Length](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_49e2b0373d57.png)\n\nAs is seen above, our technique of finetuning interpolated embeddings seems to give good models robust to increasing context length of inputs on the WikiQA task. We demonstrate this on both versions of the task. Since we finetune with a scale context of 4, we expect the accuracy to not drop until 4*2048=8192 sized input. Even beyond this limit, we do see some reasonable performance. This seems to be a consequence of the periodicity of RoPE embeddings, which leads to some characteristics being extrapolatable to positions beyond the limit set by the scale context \n\n#### Impact of Scaling Context\n\n![Impact of Scaling Context FFQA](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_bb1c339530df.png) \n\n![Impact of Scaling Context AltQA](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_a70f330e5bae.png)\n\nWe contrast models instruct finetuned with and without scale context to show that IFT with scaled context leads to a significant jump in performance. Note that for both models, we still use a scaled context (=4) during evaluation. Interestingly, even zero shot performance of the scaled RoPE embedding gives non-trivial accuracy. However, having the embeddings explictly finetuned does have considerable gains. We see almost a 2x improvement on FFQA and a 2.5x improvement on AltQA at all positions interpolated by the scale context factor \n\n#### Location of Information\n\n\n### Loss curves\n\nWe trained models across all the experiment described in the overview. Not all of them seem promising for a full evaluation.\nSome of the experiments were abandoned because the loss curves did not seem promising. In some cases we did find that the\nresults did not alway align with the losses we were observing during training.\n\nThe images below show curves from a subset of the experiments we ran:\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_345998188a7b.png)\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_4db8cf70110d.png)\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_50e01b0dbc79.png)\n\nFor example the XPOS loss never converged towards the losses seen in the other runs. Initially we suspected that fp16\nlacked sufficient precision to handle the XPOS coefficients. We adjusted the implementation to use fp32 for the \ncore attention dot product. This did improve the convergence but not sufficiently to have the losses match the other\nmodels. Our hypothesis is that XPOS is too different from the base positional embeddings to finetune into the embedding.\nThis is a bit surprising since XPOS can is just RoPE with a scaling factor that is a function of relative difference.\nOne experiment we started but have not completed is to start with a factor of 1.0 and slowly shift to the XPOS function\nover iterations.\n","# 扩展LLM上下文长度\n\n如何为Transformer模型编码位置信息，一直是LLM架构中的关键组成部分之一。\n\n近来，我们以及社区中的其他研究者都对这样一个问题颇感兴趣：能否将LLM的上下文长度扩展到更长？\n\n我们针对Llama模型的不同上下文长度扩展方案进行了大量实验。Llama模型在预训练时采用RoPE（旋转位置嵌入）编码，其上下文长度限制为2048。在此，我们分享部分实验结果以及相应的训练和评估脚本，希望能对社区有所帮助。对于表现最佳的两个模型——分别以缩放因子4和16进行线性缩放并结合指令微调——我们也公开了模型权重，供有兴趣的研究者使用或进一步测试。我们认为，缩放因子16的模型在实际任务中能够较好地处理高达16k的上下文长度，甚至可能在约20–24k的上下文长度上也有不错的表现。\n\n[缩放因子16的模型](https:\u002F\u002Fhuggingface.co\u002Fabacusai\u002FGiraffe-v1-delta-13b-scaled-16)\n\n[技术论文](http:\u002F\u002Farxiv.org\u002Fabs\u002F2308.10882)\n\n## 概述\n\n我们开展了多种实验，试图扩展模型的上下文长度。首先，我们直接以零样本方式使用基础Llama模型。正如预期，该模型在2048个token的上下文长度内表现良好，但超过此长度后性能迅速下降。\n\n接下来，我们尝试了微调方法，在RedPajama数据集上以4096个token的上下文长度对模型进行训练。这使得模型在4096个token的上下文长度内性能有所提升，然而再往上调时效果便不再明显。\n\n另一种扩展上下文长度的方法是修改RoPE编码。为此，我们尝试了多种思路：\n- 线性缩放，如kaiokendev.github.io所述。\n- 对RoPE的傅里叶基底按幂次缩放，使低频成分比高频成分被拉伸得更多。\n- 对傅里叶基底进行截断。我们的想法是让模型只关注那些足够快、在训练过程中至少能完成一个完整周期的频率；而更慢的频率则置为0（相当于完全不进行旋转，即在所有上下文长度上都同等重要）。\n- 随机化位置向量。\n\n特别地，我们将上述方法与在RedPajama数据集上的微调以及基于Vicuna数据集的指令微调相结合，这一组合取得了最为显著的效果。\n\n最后，我们实现了并尝试了[xPos](https:\u002F\u002Farxiv.org\u002Fabs\u002F2212.10554)论文中提出的方法。该方法通过引入随距离衰减的幅度惩罚项，使得傅里叶基底中的高频成分在长距离下影响较小，而低频成分则影响较大（详见我们的博客文章中的相似度热图）。\n\n## 重点结果\n\n我们最重要的发现之一是：不同的评估方法或任务会导致对上述方法的排名有所不同。这一点将在下文中详细说明。\n\n尽管如此，我们仍得出以下几点普遍观察：\n- 线性插值或缩放似乎是提升模型上下文长度最稳健的方法。\n- 使用线性缩放因子N并不一定意味着模型的上下文长度会相应地增加N倍。例如，我们的缩放因子16实验通常在上下文长度达到16000左右时性能便开始下降，而非32000（约2048×16）。我们已规划了一些改进这一现象的方案，将在后续工作中实现。\n- 截断和随机化虽然在困惑度指标上表现优异，但在检索任务中的表现却相对较差。\n- 基于Vicuna数据集的指令微调能够显著提升基础模型在自身可处理长度范围内的检索准确率，但对于超出其能力范围的长度，则无法“修复”模型的不足。\n\n## 评估任务\n\n在评估中，我们使用了两个不同的数据集：\n\n- LMSys 数据集（“lines”任务），用于在上下文中定位子字符串；\n- 我们的开源开放书问答数据集 WikiQA，该数据集基于其他开源的基础问答数据集构建。\n\n此外，我们还观察了训练集和验证集在以下过程中的对数损失：\n\n对于 LMSys 任务，我们生成了新的、更长的测试用例，上下文长度最长可达约 25000 个 token，超过了原始数据集中 16000 个 token 的上下文测试用例。\n\n**WikiQA** 任务是根据维基百科文档中提供的信息回答问题。我们在 [Google Natural Questions](https:\u002F\u002Fgithub.com\u002Fgoogle-research-datasets\u002Fnatural-questions\u002Ftree\u002Fmaster) 中的短答案格式数据基础上构建了我们的问答任务。该任务以文档和问题的形式呈现。我们确保问题的答案是一个简短的答案，要么是一个单独的词，要么是从文档中直接摘录的一小句话。通过这种结构化的任务设计，我们可以精确地确定大语言模型本应在上下文中“查找”答案的具体位置，从而通过将答案放置在不同位置来有效评估扩展后的各种上下文长度。\n\n我们选取了大型维基百科文档，并对其进行截断，得到同一文档的多个版本，其大小介于 2000 到 16000 个 token 之间。对于每个文档大小，我们还准备了多个版本，将问题和答案文本放置在文档的不同位置——例如，是在文档的前 10%、主体部分还是最后 10%。通过提供同一文档的多个版本，我们能够在不同模型规模之间以及在同一模型的不同上下文位置上进行全面而公平的评估，因为我们本质上始终在询问相同的信息。\n\n基于维基百科的数据集可能存在一个问题：模型可能会从其预训练语料库中正确回答问题，而不是从给定的上下文中获取答案。为了解决这个问题，我们创建了一个“修改版”数据集。该数据集仅包含答案为数字的问题。在此，我们将答案以及文档中所有出现该答案的地方都替换为一个不同的数字。这样可以确保如果大语言模型仅仅依赖其预训练语料库进行回忆，它就会给出错误的答案。具体的修改方式如下：\n\n- 如果答案是年份（通常在 1000 到 2100 年之间），我们会将其改为原值上下浮动 10 年内的一个随机值。我们将年份视为特殊情况，以避免因打乱时间顺序而导致对文档的解读变得荒谬。\n- 如果答案是其他数字，则将其替换为具有相同位数的另一个随机数字。\n\n我们将原始的问答任务称为 [自由形式问答 (FFQA)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fabacusai\u002FWikiQA-Free_Form_QA)，而修改后的任务则称为 [修改版数字问答 (AltQA)](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fabacusai\u002FWikiQA-Altered_Numeric_QA)。\n\n我们通过测量“存在准确率”，即模型生成的答案中是否包含作为子字符串的答案，来评估这两个版本的问答任务中每个示例的成功情况。要对我们的模型在 WikiQA 上运行推理并计算指标，请参阅位于 [此处](.\u002Fpython\u002Feval\u002Flongeval) 的 `run_inference_WikiQA.py` 和 `compute_metrics_WikiQA.ipynb`。\n\n我们已在 HuggingFace 上发布了这些数据集，以便其他人也能用于开展自己的长上下文实验。\n\n- [扩展版 LMSys 数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fabacusai\u002FLongChat-Lines)\n- [WikiQA 自由形式问答数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fabacusai\u002FWikiQA-Free_Form_QA)\n- [WikiQA 修改版数字问答数据集](https:\u002F\u002Fhuggingface.co\u002Fdatasets\u002Fabacusai\u002FWikiQA-Altered_Numeric_QA)\n\n## 结果\n\n### LMSys 评估\n\n关于以下结果的一个总体说明是：作者认为，在这项任务上微小的准确率差异并不能很好地反映模型排名的质量。因此，在解释结果时，我们主要关注整体趋势。\n\n另外，作为基准，标准的 Llama-13b 模型只有在上下文长度不超过 2048 个 token 时才具有非零准确率（经过 Vicuna 指令微调的版本也是如此）。\n\n#### 不同缩放方法的比较\n\n![不同缩放方法的比较](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_9f080f7adf7c.png)\n\n上图比较了不同的缩放方法。“Scale”表示使用指定的缩放值进行线性插值。我们看到，缩放系数为 16 的线性插值是唯一一种在上下文长度超过 9000 个 token 时仍能获得非零准确率的方法。然而，这种方法似乎会牺牲一些较短上下文长度下的准确率。\n\n缩放指数为 0.5 的方法在较短上下文长度下表现尤为出色，但随着上下文长度的增加，其准确率下降得非常快。\n\n值得注意的是，缩放系数为 16 的方法并没有像预期那样广泛适用。直观来看，人们可能会认为，既然缩放系数为 4 的方法在上下文长度达到 8192 个 token 时仍有非零准确率（这也在情理之中，因为原始上下文长度为 2048 个 token，而 8192 等于 2048 乘以 4；超过这个长度后，模型会遇到前所未有的键与查询之间的相对距离），那么缩放系数为 16 的方法应该能在上下文长度达到 2048 乘以 16，即 32768 个 token 时仍然保持非零准确率。\n\n#### IFT（指令微调）的影响\n\n![IFT 的影响](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_3e89cc012f61.png)\n\n上图展示了通过 LoRA 使用 Vicuna 指令集进行微调所带来的 IFT 效果。我们看到，IFT 确实能够小幅但不可忽视地提升准确率。然而，这并不足以改变准确率曲线的整体形状，也无法扩大模型在该任务上能够达到非零准确率的上下文长度范围。\n\n#### 在不同于训练时的缩放系数下进行零样本评估\n\n![在不同于训练时的缩放系数下进行零样本评估](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_d3f0c6348813.png)\n\n上图展示了在评估时尝试使用与模型训练时不同的缩放系数（用于线性插值）的各种实验。绿色曲线表示对一个基础模型（在 2048 个 token 的上下文上训练）应用某个缩放系数的情况。这种方法确实将非零准确率的范围从 2048 个 token 扩展到了 4096 个 token，但整个范围内的准确率都很低。不过，总体而言，一旦模型已经使用大于 0 的缩放系数进行训练，它似乎可以在评估时以零样本的方式很好地适应更大的缩放系数——从而大幅提高其能够处理的连贯上下文长度范围（例如，对比上方两张图中“训练=4，评估=8”在 16000 个 token 的上下文长度下仍为非零，而在任何超过 8000 个 token 的情况下均为零）。然而，这样做也会导致准确率下降，尤其是在“训练=16，评估=32”的情况下。\n\n“训练=16，评估=12”的运行是我们所见过的非零准确率持续时间最长的案例。它在约 20000 个 token 的上下文长度下仍保持非零分数。\n\n### WikiQA 评估\n\n在下表中，两个模型均以 scale=4 的尺度进行评估。然而，“无缩放”模型并未在 scale > 1 的情况下进行微调（即未经历任何训练）。而 scale=4 的模型则确实在该扩展尺度下接受了微调。\n\n存在准确性：\n\n| 上下文长度 | 在 FFQA 上使用 scale=4 进行 IFT | 在 FFQA 上无缩放进行 IFT | 在 AltQA 上使用 scale=4 进行 IFT | 在 AltQA 上无缩放进行 IFT |\n|--------------:|-------------------------:|-----------------------:|--------------------------:|------------------------:|\n|          2048 |                   0.3233 |                 0.2217 |                    0.7281 |                  0.2982 |\n|          4096 |                   0.3783 |                 0.2467 |                    0.7018 |                  0.2829 |\n|          8192 |                   0.4434 |                 0.2406 |                    0.6582 |                  0.2401 |\n|         16384 |                   0.3933 |                 0.0    |                    0.5363 |                  0.0    |\n\n注：对于 16k 的上下文长度，我们在推理时使用了 8 倍的缩放因子。这使得原始的 2k 上下文能够扩展到 2*8=16k。值得注意的是，尽管缩放后的模型是以 4 倍的缩放因子进行训练的，但它在推理时仍能零样本地插值到 16k（即 8 倍的缩放），且性能损失不大。然而，这一点在未缩放的模型中并不成立，从 16k 数据点上准确率降至 0 即可看出。这表明我们的缩放和上下文长度插值方法确实有效。\n\n\n#### 输入上下文长度统计\n如前所述，我们对文档进行了截断和修改，以生成不同版本的 WikiQA 数据集。每个版本旨在充分测试模型在特定上下文长度下的表现，具体长度由版本名称指示。\n\n##### FFQA\n|               | **平均上下文长度** | **最大上下文长度** |\n|---------------:|-------------------------:|------------------------:|\n|  ffqa_2k.json |                 1936.71 |                   3228 |\n|  ffqa_4k.json |                 3805.06 |                   5793 |\n|  ffqa_8k.json |                 7598.98 |                   9963 |\n| ffqa_16k.json |                15000.54 |                  16178 |\n\n##### AltQA\n|                | **平均上下文长度** | **最大上下文长度** |\n|----------------:|-------------------------:|------------------------:|\n|  altqa_2k.json |                 1953.73 |                   2698 |\n|  altqa_4k.json |                 3737.39 |                   5172 |\n|  altqa_8k.json |                 7481.37 |                   9619 |\n| altqa_16k.json |                15013.44 |                  16173 |\n\n#### 性能对不断增加的上下文长度具有鲁棒性\n\n![性能对不断增加的上下文长度具有鲁棒性](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_49e2b0373d57.png)\n\n如上所示，我们在 WikiQA 任务上采用的插值嵌入微调技术，似乎能够生成对输入上下文长度不断增加具有鲁棒性的优秀模型。我们在该任务的两个版本上都展示了这一点。由于我们是以 4 倍的上下文长度进行微调的，因此预计准确率不会在输入大小达到 4*2048=8192 之前下降。即便超过这个限制，我们仍然可以看到一些合理的性能。这似乎是 RoPE 嵌入的周期性特征所致，它使得某些特性可以外推到超出缩放上下文所设定的限制之外的位置。\n\n#### 缩放上下文的影响\n\n![缩放上下文对 FFQA 的影响](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_bb1c339530df.png) \n\n![缩放上下文对 AltQA 的影响](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_a70f330e5bae.png)\n\n我们对比了使用和不使用缩放上下文进行指令微调的模型，结果表明，使用缩放上下文进行 IFT 能显著提升模型性能。需要注意的是，对于这两个模型，我们在评估时仍然使用了缩放上下文（scale=4）。有趣的是，即使是在零样本情况下，缩放后的 RoPE 嵌入也能给出非平凡的准确率。然而，显式地对嵌入进行微调确实带来了可观的收益。我们看到，在 FFQA 上几乎实现了 2 倍的提升，而在 AltQA 上则实现了 2.5 倍的提升，且这一优势在所有由缩放上下文因子插值出的位置上都得以体现。\n\n#### 信息的位置\n\n\n### 损失曲线\n我们在概述中描述的所有实验中都训练了模型。但并非所有实验都显示出进行全面评估的前景。\n部分实验因损失曲线表现不佳而被放弃。在某些情况下，我们还发现实验结果并不总是与训练过程中观察到的损失一致。\n\n以下图片展示了我们运行的部分实验的曲线：\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_345998188a7b.png)\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_4db8cf70110d.png)\n\n![alt text](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_readme_50e01b0dbc79.png)\n\n例如，XPOS 的损失从未收敛到与其他运行中观察到的损失水平。起初，我们怀疑是 fp16 的精度不足以处理 XPOS 系数。于是我们将实现调整为使用 fp32 来进行核心注意力点积计算。这确实改善了收敛性，但仍不足以使损失与其它模型持平。我们的假设是，XPOS 与基础位置嵌入差异过大，难以在其基础上进行微调。这有些令人意外，因为 XPOS 实际上只是带有相对差异函数缩放因子的 RoPE。我们曾启动过一项尚未完成的实验，即从缩放因子 1.0 开始，逐步通过多次迭代过渡到 XPOS 函数。","# Long-Context 快速上手指南\n\n本指南旨在帮助开发者快速使用 AbacusAI 开源的 Long-Context 项目，将 Llama 系列模型的上下文窗口扩展至 16k 甚至更长。该项目通过线性缩放（Linear Scaling）结合指令微调（IFT），显著提升了模型在长文本任务中的表现。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**: Linux (推荐 Ubuntu 20.04+) 或 macOS。\n*   **Python 版本**: Python 3.8 或更高版本。\n*   **GPU 要求**: 建议使用支持 CUDA 的 NVIDIA GPU。\n    *   运行 13B 模型进行推理：建议显存 ≥ 24GB (如 RTX 3090\u002F4090, A10)。\n    *   进行微调训练：建议多卡环境或高显存专业卡 (如 A100\u002FH100)。\n*   **前置依赖**:\n    *   PyTorch (与您的 CUDA 版本匹配)\n    *   Transformers (Hugging Face)\n    *   Accelerate\n    *   PEFT (用于 LoRA 微调)\n    *   Datasets\n\n**安装基础依赖命令：**\n\n```bash\npip install torch torchvision torchaudio --index-url https:\u002F\u002Fdownload.pytorch.org\u002Fwhl\u002Fcu118\npip install transformers accelerate peft datasets sentencepiece protobuf\n```\n\n> **提示**：国内用户可使用清华或阿里镜像源加速安装：\n> `pip install -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple \u003C包名>`\n\n## 安装步骤\n\n本项目主要提供训练脚本、评估脚本以及预训练好的模型权重。您可以直接克隆仓库并安装必要的 Python 包。\n\n1.  **克隆仓库**\n\n```bash\ngit clone https:\u002F\u002Fgithub.com\u002Fabacusai\u002FLong-Context.git\ncd Long-Context\n```\n\n2.  **安装项目特定依赖（如有 requirements.txt）**\n\n如果仓库根目录包含 `requirements.txt`，请执行：\n\n```bash\npip install -r requirements.txt\n```\n\n3.  **验证安装**\n\n确保能够导入关键库并无报错：\n\n```bash\npython -c \"import transformers; import peft; print('Environment ready')\"\n```\n\n## 基本使用\n\n本项目最核心的用法是直接加载已发布的 **Scale 16** 模型权重，该模型在 16k 上下文长度下表现最佳。\n\n### 1. 加载预训练模型 (推理示例)\n\n以下是最简单的使用示例，展示如何加载 `Giraffe-v1-delta-13b-scaled-16` 模型并进行长文本推理。\n\n**注意**：该模型发布为 Delta 权重（基于 Llama-13b 的增量），加载时通常需要合并基础模型，或者直接使用支持加载 delta 权重的最新 Transformers 版本。以下代码假设您已拥有基础 Llama-13b 模型或直接通过 HuggingFace 加载整合后的模型（如果可用）。\n\n```python\nfrom transformers import AutoTokenizer, AutoModelForCausalLM\nimport torch\n\n# 模型标识符\nmodel_name = \"abacusai\u002FGiraffe-v1-delta-13b-scaled-16\"\n\n# 加载分词器\ntokenizer = AutoTokenizer.from_pretrained(\"huggyllama\u002Fllama-13b\") # 需指定基础 Llama 模型或使用整合版\n\n# 加载模型\n# 注意：如果是 Delta 权重，可能需要先合并权重到本地，或使用特定的加载逻辑\n# 此处演示标准加载流程，具体取决于 HF 仓库的当前配置\nmodel = AutoModelForCausalLM.from_pretrained(\n    model_name,\n    torch_dtype=torch.float16,\n    device_map=\"auto\",\n    trust_remote_code=True\n)\n\n# 准备长文本输入 (模拟 10k+ tokens 的场景)\nlong_context = \"...\" * 5000 + \"请问这篇文章的核心观点是什么？\"\n\ninputs = tokenizer(long_context, return_tensors=\"pt\").to(model.device)\n\n# 生成回答\noutputs = model.generate(\n    **inputs,\n    max_new_tokens=256,\n    do_sample=True,\n    temperature=0.7,\n    pad_token_id=tokenizer.eos_token_id\n)\n\nresponse = tokenizer.decode(outputs[0], skip_special_tokens=True)\nprint(response)\n```\n\n### 2. 使用评估脚本测试长上下文能力\n\n项目提供了专门的脚本来评估模型在长上下文任务（如 WikiQA）上的表现。\n\n**运行 WikiQA 推理与指标计算：**\n\n```bash\n# 进入评估目录\ncd python\u002Feval\u002Flongeval\n\n# 运行推理脚本 (需根据实际参数调整 model_path 和 scale)\npython run_inference_WikiQA.py \\\n    --model_path abacusai\u002FGiraffe-v1-delta-13b-scaled-16 \\\n    --scale 4 \\\n    --output_dir .\u002Fresults\n\n# 计算评估指标\njupyter nbconvert --to script compute_metrics_WikiQA.ipynb\npython compute_metrics_WikiQA.py\n```\n\n### 3. 数据集获取\n\n为了复现实验或进行自定义测试，您可以直接从 HuggingFace 下载作者整理的长上下文数据集：\n\n*   **Extended LMSys Dataset**: `abacusai\u002FLongChat-Lines`\n*   **WikiQA (Free Form)**: `abacusai\u002FWikiQA-Free_Form_QA`\n*   **WikiQA (Altered Numeric)**: `abacusai\u002FWikiQA-Altered_Numeric_QA`\n\n**下载示例 (使用 datasets 库):**\n\n```python\nfrom datasets import load_dataset\n\n# 加载长文本问答数据集\ndataset = load_dataset(\"abacusai\u002FWikiQA-Free_Form_QA\")\nprint(dataset[\"train\"][0])\n```\n\n通过以上步骤，您即可快速上手 Long-Context 工具，体验在 16k+ 上下文长度下的 LLM 应用能力。","某法律科技团队正在开发一款智能合同审查助手，需要让模型一次性读取并分析长达数万字的复杂并购协议及附属条款。\n\n### 没有 Long-Context 时\n- **信息割裂严重**：由于原生 Llama 模型仅支持 2048 上下文，必须将长文档强行切割成碎片，导致跨章节的条款关联（如定义部分与违约责任部分）无法被模型同时捕捉。\n- **关键细节丢失**：在分段处理中，位于文档中间或末尾的关键限制性条款极易被忽略，造成“大海捞针”式的检索失败，遗漏重大法律风险。\n- **逻辑连贯性差**：模型无法理解全文脉络，只能基于局部片段生成回答，导致输出的审查意见前后矛盾或缺乏整体视角。\n- **工程复杂度高**：开发人员需编写复杂的滑动窗口算法和外部向量数据库检索逻辑来弥补模型短板，大幅增加了系统维护成本。\n\n### 使用 Long-Context 后\n- **全篇完整洞察**：借助 Long-Context 的线性扩展技术（如 Scale 16 模型），可直接输入长达 16k-24k 的完整合同，模型能精准定位并关联分散在文档首尾的定义与条款。\n- **检索精度飞跃**：基于 RedPajama 和 Vicuna 数据集的微调显著提升了长文本下的信息检索能力，确保即使是在两万字后的细微免责条款也能被准确识别。\n- **推理逻辑严密**：模型在保持全局视野的基础上进行分析，生成的审查报告逻辑连贯，能够综合全文信息判断潜在的法律冲突。\n- **架构极简高效**：无需再构建繁琐的分段检索流水线，直接调用扩展后的模型即可处理超长任务，显著降低了开发难度和延迟。\n\nLong-Context 通过突破原生位置编码限制，让大模型真正具备了“过目不忘”的长文档深度理解与分析能力。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002Fabacusai_Long-Context_9f080f7a.png","abacusai","Abacus.AI","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002Fabacusai_a303f2d0.png","Abacus.AI is a foundational AI research company that solves the hard problems that enterprises face in the AI\u002FML space.",null,"https:\u002F\u002Fabacus.ai","https:\u002F\u002Fgithub.com\u002Fabacusai",[80,84,88],{"name":81,"color":82,"percentage":83},"Python","#3572A5",92.3,{"name":85,"color":86,"percentage":87},"Jupyter Notebook","#DA5B0B",6.6,{"name":89,"color":90,"percentage":91},"Shell","#89e051",1.1,601,46,"2026-03-28T17:13:58","Apache-2.0",4,"","未说明",{"notes":100,"python":98,"dependencies":101},"README 主要介绍了扩展 LLM 上下文长度的实验方法、评估数据集（如 WikiQA）及结果，并未提供具体的安装指南或运行环境配置要求。文中提到的模型基于 Llama-13b，实际运行需参考 Llama 或 Vicuna 模型的通用硬件需求（通常需要高性能 NVIDIA GPU 和大显存）。",[],[35,14],"2026-03-27T02:49:30.150509","2026-04-14T12:28:02.008608",[],[]]