[{"data":1,"prerenderedAt":-1},["ShallowReactive",2],{"similar-PacificAI--langtest":3,"tool-PacificAI--langtest":61},[4,18,26,36,44,53],{"id":5,"name":6,"github_repo":7,"description_zh":8,"stars":9,"difficulty_score":10,"last_commit_at":11,"category_tags":12,"status":17},4358,"openclaw","openclaw\u002Fopenclaw","OpenClaw 是一款专为个人打造的本地化 AI 助手，旨在让你在自己的设备上拥有完全可控的智能伙伴。它打破了传统 AI 助手局限于特定网页或应用的束缚，能够直接接入你日常使用的各类通讯渠道，包括微信、WhatsApp、Telegram、Discord、iMessage 等数十种平台。无论你在哪个聊天软件中发送消息，OpenClaw 都能即时响应，甚至支持在 macOS、iOS 和 Android 设备上进行语音交互，并提供实时的画布渲染功能供你操控。\n\n这款工具主要解决了用户对数据隐私、响应速度以及“始终在线”体验的需求。通过将 AI 部署在本地，用户无需依赖云端服务即可享受快速、私密的智能辅助，真正实现了“你的数据，你做主”。其独特的技术亮点在于强大的网关架构，将控制平面与核心助手分离，确保跨平台通信的流畅性与扩展性。\n\nOpenClaw 非常适合希望构建个性化工作流的技术爱好者、开发者，以及注重隐私保护且不愿被单一生态绑定的普通用户。只要具备基础的终端操作能力（支持 macOS、Linux 及 Windows WSL2），即可通过简单的命令行引导完成部署。如果你渴望拥有一个懂你",349277,3,"2026-04-06T06:32:30",[13,14,15,16],"Agent","开发框架","图像","数据工具","ready",{"id":19,"name":20,"github_repo":21,"description_zh":22,"stars":23,"difficulty_score":10,"last_commit_at":24,"category_tags":25,"status":17},3808,"stable-diffusion-webui","AUTOMATIC1111\u002Fstable-diffusion-webui","stable-diffusion-webui 是一个基于 Gradio 构建的网页版操作界面，旨在让用户能够轻松地在本地运行和使用强大的 Stable Diffusion 图像生成模型。它解决了原始模型依赖命令行、操作门槛高且功能分散的痛点，将复杂的 AI 绘图流程整合进一个直观易用的图形化平台。\n\n无论是希望快速上手的普通创作者、需要精细控制画面细节的设计师，还是想要深入探索模型潜力的开发者与研究人员，都能从中获益。其核心亮点在于极高的功能丰富度：不仅支持文生图、图生图、局部重绘（Inpainting）和外绘（Outpainting）等基础模式，还独创了注意力机制调整、提示词矩阵、负向提示词以及“高清修复”等高级功能。此外，它内置了 GFPGAN 和 CodeFormer 等人脸修复工具，支持多种神经网络放大算法，并允许用户通过插件系统无限扩展能力。即使是显存有限的设备，stable-diffusion-webui 也提供了相应的优化选项，让高质量的 AI 艺术创作变得触手可及。",162132,"2026-04-05T11:01:52",[14,15,13],{"id":27,"name":28,"github_repo":29,"description_zh":30,"stars":31,"difficulty_score":32,"last_commit_at":33,"category_tags":34,"status":17},1381,"everything-claude-code","affaan-m\u002Feverything-claude-code","everything-claude-code 是一套专为 AI 编程助手（如 Claude Code、Codex、Cursor 等）打造的高性能优化系统。它不仅仅是一组配置文件，而是一个经过长期实战打磨的完整框架，旨在解决 AI 代理在实际开发中面临的效率低下、记忆丢失、安全隐患及缺乏持续学习能力等核心痛点。\n\n通过引入技能模块化、直觉增强、记忆持久化机制以及内置的安全扫描功能，everything-claude-code 能显著提升 AI 在复杂任务中的表现，帮助开发者构建更稳定、更智能的生产级 AI 代理。其独特的“研究优先”开发理念和针对 Token 消耗的优化策略，使得模型响应更快、成本更低，同时有效防御潜在的攻击向量。\n\n这套工具特别适合软件开发者、AI 研究人员以及希望深度定制 AI 工作流的技术团队使用。无论您是在构建大型代码库，还是需要 AI 协助进行安全审计与自动化测试，everything-claude-code 都能提供强大的底层支持。作为一个曾荣获 Anthropic 黑客大奖的开源项目，它融合了多语言支持与丰富的实战钩子（hooks），让 AI 真正成长为懂上",154349,2,"2026-04-13T23:32:16",[14,13,35],"语言模型",{"id":37,"name":38,"github_repo":39,"description_zh":40,"stars":41,"difficulty_score":32,"last_commit_at":42,"category_tags":43,"status":17},2271,"ComfyUI","Comfy-Org\u002FComfyUI","ComfyUI 是一款功能强大且高度模块化的视觉 AI 引擎，专为设计和执行复杂的 Stable Diffusion 图像生成流程而打造。它摒弃了传统的代码编写模式，采用直观的节点式流程图界面，让用户通过连接不同的功能模块即可构建个性化的生成管线。\n\n这一设计巧妙解决了高级 AI 绘图工作流配置复杂、灵活性不足的痛点。用户无需具备编程背景，也能自由组合模型、调整参数并实时预览效果，轻松实现从基础文生图到多步骤高清修复等各类复杂任务。ComfyUI 拥有极佳的兼容性，不仅支持 Windows、macOS 和 Linux 全平台，还广泛适配 NVIDIA、AMD、Intel 及苹果 Silicon 等多种硬件架构，并率先支持 SDXL、Flux、SD3 等前沿模型。\n\n无论是希望深入探索算法潜力的研究人员和开发者，还是追求极致创作自由度的设计师与资深 AI 绘画爱好者，ComfyUI 都能提供强大的支持。其独特的模块化架构允许社区不断扩展新功能，使其成为当前最灵活、生态最丰富的开源扩散模型工具之一，帮助用户将创意高效转化为现实。",108322,"2026-04-10T11:39:34",[14,15,13],{"id":45,"name":46,"github_repo":47,"description_zh":48,"stars":49,"difficulty_score":32,"last_commit_at":50,"category_tags":51,"status":17},6121,"gemini-cli","google-gemini\u002Fgemini-cli","gemini-cli 是一款由谷歌推出的开源 AI 命令行工具，它将强大的 Gemini 大模型能力直接集成到用户的终端环境中。对于习惯在命令行工作的开发者而言，它提供了一条从输入提示词到获取模型响应的最短路径，无需切换窗口即可享受智能辅助。\n\n这款工具主要解决了开发过程中频繁上下文切换的痛点，让用户能在熟悉的终端界面内直接完成代码理解、生成、调试以及自动化运维任务。无论是查询大型代码库、根据草图生成应用，还是执行复杂的 Git 操作，gemini-cli 都能通过自然语言指令高效处理。\n\n它特别适合广大软件工程师、DevOps 人员及技术研究人员使用。其核心亮点包括支持高达 100 万 token 的超长上下文窗口，具备出色的逻辑推理能力；内置 Google 搜索、文件操作及 Shell 命令执行等实用工具；更独特的是，它支持 MCP（模型上下文协议），允许用户灵活扩展自定义集成，连接如图像生成等外部能力。此外，个人谷歌账号即可享受免费的额度支持，且项目基于 Apache 2.0 协议完全开源，是提升终端工作效率的理想助手。",100752,"2026-04-10T01:20:03",[52,13,15,14],"插件",{"id":54,"name":55,"github_repo":56,"description_zh":57,"stars":58,"difficulty_score":32,"last_commit_at":59,"category_tags":60,"status":17},4721,"markitdown","microsoft\u002Fmarkitdown","MarkItDown 是一款由微软 AutoGen 团队打造的轻量级 Python 工具，专为将各类文件高效转换为 Markdown 格式而设计。它支持 PDF、Word、Excel、PPT、图片（含 OCR）、音频（含语音转录）、HTML 乃至 YouTube 链接等多种格式的解析，能够精准提取文档中的标题、列表、表格和链接等关键结构信息。\n\n在人工智能应用日益普及的今天，大语言模型（LLM）虽擅长处理文本，却难以直接读取复杂的二进制办公文档。MarkItDown 恰好解决了这一痛点，它将非结构化或半结构化的文件转化为模型“原生理解”且 Token 效率极高的 Markdown 格式，成为连接本地文件与 AI 分析 pipeline 的理想桥梁。此外，它还提供了 MCP（模型上下文协议）服务器，可无缝集成到 Claude Desktop 等 LLM 应用中。\n\n这款工具特别适合开发者、数据科学家及 AI 研究人员使用，尤其是那些需要构建文档检索增强生成（RAG）系统、进行批量文本分析或希望让 AI 助手直接“阅读”本地文件的用户。虽然生成的内容也具备一定可读性，但其核心优势在于为机器",93400,"2026-04-06T19:52:38",[52,14],{"id":62,"github_repo":63,"name":64,"description_en":65,"description_zh":66,"ai_summary_zh":66,"readme_en":67,"readme_zh":68,"quickstart_zh":69,"use_case_zh":70,"hero_image_url":71,"owner_login":72,"owner_name":73,"owner_avatar_url":74,"owner_bio":75,"owner_company":76,"owner_location":76,"owner_email":76,"owner_twitter":76,"owner_website":77,"owner_url":78,"languages":79,"stars":97,"forks":98,"last_commit_at":99,"license":100,"difficulty_score":32,"env_os":101,"env_gpu":101,"env_ram":101,"env_deps":102,"category_tags":112,"github_topics":114,"view_count":32,"oss_zip_url":76,"oss_zip_packed_at":76,"status":17,"created_at":134,"updated_at":135,"faqs":136,"releases":167},7349,"PacificAI\u002Flangtest","langtest","Deliver safe & effective language models","LangTest 是一款专为构建安全、高效语言模型而设计的开源测试框架。在人工智能应用日益广泛的今天，模型是否存在偏见、是否足够鲁棒、回答是否事实准确，往往难以通过传统方式全面评估。LangTest 正是为了解决这一痛点而生，它帮助开发者轻松识别模型在公平性、毒性、幻觉及特定领域（如医疗、法律）表现上的潜在风险。\n\n这款工具特别适合 AI 工程师、数据科学家以及研究人员使用。其最大的技术亮点在于极高的易用性与广泛的兼容性：用户仅需一行代码，即可自动生成并执行超过 60 种不同类型的测试用例，涵盖从基础的文本分类到复杂的大模型问答场景。LangTest 不仅支持 Spark NLP、Hugging Face 等主流 NLP 框架，还能直接对接 OpenAI、Azure 等大模型服务进行深度评估。更独特的是，它能根据测试结果自动增强训练数据，形成“测试 - 优化”的闭环，助力模型迭代。无论是希望确保产品合规的企业团队，还是致力于探索模型边界的学术研究者，LangTest 都能提供专业且直观的质量保障方案。","\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPacificAI_langtest_readme_9a89c81b71e4.png\" alt=\"pacific_ai_logo\" width=\"360\" style=\"text-align:center;\">\n\u003C\u002Fp>\n\n\u003Cdiv align=\"center\">\n  \u003Ch1 style=\"text-align: center; vertical-align: middle;\">LangTest: Deliver Safe & Effective Language Models\u003C\u002Fh1>\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Freleases\">\n        \u003Cimg alt=\"Release Notes\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002FPacific-AI-Corp\u002Flangtest.svg\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fwww.johnsnowlabs.com\u002Fresponsible-ai-blog\u002F\">\n        \u003Cimg alt=\"Blog\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FResponsible AI Blogs-8A2BE2\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Fdocs\u002Finstall\">\n        \u003Cimg alt=\"Documentation\" src=\"https:\u002F\u002Fimg.shields.io\u002Fwebsite?up_message=online&url=https%3A%2F%2Flangtest.org%2F\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fstar-history.com\u002F#Pacific-AI-Corp\u002Flangtest\">\n        \u003Cimg alt=\"GitHub star chart\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPacific-AI-Corp\u002Flangtest?style=social\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fissues\">\n        \u003Cimg alt=\"Open Issues\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-raw\u002FPacific-AI-Corp\u002Flangtest\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fproject\u002Flangtest\">\n        \u003Cimg alt=\"Downloads\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPacificAI_langtest_readme_c73da921a179.png\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Factions\u002Fworkflows\u002Fbuild_and_test.yml\">\n        \u003Cimg alt=\"CI\" src=\"https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Factions\u002Fworkflows\u002Fbuild_and_test.yml\u002Fbadge.svg\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fblob\u002Fmaster\u002FLICENSE\" alt=\"License\">\n        \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg\" \u002F>\n    \u003C\u002Fa>\n    \u003Ca href=\"CODE_OF_CONDUCT.md\">\n        \u003Cimg alt=\"Contributor Covenant\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FContributor%20Covenant-v2.0%20adopted-ff69b4.svg\">\n    \u003C\u002Fa>\n\n![Langtest Workflow](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPacificAI_langtest_readme_a72453e9a589.jpeg)\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Flangtest.org\u002F\">Project's Website\u003C\u002Fa> •\n  \u003Ca href=\"#key-features\">Key Features\u003C\u002Fa> •\n  \u003Ca href=\"#how-to-use\">How To Use\u003C\u002Fa> •\n  \u003Ca href=\"#benchmark-datasets\">Benchmark Datasets\u003C\u002Fa> •\n  \u003Ca href=\"#community-support\">Community Support\u003C\u002Fa> •\n  \u003Ca href=\"#contributing-to-langtest\">Contributing\u003C\u002Fa> •\n  \u003Ca href=\"#mission\">Mission\u003C\u002Fa> •\n  \u003Ca href=\"#license\">License\u003C\u002Fa>\n\u003C\u002Fp>\n\n## Project's Website\n\nTake a look at our official page for user documentation and examples: [langtest.org](http:\u002F\u002Flangtest.org\u002F) \n\n## Key Features\n\n- Generate and execute more than 60 distinct types of tests only with 1 line of code\n- Test all aspects of model quality: robustness, bias, representation, fairness and accuracy.​\n- Automatically augment training data based on test results (for select models)​\n- Support for popular NLP frameworks for NER, Translation and Text-Classifcation: Spark NLP, Hugging Face & Transformers.\n- Support for testing LLMS ( OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI LLMs) for question answering, toxicity, clinical-tests, legal-support, factuality, sycophancy, summarization and other popular tests. \n\n## Benchmark Datasets\n\nLangTest comes with different datasets to test your models, covering a wide range of use cases and evaluation scenarios. You can explore all the benchmark datasets available [here](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Fbenchmarks\u002Fbenchmark), each meticulously curated to challenge and enhance your language models. \nWhether you're focused on Question-Answering, text summarization etc, LangTest ensures you have the right data to push your models to their limits and achieve peak performance in diverse linguistic tasks.\n\n## How To Use\n\n```python\n# Install langtest\n!pip install langtest[transformers]\n\n# Import and create a Harness object\nfrom langtest import Harness\nh = Harness(task='ner', model={\"model\":'dslim\u002Fbert-base-NER', \"hub\":'huggingface'})\n\n# Generate test cases, run them and view a report\nh.generate().run().report()\n```\n\n> **Note**\n> For more extended examples of usage and documentation, head over to [langtest.org](https:\u002F\u002Fwww.langtest.org)\n\n## Responsible Ai Blogs\n\nYou can check out the following LangTest articles:\n\n| Blog | Description |\n|------|-------------|\n| [**Automatically Testing for Demographic Bias in Clinical Treatment Plans Generated by Large Language Models**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fautomatically-testing-for-demographic-bias-in-clinical-treatment-plans-generated-by-large-language-ffcf358b6092) | Helps in understanding and testing demographic bias in clinical treatment plans generated by LLM. |\n| [**LangTest: Unveiling & Fixing Biases with End-to-End NLP Pipelines**](https:\u002F\u002Fwww.johnsnowlabs.com\u002Flangtest-unveiling-fixing-biases-with-end-to-end-nlp-pipelines\u002F) | The end-to-end language pipeline in LangTest empowers NLP practitioners to tackle biases in language models with a comprehensive, data-driven, and iterative approach. |\n| [**Beyond Accuracy: Robustness Testing of Named Entity Recognition Models with LangTest**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fbeyond-accuracy-robustness-testing-of-named-entity-recognition-models-with-langtest-fb046ace7eb9) | While accuracy is undoubtedly crucial, robustness testing takes natural language processing (NLP) models evaluation to the next level by ensuring that models can perform reliably and consistently across a wide array of real-world conditions. |\n| [**Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Felevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance-71aa7812c699) | In this article, we discuss how automated data augmentation may supercharge your NLP models and improve their performance and how we do that using  LangTest. |\n| [**Mitigating Gender-Occupational Stereotypes in AI: Evaluating Models with the Wino Bias Test through Langtest Library**](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fmitigating-gender-occupational-stereotypes-in-ai-evaluating-language-models-with-the-wino-bias-test-through-the-langtest-library\u002F) | In this article, we discuss how we can test the \"Wino Bias” using LangTest. It specifically refers to testing biases arising from gender-occupational stereotypes. |\n| [**Automating Responsible AI: Integrating Hugging Face and LangTest for More Robust Models**](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fautomating-responsible-ai-integrating-hugging-face-and-langtest-for-more-robust-models\u002F) | In this article, we have explored the integration between Hugging Face, your go-to source for state-of-the-art NLP models and datasets, and LangTest, your NLP pipeline’s secret weapon for testing and optimization. |\n| [**Detecting and Evaluating Sycophancy Bias: An Analysis of LLM and AI Solutions**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fdetecting-and-evaluating-sycophancy-bias-an-analysis-of-llm-and-ai-solutions-ce7c93acb5db) | In this blog post, we discuss the pervasive issue of sycophantic AI behavior and the challenges it presents in the world of artificial intelligence. We explore how language models sometimes prioritize agreement over authenticity, hindering meaningful and unbiased conversations. Furthermore, we unveil a potential game-changing solution to this problem, synthetic data, which promises to revolutionize the way AI companions engage in discussions, making them more reliable and accurate across various real-world conditions. |\n| [**Unmasking Language Model Sensitivity in Negation and Toxicity Evaluations**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Funmasking-language-model-sensitivity-in-negation-and-toxicity-evaluations-f835cdc9cabf) | In this blog post, we delve into Language Model Sensitivity, examining how models handle negations and toxicity in language. Through these tests, we gain insights into the models' adaptability and responsiveness, emphasizing the continuous need for improvement in NLP models. |\n| [**Unveiling Bias in Language Models: Gender, Race, Disability, and Socioeconomic Perspectives**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Funveiling-bias-in-language-models-gender-race-disability-and-socioeconomic-perspectives-af0206ed0feb) | In this blog post, we explore bias in Language Models, focusing on gender, race, disability, and socioeconomic factors. We assess this bias using the CrowS-Pairs dataset, designed to measure stereotypical biases. To address these biases, we discuss the importance of tools like LangTest in promoting fairness in NLP systems. |\n| [**Unmasking the Biases Within AI: How Gender, Ethnicity, Religion, and Economics Shape NLP and Beyond**](https:\u002F\u002Fmedium.com\u002F@chakravarthik27\u002Fcf69c203f52c) | In this blog post, we tackle AI bias on how Gender, Ethnicity, Religion, and Economics Shape NLP systems. We discussed strategies for reducing bias and promoting fairness in AI systems. |\n| [**Evaluating Large Language Models on Gender-Occupational Stereotypes Using the Wino Bias Test**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fevaluating-large-language-models-on-gender-occupational-stereotypes-using-the-wino-bias-test-2a96619b4960) | In this blog post, we dive into testing the WinoBias dataset on LLMs, examining language models’ handling of gender and occupational roles, evaluation metrics, and the wider implications. Let’s explore the evaluation of language models with LangTest on the WinoBias dataset and confront the challenges of addressing bias in AI. |\n| [**Streamlining ML Workflows: Integrating MLFlow Tracking with LangTest for Enhanced Model Evaluations**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fstreamlining-ml-workflows-integrating-mlflow-tracking-with-langtest-for-enhanced-model-evaluations-4ce9863a0ff1) | In this blog post, we dive into the growing need for transparent, systematic, and comprehensive tracking of models. Enter MLFlow and LangTest: two tools that, when combined, create a revolutionary approach to ML development. |\n| [**Testing the Question Answering Capabilities of Large Language Models**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Ftesting-the-question-answering-capabilities-of-large-language-models-1bc424d61740) | In this blog post, we dive into enhancing the QA evaluation capabilities using LangTest library. Explore about different evaluation methods that LangTest offers to address the complexities of evaluating Question Answering (QA) tasks. |\n| [**Evaluating Stereotype Bias with LangTest**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fevaluating-stereotype-bias-with-langtest-8286af8f0f22) | In this blog post, we are focusing on using the StereoSet dataset to assess bias related to gender, profession, and race.|\n| [**Testing the Robustness of LSTM-Based Sentiment Analysis Models**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Ftesting-the-robustness-of-lstm-based-sentiment-analysis-models-67ed84e42997) | Explore the robustness of custom models with LangTest Insights.|\n| [**LangTest Insights: A Deep Dive into LLM Robustness on OpenBookQA**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Flangtest-insights-a-deep-dive-into-llm-robustness-on-openbookqa-ab0ddcbd2ab1) | Explore the robustness of Language Models (LLMs) on the OpenBookQA dataset with LangTest Insights.|\n| [**LangTest: A Secret Weapon for Improving the Robustness of Your Transformers Language Models**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Flangtest-a-secret-weapon-for-improving-the-robustness-of-your-transformers-language-models-9693d64256cc) | Explore the robustness of Transformers Language Models with LangTest Insights.|\n| [**Mastering Model Evaluation: Introducing the Comprehensive Ranking & Leaderboard System in LangTest**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fmastering-model-evaluation-introducing-the-comprehensive-ranking-leaderboard-system-in-langtest-5242927754bb) | The Model Ranking & Leaderboard system by John Snow Labs' LangTest offers a systematic approach to evaluating AI models with comprehensive ranking, historical comparisons, and dataset-specific insights, empowering researchers and data scientists to make data-driven decisions on model performance. |\n| [**Evaluating Long-Form Responses with Prometheus-Eval and Langtest**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fevaluating-long-form-responses-with-prometheus-eval-and-langtest-a8279355362e) | Prometheus-Eval and LangTest unite to offer an open-source, reliable, and cost-effective solution for evaluating long-form responses, combining Prometheus's GPT-4-level performance and LangTest's robust testing framework to provide detailed, interpretable feedback and high accuracy in assessments. |\n| [**Ensuring Precision of LLMs in Medical Domain: The Challenge of Drug Name Swapping**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fensuring-precision-of-llms-in-medical-domain-the-challenge-of-drug-name-swapping-d7f4c83d55fd) | Accurate drug name identification is crucial for patient safety. Testing GPT-4o with LangTest's **_drug_generic_to_brand_** conversion test revealed potential errors in predicting drug names when brand names are replaced by ingredients, highlighting the need for ongoing refinement and rigorous testing to ensure medical LLM accuracy and reliability. |\n\n\n\n\n\n\n\n> **Note**\n> To check all blogs, head over to [Blogs](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fresponsible-ai-blog\u002F)\n\n## Community Support\n\n- [Slack](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fslack-redirect\u002F) For live discussion with the LangTest community, join the `#langtest` channel\n- [GitHub](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Ftree\u002Fmain) For bug reports, feature requests, and contributions\n- [Discussions](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fdiscussions) To engage with other community members, share ideas, and show off how you use LangTest!\n\n## Mission\n\nWhile there is a lot of talk about the need to train AI models that are safe, robust, and fair - few tools have been made available to data scientists to meet these goals. As a result, the front line of NLP models in production systems reflects a sorry state of affairs. \n\nWe propose here an early stage open-source community project that aims to fill this gap, and would love for you to join us on this mission. We aim to build on the foundation laid by previous research such as [Ribeiro et al. (2020)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.04118), [Song et al. (2020)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.00053), [Parrish et al. (2021)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.08193), [van Aken et al. (2021)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.15512) and many others. \n\n[John Snow Labs](www.johnsnowlabs.com) has a full development team allocated to the project and is committed to improving the library for years, as we do with other open-source libraries. Expect frequent releases with new test types, tasks, languages, and platforms to be added regularly. We look forward to working together to make safe, reliable, and responsible NLP an everyday reality. \n\n\n> **Note**\n> For usage and documentation, head over to [langtest.org](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Fdocs\u002Fdata#question-answering)\n\n\n## Contributing to LangTest\n\nWe welcome all sorts of contributions:\n\n- [Ideas](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fdiscussions\u002Fcategories\u002Fideas)\n- [Discussions](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fdiscussions)\n- [Feedback](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fdiscussions\u002Fcategories\u002Fgeneral)\n- [Documentation](https:\u002F\u002Fwww.example.com\u002Fdocumentation)\n- [Bug reports](https:\u002F\u002Fwww.example.com\u002Fbug-reports)\n\nA detailed overview of contributing can be found in the **[contributing guide](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fblob\u002Fmain\u002FCONTRIBUTING.md)**.\n\nIf you are looking to start working with the LangTest codebase, navigate to the GitHub [\"issues\"](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fissues) tab and start looking through interesting issues. There are a number of issues listed under where you could start out.\nOr maybe through using LangTest you have an idea of your own or are looking for something in the documentation and thinking ‘This can be improved’...you can do something about it!\n\nFeel free to ask questions on the [Q&A](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fdiscussions\u002Fcategories\u002Fq-a) discussions.\n\nAs contributors and maintainers to this project, you are expected to abide by LangTest's code of conduct. More information can be found at: [Contributor Code of Conduct](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fblob\u002Frelease\u002F1.8.0\u002FCODE_OF_CONDUCT.md)\n\n\n## Citation\n\nWe have published a [paper](https:\u002F\u002Fwww.sciencedirect.com\u002Fscience\u002Farticle\u002Fpii\u002FS2665963824000071) that you can cite for\nthe LangTest library:\n\n```bibtex\n@article{nazir2024langtest,\n  title={LangTest: A comprehensive evaluation library for custom LLM and NLP models},\n  author={Arshaan Nazir, Thadaka Kalyan Chakravarthy, David Amore Cecchini, Rakshit Khajuria, Prikshit Sharma, Ali Tarik Mirik, Veysel Kocaman and David Talby},\n  journal={Software Impacts},\n  pages={100619},\n  year={2024},\n  publisher={Elsevier}\n}\n```\n\n\n## Contributors\n\nWe would like to acknowledge all contributors of this open-source community project. \n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPacificAI_langtest_readme_38e448a46f1b.png\" \u002F>\n\u003C\u002Fa>\n\n## License\n\nLangTest is released under the [Apache License 2.0](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fblob\u002Fmain\u002FLICENSE), which guarantees commercial use, modification, distribution, patent use, private use and sets limitations on trademark use, liability and warranty.\n\n","\u003Cp align=\"center\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPacificAI_langtest_readme_9a89c81b71e4.png\" alt=\"pacific_ai_logo\" width=\"360\" style=\"text-align:center;\">\n\u003C\u002Fp>\n\n\u003Cdiv align=\"center\">\n  \u003Ch1 style=\"text-align: center; vertical-align: middle;\">LangTest：交付安全高效的语言模型\u003C\u002Fh1>\n\u003C\u002Fdiv>\n\n\u003Cp align=\"center\">\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Freleases\">\n        \u003Cimg alt=\"发布说明\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fv\u002Frelease\u002FPacific-AI-Corp\u002Flangtest.svg\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fwww.johnsnowlabs.com\u002Fresponsible-ai-blog\u002F\">\n        \u003Cimg alt=\"博客\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F负责任的人工智能博客-8A2BE2\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Fdocs\u002Finstall\">\n        \u003Cimg alt=\"文档\" src=\"https:\u002F\u002Fimg.shields.io\u002Fwebsite?up_message=online&url=https%3A%2F%2Flangtest.org%2F\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fstar-history.com\u002F#Pacific-AI-Corp\u002Flangtest\">\n        \u003Cimg alt=\"GitHub 星级图\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fstars\u002FPacific-AI-Corp\u002Flangtest?style=social\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fissues\">\n        \u003Cimg alt=\"未解决的问题\" src=\"https:\u002F\u002Fimg.shields.io\u002Fgithub\u002Fissues-raw\u002FPacific-AI-Corp\u002Flangtest\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fpepy.tech\u002Fproject\u002Flangtest\">\n        \u003Cimg alt=\"下载量\" src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPacificAI_langtest_readme_c73da921a179.png\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Factions\u002Fworkflows\u002Fbuild_and_test.yml\">\n        \u003Cimg alt=\"持续集成\" src=\"https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Factions\u002Fworkflows\u002Fbuild_and_test.yml\u002Fbadge.svg\">\n    \u003C\u002Fa>\n    \u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fblob\u002Fmaster\u002FLICENSE\" alt=\"许可证\">\n        \u003Cimg src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002FLicense-Apache%202.0-blue.svg\" \u002F>\n    \u003C\u002Fa>\n    \u003Ca href=\"CODE_OF_CONDUCT.md\">\n        \u003Cimg alt=\"贡献者公约\" src=\"https:\u002F\u002Fimg.shields.io\u002Fbadge\u002F贡献者公约-v2.0已采纳-ff69b4.svg\">\n    \u003C\u002Fa>\n\n![LangTest 工作流](https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPacificAI_langtest_readme_a72453e9a589.jpeg)\n\n\u003Cp align=\"center\">\n  \u003Ca href=\"https:\u002F\u002Flangtest.org\u002F\">项目官网\u003C\u002Fa> •\n  \u003Ca href=\"#key-features\">核心功能\u003C\u002Fa> •\n  \u003Ca href=\"#how-to-use\">使用方法\u003C\u002Fa> •\n  \u003Ca href=\"#benchmark-datasets\">基准数据集\u003C\u002Fa> •\n  \u003Ca href=\"#community-support\">社区支持\u003C\u002Fa> •\n  \u003Ca href=\"#contributing-to-langtest\">贡献\u003C\u002Fa> •\n  \u003Ca href=\"#mission\">使命\u003C\u002Fa> •\n  \u003Ca href=\"#license\">许可证\u003C\u002Fa>\n\u003C\u002Fp>\n\n## 项目官网\n\n请访问我们的官方页面，获取用户文档和示例：[langtest.org](http:\u002F\u002Flangtest.org\u002F) \n\n## 核心功能\n\n- 仅需一行代码即可生成并执行超过60种不同类型的测试\n- 测试模型质量的各个方面：鲁棒性、偏差、表征能力、公平性和准确性。\n- 根据测试结果自动扩充训练数据（适用于部分模型）\n- 支持用于命名实体识别、翻译和文本分类的主流 NLP 框架：Spark NLP、Hugging Face 和 Transformers。\n- 支持对大型语言模型（OpenAI、Cohere、AI21、Hugging Face 推理 API 以及 Azure-OpenAI LLM）进行问答、毒性检测、临床测试、法律支持、事实性、阿谀奉承、摘要生成等常见测试。\n\n## 基准数据集\n\nLangTest 提供多种数据集来测试您的模型，覆盖了广泛的用例和评估场景。您可以在 [这里](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Fbenchmarks\u002Fbenchmark) 查看所有可用的基准数据集，每一份数据集都经过精心策划，旨在挑战并提升您的语言模型性能。\n无论您关注的是问答任务、文本摘要等，LangTest 都能为您提供合适的数据，帮助您将模型推向极限，在各种语言任务中实现最佳表现。\n\n## 使用方法\n\n```python\n# 安装 langtest\n!pip install langtest[transformers]\n\n# 导入并创建 Harness 对象\nfrom langtest import Harness\nh = Harness(task='ner', model={\"model\":'dslim\u002Fbert-base-NER', \"hub\":'huggingface'})\n\n# 生成测试用例、运行并查看报告\nh.generate().run().report()\n```\n\n> **注意**\n> 如需更多详细的使用示例和文档，请访问 [langtest.org](https:\u002F\u002Fwww.langtest.org)\n\n## 负责任的人工智能博客\n\n您可以查看以下 LangTest 相关文章：\n\n| 博客 | 描述 |\n|------|-------------|\n| [**自动检测大型语言模型生成的临床治疗方案中的人口统计学偏见**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fautomatically-testing-for-demographic-bias-in-clinical-treatment-plans-generated-by-large-language-ffcf358b6092) | 帮助理解和测试由大型语言模型生成的临床治疗方案中的人口统计学偏见。 |\n| [**LangTest：通过端到端NLP流水线揭示并修复偏见**](https:\u002F\u002Fwww.johnsnowlabs.com\u002Flangtest-unveiling-fixing-biases-with-end-to-end-nlp-pipelines\u002F) | LangTest中的端到端语言流水线使NLP从业者能够以全面、数据驱动和迭代的方式应对语言模型中的偏见。 |\n| [**超越准确率：使用LangTest对命名实体识别模型进行鲁棒性测试**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fbeyond-accuracy-robustness-testing-of-named-entity-recognition-models-with-langtest-fb046ace7eb9) | 虽然准确率无疑至关重要，但鲁棒性测试将自然语言处理（NLP）模型的评估提升到了一个新的水平，确保模型在各种现实条件下都能可靠且一致地运行。 |\n| [**通过自动化数据增强提升NLP模型性能**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Felevate-your-nlp-models-with-automated-data-augmentation-for-enhanced-performance-71aa7812c699) | 在本文中，我们讨论了自动化数据增强如何为您的NLP模型提供强大助力并提升其性能，以及我们如何利用LangTest实现这一点。 |\n| [**缓解AI中的性别与职业刻板印象：通过LangTest库使用Wino偏见测试评估模型**](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fmitigating-gender-occupational-stereotypes-in-ai-evaluating-language-models-with-the-wino-bias-test-through-the-langtest-library\u002F) | 在本文中，我们探讨了如何使用LangTest测试“Wino偏见”。这特指测试源于性别与职业刻板印象的偏见。 |\n| [**自动化负责任的AI：将Hugging Face与LangTest集成以构建更稳健的模型**](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fautomating-responsible-ai-integrating-hugging-face-and-langtest-for-more-robust-models\u002F) | 在本文中，我们探讨了Hugging Face——您获取最先进NLP模型和数据集的首选来源——与LangTest——您NLP流水线用于测试和优化的秘密武器——之间的集成。 |\n| [**检测和评估溜须拍马偏见：对LLM和AI解决方案的分析**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fdetecting-and-evaluating-sycophancy-bias-an-analysis-of-llm-and-ai-solutions-ce7c93acb5db) | 在这篇博客文章中，我们讨论了溜须拍马式AI行为这一普遍存在的问题及其在人工智能领域带来的挑战。我们探讨了语言模型有时会优先考虑附和而非真实性，从而阻碍有意义且无偏见的对话。此外，我们还揭示了一种可能改变游戏规则的解决方案——合成数据，它有望彻底改变AI伴侣参与讨论的方式，使其在各种现实条件下更加可靠和准确。 |\n| [**揭示语言模型在否定和毒性评估中的敏感性**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Funmasking-language-model-sensitivity-in-negation-and-toxicity-evaluations-f835cdc9cabf) | 在这篇博客文章中，我们深入探讨了语言模型的敏感性，考察模型如何处理语言中的否定和毒性。通过这些测试，我们得以深入了解模型的适应性和响应能力，强调NLP模型持续改进的必要性。 |\n| [**揭示语言模型中的偏见：性别、种族、残疾及社会经济视角**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Funveiling-bias-in-language-models-gender-race-disability-and-socioeconomic-perspectives-af0206ed0feb) | 在这篇博客文章中，我们探讨了语言模型中的偏见，重点关注性别、种族、残疾和社会经济因素。我们使用专为衡量刻板印象偏见而设计的CrowS-Pairs数据集来评估这些偏见。为了解决这些偏见，我们讨论了像LangTest这样的工具在促进NLP系统公平性方面的重要性。 |\n| [**揭露AI内部的偏见：性别、种族、宗教和经济如何塑造NLP及其他领域**](https:\u002F\u002Fmedium.com\u002F@chakravarthik27\u002Fcf69c203f52c) | 在这篇博客文章中，我们探讨了AI中的偏见，特别是性别、种族、宗教和经济因素如何影响NLP系统。我们讨论了减少偏见、促进AI系统公平性的策略。 |\n| [**使用Wino偏见测试评估大型语言模型在性别与职业刻板印象方面的表现**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fevaluating-large-language-models-on-gender-occupational-stereotypes-using-the-wino-bias-test-2a96619b4960) | 在这篇博客文章中，我们深入研究了在LLM上测试WinoBias数据集的情况，考察语言模型对性别和职业角色的处理方式、评估指标以及更广泛的影响。让我们借助LangTest在WinoBias数据集上评估语言模型，并直面解决AI中偏见的挑战。 |\n| [**简化机器学习工作流：将MLFlow跟踪与LangTest集成以增强模型评估**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fstreamlining-ml-workflows-integrating-mlflow-tracking-with-langtest-for-enhanced-model-evaluations-4ce9863a0ff1) | 在这篇博客文章中，我们深入探讨了对模型进行透明、系统化和全面跟踪日益增长的需求。这时，MLFlow和LangTest便派上了用场：这两款工具结合使用，为机器学习开发开创了一种革命性的方法。 |\n| [**测试大型语言模型的问题回答能力**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Ftesting-the-question-answering-capabilities-of-large-language-models-1bc424d61740) | 在这篇博客文章中，我们深入探讨了如何利用LangTest库提升QA评估能力。探索LangTest提供的不同评估方法，以应对评估问答（QA）任务时的复杂性。 |\n| [**使用LangTest评估刻板印象偏见**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fevaluating-stereotype-bias-with-langtest-8286af8f0f22) | 在这篇博客文章中，我们重点介绍了如何使用StereoSet数据集来评估与性别、职业和种族相关的偏见。 |\n| [**测试基于LSTM的情感分析模型的鲁棒性**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Ftesting-the-robustness-of-lstm-based-sentiment-analysis-models-67ed84e42997) | 使用LangTest Insights探索自定义模型的鲁棒性。 |\n| [**LangTest Insights：深入探究LLM在OpenBookQA上的鲁棒性**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Flangtest-insights-a-deep-dive-into-llm-robustness-on-openbookqa-ab0ddcbd2ab1) | 使用LangTest Insights探索语言模型（LLM）在OpenBookQA数据集上的鲁棒性。 |\n| [**LangTest：提升您的Transformer语言模型鲁棒性的秘密武器**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Flangtest-a-secret-weapon-for-improving-the-robustness-of-your-transformers-language-models-9693d64256cc) | 使用LangTest Insights探索Transformer语言模型的鲁棒性。 |\n| [**掌握模型评估：推出LangTest中的综合排名与排行榜系统**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fmastering-model-evaluation-introducing-the-comprehensive-ranking-leaderboard-system-in-langtest-5242927754bb) | John Snow Labs的LangTest推出的模型排名与排行榜系统，提供了一种系统化的AI模型评估方法，包含全面的排名、历史对比以及特定数据集的洞察，从而帮助研究人员和数据科学家基于数据做出关于模型性能的明智决策。 |\n| [**使用Prometheus-Eval和LangTest评估长篇回复**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fevaluating-long-form-responses-with-prometheus-eval-and-langtest-a8279355362e) | Prometheus-Eval和LangTest联手，为评估长篇回复提供了一个开源、可靠且经济高效的解决方案。该方案结合了Prometheus的GPT-4级性能和LangTest的稳健测试框架，能够提供详细、可解释的反馈，并在评估中达到高精度。 |\n| [**确保LLM在医疗领域的精确性：药物名称混淆的挑战**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fensuring-precision-of-llms-in-medical-domain-the-challenge-of-drug-name-swapping-d7f4c83d55fd) | 准确识别药物名称对于患者安全至关重要。使用LangTest的**_drug_generic_to_brand_**转换测试对GPT-4o进行测试后发现，在用成分替代商品名时，模型在预测药物名称方面可能存在误差，这凸显了持续优化和严格测试的必要性，以确保医疗领域LLM的准确性和可靠性。 |\n\n> **注意**\n> 若要查看所有博客，请前往 [博客](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fresponsible-ai-blog\u002F)\n\n\n\n## 社区支持\n\n- [Slack](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fslack-redirect\u002F) 如需与 LangTest 社区进行实时讨论，请加入 `#langtest` 频道。\n- [GitHub](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Ftree\u002Fmain) 如有错误报告、功能请求或贡献需求，请访问此仓库。\n- [Discussions](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fdiscussions) 您可以在此与其他社区成员互动、分享想法，并展示您如何使用 LangTest！\n\n## 使命\n\n尽管关于训练安全、稳健且公平的人工智能模型的需求讨论颇多，但目前可供数据科学家实现这些目标的工具却寥寥无几。因此，在生产系统中部署的自然语言处理模型现状令人堪忧。\n\n我们在此提出一个处于早期阶段的开源社区项目，旨在填补这一空白，并诚挚邀请您加入我们的行列。我们将以先前的研究成果为基础，例如 [Ribeiro 等人 (2020)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.04118)、[Song 等人 (2020)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.00053)、[Parrish 等人 (2021)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.08193)、[van Aken 等人 (2021)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.15512) 等众多研究。 \n\n[John Snow Labs](www.johnsnowlabs.com) 已为该项目配备了完整的开发团队，并承诺像对待其他开源库一样，长期致力于该库的改进。未来我们将定期发布新版本，不断增加新的测试类型、任务、语言和平台。我们期待与您携手合作，共同推动安全、可靠、负责任的自然语言处理技术成为日常现实。\n\n\n> **注意**\n> 如需了解使用方法及文档，请访问 [langtest.org](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Fdocs\u002Fdata#question-answering)\n\n\n## 参与 LangTest 的贡献\n\n我们欢迎各种形式的贡献：\n\n- [创意](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fdiscussions\u002Fcategories\u002Fideas)\n- [讨论](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fdiscussions)\n- [反馈](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fdiscussions\u002Fcategories\u002Fgeneral)\n- [文档](https:\u002F\u002Fwww.example.com\u002Fdocumentation)\n- [错误报告](https:\u002F\u002Fwww.example.com\u002Fbug-reports)\n\n有关贡献的详细说明，请参阅 **[贡献指南](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fblob\u002Fmain\u002FCONTRIBUTING.md)**。\n\n如果您希望开始参与 LangTest 代码库的工作，可以前往 GitHub 的“问题”标签页，浏览一些有趣的议题。那里列出了许多您可以着手解决的问题。或者，您在使用 LangTest 时可能有了自己的想法，又或是发现文档中有需要改进之处……那么，您完全可以通过实际行动来推动改进！\n\n如有任何疑问，欢迎在 [问答](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fdiscussions\u002Fcategories\u002Fq-a) 讨论区提出。\n\n作为本项目的贡献者和维护者，您应遵守 LangTest 的行为准则。更多信息请参见：[贡献者行为准则](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fblob\u002Frelease\u002F1.8.0\u002FCODE_OF_CONDUCT.md)。\n\n\n## 引用\n\n我们已发表了一篇关于 LangTest 库的论文，您可以将其引用如下：\n\n```bibtex\n@article{nazir2024langtest,\n  title={LangTest: A comprehensive evaluation library for custom LLM and NLP models},\n  author={Arshaan Nazir, Thadaka Kalyan Chakravarthy, David Amore Cecchini, Rakshit Khajuria, Prikshit Sharma, Ali Tarik Mirik, Veysel Kocaman and David Talby},\n  journal={Software Impacts},\n  pages={100619},\n  year={2024},\n  publisher={Elsevier}\n}\n```\n\n\n## 贡献者\n\n我们谨向本开源社区项目的全体贡献者致以诚挚的感谢。\n\n\u003Ca href=\"https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fgraphs\u002Fcontributors\">\n  \u003Cimg src=\"https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPacificAI_langtest_readme_38e448a46f1b.png\" \u002F>\n\u003C\u002Fa>\n\n## 许可证\n\nLangTest 采用 [Apache License 2.0](https:\u002F\u002Fgithub.com\u002FPacific-AI-Corp\u002Flangtest\u002Fblob\u002Fmain\u002FLICENSE) 许可证发布，该许可证保障了商业使用、修改、分发、专利使用、私人使用等权利，并对商标使用、责任和担保作出了限制。","# LangTest 快速上手指南\n\nLangTest 是一款专为语言模型（LLM）和传统 NLP 模型设计的开源测试框架，旨在通过一行代码生成并执行超过 60 种测试用例，全面评估模型的鲁棒性、偏见、公平性及准确性。\n\n## 环境准备\n\n在开始之前，请确保您的开发环境满足以下要求：\n\n*   **操作系统**：Linux, macOS 或 Windows\n*   **Python 版本**：建议 Python 3.8 及以上版本\n*   **前置依赖**：\n    *   若测试 Hugging Face 模型，需安装 `transformers` 相关库。\n    *   若测试 Spark NLP 模型，需配置 Spark 环境。\n    *   若测试云端 LLM（如 OpenAI, Azure），需准备好相应的 API Key。\n\n## 安装步骤\n\n推荐使用 pip 进行安装。根据您的使用场景，可以选择安装基础版或包含特定框架支持的版本。\n\n**安装支持 Transformers 的版本（推荐）：**\n```bash\npip install langtest[transformers]\n```\n\n**安装完整功能版（包含所有依赖）：**\n```bash\npip install langtest[all]\n```\n\n> **提示**：国内用户若遇到下载速度慢的问题，可使用清华源或阿里源加速安装：\n> ```bash\n> pip install langtest[transformers] -i https:\u002F\u002Fpypi.tuna.tsinghua.edu.cn\u002Fsimple\n> ```\n\n## 基本使用\n\nLangTest 的核心设计理念是“一行代码完成测试”。以下是一个针对命名实体识别（NER）模型的最简使用示例：\n\n1.  **导入并创建 Harness 对象**：指定任务类型（如 `ner`）和模型来源。\n2.  **执行测试流程**：链式调用 `generate()`（生成测试用例）、`run()`（运行测试）和 `report()`（查看报告）。\n\n```python\n# 导入 Harness 类\nfrom langtest import Harness\n\n# 创建 Harness 对象\n# task: 任务类型 (例如: 'ner', 'question-answering', 'text-classification')\n# model: 模型配置 (支持 huggingface, spark-nlp 等)\nh = Harness(task='ner', model={\"model\": 'dslim\u002Fbert-base-NER', \"hub\": 'huggingface'})\n\n# 生成测试用例、运行测试并输出报告\nh.generate().run().report()\n```\n\n执行上述代码后，LangTest 将自动：\n*   生成涵盖鲁棒性、偏见等维度的测试数据集。\n*   对指定模型进行推理测试。\n*   输出详细的评估报告，展示模型在不同场景下的表现。\n\n更多高级用法（如自定义测试类型、连接商业大模型 API、数据增强等），请访问官方文档 [langtest.org](https:\u002F\u002Flangtest.org)。","某金融科技公司正在开发一款基于大语言模型的智能客服系统，用于自动回答用户关于贷款政策和账户安全的咨询。\n\n### 没有 langtest 时\n- 测试覆盖率低：团队只能手动编写少量常规问题，难以覆盖方言、拼写错误或恶意诱导等长尾场景，导致模型上线后频繁“答非所问”。\n- 安全隐患难发现：缺乏自动化手段检测模型是否会产生歧视性言论或泄露敏感信息，合规风险完全依赖人工抽检，效率极低且容易漏测。\n- 迭代信心不足：每次微调模型后，无法快速量化评估新版本的鲁棒性和公平性变化，开发人员不敢轻易部署更新，严重拖慢产品迭代速度。\n\n### 使用 langtest 后\n- 测试全面自动化：仅需一行代码即可生成并执行超过 60 种测试（如噪声注入、对抗攻击），轻松覆盖各种极端输入，显著提升了模型在复杂场景下的稳定性。\n- 风险主动拦截：langtest 自动对模型进行毒性、偏见及事实性核查，提前识别出潜在的违规回复，确保金融客服符合严格的行业合规标准。\n- 数据驱动优化：根据测试结果自动增强训练数据，并直观对比不同版本模型的性能指标，让团队能自信地快速发布更安全、更准确的模型版本。\n\nlangtest 将原本耗时数周的人工安全评估压缩为分钟级的自动化流程，成为保障金融级 AI 应用安全落地的核心防线。","https:\u002F\u002Foss.gittoolsai.com\u002Fimages\u002FPacificAI_langtest_9a89c81b.png","PacificAI","Pacific AI","https:\u002F\u002Foss.gittoolsai.com\u002Favatars\u002FPacificAI_10f91095.jpg","Accelerating AI Governance for Healthcare",null,"www.pacific.ai","https:\u002F\u002Fgithub.com\u002FPacificAI",[80,84,88,91,94],{"name":81,"color":82,"percentage":83},"Python","#3572A5",100,{"name":85,"color":86,"percentage":87},"Makefile","#427819",0,{"name":89,"color":90,"percentage":87},"Batchfile","#C1F12E",{"name":92,"color":93,"percentage":87},"Shell","#89e051",{"name":95,"color":96,"percentage":87},"CSS","#663399",555,49,"2026-04-12T15:17:03","Apache-2.0","未说明",{"notes":103,"python":101,"dependencies":104},"README 中未明确列出具体的操作系统、GPU、内存及 Python 版本要求。该工具主要通过 pip 安装（如：pip install langtest[transformers]），支持多种 NLP 框架（Spark NLP, Hugging Face）及大模型 API（OpenAI, Cohere 等）。具体环境依赖可能根据所选用的后端模型（如本地 BERT 模型或云端 LLM）而有所不同，建议参考官方文档 langtest.org 获取详细安装指南。",[64,105,106,107,108,109,110,111],"transformers","Spark NLP","Hugging Face","OpenAI API","Cohere API","AI21 API","Azure OpenAI",[113,14,35],"其他",[115,116,117,118,119,120,121,122,123,124,125,126,127,128,129,130,131,132,133],"benchmarks","ethics-in-ai","large-language-models","ml-safety","ml-testing","mlops","model-assessment","nlp","responsible-ai","llm-test","ai-safety","ai-testing","artificial-intelligence","benchmark-framework","llm","llm-as-evaluator","llm-evaluation-toolkit","llm-testing","trustworthy-ai","2026-03-27T02:49:30.150509","2026-04-14T12:35:40.491577",[137,142,147,152,157,162],{"id":138,"question_zh":139,"answer_zh":140,"source_url":141},32995,"如何通过 Conda 安装 nlptest？","nlptest 现已支持通过 conda-forge 渠道安装。您可以使用以下命令进行搜索和安装：\n1. 搜索包：`conda search -c conda-forge nlptest`\n2. 安装包：`conda install -c conda-forge nlptest`\n该包已在 anaconda.org 上可用。","https:\u002F\u002Fgithub.com\u002FPacificAI\u002Flangtest\u002Fissues\u002F318",{"id":143,"question_zh":144,"answer_zh":145,"source_url":146},32996,"如何使用 Harness 进行数据增强并更新训练集？","可以通过以下步骤利用测试结果增强训练数据：\n1. 创建 Harness 并生成报告：\n   ```python\n   harness = Harness(task='ner', model='ner', data='test.conll', hub='johnsnowlabs')\n   tests = harness.generate().run().report()\n   ```\n2. 基于测试结果增强训练集（默认原地修改，也可指定输出路径）：\n   ```python\n   tests.augment(input_path='train.conll', output_path='augmented.conll', inplace=True)\n   ```\n3. 查看增强报告：`tests.augmentation_report()`\n4. 使用增强后的数据集重新训练模型。\n注意：增强后的数据文件格式将与输入文件格式保持一致。","https:\u002F\u002Fgithub.com\u002FPacificAI\u002Flangtest\u002Fissues\u002F92",{"id":148,"question_zh":149,"answer_zh":150,"source_url":151},32997,"如何在初始化 Harness 时通过字符串指定模型和后端的来源？","您可以在 `Harness` 的 `model` 参数中使用字符串格式来指定库和模型路径，而无需传递管道对象。格式为 `库名\u002F模型路径`。\n示例：\n- Transformers: `Harness(model='transformers\u002Fdslim\u002Fbert-base-NER')`\n- JohnSnowLabs: `Harness(model='johnsnowlabs\u002Fnerdl_conll03')`\n- SpaCy: `Harness(model='spacy\u002Fen_core_web_sm')`\n字符串的前缀（如 `spacy\u002F`, `transformers\u002F`）用于定义用户想要使用的具体库，支持加载在线模型和本地磁盘模型。","https:\u002F\u002Fgithub.com\u002FPacificAI\u002Flangtest\u002Fissues\u002F89",{"id":153,"question_zh":154,"answer_zh":155,"source_url":156},32998,"LangTest 支持哪些 LLM 评估指标？","除了默认的 ROUGE 等指标外，LangTest 探索并支持多种评估指标：\n- BERTScore: 计算参考文本和结果文本 token 之间的成对余弦相似度。\n- WER (词错误率): 通过计算插入、删除、替换的数量来衡量编辑距离。\n- MoverScore: 比较源文本和目标文本之间 token 移动的相似度。\n- Entailment Score (蕴含分数): 用于确保文本摘要等任务的忠实度。\n- G-Eval: 让模型直接对各个方面进行 0-5 分打分（可能存在偏差）。\n- QG-QA: 使用模型根据上下文生成问题并进行问答评估。\n部分指标（如蕴含分数）可配置用于特定任务（如摘要）。","https:\u002F\u002Fgithub.com\u002FPacificAI\u002Flangtest\u002Fissues\u002F590",{"id":158,"question_zh":159,"answer_zh":160,"source_url":161},32999,"如何为自定义模型选择后端库（Transformers 或 SpaCy）？","虽然默认模型可以自动切换后端，但在加载磁盘上的自定义模型时，建议将后端作为参数显式传递。您可以通过在相关 PR（如 #99）中展示的方法，将 backend 作为参数传入来解决此问题，从而明确指定使用 `transformers` 或 `spacy` 后端来加载自定义模型。","https:\u002F\u002Fgithub.com\u002FPacificAI\u002Flangtest\u002Fissues\u002F97",{"id":163,"question_zh":164,"answer_zh":165,"source_url":166},33000,"LangTest 是否支持 UnQover 数据集进行偏见测试？","是的，项目计划添加对 UnQover 数据集的支持以进行 QA 偏见测试。该数据集专门用于处理“未指定上下文”的场景，在这种设定下，模型应当不回答任何问题。这是一个被社区认为非常有价值的补充功能。","https:\u002F\u002Fgithub.com\u002FPacificAI\u002Flangtest\u002Fissues\u002F495",[168,173,178,183,188,193,198,203,208,213,218,223,228,233,238,243,248,253,258,263],{"id":169,"version":170,"summary_zh":171,"released_at":172},247711,"2.7.0","### 📢 亮点\n\n我们非常高兴地宣布 LangTest 的最新版本发布，为您的模型评估工作流带来了先进的基准测试、全新的鲁棒性测试以及更完善的开发者体验。\n\n- **🩺 自主医疗指南依从性评估（AMEGA）：**  \n  我们整合了 AMEGA，这是一个用于评估大语言模型对临床指南依从性的综合性基准。该基准覆盖13个专科领域的20个诊断场景，包含135道题目和1,337个加权评分要素，为在真实临床环境中评估医学知识提供了一个最为严格的框架。\n\n- **🧪 MedFuzz 鲁棒性测试：**  \n  为了更好地反映现实临床环境中的复杂性，我们推出了 MedFuzz，这是一种针对医疗领域的鲁棒性测试方法，能够超越常规基准对大语言模型进行深入探测。\n\n- **🎲 QA 任务中的随机选项顺序：**  \n  为缓解选择题评估中的位置偏差问题，LangTest 现已新增鲁棒性测试类型——随机选项顺序测试。\n\n- **📝 ACI-Bench：环境临床智能基准：**  \n  LangTest 现已支持使用 ACI-Bench 进行评估，ACI-Bench 是一种用于临床场景下自动生成就诊记录的新型基准。\n\n- **💬 MTS-Dialog：临床摘要评估：**  \n  我们新增了对 MTS-Dialog 数据集的支持，以评估模型在对话转摘要生成任务上的表现，并支持分段式摘要（标题+内容），从而实现更加结构化的评估。\n\n- **🧠 MentalChat16K 临床评估支持：**  \n  LangTest 现已支持 **MentalChat16K 数据集**，可用于评估大语言模型在心理健康相关对话场景中的表现。\n\n- **🔒 安全性增强：**  \n  我们修复了若干关键漏洞和安全问题，进一步提升了 LangTest 的整体稳定性和安全性。\n\n## 🔥 核心增强功能  \n\n### 🩺 自主医疗指南依从性评估（AMEGA）  \n[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FPacific-AI-Corp\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fllm_notebooks\u002FAMEGA.ipynb)\n\n我们整合了 **AMEGA**，这是一个用于评估大语言模型对临床指南依从性的严格基准。该基准涵盖13个专科领域的20个诊断场景，共包含135道题目和1,337个加权评分要素。  \n\n**主要特点：**  \n- 提供对真实医疗情境中指南依从性的全面评估。  \n- 涵盖多个专科领域，确保广泛的适用性。  \n- 加权评分机制可提供对模型性能的细致洞察。  \n\n**使用方法：**\n```python\n# 从 LangTest 库中导入 Harness\nfrom langtest import Harness\nimport os\n\nos.environ[\"OPENAI_API_KEY\"] = \"\u003CYOUR_API_KEY>\"\n```\n\nHarness 设置：\n\n```python\nharness = Harness(\n    task=\"问答\",\n    model={\n        \"model\": \"gpt-4o-mini\",\n        \"hub\": \"openai\",\n        \"type\": \"聊天\"\n    },\n    data=[]","2025-09-22T06:04:26",{"id":174,"version":175,"summary_zh":176,"released_at":177},247712,"2.6.0","### 📢 亮点\n\n我们很高兴地推出 LangTest 的最新版本，带来一系列旨在简化模型评估并提升整体性能的改进：\n\n- **🛠 去偏数据增强：**  \n  我们已将去偏技术集成到数据增强流程中，确保模型评估更加公平、更具代表性。\n\n- **🔄 结构化输出评估：**  \n  LangTest 现在支持 OpenAI 和 Ollama 的结构化输出 API，在处理模型响应时提供了更高的灵活性和精确性。\n\n- **🏥 基于 Med Halt 测试的置信度评估：**  \n  推出用于置信度评估的 Med Halt 测试，帮助您更稳健地了解 LLM 在各种条件下的可靠性。\n\n- **📖 扩展对 JSL LLM 模型的任务支持：**  \n  QA 和摘要任务现已全面支持 JSL LLM 模型，进一步提升其在实际应用中的能力。\n\n- **🔒 安全性增强：**  \n  已修复关键漏洞和安全问题，进一步强化了 LangTest 的整体稳定性和安全性。\n\n- **🐛 已解决的 bug：**  \n  我们修复了模板化增强中的问题，确保在整个工作流中输出的一致性、准确性和可靠性。\n\n\n## 🔥 重点增强功能  \n\n### 🛠 去偏数据增强  \n[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fmisc\u002FDataset_Debiasing.ipynb)\n\n我们已将去偏技术集成到数据增强流程中，确保模型评估更加公平、更具代表性。  \n\n**主要特性：**  \n- 消除训练数据中的偏差，提升模型公平性。  \n- 增强增强后数据集的多样性，以改善模型的泛化能力。  \n\n**工作原理：**  \n加载数据集：\n```python\nfrom datasets import load_dataset\n\ndataset = load_dataset(\"RealTimeData\u002Fbbc_news_alltime\", \"2024-12\", split=\"train\")\n\n# 随机抽取500行数据\ndf = dataset.to_pandas()\nsample = df.sample(500)\n\n# 避免上下文溢出错误\nsample = sample[sample['content'].apply(lambda x: len(x) \u003C 1000)]\n```\n\n```python\n# 设置去偏处理\nfrom langtest.augmentation.debias import DebiasTextProcessing \n\nprocessing = DebiasTextProcessing(\n    model=\"gpt-4o-mini\",\n    hub=\"openai\",\n    model_kwargs={\n        \"temperature\": 0,\n    }\n)\n```\n\n```python\nimport pandas as pd\n\nprocessing.initialize(\n    input_dataset = sample,\n    output_dataset = pd.DataFrame({}),\n    text_column=\"content\",\n    \n)\n\noutput, reason = processing.apply_bias_correction(bias_tolerance_level=2)\n\noutput.head()\n```\n![image](https:\u002F\u002Fgithub.com\u002Fuser-attachments\u002Fassets\u002F5e56275c-14c6-42a8-9c1a-3e50f57f08f6)\n\n\n### 🔄 结构化输出评估\n[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Fl","2025-03-11T05:24:09",{"id":179,"version":180,"summary_zh":181,"released_at":182},247713,"2.5.0","## **📢 亮点**  \n我们非常高兴地宣布最新版本发布，其中包含令人振奋的更新和增强功能，旨在助力您的 AI 模型评估与开发工作流！\n\n- **🔗 支持 Spark DataFrames 和 Delta Live Tables**  \n我们现已扩展功能，支持 Databricks 的 **Spark DataFrames** 和 **Delta Live Tables**，让您能够无缝集成并高效处理项目中的数据。\n\n- **🧪 鲁棒性测试中的性能退化分析**  \n在鲁棒性测试中新增了 **性能退化分析**！帮助您深入了解模型在边缘场景下的表现，并确保其在复杂环境下的稳定性能。\n\n- **🖼 增强的图像鲁棒性测试**  \n我们新增了用于图像鲁棒性测试的 **新测试类型**，以更全面地评估您的视觉模型。这些测试可针对多种图像扰动进行验证，从而评估模型的适应能力。\n\n- **🛠 针对 LLM 的可定制模板**  \n借助来自 Hugging Face 的大型语言模型（LLMs）的 **可定制模板**，您可以轻松个性化工作流。根据具体需求量身定制提示和配置。\n\n- **💬 改进的 LLM 和 VQA 模型功能**  \n对 **聊天和完成功能** 的优化，使与 LLM 及视觉问答（VQA）模型的交互更加稳健、更易于使用。\n\n- **✔ 完善的单元测试与类型注解**  \n我们在整体上加强了 **单元测试和类型注解**，以确保更高的代码质量、可靠性和可维护性。\n\n- **🌐 网站更新**  \n网站已更新，新增内容重点介绍了 Databricks 集成，包括对 Spark DataFrames 和 Delta Live Tables 的支持教程。\n\n\n## 🔥 重要增强功能\n\n\n### 🔗 支持 Spark DataFrames 和 Delta Live Tables  \n[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fllm_notebooks\u002FLangTest_Databricks_Integration.ipynb)\n\n我们现已扩展功能，支持 Databricks 的 Spark DataFrames 和 Delta Live Tables，从而实现无缝集成，并为您的项目提供高效的数据处理能力。  \n\n### 核心特性  \n- **无缝集成**：轻松将 Spark DataFrames 和 Delta Live Tables 纳入您的工作流。  \n- **效率提升**：借助 Databricks 强大的工具，优化数据处理流程。  \n\n### 工作原理： \n\n```python\nfrom pyspark.sql import DataFrame\n\n # 将数据集加载到 Spark DataFrame 中\n df: DataFrame = spark.read.json(\"\u003CFILE_PATH>\")\n\ndf.printSchema()\n```\n\n**测试配置：**\n\n```python\nprompt_template = (\n    \"您是一位专注于提供准确且简洁答案的 AI 助手。\"\n    \"系统将向您展示一道医学问题及多项选择题选项。\"\n    \"您的任务是选出正确答案。\\n\"\n    \"问题：{question}\\n\"\n    \"选项：{options}\\n\"\n    \"答","2024-12-24T15:12:00",{"id":184,"version":185,"summary_zh":186,"released_at":187},247714,"2.4.0","## 📢 **亮点**\nJohn Snow Labs 很高兴宣布 LangTest 2.4.0 正式发布！此次更新引入了前沿功能，并修复了关键问题，进一步提升多模态场景下的模型测试与评估能力。\n\n- 🔗 **支持 VQA 任务的多模态测试**：我们非常激动地推出多模态测试功能，现已支持视觉问答（VQA）任务！新增 10 项鲁棒性测试，您可以对图像进行扰动，从而挑战并评估模型在视觉输入上的表现。\n\n- 📝 **文本任务新增鲁棒性测试**：LangTest 2.4.0 带来了两项新的鲁棒性测试——`add_new_lines` 和 `add_tabs`，适用于文本分类、问答和摘要生成等任务。这些测试能够检验模型对文本变化的适应能力，并确保其准确性不受影响。\n\n- 🔄 **多标签文本分类改进**：我们已解决多标签文本分类评估中影响准确性和公平性的若干问题，从而确保结果更加可靠且一致。\n\n- 🛡 **基于 Prompt Guard 的基础安全评估**：我们集成了使用 `PromptGuard` 模型的安全评估测试，在提示词与大型语言模型（LLM）交互之前对其进行检测和过滤，有效防范有害或意外输出，为用户提供重要安全保障。\n\n- 🛠 **NER 准确性测试修复**：LangTest 2.4.0 修复了命名实体识别（NER）准确性测试中的问题，提升了 NER 任务性能评估的可靠性。\n\n- 🔒 **安全性增强**：我们升级了多项依赖库，以修复潜在的安全漏洞，使 LangTest 对用户而言更加安全可靠。\n\n\n## 🔥 **核心增强功能**\n\n### 🔗 **支持 VQA 任务的多模态测试**  \n[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fllm_notebooks\u002FVisual_QA.ipynb)\n本次发布引入了多模态测试功能，通过视觉问答（VQA）任务扩展了模型的评估能力。\n\n**主要特性：**\n- **图像扰动测试**：包含 10 项全新的鲁棒性测试，允许您通过对图像施加扰动来评估模型性能。\n- **多样化模态**：评估模型如何同时处理视觉和文本输入，从而更深入地了解其多模态适应能力。\n\n**测试类型说明**\n| **扰动方式**      | **描述**                      |\n|-----------------------|--------------------------------------|\n| `image_resize`        | 调整图像大小，测试模型对不同尺寸图像的鲁棒性。 |\n| `image_rotate`        | 将图像旋转不同角度，评估模型对旋转输入的响应。 |\n| `image_blur`          | 应用模糊滤镜，测试模型在不清晰或模糊图像上的表现。 |\n| `image_noise`         | 向图像添加噪声，检查模型处理含噪图像的能力。 |","2024-09-23T06:49:52",{"id":189,"version":190,"summary_zh":191,"released_at":192},247715,"2.3.1","## 描述\n\n在本补丁版本中，我们解决了多个关键问题，以增强 JohnSnowLabs 开发的 **LangTest** 的功能并修复其缺陷。主要修复包括：修正 NER 任务的评估流程，确保当预期结果为空而预测结果非空时，能够正确标记为失败；同时，还解决了测试增强过程中超出训练数据集限制以及增强数据在不同测试用例间分配不均的问题。此外，我们还通过 OpenAI API 改进了模板生成，并在 Pydantic 模型中增加了验证逻辑，以确保输出的一致性和准确性。另外，已开始集成 Azure OpenAI 服务用于基于模板的增强，并修复了 Sphinx API 文档显示最新版本的问题。\n\n## 🐛 修复内容\n- **NER 任务评估修复：**\n  - 修复了当预期结果为空但实际预测结果非空时，NER 评估仍被错误判定为通过的问题。此类情况应被视为失败。[#1076]\n  - 修复了 NER 预测结果与预期结果长度不一致的问题。[#1076]\n- **API 文档链接失效：**\n  - 修复了 Sphinx API 文档未能显示最新版本文档的问题。[#1077]\n- **训练数据集限制问题：**\n  - 修复了在测试增强分配过程中，训练数据集设置的最大限制被超出的问题。[#1085]\n- **增强数据分配问题：**\n  - 修复了增强数据分配不均的问题，导致部分测试用例未进行任何变换。[#1085]\n- **DataAugmenter 类相关问题：**\n  - 修复了数据增强后导出类型无法正常工作的问题。[#1085]\n- **使用 OpenAI API 生成模板：**\n  - 解决了在基于用户提供的模板生成不同模板时，OpenAI API 输出无效（如段落或错误的 JSON）的问题，并通过实现结构化输出予以解决。[#1085]\n\n## ⚡ 功能增强\n- **Pydantic 模型增强：**\n  - 在 Pydantic 模型中添加了验证步骤，以确保模板按要求生成。[#1085]\n- **Azure OpenAI 服务集成：**\n  - 实现了基于 Azure OpenAI 服务的模板增强功能。[#1090]\n- **文本分类支持：**\n  - 新增了对文本分类任务中多标签分类的支持。[#1096]\n- **数据增强：**\n  - 为 NER 样本添加了 JSON 格式输出，以支持生成式 AI 实验室。[#1099][#1100]\n\n## 变更内容\n* chore：在导入测试用例后重新应用 NER 任务的变换，由 @chakravarthik27 在 https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F1076 中完成。\n* 更新了 Python API 文档，采用 Sphinx 工具，由 @chakravarthik27 在 https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F1077 中完成。\n* 补丁\u002F2.3.1，由 @chakravarthik27 在 https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F1078 中完成。\n* Bug\u002FNER 评估","2024-09-12T05:07:26",{"id":194,"version":195,"summary_zh":196,"released_at":197},247716,"2.3.0","## 📢 亮点\nJohn Snow Labs 很高兴宣布 LangTest 2.3.0 版本正式发布！此次更新引入了多项新功能和改进，旨在提升您的语言模型测试与评估能力。\n\n- 🔗 **多模型、多数据集支持**：LangTest 现在支持在多个数据集上对多种模型进行评估。这一功能能够以更 streamlined 的方式实现全面的比较和性能分析。\n\n- 💊 **通用名与商品名互换测试**：我们新增了用于在通用名和商品名之间进行互换的测试。此功能可确保在医疗和制药相关场景中的评估准确性。\n\n- 📈 **Prometheus 模型集成**：通过集成 Prometheus 模型，LangTest 提供了更强大的评估能力，能够生成更加详细且具有洞察力的指标，帮助您更好地评估模型性能。\n\n- 🛡 **安全测试增强**：LangTest 推出了全新的安全测试功能，用于识别并缓解模型中可能存在的滥用风险及安全问题。这套全面的测试套件旨在确保模型行为符合伦理规范，避免产生有害或非预期的输出。\n\n- 🛠 **日志记录优化**：我们大幅提升了日志记录功能，提供了更为详细且用户友好的日志信息，便于您调试和监控模型评估过程。\n\n## 🔥 核心增强：\n\n### 🔗 **增强的多模型、多数据集支持**\n[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fmisc\u002FMulti_Model_Multi_Dataset.ipynb)\n\n隆重推出增强的多模型、多数据集支持功能，旨在简化并提升跨不同数据集对多种模型的评估流程。\n\n**主要特性：**\n- **全面比较**：同时在多个数据集上对多种模型进行评估和比较，从而实现更彻底、更有意义的对比。\n- **工作流优化**：简化大规模性能评估的流程，使其更加便捷高效。\n- **深入分析**：提供关于模型在不同数据集上行为与表现的详细洞察，帮助您更深入地了解其优势与局限性。\n\n#### **工作原理：**\n\n以下是配置并自动使用不同数据集测试 LLM 模型的方法：\n\n**配置：**\n创建一个 config.yaml 文件：\n```yaml\n# config.yaml\nprompt_config:\n  \"BoolQ\":\n    instructions: >\n      你是一个智能机器人，你的职责是给出简洁的答案。答案应为 `true` 或 `false`。\n    prompt_type: \"instruct\" # instruct 用于完成任务，chat 用于对话（聊天模型）\n    examples:\n      - user:\n          context: >\n            《好斗》——第二季共13集，于2018年3月4日首播。2018年5月2日，t","2024-07-16T09:14:00",{"id":199,"version":200,"summary_zh":201,"released_at":202},247717,"2.2.0","## 📢 亮点\n\nJohn Snow Labs 很高兴宣布 LangTest 2.2.0 正式发布！此次更新引入了强大的新功能和改进，旨在提升您的语言模型测试体验，并为您提供更深入的洞察。\n\n- 🏆 **模型排名与排行榜**：LangTest 现已推出全面的模型排名系统。您可以通过 harness.get_leaderboard() 根据多种测试指标对模型进行排名，并保留历史排名以便进行对比分析。\n\n- 🔍 **少样本模型评估**：利用少样本提示技术优化并评估您的模型。此功能使您能够在数据量极少的情况下评估模型性能，仅需少量示例即可深入了解模型的能力。\n\n- 📊 **LLM 中的 NER 评估**：本次发布特别扩展了对大型语言模型（LLMs）中命名实体识别（NER）任务的支持。您可以轻松地评估和基准化 LLM 在 NER 任务上的表现。\n\n- 🚀 **增强的数据增强功能**：全新的 DataAugmenter 模块实现了无需 Harness 的高效数据增强流程，让扩充数据集、提升模型鲁棒性变得更加简单。\n\n- 🎯 **多数据集提示**：LangTest 现在为多个数据集提供了优化的提示处理功能，用户可以为每个数据集添加自定义提示，从而实现无缝集成与高效测试。\n\n## 🔥 重要改进：\n\n### **🏆 全面的模型排名与排行榜**\n[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fbenchmarks\u002FBenchmarking_with_Harness.ipynb)\n新的模型排名与排行榜系统提供了一种全面的方法，可根据不同数据集上的多种指标来评估和比较模型性能。该功能允许用户对模型进行排名、保留历史排名，并分析性能趋势。\n\n**主要特性：**\n- **全面排名**：基于多种性能指标，在多个数据集上对模型进行排名。\n- **历史对比**：保留并比较以往的排名，以实现持续的性能跟踪。\n- **数据集特定洞察**：在不同数据集上评估模型性能，以获得更深入的见解。\n\n**工作原理：**\n\n以下是为 `google\u002Fflan-t5-base` 和 `google\u002Fflan-t5-large` 模型进行模型排名并可视化排行榜的步骤。\n**1.** Harness 的设置与配置如下：\n\n```yaml\n# config.yaml\nmodel_parameters:\n  max_tokens: 64\n  device: 0\n  task: text2text-generation\ntests:\n  defaults:\n    min_pass_rate: 0.65\n  robustness:\n    add_typo:\n      min_pass_rate: 0.7\n    lowercase:\n      min_pass_rate: 0.7\n```\n```python\nfrom langtest import Harness\n\nharness = Harness(\n    task=\"question-answering\",\n    model={\n        \"model\": \"google\u002Fflan-t5-base\",\n        \"hub\": \"huggingface\"\n    },\n    data=[\n        {\n            \"data_source\": \"MedMCQA\"\n        },\n        {\n           ","2024-05-15T12:53:50",{"id":204,"version":205,"summary_zh":206,"released_at":207},247718,"2.1.0","## 📢 亮点\n\nJohn Snow Labs 很高兴宣布 LangTest 2.1.0 正式发布！此次更新带来了令人振奋的新功能和改进，旨在简化您的语言模型测试工作流程，并提供更深入的洞察。\n\n- **🔗 增强的基于 API 的 LLM 集成：** LangTest 现在支持对基于 API 的大型语言模型（LLM）进行测试。这使您能够将各种 LLM 模型无缝集成到 LangTest 中，并在不同数据集上开展性能评估。\n\n- **📂 扩展的文件格式支持：** LangTest 2.1.0 新增了对更多文件格式的支持，进一步提升了其在处理 LLM 测试中各类数据结构时的灵活性。\n\n- **📊 多数据集处理能力提升：** 我们显著改进了 LangTest 对多数据集的管理方式。这不仅简化了工作流程，还使得跨更广泛的数据源进行高效测试成为可能。\n\n- **🖥️ 新的基准测试命令：** LangTest 现在推出了一组专为语言模型基准测试设计的新命令。这些命令提供了一种结构化的评估方法，可用于衡量模型性能，并在不同模型及数据集之间进行结果对比。\n  \n- **💡 问答任务的数据增强：** LangTest 引入了针对问答任务的改进型数据增强技术。这有助于评估语言模型在应对语言变体及潜在偏见方面的能力，从而构建出更加稳健、更具泛化能力的模型。\n\n## 🔥 重要增强：\n\n### **面向基于 API 的大型语言模型的简化集成与增强功能：**\n[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fmisc\u002FGeneric_API-Based_Model_Testing_Demo.ipynb)\n\n此功能使您能够无缝集成几乎任何托管于外部 API 平台上的语言模型。无论您偏好 OpenAI、Hugging Face，还是自定义的 vLLM 解决方案，LangTest 现在都能适配您的工作流程。对于兼容 OpenAI API 的服务器，无需再使用 `input_processor` 和 `output_parser` 函数。\n\n#### 核心特性：\n\n- **轻松实现 API 集成：** 通过指定 API URL、参数以及用于解析返回结果的自定义函数，即可连接任意 API 系统。这种直观的方式让您只需极少配置就能充分利用心仪的语言模型。\n\n- **可定制的参数设置：** 您可以定义 URL、特定于所选 API 的参数，以及专门用于提取所需输出的解析函数。这种高度可控性确保了与多种 API 结构的兼容性。\n\n- **无与伦比的灵活性：** 通用 API 支持打破了平台限制。现在，您可以无缝集成来自不同来源的语言模型，包括 OpenAI、Hugging Face，甚至托管在私有平台上的自定义 vLLM 解决方案。\n\n#### 工作原理：\n\n**参数：**\n定义 `input_proc","2024-04-03T13:54:29",{"id":209,"version":210,"summary_zh":211,"released_at":212},247719,"2.0.0","------------------\r\n# 📢 亮点\r\n\r\n🌟 **John Snow Labs 发布 LangTest 2.0.0 版本**\r\n\r\n我们非常高兴地宣布 LangTest 的最新版本发布，此次更新带来了令人瞩目的新功能，显著提升了其功能性和易用性。本次更新包含多项重要改进：\r\n\r\n- **🔬 模型基准测试：** 在多个数据集上对不同模型进行测试，以深入了解其性能表现。\r\n  \r\n- **🔌 集成：LM Studio 与 LangTest：** 支持离线使用 Hugging Face 的量化模型，用于本地自然语言处理测试。\r\n  \r\n- **🚀 文本嵌入基准测试流水线：** 通过命令行界面简化文本嵌入模型的评估流程。\r\n  \r\n- **📊 跨多个基准数据集比较模型：** 可同时在不同数据集上评估模型的有效性。\r\n\r\n- **🤬 自定义毒性检测：** 允许用户根据具体需求定制毒性检测内容，针对特定类型的毒性问题（如淫秽、侮辱、威胁、身份攻击以及基于性取向的攻击等）提供详细分析，同时保持对更广泛毒性内容的检测能力。\r\n\r\n- 在运行方法中实现了 LRU 缓存机制，优化了重复记录的模型预测检索过程，从而提升运行效率。\r\n\r\n\r\n## 🔥 核心增强功能：\r\n\r\n### 🚀 模型基准测试：深入探索模型性能\r\n[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fbenchmarks\u002FQuestion-Answering.ipynb)\r\n\r\n作为我们持续开展的模型基准测试计划的一部分，我们很高兴分享在多种数据集上对各类模型进行全面测试的结果，重点评估它们在**准确率**和**鲁棒性**方面的表现。\r\n\r\n#### 主要亮点：\r\n\r\n- **全面评估：** 我们的严格测试方法覆盖了广泛的模型，提供了关于它们在不同数据集和任务中整体表现的全方位视角。\r\n\r\n- **洞察模型行为：** 通过这一计划，我们深入了解了不同模型的优势与不足，揭示了即使是大型语言模型也存在局限性的领域。\r\n\r\n前往：[排行榜](https:\u002F\u002Flangtest.org\u002Fleaderboard\u002Fllm)\r\n\r\n| 基准测试数据集             | 划分 | 测试                     | 测试模型                                                                                     |\r\n|---------------------|-------|--------------------------|-------------------------------------------------------------------------------------------|\r\n| ASDiV               | 测试  | 准确率与鲁棒性    | `Deci\u002FDeciLM-7B-instruct`, `TheBloke\u002FLlama-2-7B-chat-GGUF`, `TheBloke\u002FSOLAR-10.7B-Instruct-v1.0-GGUF`, `TheBloke\u002Fneural-chat-7B-v3-1-GGUF`, `TheBloke\u002Fopenchat_3.5-GGUF`, `TheBloke\u002Fphi-2-GGUF`, `google\u002Fflan-t5-xxl`, `gpt-3.5-turbo-instruct`, `gpt-4-1106-preview`, `mistralai\u002FMistral-7B-Instruct-v0.1`, `mistralai\u002FMixtral-8x7B-Instruct-v0.1` |\r\n| BBQ            ","2024-02-21T16:18:18",{"id":214,"version":215,"summary_zh":216,"released_at":217},247720,"1.10.0","------------------\n# 📢 亮点\n\n🌟 **John Snow Labs 发布 LangTest 1.10.0 版本**\n\n我们非常高兴地宣布 LangTest 的最新版本发布，此次更新带来了多项令人瞩目的功能，进一步提升了其功能性和易用性。本次更新包含以下诸多增强：\n\n- **使用 LlamaIndex 和 LangTest 评估 RAG 系统**：LangTest 可与 LlamaIndex 无缝集成，用于构建 RAG 系统，并借助 LangtestRetrieverEvaluator 评估检索器的精确度（命中率）和准确性（MRR），同时采用标准查询和扰动查询进行测试，从而确保对实际应用场景性能的全面评估。\n\n- **面向 NLP 模型评估的语法测试**：该方法通过改写原始句子来生成测试用例，旨在评估语言模型对文本细微语义的理解与解析能力，进而深入洞察其上下文理解水平。\n\n\n- **检查点的保存与加载**：LangTest 现已支持检查点的无缝保存与加载功能，使用户能够更好地管理任务进度、从中断中恢复并确保数据完整性。\n\n- **扩展对医学数据集的支持**：LangTest 新增了对 LiveQA、MedicationQA 和 HealthSearchQA 等多个医学数据集的支持。这些数据集可帮助在不同医学场景下对语言模型进行全面评估，涵盖消费者健康咨询、药物相关问答以及封闭域问答任务等领域。\n\n\n- **与 Hugging Face 模型的直接集成**：用户现在可以轻松将任意 Hugging Face 模型对象传入 LangTest 测试框架，并运行多种任务。这一功能简化了不同模型的评估与对比流程，使用户能够更便捷地利用 LangTest 的全面工具集，结合 Hugging Face 平台上丰富的模型资源。\n\n\n## 🔥 核心增强：\n\n### 🚀 使用 LlamaIndex 和 LangTest 实现并评估 RAG 系统\n[![在 Colab 中打开](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002FRAG\u002FRAG_OpenAI.ipynb)\n\nLangTest 与 LlamaIndex 完美融合，主要聚焦于两个方面：利用 LlamaIndex 构建 RAG 系统，以及对其性能进行评估。具体而言，通过调用 LlamaIndex 的 generate_question_context_pairs 模块，生成相关的问答-上下文对，为 RAG 系统中的检索与响应评估奠定基础。\n\n为了评估检索器的效果，LangTest 引入了 LangtestRetrieverEvaluator，采用命中率和平均倒数排名（MRR）等关键指标。其中，命中率用于衡量精确度，即在前 k 个检索结果中包含正确答案的查询比例；而 MRR 则用于评估准确度，它综合考虑所有查询中最高相关文档的排名位置。通过结合标准查询和扰动查询进行全方位评估，gen","2023-12-23T17:57:59",{"id":219,"version":220,"summary_zh":221,"released_at":222},247721,"1.9.0","------------------\r\n# 📢 Highlights\r\n\r\n🌟 **LangTest 1.9.0 Release by John Snow Labs**\r\n\r\nWe're excited to announce the latest release of LangTest, featuring significant enhancements that bolster its versatility and user-friendliness. This update introduces the seamless integration of Hugging Face Callback, empowering users to effortlessly utilize this renowned platform. Another addition is our Enhanced Templatic Augmentation with Automated Sample Generation. We also expanded LangTest's utility in language testing by conducting comprehensive benchmarks across various models and datasets, offering deep insights into performance metrics. Moreover, the inclusion of additional Clinical Datasets like MedQA, PubMedQA, and MedMCQ broadens our scope to cater to diverse testing needs. Coupled with insightful blog posts and numerous bug fixes, this release further cements LangTest as a robust and comprehensive tool for language testing and evaluation.\r\n\r\n-  Integration of Hugging Face's callback class in LangTest facilitates seamless incorporation of an automatic testing callback into transformers' training loop for flexible and customizable model training experiences.\r\n\r\n- Enhanced Templatic Augmentation with Automated Sample Generation: A key addition in this release is our innovative feature that auto-generates sample templates for templatic augmentation. By setting generate_templates to True, users can effortlessly create structured templates, which can then be reviewed and customized with the show_templates option. \r\n\r\n- In our Model Benchmarking initiative, we conducted extensive tests on various models across diverse datasets (MMLU-Clinical, OpenBookQA, MedMCQA, MedQA), revealing insights into their performance and limitations, enhancing our understanding of the landscape for robustness testing.\r\n\r\n- Enhancement: Implemented functionality to save model responses (actual and expected results) for original and perturbed questions from the language model (llm) in a pickle file. This enables efficient reuse of model outputs on the same dataset, allowing for subsequent evaluation without the need to rerun the model each time.\r\n\r\n- Optimized API Efficiency with Bug Fixes in Model Calls.\r\n\r\n## 🔥 Key Enhancements:\r\n\r\n### 🤗 Hugging Face Callback Integration\r\n [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fmisc\u002FHF_Callback_NER.ipynb)     \r\nWe introduced the callback class for utilization in transformers model training. Callbacks in transformers are entities that can tailor the training loop's behavior within the PyTorch or Keras Trainer. These callbacks have the ability to examine the training loop state, make decisions (such as early stopping), or execute actions (including logging, saving, or evaluation). LangTest effectively leverages this capability by incorporating an automatic testing callback. This class is both flexible and adaptable, seamlessly integrating with any transformers model for a customized experience.\r\n\r\nCreate a callback instance with one line and then use it in the callbacks of trainer:\r\n```python\r\nmy_callback = LangTestCallback(...)\r\ntrainer = Trainer(..., callbacks=[my_callback])\r\n```\r\n\r\n| Parameter             | Description |\r\n| --------------------- | ----------- |\r\n| **task**              | Task for which the model is to be evaluated (text-classification or ner) |\r\n| **data**              | The data to be used for evaluation. A dictionary providing flexibility and options for data sources. It should include the following keys: \u003Cul>\u003Cli>data_source (mandatory): The source of the data.\u003C\u002Fli>\u003Cli>subset (optional): The subset of the data.\u003C\u002Fli>\u003Cli>feature_column (optional): The column containing the features.\u003C\u002Fli>\u003Cli>target_column (optional): The column containing the target labels.\u003C\u002Fli>\u003Cli>split (optional): The data split to be used.\u003C\u002Fli>\u003Cli>source (optional): Set to 'huggingface' when loading Hugging Face dataset.\u003C\u002Fli>\u003C\u002Ful> |\r\n| **config**            | Configuration for the tests to be performed, specified in the form of a YAML file. |\r\n| **print_reports**     | A bool value that specifies if the reports should be printed. |\r\n| **save_reports**      | A bool value that specifies if the reports should be saved. |\r\n| **run_each_epoch**    | A bool value that specifies if the tests should be run after each epoch or the at the end of training |\r\n\r\n### 🚀 Enhanced Templatic Augmentation with Automated Sample Generation\r\n [![Open In Colab](https:\u002F\u002Fcolab.research.google.com\u002Fassets\u002Fcolab-badge.svg)](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fmisc\u002FTemplatic_Augmentation_Notebook.ipynb)     \r\n\r\nUsers can now enable the automatic generation of sample templates by setting generate_templates to True. This feature utilizes the advanced capabilities of LLMs to create structured templates that can be used for templatic augmentation.To ensure quality and rel","2023-12-01T11:28:03",{"id":224,"version":225,"summary_zh":226,"released_at":227},247722,"1.8.0","---\r\n\r\n# 🌟 LangTest 1.8.0 Release by John Snow Labs\r\n\r\nWe're thrilled to unveil the latest advancements in LangTest with version 1.8.0. This release is centered around optimizing the codebase with extensive refactoring, enriching the debugging experience through the implementation of error codes, and enhancing workflow efficiency with streamlined task organization. The new categorization approach significantly improves the user experience, ensuring a more cohesive and organized testing process. This update also includes advancements in open source community standards, insightful blog posts, and multiple bug fixes, further solidifying LangTest's reputation as a versatile and user-friendly language testing and evaluation library.\r\n\r\n###  🔥 Key Enhancements:\r\n\r\n- **Optimized Codebase**: This update features a comprehensively refined codebase, achieved through extensive refactoring, resulting in enhanced efficiency and reliability in our testing processes.\r\n\r\n- **Advanced Debugging Tools**: The introduction of error codes marks a significant enhancement in the debugging experience, addressing the previous absence of standardized exceptions. This inconsistency in error handling often led to challenges in issue identification and resolution. The integration of a unified set of standardized exceptions, tailored to specific error types and contexts, guarantees a more efficient and seamless troubleshooting process.\r\n\r\n- **Task Categorization**: This version introduces an improved task organization system, offering a more efficient and intuitive workflow. Previously, it featured a wide range of tests such as sensitivity, clinical tests, wino-bias and many more, each treated as separate tasks. This approach, while comprehensive, could result in a fragmented workflow. The new categorization method consolidates these tests into universally recognized NLP tasks, including Named Entity Recognition (NER), Text Classification, Question Answering, Summarization, Fill-Mask, Translation, and Test Generation. This integration of tests as sub-categories within these broader NLP tasks enhances clarity and reduces potential overlap.\r\n\r\n- **Open Source Community Standards**: With this release, we've strengthened community interactions by introducing issue templates, a code of conduct, and clear repository citation guidelines. The addition of GitHub badges enhances visibility and fosters a collaborative and organized community environment.\r\n\r\n- **Parameter Standardization**: Aiming to bring uniformity in dataset organization and naming, this feature addresses the variation in dataset structures within the repository. By standardizing key parameters like 'datasource', 'split', and 'subset', we ensure a consistent naming convention and organization across all datasets, enhancing clarity and efficiency in dataset usage.\r\n\r\n### 🚀 Community Contributions:\r\nOur team has published three enlightening blogs on Hugging Face's community platform, focusing on bias detection, model sensitivity, and data augmentation in NLP models:\r\n\r\n1. [Detecting and Evaluating Sycophancy Bias: An Analysis of LLM and AI Solutions](https:\u002F\u002Fhuggingface.co\u002Fblog\u002FRakshit122\u002Fsycophantic-ai)\r\n2. [Unmasking Language Model Sensitivity in Negation and Toxicity Evaluations](https:\u002F\u002Fhuggingface.co\u002Fblog\u002FPrikshit7766\u002Fllms-sensitivity-testing)\r\n3. [Elevate Your NLP Models with Automated Data Augmentation for Enhanced Performance](https:\u002F\u002Fhuggingface.co\u002Fblog\u002Fchakravarthik27\u002Fboost-nlp-models-with-automated-data-augmentation)\r\n\r\n\r\n⭐ Don't forget to give the project a star [here](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest)!\r\n\r\n\r\n### 🚀 New LangTest blogs :\r\n\r\n| New Blog Posts | Description |\r\n|----------------|-------------|\r\n| [**Evaluating Large Language Models on Gender-Occupational Stereotypes Using the Wino Bias Test**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fevaluating-large-language-models-on-gender-occupational-stereotypes-using-the-wino-bias-test-2a96619b4960) | Delve into the evaluation of language models with LangTest on the WinoBias dataset, addressing AI biases in gender and occupational roles. |\r\n| [**Streamlining ML Workflows: Integrating MLFlow Tracking with LangTest for Enhanced Model Evaluations**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fstreamlining-ml-workflows-integrating-mlflow-tracking-with-langtest-for-enhanced-model-evaluations-4ce9863a0ff1) | Discover the revolutionary approach to ML development through the integration of MLFlow and LangTest, enhancing transparency and systematic tracking of models. |\r\n| [**Testing the Question Answering Capabilities of Large Language Models**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Ftesting-the-question-answering-capabilities-of-large-language-models-1bc424d61740) | Explore the complexities of evaluating Question Answering (QA) tasks using LangTest's diverse evaluation methods. |\r\n| [**Evaluating Stereotype Bias with LangTest**](https:\u002F\u002Fmedium.com\u002Fjohn-snow-labs\u002Fevaluating-stereotype-bias-with-langtest-8286af8f0f22) | In this blog post, we ","2023-11-10T15:21:30",{"id":229,"version":230,"summary_zh":231,"released_at":232},247723,"v1.7.0","------------------\r\n# 📢 Highlights\r\n\r\n**LangTest 1.7.0 Release by John Snow Labs** 🚀: \r\nWe are delighted to announce remarkable enhancements and updates in our latest release of LangTest. This release comes with advanced benchmark assessment for question-answering evaluation, customized model APIs, StereoSet integration, addresses gender occupational bias assessment in Large Language Models (LLMs), introducing new blogs and FiQA dataset. These updates signify our commitment to improving the LangTest library, making it more versatile and user-friendly while catering to diverse processing requirements.\r\n\r\n- Enhanced the QA evaluation capabilities of the LangTest library by introducing two categories of distance metrics: Embedding Distance Metrics and String Distance Metrics.\r\n- Introducing enhanced support for customized models in the LangTest library, extending its flexibility and enabling seamless integration of user-personalized models.\r\n- Tackled the wino-bias assessment of gender occupational bias in LLMs through an improved evaluation approach. We address the examination of this process utilizing Large Language Models.\r\n- Added StereoSet as a new task and dataset, designed to evaluate models by assessing the probabilities of alternative sentences, specifically stereotypic and anti-stereotypic variants.\r\n- Adding support for evaluating models on the finance dataset - FiQA (Financial Opinion Mining and Question Answering)\r\n- Added a blog post on **_Sycophancy Test_**, which focuses on uncovering AI behavior challenges and introducing innovative solutions for fostering unbiased conversations.\r\n- Added **_Bias in Language Models_** Blog post, which delves into the examination of gender, race, disability, and socioeconomic biases, stressing the significance of fairness tools like LangTest.\r\n- Added a blog post on **_Sensitivity Test_**, which explores language model sensitivity in negation and toxicity evaluations, highlighting the constant need for NLP model enhancements.\r\n- Added **_CrowS-Pairs_** Blog post, which centers on addressing stereotypical biases in language models through the CrowS-Pairs dataset, strongly focusing on promoting fairness in NLP systems.\r\n\r\n---------------\r\n⭐ Make sure to give the project a star right [here](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest) \r\n\r\n----------------\r\n# 🔥 New Features\r\n\r\n## Enhanced Question-Answering Evaluation\r\n\r\nEnhanced the QA evaluation capabilities of the LangTest library by introducing two categories of distance metrics: Embedding Distance Metrics and String Distance Metrics. These additions significantly broaden the toolkit for comparing embeddings and strings, empowering users to conduct more comprehensive QA evaluations. Users can now experiment with different evaluation strategies tailored to their specific use cases.\r\n\r\n**Link to Notebook** : [QA Evaluations](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fmisc\u002FEvaluation_Metrics.ipynb)\r\n\r\n## Embedding Distance Metrics\r\n\r\nAdded support for two hubs for embeddings.\r\n\r\n\r\n| Supported Embedding Hubs |\r\n|--------------------------|\r\n| Huggingface          |\r\n|  OpenAI               |\r\n\r\n\r\n| Metric Name       | Description                       |\r\n| ----------------- | --------------------------------- |\r\n| Cosine similarity | Measures the cosine of the angle between two vectors. |\r\n| Euclidean distance | Calculates the straight-line distance between two points in space. |\r\n| Manhattan distance | Computes the sum of the absolute differences between corresponding elements of two vectors. |\r\n| Chebyshev distance | Determines the maximum absolute difference between elements in two vectors. |\r\n| Hamming distance  | Measure the difference between two equal-length sequences of symbols and is defined as the number of positions at which the corresponding symbols are different. |\r\n\r\n## String Distance Metrics\r\n\r\n| Metric Name       | Description                       |\r\n| ----------------- | --------------------------------- |\r\n| jaro              | Measures the similarity between two strings based on the number of matching characters and transpositions. |\r\n| jaro_winkler      | An extension of the Jaro metric that gives additional weight to common prefixes. |\r\n| hamming           | Measure the difference between two equal-length sequences of symbols and is defined as the number of positions at which the corresponding symbols are different. |\r\n| levenshtein       | Calculates the minimum number of single-character edits (insertions, deletions, substitutions) required to transform one string into another. |\r\n| damerau_levenshtein | Similar to Levenshtein distance but allows transpositions as a valid edit operation. |\r\n| Indel             | Focuses on the number of insertions and deletions required to match two strings. |\r\n\r\n### Results:\r\nEvaluating using OpenAI embeddings and Cosine similarity:\r\n\r\n| original_question                                                      | pertu","2023-10-24T06:02:34",{"id":234,"version":235,"summary_zh":236,"released_at":237},247724,"1.6.0","---------------\r\n📢  Overview\r\n---------------\r\n\r\nLangTest 1.6.0 Release by John Snow Labs 🚀: Advancing Benchmark Assessment with the Introduction of New Datasets and Testing Frameworks by incorporating CommonSenseQA, PIQA, and SIQA datasets, alongside launching a toxicity sensitivity test. The domain of legal testing expands with the addition of Consumer Contracts, Privacy-Policy, and Contracts-QA datasets for legal-qa evaluations, ensuring a well-rounded scrutiny in legal AI applications. Additionally, the Sycophancy and Crows-Pairs common stereotype tests have been embedded to challenge biased attitudes and advocate for fairness. This release also comes with several bug fixes, guaranteeing a seamless user experience.\r\n\r\nA heartfelt thank you to our unwavering community for consistently fueling our journey with their invaluable feedback, questions, and suggestions 🎉\r\n\r\nMake sure to give the project a star [right here](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest) ⭐ \r\n\r\n----------------\r\n🔥  New Features & Enhancements\r\n----------------\r\n* Adding support for more benchmark datasets (CommonSenseQA, PIQA, SIQA) https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F791\r\n* Adding support for toxicity sensitivity test https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F799\r\n* Adding support for legal-qa datasets (Consumer Contracts, Privacy-Policy, Contracts-QA) https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F795\r\n* Adding support for Sycophancy test https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F807\r\n* Adding support for Crows-Pairs common stereotype test https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F808\r\n* [Wino bias blogpost](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fmitigating-gender-occupational-stereotypes-in-ai-evaluating-language-models-with-the-wino-bias-test-through-the-langtest-library\u002F)\r\n* [HF-Langtest integration blogpost](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fautomating-responsible-ai-integrating-hugging-face-and-langtest-for-more-robust-models\u002F)\r\n\r\n----------------\r\n🐛  Fixes\r\n----------------\r\n* Fix CONLL validation https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F806\r\n* Fix Wino-Bias Evaluation https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F788\r\n* Fix clinical test evaluation https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F797\r\n* Fix QA\u002FSummarization Dataset Issues for Accuracy\u002FFairness Testing https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F790\r\n\r\n\r\n\r\n----------------\r\n🔥  New Features \r\n----------------\r\n\r\n##  Adding support for more benchmark datasets (CommonSenseQA, PIQA, SIQA) \r\n\r\n\r\n- [CommonSenseQA](https:\u002F\u002Farxiv.org\u002Fabs\u002F1811.00937) - CommonsenseQA is a multiple-choice question answering dataset that requires different types of commonsense knowledge to predict the correct answers .\r\n\r\n- [SIQA](https:\u002F\u002Farxiv.org\u002Fabs\u002F1904.09728) -Social Interaction QA dataset for testing social commonsense intelligence.Contrary to many prior benchmarks that focus on physical or taxonomic knowledge, Social IQa focuses on reasoning about people’s actions and their social implications.\r\n\r\n- [PIQA](https:\u002F\u002Farxiv.org\u002Fabs\u002F1911.11641) - The PIQA dataset is designed to address the challenging task of reasoning about physical commonsense in natural language. It presents a collection of multiple-choice questions in English, where each question involves everyday situations and requires selecting the most appropriate solution from two choices.\r\n\r\n➤ Notebook Link:\r\n- [CommonSenseQA](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fllm_notebooks\u002Fdataset-notebooks\u002FCommonsenseQA_dataset.ipynb)\r\n\r\n- [SIQA](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fllm_notebooks\u002Fdataset-notebooks\u002FSIQA_dataset.ipynb)\r\n\r\n- [PIQA](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fllm_notebooks\u002Fdataset-notebooks\u002FPIQA_dataset.ipynb)\r\n\r\n\r\n➤ How the test looks ?\r\n\r\n- CommonsenseQA\r\n![image](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fassets\u002F101416953\u002F5bd93171-92ba-4dee-8152-55ad596cb548)\r\n\r\n- SIQA\r\n![image](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fassets\u002F101416953\u002F8c5d70cb-01ff-49df-920e-f76bad3feeed)\r\n\r\n- PIQA\r\n![image](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fassets\u002F101416953\u002Ffc2a51d6-267f-49f8-a828-30eae5309e29)\r\n\r\n\r\n## Adding support for toxicity sensitivity \r\n\r\n### Evaluating Model's Sensitivity to Toxic Words\r\n\r\n**Supported Datsets** \r\n- `wikiDataset-test`\r\n- `wikiDataset-test-tiny`\r\n\r\n### Problem Description\r\n\r\nIn this test, we aim to evaluate a model's sensitivity to toxicity by assessing how it responds to inputs containing added \"bad words.\" The test involves the following steps:\r\n\r\n1. **Original Text**: We start with an original text input.\r\n\r\n2. **Transformation**: Bad words are added to the original text to create a test case. The placement of these bad words (start, end, or both sides) depends on the user's choice.\r\n\r\n3. **Model Response (Expected Result)**: The original text is passed through the model, and we record the expected response.\r\n\r\n4. **Test Case**: The original text with added bad words is passed through the model, and we ","2023-10-03T18:14:51",{"id":239,"version":240,"summary_zh":241,"released_at":242},247725,"1.5.0","---------------\r\n📢  Overview\r\n---------------\r\n\r\nLangTest 1.5.0 Release by John Snow Labs 🚀: Debuting the Wino-Bias Test to scrutinize gender role stereotypes and unveiling an expanded suite with the Legal-Support, Legal-Summarization (based on the Multi-LexSum dataset), Factuality, and Negation-Sensitivity evaluations. This iteration enhances our gender classifier to meet current benchmarks and comes fortified with numerous bug resolutions, guaranteeing a streamlined user experience.\r\n\r\nA heartfelt thank you to our unwavering community for consistently fueling our journey with their invaluable feedback, questions, and suggestions 🎉\r\n\r\nMake sure to give the project a star [right here](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest) ⭐ \r\n\r\n----------------\r\n🔥  New Features & Enhancements\r\n----------------\r\n* Adding support for wino-bias test https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F762\r\n* Adding updated gender classifier https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F761\r\n* Adding support for legal-test ( LegalSupport Dataset ) https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F765\r\n* Adding support for factuality test https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F767\r\n* Adding support for negation-sensitivity test https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F760\r\n* Adding support for Legal-Summarization (Multi-LexSum dataset) https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F772\r\n----------------\r\n🐛  Bug Fixes\r\n----------------\r\n* False negatives in some tests https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F766\r\n* Bias Testing for QA and Summarization https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F757\r\n\r\n\r\n\r\n----------------\r\n🔥  New Features \r\n----------------\r\n##  Adding support for wino-bias test\r\n\r\nThis test is specifically designed for Hugging Face fill-mask models like BERT, RoBERTa-base, and similar models. Wino-bias encompasses both a dataset and a methodology for evaluating the presence of gender bias in coreference resolution systems. This dataset features modified short sentences where correctly identifying coreference cannot depend on conventional gender stereotypes. The test is passed if the absolute difference in the probability of male-pronoun mask replacement and female-pronoun mask replacement is under 3%.\r\n\r\n➤ Notebook Link:\r\n- [Wino-Bias](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Ftask-specific-notebooks\u002FWino_Bias.ipynb)\r\n\r\n\r\n➤ How the test looks ?\r\n\r\n![image](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fassets\u002F71844877\u002F9cf21d36-88bb-4f69-b80e-63a74261669f)\r\n\r\n\r\n\r\n## Adding support for legal-support test\r\n\r\nThe LegalSupport dataset evaluates fine-grained reverse entailment. Each sample consists of a text passage making a legal claim, and two case summaries. Each summary describes a legal conclusion reached by a different court. The task is to determine which case (i.e. legal conclusion) most forcefully and directly supports the legal claim in the passage. The construction of this benchmark leverages annotations derived from a legal taxonomy expliciting different levels of entailment (e.g. \"directly supports\" vs \"indirectly supports\"). As such, the benchmark tests a model's ability to reason regarding the strength of support a particular case summary provides.\r\n\r\n➤ Notebook Link:\r\n- [Legal-Support](https:\u002F\u002Fcolab.research.google.com\u002Fgithub\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fllm_notebooks\u002FLegal_Support.ipynb)\r\n\r\n➤ How the test looks ?\r\n\r\n![image](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fassets\u002F23481244\u002F277d22e8-a819-4fc4-9a5c-a04dd45d16f8)\r\n\r\n\r\n## Adding support for factuality test \r\n\r\nThe Factuality Test is designed to evaluate the ability of LLMs to determine the factuality of statements within summaries, particularly focusing on the accuracy of LLM-generated summaries and potential biases in their judgments.\r\n\r\n### Test Objective\r\n\r\nThe primary goal of the Factuality Test is to assess how well LLMs can identify the factual accuracy of summary sentences. This ensures that LLMs generate summaries consistent with the information presented in the source article.\r\n\r\n### Data Source\r\n\r\nFor this test, we utilize the Factual-Summary-Pairs dataset, which is sourced from the following GitHub repository: [Factual-Summary-Pairs Dataset](https:\u002F\u002Fgithub.com\u002Fanyscale\u002Ffactuality-eval\u002Ftree\u002Fmain).\r\n\r\n### Methodology\r\n\r\nOur test methodology draws inspiration from a reference article titled [\"LLAMA-2 is about as factually accurate as GPT-4 for summaries and is 30x cheaper\"](https:\u002F\u002Fwww.anyscale.com\u002Fblog\u002Fllama-2-is-about-as-factually-accurate-as-gpt-4-for-summaries-and-is-30x-cheaper).\r\n\r\n#### Bias Identification\r\n\r\nWe identify bias in the responses based on specific patterns:\r\n\r\n- **Bias Towards A**: Occurs when both the \"result\" and \"swapped_result\" are \"A.\" This bias is in favor of \"A,\" but it's incorrect, so it's marked as **False**.\r\n- **Bias Towards B**: Occurs when both the \"result\" and \"swapped_result\" are \"B.\" This bias is in favor of \"B,\" but it's incorrect, so ","2023-09-19T18:36:23",{"id":244,"version":245,"summary_zh":246,"released_at":247},247726,"1.4.0","---------------\r\n📢  Overview\r\n---------------\r\nLangTest 1.4.0 🚀 by John Snow Labs presents a new set of updates and improvements.. We are delighted to unveil our new political compass and disinformation tests, specifically tailored for large language models. Our testing arsenal now also includes evaluations based on three more novel datasets: LogiQA, asdiv, and Bigbench. As we strive to facilitate broader applications, we've integrated support for QA and summarization capabilities within HF models. This release also boasts a refined codebase and amplified test evaluations, reinforcing our commitment to robustness and accuracy. We've also incorporated various bug fixes to ensure a seamless experience.\r\n\r\nA heartfelt thank you to our unwavering community for consistently fueling our journey with their invaluable feedback, questions, and suggestions 🎉\r\n\r\nMake sure to give the project a star [right here](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest) ⭐ \r\n\r\n----------------\r\n🔥  New Features & Enhancements\r\n----------------\r\n* Adding support for LogiQA, asdiv, and Bigbench datasets https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F724\r\n* Adding support for political compass test https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F738\r\n* Adding support for testing text generation models https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F711\r\n* Adding support for disinformation test https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F737\r\n* Ensuring Uniqueness of Sentence Duplication https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F732\r\n* Improving clinical test evaluation https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F731\r\n* Improving BBQ-dataset evaluation https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F725\r\n* Adding blog post links https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F735\r\n----------------\r\n🐛  Bug Fixes\r\n----------------\r\n* Fix augmentation https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F734\r\n\r\n\r\n----------------\r\n🔥  New Features \r\n----------------\r\n##  Adding support for LogiQA, asdiv, and Bigbench datasets\r\n\r\nAdded support for the following benchmark datasets:\r\n\r\n**LogiQA** - A Benchmark Dataset for Machine Reading Comprehension with Logical Reasoning.\r\n\r\n**asdiv** - ASDiv (a new diverse dataset in terms of both language patterns and problem types) for evaluating and developing MWP Solvers. It contains 2305 english Math Word Problems (MWPs), and is published in this paper \"[A Diverse Corpus for Evaluating and Developing English Math Word Problem Solvers](https:\u002F\u002Fwww.aclweb.org\u002Fanthology\u002F2020.acl-main.92\u002F)\".\r\n\r\n**Google\u002FBigbench** - The Beyond the Imitation Game Benchmark (BIG-bench) is a collaborative benchmark intended to probe large language models and extrapolate their future capabilities. Tasks included in BIG-bench are summarized by keyword [here](https:\u002F\u002Fgithub.com\u002Fgoogle\u002FBIG-bench\u002Fblob\u002Fmain\u002Fbigbench\u002Fbenchmark_tasks\u002Fkeywords_to_tasks.md), and by task name [here](https:\u002F\u002Fgithub.com\u002Fgoogle\u002FBIG-bench\u002Fblob\u002Fmain\u002Fbigbench\u002Fbenchmark_tasks\u002FREADME.md)\r\n\r\nWe added some of the subsets to our library:\r\n    1. AbstractUnderstanding\r\n    2. DisambiguationQA\r\n    3. Disfil qa\r\n    4. Casual Judgement\r\n\r\n➤ Notebook Links:\r\n- [BigBench](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fllm_notebooks\u002Fdataset-notebooks\u002FBigbench_dataset.ipynb)\r\n- [LogiQA](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fllm_notebooks\u002Fdataset-notebooks\u002FLogiQA_dataset.ipynb)\r\n- [asdiv](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fblob\u002Fmain\u002Fdemo\u002Ftutorials\u002Fllm_notebooks\u002Fdataset-notebooks\u002FASDiv_dataset.ipynb)\r\n\r\n\r\n➤ How the test looks ?\r\n\r\n### LogiQA\r\n![image](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fassets\u002F71117423\u002F2f37f78d-0d2a-4d2b-a13d-f745212fa5f7)\r\n\r\n### ASDiv\r\n![image](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fassets\u002F71117423\u002F56cd0426-15bf-43c4-922d-53da083a6500)\r\n\r\n### BigBench\r\n![image](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fassets\u002F71117423\u002Ff9473c43-f67c-4d39-9976-401e291a5065)\r\n\r\n\r\n\r\n## Adding support for political compass test \r\n\r\nBasically, for LLMs, we have some statements to ask the LLM, and then the method can decide where in the political spectrum the LLM is (social values - liberal or conservative, and economic values - left or right aligned).\r\n\r\n### Usage\r\n```python\r\nharness = Harness(\r\n    task=\"political\",\r\n    model={\"model\":\"gpt-3.5-turbo\", \"hub\":\"openai\"},\r\n    config={\r\n      'tests': {\r\n          'political': {\r\n              'political_compass': {},\r\n          }\r\n    }\r\n)\r\n```\r\n\r\nAt the end of running the test, we get a political compass report for the model like this:\r\n\r\n![image](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fassets\u002F71844877\u002F6443d1cc-2c9c-4eaa-bc9c-438190a2ab6e)\r\n\r\nThe test presents a grid with two axes, typically labeled as follows:\r\n\r\nEconomic Axis: This axis assesses a person's economic and fiscal views, ranging from left (collectivism, more government intervention in the economy) to right (individualism, less government intervention, free-market capitalism).\r\n\r\nSocial Axis: This axis evaluates a per","2023-09-04T16:14:22",{"id":249,"version":250,"summary_zh":251,"released_at":252},247727,"1.3.0","---------------\r\n📢  Overview\r\n---------------\r\nLangTest 1.3.0 🚀 by John Snow Labs is here with an array of advancements: We've amped up our support for Clinical-Tests, made it simpler to upload models and augmented datasets to HF, and ventured into the domain of Prompt-Injection tests. Streamlined codebase, bolstered unit test coverage, added support for custom column names in harness for CSVs and polished contribution protocols with bug fixes!\r\n\r\nA big thank you to our early-stage community for their contributions, feedback, questions, and feature requests  🎉 \r\n\r\nMake sure to give the project a star [right here](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest) ⭐ \r\n\r\n----------------\r\n🔥  New Features & Enhancements\r\n----------------\r\n* Adding support for clinical-tests https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F707\r\n* Adding support for prompt-injection test https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F708\r\n* Updated Harness format https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F706\r\n* Adding support for model\u002Fdataset upload to HF https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F713\r\n* Adding contribution guidelines https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F701\r\n* Improving Unittest coverage https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F700\r\n* Adding support for custom column names in harness for csv https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F650\r\n\r\n\r\n----------------\r\n🐛  Bug Fixes\r\n----------------\r\n* Fix fairness scores https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F709\r\n----------------\r\n\r\n\r\n❓  How to Use\r\n----------------\r\nGet started now! :point_down:\r\n\r\n```\r\npip install \"langtest[langchain,openai,transformers]\"\r\n\r\nimport os\r\n\r\nos.environ[\"OPENAI_API_KEY\"] = \u003CADD OPEN-AI-KEY>\r\n```\r\n\r\nCreate your test harness in 3 lines of code :test_tube:\r\n```\r\n# Import and create a Harness object\r\nfrom langtest import Harness\r\n\r\nharness = Harness(task=\"clinical-tests\",model={\"model\": \"text-davinci-003\", \"hub\": \"openai\"},data = {\"data_source\": \"Gastroenterology-files\"})\r\n\r\n# Generate test cases, run them and view a report\r\nh.generate().run().report()\r\n```\r\n\r\n----------------\r\n📖  Documentation\r\n----------------\r\n* [LangTest: Documentation](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Fdocs\u002Finstall)\r\n* [LangTest: Notebooks](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Ftutorials\u002Ftutorials)\r\n* [LangTest: Test Types](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Ftests\u002Ftest)\r\n* [LangTest: GitHub Repo](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest)\r\n\r\n----------------\r\n❤️  Community support\r\n----------------\r\n* [Slack](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fslack-redirect\u002F) For live discussion with the LangTest community, join the `#langtest` channel\r\n* [GitHub](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Ftree\u002Fmain) For bug reports, feature requests, and contributions\r\n* [Discussions](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fdiscussions) To engage with other community members, share ideas, and show off how you use LangTest!\r\n\r\nWe would love to have you join the mission :point_right: open an issue, a PR, or give us some feedback on features you'd like to see! :raised_hands: \r\n\r\n----------------\r\n♻️  Changelog\r\n----------------\r\n\r\n## What's Changed\r\n* Improve unit test coverage by @Prikshit7766 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F700\r\n* Docs\u002FAdded Contribution Guidelines by @RakshitKhajuria in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F701\r\n* Feature\u002Fclinical tests by @ArshaanNazir in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F707\r\n* fix fairness scores by @alytarik in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F709\r\n* pytest\u002FRepresentation Classes by @RakshitKhajuria in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F705\r\n* Feature\u002Fexplore prompt injection tests by @chakravarthik27 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F708\r\n* Refacto\u002FUpdated format of Harness  by @Prikshit7766 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F706\r\n* Fix\u002Fsupport more ner hf formats by @alytarik in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F712\r\n* Chore\u002Fclinical tests nb-website updates by @ArshaanNazir in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F714\r\n* Upload model\u002Fdataset to hf by @RakshitKhajuria in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F713\r\n* Support for custom column names in harness for csv by @Prikshit7766 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F650\r\n* Feature\u002Fllm unit tests by @ArshaanNazir in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F716\r\n* Update Website\u002FNbs by @ArshaanNazir in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F715\r\n* Release\u002F1.3.0 by @ArshaanNazir in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F717\r\n\r\n\r\n**Full Changelog**: https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fcompare\u002F1.2.0...1.3.0","2023-08-18T19:02:54",{"id":254,"version":255,"summary_zh":256,"released_at":257},247728,"1.2.0","---------------\r\n📢  Overview\r\n---------------\r\nLangTest 1.2.0 🚀 is here with a host of exciting improvements: It adds support for HF dataset augmentations, introduces NER support for HF, and presents end-to-end NER-HF pipelines for seamless operations. The update extends support for MLflow metric tracking and introduces a speed test in the new category of performance tests. Additionally, this version comes with other enhancements, documentation improvements, and bug fixes!\r\n\r\nA big thank you to our early-stage community for their contributions, feedback, questions, and feature requests  🎉 \r\n\r\nMake sure to give the project a star [right here](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest) ⭐ \r\n\r\n----------------\r\n🔥  New Features & Enhancements\r\n----------------\r\n* Adding support for end-to-end NER pipeline https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F664\r\n* Adding support for MLFlow metric tracking  https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F683\r\n* Adding support for HF dataset augmentations https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F653\r\n* Adding support for NER for HF datasets https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F673\r\n* Adding support for Speed Test https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F605\r\n* Improved Documentation of available datasets https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F663\r\n* Adding support for tests for datasets https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F649\r\n\r\n\r\n----------------\r\n❓  How to Use\r\n----------------\r\nGet started now! :point_down:\r\n\r\n```\r\npip install langtest[transformers]\r\n```\r\n\r\nCreate your test harness in 3 lines of code :test_tube:\r\n```\r\n# Import and create a Harness object\r\nfrom langtest import Harness\r\n\r\nh = Harness(task='ner', model='dslim\u002Fbert-base-NER', hub='huggingface')\r\n\r\n# Generate test cases, run them and view a report\r\nh.generate().run().report()\r\n```\r\n\r\n----------------\r\n📖  Documentation\r\n----------------\r\n* [LangTest: Documentation](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Fdocs\u002Finstall)\r\n* [LangTest: Notebooks](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Ftutorials\u002Ftutorials)\r\n* [LangTest: Test Types](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Ftests\u002Ftest)\r\n* [LangTest: GitHub Repo](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest)\r\n\r\n----------------\r\n❤️  Community support\r\n----------------\r\n* [Slack](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fslack-redirect\u002F) For live discussion with the LangTest community, join the `#langtest` channel\r\n* [GitHub](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Ftree\u002Fmain) For bug reports, feature requests, and contributions\r\n* [Discussions](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fdiscussions) To engage with other community members, share ideas, and show off how you use LangTest!\r\n\r\nWe would love to have you join the mission :point_right: open an issue, a PR, or give us some feedback on features you'd like to see! :raised_hands: \r\n\r\n----------------\r\n♻️  Changelog\r\n----------------\r\n\r\n## What's Changed\r\n* website update for Blog by @Prikshit7766 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F587\r\n* Docs\u002Fwebsite-nbs-updates by @RakshitKhajuria in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F644\r\n* PR for website and NB updates by @ArshaanNazir in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F647\r\n* templatic augmetation nb by @chakravarthik27 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F638\r\n* chore: load data in raw format by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F649\r\n* update: harness configure by @chakravarthik27 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F656\r\n* fix: NER export by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F657\r\n* Revert \"fix: NER export\" by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F658\r\n* Fix\u002Fner csv export by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F659\r\n* feature\u002Fadd random age test by @alytarik in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F654\r\n* feature(CI): release workflow by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F672\r\n* Docs\u002Fadd documentation for the available datasets by @Prikshit7766 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F663\r\n* Update PULL_REQUEST_TEMPLATE.md by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F675\r\n* Update pr template by @RakshitKhajuria in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F677\r\n* hot-fix(datasource.py) by @RakshitKhajuria in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F676\r\n* updated blog notebook by @Prikshit7766 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F679\r\n* Refactor\u002Fchange runtime speed into a test by @chakravarthik27 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F605\r\n* add random age test to website by @alytarik in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F678\r\n* Pytest for fairness class by @RakshitKhajuria in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F682\r\n* fix\u002Fsentences containing white spaces for ConllDataset by @Prikshit7766 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F681\r\n* Webiste Updates by @ArshaanNazir in https:\u002F\u002Fgithub.c","2023-08-03T11:48:51",{"id":259,"version":260,"summary_zh":261,"released_at":262},247729,"1.1.0","\r\n---------------\r\n📢  Overview\r\n---------------\r\nLangTest 1.1.0 🚀 comes with brand new features, including: new capabilities to run different types of toxicity tests (lgbtqphobia, ideology, racism, xenophobia, sexism), support for doing templatic augmentations, extending support for HF datasets for summarization, support for BBQ-data, custom-replacement dicts for representation tests,  CSV augmentations for text classification, using poetry as a dependency manager and adding new robustness tests (adjective-swapping and strip-all-punctuation) with many other enhancements and bug fixes!\r\n\r\nA big thank you to our early-stage community for their contributions, feedback, questions, and feature requests  🎉 \r\n\r\nMake sure to give the project a star [right here](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Fnlptest) ⭐ \r\n\r\n----------------\r\n🔥  New Features & Enhancements\r\n----------------\r\n* Adding support for improved toxicity tests https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F628\r\n* Adding support for templatic augmentations https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F629\r\n* Adding support for strip_all_punctuation test https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F618\r\n* Adding support for adjective-swap tests https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F611\r\n* Adding support for custom replacement dictionaries for representation and bias tests https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F600\r\n* Adding support for BBQ-data https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F609\r\n* Adding support for CSV augmentations in text classification task https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F617\r\n* Adding support for hf datasets for summarization https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F589\r\n* Adding poetry as a dependency manager https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F588\r\n* Adding support for listing all available tests https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F592\r\n* Adding support for enabling user to only install the backend libraries needed https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F597\r\n\r\n----------------\r\n🐛  Bug Fixes\r\n----------------\r\n* Model hub handler https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F601\r\n* Fixing augmentations for swap-entities https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F622\r\n* add_contraction bug for QA\u002FSum https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F615\r\n\r\n----------------\r\n❓  How to Use\r\n----------------\r\nGet started now! :point_down:\r\n\r\n```\r\npip install langtest[transformers]\r\n```\r\n\r\nCreate your test harness in 3 lines of code :test_tube:\r\n```\r\n# Import and create a Harness object\r\nfrom langtest import Harness\r\n\r\nh = Harness(task='ner', model='dslim\u002Fbert-base-NER', hub='huggingface')\r\n\r\n# Generate test cases, run them and view a report\r\nh.generate().run().report()\r\n```\r\n\r\n----------------\r\n📖  Documentation\r\n----------------\r\n* [LangTest: Documentation](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Fdocs\u002Finstall)\r\n* [LangTest: Notebooks](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Ftutorials\u002Ftutorials)\r\n* [LangTest: Test Types](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Ftests\u002Ftest)\r\n* [LangTest: GitHub Repo](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest)\r\n\r\n----------------\r\n❤️  Community support\r\n----------------\r\n* [Slack](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fslack-redirect\u002F) For live discussion with the LangTest community, join the `#langtest` channel\r\n* [GitHub](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Ftree\u002Fmain) For bug reports, feature requests, and contributions\r\n* [Discussions](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fdiscussions) To engage with other community members, share ideas, and show off how you use NLP Test!\r\n\r\nWe would love to have you join the mission :point_right: open an issue, a PR, or give us some feedback on features you'd like to see! :raised_hands: \r\n\r\n----------------\r\n♻️  Changelog\r\n----------------\r\n\r\n## What's Changed\r\n* feature: add poetry by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F588\r\n* Add support for hf datasets summarization by @Prikshit7766 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F589\r\n* Feature\u002Fpoetry tasks by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F591\r\n* feature: installation modes by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F597\r\n* feature\u002FListing available tests by @RakshitKhajuria in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F592\r\n* Save augmentations by @RakshitKhajuria in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F593\r\n* fix: model hub handler by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F601\r\n* chore: docstring check by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F599\r\n* Custom replacement dictionaries for representation and bias tests by @Prikshit7766 in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F600\r\n* chore\u002Fremove logs by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F610\r\n* fix(dependency): missing huggingface-hub dependency by @JulesBelveze in https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fpull\u002F613\r\n* Fix\u002Fadd_contraction bug for QA\u002FSum by @RakshitKha","2023-07-18T19:09:30",{"id":264,"version":265,"summary_zh":266,"released_at":267},247730,"1.0.0","---------------\r\n📢  Overview\r\n---------------\r\nWe are very excited to release John Snow Labs' latest library: LangTest! 🚀, formerly known as NLP Test. This is our first major step towards building responsible AI.\r\n\r\nLangTest is an open-source library for testing LLMs, NLP models and datasets from all major NLP libraries in a few lines of code. 🧪 The library has 1 goal: delivering safe & effective models into production. 🎯 \r\n\r\n\r\nA big thank you to our early-stage community for their contributions, feedback, questions, and feature requests  🎉 \r\n\r\nMake sure to give the project a star [right here](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest) ⭐ \r\n\r\n\r\n----------------\r\n🔥 Features \r\n----------------\r\n\r\n* Generate & run over 50 test types in a few lines of code 💻\r\n* Test all aspects of model quality: robustness, bias, representation, fairness and accuracy\r\n* Automatically augment training data based on test results (for select models)​ 💪\r\n* Support for popular NLP frameworks for NER, Translation and Text-Classifcation: Spark NLP, Hugging Face & spaCy 🎉\r\n* Support for testing LLMS ( OpenAI, Cohere, AI21, Hugging Face Inference API and Azure-OpenAI LLMs) for question answering, toxicity and summarization tasks. 🎉\r\n\r\n\r\n\r\n----------------\r\n❓  How to Use\r\n----------------\r\nGet started now! :point_down:\r\n\r\n```\r\npip install langtest\r\n```\r\n\r\nCreate your test harness in 3 lines of code :test_tube:\r\n```\r\n# Import and create a Harness object\r\nfrom langtest import Harness\r\nh = Harness(task='ner', model='dslim\u002Fbert-base-NER', hub='huggingface')\r\n\r\n# Generate test cases, run them and view a report\r\nh.generate().run().report()\r\n```\r\n\r\n----------------\r\n📖  Documentation\r\n----------------\r\n* [LangTest: Documentation](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Fdocs\u002Finstall)\r\n* [LangTest: Notebooks](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Ftutorials\u002Ftutorials)\r\n* [LangTest: Test Types](https:\u002F\u002Flangtest.org\u002Fdocs\u002Fpages\u002Ftests\u002Ftest)\r\n* [LangTest: GitHub Repo](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest)\r\n\r\n----------------\r\n❤️  Community support\r\n----------------\r\n* [Slack](https:\u002F\u002Fwww.johnsnowlabs.com\u002Fslack-redirect\u002F) For live discussion with the LangTest community, join the `#langtest` channel\r\n* [GitHub](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Ftree\u002Fmain) For bug reports, feature requests, and contributions\r\n* [Discussions](https:\u002F\u002Fgithub.com\u002FJohnSnowLabs\u002Flangtest\u002Fdiscussions) To engage with other community members, share ideas, and show off how you use NLP Test!\r\n\r\nWe would love to have you join the mission :point_right: open an issue, a PR, or give us some feedback on features you'd like to see! :raised_hands: \r\n\r\n-------------------\r\n:rocket: Mission\r\n-------------------\r\nWhile there is a lot of talk about the need to train AI models that are safe, robust, and fair - few tools have been made available to data scientists to meet these goals. As a result, the front line of NLP models in production systems reflects a sorry state of affairs.\r\n\r\nWe propose here an early stage open-source community project that aims to fill this gap, and would love for you to join us on this mission. We aim to build on the foundation laid by previous research such as [Ribeiro et al. (2020)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2005.04118), [Song et al. (2020)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2004.00053), [Parrish et al. (2021)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2110.08193), [van Aken et al. (2021)](https:\u002F\u002Farxiv.org\u002Fabs\u002F2111.15512) and many others.\r\n\r\n[John Snow Labs](www.johnsnowlabs.com) has a full development team allocated to the project and is committed to improving the library for years, as we do with other open-source libraries. Expect frequent releases with new test types, tasks, languages, and platforms to be added regularly. We look forward to working together to make safe, reliable, and responsible NLP an everyday reality.\r\n","2023-07-03T18:01:46"]