We are excited to launch our advanced guardrail model, VirtueGuard-Text-Lite. This innovative Guardrail model sets a new standard and surpasses existing state-of-the-art models in safety protection performance while operating at unprecedented speeds. (Please talk to our team about our VirtueGuard-Text-Pro if interested.)
In the rapidly evolving landscape of AI, ensuring that models comply with safety and security standards is crucial. VirtueGuard-Text-Lite is designed to provide a robust framework that actively monitors and regulates AI outputs, ensuring they remain aligned with established safety and security protocols. Leveraging dynamic risk assessment and contextual awareness, the model not only prevents the harmful or inappropriate input/output content but also adapts to emerging threats in real-time. As shown in the figure below, VirtueGuard-Text-Lite achieves over 10% improvement on AUPRC when evaluated with standard benchmarks such as OpenAI Mod and ToxicChat datasets while being more than 30 times faster than models like LlamaGuard. This proactive approach to AI safety represents a significant step forward in maintaining trust and reliability in AI systems, protecting users while unlocking the full potential of AI technologies.
Overall Performance
Building on its exceptional performance, VirtueGuard-Text-Lite showcases its superiority across various safety benchmarks. As highlighted in the detailed comparison table, VirtueGuard-Text-Lite achieves the best performance in critical metrics such as AUPRC on public benchmarks: Open AI Moderation dataset (0.948 AUPRC) and ToxicChat dataset (0.912 AUPRC). It significantly outperforms other leading models, such as Llama Guard 3.8B and ShieldGemma 9B, by substantial margins. Notably, VirtueGuard-Text-Lite also stands out for its ability to minimize false positive rates, with an industry-leading low Overkill rate of only 0.007 FPR. This combination of high accuracy in detecting risky content with low false positive rates makes VirtueGuard-Text-Lite an ideal choice for real-world applications.
Model | Open AI Mod AUPRC (↑) |
ToxicChat AUPRC (↑) |
Overkill AUPRC (↑) |
TwinSafety AUPRC (↑) |
Agies AUPRC (↑) |
---|---|---|---|---|---|
VirtueGuard Text-Lite | 0.948 | 0.912 | 0.918 | 0.796 | 0.907 |
Llama Guard 1 7B | 0.796 | 0.651 | 0.862 | 0.715 | 0.897 |
Llama Guard 2 8B | 0.803 | 0.525 | 0.893 | 0.764 | 0.883 |
Llama Guard 3 8B | 0.820 | 0.570 | 0.886 | 0.796 | 0.903 |
ShieldGemma 2B | 0.630 | 0.597 | 0.894 | 0.701 | 0.889 |
ShieldGemma 9B | 0.894 | 0.767 | 0.914 | 0.735 | 0.904 |
ShieldGemma 27B | 0.689 | 0.653 | 0.884 | 0.715 | 0.876 |
Open AI Moderation API | 0.870 | 0.562 | 0.804 | 0.607 | 0.845 |
Perspective API | 0.787 | 0.499 | 0.567 | 0.583 | 0.825 |
Toxic Chat T5 | 0.742 | 0.563 | 0.796 | 0.607 | 0.755 |
Risk Categories & Jailbreak
VirtueGuard-Text-Lite Risk Categories | |
---|---|
S1 (Violent Crimes) | S2 (Non-Violent Crimes) |
S3 (Sex-Related Crimes) | S4 (Child Sexual Exploitation) |
S5 (Specialized Advice) | S6 (Privacy) |
S7 (Intellectual Property) | S8 (Indiscriminate Weapons) |
S9 (Hate) | S10 (Suicide & Self-Harm) |
S11 (Sexual Content) | S12 (Jailbreak Prompts) |
VirtueGuard-Text-Lite covers a comprehensive range of 12 risk categories, including 11 categories from the MLCommons taxonomy and an additional “Jailbreak Prompts” category. This extra category is specifically designed to detect and prevent jailbreak attacks on AI models, adding crucial protection against emerging threats to Large Language Model systems.
Although VirtueGuard-Text-Lite is not designed as a specialized jailbreak detection model, it still excels particularly in this field. VirtueGuard-Text-Lite achieves a near-perfect performance with a 0.99 AUPRC score on jackhhao/jailbreak-classification dataset. Notably, this performance surpasses leading specialized jailbreak detection models, including Deepset, ProtectAI, LlamaPromptGuard, and the jackhhao/jailbreak-classifier.
Another significant advantage of VirtueGuard-Text-Lite over specialized jailbreak detection models is its ability to maintain a low false positive rate. Specialized models, trained primarily on jailbreak or similar tasks, often lack exposure to the diverse range of prompts encountered in real-world applications. As a result, models like Deepset and LlamaPromptGuard tend to misclassify benign prompts as threats, leading to high false alarms. In contrast, VirtueGuard-Text-Lite achieves a remarkably low false positive rate of 0.022. This precision ensures robust security without compromising user experience, making it an ideal solution for real-world applications where both safety protection and usability are critical.
LlamaGuard Comptabiltiy
VirtueGuard-Text-Lite offers seamless compatibility with the open-sourced Llama Guard model in both input and output formats, simplifying the process for developers to upgrade their safety tools. By merely replacing the API calling function, developers can effortlessly tap into VirtueGuard-Text-Lite's superior performance with no additional integration effort. This plug-and-play compatibility ensures a cost-effective, near-zero effort transition to a more effective text moderation AI safety solution.
import os
import requests
API_KEY = os.environ.get('VIRTUEAI_API_KEY')
API_URL = "http://api.virtueai.io/textguardlite"
headers = {
"Content-Type": "application/json",
"Authorization": f"Bearer {API_KEY}"
}
data = {
"message": "###Your prompt for moderation###"
}
response = requests.post(API_URL, json=data, headers=headers)
print(response.json())
import axios from 'axios';
const API_KEY = process.env.VIRTUEAI_API_KEY;
const API_URL = "http://api.virtueai.io/textguardlite";
const headers = {
"Content-Type": "application/json",
"Authorization": `Bearer ${API_KEY}`
};
const data = {
message: "###Your prompt for moderation###"
};
axios.post(API_URL, data, { headers })
.then(response => console.log(response.data))
.catch(error => console.error('Error:', error));
curl -X POST "http://api.virtueai.io/textguardlite" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer YOUR_API_KEY" \
-d '{"message": "###Your prompt for moderation###"}'
Safe Output Format
Unsafe Output Format
S2, S9