This blog provides a summary of the latest red-teaming evaluations by our VirtueRed on three leading Anthropic models—Claude 3.7 Sonnet, Claude 3.7 Sonnet Thinking, and Claude 3.5. For the full report, visit here.
As AI models and systems evolve rapidly, ensuring their safe & secure deployment is more critical than ever. At VirtueAI, our proprietary VirtueRed platform, which hosts more than 100 red-teaming algorithms, automatically tests foundation models against broad risk categories, such as regulation-based and use-case–driven risks.
Today, we compare Anthropic’s newest releases to highlight their safety profiles and areas for enhancement and draw insights on whether current reasoning could improve the safety & security of foundation models.
VirtueRed: Automated red-teaming testing platform for foundation models and applications
VirtueRed automatically generates challenging data based on our advanced proprietary red-teaming algorithms for large language models and multimodal foundation models. VirtueRed covers broad risk categories, which mainly follow two principles:
- Regulatory compliance risks (e.g., adherence to frameworks like the EU AI Act, GDPR, company policies)
- Use-case–driven risks (e.g., hallucination, bias, privacy, adversarial robustness, and brand risks)
VirtueRed allows us to pinpoint both systemic weaknesses and real-world operational challenges. The risk categories and metrics are customizable to industrial usages.
Do reasoning capabilities lead to safer models?
Anthropic’s release of Claude 3.7 Sonnet Thinking introduces a groundbreaking “hybrid reasoning model” that offers users the flexibility to choose between rapid responses and more deliberate, step-by-step reasoning. This versatility enhances its performance across various domains, including complex problem-solving and creative tasks. Here are the key findings from the red-teaming evaluation by VirtueRed:
- Thinking vs. Non-thinking: In our direct comparison between Claude 3.7 Sonnet (without thinking) and Claude 3.7 Sonnet (thinking), they maintain very similar safety profiles. This observation is a bit surprising given the expectations around enhanced reasoning, indicating that the new thinking mode does not yet translate into a significant safety advantage.
- Analysis on risk categories: On the EU AI Act and AI company policy based evaluation risk categories, Claude 3.7 Sonnet (thinking) achieves a small improvement (around 10%), which implies the alignment and reasoning targets of the model with the thinking process.
- Claude 3.7 vs. Claude 3.5: Both versions of Claude 3.7 Sonnet significantly reduce instances of over-cautiousness and false refusals compared to Claude 3.5. This improvement allows the model to more accurately distinguish between genuinely harmful requests and benign ones that previous versions might have incorrectly flagged, resulting in a more helpful model while maintaining strong safety alignment.
Takeaway: Despite its innovative promise, the integrated Thinking mode in Claude 3.7 Sonnet delivers performance nearly identical to its non-thinking counterpart, suggesting that further refinements are needed to realize the potential safety benefits of current reasoning. In the meantime, Claude 3.7 notably reduces over-cautiousness and false refusals compared to Claude 3.5 while maintaining core safeguards.
Detailed Red-teaming Analysis on Claude Sonnet Models: Key Findings
Reduction in Over-Cautiousness (False Refusals)
- Claude 3.7 Sonnet demonstrates marked improvement in distinguishing between genuinely harmful requests and benign ones, significantly reducing false refusals compared to Claude 3.5.
- This improvement enables engagement with fiction, thought experiments, and educational scenarios that pose no actual harm, which increases usability without compromising overall safety protections.

Virtue AI has collaborated with Stanford to develop the leaderboard for regulatory compliance risks, AIR-Bench, which has been integrated with Stanford HELM.

EU AI Act Compliance Risk Assessments
- Both Claude 3.7 variants demonstrate clear adherence to regulatory guidelines, with robust safeguards in areas such as automated decision-making and privacy handling.
- In contrast, Claude 3.5 occasionally produced outputs—such as borderline automated assessments—that hint at potential regulatory ambiguities (e.g., outputs resembling credit-risk evaluations).
Takeaway: Anthropic’s latest Claude 3.7 variants offer stronger regulatory safeguards compared to Claude 3.5.
Advice in Heavily Regulated Industries
- Despite Claude 3.7 Sonnet achieving good safety scores in other categories, it still shows weaknesses when providing advice in heavily regulated industries, where it should maintain stricter boundaries.
- The safety performance in this category shows minimal improvement from Claude 3.5 to Claude 3.7, with both models receiving nearly identical scores in our evaluations.

Claude 3.7 provides undesired advice in regulated industry
Takeaway: We recommend that Anthropic implement more specialized guardrails for regulated industry contexts to address these persistent vulnerabilities for future Claude models.
Hallucination
- When faced with ambiguous or adversarial prompts, both Claude 3.7 variants deliver more accurate, contextually grounded outputs, while VirtueRed is able to identify some failure cases.
- Claude 3.5, on the other hand, sometimes generates hallucinated answers under stress.

Claude 3.7 outputs hallucinated answers
Takeaway: Enhanced design in Claude 3.7 translates to superior factual reliability compared to Claude 3.5.
Fairness & Bias
- All three models registered low bias levels, effectively handling sensitive topics with minimal disparity.
Takeaway: In terms of fairness and bias, Anthropic’s models consistently perform at a safe baseline, with no material differences warranting further intervention.
Privacy & Security
- Both Claude 3.7 variants incorporate proactive privacy safeguards; for example, when prompted to reveal sensitive data, they quickly revert to safe outputs without extraneous disclosure, with some failure cases identified by VirtueRed.
- Although Claude 3.5 upholds strong security measures, our tests revealed occasional leakage before engaging these safeguards under stress conditions.

PII leakage example for Claude 3.7
Takeaway: Claude 3.7 offers a more reliable baseline for privacy and security, making them better suited for privacy-sensitive applications.
Societal Harmfulness
- All models effectively filter out extremist content and misinformation, consistently aligning with safe content guidelines.
Takeaway: Anthropic’s models reliably minimizes societal harm, with each variant demonstrating strong content safety alignment.
Robustness
- Stress tests show that both Claude 3.7 Sonnet and Claude 3.7 Sonnet Thinking maintain resilience against general prompt injections and manipulation attempts.
- In contrast, Claude 3.5 displayed more variability— producing less secure outputs when subjected to adversarial challenges.
Takeaway: For environments where adversarial challenges are paramount, Claude 3.7 establishes a more robust safety foundation.
Conclusions
Our comprehensive red-teaming analysis indicates that Anthropic’s latest Claude 3.7 models exhibit an improved safety profile compared to Claude 3.5. Both the 3.7 Sonnet and 3.7 Sonnet Thinking variants align well with regulatory standards, effectively mitigate operational risks, and demonstrate low levels of bias and societal harm.
While the differences between the two 3.7 models are subtle, Claude 3.7 Sonnet Thinking showed a slightly higher level of safety, particularly in handling regulatory risk categories related to, say, the EU AI Act. However, despite its enhanced reasoning capabilities, the Thinking mode does not contribute meaningfully to improving the model’s safety, which poses interesting challenges.
Safe and Secure AI Deployments with VirtueAI
Need assistance in ensuring the safe and secure deployment of AI models & applications? Learn more about our red-teaming and guardrail solutions by contacting our team at contact@virtueai.com.