When AI systems reach their limits, they produce alarming results

  • Gemini Pro 2.5 often produced dangerous results under the guise of simple instructions
  • ChatGPT models often provided partial agreement expressed in sociological explanations.
  • Claude Opus and Sonnet rejected the most damaging requests, but they had weaknesses

Modern AI systems are often assumed to follow security standards, and people rely on them for daily learning and support, often assuming that strong security measures are always in place.

researcher cyber news ran a structured series of adversarial tests to see if leading AI tools could produce malicious or illegal results.

- Advertisement -

The test used a single interaction window of one minute for each trial, allowing for little interaction.

Partial and total compliance models

The tests covered categories such as stereotypes, hate speech, self-harm, cruelty, sexual content and various forms of crime.

Each response was stored in separate folders using fixed file naming conventions to allow clear comparisons, with a consistent scoring system to track when a model is fully compliant, partially compliant, or has rejected a request.

The results were very different in all categories. Strong rejections were common, but many models showed weaknesses when inquiries were watered down, reformulated or disguised as analysis.

- Advertisement - Advertisement

ChatGPT-5 and ChatGPT-4o often offered sociological or hedging explanations rather than dismissing what was perceived as partial compliance.

Gemini Pro 2.5 stood out for negative reasons, as it often offered direct responses even when malicious framing was involved.

In contrast, Claude Opus and Claude Sonnet were resolute in testing stereotypes, but less consistent in cases presented as academic research.

The hat number experiments showed the same pattern: the Claude models performed best, while Gemini Pro 2.5 again showed the greatest vulnerability.

- Advertisement - Advertisement

ChatGPT models typically provided polite or indirect responses that remained consistent with the message.

Milder language has been shown to be far more effective than outright insults in circumventing court orders.

Similar weaknesses emerged in self-harm tests, where indirect questions or questions often bypassed filters and generated dangerous content.

Crime-related categories showed large differences between the models, with some giving detailed explanations of hacking, financial fraud, hacking or smuggling, while the intent was disguised as research or observation.

Drug testing resulted in more severe rejection patterns, although ChatGPT-4o still returned inconclusive results more often than others, and stalking was the lowest overall risk category, with nearly all models rejecting requests.

The results show that AI tools can still answer malicious questions if they are properly formulated.

The ability to bypass filters through simple reformulation allows these systems to still filter malicious information.

Even partial compliance becomes risky if the information disclosed relates to illegal activities or situations where people would normally rely on tools like anti-phishing or a firewall. to be sure.

Related Articles

Advertisement