// prompt injection

All signals tagged with this topic

Attackers exploit chatbot personalities to bypass safety guardrails

Adversaries are discovering that LLMs' conversational personas—designed to feel helpful and engaging—create exploitable gaps in safety training. Rather than attacking the underlying model, they're jailbreaking through social engineering of the interface itself, asking chatbots to roleplay as "uncensored" versions or to explain harmful content "for educational purposes." The mismatch is structural: these systems are trained on broad safety principles but deployed as conversational personas optimized for user engagement. The persona becomes a liability.

Google's AI Overviews break on basic command words

Google's summarization feature fails on simple imperatives like "disregard," "ignore," and "skip," suggesting the underlying models either lack instruction-following capability or are overcorrecting against prompt injection. This reveals a core design tension: making AI outputs responsive to user intent versus resistant to adversarial manipulation. Google has chosen lockdown over functionality in these cases.