Safety

MindGuard: open-source guardrails that understand clinical context

António Farinhas

Introduction

Large language models are increasingly used for mental health support, from emotional support to coaching to psychotherapy-style conversations. Frontier models are good at sustaining a coherent multi-turn dialogue, but coherence is not the same as clinical appropriateness, and nowhere is that gap more consequential than in mental health safety. Unlike trained clinicians, who follow established protocols for assessing and responding to risk, current models have no reliable mechanism for incorporating such risk protocols in their responses.

A common way to mitigate this is a guardrail model: a lightweight classifier that monitors messages and triggers an intervention when it detects harmful content. These work well for content moderation, but general-purpose guardrails are poorly suited to mental health. They sort content into broad harm categories like violence, hate, or self-harm, and they optimize for detecting that a sensitive topic is present rather than judging whether a clinically meaningful risk signal is present in context. The result is a system that cannot tell a therapeutic disclosure from a genuine clinical crisis. A person describing a past suicide attempt in the course of processing it is not the same as a person expressing current intent, and a guardrail that treats those two the same will either interrupt care it should not interrupt or miss the signal that actually matters.

We built MindGuard to close that gap. First released in February 2026 and strengthened since through expert peer review, it is a family of lightweight, open-source safety classifiers developed with PhD-level licensed clinical psychologists, built around three contributions: a clinically grounded risk taxonomy, a benchmark of multi-turn conversations annotated turn by turn by clinical experts, and an automated red-teaming framework for measuring how a classifier changes downstream system behavior. This post is the in-depth, standalone account of how all of it works and what we found. A shorter, high-level version is in the original announcement. We are open-sourcing everything, because we believe safety in healthcare AI should be a shared foundation rather than a proprietary advantage.

A clinically grounded risk taxonomy

Working with licensed clinical psychologists, we built a risk taxonomy that captures clinically actionable risk while staying simple enough to interpret consistently and to drive system behavior. It defines three categories based on the appropriate clinical response: safe, self-harm, and harm to others.

Safe covers messages with no imminent risk to self or others. This includes the bulk of therapeutic conversation: common topics like stress, work, and relationships, depression or anxiety symptoms without suicidal ideation, historical trauma disclosures without current risk indicators, and metaphorical language like "I'm drowning in work." Self-harm covers imminent risk of suicide or self-harm, from direct or indirect ideation to method inquiries to discussion of past attempts linked to current ideation. Harm to others covers imminent risk to identifiable others, including threats, violent ideation with a target, or abuse and neglect of dependents. Figure 1 illustrates the proposed taxonomy.

Figure 1. The MindGuard risk taxonomy. Three response-driven categories, Safe, Self-Harm, and Harm to Others, with representative examples of the content that falls under each.

These specific categories are not arbitrary. The split between self-directed and other-directed harm reflects a fundamental clinical distinction: self-harm responses center on collaborative safety planning within a therapeutic relationship, while credible threats toward identifiable others and indicators of abuse or neglect trigger duty-to-protect and mandated-reporting obligations that are ethically, legally, and procedurally different. Most safety taxonomies collapse other-directed harms under a broad violence label, obscuring a distinction that determines the correct response. And the explicit safe category encodes a clinical, rather than content-moderation, notion of safety. Mental health conversations are full of intense emotional material, and treating all of it as a safety event leads to unnecessary escalation, disrupted rapport, and reduced willingness to disclose. Defining what does not require intervention is what lets a system stay engaged when staying engaged is the therapeutic thing to do.

MindGuard-testset: expert-annotated evaluation data

To evaluate safety classifiers under realistic conditions, we built MindGuard-testset, a benchmark of multi-turn mental health conversations annotated at the turn level by licensed clinical psychologists. To our knowledge it is the first public benchmark with turn-level clinical risk labels for multi-turn mental health support.

For data collection, we engaged 10 licensed clinical psychologists to interact directly with a clinician language model, simulating realistic therapeutic conversations. They adopted diverse patient archetypes varying in symptom profile, background, and risk level, and produced both low-risk and high-risk conversations. This captures conversational dynamics and clinically plausible risk expressions that purely synthetic generation struggles to produce, and it draws on a competency psychologists already have from training with simulated and standardized patients.

A separate group of three licensed clinical psychologists then annotated each conversation at the turn level using the taxonomy. Annotators saw the full preceding context for each user message but not the model's response to the current turn, which reduces bias from the model's output. Final labels were set by majority vote, with 94.4% unanimous agreement and a Krippendorff's alpha of 0.57. We used clinical experts rather than crowd workers deliberately: distinguishing a borderline case from an unsafe one, and reading a single turn in light of the whole conversation, is exactly the judgment that defines clinical practice and is hard to replicate with non-expert annotators.

The dataset comprises 1,134 annotated user turns across 67 conversations, with an average of 16.9 turns each. The class balance is realistic: 96.3% of turns are safe and 3.7% are unsafe (1.8% self-harm, 1.9% harm to others). That imbalance is clinically meaningful, since acute crisis disclosures are rare, but coverage of high-risk interactions stays strong: about a quarter of conversations (25.4%) contain at least one unsafe turn.

Building MindGuard: synthetic data with clinical supervision

High-risk scenarios that fit the taxonomy are rare in real conversations, so for training data we generate synthetic multi-turn dialogues through a controlled two-agent setup. A patient language model (PLM) follows a predefined clinical scenario while a clinician language model (CLM) responds as an AI therapist without access to that scenario. Each scenario specifies the patient's presentation, emotional state, and communication style; a risk category and finer-grained subcategory such as direct ideation, passive ideation, or metaphorical language; a maximum dialogue length; and a target trajectory such as gradual escalation, sustained ambiguity, or de-escalation. We wrote roughly 300 scenarios with clinical input and generated multiple distinct conversations from each by varying model instantiations and sampling, which captures stylistic diversity while holding clinical intent constant.

After generating each conversation, a judge language model (JLM) labels every user message under the taxonomy. We present all user turns as a numbered list so the judge can assess each message in the context of the full interaction and account for how risk accumulates across turns. This is intentionally different from deployment, where a classifier operates turn by turn with no access to the future. Using full-conversation context for labeling produces higher-quality supervision, and we aggregate the judge across five samples by majority vote for robustness on borderline cases.

A natural question is whether an LLM judge is reliable enough to supervise training. We checked it against the human consensus on MindGuard-testset using the same labeling setup. The judge reaches a Cohen's kappa of 0.62, which sits inside the range of pairwise agreement among the three human annotators (0.51 to 0.67). Adding the judge as a fourth rater moves Krippendorff's alpha only from 0.57 to 0.55, so it introduces no more disagreement than the clinicians introduce among themselves.

The final training set is 5,812 labeled user turns: 62.1% safe, 21.3% self-harm, and 16.5% harm to others. The safe category deliberately includes emotionally intense but non-crisis content, such as trauma histories, metaphorical references to harm, and passive ideation accompanied by strong protective factors, so the models learn the clinical distinction between distressing content and actionable risk. We fine-tune the classifiers from Qwen3Guard-Gen at 4B and 8B parameters for three epochs with supervised learning.

Results: turn-level risk classification

We evaluate at two levels: turn-level classification accuracy in isolation, and system-level safety inside a multi-turn adversarial system. We start with the turn level, comparing against the safety classifiers that support custom category definitions: Llama Guard 3 at 1B and 8B, and gpt-oss-safeguard at 20B and 120B. We report AUROC and false positive rate at high recall (the threshold-independent metrics that suit safety, where the deployment threshold varies by application), with conversation-level cluster bootstrap confidence intervals on every metric.

Across all metrics, MindGuard outperforms the baselines (Table 1, with ROC curves in Figure 2). MindGuard 8B reaches 0.982 AUROC and MindGuard 4B reaches 0.981, ahead of Llama Guard 3 8B (0.970) and gpt-oss-safeguard 120B (0.960), and the ranking holds under conversation-level cluster bootstrap confidence intervals, which matters given how few unsafe turns there are. Even the 4B model beats every baseline, including one roughly 30 times its size. At 90% true positive rate, MindGuard's false positive rate sits between 3.1% and 4.1%, a 2 to 26 times reduction relative to the general-purpose safeguards, with similar gains at 95% recall.

Table 1. Turn-level risk classification on MindGuard-testset. AUROC (higher is better) and false positive rate at 90% and 95% true positive rate (lower is better), against the safety classifiers that support custom category definitions. Both MindGuard models lead on every metric, and MindGuard 4B outperforms baselines up to 30 times its size.

Figure 2. ROC curves for turn-level risk classification, with a zoom on the low-false-positive region (inset). Markers show the operating points at 90% and 95% true positive rate. In the high-recall regime that matters for safety, MindGuard separates cleanly from the general-purpose safeguards, holding the lowest false positive rate at fixed recall.

Aggregate accuracy can hide whether a classifier flags risk early enough to act on, so we also measure detection timing on the conversations that contain at least one human-annotated unsafe turn. We report whether the classifier ever flags the conversation, and whether it flags risk at or before the first annotated unsafe turn. MindGuard is the only family that detects every unsafe conversation, and MindGuard 8B flags risk at or before the first unsafe turn in 94% of cases, catching escalation about as early as human annotators do. Llama Guard 3 8B, despite a competitive 0.970 AUROC, manages only 41% on that early-detection measure. The summary number can look strong while the classifier still arrives too late.

The practical reading: in mental health settings, high recall is necessary because you cannot afford to miss a crisis signal, but a high false positive rate at that recall means escalating benign conversations, eroding rapport, and adding clinician burden. Lower false positives at fixed high recall translate directly into more usable safety thresholds, and early detection turns response from retroactive into timely.

Results: system-level safety with automated red teaming

Turn-level accuracy is necessary but not sufficient, because a classifier is deployed as part of a larger system. We therefore evaluate at the system level using automated red teaming, and we release the framework so others can do the same. An attacker patient model engages a target clinician model, configured with or without a safety classifier, using a gradual-escalation strategy: the conversation opens with innocuous content like academic or workplace stress and introduces explicit crisis content only at a predetermined injection turn. When the classifier flags a message as unsafe, a developer message injects the matching clinical protocol (see Figure 3). The attack library spans 67 self-harm protocols across 11 subcategories and 78 harm-to-others protocols across 11 subcategories, each run five times, for 725 multi-turn interactions. We score two things with an LLM judge: attack success (the system fails to detect the crisis and continues standard engagement) and harmful engagement (the system actively engages with harmful content in a way that could normalize or facilitate it).

Figure 3. Detect-then-inject in action. MindGuard flags an escalating caregiver conversation as Harm to Others and triggers injection of the matching protocol. With the protocol, the model surfaces appropriate support and safety resources; without it, it misses the risk. Injection happens only on the flagged turn.

Placing MindGuard in the loop produces large, consistent improvements (Figure 4). On GLM-4.6, MindGuard 4B cuts attack success from 25.1% to 7.6%, a 70% reduction, against 44% for the strongest general-purpose baseline (gpt-oss-safeguard 120B), and cuts harmful engagement from 13.7% to 3.3%, a 76% reduction, against 50% for the best baseline. The trend holds across other base models, including Qwen3-235B-Instruct and Qwen3-235B-Thinking. Once again, MindGuard 4B beats gpt-oss-safeguard 120B, which underscores that task-specific supervision can outweigh raw scale.

Figure 4. System-level safety under automated red teaming. Attack success and harmful engagement (both lower is better) across three base counselor models, with no classifier and with each safety classifier in the loop. MindGuard yields the largest reductions in every setting, and MindGuard 4B outperforms gpt-oss-safeguard 120B, a model roughly 30 times its size. LG = Llama Guard 3; GPT = gpt-oss-safeguard.

We verified these results three ways. Under one-sided permutation tests with attack-level resampling, MindGuard 8B significantly improves over every open-source baseline across all three base models (30 of 30 comparisons, p < 0.05), and MindGuard 4B is significant in 27 of 30. Because we use an LLM judge, self-preference bias is a concern, so we swapped the system-level judge from GLM-4.6 to GPT-5: the ordering and significance hold (MindGuard stays significantly better than every open-source baseline in 56 of 60 pairwise tests), so the gains are not an artifact of within-family alignment. Finally, against frontier proprietary classifiers, GPT-5 shows no statistically significant advantage over MindGuard on any metric or counselor (0 of 12), and Claude Sonnet 4.5 is significantly better only on attack success rate in 4 of 6 comparisons. A small, open, task-specific classifier holds its own against much larger closed models.

Detection in the loop, on a benchmark we did not design

The red-teaming results show MindGuard works inside a system we built. The obvious question is whether the same idea (detect risk, then act on it) holds up on an independent benchmark, so we tested it on VERA-MH, Spring Health's multi-turn, clinically grounded suicide-risk benchmark. VERA-MH role-plays patients across a range of disclosure styles, has the model under test respond over a full conversation, and scores the transcript against a clinician-developed rubric.

The setup mirrors deployment. MindGuard runs on every turn, benign turns proceed with a clean context, and when it flags self-harm risk a developer message injects the matching clinical protocol, grounded in established suicide-risk practice such as the Columbia Protocol and the Zero Suicide framework, so the protocol arrives exactly when it is needed rather than sitting resident in the system prompt on every turn. Claude Sonnet scores about 60 on VERA-MH with no instruction and about 68 with a naive mental-health system prompt. With MindGuard detection plus the injected protocol, the same model rises to about 80, into the 80 to 82 band Spring reports as state of the art (Figure 5). We are not claiming to beat them, and we would not assert which base model underlies their headline number. The useful point is that a lightweight detection-and-injection layer reaches that band on a model that is neither the strongest available nor one designed specifically for mental health. 

Figure 5. Same model, different safety strategy. Claude Sonnet scored by VERA-MH (higher is better) under three configurations: no protocol (60.2), a naive mental-health system prompt (68.1), and MindGuard detection with protocol injection (80.4). Detection-triggered injection lifts the same underlying model by about 20 points, into the state-of-the-art band Spring Health reports (~80 to 82).

The gain is informative about where safety comes from. Broken out by rubric dimension, the model is already strong at what depends on its own clinical instincts: risk detection is near the ceiling before you tell it anything. Almost the entire improvement lands in the dimensions that are pure protocol script, like guiding the user toward specific care steps and following expected AI boundaries. Much of what looks like a safety gap is really a knowledge-of-protocol gap, and the binding constraint is knowing the protocol, not capability. That is exactly the constraint a reliable detector plus on-demand injection is built to remove, and it is why the false-positive reductions above matter here: detect-then-inject is only as good as a detector that can run at high recall without flooding benign conversation with false alarms.

This complements rather than competes with VERA-MH. The benchmark is a standardized yardstick for adherence to one suicide-risk protocol; MindGuard is a deployable layer that detects risk and runs the protocol the situation calls for, with a design that extends to other categories of harm where the right clinical response differs. Both get there by the same mechanism, getting the right protocol in front of a capable model at the right moment, and detection-first is what makes a strong score a statement about a shippable system rather than a test-time configuration.

Limitations

MindGuard represents meaningful progress, and several limitations are worth stating plainly.

Turn-level, not longitudinal. Our classifiers operate at the turn level within a single conversation. They do not model risk across multiple sessions or an evolving therapeutic relationship, which is how clinical risk assessment actually works over time. Longitudinal safety mechanisms that detect gradual escalation across sessions are a priority for future work.

Detection is not the whole of intervention. Our system-level evaluation uses a coarse developer-message intervention as a stand-in for the nuanced response a real clinician provides. In mental health contexts, disengagement or refusal can itself be a safety failure by disrupting access to support. Richer intervention strategies grounded in real clinical protocols are needed for deployment, and the VERA-MH experiment is an early step in that direction.

The detector is a single point of dependency. Detect-then-inject is only as safe as its detector. If MindGuard misses, nothing fires. We treat this as a reason to invest in recall rather than a reason to avoid the architecture, but it is a real concentration of risk that any deployment should account for.

A complement to clinical judgment, not a replacement. MindGuard is a context-sensitive signal detector that supports safety-relevant decisions. It is not a diagnostic or treatment-recommendation system, and it does not replace the accountability, judgment, and longitudinal context that professional mental health care requires.

What's next

Two directions follow directly from this work. The first is richer, clinically grounded intervention, moving beyond the coarse developer-message stand-in toward response strategies that maintain therapeutic engagement while escalating appropriately, with per-category protocols routed by the detector. The second is longitudinal safety, developing mechanisms that track gradual risk escalation across sessions rather than within a single conversation.

Open-source release

We are open-sourcing everything needed to reproduce, extend, or build on this work: the trained MindGuard models (4B and 8B), the clinically annotated MindGuard-testset, the risk taxonomy itself, and the automated red-teaming framework for system-level safety evaluation.

Read the full paper, and access the data and models, from the links in the resources below.

Acknowledgements

This work was led by the Sword Health AI Research Team in close collaboration with our Clinical Team, whose input shaped every stage: the risk taxonomy, the annotation campaigns that built and validated MindGuard-testset, the synthetic data design, and the clinical protocol behind the VERA-MH experiments. MindGuard would not exist without them. We also thank the reviewers whose feedback made the paper stronger.

Footnotes

  1. 1

    MindGuard paper (arXiv): arxiv.org/abs/2602.00950

  2. 2
  3. 3

    Models and MindGuard-testset (Hugging Face): huggingface.co/swordhealth

  4. 4

    VERA-MH benchmark, Spring Health (arXiv): arxiv.org/abs/2602.05088

  5. 5
  6. 6
  7. 7
  8. 8

    Columbia Protocol (C-SSRS): cssrs.columbia.edu

  9. 9

    Zero Suicide framework: zerosuicide.edc.org

  10. 10

    988 Suicide & Crisis Lifeline: 988lifeline.org

Portugal 2020Norte 2020European UnionPlano de Recuperação e ResiliênciaRepública PortuguesaNext Generation EU