Agents

May 18, 2026 • 10 min read

How we built our Care Coordinator agent

Inside the multi-agent AI system resolving member support at scale.

Written by

Lead Machine Learning Engineer, Sword Health

Introduction

Care Coordinators (CCs) are the human backbone of member support at Sword. Their mission is critical: remove blockers to care so that members, who engage with Sword solutions, can follow their prescribed programs with success, while providing a delighting support experience. To achieve that, CCs handle support tickets that arrive through SMS, email, app messages, and more. They gather the right context, decide what actions are needed, and either resolve the issue themselves — managing returns, scheduling, updating member information — or escalate to Clinical Specialists or other teams when necessary.

But solving a ticket is rarely a simple, one-step process. A CC often needs to identify the problem, consult internal policies, verify the member’s status (program enrollment, device health, billing details, etc.), follow a prescribed set of resolution steps, use internal tools, request additional context from the member and finally confirm that everything is truly resolved.

To make things even more complex, some cases are critical or highly intricate — for example, checking program eligibility, shipping replacement devices, or requesting input from a Therapist before proceeding with a resolution. After all, Sword operates in a clinical context.

Adding to this challenge, Sword’s product offering spans multiple solutions so the agent needs to be adaptable and context-aware. You get the idea: this is a hard problem.

The sustained exponential growth that Sword has experienced has put increasing pressure on our CCs. One option would have been to scale the team at the same pace — but that approach would have quickly made operations overly complex, slowed down resolution times, and ultimately led to a subpar member experience. As such, true to Sword's ethos, we set out to solve this problem with AI, and by building the AI CC Agent we were able to:

Reply to and resolve member issues almost instantly
Scale effortlessly as Sword continues to grow
Overall maintain our delighting support experience

Next, we’ll share a high-level overview of how we built the AI CC Agent, along with some of the key learnings and insights we gained along the way. We’ll start by describing our multi-agent architecture that enables the system to reason, act, and collaborate much like a human CC would. Then, we’ll dive into the shared knowledge base that powers its decision-making and the orchestration layer that allows it to perform real actions in our backend systems. Finally, we’ll briefly discuss how we evaluate, monitor, and continuously improve the agent, and share a few thoughts for the future.

Building a multi-agent system

Hard-coding a single monolithic agent to handle every possible case type quickly becomes brittle and expensive to maintain, and it could also prove overly complex for a single agent to replicate the nuanced workflows human CCs go through. Instead, we designed a multi-agent architecture that adapts flexibly to any situation. And what if it can’t? The system immediately escalates to a human CC and lets the member know that someone’s working on it, as shown in the figure below.

Figure 1. Overview of the AI CC Agent capabilities

The beauty of this approach is that we only had to build one multi-agent system — once. From there, it’s just a matter of making sure the right resolution protocols exist in the knowledge base and that the system has access to the tools it needs. This design also facilitates observability, allowing us to pinpoint where our architecture struggles most and fix issues efficiently.

Here’s what the architecture looks like:

Figure 2. High-level overview of the AI CC Agent architecture

One of the biggest advantages of a multi-agent setup is specialization. Each agent focuses on a particular task, which keeps responsibilities clear and dramatically improves performance.

Meet the agents

Some of our agents follow the ReAct framework. That means they use “reasoning and acting” — thinking out loud and leveraging tools iteratively — until they get the job done.

Supervisor: Determines the member’s intent from incoming messages and ensures the issue is resolved end-to-end. It can search for resolution protocols, answer directly, or delegate tasks to a specialized agent. If an issue can’t be solved, it escalates to a human CC. Sometimes, it even decides not to reply (say, when a member simply reacts with a 👍).
Specialized: Executes resolution procedures, especially those requiring specific tools. Many of these involve interacting directly with Sword’s back-office systems — creating orders or updating records. By giving specialized agents the ability to act, not just reason, the system moves beyond simple Q&A and becomes a true problem-solver. If the agent hits a dead end, it hands control back to the supervisor.
Reply: Formats responses according to CC communication guidelines and the message channel (SMS, email, etc.).
Guardrails: Validates every reply against predefined rules (checking for things like hallucinations or unintended clinical advice). If something’s off, it rejects the response, sends feedback, and routes control back to the supervisor.

During the resolution of a case, agents actively call tools and exchange information. This not only helps them solve issues more effectively, but also makes the process transparent for developers, who can observe the chain of reasoning and spot potential issues.

Example of the multi-agent system solving the problem

Member: Hi, I tried to pull up the link below on UPS, it won't work. Please send a new link.
Supervisor Agent: [member name] is requesting a new UPS return label for her older Sword kit, as the previous link provided did not work. I need to find a protocol that can assist in generating or providing a new return label for the kit. Let's search for relevant protocols to address this issue. [Here the Supervisor Agent will search the knowledge base for resolution protocols related to providing return labels]
Supervisor Agent: search protocols (query=return label for Sword kit, UPS label generation, older kit return process)
Supervisor Agent: [member name] is requesting a new UPS return label for her older Sword kit because the previous link provided did not work. I found a protocol that can assist in generating or providing a new return label for the kit. I will delegate this task to a specialized agent using the protocol ID '1f2b4a7d-7207-80c7-9e48-cc7f022e364d'.

After the Supervisor Agent has retrieved an adequate protocol, it will delegate its execution to the Specialized Agent, which has access to tools and is carefully prompted to follow procedures.

Specialized Agent: The protocol assigned to me is for troubleshooting when a return label or order is unavailable or invalid. The steps involve creating a new return order and obtaining a return label. If these steps fail, it involves confirming the user's address or escalating the issue.
Here's what I will do next:
1. Attempt to create a new return order for [member name]’s older kit using the create_return_order tool.
2. If successful, I will then use the get_return_label tool to obtain the return label link.
3. If unsuccessful, I will follow the protocol's instructions for further troubleshooting or escalation.
Let's start by creating a new return order.
[The Specialized Agent can call tools that will interact with Sword’s backend systems]
Specialized Agent: tool call create_return_order
Specialized Agent: tool call get_return_label
Specialized Agent: The protocol assigned to me is for troubleshooting when a return label or order is unavailable or invalid. The steps involve creating a new return order and obtaining a return label. If these steps fail, it involves confirming the user's address or escalating the issue.
Here's what I have done so far:
I successfully created a new return order for [member name]’s older kit using the create_return_order tool and generated a return label with get_return_label.
Next, I will provide [member name] with the return label link and instructions for returning the kit.
Hello [member name]!
My name is Mary , an AI Care Coordinator from the Sword team. We’re reaching out to help you with the return of your kit.
Here's the multi-method return label link: https://www.ups.com/uel/llp/[...]
You can return your kit by either using the return label included in the kit, print this one at home, or present the scannable code at UPS.
To return your kit, please follow these steps:
(…)

The Reply Agent formats the Specialized Agent reply as required for the communication channel and according to a set of guidelines.

Reply Agent: Hi [member name],
I hope you're having a good day! I'm sorry to hear that the link isn't working for you. I've generated a new return label, and you can access it using this link:
https://www.ups.com/uel/llp/[...]
To return your kit, you have a few options:
You can pack the Digital Therapist with all accessories securely in any box, including the hotspot (if provided). Be sure to remove all extra labels from the outside of the package.
If you choose the paperless barcode option, simply access the barcode from the link above, take the package to a UPS location, and have them scan the code from your smartphone.
If you use a printed label, just print it from the link above and attach it to any sturdy shipping box. Then, find your closest UPS location at ups.com/dropoff, select Drop Off, and take your package to the UPS Store, UPS Customer Center, UPS Alliances, or an Authorized Shipping Outlet near you.
Please reach out if you have any other questions or need further assistance. I'm here to help!
Best regards.

Finally, the Guardrails Agent approves the message generated, given a set of rules.

Guardrails Agent: Message accepted

The flow is entirely sequential. We want the system to respond quickly, but it doesn’t need to be lighting-fast. Much like a human CC, the process is: understand the member’s intent, choose the right resolution protocol, follow it step by step, and then reply. Running multiple agents in parallel might sound exciting, but in practice, it sacrifices the incremental context gained step after step. That context is crucial — and the resulting speed is still orders of magnitude superior to that of a human CC.

The knowledge base

Behind every resolved case lies our carefully maintained knowledge base — the single source of truth that both CCs and the AI rely upon. This knowledge base holds more than a thousand articles, ranging from detailed device troubleshooting guides to billing workflows and clinical support procedures involving therapists. For a human CC, it would be impossible to memorize all of this information, yet having it documented and searchable is what allows them to handle such a wide variety of cases with consistency.

The AI CC Agent taps into this very same resource. By sharing the knowledge base between humans and AI, we ensure that the way an issue is solved does not depend on who is handling it. In fact, there are whole processes around the knowledge base and its shared governance, such as AI friendly formatting guidelines, how to add specific instructions for humans, how to modify live articles, etc. Additionally, this ensures our care teams do not need to maintain multiple sources of truth: when a resolution flow changes, the knowledge base article is updated, and the new flow can be used by both the AI and Human CCs.

Example of knowledge base article

Member reports that once they click to start a session, it shows a ''No sessions available'' error.

Description

To be followed when the member clicks to start a session and the error “No sessions available” appears.

Troubleshooting steps

First step:

1.1: If you don’t have enough information to analyze the issue - Apply macro:

Hi {{ticket.requester.first_name}}, This is {{ticket.assignee.first_name}}, your Care Coordinator from Sword, I hope this email finds you well! Thank you for reaching out regarding this matter. In order to better understand what is happening, can you confirm if the error "No sessions available"/"Repeat today's session" is coming up? Or are you in need of new exercises? Looking forward to hearing from you.

1.2 - If you have sufficient info - Proceed to the next step.

Second step:

2.1: If you don’t have enough information to analyze the issue - Apply macro:

Hi {{ticket.requester.first_name}}, This is {{ticket.assignee.first_name}}, your Care Coordinator from Thrive, I hope this email finds you well! I understand you are having some issues with your tablet, I apologize for that. I have updated your tablet on my end, could you please fully restart it and try performing another session to see if the problem persists? In order to restart it, please press the power button for 1-2s until it opens the pop menu to "Power off" or "Restart". Then click Restart. Let me know if this works, I'll be waiting for updates.

2.2 - Wait for the member’s feedback before proceeding. If the issue persists - proceed to the next step.

In practice, this means that regardless of who is taking the lead, members receive consistent, reliable answers rooted in the same set of resolution protocols. The knowledge base becomes the foundation that enables scalability without sacrificing quality.

Finding the correct article

Accurately surfacing the right knowledge base article is one of the most critical challenges for the AI CC Agent. The system uses Retrieval-Augmented Generation (RAG), combining hybrid search with dense embeddings to quickly locate the most relevant pieces of information. This allows the AI to ground its reasoning in real, up-to-date knowledge rather than relying only on general language patterns. We initially relied on cosine similarity over embeddings, which captures intent-level matches well but occasionally misses the correct articles. To improve retrieval performance, we then implemented hybrid search, combining dense semantic retrieval with traditional keyword-based search and merging the results into a single ranked list. This approach increases both coverage and reliability by balancing semantic generalization with lexical precision — a key advantage given Sword’s specialized terminology and internal jargon.

Hybrid Search

A request for a UPS return label may semantically match articles about “returning a kit,” while keyword matching ensures terms like “UPS”, “label”, and “kit” are explicitly considered, helping the system surface the exact procedural article needed.

There’s another challenge, though: similarity search often retrieves documents describing the problem rather than the solution, because the supervisor searches for the member’s problem — since it does not know the solution until it finds the correct article.

For example: if a member reports their Sword kit is displaying a specific error message, the agent will search for documents about the error message. Now, if the resolution for this particular problem consists in replacing the kit, we cannot expect the agent to find a document titled “how to replace a kit”, but rather “kit displays error message”. So we need to make sure documents are framed from the problem’s perspective and contain the solution inside.

Thus we invested some time in ensuring all knowledge base document titles and descriptions are written from the member’s perspective to better align with the agent’s search queries. Additionally, we also worked on query optimization and prompt engineering. By carefully designing the supervisor’s prompts and including relevant context — such as Sword processes and internal terminology — we guided query generation toward producing better search inputs. Through iterative tuning, we arrived at an approach that allows the supervisor to more reliably surface the correct articles by generating better search queries and relying on general Sword context that might guide towards the solution.

We also tuned the number of articles returned to strike the right balance between precision and recall. Returning too few risks missing the correct article, while too many introduce noise and increase costs. Evaluation on an annotated dataset of conversations with labeled correct articles allows us to measure which approaches retrieve the right article and at what rank, providing clear insights about winning methodologies, as depicted below.

Figure 3. Example of different technical approaches compared to help determine the best strategy

Finally, we are also retrieving similar historical conversations. By indexing past successfully resolved cases and linking them to the articles actually used, the system can surface documents that have proven effective for analogous issues. Combining this case-based retrieval with hybrid search on the knowledge base further enhances both precision and recall, giving the agent a stronger foundation to find the correct article among thousands — a straw in a haystack. In particular, upgrading similarity search to hybrid search, adding a reranking step and combining results with similar historical tickets, has boosted the rate of correct protocol delegation from supervisor to specialized agent by 68%! Below is a diagram illustrating the approach:

Figure 4. Overview of Supervisor query being used to retrieve protocols from the Knowledge Base, as well as from similar conversations from the past

The tools

For the AI CC Agent to be truly effective, it needs more than reasoning and knowledge retrieval — it needs to act. That’s where the orchestration layer comes in. This layer sits between the AI agents and Sword’s backend systems, serving as the bridge that turns decisions into actions.

Instead of relying on hard-coded integrations, we expose backend capabilities through OpenAPI specifications. From these specifications, tools are dynamically generated and made available to the AI agent. This means the AI has a clear, structured way to interact with internal systems, whether it’s scheduling a session, updating a member’s information, or triggering a device replacement.

Below is a diagram depicting how the specialized agent executes a resolution protocol by leveraging all the necessary tools required by the protocol itself, as obtained directly from the knowledge base. Note how the tools are created dynamically, based on API docs, and passed to the agent right when it needs them, as shown in the figure below. This is very important, because LLM performance deteriorates as the size of the tool catalog increases, hence why we only pass the tools the specialized agent needs to follow a given resolution protocol — and no more.

Figure 5. Overview of Supervisor query being used to retrieve protocols from the Knowledge Base, as well as from similar conversations from the past

By treating the orchestration layer as a flexible connector, we ensure the system remains both extensible and resilient. New tools can be added in a plug-and-play fashion by simply updating the resolution protocols that use them, and existing tools naturally evolve as backend services change and their API specifications get updated. In short, this is what transforms the AI CC Agent from a merely knowledgeable assistant into a truly capable operator.

Langgraph, MCP and Langfuse

When building the AI CC Agent, we decided to rely on LangGraph and LangChain. These frameworks do introduce some overhead compared to a fully custom implementation, but the trade-off is worth it. They give us a common language for agents across the whole AI team, provide abstractions that evolve and facilitate workflows, and help enforce good design principles across the system. This way, we can move faster without reinventing the wheel.

As for Model Context Protocol (MCP), it wasn’t the right fit for our use case. MCP is designed for scenarios where we want to expose our services to AI agents, but our use case is internal and our architecture already solves that problem with the orchestration layer and dynamically generated tools: the agent only needs to interact with a single internal API containing an endpoint for each tool. Keeping things simpler avoids unnecessary complexity while still giving us all the flexibility and control we need.

We’re also using Langfuse to provide full observability over our multi-agent system. Langfuse allows us to trace the execution of each agent, inspect why the system generated a particular reply, monitor tool usage, latency, costs, and debug complex flows end-to-end. This visibility is critical for diagnosing issues, optimizing performance, and ensuring reliability, especially when multiple agents interact and exchange information when handling a support case.

Evaluating the agent and improving it

Building the AI CC Agent is only half the challenge. The other half is making sure it keeps getting better over time, while never breaking what already works. To do that, we run three complementary evaluation processes, each with a very specific purpose.

Development cycle

The first lens is focused on product improvement. Every time the agent escalates a ticket, it means there’s a limitation we need to learn from — in other words: something the agent cannot yet do. We audit a sample of these escalated tickets every week, classify them into an exhaustive set of root causes, and perform deep dives into the most relevant issues. This process allows us to identify where the agent is falling short, prioritize fixes, and feed concrete actions into the next iteration cycle.

Figure 6. Example of escalation causes

Regression and A/B testing

We make sure that progress never comes at the expense of reliability. Each time we release a major feature or improvement, the agent goes through regression testing to ensure it hasn’t lost the ability to solve the problems it had already mastered. The principle is simple: we rerun the agent on historical tickets and check that resolution is correctly achieved, as before. This safeguard protects both the member experience and overall business performance, even as the system continues to evolve week after week. Along the way, we had to solve some problems, such as the need to replay and simulate historical tool calls in test environments.

When we have enough confidence that a change will not introduce regressions, we occasionally perform A/B tests to quantify its marginal impact in the field and determine whether we should ship it or not.

Business metrics

The final lens is purely from a business standpoint: how fast is AI compared to human CCs in resolving tickets? How satisfied are our members with the responses? How many tickets can the agent handle correctly from beginning to end? To answer this, we collect data and run systematic audits of closed tickets, performed by our CCs. These audits feed into weekly metrics that give us a clear, objective view of how much value the agent is delivering in production. After pausing for a while to build the agent’s architecture, our focus has now shifted toward expanding its capabilities. We’re adding new tools on a weekly basis, and it’s been incredibly rewarding to watch the business performance metrics climb week after week. Each new integration not only increases the agent’s autonomy but also reduces the load on our CCs!

In the plots below, the trend is unmistakable: week after week, our agent has been solving more and more tickets, completely autonomously. This improvement is clearly reflected in the time-to-first-reply plot: thanks to the agent, in under a year we’ve reached a point where the vast majority of tickets receive a response in under five minutes. And we’re achieving this without compromising member satisfaction, as shown in the final plot. These metrics confirm that our work is creating real, meaningful impact, and that every single feature we ship each week is helping our members access better healthcare.

Figure 7. % all tickets correctly handled by the AI CC Agent, end-to-end

Figure 8. Evolution of time to 1st reply in all tickets

Figure 9. Customer Satisfaction Score (CSAT) measured on a 1–5 scale, where 1 = Very Dissatisfied and 5 = Very Satisfied.

Some thoughts for the future

Looking ahead, there are some major areas where we see big opportunities for improvement and we find worth mentioning.

First, retrieval. The agent is only as good as its ability to find the right protocol in our knowledge base. We are currently experimenting with navigating a decision-tree versus performing a similarity search on a flat set of documents. Results so far seem promising:

We transfer some of the reasoning burdens out of the model and into knowledge base by introducing structure. In other words: the agent only needs to follow instructions, step by step, with progressive context disclosure, instead of having to search and reason about the member and knowledge base documents.
We can investigate retrieval issues by tracking where in the decision-tree the agent is systematically taking the wrong turn and fix the context issue at the source. In fact, with this approach we can empower the team responsible for the knowledge base to diagnose and fix retrieval issues, instead of relying on the engineers to do it.

Second, multi-question handling. Today’s architecture is built around a single question at a time: the supervisor searches for a protocol, chooses one, hands it off to a specialized agent, and the flow proceeds downstream until a reply is generated. If a member asks two or three questions in the same message, the extras are left unanswered. Solving this will require a new layer in the architecture — something like a superorchestrating agent that can extract multiple intents, break them down, and delegate each one to a supervisor agent. With that in place, the rest of the pipeline can run as usual, but now with true multi-intent support.

All these areas are ambitious, but they’re exactly the kind of challenges that make building this agent exciting.

Conclusion

Building the AI CC Agent has been a journey of combining human expertise, AI technology, and thoughtful engineering. By leveraging a multi-agent architecture, a rich shared knowledge base, and an increasing set of tools, we’ve created a system that not only scales with Sword’s growth but also maintains the consistency and quality that members expect — plus immediate responses!

Looking ahead, enhancements in retrieval, multi-intent handling, and improving automated evals promise to make the system even more capable and responsive.

Ultimately, the AI CC Agent is not just a tool — it’s a force multiplier for our human Care Coordinators, empowering them to focus on the complex, high-touch cases where their empathy and expertise truly matter, while providing members with faster, more reliable support every day.

Acknowledgements

Written by Paulo Cardoso, in collaboration with Clara Matos, and based on the feature developed by Paulo Cardoso, Miguel Sabino, Catarina Morgado, Sebastião Caldas, José Nave, Rita Catarino, Ana Costa, Renata Rocha, Adriana Ramos, Clara Matos, Ivo Gabriel and Luís Ungaro.

This work reflects the collective effort of multiple teams across Sword who make the AI Care Coordinator possible. Special thanks to our Care Coordinators, whose ongoing feedback continues to shape and improve the agent, as well as Member Ops, AI Engineering, Sword Desk and Infrastructure, without whom the AI CC agent would not have been possible.

Share this Article