Testing and optimization¶

Quality workflow

Reliable agents are built through repeatable tests, not one good demo call. Cover happy paths, vague callers, missing information, tool timing, routing, and fallback behavior.

Keep changes only when they improve the test set without breaking another scenario.

Testing is how you turn a promising agent into a reliable production workflow.

Do not rely on one successful demo conversation. Test the situations that customers actually create.

What to test¶

For every agent, test:

common questions,
vague questions,
missing information,
wrong assumptions,
tool usage,
tool failure,
knowledge-base answers,
escalation or forwarding,
refusal and boundary cases.

Prompt test cases¶

Prompt tests are reusable conversations or scenarios that help you compare behavior after changes.

Useful test cases include:

Scenario	Why it matters
Happy path	Confirms the main workflow still works.
Ambiguous caller	Checks clarification behavior.
Missing details	Checks data collection.
Out-of-scope request	Checks boundaries.
Tool required	Checks tool trigger and parameters.
Tool not allowed	Checks the agent does not act too early.
Human handover	Checks escalation behavior.

Optimization workflow¶

Use a structured workflow:

Identify one problem.
Create or update a test case for it.
Change the prompt, knowledge, parameter, or tool instruction.
Run the same tests again.
Keep the change only if it improves the result without breaking another scenario.

Comparing variants¶

When comparing a baseline with an improved variant, look for:

correctness,
instruction following,
concise answers,
appropriate tool usage,
safe fallback behavior,
better caller experience.

Do not choose a variant only because it sounds more polished. It must also be more reliable.

Voice QA¶

For voice agents, also test the live experience:

greeting cannot be interrupted if that is required,
caller interruptions are handled naturally,
the agent asks one question at a time,
forwarding waiting messages are clear,
speech speed and accent fit the audience,
long answers are not exhausting to listen to.

Release checklist¶

Prompt tests cover the main workflows.
Knowledge-base questions were tested.
Tools were tested for success and failure.
Routing was tested for default and override paths.
Voice behavior was tested with real calls when applicable.
A rollback plan exists for major changes.