• Skip to primary navigation
  • Skip to main content
  • Skip to footer
Cyara

Cyara

Cyara Customer Experience Assurance Platform

  • LOGIN
  • CONTACT US
  • WATCH A DEMO
  • PRODUCTS & SERVICES
    • AI-Powered CX Assurance Platform
      • Call Explorer
      • Call Routing & Agent Desktop Testing
      • Cloud Contact Center Monitoring
      • Conversational AI Testing
      • Integrations
      • Omnichannel Testing
      • Voice Quality Testing
    • Products
      • AI Trust
      • Botium
      • CentraCX
      • Cloud Migration Assurance
      • Cruncher
      • Number Trust
      • Pulse
      • Pulse 360
      • ResolveAX
      • testRTC
      • Velocity
      • Voice Assure
    • Services
      • Cyara Academy
      • Consulting
      • Customer Success
      • Support
  • SOLUTIONS
    • IVR Testing
      • IVR Discovery
      • IVR Testing
    • Omnichannel Testing
      • Chatbot Testing & Optimization
      • Cloud Contact Center
      • Contact Center Number Test Types
      • Contact Center Testing
      • Email & SMS Testing
      • Omnichannel Testing
      • Voice of Customer
      • Web Interaction Testing
    • Software Testing & Monitoring
      • Continuous Testing Solutions
      • Customer Experience Monitoring
      • DevOps for Customer Experience
      • Functional Testing
      • Incident Management
      • Load/Performance Testing
      • Regression Testing
    • Voice Quality Testing
      • Agent Desktop Testing
      • Outbound Call Testing
      • Voice Biometrics Testing
      • Voice Quality Testing
  • RESOURCES
    • Blog
    • Events
    • Customer Success Showcase
    • Resources
    • Webinars
  • ABOUT
    • CEO’s Desk
    • Leadership
    • Press Releases
    • Media Coverage
    • Partners
    • Awards
    • About Cyara
    • Careers
    • Employee Profiles
    • Legal

Blog / CX Assurance

April 26, 2022

Why Isn’t My Chatbot Working?

Alison Houston

Alison Houston, Data Model Analyst

This article was originally published on QBox’s blog, prior to Cyara’s acquisition of QBox. Learn more about Cyara + QBox.


You’ve spent endless weeks, if not months, tirelessly building, training, and testing your chatbot, and it’s finally ready to be launched into the real world.  

But then after assessing its real-world performance, you’re a little deflated as it’s not quite hitting the mark. We’ve all been there, and it’s really frustrating when you consider the number of hours spent on perfecting your chatbot.  

And, of course, it leaves your customers very frustrated too.

Cyara’s conversational AI optimization solution allow you to test, train, and monitor your chatbots to assure CX quality at scale.

Smiling, neutral, and frowning faces

From a chatbot builder point of view, you need to understand what influence each utterance has (and even what influence each word in each utterance has) on your model’s performance.  

Most of the popular NLP providers use a variety of different models and algorithms that are virtually impossible for chatbot builders to fathom. 

And it’s also very difficult and time-consuming to do a deep dive of your training data and analyze its learning value. 

But you really do need to do a thorough analysis if you want to make your chatbot smarter.

Here are three techniques: two to measure your bot’s performance and one for visualizing the results of its performance.

1. Cross-Validation Testing

This involves preparing a separate labelled dataset of utterances that your model hasn’t been trained on (this data would typically be real user questions), and then testing it against your chatbot to assess your model performance and see if there are any gaps in your bot’s knowledge. Cross-validation testing has its advantages and disadvantages though:

Advantage

  • If you have a large file from many user interactions over many months, you are likely to have a great dataset that represents all the intents/subject areas you wish to cover.

Disadvantages

  • If you are at the early stage of your chatbot model, you may still be in the process of creating new intents and splitting or merging some existing intents. Each time this happens, you’ll have to update your cross-validation file, and this can be time-consuming.
  • Keeping this dataset out of the training data can be a challenge. Naturally, if you know the data set, you’ll check the test results and where your test fails, you may be tempted to train your model with the potential to overfit it to the cross-validation dataset
  • It will only be as good as the data in it.
  • Ideally, cross-validation testing will give meaningful and accurate results, but there’s no way of accurately predicting what your bot will encounter in the future, and you will constantly have to update your cross-validation dataset. You’ll also have to keep testing on your cross-validation dataset to monitor for any regression.

2. K-Fold Cross-Validation Testing

K-fold cross-validation testing solves some of the issues mentioned above. 

It’ll generate test data from your training data automatically, by removing some of the training data from where it is (the intent) and using them as test data. 

It’ll then evaluate your training data by dividing it into a number of sub-samples (or folds) and then using one fold at a time as a testing dataset to test on the rest of the training data. 

For example, you might divide your training data into 10 equal folds (you could use more or fewer folds, but 10 folds is commonly used), and then perform 10 separate tests, each time holding back one of the folds and testing your model based on the data in the “held back” fold not in your training data. 

This means that all training data will become test data at one point. 

This technique helps you to see weaknesses in your data. 

K-fold visualization chart

Again, it has its advantages and disadvantages.

Advantages

  • If you don’t have test data, and you are in the early stage of model building, it generates the test data itself.
  • K-fold is a known technique.

Disadvantages

  • It’s time-consuming.
  • K-fold doesn’t work well with low levels of training data (fewer than 200 samples per class or intent).
  • Each change to the training data affects the fold the initial training data was in, generating variation in your test results due to the randomization of the folding. This then makes it difficult to understand the learning value of your new changes.

3. Visualizing Your Test Results Through a Confusion Matrix

This technique allows you to visualize the performance of the intent predictions in the form of a table. 

To build a confusion matrix, you’d use a test validation dataset. Each piece of data in your dataset needs a predicted outcome (the intent that the data should return) and an actual outcome (the intent that the data actually returns in your model).

From the predicted and actual outcomes, you will get a count of the number of correct and incorrect predictions.

These numbers are then organized into a table, or matrix where:

Picture3 (1).png

  • True positive (TP) for correctly predicted event valuesFalse positive (FP) for incorrectly predicted event valuesTrue negative (TN) for correctly predicted no-event valuesFalse negative (FN) for incorrectly predicted no-event 

IMPORTANT: You’ll want to think how you categorize the right and wrong predictions that are below your chatbot confidence threshold.

These values can then be used to calculate more classification metrics like precision and recall, and F1 scores. 

Precision highlights any false-positive problems you may have in your model and recall highlights any false-negative problems. The F1 score is an overall measure of a model’s accuracy. 

Confusion matrix with precision, recall and F1

The confusion matrix technique has its own advantages and disadvantages:

Advantage

  • It provides a good visual of your bot’s performance (see diagram above as an example).

Disadvantages

  • It is a very time-consuming task.
  • Calculations for these additional metrics are quite complex.
  • It can be quite challenging to interpret unless you’re familiar with the statistics involved.
  • It won’t necessarily help you see why an utterance isn’t working.

In summary, the cross-validation, K-fold, and confusion matrix methods for diagnosing and improving a chatbot are very time-consuming, and difficult to understand if you’re not a statistician. 

Also, you’re likely to find a lot of issues that need fixing, and you’ll probably want to fix them all at once — but this in turn will generate challenges as you try to understand which changes worked/helped your model and which didn’t. 

You also need to think about the possible further regression (the ripple effect of changing data in one part of the model modifying the performance of the rest of the model) of your chatbot, and it’s very difficult to identify and unpick these newly created problems.

Modern tools are arriving on the market that have the ability to analyze and benchmarks your chatbot training data and gives insight by visualizing and understanding where your chatbot does and doesn’t perform, and why. 

You can see your chatbot’s performance at model level, intent level and utterance level, even word-by-word analysis, in clear and easy-to-understand visuals.

Contact us to find out how you can use our solutions to accelerate chatbot development and assure quality at scale.

Read more about: AI Chatbot Testing, Chatbot Testing, Chatbots, Conversational AI Testing, QBox

Start the Conversation

Tell us what’s on your mind, and learn how Cyara’s AI-led CX transformation can help you delight your customers.

Contact Us

Related Posts

chatbot testing solution

May 8, 2025

Chatbot Testing Best Practices to Ensure Flawless Customer Support

As contact centers continue to innovate with AI-powered bots, follow these chatbot testing best practices to optimize your CX for success.

Topics: AI Chatbot Testing, Artificial Intelligence (AI), Chatbot Testing, Chatbots, Conversational AI, Customer Experience (CX)

performance testing tools

April 10, 2025

Why Performance Testing Tools are Non-Negotiable for AI and ML Applications

It only takes a single issue to drive your customers away. Improve your AI-driven CX quality with Cyara's performance testing tools.

Topics: AI Chatbot Testing, Artificial Intelligence (AI), Automated Testing, Customer Experience (CX), Performance Testing

chatbot testing

March 6, 2025

The Future of Chatbot Testing: 5 Trends to Watch

Advancements have ushered in new ways for businesses and customers to connect. Assure CX quality with Cyara's chatbot testing solutions.

Topics: AI Chatbot Testing, Artificial Intelligence (AI), Automated Testing, Chatbots, Customer Experience (CX)

Footer

Cyara logo
 
  • LinkedIn
  • Twitter
  • YouTube

Copyright © 2006–2025 Cyara® Inc. The Cyara logo, names and marks associated with Cyara’s products and services are trademarks of Cyara. All rights reserved. Privacy Statement  Cookie Settings