This article was originally published on QBox’s blog, prior to Cyara’s acquisition of QBox. Learn more about Cyara + QBox.
You’ve spent endless weeks, if not months, tirelessly building, training, and testing your chatbot, and it’s finally ready to be launched into the real world.
But then after assessing its real-world performance, you’re a little deflated as it’s not quite hitting the mark. We’ve all been there, and it’s really frustrating when you consider the number of hours spent on perfecting your chatbot.
And, of course, it leaves your customers very frustrated too.
Cyara’s conversational AI optimization solution allow you to test, train, and monitor your chatbots to assure CX quality at scale.
From a chatbot builder point of view, you need to understand what influence each utterance has (and even what influence each word in each utterance has) on your model’s performance.
Most of the popular NLP providers use a variety of different models and algorithms that are virtually impossible for chatbot builders to fathom.
And it’s also very difficult and time-consuming to do a deep dive of your training data and analyze its learning value.
But you really do need to do a thorough analysis if you want to make your chatbot smarter.
Here are three techniques: two to measure your bot’s performance and one for visualizing the results of its performance.
1. Cross-Validation Testing
This involves preparing a separate labelled dataset of utterances that your model hasn’t been trained on (this data would typically be real user questions), and then testing it against your chatbot to assess your model performance and see if there are any gaps in your bot’s knowledge. Cross-validation testing has its advantages and disadvantages though:
Advantage
- If you have a large file from many user interactions over many months, you are likely to have a great dataset that represents all the intents/subject areas you wish to cover.
Disadvantages
- If you are at the early stage of your chatbot model, you may still be in the process of creating new intents and splitting or merging some existing intents. Each time this happens, you’ll have to update your cross-validation file, and this can be time-consuming.
- Keeping this dataset out of the training data can be a challenge. Naturally, if you know the data set, you’ll check the test results and where your test fails, you may be tempted to train your model with the potential to overfit it to the cross-validation dataset
- It will only be as good as the data in it.
- Ideally, cross-validation testing will give meaningful and accurate results, but there’s no way of accurately predicting what your bot will encounter in the future, and you will constantly have to update your cross-validation dataset. You’ll also have to keep testing on your cross-validation dataset to monitor for any regression.
2. K-Fold Cross-Validation Testing
K-fold cross-validation testing solves some of the issues mentioned above.
It’ll generate test data from your training data automatically, by removing some of the training data from where it is (the intent) and using them as test data.
It’ll then evaluate your training data by dividing it into a number of sub-samples (or folds) and then using one fold at a time as a testing dataset to test on the rest of the training data.
For example, you might divide your training data into 10 equal folds (you could use more or fewer folds, but 10 folds is commonly used), and then perform 10 separate tests, each time holding back one of the folds and testing your model based on the data in the “held back” fold not in your training data.
This means that all training data will become test data at one point.
This technique helps you to see weaknesses in your data.
Again, it has its advantages and disadvantages.
Advantages
- If you don’t have test data, and you are in the early stage of model building, it generates the test data itself.
- K-fold is a known technique.
Disadvantages
- It’s time-consuming.
- K-fold doesn’t work well with low levels of training data (fewer than 200 samples per class or intent).
- Each change to the training data affects the fold the initial training data was in, generating variation in your test results due to the randomization of the folding. This then makes it difficult to understand the learning value of your new changes.
3. Visualizing Your Test Results Through a Confusion Matrix
This technique allows you to visualize the performance of the intent predictions in the form of a table.
To build a confusion matrix, you’d use a test validation dataset. Each piece of data in your dataset needs a predicted outcome (the intent that the data should return) and an actual outcome (the intent that the data actually returns in your model).
From the predicted and actual outcomes, you will get a count of the number of correct and incorrect predictions.
These numbers are then organized into a table, or matrix where:
|
IMPORTANT: You’ll want to think how you categorize the right and wrong predictions that are below your chatbot confidence threshold.
These values can then be used to calculate more classification metrics like precision and recall, and F1 scores.
Precision highlights any false-positive problems you may have in your model and recall highlights any false-negative problems. The F1 score is an overall measure of a model’s accuracy.
The confusion matrix technique has its own advantages and disadvantages:
Advantage
- It provides a good visual of your bot’s performance (see diagram above as an example).
Disadvantages
- It is a very time-consuming task.
- Calculations for these additional metrics are quite complex.
- It can be quite challenging to interpret unless you’re familiar with the statistics involved.
- It won’t necessarily help you see why an utterance isn’t working.
In summary, the cross-validation, K-fold, and confusion matrix methods for diagnosing and improving a chatbot are very time-consuming, and difficult to understand if you’re not a statistician.
Also, you’re likely to find a lot of issues that need fixing, and you’ll probably want to fix them all at once — but this in turn will generate challenges as you try to understand which changes worked/helped your model and which didn’t.
You also need to think about the possible further regression (the ripple effect of changing data in one part of the model modifying the performance of the rest of the model) of your chatbot, and it’s very difficult to identify and unpick these newly created problems.
Modern tools are arriving on the market that have the ability to analyze and benchmarks your chatbot training data and gives insight by visualizing and understanding where your chatbot does and doesn’t perform, and why.
You can see your chatbot’s performance at model level, intent level and utterance level, even word-by-word analysis, in clear and easy-to-understand visuals.
Contact us to find out how you can use our solutions to accelerate chatbot development and assure quality at scale.