This article was originally published on QBox’s blog, prior to Cyara’s acquisition of QBox. Learn more about Cyara + QBox.
Once your chatbot has been built and trained, if you have used advanced tooling, you haven’t had the need to test your chatbot model with data outside of its training set. But there will come a time when you will , in effect, need to simulate real-world interactions to provide a more accurate measure of how well your chatbot might perform once it’s live. This is called cross-validation testing.
Cyara helps businesses assure chatbot quality through the entire development lifecycle.
This test data could consist of:
- A set of real user utterances that would have been set aside before the chatbot was built;
- A set of in-house utterances devised before or during the chatbot build;
- A collection of real user utterances once the chatbot has been launched.
Incidentally, for those of you who have devised their cross-validation dataset in-house, a word of warning: to ensure no model bias is present in the cross-validation data, it is recommended this dataset is not created by anyone directly associated with the chatbot build. A top tip would be to get other colleagues involved (or family and friends!)—simply give them a brief explanation of each intent (but not too much detail) and ask them to list as many various ways on how they would ask each one.
This cross-validation data is then tested against your chatbot to evaluate its performance. It will help to identify any blind spots in your training data—perhaps new concepts (key words or phrases) that have been missed, or new ways to express the existing concepts within the intents. It can also identify if your chatbot is overfitting, meaning the model is so finetuned to its existing training data that it negatively impacts the performance of the model on new data.
Whichever way the cross-validation data is created, it’s vital that the data covers every intent in your chatbot model, to ensure all intents are thoroughly tested.
But How Much Data is Needed?
We would recommend aiming for a minimum of 1x times the amount of training data you have in each intent. For example, if you have an intent with 30 utterances, you should have at least 30 cross-validation utterances for that intent. For your short-tail intents (the intents you anticipate being returned the most frequently), or the more complex intents, try to increase the number of cross validation utterances to 2 or even 3 times the amount of training data, or even more—the more the better! But this probably won’t be an overnight process, the dataset should be expanded over time—collected in conjunction with audits and reports from your live user logs. When collecting utterances from your live user logs, always try to pick a selection that feature very diverse language, while still being valid in their subject matter, to ensure your chatbot is tested to its limits.
In addition to evaluating chatbot performance, cross-validation testing has other uses too. A key one is to identify regressions when you make major changes to your chatbot. For example, you might want to scale up the chatbot at some point. Once you’ve added lots of new intents, you’ll need to make sure cross-validation utterances that were returning the correct intents before are still performing just as well after the updates. So, it’s recommended you test your model with the same cross-validation data before and after making such updates. In fact, you should get into the habit of regular cross-validation testing, even if you’re just making minor tweaks in your model to improve performance. This will help to give you peace of mind that any changes you’re making won’t be detrimental to the rest of the model.
Another key use of cross-validation testing is to help determine a suitable confidence threshold for your chatbot. This would involve producing an ROC or AUC by plotting all the results of the cross-validation test onto a graph using various confidence thresholds. You can then determine the optimum confidence threshold for your particular needs. For example, if you want a very accurate chatbot you’ll probably want to increase the confidence threshold to minimize the risk of giving incorrect answers to your customers. And from the ROC curve you’ll be able to understand the trade-off of having that higher threshold. This is a very short explanation of the ROC curve, and you can read more here.
In summary, cross-validation testing is a very useful way for assessing the effectiveness of your chatbot model, but it is essential to have a good quality dataset that tests each intent and with as many diverse utterances as you can possibly gather.