This article was originally published on QBox’s blog, prior to Cyara’s acquisition of QBox. Learn more about Cyara + QBox.
You’ve worked hard to make improvements to your chatbot model and it’s now scoring very well for correctness (and hopefully for confidence and clarity—if applicable) in automated tests. But your work hasn’t finished just because overall model correctness has reached 80% or more. The next step in your chatbot improvement journey now is to start cross-validation testing.
Cyara’s automated chatbot testing solutions allow you to assure quality at every stage of development.
Cross-validation testing is recommended, because not only will it help to see if there are any blind spots in your training data; it will help to identify if your chatbot model is overfit (a model that is very finetuned to its existing training dataset but performs poorly when faced with new data, even if it’s just a small digression from the training set).
It is important to look out for an overfit model because they can be deceptive and will lull the chatbot builder/trainer into a false sense of security. On the face of it, the model looks like it’s a success because the overall scores are high in the automated testing. But really, its predictive power is feeble, as the model has not gained much learning value from the existing training set to be able to apply its learned knowledge in the real world successfully.
When you start running cross-validation tests, you should expect your overall correctness score to be a little lower than the automated correctness score, and anything up to 10% lower is considered acceptable (and natural—it’s simply not possible to think up every single permutation that your customers will use to express themselves within your chatbot model!). BUT…if you find your cross-validation test is more than 10% lower for coverall correctness than your latest automated test, it could mean your model is overfit.
We often see client models that are overfitting, and sometimes you can tell by looking at the existing training data and seeing many very similar utterances that express the same concept, or the vocabulary is very limited. For example, they may have an intent about a change of address request and there are many utterances with “change” but no utterances with the past tense “changed”, or synonyms like “update/updated,” “amend/amended,” etc. But usually, it is not immediately obvious the model is overfitting until a cross-validation test is done, which is why this is a crucial next step in the chatbot improvement journey.