This article was originally published on Botium’s blog on June 10, 2020, prior to Cyara’s acquisition of Botium. Learn more about Cyara + Botium
Audience: If you have an existing Watson Assistant skill and you want to analyze it for consistency as well as the performance of the training data itself, you should read this article. It shows ways to improve the performance of your skill as well. We will use Botium to:
- download the training data from your Watson Assistant skill to Botium
- run static and dynamic analytics on it
- present the results in the Botium Coach dashboard
- augment the training data to improve the performance
- upload the training data from Botium to your Watson Assistant skill
- and finally, validate the improvements
Attention: In machine learning, it typically makes no sense to use training data for testing, but using the outlined approach you will detect any serious flaws in the training data itself. It is not possible to tell how your skill will work out in production later, for this you have to invest some more effort to prepare good test data (or use the data sets included in Botium).
Step 1: Open a Channel to your Watson Assistant Skill
Open the Chatbots menu in Botium, click the Register New Chatbot button and select IBM Watson Assistant API in the technology selection field. You can find everything you need in your IBM Cloud Console – in most cases you will authenticate with an IAM API Key – see the IBM docs on how to get yours. It is important to select Assistant V1 as SDK Version.
When everything is in place, click the Say Hello button to verify your credentials.
Step 2: Use the Test Case Wizard to Download Training Data to Botium Box
Open the Test Case Wizard, expand the Conversation Model Downloader section, enter a name for the new Test Set and click on Download from IBM Watson Assistant. Make sure the Create new Test project with this Test Set is enabled to save you a few extra clicks later.
At this point, Botium will download the intents and user examples from your skill and build a Test Set in Botium out of it. It will do a static analysis of the training data as well – in case it identifies obvious problems (such as duplicate utterances, empty user examples list…) it will immediately warn you about it. You can now see the Test Set statistics.
By switching to the Test Cases tab, you should now see something familiar – the intent list as you named it in your Watson Assistant skill, as well as the user examples for them!
Step 3: Run a First Test Session
In the Test Set Dashboard, click the Start Test Session button and watch the test session progress. Botium will send all of your training data one after the other to the Watson Assistant skill and notice any irregularities – for example, if a user example resolves to another intent as expected.
Attention: Depending on the Watson Assistant plan in place you will have to pay for each API call!
Step 4: Open the results in Botium Coach Dashboard
In Botium Coach Dashboard you will now receive some hints about what might be wrong with your training data. You won’t get much out of the confidence score evaluations, as we are testing with training data, so anything but a very high average confidence score would be really surprising (or alarming).
You can have a look a the confusion matrix, but for the same reason as above, you won’t really get valuable hints there.
You definitely should have a look into the Mismatch Probability Risks section – this is about user examples with a high risk of matching an incorrect intent, which usually would trigger disambiguation in Watson Assistant (if this feature is enabled) – this shouldn’t happen for training data. Botium Coach evaluates the alternate intents lists to show you any intents and user examples that are very likely to cause a mismatch due to similar confidence scores. In the example below you can see that the user examples “good day”, “can I teach you” and some more have a mismatch probability score of 1, which means that Watson Assistant is not able to distinguish between two intents.
Clicking inside the chart will show the alternate intents list Watson Assistant returned for this user example – so there is definitely something odd with your training data – most likely, the user example “good day” is included in both of the intended user examples.
Other valuable insights you can test in the Intent Mismatch Probability Risks, Alternative Intents list – this will highlight any intents that are often appearing in the alternate intents lists of the other on a rather prominent position.
Step 5: Augment Training Data
You can now choose to:
- augment your training data in your Watson Assistant workspace and then download the augmented training data to Botium Box again
- or use the included Botium Box tools to augment the training data in Botium Box and upload it to your Watson Assistant with the Test Case Wizard
When doing it in the Watson Assistant workspace, use the Test Case wizard again to either overwrite the Test Set data or create a new one.
When doing it in Botium, you can use the Botium tools to augment the training data:
- Use the Test Case Designer to manually add and remove user examples to the training data
- Use the Paraphraser to generate additional user examples and add them to the training data
- Merge training data from the included Botium Box data sets, available for various domains in various languages
- Merge training data from other sources
For example, you can now identify the intents for which there is not enough training data available (less than 10 user examples), and use the Paraphraser to generate more:
To upload the additional training data to your Watson Assistant workspace, open the Test Set Dashboard, click on Upload Conversation Model to Chatbot Provider and select your IBM Watson chatbot as registered in the first step of this tutorial. You can now choose to
- Create a new blank workspace
- Copy and extend the existing workspace
- Merge user examples into the existing workspace
All three of them are valid choices, but I would not recommend the third option for workspaces that are currently live, for obvious reasons.
Step 6: Validate Improvements
Again, run a test session with your augmented training data and open it in Botium Coach when ready. In Botium Coach, you can select a secondary test session for performance comparison. It will show your intents and user examples that now are working better or worse than before.
As usual in Botium Coach, you can drill down to the single user example level to trace any improvements or deteriorations.
Conclusion
While it usually makes no sense to test a machine learning model with data it has been trained on, it can help to visualize issues with the training data itself, such as mismatch probabilities and duplicate user examples. Flaws in the training data will have a negative impact on the overall NLU performance of your chatbot for sure.