This article was originally published on QBox’s blog, prior to Cyara’s acquisition of QBox. Learn more about Cyara + QBox.
Welcome to part 2 of our blog series, summarizing the changes you’ll experience as you upgrade your Microsoft chatbot from the old LUIS service to the shiny new Cognitive Services for Language (CLU).
Part 1 of this blog discussed what features are modified or even removed in CLU, and how the interface you build your bot in has changed. We also briefly touched upon the two model training modes available in English—standard and advanced, which we’ll be coming back to in this 2nd part.
In part 2 we’re first going to demonstrate the (very simple!) process of importing your LUIS chatbot into the new CLU service. Then we’ll take the same bot we just imported and use our tool to compare its performance in both LUIS and CLU, while also changing the volume of training data to assess whether CLU’s NLP does indeed require less, as Microsoft has suggested.
Cyara helps businesses like yours optimize chatbot performance with automated testing and monitoring solutions.
Importing a LUIS Chatbot into CLU
If you’ve already gone through the process of exporting your bot from LUIS as a .json file, then the first half of this process will be very familiar to you.
Here in fig. 1, we have our demo LUIS chatbot “Bee,” who handles queries about an AI event, answering questions like “What does AI mean?” and “What companies are attending?”
We want to convert Bee into a CLU bot, so to do that we need to export it from LUIS. As shown in fig. 2, when viewing all your apps (bots) on the luis.ai/applications page you can select a bot to export as a .json file with the “Export” drop-down menu.
Now that Bee has been exported as a .json, we go to the language.cognitive.azure.com/clu/projects page (shown in fig. 3) where we can view all our current CLU projects (bots) and also import new ones.
When you click on “Import” here you’ll be prompted to select a .json file, so just select the one you exported from LUIS and away you go. Here in fig. 4 we can now see Bee as it appears, having just been imported to CLU. Helpfully, the “Entities used with this intent” column shows all the annotated (machine learned) entities that have been tagged in each intent, which LUIS did not show on its equivalent page. We’d encourage you to make a lot more use of these kinds of entities moving forward, as the improved NLP engine underneath CLU could make these kinds of context-sensitive entities more effective with less training data. But also keep in mind that the entity structure is slightly different in CLU, so be sure to check any entities you tagged while in LUIS to make sure their structure is still appropriate for your use case.
Comparing the Impact of Downsizing Training Data in LUIS/CLU
Now that we have the ability to test CLU models and have just created a version of the same model in both LUIS and CLU, we can start testing the theory that CLU should require less training data to get similar model performance.
We decided to assess this by making a “downsized” version of Bee, where we removed 20% of the training data from each intent at random, and then testing the performance of:
- Both the original Bee and the downsized Bee
- First in LUIS, then in CLU with standard training, and then finally in CLU with advanced training
- This was done for both automated tests (assessing model performance just using the training data) and cross-validation tests (assessing model performance on test data).
We do not know exactly what is different under the hood of CLU, but if CLU is indeed using a more modern, advanced and powerful NLP engine, then we would expect LUIS to exhibit a more extreme drop in performance after reducing the training data. CLU, on the other hand, should be able to better recognise concepts with fewer training examples, and thus experience a smaller drop in performance when training data is reduced.
We also wanted to examine the difference in performance between standard and advanced training, so we’d have a better idea of what the strategy should be when carrying out tests on CLU models in the future.
Results
When the original Bee model and the downsized Bee model were tested using LUIS, CLU standard training and CLU advanced training, the scores came back as the following, summarized in table 1:
The percentage loss from model downsizing was calculated from these scores, and summarised in the graph in fig. 5:
Results – LUIS
The Bee model performs well under automated testing, since it was already optimized for LUIS, achieving all 3 scores in the 90’s. However, its performance on CV testing is much lower, scoring in the 60’s to low 80’s, as it was never going to be a production bot and we didn’t push the fine-tuning. There are several “weak” concepts within this test data (i.e., not explicitly covered within the training data), which make it harder for LUIS to recognize them.
When Bee is downsized in LUIS, its model score drops an average of 3.6 points, with the largest drop being the automated test correctness, which fell by 7 points. This suggests that LUIS’ predictive accuracy is taking a hit when the volume of training data was reduced.
Results – CLU (standard training)
When the next version of Bee was created using the standard training mode on CLU, confidence remained high, however the correctness scores dropped in the automated tests, and the most significant difference is the extremely poor clarity scores that result from standard training, which require closer examination.
Fig. 6 shows an example of this from one of the CV tests on a standard-trained CLU version of Bee. If standard training is used, the most probable intents will still be returned as rankings by CLU, but they tend to be extremely close together in confidence. The clarity score represents model stability, sometimes described as “potential for confusion,” and is derived from how far apart the confidence values are for each ranked intent.
This tendency towards very similar confidence values appears to be unique to CLU’s standard training mode, and does not emerge when advanced training is used. The results of standard training can still be informative—giving the user a good indication of where errors could arise based on what utterances are mis-classified—but the clarity value is unlikely to be of much use here.
When standard-trained CLU is compared with LUIS, its score totals are poorer. Nevertheless, it is worth noting that the standard-trained version of Bee sees a smaller drop in scores when the model is downsized, compared with LUIS. For example, the automated correctness dropped by 7 points when the model was downsized in LUIS or 8% loss, but it only dropped by 3 points when the model was downsized in standard-trained CLU or less than 4%. This does indeed suggest that CLU is less dependent on large volumes of training data than LUIS is, even when only the basic standard training mode is used.
Results – CLU (advanced training)
When the advanced-trained CLU version of Bee is tested, not only does it consistently return the highest scores of the three model versions, but it also shows only a very small drop in scores when the model was downsized (an average of -0.8 points). The automated clarity score actually increased when the model was downsized, correctness decreased by only 2 points or less than 2% performance. Even the downsized advanced-trained version of Bee in CLU scored better than the full-sized version of Bee in LUIS.
The advanced training in CLU also does not have the issue with low clarity scores that the standard training does, so we can get a clearer picture of which intents are most likely to get confused with one another.
Conclusions
From converting one of our own chatbots into a CLU model, we were able to confirm some details of how we should expect CLU to perform differently from LUIS, and how this will be reflected in our platform. In summation:
- Importing your LUIS model into CLU is incredibly quick and easy. However, if you use some of the deprecated features like patterns, or make heavy use of overlapping entities, be sure to check Microsoft’s CLU documentation (and part 1 of this blog) to see how your model will be altered in the process.
- Advanced training on CLU achieves the best scores of the 3 options considered here. Even with 20% less training data.
- Even when training data was reduced, the advanced-trained CLU model did score better than the full-sized LUIS version of the same model. This suggests that CLU can indeed perform better with less training data.
- Reducing model size does result in the biggest loss in performance for LUIS, while both the standard and advanced-trained models in CLU demonstrated more stable scores even after the number of training utterances was decreased.
- Standard training on CLU is quick and free, but correctness and especially clarity scores can drop even when compared with LUIS.
With regards to this final point, we’d still recommend using standard training for most of your CLU tests in order to help identify weak concepts, due to the time and expense of using advanced training and the fact that the findings are the same. A final test should then be carried out with advanced training to get the “true” model score, and then that model would be the version you deploy to live interactions.