It took just sixteen hours for Microsoft’s shiny new chatbot Tay to be shut down for profanity. The chatbot had been released on the social media platform, X, then known as Twitter, following extensive evaluation and stress testing under different conditions to ensure that interacting with the chatbot would be a positive experience. Unfortunately, the testing plan had not bargained on a coordinated attack exploiting the chatbot’s vulnerability when exposed to a torrent of offensive material. Tay soon began tweeting wildly inappropriate words and images and was taken offline within hours.
The chatbot’s failure highlights just how hard yet imperative it can be to test and evaluate a model before real world deployment. With the recent flux of accessible “off-the-shelf” machine learning algorithms, building AI models, in particular generative AI models is now relatively straight forward. However, the simplicity with which models are deployed undermines the complexity of evaluating them. Nonetheless, deploying the model anywhere outside the data and context it has been trained on can be risky if its performance is not evaluated. The evaluation process requires clear definitions for good performance as well as highlighting the potential risks, and can throw up unexpected requirements in the test data. Not only are the subtle nuances in the initial evaluation requirements important, but once deployed a process needs to be in place so that the algorithm can be monitored over time.
Know your goals
The first point to note is that checking how well the output from an AI model matches the data in the training set is not an adequate indication of how well it will perform once deployed on other data. The problem can be exemplified by considering a simple model based on an equation that best fits a training data set. Data values are inevitably subject to measurement uncertainties and local conditions that add various types of noise, so taking the line defined by the equation identified as best matching the training data and measuring how good that match is falls short of adequate evaluation - the more perfectly a model matches this noisy data, the less perfectly it will fit an alternative set of data, a scenario described as “overfitting”. Similarly, what a machine learning or AI algorithm or model learns when it optimises its fit to the training data may not be generalisable.
There are a number of possible approaches and factors to take into account when sourcing test data but the first thing to consider when drawing up a process for evaluating an AI model is its objective. With this objective in mind it is then possible to pin down an appropriate measure of performance, which will shape how to use the test data to evaluate the model performance. Among the distinguishing factors between different measures used to evaluate how a model performs on test data, some will be suitable when the objective is to classify (e.g high or low risk based on health data?) while others are useful for models that estimate or predict (e.g What is the estimated height of a child given their parents’ heights).
Classification model performance can be measured using accuracy, confusion matrices, sensitivity, specificity and the receiver operating characteristic (ROC). Classification accuracy summarises the performance of a classification model as the number of cases in which the model correctly classifies divided by the total number of cases used in the test set. However, this can be a blunt tool as there are cases where there is a different cost or consequence depending on the direction of the error. Confusion matrices are helpful to explore how the model performs in correctly classifying the different classes. The confusion matrix sums up the number of cases the model classifies correctly within each of the classes, for example how many actual high-risk cases are correctly classified as high risk by the model. The number of cases the model classifies as high risk, for example, that are not high risk is referred to as the False Positives. In the context of medical tests (e.g the covid lateral flow tests) testing positive for a condition that is not actually there is potentially less damaging than testing negative when the condition is there.
Additionally, the sensitivity and specificity can provide a more detailed look at model performance. The sensitivity refers to the proportion of cases labelled as positive that are classified as positive by the model, whereas specificity refers to the proportion labelled as negative that it classifies as negative. It is useful to visualise model performance and the receiver operating characteristic (ROC) provides a method to do just that. The ROC plots the True Positive rate against the False Positive rate for the model. This can be further summarised in one value as the area under the curve (AUC). The larger the AUC the better the model is performing.
Deciding whether accuracy is enough or whether there is a need to delve into the directions of the errors depends on the context of the model’s deployment. Other examples in medicine include the risk models that were developed to assess an individual’s risk of a specific medical condition, such as QRISK 1 which calculates a person’s risk of developing a heart attack or stroke over the next 10 years. Here model performance needs to go beyond accuracy and consider the direction of the errors it makes. A good overview of performance evaluation is outlined by Flach (2019) 2. Is it better to tell someone they may be at risk of disease X, run a blood test and rule it out (False Positive) than to tell them they are not at risk and not check (False Negative)? All this needs to be considered and factored into the validation of the model. It is worth noting that a systematic direction for its errors can also cause an algorithm to hit ethical problems.
When evaluating the performance of models that are estimating a numerical value (e.g height of child from height of parents) the measures used are based on how far off the model’s estimate is from the actual value (which is known for testing data). There are then a multitude of ways of summarising that quantity. The mean square error (MSE) is computed by taking the average squared difference between the estimated values from the model and the actual value in the data. Other variations include root mean square error (RMSE) and mean absolute error (MAE). The RMSE is computed in the same way as the MSE but the value is square rooted. The MAE takes a different approach by summing up the absolute errors (i.e. the error magnitude). Each of these three measures involve dividing the value obtained by the number of rows in the data. Depending on the context one of these measures may be better suited than others. For example the MSE is sensitive to outliers so can be easily skewed by a small number of extreme values, which may be useful to highlight them, whereas the RMSE has the advantage of being measured in the same units as the original variable the model is designed to estimate.
Large Language Models (LLMs, e.g. Gemini, ChatGPT) are also models trained on a data set and as such also need to be evaluated and monitored. Whereas in the models discussed so far there are some standard metrics, evaluating LLMs is more challenging as there are a multitude of benchmarks and metrics3. When LLMs are used to answer questions (when you ask a chatbot a question) then monitoring the performance of the model (the trained LLM) can involve anything. Is the answer correct? Is the answer clear? Is the answer biased? The possible metrics are varied and not as simple to capture in one measure. It is also possible to use a LLM to evaluate or score another LLMs’ answer to a question. However this adds its own risk as LLMs are not 100% accurate or consistent themselves, and they can hallucinate.
Getting data right
Not only is separate test data needed for an evaluation, but care is needed to ensure the test data is suitably representative. Similar requirements apply for test data as for the original training data to ensure the dataset is representative of the context the model will be deployed in. For instance, if an algorithm is being developed to handle photos from the UK, training and testing it on photos where the sun always shines may cause problems. The model needs to be trained and tested on a set of photos that include rain and clouds otherwise it cannot be assumed it will reliably classify such photos if they appear during deployment in the real world. Getting the training and test data set right may mean using a smaller more curated set than simply one that contains everything available.
These data sets also need to have reliable labelling i.e. the rows of data need to be accurate so that the model’s performance can be assessed objectively against a trusted “ground truth”. For example, if we want to evaluate the performance of a fraud transaction classification model using accuracy as the performance metric, then we need a reliable training data set with true fraud transactions to evaluate how good the model is at detecting them. A data set with a list of transactions that are not accurately identified (or labelled) as fraud or not is not helpful. Thinking about how some commercial LLMs are trained on all the data in the “internet” it is worth asking whether a smaller more curated and specific training set would be better for model performance as well as being more ethical and safer.
Several approaches for generating test data sets take training and test data as distinct subsets from the same initial data set 4. There are different ways of doing this to make the most of the data to evaluate the model as systematically and exhaustively as possible. Perhaps the simplest example is using a hold-out set, which involves taking all the data available and taking a random subset of the data to use for testing the model. Depending on how much data is available then this can be 50% or less.
A slightly more sophisticated approach is k-fold cross validation, which involves splitting all the data you have available into k subsets and then doing k iterations where in each iteration a different kth of the data is used as the testing data for evaluation of the model built by training it on the remaining (k-1/k) of the data. This is repeated k times each time using a different one of the k subsets for testing. The performance of the model can then be averaged over the k iterations. (The measure of performance can be, say, accuracy or sensitivity depending on the context). For example, if k is 3 then the data is split into 3, and each iteration will take a different 2/3 of the data as training data to build the model, and the remaining 1/3 as testing data to evaluate the model.
Bootstrap is a more computationally intensive approach and it involves creating multiple samples by randomly sampling with replacement from the original data. Typically, hundreds or thousands of and such samples are generated, each will be different. These multiple samples provide multiple versions of the training and testing data so the model can be evaluated on all these variations. As bootstrap relies on sampling with replacement this means that each row of data in the original data can appear multiple times in a sample training or test data during each iteration, or not appear in other samples. As with k-fold cross validation the performance of the model can be then averaged over these multiple iterations. It is important that bootstrap does not rely on only a handful of iterations. Both bootstrap and cross validation offer an opportunity to see how sensitive the model’s performance is to the characteristics of the test data, but when the data sets available are small, the use of the bootstrap approach provides a more robust way of estimating the model’s performance.
An approach that can be useful to test whether the performance of the model is sensitive to time is time-based splits. This involves taking a “sliding window” ensuring that data is split into back-to-back time periods. Using a back-to-back (sliding window) further ensures that the data the model is trained on is separate from the one it is tested on.
Maintained monitoring
Once an algorithm has been let loose it can be challenging to maintain any rigorous monitoring, but it is worth highlighting the importance of taking on the challenge of ongoing monitoring and promising approaches to it. Some of the same metrics will apply to keep a handle on the myriad of issues that could arise. These range from the banal, such as data input errors, to the complex as is the case in model drift.
In the first case, if a model makes use of data that is fed into it from another system (e.g. a billing system) any update to this other system can affect model performance. Identifying this involves checking that the characteristics of the data used to train the model and the latest data fed into the model are not too dissimilar, since a difference in the data such as an increase by a factor of 10 or a hundred can cause the algorithm to fail. The magnitude of acceptable change in the data will depend on the context. Such a step change (due to source system update) in one of the model inputs can be identified and can potentially be an easy fix.
Model drift is more complex as real-world data evolves over time. There are two types of model drift: data drift and concept drift. Data drift refers to the change that can occur to data over time, whilst concept drift5 is a deterioration or change in the relationship between the target variable and input variables of a model. An example of data drift could be in the context of billing data the addition of new price plans or phones to the data, whilst an example of concept drift can arise when there is a change in the relationship between the effect (for instance, leaving one mobile phone provider for another) and underlying factors changes. In the context of the mobile phone provider market, a concept drift may mean that leaving for another provider is no longer dictated so much by price sensitivity as the type of network. Both types of drift lead to a deterioration in performance of the model as time goes by. Performance monitoring of the model is key to detecting model drift but differentiating between data or concept drift requires additional specialist approaches. Some of these are outlined in (Rotalinti, 2022)6 and (Davis, 2020)7.
In some cases, refreshing a model to account for the change in the underlying data (both training and test) can be quick and easy. However, if concept drift is detected, then it may take more than just a model refresh as the relationships between the variable we are trying to model, and the explanatory data has changed. This may involve finding new data sources and could lead to significant changes in the model, for example moving from a regression model to a neural network. Deciding to rebuild or retrain a model can also in some cases have environmental impact (particularly for the more resource intensive models such as deep learning and LLMs). Either way, where models are subject to peer review or some form of governance this can be a more onerous task.
Even with each step in a model’s evaluation stringently adhered to it is also important to assess the context for its deployment for risks and rogue scenarios that might break or in the case of Tay despoil it. And like all other stages of the evaluation this should not just be at the time of deployment but also over time. When models (machine learning or other) are used to inform or make important decisions providing information on how and when the model was evaluated, and how it is monitored should be standard practice not just to avoid the wasted expense of another broken AI model (algorithm) left on the shelf but more importantly to safeguard the welfare of those who come into contact with it.
- About the author
- Isabel Sassoon is senior lecturer in the Department of Computer Science, Brunel University London.
- Copyright and licence
- © 2024 Royal Statistical Society
This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.
- How to cite
- Sassoon, Isabel. 2024. “Evaluation essentials for safe and reliable AI model performance .” Real World Data Science, May 21, 2024. URL
References
Hippisley-Cox, J., Coupland, C. and Brindle, P. Development and validation of QRISK3 risk prediction algorithms to estimate future risk of cardiovascular disease: prospective cohort study.BMJ (2017) doi: https://doi.org/10.1136/bmj.j2099.↩︎
Flach, P. (2019). Performance evaluation in machine learning: the good, the bad, the ugly, and the way forward. Proceedings of the AAAI conference on artificial intelligence pp. 9808-9814 (2019) doi: https://doi.org/10.1609/aaai.v33i01.33019808.↩︎
Chang, Y. et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology (2023) doi: https://doi.org/10.1145/3641289.↩︎
Witten, I. H., Frank, E. and Hall, M. A. Data mining: Practical machine learning tools and techniques. Morgan Kaufmann (2011).↩︎
Bayram, F., Ahmed, B. S. and Kassler A. From concept drift to model degradation: An overview on performance-aware drift detectors. Knowledge-Based Systems (2022) doi: https://doi.org/10.1016/j.knosys.2022.108632.↩︎
Rotalinti, Y., Tucker, A., Lonergan, M., Myles, P. and Branson, R. Detecting drift in healthcare AI models based on data availability. Joint European Conference on Machine Learning and Knowledge Discovery in Databases 243-258 (2022) Springer Nature Switzerland. doi: https://doi.org/10.1007/978-3-031-23633-4_17↩︎
Davis, S. E., Greevy Jr, R. A., Lasko, T. A., Walsh, C. G. and Matheny, M. E. Detection of calibration drift in clinical prediction models to inform model updating. Journal of biomedical informatics (2020) doi: https://doi.org/10.1016/j.jbi.2020.103611.↩︎