Undergraduate student Yifan (Rosetta) Hu was responsible for writing the Python script that pre-processes the 2015–2016 UPC, EC, and PPC data for training neural network models. Her script randomly sampled five negative EC descriptions for every positive match between a UPC and EC code. Professor Mandy Korpusik performed the remaining work, including setting up the environment, training the BERT model, and evaluation. Hu spent roughly 10 hours on the competition, and Korpusik spent roughly 40 hours of work (and many additional hours running and monitoring the training and testing scripts).
Our perspective on the challenge
The goal of this challenge is to use machine learning and natural language processing (NLP) to link language-based entries in the IRI and FNDDS databases. Our proposed approach is based on our prior work using deep learning models to map users’ natural language meal descriptions to the FNDDS database (Korpusik, Collins, and Glass 2017b) to retrieve nutrition information in a spoken diet tracking system. In the past, we found a trade-off between accuracy and cost, leading us to select convolutional neural networks over recurrent long short-term memory (LSTM) networks – with nearly 10x as many parameters and 2x the training time required, LSTMs achieved slightly lower performance on semantic tagging and food database mapping on meals in the breakfast category. Here, we propose to investigate state-of-the-art transformers, specifically the contextual embedding model (i.e., the entire sentence is used as context to generate the embedding) known as BERT (Bidirectional Encoder Representations from Transformers, Devlin et al. 2018).
Our approach
Our approach is to fine-tune a large pre-trained BERT language model on the food data. BERT was originally trained on a massive amount of text for a language modelling task (i.e., predicting which word should come next in a sentence). It relies on a transformer model, which uses an “attention” mechanism to identify which words the model should pay the most “attention” to. We are specifically using BERT for binary sequence classification, which refers to predicting a label (i.e., classification) for a sequence of words. In our case, during fine-tuning (i.e., training the model further on our own dataset) we will feed the model pairs of sentences (where one sentence is the UPC description of a food item and the other is the EC description of another food item), and the model will perform binary classification, predicting whether the sentences are a match (i.e., 1) or not (i.e., 0). We start with the 2015–2016 ground truth PPC data for positive examples, and five randomly sampled negative examples per positive example.
Training methods
Since we used a neural network model, the only features passed into our model were the tokenized words themselves of the EC and UPC food descriptions – we did not conduct any manual feature engineering (Dong and Liu 2018). The model was trained on a 90/10 split into 90% training and 10% validation data, where the validation data was used as a test set to fine-tune the model’s hyperparameters. We started with a randomly sampled set of 16,000 pairs, batch size of 16 (i.e., the model would train on batches of 16 samples at a time), AdamW (Loshchilov and Hutter 2017) as the optimizer (which adaptively updates the learning rate, or how large the update should be to the model’s parameters), a linear schedule with warmup (i.e., starting with a small learning rate in the first few epochs of training due to large variance in early stages of training, L. Liu et al. 2019), and one epoch (i.e., the number of times the model passes through all the training data). We then added the next randomly sampled set of 16,000 pairs to get a model trained on 32,000 data points. Finally, we reached a total of 48,000 data samples used for training. Each pair of sequences was tokenized with the pre-trained BERT tokenizer, with the special CLS and SEP tokens (where CLS is a learned vector that is typically passed to downstream layers for final classification, and SEP is a learned vector that separates two input sequences), and was padded with zeros to the maximum length input sequence of 240 tokens, so that each input sequence would be the same length.
Model development approach
We faced many challenges due to the secure nature of the ADRF environment. Since our approach relies on BERT, we were blocked by errors due to the local BERT installation. Typically, BERT is downloaded from the web as the program runs. However, for this challenge, BERT must be installed locally for security reasons. To fix the errors, the BERT models needed to be installed with git lfs clone
instead of git.
Second, we were unable to retrieve the test data from the database due to SQLAlchemy errors. We found a workaround by using DBeaver directly to save database tables as Excel spreadsheets, rather than accessing the database tables through Python.
Finally, we needed a GPU in order to efficiently train our BERT models. However, we initially only had a CPU, so there was a delay due to setting up the GPU configuration. Once the GPU image was set up, there was still a CUDA error when running the BERT model during training. We determined that the model was too big to fit into GPU memory, so we found a workaround using gradient checkpointing (trading off computation speed for memory) with the transformers library’s Trainer and TrainingArguments. Unfortunately, the version of transformers we were using did not have these tools, and the library was not updated until less than a week before the deadline, so we still had to train the model on the CPU.
To deal with the inability to run jobs in the background, our process was checkpointing our models every five batches, and saving the model predictions during evaluation to a csv file every five batches as well.
Find the code in the Real World Data Science GitHub repository.
Our results
After training, the 48K model (so-called because it was trained on 48,000 data samples) was used at test time via ranking all possible 2017–18 EC descriptions given an unseen UPC description. The rankings were obtained through the model’s output value – the higher the output (or confidence), the more highly we ranked that EC description. To speed up the ranking process, we used blocking (i.e., only ranking a subset of all possible matches), specifically with exact word matches (using only the first six words in the UPC description, which appeared to be the most important), and fed all possible matches through the model in one batch per UPC description. Since we still did not have sufficient time to complete evaluation on the full set of test UPC descriptions, we implemented an expedited evaluation that only considered the first 10 matching EC descriptions in the BERT ranking process (which we call BERT-FAST). We also report results for the slower evaluation method that considers all EC descriptions that match at least one of the first six words in a given UPC description, but note that these results are based on just a small subset of the total test set. See Table 1 below for our results, where the (5?) indicates how often the correct match was ranked among the top-5. See Table 2 for an estimate of how long it takes to train and test the model on a CPU.
Model | Success@5 | NDCG@5 |
---|---|---|
BERT-FAST | 0.057 | 0.047 |
BERT-SLOW | 0.537 | 0.412 |
Time | |
---|---|
Training (on 48K samples) | 16 hours |
Testing (BERT-FAST) | 52 hours |
Testing (BERT-SLOW) | 63 days |
Future work/refinement
In the future, with more time available, we would train on all data, not just our limited dataset of 48,000 pairs, as well as perform evaluation on the held-out test set with the full set of possible EC matches that have one or more words in common with the UPC description. We would compare against baseline word embedding methods such as word2vec (Mikolov et al. 2017) and Glove (Pennington, Socher, and Manning 2014), and we would explore hierarchical prediction methods for improving efficiency and accuracy. Specifically, we would first train a classifier to predict the generic food category, and then train finer-grained models to predict specific foods within a general food category. Finally, we are exploring multi-modal transformer-based approaches that allow two input modalities (i.e., food images and text descriptions of a meal) for predicting the best UPC match.
Lessons learned
We recommend that future challenges provide every team with both a CPU and a GPU in their workspace, to avoid transitioning from one to the other midway through the challenge. In addition, if possible, it would be very helpful to provide a mechanism for running jobs in the background. Finally, it may be useful for teams to submit snippets of code along with library package names, in order for the installations to be tested properly beforehand.
- About the authors
- Yifan (Rosetta) Hu is an undergraduate student and Mandy Korpusik is an assistant professor of computer science at Loyola Marymount University’s Seaver College of Science and Engineering.
- Copyright and licence
- © 2023 Yifan Hu and Mandy Korpusik
This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Peter Bond on Unsplash.
- How to cite
- Hu, Yifan, and Mandy Korpusik. 2023. “Food for Thought: Third place winners – Loyola Marymount.” Real World Data Science, August 21, 2023. URL