Food for Thought: First place winners – Auburn Big Data

Auburn University’s team of PhD students and faculty describe their winning solution to the Food for Thought challenge: random forest classifiers.

Machine learning
Natural language processing
Public policy
Health and wellbeing
Author

Alex Knipper, Naman Bansal, Jingyi Zheng, Wenying Li, and Shubhra Kanti Karmaker

Published

August 21, 2023

The Auburn Big Data team from Auburn University consists of five members, including three assistant professors: Dr Wenying Li of the Department of Agricultural Economics and Rural Sociology, Dr Jingyi Zheng of the Department of Mathematics and Statistics, and Dr Shubhra Kanti Karmaker of the Department of Computer Science and Software Engineering. Additionally, the team comprises two PhD students, Naman Bansal and Alex Knipper, who are affiliated with Dr Karmaker’s big data lab at Auburn University.

It is estimated that our team has spent approximately 1,400 hours on this project.

Our perspective on the challenge

At the start of this competition, we decided to test three general approaches, in the order listed:

  1. A heuristic approach, where we use only the data and a defined similarity metric to predict which FNDDS label a given IRI item should have.

  2. A simpler modeling approach, where we train a simple statistical classifier, like a random forest (Parmar, Katariya, and Patel 2019), logistic regression, etc., to predict the FNDDS label for a given IRI item. For this method, we opted to use a random forest as our statistical model, as it was a simpler model to use as a baseline, having shown decent performance in a wide range of classification tasks. As it turned out, this approach was quite robust and accurate, so we kept it as our main model for this approach.

  3. A large language modeling approach, where we train a model like BERT (Devlin et al. 2018) to map the descriptions for given IRI and FNDDS items to the FNDDS category the supplied IRI item belongs to.

Our approach

As we explored the data provided, we opted to use the given 2017–2018 PPC dataset as our primary dataset for both training and testing. To ensure a fair evaluation of the model, we randomly split the dataset into 60% training samples and 40% testing samples, making sure our training process never sees the testing dataset. For evaluating our models, we adopted the competition’s metrics: Success@5 and NDCG@5. After months of testing, our statistical classifier (approach #2) proved itself to be the model that both processes the data fastest and achieves the highest performance on our testing metrics.

This approach, at a high level, takes in the provided data (among other configuration parameters), formats the data in a computer-readable format – converting the IRI and FNDDS descriptions to a numerical representation with word embeddings (2018; Mikolov et al. 2013; Pennington, Socher, and Manning 2014) and then using that numerical representation to calculate the distances between each description – and then trains a classification model (random forest (2019)/neural network (Schmidhuber 2015)) that can predict an FNDDS label for a given IRI item.

In terms of data, our approach uses the FNDDS/IRI descriptions, combining them into a single “description” field, and the IRI item’s categorical items – department, aisle, category, product, brand, manufacturer, and parent company – to further discern between items.

While most industrial methods require use of a graphics processing unit (graphics card, or GPU) to perform this kind of processing, our primary method only requires the computer’s internal processor (CPU) to function properly. With that in mind, to achieve the best possible performance on our test metrics, the most time-consuming operations are run in parallel. The time taken to train our primary model can likely be further improved if we parallelize these operations across a GPU, with the only downside being the imposition of a GPU requirement for systems aiming to run this method.

In addition to our primary method, our team has worked with alternate approaches on the GPU (using BERT (2018), neural networks (2015), etc.) to either: 1) speed up the time it takes to process and make inferences for the data, achieving similar performance on our test metrics, or 2) achieve higher performance, likely at a cost to the time it takes to process everything. Our reasoning behind doing so is that if a simple statistical model performs well, then a larger language model should be able to demonstrate a higher performance on our test metrics without much of an increase in training time. At the current time, these methods are still unable to match the performance/efficiency tradeoff of our primary method.

After exploring alternate methods to no avail, our team then decided to focus again on our primary method, the random forest (2019), and a secondary method, feed-forward neural network mapping our input features (X) to the FNDDS labels (Y) (2015), to optimize their training hyperparameters for the dataset. Our aim in this is to see which of our already-implemented, easier-to-run downstream methods would better optimize the performance/efficiency tradeoff after having its training parameters optimized to the fullest. This has resulted in a marginal increase in training time (+20-30 minutes) and a roughly 5% increase in performance for our still-highest performing model, the random forest.

Overall, our primary method – the random forest – gave us an approximate training time (including data pre-processing) of 4 hours 30 minutes for our ~38,000 IRI item training set, and an approximate inference time of 15 minutes on our testing set of ~15,000 IRI items. Furthermore, our method gave us a Success@5 score of .789 and an NDCG@5 score of .705 on our testing set.

Key features

Here is a list of the key features we utilize, along with what type of data we treat it as.

  • FNDDS
    • food_code – identifier
    • main_food_description – text
    • additional_food_description – text
    • ingredient_description – text
  • IRI
    • upc – identifier
    • upcdesc – text
    • dept – categorical
    • aisle – categorical
    • category – categorical
    • product – categorical
    • brand – categorical
    • manufacturer – categorical
    • parent – categorical

The intuition behind using these particular features is that the text-based descriptions provide the majority of the “meaning” of the item. By converting each description to a numerical representation (2013; 2014), we can then calculate the similarity between each “meaning” to determine which FNDDS label is most similar to the IRI item provided. However, that alone is not enough. The categorical features on the IRI item help to further enhance the model’s classifications using the logic and categories people use in places like grocery stores. For example, if given an item whose aisle was “fruit” and brand was “Dole”, the item could be reasonably expected to be something like “peaches” over something like “broccoli”.

Feature selection

Aforementioned intuition aside, our feature selection was rather naive, in that we manually examined the data and removed any redundant text features before doing anything else. After that, we decided to use description fields as “text” data to comprise the main “meaning” of the item, represented numerically after converting the text using a word embedding (2013; 2014). We also decided to use the non-description fields (aisle, category, etc.) as “categorical” data that would be turned into its own numerical representation, allowing our model to more easily discern between items using similar systems to people.

Feature transformations

Our feature transformations are also relatively simple. First, we combine all description fields for each item to make one large description, and then use a word embedding method (like GloVe (2014) or BERT (2018)) to convert the description into a numerical representation, resulting in a 300-dimensional GloVe or 768-dimensional BERT vector of numbers for each description. Then, for each IRI item, we calculate the cosine and Euclidean distances from each FNDDS item, resulting in two vectors, both equal in length to the original FNDDS data (in this case, two vectors of length ~7,300). The intuition behind this is that while cosine and Euclidean distances can tell us similar things, providing both of these sets of distances to the model should allow it to pick up on a more nuanced set of relationships between the IRI and FNDDS items.

For categorical data, we take all unique values in each field and assign them an ID number. While that is often not the best practice for making a numerical representation out of categorical data (Potdar, Pardawala, and Pai 2017), it seemed to work for the downstream model.

Regardless, the aforementioned feature transformations give us (ad hoc) ~14,900 features if we use GloVe and ~15,300 features if we use BERT. Both feature sets can then be sent to the downstream random forest/neural network to start classifying items.

It should be noted that processing the data is by far the most time-consuming part of our method. The data processing times for each embedding are as follows:

  • GloVe: ~3 hours
  • BERT: ~6 hours

Due to BERT both taking so long to process data and performing lower than our GloVe embeddings on the classification task, we opt to use GloVe embeddings for our primary method. Our only theoretical explanation here is that since BERT is better at context-dependent tasks (Wang, Nulty, and Lillis 2021), it likely will expect something similar to well-structured sentences as input, which is not what the IRI/FNDDS descriptions are. Rather, GloVe – being a method that depends less on context (2013; 2014) – should excel better when the input text is not a well-formed sentence.

Training methods

Once the data has been processed, we collect the following data for each IRI item:

  • UPC code
  • Description (converted to numerical representation)
  • Categorical variables (converted to numerical representation)
  • Distances to each FNDDS item

Once that has been collected for each IRI item, we can finally use our classification model. We initialize our model and begin the training process with the IRI data mentioned above and the target FNDDS labels for each one, so the model knows what the “correct” answer is for the given data. Once the model has trained on our training dataset, we save the model and it is ready for use.

This part of training takes much less time than preparing the data, since calculating the embeddings takes a lot more computation than a random forest model. The training times for each method are as follows:

  • Random Forest: ~1 hour 15 minutes
  • Neural Network: ~25 minutes

Despite the neural network taking far less time to train than the random forest, it still scores lower on the scoring metrics than the random forest, so we opt to continue using the random forest model as our primary method.

General approach to developing the model

Since the linkage problem involves mapping tens of thousands of items to a smaller category set of a few thousand items, we decided to frame this problem as a multi-class classification problem (Aly 2005), where we then rank the top “k” most probable class mappings, as requested by the competition ruleset.

Most of the usable data available to us is text data, so we need a method that can use that text-based information to accurately map classes based on the aforementioned text information. To best accomplish this, we opt to use word embedding techniques to calculate an average numerical representation for each text description (both IRI and FNDDS), so we can calculate distances between each description, giving our model a sense of how similar each description is.

The key “trick” to the model

Since text descriptions hold the most information that can be used to link between an IRI item and an FNDDS item, finding a way to calculate the similarity between each description is paramount to making this method work.

Both distance calculation methods used in this work, cosine and Euclidean distance, are very similar in the type of information encoded, the only major difference being that cosine distance is implicitly normalized and Euclidean distance is not (Qian et al. 2004).

Notable observations

Just by building the ranking using the cosine similarities between each IRI item and all FNDDS items, we can achieve a Success@5 performance of 0.234 and an NDCG@5 performance of 0.312. The other features are provided and the random forest classifier is used to add some extra discriminative power to the model.

Data disclaimer

Our current method only uses the data readily available from the 2017–2018 dataset, which we acknowledge is intended for testing. To remedy this, we further split this dataset into train/test sets and report results on our unseen test subset for our primary performance metrics. This gives a decent look into how the model will perform on unseen data.

Our results

Approximate training time

Overall, our approximate training time for our primary method is 4 hours 30 minutes broken down (approximately) as follows:

  1. Reading data from database: 30 seconds
  2. Calculating ~7,300 FNDDS description embeddings: 15 minutes 45 seconds
  3. Calculating ~38,000 IRI description embeddings and similarity scores: 2 hours 20 minutes 45 seconds
  4. Formatting calculated data for the random forest classifier: 35 minutes
  5. Training the random forest classifier: 1 hour 15 minutes
Approximate inference time

Our approximate inference time for our primary method is 15 minutes to make inferences for ~15,000 IRI items.

S@5 & NDCG@5 performance

This is how our best-performing model (GloVe + random forest) performs at the current time on the testing set:

  • NDCG@5: 0.705
  • Success@5: 0.789

When we evaluate that same model on the full PPC dataset we were provided (~38,000 items), we get the following scores:

  • NDCG@5: 0.879
  • Success@5: 0.916

(Note: The full PPC dataset contains approximately 15,000 items that we used to train the model, so these scores are not as representative of our method’s performance as the previous scores.)

Future work/refinement

As mentioned previously, we only used the given 2017–2018 PPC dataset as our primary dataset for both training and testing. Going forward, we would like to include datasets from previous years as well, which we believe would further increase our model performance. Additionally, the datasets generated from this research have the potential to inform and support additional studies from a variety of perspectives, including nutrition, consumer research, and public health. Further research utilizing these datasets has the potential to make significant contributions to our understanding of consumer behavior and the role of food and nutrient consumption in overall health and well-being.

Lessons learned

It was interesting that the random forest model performed better than the vanilla neural network model. This shows that a simple solution can work better, depending on the application. This observation is in line with the well-established principle in machine learning that the choice of model should be guided by the nature of the problem and the characteristics of the data. In this case, the random forest model, being a simpler and more interpretable model, was better suited to the problem at hand and was able to outperform the more complex neural network model. These results underscore the importance of careful model selection and the need to consider both the complexity of the model and the specific requirements of the problem when choosing an algorithm for a particular application.

About the authors
Alex Knipper and Naman Bansal are PhD students, and Jingyi Zheng, Wenying Li, and Shubhra Kanti Karmaker are assistant professors at Auburn University.
Copyright and licence
© 2023 Alex Knipper, Naman Bansal, Jingyi Zheng, Wenying Li, and Shubhra Kanti Karmaker

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by nrd on Unsplash.

How to cite
Knipper, Alex, Naman Bansal, Jingyi Zheng, Wenying Li, and Shubhra Kanti Karmaker. 2023. “Food for Thought: First place winners – Auburn Big Data.” Real World Data Science, August 21, 2023. URL

References

Aly, M. 2005. “Survey on Multiclass Classification Methods, Tech. Rep.” California Institute of Technology.
Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” CoRR abs/1810.04805. http://arxiv.org/abs/1810.04805.
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” https://arxiv.org/abs/1301.3781.
Parmar, A., R. Katariya, and V. Patel. 2019. “A Review on Random Forest: An Ensemble Classifier.” In International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018, edited by J. Hemanth, X. Fernando, P. Lafata, and Z. Baig, 758–63. Cham: Springer International Publishing.
Pennington, J., R. Socher, and C. Manning. 2014. GloVe: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–43. Doha, Qatar: Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162.
Potdar, K., T. S. Pardawala, and C. D. Pai. 2017. “A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers.” International Journal of Computer Applications 175 (4): 7–9. https://doi.org/10.5120/ijca2017915495.
Qian, G., S. Sural, Y. Gu, and S. Pramanik. 2004. “Similarity Between Euclidean and Cosine Angle Distance for Nearest Neighbor Queries.” In Proceedings of the 2004 ACM Symposium on Applied Computing, 1232–37. SAC ’04. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/967900.968151.
Schmidhuber, J. 2015. “Deep Learning in Neural Networks: An Overview.” Neural Networks 61: 85–117. https://doi.org/https://doi.org/10.1016/j.neunet.2014.09.003.
Wang, C., P. Nulty, and D. Lillis. 2021. “A Comparative Study on Word Embeddings in Deep Learning for Text Classification.” In Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, 37–46. NLPIR ’20. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3443279.3443304.