Food for Thought: Second place winners – DeepFFTLink

PhD students and faculty from Worcester Polytechnic Institute and Indiana University Bloomington describe their solution to the Food for Thought challenge: an ensemble of base string matching and BERT models.

Machine learning
Natural language processing
Public policy
Health and wellbeing
Author

Yang Wu, Aishwarya Budhkar, Kai Zhang, Xuhong Zhang, and Xiaozhong Liu

Published

August 21, 2023

DeepFFTLink team members: Yang Wu and Kai Zhang are PhD students at Worcester Polytechnic Institute. Aishwarya Budhkar is a PhD student at Indiana University Bloomington. Xuhong Zhang is an assistant professor at Indiana University Bloomington. Xiaozhong Liu is an associate professor at Worcester Polytechnic Institute.

Perspective on the challenge

Text matching is an essential task in natural language processing (NLP, Pang et al. 2016), while record linkage across different sources is an essential task in data science. Machine learning techniques allow people to combine data faster and cheaper than using manual linkage. However, in the context of the Food for Thought challenge, existing methods for matching universal product codes (UPCs) to ensemble codes (ECs) require every UPC to be compared with every EC code (Figure 1a). Such approaches can be computationally expensive in the training process when data is noisy. Here, we propose an ensemble model with a category-based adapter to tackle this problem, drawing on the category information included in UPC and EC data. The category-based adapter allows UPCs to be first matched with only a small and reliable set of ECs (Figure 1b). Then, an ensemble model will be deployed to make predictions for UPC-EC matching. Our proposed approach can achieve competitive performance compared with state-of-the-art models.

(a)

(a)

(b)

(b)

Figure 1: A toy example of our method. Panel (a) shows the traditional matching method, while (b) is our proposed ensemble model with category-based adapter. With the help of the adapter, UPC 1 only needs to be matched with EC 1 and EC 3.

Our approach

We propose a two-step framework to address this problem. To begin with, we use a category-based adapter to get reliable candidate ECs for each UPC. Then, an ensemble model (Dietterich 2000) is deployed to make a prediction for each UPC-EC pair.

Category-based adapter

By using 2015–2016 UPC-EC data, we created a knowledge base, which is a UPC category–EC pair-wised table for generating candidate ECs. Within this setting, each UPC category is, on average, related to only 32 ECs. This knowledge base is then used as context to further filter the candidate ECs. Note that there are some new ECs generated year by year, which can also be part of the potential ECs in the UPC-EC matching task, since the contextual information of new ECs does not exist in our knowledge base.

Ensembled model

We ensemble the base-string match and BERT models. BERT is a deep learning model for natural language processing (Devlin et al. 2018). In the base-string match model, we used the Term Frequency-Inverse Document Frequency (TFIDF) of each UPC and EC description as features to calculate a pairwise cosine similarity, which is a distance between instances. Meanwhile, we used features extracted from UPC and EC descriptions to fine-tune the BERT base model and calculated the cosine similarity of embeddings between each UPC and EC. Then we rank ECs based on their similarity scores with the UPC.

Figure 2: The framework of our proposed model. A two-step strategy is used to make the final prediction.

Our results

We randomly selected 500 samples from the 2017–2018 UPC-EC data to train the ensembled weight for each model. Two functions were adapted to make a fusion of base-string and BERT models:

\[ C = a * X + b * Y \tag{1}\]

\[ C = a * log(X) + b * log(Y) \text{. } \tag{2}\]

\(C\) denotes the final confidence score. \(X\) and \(Y\) represent base_string_similarity_score and BERT_similarity_score, respectively. \(a\) and \(b\) are corresponding model weights for base_string and BERT models.

A better Success@5 is achieved with function (1). The ensembled weights for the base-string model and BERT model are 0.738 and 0.262, respectively. The experiment result indicates that the base_string model contributes more than the BERT model when the ensemble model makes predictions. The prediction result for the 2017–2018 data is:

  • Success@5: 0.727
  • NDCG@5: 0.528

Computation time is 6 hours.

Future work

Our next step will focus on adding the newly generated EC data into our knowledge base, which allows the model to be more stable to make predictions for UPC-EC matching. Our model is an unsupervised method, which does not need labels for each instance. We use cosine similarity to rank the matches, so no labels are needed in the training process. However, our future work will try to label some instances to handle the UPC-EC matching task in a supervised manner.

Lessons learned

  1. If the data is not complex, simple models may outperform complex models. For example, in our experiment, we found that the base-string model outperforms single RoBERTa (Liu et al. 2019) or BERT models. However, our ensemble model can outperform each individual model since model fusion allows information aggregation from multiple models.

  2. Multi-label models may not work well on UPC-EC data. In our early work, we tried to consider the UPC-EC matching task as a multi-label problem, e.g., we labeled each EC as a binary label which indicated whether the EC was an appropriate match or not. We mapped UPC and EC pairs into a multi-label table. However, we find that the UPC and EC keeps a one-to-one relation for most UPCs. The model performance of a multi-label model, i.e., Label-Specific Attention Network (LSAN, Xiao et al. 2019), is lower than base-string model on both Success@5 and NDCG@5 metrics.

About the authors
Yang Wu and Kai Zhang are PhD students, and Xiaozhong Liu is an associate professor at Worcester Polytechnic Institute. Aishwarya Budhkar is a PhD student and Xuhong Zhang is an assistant professor at Indiana University Bloomington.
Copyright and licence
© 2023 Yang Wu, Aishwarya Budhkar, Kai Zhang, Xuhong Zhang, and Xiaozhong Liu

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Hanson Lu on Unsplash.

How to cite
Wu, Yang, Aishwarya Budhkar, Kai Zhang, Xuhong Zhang, and Xiaozhong Liu. 2023. “Food for Thought: Second place winners – DeepFFTLink.” Real World Data Science, August 21, 2023. URL

References

Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” CoRR abs/1810.04805. http://arxiv.org/abs/1810.04805.
Dietterich, T. G. 2000. “Ensemble Methods in Machine Learning.” In Multiple Classifier Systems, 1–15. Berlin, Heidelberg: Springer Berlin Heidelberg.
Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” CoRR abs/1907.11692. http://arxiv.org/abs/1907.11692.
Pang, L., Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng. 2016. “Text Matching as Image Recognition.” CoRR abs/1602.06359. http://arxiv.org/abs/1602.06359.
Xiao, L., X. Huang, B. Chen, and L. Jing. 2019. “Label-Specific Document Representation for Multi-Label Text Classification.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 466–75. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1044.