DeepFFTLink team members: Yang Wu and Kai Zhang are PhD students at Worcester Polytechnic Institute. Aishwarya Budhkar is a PhD student at Indiana University Bloomington. Xuhong Zhang is an assistant professor at Indiana University Bloomington. Xiaozhong Liu is an associate professor at Worcester Polytechnic Institute.
Perspective on the challenge
Text matching is an essential task in natural language processing (NLP, Pang et al. 2016), while record linkage across different sources is an essential task in data science. Machine learning techniques allow people to combine data faster and cheaper than using manual linkage. However, in the context of the Food for Thought challenge, existing methods for matching universal product codes (UPCs) to ensemble codes (ECs) require every UPC to be compared with every EC code (Figure 1a). Such approaches can be computationally expensive in the training process when data is noisy. Here, we propose an ensemble model with a category-based adapter to tackle this problem, drawing on the category information included in UPC and EC data. The category-based adapter allows UPCs to be first matched with only a small and reliable set of ECs (Figure 1b). Then, an ensemble model will be deployed to make predictions for UPC-EC matching. Our proposed approach can achieve competitive performance compared with state-of-the-art models.
Our approach
We propose a two-step framework to address this problem. To begin with, we use a category-based adapter to get reliable candidate ECs for each UPC. Then, an ensemble model (Dietterich 2000) is deployed to make a prediction for each UPC-EC pair.
Category-based adapter
By using 2015–2016 UPC-EC data, we created a knowledge base, which is a UPC category–EC pair-wised table for generating candidate ECs. Within this setting, each UPC category is, on average, related to only 32 ECs. This knowledge base is then used as context to further filter the candidate ECs. Note that there are some new ECs generated year by year, which can also be part of the potential ECs in the UPC-EC matching task, since the contextual information of new ECs does not exist in our knowledge base.
Ensembled model
We ensemble the base-string match and BERT models. BERT is a deep learning model for natural language processing (Devlin et al. 2018). In the base-string match model, we used the Term Frequency-Inverse Document Frequency (TFIDF) of each UPC and EC description as features to calculate a pairwise cosine similarity, which is a distance between instances. Meanwhile, we used features extracted from UPC and EC descriptions to fine-tune the BERT base model and calculated the cosine similarity of embeddings between each UPC and EC. Then we rank ECs based on their similarity scores with the UPC.
Find the code in the Real World Data Science GitHub repository.
Our results
We randomly selected 500 samples from the 2017–2018 UPC-EC data to train the ensembled weight for each model. Two functions were adapted to make a fusion of base-string and BERT models:
\[ C = a * X + b * Y \tag{1}\]
\[ C = a * log(X) + b * log(Y) \text{. } \tag{2}\]
\(C\) denotes the final confidence score. \(X\) and \(Y\) represent base_string_similarity_score and BERT_similarity_score, respectively. \(a\) and \(b\) are corresponding model weights for base_string and BERT models.
A better Success@5 is achieved with function (1). The ensembled weights for the base-string model and BERT model are 0.738 and 0.262, respectively. The experiment result indicates that the base_string model contributes more than the BERT model when the ensemble model makes predictions. The prediction result for the 2017–2018 data is:
- Success@5: 0.727
- NDCG@5: 0.528
Computation time is 6 hours.
Future work
Our next step will focus on adding the newly generated EC data into our knowledge base, which allows the model to be more stable to make predictions for UPC-EC matching. Our model is an unsupervised method, which does not need labels for each instance. We use cosine similarity to rank the matches, so no labels are needed in the training process. However, our future work will try to label some instances to handle the UPC-EC matching task in a supervised manner.
Lessons learned
If the data is not complex, simple models may outperform complex models. For example, in our experiment, we found that the base-string model outperforms single RoBERTa (Liu et al. 2019) or BERT models. However, our ensemble model can outperform each individual model since model fusion allows information aggregation from multiple models.
Multi-label models may not work well on UPC-EC data. In our early work, we tried to consider the UPC-EC matching task as a multi-label problem, e.g., we labeled each EC as a binary label which indicated whether the EC was an appropriate match or not. We mapped UPC and EC pairs into a multi-label table. However, we find that the UPC and EC keeps a one-to-one relation for most UPCs. The model performance of a multi-label model, i.e., Label-Specific Attention Network (LSAN, Xiao et al. 2019), is lower than base-string model on both Success@5 and NDCG@5 metrics.
- About the authors
- Yang Wu and Kai Zhang are PhD students, and Xiaozhong Liu is an associate professor at Worcester Polytechnic Institute. Aishwarya Budhkar is a PhD student and Xuhong Zhang is an assistant professor at Indiana University Bloomington.
- Copyright and licence
- © 2023 Yang Wu, Aishwarya Budhkar, Kai Zhang, Xuhong Zhang, and Xiaozhong Liu
This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Hanson Lu on Unsplash.
- How to cite
- Wu, Yang, Aishwarya Budhkar, Kai Zhang, Xuhong Zhang, and Xiaozhong Liu. 2023. “Food for Thought: Second place winners – DeepFFTLink.” Real World Data Science, August 21, 2023. URL