Food for Thought: Competition and challenge design

From structure to setup, to metrics, results, and lessons learned, Zheyuan Zhang and Uyen Le give an overview of the design of the Food for Thought competition.

Machine learning
Natural language processing
Public policy
Health and wellbeing
Author

Zheyuan Zhang and Uyen Le

Published

August 21, 2023

Since 2014, the professional services firm Westat, Inc. has been developing the Purchase to Plate Crosswalk (PPC) for the United States Department of Agriculture (USDA) Economic Research Service (ERS). The PPC links the retail food transactions database from IRI’s InfoScan service and the USDA Food and Nutrient Database for Dietary Studies (FNDDS). However, the current linkage process uses only partly automated data matching, meaning it is resource intensive, time consuming, and requires manual review.

With sponsorship from ERS, Westat partnered with the Coleridge Initiative to host the Food for Thought competition to challenge researchers and data scientists to use machine learning and natural language processing to find accurate and efficient methods for creating the PPC. Figure 1 provides a visual overview of the challenge set by the competition.

Figure 1: Overview of the Food for Thought Competition Challenge.

The one-to-many matching task that is central to the competition throws up many challenges for researchers to wrestle with. Because IRI data contains food transactions collected from partnered retail establishments for over 350,000 items, the matchings need to be made based on limited data features, including categories, providers, and semantically inconsistent descriptions that consist of short phrases. Consider this hypothetical example: IRI product-related information about a (fictional) “Cheesy Hashbrowns Hamburger Helper, 5.5 Oz Box” needs to be linked to FNDDS nutrition-related information found under “Mixed dishes – meat, poultry, seafood: Mixed meat dishes”. Figure 2 demonstrates how the two databases are linked with each other to create the PPC. As can be seen, there is no common word that easily indicates that “Cheesy Hashbrowns Hamburger Helper…” should be matched with “Mixed dishes…”, and such cases exist in all IRI tables used for the challenge, from 2012 through 2018.

Figure 2: Each universal product code (UPC) from the IRI data could match to only one ensemble code (EC) from the FNDDS data, whereas one EC code could match to multiple UPCs.

Also, because nutritionists or food scientists will always need to review the matching, regardless of the matching method used, it was important that our evaluation of proposed matching methods focused both on the accuracy of prediction models and also on metrics that would lead participants to develop models that facilitate qualified reviewers to reduce their workloads.

Organising the competition was also a challenge in its own right, for data privacy reasons. IRI scanner data contains sensitive information, such as store name, location, unit price, and weekly quantity sold for each item. This ruled out using existing online platforms like Kaggle, DrivenData or AIcrowd to host the competition, and instead required a private secure data enclave to ensure the safe use of sensitive and confidential data assets. The need for such an environment imposed capacity constraints on the competition, meaning only dozens of teams could be invited to take part, whereas on open platforms it is common to have thousands of teams competing and sharing ideas and code.

Competition structure

The competition ran over 10 months and consisted of three separate challenges: two interim, one final. Applications opened in September 2021, and the competition started in January 2022. Submission deadlines for the first and second interim challenges were in July and September 2022, respectively. For these rounds, participants submitted preliminary solutions for evaluation based solely on quantitative metrics, and two awards of $10,000 were given to the highest-scoring teams. The deadline for the final challenge was in October 2022. Here, solutions were evaluated by the scientific review board based on three judging criteria: quantitative metrics, transferability, and innovation. First, second, and third place winners received awards of $30,000, $1,500, and $1,000 respectively. Final presentations were given at the Food for Thought symposium in December 2022.

The competition was run entirely within the Coleridge Initiative’s Administrative Data Research Facility (ADRF), which was established by the United States Census Bureau to inform the decision-making of the Commission on Evidence-Based Policy under the Evidence Act. ADRF follows the Five Safes Framework: safe projects, safe people, safe data, safe settings, and safe outputs.

In keeping with this framework, participants were provided with ADRF login credentials after signing the relevant data use agreements during the onboarding process. All participants were required to agree to the ADRF terms of use, to complete security training, and to pass a security training assessment prior to accessing the challenge data. Participants’ access within ADRF was limited to the challenge environment and data only. There was no internet access, so Coleridge Initiative ensured that any packages requested by teams were available for use within the environment after passing security review. All codes and documentation were only allowed to be exported outside ADRF after export reviews from both Coleridge Initiative and USDA staff. At the end of each challenge, the teams submitted write-ups and supporting files by placing all the necessary submission files in their ADRF team folder. Detailed submission instructions are available via the Real World Data Science GitHub repository.

Metrics

Submissions were evaluated by Coleridge Initiative and technical review and subject review boards based on the following criteria:

  • Quantitative metrics were used to measure the predictive accuracy and runtime of the model.
  • Transferability measured the quality of documentation and code, and the ability of individuals who are not involved in model development to replicate and implement the team’s approach.
  • Innovation measured novelty and creativity of the model in addressing the linkage problem.

Technical review was overseen by faculty members from computer science and engineering departments of top US universities. Subject review was handled by subject matter experts from USDA and Westat.

From a quantitative perspective, the most common way to evaluate machine learning competition submissions is to use model predictive accuracy. However, single metrics are typically incomplete descriptions of real-world tasks, and they can easily hide significant differences between models which simple predictive accuracy cannot capture. To select the most appropriate official challenge metrics, Coleridge Initiative reviewed the literature on the use of evaluation measures in both classification and ranking task machine learning competitions. Success at 5 (S@5) and Normalized Discounted Cumulative Gain at 5 (NDCG@5) scores were ultimately used as the quantitative metrics.

The metrics were applied as follows: models proposed by each team were tasked with outputting five potential FNDDS matches for each IRI code, with potential FNDDS matches ordered from most likely to least likely. S@5 and NDCG@5 scores are broadly similar – both measure whether a correct match is present in the five proposed matches that participants were asked to identify. However, S@5 does not take rank position into account and only considers whether the five proposed FNDDS matches contain the correct FNDDS response. NDCG@5 does take rank into account and also measures how highly the correct FNDDS response is ranked among the five proposed matches. Both measures range from 0 to 1 (or 0% to 100%). Models get a “full credit” for S@5 as long as they contain the correct FNDDS option. NDCG@5 penalizes models when the correct match is ranked lower on the list of 5 proposed matches.

Technical description

Environment setup

Coleridge Initiative solicited technical requirements from participants at the challenge application stage to prepare the ADRF environment as much as possible before the competition began. Each team was asked to share anticipated workspace specifications and software library requests in their application package. From this we identified, reviewed, and installed the requested Python and R packages, libraries, and library components (e.g., pre-trained models, training data) that were not yet available within ADRF.

The setup of graphics processing units (GPUs) was also a critical part of competition preparation. We created an environment with 16 gibibyte (GiB) of GPU memory for each team. Our technology team met with multiple teams several times to discuss computing environment configurations to ensure the GPU could work properly. None of these efforts was wasted: without GPU access, it would be impossible for teams to use state-of-the-art pre-trained models such as the Bidirectional Encoder Representations from Transformers (BERT, Devlin et al. 2018).

We completed the setup of new team workspaces, each customized to the individual team’s resource and library requirements, including GPU configuration. The isolation and customization of workspaces was vital because teams may request different versions of libraries that potentially have version conflict with other libraries. We ensured the configurations were all set before the challenge began because such data challenges are bursty in nature (Macavaney et al. 2021), and handling support requests in the private data enclave risked causing delays. We hoped to avoid receiving too many requests in the beginning phase of the competition in order to give participants a better experience, though we did of course provide participants with instructions on how to request additional libraries during the challenge period.

Supporting materials

In addition to environment preparation, we made available a list of supporting documentation, including IRI, PPC, and FNDDS codebooks, technical reports, and related publications that could help teams understand the challenge datasets. The FNDDS codebook pooled information on variable availability, coding, and descriptions across dataset files and years. It also included internal Westat food category coding difficulty ratings and notes on created PPC codes and provided UPC code, EC code, and general dataset remarks and observations that may take time for analysts to discover on their own.

We developed a baseline model to demonstrate the challenge task and the expected outputs – both outside of ADRF using FNDDS and fictitious data in place of IRI data, and an analogous model using FNDDS and IRI data within the ADRF secure environment. Moreover, we provided the teams with an evaluation script to read in their submissions and evaluate them for predictive accuracy against the public test set using S@5 and NDCG@5 challenge metrics. Finally, we held multiple webinars during the course of the challenge to explain next steps, address participant questions, solicit feedback, and provide general support. Multiple teams also met with our technology team to clarify ADRF-related questions or troubleshoot technical issues.

(Baseline model, toolkits, and evaluation script are available from the Real World Data Science GitHub repository.)

Data splitting

To mimic the real-world scenario, the competition used 2012–2016 IRI data as the training set, and the 2017–2018 IRI data as the test set, since the data change over time and USDA could provide the most recent data available. To make sure that models were generalizable and not just overfit to the test set, we split the test set into private and public test sets. In this way, we guaranteed that the models were evaluated on completely hidden data. In order to keep the similar distribution of the two sets, we first divided the data into five quintiles based on EC code frequencies and then randomly sampled 80% of records in each group without repetition for placement into the private test set. Later in the competition, because of the computation limit, we further shrank the private test set to 40% of its original size using the same data-splitting method.

Judging

In the first two rounds, submissions were evaluated based on the quantitative metrics, as previously mentioned above. Coleridge Initiative was responsible for running the evaluation script, making sure not to re-train the model or modify the configs in any way, and only applying the model to predict the private test set. Prediction results were then compared against ground truth to get the private scores.

The final challenge was reviewed by the scientific review board on all three judging criteria. Submitted models were first evaluated by Coleridge Initiative in the same way as in the first two rounds. The runtime of models was also recorded as an assessment of model cost. The scientific review boards then assessed the models by the quality of documentation, the quality of code, and the ability to replicate and implement the team’s approach, and scored the models for innovation and creativity in addressing the linkage problem. Lastly, scores were summarized and the scientific review board discussed and decided the winners of the competition.

Results

The next few articles in this collection walk readers through the solutions proposed by competition finalists. Figure 3 provides a brief summary.

Figure 3: Top competitors and their solutions to the Food for Thought challenge.

Lessons learned

It was undoubtedly challenging for teams to work with highly secured data in a private data enclave for this data challenge. We solicited feedback from teams and summarized the issues that we experienced throughout the competitions, together with the solutions to resolve those issues. Below are our main lessons learned and we hope this summary can serve to inform future competitions.

  • Environmental factors: The installation and setup of packages, libraries, and resources, as well as the configuration of GPUs, system dependencies, and workspace design were expected to take a long time as each team had their own needs. To accelerate the process, we requested a list of specific package and environment requirements from the teams in advance. However, due to the complexity of the system configuration required by the teams, environment setup took longer than expected. Thus, the challenge deadlines had to be postponed a few times to accommodate this.

  • Time commitment: Twelve teams were selected to participate in the challenge, but only three teams remained in the final challenge. Other than one team that was disqualified for violating the ADRF terms of use agreement, eight dropped out because of other commitments and insufficient time to meaningfully participate. To ensure security, ADRF does not allow jobs to run in the backend, which also adds to the time commitment of teams. To encourage teams to participate in the final challenge, we gave out additional awards for second and third places.

  • Computing resource limit: One issue encountered in evaluating submitted models was computing environment resource limits due to the secured nature of the data enclave. The original private test dataset is four times larger than the public test dataset, making it unfeasible to evaluate. To overcome this issue, given the fixed resource constraints, we decided to reduce the private test set to 40% of its original size. It would have been helpful, though, if the competition had set a model running time limit at the outset, so that participants could build simpler yet effective models.

  • Supporting code: Although the initial baseline model we provided was extremely simple, we found this helped participants a lot in the initial phase – yet there is space to improve. To be specific, supporting codes should be constructed so that all relevant data tables are used and specify the main function to run the code, especially how the model should be tested. The teams only used the main table, which was the only table that was used in the baseline model, for training and did not touch the other supporting table. If we included the other table in the baseline model, it could help participants to have a better use of this data as well. In addition, a baseline model should be intuitive for the participants to follow, allowing evaluators to easily replace the public test set with the private test set without any programming modifications.

About the authors
Zheyuan Zhang and Uyen Le are research scientists at the Coleridge Initiative.
Copyright and licence
© 2023 Zheyuan Zhang and Uyen Le

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite
Zhang, Zheyuan, and Uyen Le. 2023. “Food for Thought: Competition and challenge design.” Real World Data Science, August 21, 2023. URL

References

Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” CoRR abs/1810.04805. http://arxiv.org/abs/1810.04805.
Macavaney, S., A. Mittu, G. Coppersmith, J. Leintz, and P. Resnik. 2021. “Community-Level Research on Suicidality Prediction in a Secure Environment: Overview of the CLPsych 2021 Shared Task.” In Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access.