Food for Thought: The importance of the Purchase to Plate Suite

Andrea Carlson and Thea Palmer Zimmerman outline the policy issues driving the development of the Purchase to Plate Suite of data products and why linking retail scanner data to nutrition information is a time-consuming but crucial task.

Machine learning
Natural language processing
Public policy
Health and wellbeing
Author

Andrea Carlson and Thea Palmer Zimmerman

Published

August 21, 2023

Disclaimer

The findings and conclusions in this publication are those of the authors and should not be construed to represent any official USDA or US Government determination or policy. This research was supported by the US Department of Agriculture’s Economic Research Service and Center for Nutrition, Policy and Promotion. Findings should not be attributed to Circana (formerly IRI).

About 600,000 deaths per year in the United States are related to chronic diseases that are linked to poor dietary choices. Many other individuals suffer from diet-related health conditions, which may limit their ability to work, learn, and be physically active (US Department of Agriculture and US Department of Health and Human Services 2020). In recognition of the link between diet and health, in 1974 the Senate Select Committee on Nutrition and Human Needs, originally formed to eliminate hunger, expanded its focus to improving eating habits, nutrition policy and the national diet. Since 1980, the Dietary Guidelines for Americans have been released every five years by the US Departments of Agriculture (USDA) and Health and Human Services (DHHS). The guidelines present “advice on what to eat and drink to meet nutrient needs, promote health, and prevent disease”.

Because there can be economic and social barriers to maintaining a healthy diet, USDA promotes Food and Nutrition Security so that everyone has consistent and equitable access to healthy, safe, and affordable foods that promote optimal health and well-being. A set of data tools called the Purchase to Plate Suite (PPS) supports these goals by enabling the update of the Thrifty Food Plan (TFP), which estimates how much a budget-conscious family of four needs to spend on groceries to ensure a healthy diet. The TFP market basket – consisting of the specific amounts of various food categories required by the plan – forms the basis of the maximum allotment for the Supplemental Nutrition Assistance Program (SNAP, formerly known as the “Food Stamps” program), which provided financial support towards the cost of groceries for over 41 million individuals in almost 22 million households in fiscal year 2022.

The 2018 Farm Act (Agriculture Improvement Act of 2018) requires that USDA reevaluate the TFP every five years using current food composition, consumption patterns, dietary guidance, and food prices, and using approved scientific methods. USDA’s Economic Research Service (ERS) was charged with estimating the current food prices using retail food scanner data (Levin et al. 2018; Muth et al. 2016) and utilized the PPS for this task. The most recent TFP update was released in August 2021 and the revised cost of the market basket was the first non-inflation adjustment increase in benefits for SNAP in over 40 years (US Department of Agriculture 2021).

The PPS combines datasets to enhance research related to the economics of food and nutrition. There are four primary components of the suite:

The PPC allows researchers to measure the healthfulness of store purchases. On average US consumers acquire about 75% of their calories from retail stores, and there are a number of studies linking the availability of foods at home to the healthfulness of the overall diet (e.g., Gattshall et al. 2008; Hanson et al. 2005). Thus, understanding the healthfulness of store purchases allows us to understand differences in consumers who purchase healthy versus less healthy foods, and may contribute to better policies that promote healthier food purchases. While healthier diets are linked to a lower risk of disease outcomes (Reedy et al. 2014), other factors such as health care access may also be contributors (Cleary, Liu, and Carlson 2022). The PPC also forms the basis of the price tool, PPPT – which allows researchers to estimate custom prices for dietary recall studies – and a new ERS data product, the PP-NAP. The national average prices from PP-NAP are used in reevaluating the TFP. By using the PP-NAP with 24-hour dietary recall information from surveys such as What We Eat in America (WWEIA) – the dietary component of the nationally representative National Health and Nutrition Examination Survey(NHANES)1 – researchers can examine the relationship between the cost of food, dietary intake, and chronic diseases linked to poor diets. The price estimates also allow researchers to develop cost-effective healthy diets such as MyPlate Kitchen. The final component of the Purchase to Plate Suite, the ingredient tool (PPIT), breaks dietary recall-reported foods back into purchasable ingredients, based on US retail food purchases. The PPIT is also used in the revaluation of the TFP, and by researchers who want to look at the relationship between reported ingestion of grocery items, cost and disease outcomes using WWEIA/NHANES. More information on the development of the PPC is available in two papers by Carlson et al. (2019, 2022).

The Food for Thought competition aimed to support the development of the PPC – and thus policy-oriented research – by linking retail food scanner data to the USDA nutrition data used to analyze NHANES dietary recall data, specifically the Food and Nutrient Database for Dietary Studies (FNDDS) (2018, 2020). In particular, the competition set out to use artificial intelligence (AI) to reduce human resources in creating the links for the PPC, while still maintaining the high-quality standards required for reevaluating the TFP and for data published by ERS (which is one of 13 Principle Statistical Agencies in the United States Federal Government).

Methods used to date

On the surface, the linking process may appear simple: both the FNDDS and retail food scanner data are databases of food. But the scanner data are produced for market research, and the FNDDS for dietary studies. The scanner data include about 350,000 items with sales each year, while the FNDDS has only 10,000–15,000 items. Scanner data relates to specific products, while FNDDS items are often more general. Both datasets have different hierarchical structures – the FNDDS hierarchy is based around major food groups: dairy; meat, poultry and seafood; eggs; nuts and legumes; grains; fruits; vegetables; fats and oils; and sugars, sweets, and beverages. Items fall into the groups regardless of preparation method or form. That is, broccoli prepared from frozen and from fresh both appear in the vegetable group, and for some fruits and vegetables, the fresh, frozen, canned and dried form are the same FNDDS item. Vegetable-based mixed dishes, such as broccoli and carrot stir-fry or soup, are also classified in the vegetable group. On the other hand, the scanner data classifies foods by grocery aisle. That is, the fresh and frozen broccoli are classified in different areas: produce and frozen vegetables. Similarly, when sold as a prepared food, the broccoli and carrot stir-fry may be found in the frozen entries, as a kit in either the frozen or produce section, refrigerated foods, or all of these.

To allow researchers to import the FNDDS nutrient data into the scanner data, a one-to-many match between FNDDS and scanner data items was needed. The food descriptions in the scanner data include brand names and package sizes and are written as a consumer would pronounce them – e.g., fresh and crisp broccoli florets, ready-cut, 10 oz – versus a more general FNDDS description such as “Broccoli, raw”. (Also linked to the “Broccoli, raw” code would be broccoli sold with stems attached, broccoli spears, and any other way raw broccoli is sold.) In the scanner data, the Universal Product Code (UPC) and the European Article Number (EAN) can link items between tables within the scanner data, as well as between datasets of grocery items, such as the USDA Global Branded Foods Product Database, a component of USDA’s Food Data Central. However, these codes are not related to the FNDDS codes, or any other column within the FNDDS. In other words, before development of the PPC, there were no established linking identifiers.

Figure 1 shows the process USDA uses to develop matches between scanner data and FNDDS.

Figure 1: Process currently used to create the matches between the USDA Food and Nutrient Database for Dietary Studies (FNDDS) and the retail scanner data (labelled “IRI” for the IRI InfoScan and Consumer Network) product dictionaries. Source: Author provided.

We start the linking process by categorizing the scanner data items into homogeneous groups to make the first round of automated matching more efficient. To save time, we use the second lowest hierarchical category in the scanner data which generally divides items within a grocery aisle into homogenous groups such as produce, canned beans, baking mixes, and bread. Once the linking categories for scanner data are established, we select appropriate items from the FNDDS. Since the FNDDS is highly structured, this selection is usually straightforward.

Our next step is to use semantic matching to create a search table that aligns similar terms within the IRI product dictionary and FNDDS. This first requires that we extract attributes from the FNDDS descriptions into fields similar to those in the scanner data product dictionary. The FNDDS descriptions are found across multiple columns because they are added as the need arises to provide examples of brand names or alternative descriptions of foods which help code the foods WWEIA participants report eating. We manually create matching tables that link terms used in FNDDS to those used in the scanner data, organized by the fields defined in the restructured FNDDS. We then use this table as the basis of a probabilistic matching process. For example, when linking the produce group, “fresh” in the scanner data would be aligned with “raw” and “prepared from fresh” and NOT “prepared from frozen” in the FNDDS, and “broccoli florets” would also be aligned with “raw” and “broccoli”. Since the FNDDS is designed to code the foods individuals report eating, many of the foods in the FNDDS are already prepared and result in descriptions such as “broccoli, steamed, prepared from fresh” or “broccoli, boiled, prepared from frozen”.

Once the linking table is established, the probabilistic match process returns the single best possible match for each item in the scanner data. For example, a match between fresh broccoli florets and frozen broccoli would have a lower probability score than “broccoli, raw”. Because these matches form the basis of major USDA policies, we cannot accept an error rate of more than 5 percent, and lower is preferred. To reach that goal, nutritionists review every match to make sure the probabilistic match did not return a match between cauliflower florets and fresh broccoli, say, or that a broccoli and carrot stir-fry is not matched to a dish with broccoli, carrots, and chicken. The correct matches, such as the one between fresh broccoli florets and raw broccoli, are set aside while the items with an incorrect match, such as cauliflower florets and the broccoli and carrot stir-fry, are used to revise the search table. Revisions might include adding (NOT chicken) to the broccoli and carrot stir-fry dish. Mixed dishes — such as the broccoli and carrot stir-fry — pose particular challenges because there are a wide variety of similar products available in the grocery store. After a few rounds of revising the search table and running the probabilistic match process, it is more efficient to use a manual match, established by one nutritionist and reviewed by another, after which the match is assumed to be correct.

The process improved with each new wave of FNDDS and IRI data. Our first creation of the PPC linked the FNDDS 2011/12 to the 2013 IRI retail scanner data. Subsequent waves started with the previous search table and resulting matches were reviewed by nutritionists. We also used more fields in the IRI product dictionary to create the homogeneous linking groups and made modifications to these groups with each wave. During each wave we experimented with the number of rounds of probabilistic matching that was the most cost effective. For some linking groups it took less human time to manually match from the start, while for other groups it was more efficient to do multiple rounds of improvements to the search table. Starting with the most recent wave (matching FNDDS 2017/18 to the 2017 and 2018 retail scanner data), we assumed previous matches appearing in the newer data were correct. Although this assumption was good for most matches, a review demonstrated the need to review previous matches prior to removing the item from the list of scanner data items needing FNDDS matches. In the future we intend to explore methods developed by the participants of the Food for Thought competition.

Linking challenges

An ongoing challenge to the linking problem is that both the scanner data and the FNDDS undergo substantive changes each year, meaning that both the previous matches and search tables need to be reviewed and revised with each new effort, as tables that work with one cycle of FNDDS and scanner data will need revisions to use with the next cycle. Changes to the scanner data that impact our current method include dropped and added items, data corrections, and revisions to the categories that form the basis of the homogeneous linking groups. In addition, there are errors such as incorrect food descriptions, conflicting package size information, and changes in the item description from year to year. Since the FNDDS is designed to support dietary recall studies, revisions reflect both changes to available foods and the level of detail respondents can provide. These revisions result in dropped/added food codes, changes to food descriptions that impact which scanner data items match to the FNDDS items, and revisions to recipes used in the nutrient coding which impacts the number of retail ingredients available in the FNDDS.

Of the four parts of the PPS, establishing the matches is the most time-consuming task and constitutes at least 60 percent of the total budget. In the most recent round, we had 168 categories and each one went through 2-3 automated matching rounds; after each round, nutritionists spent an average of two hours reviewing the matches. This adds up to somewhere between 670 and 1,000 hours of review time. After the automated review, manual matching requires an additional 300 hours. Reducing the amount of time required to establish matches and link the FNDDS and retail scanner datasets may lead to significant time savings, resulting in faster data availability. That, in turn, could allow more timely policy-based research, and the mandated revision of the Thrifty Food Plan can continue with the most recent food price data.

About the authors
Andrea Carlson is an agricultural economist in the Food Markets Branch of the Food Economics Division in USDA’s Economic Research Service. She is the project lead for the Purchase to Plate Suite, which allows users to import USDA nutrient and food composition data into retail food scanner data acquired by USDA and estimate individual food prices for dietary intake data.

Thea Palmer Zimmerman is a senior study director and research nutritionist at Westat.

Image credit
Thumbnail photo by Kenny Eliason on Unsplash.
How to cite
Carlson, Andrea, and Thea Palmer Zimmerman. 2023. “Food for Thought: The importance of the Purchase to Plate Suite.” Real World Data Science, August 21, 2023. URL

Acknowledgements

The research presented in this compendium supports the Purchase to Plate Suite of data products. Carlson has been privileged to both develop and lead this project over the course of her career, but it is not a solo project. Many thanks to the Linkages Team from USDA’s Economic Research Service (Christopher Lowe, Mark Denbaly Elina Page, and Catherine Cullinane Thomas) the Center for Nutrition Policy and Promotion (Kristin Koegel, Kevin Kuczynski, Kevin Meyers Mathieu, TusaRebecca Pannucci), and our contractor Westat, Inc. (Thea Palmer Zimmerman, Carina E. Tornow, Amber Brown McFadden, Caitlin Carter, Viji Narayanaswamy, Lindsay McDougal, Elisha Lubar, Lynnea Brumby, Raquel Brown, and Maria Tamburri). Many others have supported this project over the years.

References

Carlson, A. C., E. T. Page, T. P. Zimmerman, C. E. Tornow, and S. Hermansen. 2019. “Linking USDA Nutrition Databases to IRI Household-Based and Store-Based Scanner Data.” Technical bulletin 1952. US Department of Agriculture, Economic Research Service.
Carlson, A. C., C. E. Tornow, E. T. Page, A. Brown McFadden, and T. Palmer Zimmerman. 2022. “Development of the Purchase to Plate Crosswalk and Price Tool: Estimating Prices for the National Health and Nutrition Examination Survey (NHANES) Foods and Measuring the Healthfulness of Retail Food Purchases.” Journal of Food Composition and Analysis 106: 104344. https://doi.org/10.1016/j.jfca.2021.104344.
Cleary, R., Y. Liu, and A. Carlson. 2022. “Differences in the Distribution of Nutrition Between Households Above and Below Poverty.” Agricultural and Applied Economic Association Annual Meeting. Anaheim, CA. https://ageconsearch.umn.edu/record/322267.
Gattshall, M. L., J. A. Shoup, J. A. Marshall, L. A. Crane, and P. A. Estabrooks. 2008. “Validation of a Survey Instrument to Assess Home Environments for Physical Activity and Healthy Eating in Overweight Children.” International Journal of Behavioral Nutrition and Physical Activity 5 (3). https://doi.org/10.1186/1479-5868-5-3.
Hanson, N. I., D. Neumark-Sztainer, M. E. Eisenberg, M. Story, and M. Wall. 2005. “Associations Between Parental Report of the Home Food Environment and Adolescent Intakes of Fruits, Vegetables and Dairy Foods.” Public Health Nutrition 8 (1). https://doi.org/10.1079/PHN2005661.
Levin, D., D. Noriega, C. Dicken, A. Okrent, M. Harding, and M. Lovenheim. 2018. “Examining Store Scanner Data: A Comparison of the IRI Infoscan Data with Other Data Sets, 2008-12.” Technical bulletin 1949. US Department of Agriculture, Economic Research Service.
Muth, M. K., M. Sweitzer, D. Brown, K. Capogrossi, S. Karns, D. Levin, A. Okrent, P. Siegel, and C. Zhen. 2016. “Understanding IRI Household-Based and Store-Based Scanner Data.” Technical bulletin 1942. US Department of Agriculture, Economic Research Service.
Reedy, J., S. M. Krebs-Smith, P. E. Miller, A. D. Liese, L. L. Kahle, Y. Park, and A. F. Subar. 2014. “Higher Diet Quality Is Associated with Decreased Risk of All-Cause, Cardiovascular Disease, and Cancer Mortality Among Older Adults.” The Journal of Nutrition 144 (6): 881–89. https://doi.org/10.3945/jn.113.189407.
US Department of Agriculture. 2021. “Thrifty Food Plan, 2021.” Food and Nutrition Service 916. US Department of Agriculture. https://FNS.usda.gov/TFP.
US Department of Agriculture, Agricultural Research Service. 2018. “USDA Food and Nutrient Database for Dietary Studies 2015-2016.” US Department of Agriculture, Agricultural Research Service. https://www.ars.usda.gov/nea/bhnrc/fsrg.
———. 2020. “USDA Food and Nutrient Database for Dietary Studies 2017-2018.” US Department of Agriculture, Agricultural Research Service. https://www.ars.usda.gov/nea/bhnrc/fsrg.
US Department of Agriculture and US Department of Health and Human Services. 2020. “Dietary Guidelines for Americans, 2020-2025.” 9th edition. US Department of Agriculture and US Department of Health and Human Services. https://DietaryGuidelines.gov.

Footnotes

  1. NHANES is a multi-module continuous survey conducted by the Centers for Disease Control and Prevention. In addition to the WWEIA, NHANES includes a four-hour complete medical exam including a health history, and a blood and urine analysis.↩︎