Real World Data Science

Defining Purposes and Uses to Support the Development of Statistical Products in a 21st Century Census Curated Data Enterprise Environment

Fri, 22 Nov 2024 00:00:00 GMT

Acknowledgments: This research was sponsored by the:
Unites States Census Bureau Agreement No. 01-21-MOU-06 and
Alfred P. Sloan Foundation Grant No. G-2022-19536

The views expressed in this article are those of the authors and not the Census Bureau.

1 Summing it up

We end where we began in the first article of our series. Through this four-part series, we introduced a Curated Data Enterprise (CDE) Framework (see Figure 1) that can guide the development and dissemination of statistics broadly applicable to addressing social and economic issues while ensuring replicability and reusability. The CDE provides the scaffold for scaling the statistical product development of interest to the US Census Bureau and broadly applies to official statistics agencies (Keller et al. 2022). We illustrated this through a use case on climate resiliency of skilled nursing facilities, highlighting the replicability and reusability of the capabilities that would benefit inclusion in a CDE.

Figure 1: The CDE Framework starts with the purposes & uses of the statistical products. The outer rectangle identifies the guiding principles for ethical, transparent, reproducible statistical product development and dissemination. The inner rectangle identifies the statistical product development steps.

As noted in the first three articles, the process begins with articulating purposes and uses through stakeholder engagement and continues by leveraging that engagement, including subject matter expertise, to inform statistical product development. Eliciting purposes and uses from stakeholders and data users is facilitated by asking questions such as:

What questions keep you awake at night because you don’t have data insights to address them? What are those purposes and uses that you need statistical products to support?
How do we collaborate and engage with you to better understand your needs and help you identify gaps in understanding regarding purpose and use?
How do we prioritize what statistical products to develop first?

Examples of purposes and uses that drive new statistical products include accurately measuring gig employment (Salvo, Shipp, and Zhang 2022a), migration due to extreme climate events (Salvo, Shipp, and Zhang 2022b), the various dimensions of housing affordability (Wu et al. 2023), and addressing the undercount of young children (Salvo, Lancaster, and Shipp 2023). Other topics that require multiple sources and types of data include creating a household living budget based on the minimum necessary to ensure an adequate standard of living (Lancaster et al. 2023) and using this budget as a starting point for measuring insecurity across components such as food or housing (Montalvo et al. 2023).

2 Developing an end-to-end (E2E) curation system

Purposes and uses defined in use cases are important to support the rapid development of statistical products. These use cases will capture the imagination of those working to address today’s critical issues and advance public understanding and trust in federal statistics. The above paragraph provides examples of purposes and uses for which we have developed use cases.

Use cases are a powerful mechanism to promote methodological research to develop and implement capabilities needed in a CDE. The objectives are to undertake research projects that have the potential to create statistical products with explicit purposes and uses that will exercise the end-to-end (E2E) curation components.

When implemented, these proposed use cases will demonstrate a sequence of capabilities needed to build the CDE, such as agile data discovery, reusing modules and data (including synthetic data), tracking the provenance of collected and generated data, reusing synthetic data and methods to integrate many types of data, conducting statistical analysis involving heterogeneous data integration, and reviewing data and statistical results with an equity and ethics lens. These steps will be captured in an end-to-end curation system.

Criteria for developing and evaluating use cases that will uncover the capabilities and research necessary to develop the CDE

Criteria are needed to evaluate, and partner with researchers and stakeholders in developing and implementing the capabilities to capture in the CDE. The choice of use cases, when curated, needs to provide unique insight into CDE capabilities and statistical product development. The capabilities to be developed include addressing some purpose and use that no single source of information can resolve, generating practical diagnostics to improve existing methods, creating pilot software, and validating new and improved statistical products. These criteria, developed through listening sessions and discussions with experts, guide the prioritization and selection of use cases and their evaluation after curation (see Table 2) (Keller et al. 2022).

Table 2. Criteria for Selecting and Prioritizing Use Cases to Identify CDE Capabilities
Value and feasibility of the CDE approach described in the existing research (potential use case) to address emerging or long-standing issues, ie, its purpose and use over and above existing approaches to address high-priority problems. \| \| Stakeholders’ challenges and issues as the source of purposes and uses. \| \| Subject matter experts to advise on the approach and implementation. \| \| Partners to access data from local and state governments, non-profit organizations, and the private sector, and strategies to overcome legal and administrative barriers to such access that benefits to both the providers and recipients of the data. \| Survey, administrative, opportunity, and procedural data from multiple sources (eg, local, state, federal, third-party) to address the purpose and use (issue) in an integrated way. There are well-defined data ingestion and governance requirements. \| \| Computation and measurement requirements for statistical products include the unit(s) of analysis and their characteristics, temporal sequence, geocoded location data, and methods for imputations, projections, and statistical analysis. \| \| Equity and ethical dimensions are considered at each step to ensure that the use case provides fair and accurate representation across groups and an assessment that the potential benefits outweigh the potential harm. \| \| Evidence of CDE capabilities to be built, including the code, data, and documentation to create the statistical products, which can be described in the curation step. \| \| Statistical products include integrated data sources, indicators, maps, visualizations, storytelling and analysis. \| \| Potential viability of proposed dissemination platforms for interactive access to data products at all levels of data acumen (Keller and Shipp 2021) while adhering to confidentiality and privacy rules. \|

An end-to-end curation process

Curation is an end-to-end process defined by the context of the purposes and uses that document the decisions and trade-offs at each step in the CDE Framework. The following curation definition will be used as it serves the CDE’s vision.

Curation involves documenting, for each statistical product, the inputs from which the product is derived, the wrangling used to transform the information into product, and the statistical product itself. Purposes and uses provide the context for each statistic and statistical product.

This definition has evolved from numerous stakeholder discussions via listening sessions and discussions with Census Bureau staff. (Nusser et al. forthcoming; Faniel, Frank, and Yakel 2019; NASEM 2022).

As use cases are curated, the CDE capabilities will evolve to quickly develop statistical products. These curated use cases are integral to developing an E2E curation process for the CDE.

Invitation to contribute purpose and use ideas for developing new statistical products

The CDE development aims to curate a significant number of use cases that address social and economic issues that have the potential to define capabilities to be built in the CDE. Initially, they are seeking ideas for purposes and uses to define these use cases and statistical products.

The skilled nursing facility use case included code, data, and documentation to calculate the probability of workers getting to work during a weather event, resilience indicators at the county or sub-county level, alternative skilled nursing home deficiency measures, and other capabilities.

Incorporating capabilities in the CDE

To accelerate the development of statistical products, the Census Bureau will develop use cases to articulate and create CDE capabilities. This requires identifying those valuable nuggets for learning and quickly translating and incorporating this information into the CDE. Examples of critical capabilities of interest are learning about the utility of synthetic data, the ability to aggregate data into custom geographies, and combining different units of analysis. The expected outcome is the creation of an innovative 21^st Century Census Curated Data Enterprise focused on purposes and uses that overcome the limitations and challenges of today’s survey-alone model.

The 21^st Century Census Curated Data Enterprise development presents an opportunity for researchers to help drive the development of the CDE as the foundation for creating new statistical products. The US Census Bureau is seeking ideas for purposes and uses that will define new statistical products. They are interested in research projects (use cases) that are guided by the CDE framework as potential new statistical products. They want to learn from and understand your experiences in using the CDE framework, for example, what worked well, what challenges you faced, how each step in the framework was curated, and what capabilities are replicable and reusable for developing and enhancing statistical products.

← Part 3: Climate resiliency of skilled nursing facilities

About the authors: Stephanie Shipp leads the Curated Data Enterprise research portfolio and collaborates with the US Census. She is an economist with experience in data science, survey statistics, public policy, innovation, ethics, and evaluation.; Joseph Salvo is a demographer with experience in US Census Bureau statistics and data. He makes presentations on demographic subjects to a wide range of groups about managing major demographic projects involving the analysis of large data sets for local applications.; Vicki Lancaster is a statistician with expertise in experimental design, linear models, computation, visualizations, data analysis, and interpretation.

Copyright and licence: © 2024 Stephanie Shipp

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Lukas Blazek on Unsplash.

How to cite: Shipp S, Salvo J, Lancaster V (2024). “Statistical Products in a 21^st Century Census Curated Data Enterprise Environment” Real World Data Science, November 22, 2024. URL

References

Faniel, Ixchel M, Rebecca D Frank, and Elizabeth Yakel. 2019. “Context from the Data Reuser’s Point of View.” Journal of Documentation 75 (6): 1274–97. https://doi.org/10.1108/JD-08-2018-0133.

Keller, Sallie, Kenneth Prewitt, John Thompson, Steve Jost, Christopher Barrett, Sarah Nusser, Joseph Salvo, and Stephanie Shipp. 2022. “A 21st Century Census Curated Data Enterprise. A Bold New Approach to Create Official Statistics. Technical Report.” Proceedings of the Biocomplexity Institute BI-2022-1115: 297–323. https://doi.org/10.18130/r174-yk24.

Keller, Sallie, and Stephanie Shipp. 2021. “Data Acumen in Action.” Notices of the American Mathematical Society. https://www.ams.org/journals/notices/202109/noti2353/noti2353.html?adat=October%202021&trk=2353&galt=feature&cat=feature&pdfissue=202109&pdffile=rnoti-p1468.pdf .

Lancaster, V., M. Montalvo, J. Salvo, and S. Shipp. 2023. “The Importance of Household Living Budget in the Context of Measuring Economic Vulnerability: A Census Curated Data Enterprise Use Case Demonstration.” Proceedings of the Biocomplexity Institute Technical Report. TR# BI-2023-258. https://doi.org/10.18130/p43z-c742.

Montalvo, Cesar, Vicki Lancaster, Joseph Salvo, and Stephanie Shipp. 2023. “The Importance of Household Living Budget in the Context of Food Insecurity: A Census Curated Data Enterprise Use Case Demonstration.” Proceedings of the Biocomplexity Institute, Technical Report BI-2023-261. https://doi.org/10.18130/2kgx-tv50.

NASEM. 2022. “Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies.” National Academies of Science, Engineering, and Medicine. https://doi.org/10.1162/99608f92.17405bb6.

Nusser, S., S. Keller, S. Shipp, Z. Zhu, and E. Wu. forthcoming. “Curation in the Context of the Census Curated Data Enterprise (CDE).” TBD, forthcoming.

Salvo, J., V. Lancaster, and S. Shipp. 2023. “The Net Undercount of Children Under 5 Years of Age in the Decennial Census: An Art of the Possible Use Case.” Proceedings of the Biocomplexity Institute Technical Report. TR# BI-2023-000. https://doi.org/10.18130/nzyj-m621.

Salvo, J., S. Shipp, and S. Zhang. 2022b. “Building a Case Study of Domestic Migration and the Curated Data TR# 2022-027 - Essential Elements.” Proceedings of the Biocomplexity Institute Technical Report BI 2022-027 (2022b). https://doi.org/10.18130/bcwa-gt69.

———. 2022a. “Defining the Role of Gig Employment in the Post-Pandemic World of Work.” Proceedings of the Biocomplexity Institute Technical Report BI 2022-026 (2022a). https://doi.org/10.18130/wkx0-4y46.

Wu, E., J. Salvo, V. Lancaster, and S. Shipp. 2023. “Housing Affordability – an Art of the Possible Use Case to Develop the 21st Century Census Curated Data Enterprise.” Proceedings of the Biocomplexity Institute Technical Report BI-2023-262. https://doi.org/10.18130/qgkd-va29.

Translating the Curated Data Model into Practice - Climate resiliency of skilled nursing facilities

Tue, 19 Nov 2024 00:00:00 GMT

Acknowledgments: This research was sponsored by the:
Unites States Census Bureau Agreement No. 01-21-MOU-06 and
Alfred P. Sloan Foundation Grant No. G-2022-19536

1 Introduction

Here, we demonstrate how the CDE Framework can be implemented for a research use case related to skilled nursing facilities. The framework provides the guiding principles for ethical, transparent, and reproducible research and dissemination and the research process for developing the statistical product.

Across the US, federally regulated skilled nursing facilities (SNFs) provide essential care, rehabilitation, and related health services to about 1.3 million people. An SNF is a facility that meets specific federal regulatory certification requirements that enable it to provide short-term inpatient care and services to patients who require medical, nursing, or rehabilitative services. Their patients can be among the most vulnerable members of our society, and yet, historically, SNFs have not been incorporated into existing emergency response systems. For example, during the 2004 Florida hurricane season, SNFs were given the same priority as day spas for restoring electricity, telephones, water, and other essential services (Hyer et al. 2006). Even worse are the deaths of SNF residents in Louisiana following Hurricanes Katrina and Rita in 2005 (Dosa et al. 2008). This was still an issue in 2021. In Louisiana, 15 SNF residents died when evacuated to a warehouse during Hurricane Ida (2021), and 12 died in Florida as a result of Hurricane Irma (2017). In both instances, the deaths were attributed to extreme heat and lack of electricity (Skarha et al. 2021).

These events prompted the (The White House 2022) initiative, Protecting Seniors by Improving Safety and Quality of Care in the Nation’s Nursing Homes, stating, ‘All people deserve to be treated with dignity and respect and to have access to quality medical care.’

However, there are questions that need to be addressed to best protect SNFs and their residents. For example, how resilient are SNFs in extreme climate events? This use case demonstration shows how we built a new statistical product to address this question using the CDE Framework (Lancaster et al. 2023).

2 Purposes and uses

A skilled nursing facility (SNF) is a federally regulated nursing facility with the staff and equipment to provide skilled nursing care, skilled rehabilitation services, and other related health services (Medicare & Medicaid Services 2023). The context of this use case is to create a baseline picture of SNFs in Virginia and then integrate information on the risk of extreme flood events to assess facility and community preparedness – for example, how likely are the nursing staff¹ to make it to the facility in the event of a flood?

This use case has two parts. The first creates a baseline data picture of SNFs, bringing together data about the residents, nursing staff, and SNF characteristics. The second addresses two issues raised in the (The White House 2022) initiative: emergency preparedness and nurse staffing. We frame these issues into three purpose and use questions with the ultimate goal of creating statistical products that address these questions:

Can SNF workers get to work during an extreme flood event?
Are SNFs prepared for a flood emergency?
Can communities support SNFs during an emergency?

3 Statistical product development stages

Subject matter input and literature review

The subject matter experts consulted included nursing facility administrators, SNF resident advocates, demographers, and researchers. Our discussions and literature review informed us of the many federal policies governing SNFs regarding inspections and data reporting requirements (procedural data). In addition, we were told about non-public data sources on residents and SNF staff that were aggregated to the SNF level and provided to the public under a grant from the National Institute on Aging. This information was important since we had yet to come across this source in our data discovery process. The dialogue with experts and our literature review helped us generate a ‘wish list’ of variables we used to inform our data discovery process that we visualized into a conceptual data map (see Figure 1).

Figure 1: Conceptual Data Map Aligned to Purpose and Use: The conceptual data map displays the results of our data discovery. The team identifies the data needs informed by expert elicitation and literature review. For this use case the data discovery took three phases: (1) create a data picture of SNF owners, nursing staff, and residents, and the communities the facilities reside in; (2) identify the potential risks of a severe flood events, coastal and riverine; and (3) identify the potential weakness in the SNF’s and community’s ability to respond.

Data discovery

Data discovery focused on identifying data sources to address the purpose and use questions and was informed by the conceptual data map.

For the first question – Can SNF workers get to work during an extreme flood event? – we discovered and used proprietary synthetic population, transportation routes, building data sources, and publicly available flood data. The HERE Premium Streets proprietary data includes information about roads, such as type of road, speed limits, number of lanes, etc. The proprietary synthetic population data, Building Knowledge Base (BKB), are used to identify where SNF workers live and work to map transportation routes from home to work (Mortveit, Xie, and Marathe 2023). Publicly available data from the Federal Emergency Management Administration (FEMA) provided flooding risk estimates along the routes from nursing staff homes to the SNF.

For the second question – Are SNFs prepared for a flood emergency? – we used Center for Medicare and Medicaid (CMS) SNF inspection and deficiency data as a proxy for preparedness. We also examined SNF residents’ physical and mental health to assess SNF emergency preparedness. For example, if most residents faced mobility challenges, the SNF would need more resources available during an emergency to move residents to a safer facility. We used data about residents from the Long Term Care Focus (LTCFocus 2022) Public Use Data sponsored by the National Institute on Aging (Brown University 2022).

We used data to measure community resilience, assets, and risks by geography at the county, city, and census tract levels to address the third question, Can communities support SNFs during an emergency? These data included:

Health professional shortages area (HRSA 2022)
Shelter facilities and emergency service providers data (Homeland Security: Geospatial Management Office 2022)
Community Resilience Indicator Analysis and National Risk Index for Natural Hazards (FEMA 2022).

All data are provided in a GitHub repository along with their metadata, except for the three proprietary data sources. Articles about how the synthetic estimates are constructed are provided for two of these proprietary data sources. The third data source was obtained from a private-sector vendor whose data and documentation are proprietary; a link is provided to their website.

Data ingest and governance

All the public data, metadata, code, statistical products, data processes, and relevant literature on SNF policies and regulations are stored in a GitHub repository.

In our experience, data wrangling is the most time-consuming and challenging part of product development. This speaks directly to the benefit of the CDE; once a researcher has wrangled together multiple data sources, it can be made available to other researchers.

The two predominant issues with data wrangling for this Use Case included reconciling data sources that contain data on the same topic and creating linkages between data sources. For example, we reviewed three hospital data sources:

Homeland Security Infrastructure Foundation-Level Data (HIFLD) (DHS 2022)
HealthData.gov - COVID-19 Reported Patient Impact and Hospital Capacity by State (HHS 2022)
Map of VHHA Hospital and Health System Members (Virginia Hospital & Healthcare Association 2022)

We observed inconsistences and omissions across the three data sources including:

non-standard hospital names and hospital classification types
inconsistent availability of hospital IDs (such as Medicare Provider Number)
conflicting geographic information, including address, latitude, and longitude.

We did not attempt to reconcile these inconsistencies for the demonstration but decided to use a single source for shelter facility and emergency service provider data. We used HIFLD data since they provided the most current data (DHS 2022). The use of these data reinforces the purpose of the use case – to illuminate the challenges in creating statistical products and what the Census Bureau would need to consider.

Similar inconsistencies made it difficult to link data sources using geographic variables. For example, we used shelter facility and emergency service provider data sources from the HIFLD – including hospitals, Red Cross chapter facilities, National Shelter System Facilities, emergency medical service stations, fire stations, and urgent care facilities – to calculate a metric for potential community support. The goal was to place each facility in a Virginia county or independent city. Virginia is divided into 95 counties, and 38 independent cities considered county-equivalents for census purposes, and in some cases, there is a county and a city with the same name (eg, Richmond County and Richmond City, each in different locations in Virginia). It was necessary to canonicalize the county and city names (when available), which meant aligning upper and lower cases, removing unnecessary characters, and distinguishing between county and city.²

The challenge with locating shelter facilities and emergency service providers within a county or independent city was using different variables to identify their location (latitude and longitude, address, ZIP code³, Federal Information and Processing Standard (FIPS) code, and county/city name). In cases where the data source only had a ZIP or FIPS code, a Department of Housing and Urban Development crosswalk was used to link the two codes; in other cases, a crosswalk that linked non-independent cities and towns to counties was used; and in others, a crosswalk that linked FIP codes to counties and independent cities. Researchers would benefit from exhaustive crosswalks between all variables on the same topic, such as location variables, facility names, and identification numbers, to reduce the time spent on data wrangling.

Regarding data products related to popular indices, such as climate disaster risks and community resilience, they are operationalized differently across the various departments and agencies within the federal and state governments and private and non-profit sectors. It is an enormous task to review the methodology and technology reports (if available) to understand their differences and decide which versions are most relevant (fitness-for-purpose) for a particular use case. Again, after reviewing the options for this use case, we determined that the National Risk Index for riverine and coastal floods from FEMA was the best option for climate risk estimates. The detailed technical report, National Risk Index Technical Document (FEMA 2021), provides a clear assessment of the assumptions and limitations of the data and a description of how the risk estimates were derived. Researchers would benefit from guidance on the numerous constructions of indices on the same topic. A use case on a specific index topic could be used to highlight differences and similarities among indices, which would help with data wrangling and fitness-for-use. Ideally, the use case could benchmark the various constructions and provide a statistical assessment.

3.1 Question 1: Can SNF workers get to work during an extreme flooding event?

Sufficient nursing staff is of significant concern to assure resident safety and quality of care.

Since proprietary synthetic population data and commercial sector digitized mapping data were used to construct the routes SNF nursing staff are likely to take from home to work, only an outline of the computational process used to identify the routes is provided. Publicly available data from FEMA were used to estimate flooding risk along a particular route. Below is a general description of the modeling steps and the proprietary data used to assess SNF vulnerability as a function of the nursing staff’s inability to report to work due to the transportation infrastructure (Choupani and Mamdoohi 2016).

Computational modules

Here is the basic outline of the process that uses proprietary data that starts at network construction and ends with routes. For more details, see the GitHub repository: Vulnerability of SNFs concerning Commuting.

Extract network data from HERE (2021 Q1 in this use case).
Process the extracted data to form a network suitable for routing. This includes inference of speed limits for road links where such data is missing.
Prepare origin-destination pairs. In this case, the list of locations pairs a worker’s home and work locations. The person is constructed in the synthetic population pipeline, and residences and workplaces are derived through the data fusion process used to construct the NSSAC building database.
Construct routes using the Quest router.

Once the routes to an SNF were established, the expected number of nursing staff at an SNF during a flood event could be calculated as the sum of the probabilities of each worker being able to commute to work during a flood event. A computational model was developed using the following data:

SNF locations in Virginia from the Centers for Medicare & Medicaid Services (CMS);
Home locations of workers at each SNF assigned from the synthetic population and Building Knowledge Base (Beckman, Baggerly, and McKay 1996; Mortveit, Xie, and Marathe 2023);
Virginia road networks; and
FEMA census tract-level riverine and coastal flood risks.

Using router software, the Virginia road network was used from the HERE map data to compute each nursing staff’s likely route to their SNF. Routers are commonly used within transportation and traffic simulators. The router software used for this demonstration is a highly parallelizable router previously developed in BI NSSAC, known as the Simba router (Barrett et al. 2013).

The FEMA risk data provide the riverine and coastal flood risks for each census tract in Virginia. Given the routes, the FEMA riverine and coastal flood risks were used to estimate the probability of the nursing staff making it to work. The FEMA technical document National Risk Index Technical Document (FEMA 2021) provides information on how natural hazard risks are calculated. We use these risk estimates ranging from 0 to 100 as a proxy for the probability a worker can reach the SNF by dividing by 100. For example, we assume a risk is zero if there is zero probability of being unable to reach the SNF due to an extreme flood event.

In contrast, a risk of 100 indicates the roads are underwater, and the probability of being unable to reach the SNF is one. The maximum risks along transportation routes leading to an SNF range from 0 to 47 for riverine flooding and 0 to 40 for coastal flooding. We assume the combined value of the maximum riverine and coastal flood risks along a worker’s transportation routes, divided by 100, is the worker’s probability of not getting to work during a flooding event.

Since we do not have data on the exact home locations of the nursing staff, we estimated how many could reach the facility by taking a random sample (whose size is the CMS average daily nursing staff⁴ for an SNF) from the possible routes identified using the HERE Virginia road network. We calculated the average with a 95% nonparametric confidence interval. The 283 SNFs used in our research have an average daily nursing staff of 12,609. Using the above approach, we estimated that 10,005 (95% CI: 9,013, 10,700) or 79% can get work during an extreme flood event. The individual SNF nursing staff percentage who can make it to work ranges from 48% to 93%.

Figure 2 visualizes this analysis for the 283 SNFs ordered by the observed average daily nursing staff numbers at the facility from smallest to largest, displayed using the orange line. The black line indicates the expected number in an extreme flood event and the 95% nonparametric confidence interval (grey band). The code for Figure 2 is provided in the GitHub repository.

Figure 2: SNF Average Observed and Expected Average Daily Nursing Staff Numbers: The horizontal axis is ordered by the size of the nursing staff at the facility from smallest to largest. The orange line displays the observed average daily nursing staff numbers. The black line displays the estimated numbers in the event of an extreme coastal and/or riverine flood event. The grey band is the 95% nonparametric confidence interval.

For example, in King George County, the SNF is Heritage Hall King George (Federal Provider Number 495300 in Figure 3), located near the Potomac River, which opens to the Chesapeake Bay. According to CMS, the Heritage Hall King George facility has an average daily skilled nursing staff of 41. Using the HERE Virginia road network, we identified 101 routes the staff could use to reach the facility. The combined maximum coastal and riverine flood risks along these routes ranged from 5.6 to 66.7; a random sample of 41 from the 101 routes gives an average probability of reaching the facility of 0.74 with a 95% nonparametric confidence interval of [0.65, 0.80]. These were used to estimate the average number of nursing staff at the facility, 30, during a flood event, along with a 95% nonparametric confidence interval [14, 38]. Publicly available data from the Federal Emergency Management Administration (FEMA) provided flooding risk estimates along the routes from the nursing staff home to the SNF along with proprietary road and building information.

Figure 3: An Example of Nursing Staff Routes to Heritage Hall King George SNF: Routes that workers can take to work at Heritage Hall  King George SNF FPN 495300 (identified with the black oval). The risk levels of each road are identified with colors, from low risk (blue), medium-low (yellow), orange (medium), red (medium-high), to high risk (dark red). The risk scores are used to calculate the probability of a worker getting to work during an extreme flood event using publicly available FEMA data and proprietary road and building data.

3.2 Question 2. Are SNFs prepared for emergencies?

To address this question, we examined how prepared SNFs are for emergencies using annual inspection and deficiency data as a proxy for preparedness. CMS issues deficiencies to SNFs that fail to meet federal Medicare and Medicaid preparedness standards. Every deficiency is classified into one of 12 categories based on the scope and severity of the deficiency. There are two broad types of non-health-related deficiencies:

Emergency Preparedness Deficiencies – There are four elements of emergency preparedness. They cover an emergency plan, policies and procedures, a communication plan, and training and testing.
Fire Life Safety Code – The set of fire protection requirements are designed to provide a reasonable degree of safety from fire. They cover construction, protection, and operational features designed to provide safety from fire, smoke, and panic.

We calculated separate Emergency Preparedness and Fire Life Safety Code deficiency indices to combine them to create a single index to measure SNF preparedness and distinguish between high and low performing SNFs. The computation of the indices has four steps.

Number of deficiencies: For each SNF, the total number of deficiencies during the past four years, 2018-2022, was divided by the number of SNF inspections over the same period to estimate the average number of deficiencies per inspection.
Time to resolve deficiencies: We next computed the average number of days it took to resolve each deficiency.
Scope and severity of deficiencies: We then transformed the deficiency letter inspection rating for scope and severity to a numerical weight using the CMS technical guide, Care Compare Nursing Home Five-Star Quality Rating System (Medicare & Medicaid Services 2022),and averaged the ratings.
The estimates from these three steps were summed to compute separate Emergency Preparedness and Fire Life Safety Code deficiency indices (see Figure 4) and are provided for reuse in a .csv file on GitHub.

Figure 4 displays the results of an exploratory data analysis for each index. These analyses assessed fitness-for-use; we wanted to construct an indicator with sufficient variability to discriminate between high and low-performing SNFs. It is evident we accomplished this in Figure 4 there are SNFs with indices outside the main body of the data. We summed the Emergency Preparedness and Fire Life Safety Code indices and categorized them into high, medium, low, and no deficiencies.

Figure 4: Exploratory Data Analysis Visualizations for the Emergency Preparedness and Fire Life Safety Code Deficiencies

3.3 Question 3: Can communities support SNFs during emergencies?

To answer this question, we computed a community resiliency index using the US Census American Community Survey and the guidance provided by the Homeland Security document Community Resilience Indicator Analysis: County-Level Analysis of Commonly Used Indicators from Peer-Reviewed Research (Edgemon et al. 2018). The index was constructed by summing the county (census tract) level percentages for the following variables:

fraction employed
fraction with no disability
fraction with a high school diploma or greater
fraction of households with at least one vehicle
reverse GINI Index – so all indicators are in a positive direction.

Figure 5 displays the combined deficiency indices, Emergency Preparedness + Fire Life Safety Code, for each SNF with the choropleth map for the community resilience index at the census tract level. We also examined the number of shelter facilities and emergency service providers and the availability of medical staff per 10,000 residents. We constructed isochrones to establish the distance from the SNF to these potential sources of support. Working on this component of the use case highlighted the need for cross-agency data, pointing to the utility of future strategic partnering between the US Census Bureau, CMS, and FEMA.

Figure 5: 2020 Population Resilience Composite Index for Virginia Census Tracts: The light yellow tracts are the least resilient, and the dark green are the most resilient. The locations of the 283 SNFs are identified with filled circles, orange circles with the highest

In addition to describing the population using a resilience index, we also developed a measure to present the number of shelter facilities and emergency service providers (data from Homeland Security / Homeland Infrastructure Foundation Level Data) and the availability of medical doctors (MDs) and Doctor of Osteopathic Medicine (ODs) who provide direct patient care (HRSA 2022) (Figure 6).

The number of MDs and ODs is described as a primary care health professional shortage area. HRSA defines these contiguous areas where primary medical care professionals are overutilized, excessively distant, or inaccessible to the population of the area under consideration. Figure 6 (bottom) shows that approximately one-third of the counties and independent cities have health professional shortage areas across their entire boundary, and another 40 percent have shortages within parts of their boundaries.

Figure 6: Assessment of the number of shelter facilities and emergency service providers per 10,000 population (top) and medically underserved areas (bottom): On both maps, the lighter the color, the more in need is the population of shelter facilities and emergency services (top chart) or health professionals (bottom chart). The location of the 283 SNFs are identified with filled circles, orange circles are those with the highest deficiency index and grey circles are those with no deficiencies.

4 Guiding principles for ethical, transparent, reproducible statistical product development and dissemination.

Communication

We communicated results throughout the Demonstration Use Case research with our Census CDE Working Group (composed of former Census Bureau Directors and Communication Director, and academic and industry census experts), with the Census Bureau, at conferences such as the annual Federal Statistical Committee on Methodology, and sharing drafts to seek input and ideas. The discussions and presentations helped to shape ideas and advance our thinking about how best to address the purpose and use questions.

Stakeholder engagement

We engaged stakeholders by sharing our research and results through conference presentations at the American Community Survey Data Users Conference and the Applied Public Data Users Conference. We also shared this demonstration project at Listening Sessions with stakeholders as an example of statistical product development. The Listening Sessions bring together 7 to 12 stakeholders by topic (e.g., children’s health) or function (e.g., state demographers) to seek their ideas for new statistical products.

Equity and ethics

As described in the Introduction, there are ethics and equity issues that drew us to develop this Use Case. Here we focus on equity and ethics vis-a-vis the data choices and analyses. With regard to ethical considerations with our data discovery process, fitness-for-purpose evaluation, and analyses, two questions arose:

What role does synthetic data have to play, and how do you benchmark it to evaluate fitness-for-purpose?
How do you construct and evaluate an index with the goal of identifying vulnerable populations?

Realizing the importance of nursing staff levels, we discussed and questioned whether the synthetic data had biases and were not representative of SNF residents and employees. We benchmarked the synthetic SNF nursing staff numbers against those submitted quarterly to CMS and observed they were biased low, so we decided to use the CMS data. These data were used to estimate the average number of nursing staff that could reach the facility during an extreme flood event (Figure 2).

In this use case, we were fortunate to have the “truth” to benchmark the synthetic data for the average daily nursing staff at each SNF. But this was not the case for the home locations of the nursing staff, therefore, the synthetic locations were not used since we had no way to benchmark them. Ideally, we would use the actual addresses of SNF employees. Instead, we used a simulation to estimate the average risks over routes leading to the SNF. This approach could be replaced with (or benchmarked against) the Census commuting data sets (eg, Commuting Flows or the LEHD Origin-Destination Employment Statistics) and the home census tract used as the starting point for each worker. For the number of nursing staff and their home locations, it is impossible to identify potential biases that would result in the inequitable allocation of emergency rescue resources without a thorough understanding of how the synthetic data were generated.

How one evaluates the equity of an index is a more challenging task. Questions that need to be addressed include:

How do you select the variables used to construct an indicator to guide an equitable allocation of technical assistance?
What relationship between these variables is important?
What are the differences across the numerous publicly available resilience estimators? Do some lead to a more equitable allocation of technical assistance in the event of an extreme clime event?
How do you validate a resilience estimator?

The technical document Community Resilience Indicator Analysis: County-Level Analysis of Commonly Used Indicators from Peer-Reviewed Research (Edgemon et al. 2018) identified the 20 most commonly selected variables for constructing resilience estimators from peer-reviewed research. Future research will need to validate these indices against past extreme climate events.

Privacy and confidentiality

We did not do a full disclosure review. However, some data are proprietary, and we could not release those data. We discuss how we used these data.

Dissemination

We disseminated the final version of the use case in the University of Virginia Libra Open repository (Lancaster et al. 2023).

Curation

Curation involves documenting all steps of the process so that they can be repeated, validated, reused, or extended. The final report explains the process in words. Curation must also provide the data, metadata, source code, and products. This led us to construct a GitHub repository. A README file guides the reader through the material and provides instructions for replicating the research results. Note that the README file must be downloaded for the hyperlinks to work.

5 Using the SNF statistical product

This potential statistical product has many uses. Federal policymakers and administrators regulate SNFs; however, they only sometimes realize the impacts on costs and the need for increased resources to meet these regulations. For example, by reviewing the aggregate inspection deficiency metrics, policymakers can target resources where they are most needed. Providing additional funding to pay workers more, improve their facilities, and address inspection deficiencies would improve the quality of SNFs.

The media and advocacy groups play a role in highlighting good and bad cases of SNF care or where communities do not have adequate assets to support SNFs during an emergency event. For example, a New Yorker article (Rafiei 2022) highlighted how nursing homes decline dramatically when bought by private equity owners. The GAO (September 22, 2023) recently identified the need for more information about private equity ownership in CMS data – a gap that CMS needs to address. And, of course, researchers and analysts are essential for conducting research that leads to creating and improving statistical products around SNFs. By releasing a regularly scheduled SNF statistical product, the changes in SNFs over time can be monitored.

6 What CDE capabilities have this use case demonstrated?

As demonstrated by this use case, the CDE Framework is a powerful process for guiding and curating the development of statistics to address complex purposes and uses. Additionally, use cases help illuminate technical capabilities that should be present in the data enterprise to facilitate and accelerate the reuse of data and methods in the development and dissemination of new statistical products.

This CDE demonstration is the first of many use cases needed to define and develop CDE capabilities. Underlying each use case is the curation process. Curation documents each step, including decisions that may involve trade-offs. Curation preserves and adds value to the data. This includes organizing to facilitate data discovery and easy access; providing metadata to enable the reuse in scientific and programmatic research; enhancing the value of the data enterprise through linkages between datasets; and mapping the network of interconnections between datasets, research outputs, researchers, and institutions. Over time, a searchable curation system will be needed as a foundation for creating statistical products in the CDE.

The types of products from a use case that can benefit the larger community are only limited by the creativity of the researchers and stakeholders carrying out the use case. The products from this use case are re-useable code; integrated data sets across diverse topics for each SNF; maps and other visualizations; statistical products such as SNF deficiency indices and various indices that measure community and SNF resilience; the probability of a worker reaching an SNF in the event of extreme flooding; and a GitHub repo that provides easy access to all these products plus relevant metadata, literature, and government documents and regulations.

Conducting this use case has been an eye-opening experience as to the amount and quality of publicly available data to address our research questions. The statistical capabilities and products flowing from diverse use cases can only be identified as the program progresses.

← Part 2: What is the CDE?

Part 4: Census Curated Data Enterprise Environment →

About the authors: Vicki Lancaster is a statistician with expertise in experimental design, linear models, computation, visualizations, data analysis, and interpretation. She works with scientists at federal agencies on projects requiring statistical skills and creativity, eg, defining skilled technical workforce using novel data sources.; Stephanie Shipp leads the Curated Data Enterprise research portfolio and collaborates with the US Census. She is an economist with experience in data science, survey statistics, public policy, innovation, ethics, and evaluation.; Sallie Keller is the Chief Scientist and Associate Director of Research and Methodology at the US Census Bureau. She is a statistician with research interest in social and decision informatics, statistics underpinnings of data science, and data access and confidentiality. Sallie Keller was at the University of Virginia when this work was conducted.; Aaron Schroeder has experience in the technologies and related policies of information and data integration and systems analysis, including policy and program development and implementation.; Henning Mortveit develops massively interacting systems and the mathematics supporting rigorous analysis and understanding of their stability and resiliency.; Samarth Swarup conducts research in computational social science, resiliency and sustainability, and stimulation analytics.; Dawen Xie develops geographic information systems, visual analytics, information management systems, and databases, with a current focus on building dynamic web systems.

Copyright and licence: © 2024 Stephanie Shipp

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Ground Picture on Shutterstock.

How to cite: Lancaster V, Shipp S, Keller S et al. (2024). “Translating the Curated Data Model into Practice - climate resiliency of skilled nursing facilities” Real World Data Science, November 19, 2024. URL

References

Barrett, Christopher, Keith Bisset, Shridhar Chandan, Jiangzhuo Chen, Youngyun Chungbaek, Stephen Eubank, Yaman Evrenosoğlu, et al. 2013. “Planning and Response in the Aftermath of a Large Crisis: An Agent-Based Informatics Framework.” In 2013 Winter Simulations Conference (WSC), 1515–26. IEEE.

Beckman, Richard J, Keith A Baggerly, and Michael D McKay. 1996. “Creating Synthetic Baseline Populations.” Transportation Research Part A: Policy and Practice 30 (6): 415–29.

Choupani, Abdoul-Ahad, and Amir Reza Mamdoohi. 2016. “Population Synthesis Using Iterative Proportional Fitting (IPF): A Review and Future Research.” Transportation Research Procedia 17: 223–33.

Dosa, David M, Kathryn Hyer, Lisa M Brown, Andrew W Artenstein, LuMarie Polivka-West, and Vincent Mor. 2008. “The Controversy Inherent in Managing Frail Nursing Home Residents During Complex Hurricane Emergencies.” Journal of the American Medical Directors Association 9 (8): 599–604. https://pubmed.ncbi.nlm.nih.gov/19083295/.

Edgemon, Lesley, Carol Freeman, Carmella Burdi, Trail, and Kyle Pfeiffer. 2018. “Community Resilience Indicator Analysis: County-Level Analysis of Commonly Used Indicators from Peer-Reviewed Research.” Argonne National Laboratory. https://www.researchgate.net/publication/331232094_Community_Resilience_Indicator_Analysis_County-Level_Analysis_of_Commonly_Used_Indicators_From_Peer-Reviewed_Research.

FEMA. 2021. “National Risk Index Technical Documentation.” Federal Emergency Management Agency. 2021. https://www.fema.gov/sites/default/files/documents/fema_national-risk-index_technical-documentation.pdf .

———. 2022. “Community Resilience Indicator Analysis: Commonly Used Indicators from Peer-Reviewed Research: Updated for Research Published 2003-2021.” Federal Emergency Management Agency. 2022. hhttps://www.fema.gov/sites/default/files/documents/fema_2022-community-resilience-indicator-analysis.pdf .

Homeland Security: Geospatial Management Office, Department of. 2022. “Homeland Security Infrastructure Foundation-Level Data Open Data.” 2022. https://hifld-geoplatform.opendata.arcgis.com/.

Hyer, Kathryn, Lisa M Brown, Amy Berman, and LuMarie Polivka-West. 2006. “Establishing and Refining Hurricane Response Systems for Long-Term Care Facilities: The John a. Hartford Foundation Was the Lead Funder of a Hurricane Summit to Focus on the Neglected Needs of the Elderly.” Health Affairs 25 (Suppl1): W407–11. https://www.healthaffairs.org/doi/full/10.1377/hlthaff.25.w407?casa_token=XbJ2j-CdtssAAAAA:USJMJsZq_jlYlQlASQt4O4OYJcq_AOKjpXOx5tTMUIZxoNVXZCzj1_ejtQyLHrnTg6B1BygFuuGZ.

Lancaster, V., S. Shipp, S. Keller, A. Schroeder, H. Mortveit, S. Swarup, and D. Xie. 2023. “Census Curated Data Enterprise Use Case Demonstration: Climate Resiliency of Skilled Nursing Facilities” TR 2023-53. https://doi.org/10.18130/ce97-sp05.

LTCFocus, Brown University. 2022. “Who We Are.” 2022. https://ltcfocus.org/about.

Medicare & Medicaid Services, Centers for. 2022. “Design for Care Compare Nursing Home Five-Star Quality Rating System: Technical Users’ Guide.” 2022. https://www.cms.gov/medicare/provider-enrollment-and-certification/certificationandcomplianc/downloads/usersguide.pdf.

———. 2023. “CMS Glossary.” 2023. https://www.cms.gov/glossary?term=skilled+nursing+facility&items_per_page=10&viewmode=grid .

Mortveit, H., D. Xie, and M. Marathe. 2023. “NSSAC Building Knowledge Base: Modeling and Implementation.”

Rafiei, Y. 2022. “When Private Equity Takes over a Nursing Home.” New Yorker 2022: 333. https://www.newyorker.com/news/dispatch/when-private-equity-takes-over-a-nursing-home.

Skarha, Julianne, Lily Gordon, Nazmus Sakib, Joseph June, Dylan J Jester, Lindsay J Peterson, Ross Andel, and David M Dosa. 2021. “Association of Power Outage with Mortality and Hospitalizations Among Florida Nursing Home Residents After Hurricane Irma.” In JAMA Health Forum, 2:e213900–213900. 11. American Medical Association. https://jamanetwork.com/journals/jama-health-forum/fullarticle/2786665.

The White House. 2022. “Protecting Seniors by Improving Safety and Quality of Care in the Nation’s Nursing Homes.” 2022. https://www.whitehouse.gov/briefing-room/statements-releases/2022/02/28/fact-sheet-protecting-seniors-and-people-with-disabilities-by-improving-safety-and-quality-of-care-in-the-nations-nursing-homes/ .

Footnotes

Nursing staff includes medical aides and technicians, certified nursing assistants, licensed practical nurses (LPNs), LPNs with administrative duties, registered nurses (RNs), RNs with administrative duties, and the RN director of nursing.↩︎
For example, distinguishing county from city when the name is the same could be done using State/County FIPS codes. Richmond County is 51159; Richmond City is 51760.↩︎
ZIP code is a system of postal codes used by the United States Postal Service. ZIP was chosen to indicate mail travels more quickly when senders use the postal code.↩︎
Average Daily Nursing Staff is the daily number of Medical Aides and Technicians, CNAs, LPNs, LPNs with administrative duties, RNs, RNs with administrative duties, and RN Director of Nursing averaged over three months.↩︎

Advancing Data Science in Official Statistics – What is the Curated Data Enterprise?

Fri, 08 Nov 2024 00:00:00 GMT

Acknowledgments: This research was sponsored by the:
Unites States Census Bureau Agreement No. 01-21-MOU-06 and
Alfred P. Sloan Foundation Grant No. G-2022-19536

The views expressed in this perspective are those of the authors and not the Census Bureau.

Introduction

Today, official statistics – tables, reports and microdata – are produced using data from a single survey. These surveys are foundational for researchers and policymakers. However, many issues cannot be answered by surveys alone. For example, creating a picture of how prepared skilled nursing facilities (SNFs) are for climate emergencies requires wrangling all types of data about the facilities and their communities.(Note: A skilled nursing facility is a facility that meets specific federal regulatory certification requirements that enable it to provide short-term inpatient care and services to patients who require medical, nursing, or rehabilitative services.) This includes SNF data on the number and dates of inspections, deficiencies, residents’ mental and physical health, the number of nursing staff and where they live, community assets data on the number of shelter facilities, health professionals and emergency service providers, and community risks data on the probability of an extreme climate event. How can we create new statistical products useful to policymakers, emergency responders, skilled nursing facility staff, and others to inform their decisions?

Official statistics

Official statistics are essential for a democratic society as they provide economic, demographic, social, and environmental data about the government, the economy, and the environment. Official statistical agencies should compile and make these statistics available impartially to honor the right to public information.

Objective, reliable, and accessible official statistics instill confidence in the integrity of government and public decision-making regarding a country’s economic, social, and environmental situation at national and international levels. They should be widely available and meet the needs of various users (United Nations 2024).

With the explosion of available data, there is an opportunity to combine all types of information to create statistical products that address cross-cutting topics for a wide range of purposes and uses. The US Census Bureau is modernizing and transforming its enterprise system to accommodate a new way to produce statistical products that take advantage of all data types: designed surveys and censuses, public and private administrative data, opportunity data scraped from the internet, and procedural data (Keller et al. 2022).

‘We are moving towards a single enterprise, data-centric operation that enables us to funnel data from many sources in a single data lake using common collection and ingestion platforms… This is the essence of a curated data approach — assemble, assess, and fill in the gaps to create quality statistical data.’

Robert Santos, Director, US Census Bureau

This curated approach is embodied in the Curated Data Enterprise (CDE). The Curated Data Enterprise Framework in Figure 1 provides a guide for creating statistical products that enable the full integration of data from many sources (Keller et al. 2020). At the heart of the framework are the purposes and uses that provide the context and driving force for developing the statistical product. The outer rectangle in Figure 1 identifies the guiding principles for ethical, transparent and reproducible product development and dissemination. The inner rectangle identifies the steps in the statistical product development, including integrating primary and secondary data sources. The arrows convey that this process may only sometimes be linear. Instead, the process is iterative, where new information may be discovered at any point, requiring reevaluating and updating prior steps. Our Social and Decision Analytics research group in the Biocomplexity Institute developed, tested, and refined the CDE (data science) Framework in our research since 2013 (Keller, Lancaster, and Shipp 2017; Keller et al. 2020). The proposed use of the CDE to develop statistical products at the US Census Bureau is in its early stages.

The next article in this series will put the CDE Framework into practice by demonstrating the use case on skilled nursing facilities’ preparedness for emergencies during extreme climate events. As a prelude to that article, we have created a visual for the statistical product development component of how that process works in action in Figure 2.

Figure 2: Example: Steps in the statistical product development for the skilled nursing facility use case. The diagram describes the steps applied to a use case on the resilience of skilled nursing facilities. Section 3 of this series describes the steps in detail.

The CDE Framework’s guiding principles and research steps are described below. To find out more click on a cross reference.

Guiding principles:

Purposes and uses
Stakeholders
Curation
Equity and ethics
Privacy and confidentiality
Communications and dissemination

Research steps:

Subject matter input
Data discovery
Data ingestion & Governance
Data wrangling
Fitness-for-purpose
Statistics development

Guiding principles

Purposes and uses

The CDE is centered on developing statistical products to meet specific purposes and uses. Researchers and stakeholders propose the purposes and uses, defining the ‘why’ for developing statistics and statistical products. They include questions or issues that the statistics should be designed to support and are clarified by documented best practices, literature reviews and conversations with subject matter experts.

Stakeholders

Stakeholders include individuals, groups, and organizations that have the potential to affect or be affected by the outcome of the research. Engaging stakeholders is crucial for fostering the connection and trust that can lead to better decision making. Kujala et al. (2022) best described the principle of stakeholder engagement: ‘Stakeholder engagement refers to the aims, activities, and impacts of stakeholder relations in a moral, strategic, and pragmatic manner.’ When placed within the CDE context and represented in the Framework, collaborative engagement with stakeholders occurs at all stages of product development to better understand what the final product needs to look like. Further, product development is not a linear process but occurs through successive waves of iteration with users.

Forming partnerships with stakeholders is instrumental in identifying requirements and implementing statistical products. This requires listening to community voices in an active engagement strategy.¹ Of necessity, these partnerships entail collaboration, such as creative and collaborative problem-solving workshops and the development of innovative digital tools vetted by networks of users.²

Curation

The broad meaning of curation is the act of organizing, documenting and maintaining a collection of artifacts. The artifacts of the development and dissemination of statistics or statistical products include all the components in Figure 1, from meeting with stakeholders to formulating the purposes and uses to creating and disseminating the statistical products. Maintaining the artifacts is the essence of the CDE. Every step in the process should be documented and easily accessible in a repository, for example, GitHub, for the work to be transparent and reproducible. Curation in the context of the CDE is an end-to-end activity. It involves documenting the purpose and use, providing the context for acquiring, wrangling, and archiving data from many sources to support the development of statistical products. It will include metadata (Cannon 2013), the code used to read and write the data, and the code that ingested the data from the source and prepared it for analysis.

Curation steps

Document the development of the research questions, why this research is important, and how it supports the purposes and uses and resulting statistical product.
Document the context for the purposes and uses, ie, a policy directive, stakeholder request, policy evaluation, etc.
What stakeholder engagement and transparency are built into the process?

Equity and ethics

An ethics review ensures dialogue on this topic throughout the statistical product development and dissemination life cycle. This involves teams of researchers and stakeholders across many areas of expertise, each with its own research integrity norms and practices. This requires that ethics be woven into every aspect of the CDE. An equity review ensures that underserved groups are represented and biases inherent in various data sources are acknowledged.

Curation questions

What are the project’s expected benefits to the ‘public good’? Do they outweigh potential risks to specific sub-populations, eg, individuals, firms and their locations by different levels of geography?
Are there implicit assumptions and biases regarding the studied communities in framing the project and associated data sources? If yes, how will they be addressed?
What type of institutional approval process and contracts are needed? What statistical quality standards and confidentiality standards will be needed? For an explanation of the Institution Review Board see Note 1.

An ethics checklist can help with this process. Links to ethics checklists are provided below.

University of Virginia, Biocomplexity Institute, Social and Decision Analytics Division Data Science Project Ethics Tool
United Kingdom Government, Data Ethics Framework

Privacy and confidentiality

Privacy is about the individual, whereas confidentiality is about the individual’s information. Privacy refers to an individual’s desire to control their information. Confidentiality refers to the researcher’s agreement with the individual, which could be an agency like the Census Bureau, regarding how their information will be handled, managed, and disseminated (Keller, Shipp, and Schroeder 2016). This is a guiding principle because it needs to be considered and embraced at the earliest possible stages of statistical product development and will impact dissemination choices.

Curation questions

What steps are taken to ensure the privacy and confidentiality of the data?
What statistical methods (if any) are used to ensure the privacy and confidentiality of the data?
How do the methods chosen to protect confidentiality affect the purposes and uses of the data?
What stakeholder engagement and transparency are built into the process?
Does the context surrounding the purposes, uses, and anticipated data sources require an Institutional Review Board (IRB) review and approval? If yes, is it archived?

Note 1: Institutional Review Board

In the United States, institutional review boards (IRBs) assess the ethics and safety of research studies involving human subjects, such as behavioral studies or clinical trials for new drugs or medical devices. Today, the definition of human subjects has evolved to include secondary data, such as administrative data collected for other purposes, eg, local property data collected for tax purposes.

The Belmont Commission was convened in the late 1970s after the ethical failures of many research projects that involved vulnerable populations surfaced. The Belmont Commission issued three principles for the conduct of ethical research:

Respect for people — treating people as autonomous and honoring their wishes
Beneficence — understanding the risks and benefits of the study and weighing the balance between (1) doing no harm and (2) maximizing possible benefits and minimizing possible harms
Justice — deciding if the risks and benefits of research are distributed fairly.

These principles were translated to a set of regulations called the Common Rule that govern federally-funded research. The Belmont Commission provided the foundation for IRB principles and focused on research involving human subjects in experiments and studies. IRB approval is required to be eligible for federal grants and contracts. Many universities also require IRB review for research conducted by faculty, students, and researchers (Shipp, LaLonde, and Martinez 2023).

Communication and dissemination

Communication involves sharing data, statistical method choices, well-documented code, working papers, and dissemination through research team meetings, stakeholder engagements, conference presentations, publications, webinars, websites, and social media. As a principle, communication and dissemination are critical to ensure that statistical product development processes and findings are transparent and reproducible (Berman et al. 2016). An essential facet of this step is to tell the story of the analysis by conveying the context, purpose, and implications of the research and findings (Berinato 2019; Wing 2019; NASEM 2022).

Curation questions

Are the meeting notes, statistical products, code, reports, and presentations archived in a repository?
Briefly describe what did not work in this process, eg, data wrangling challenges where data sources could not be integrated, data source changes after a fitness-for-purpose assessment, analyses that were changed because assumptions were not met, etc.
Have project methods and outputs been made as transparent as possible?
Are the potential limitations of the research clearly presented?
Why or why not should the research be used as the basis for an institutional or policy action?
Have the predicted benefits and social costs to all potentially affected communities been considered?

Research steps

Subject matter input

Subject matter (domain) expertise plays a role in translating the information acquired into understanding the underlying phenomena in the data (Box et al. 1978). Domain knowledge provides the context to define, evaluate and interpret the findings at each research stage (Leonelli 2019; Snee, DeVeaux, and Hoerl 2014). Subject matter input can be obtained through a review of the literature, talking to experts, or learning about their work at conferences or other convenings. Subject matter experts are different than stakeholders. Both provide important input to identifying and clarifying purposes and uses.

Curation steps

Document the meetings with subject matter experts and stakeholders.
Document the literature search methods and the results of the literature review.
Document choices are made during the development of the products.
Were subject matter experts and stakeholders recruited from underrepresented groups?

Data discovery

Data discovery identifies potential sources that address the research goals defined by purposes and uses. Data sources include the following types (Keller et al. 2020).

Designed data are collected using statistically designed methods, such as surveys, censuses, and data generated from an experimental or quasi-experimental design, such as a clinical trial or agricultural field study.
Administrative data are collected for the administration of an organization or program by entities such as government agencies.
Opportunity data are derived from internet-based information, such as websites, wearable and other sensor devices, and social media, and captured through application programming interfaces (APIs) and web scraping, eg, geocoded place-based data, transportation routes, and other data sources.
Procedural data are processes and policies, such as a change in health care coverage, a data repository policy outlining procedures and the metadata required to store data, or a responsible AI policy.

The goal of the data discovery process is to think broadly and imaginatively about all data types and to capture the variety of data sources that could be useful for the problem. There are three steps in the data discovery process (Keller, Shipp, and Schroeder 2016).

Identify potential data sources and make an inventory.
Create a set of questions to screen the data sources to ensure the data meet the criteria for use.
Select and acquire the data sources that meet the screening criteria.

Curation steps

Describe your data discovery process and reasoning behind the selected data sources.
- Do underrepresented groups have adequate geographic coverage? If not, are there methods, such as synthetic data, you can use to provide adequate coverage?
- Have checks and balances been established to identify and address implicit biases in the data and interpretation of the data? Has the team engaged in discussion and provided insights across their diverse perspectives?
Describe the assumptions that need to be made to use these data sources.
Identify and document the paradata and metadata that describe each data source. Paradata describe how the data were collected, while metadata are ‘data about data’. It includes information about the data’s content, data dictionaries and technical documents that will help the user assess its fitness for purpose (Cannon 2013; NASEM 2022).
Discuss data sources you would have used if they were available.

Data ingest and governance

Data ingestion is the process of bringing data into the data management platform(s) for use. Data governance establishes and adheres to rules and procedures regarding data access, dissemination and destruction.

Curation steps

Document policies and institutional agreements for data use.
- Have team members reviewed data use agreements, standard operating procedures (SOPs), and data management plans? Are they fair?
- Do additional procedures need to be defined for this project?
Document the code and processes used to ingest the data sources and manage governance.

Data wrangling

Data wrangling includes the activities of data profiling, preparing, linking and exploring used to assess the data’s quality and representativeness and what analyses the data can support.

Table 1. Activities of data wrangling
Profiling	Preparing	Linking	Exploring
data quality data structure meta data, paradata, and provenance	cleaning transforming structuring	ontology selection & alignment entity resolution / harmonization	visualizations descriptive statistics characterizations

Curation steps

Describe any data quality issues within the stated purpose and use context and how they were resolved. This can include statistical solutions like imputing missing data, identifying outliers or constructing synthetic populations.
- How representative are the data?
- What populations are and are not covered?
Describe any issues with the wrangling process and how they were resolved.
Document the code used to wrangle the data and describe how it was validated.
Document assumptions made regarding the transformation and use of the data.

Fitness-for-purpose

Fitness-for-purpose starts with assessing the constraints imposed on the data by the particular statistical methods used and the population to which the inferences extend. It is a function of the modeling, data quality needs of the models, and data coverage (representativeness) needs of the models. The statistical product’s ‘fitness-for-purpose’ involves those on the receiving end of the data helping identify issues germane to the data application, such as identifying biases affecting equity. For example, given known differences in their availability, does using administrative records lead to better modeling outcomes for some groups more than others? What can be done to compensate for such bias?

Curation steps

Document the constraints and limitations of the data.
- What are the limitations of the results? Are the results useful, given the purpose of the study?
Discuss the populations to which any inferences will generalize.
- Do the statistical results support the potential benefits of the study previously stated?
- Do any data require revisiting the question of potential biases being introduced through the choice of data sets and variables?

Statistics development

The development of statistics and statistical products for dissemination is a function of the research questions, the data’s limitations and the assumptions of the statistical method(s) used.

Curation steps

Describe the statistical methods planned and used and how the method assumptions were evaluated.
Discuss the conclusions of the statistical analyses and any inferences that can be made from the disseminated statistical products.
Discuss how the statistics support the purposes and uses driving the development of the products.

Here, we have defined the CDE and provided a conceptual walk through of the framework from Figure 1. In the next article, we will put the CDE Framework into practice through a demonstration use case on the resilience of skilled nursing facilities.

← Part 1: The policy problem

Part 3: Climate resiliency of skilled nursing facilities →

About the authors: Sallie Keller is the Chief Scientist and Associate Director of Research and Methodology at the US Census Bureau. She is a statistician with research interest in social and decision informatics, statistics underpinnings of data science, and data access and confidentiality. Sallie Keller was at the University of Virginia when this work was conducted.; Stephanie Shipp leads the Curated Data Enterprise research portfolio and collaborates with the US Census. She is an economist with experience in data science, survey statistics, public policy, innovation, ethics, and evaluation.; Vicki Lancaster is a statistician with expertise in experimental design, linear models, computation, visualizations, data analysis, and interpretation. She works with scientists at federal agencies on projects requiring statistical skills and creativity, eg, defining skilled technical workforce using novel data sources.; Joseph Salvo is a demographer with experience in US Census Bureau statistics and data. He makes presentations on demographic subjects to a wide range of groups about managing major demographic projects involving the analysis of large data sets for local applications.

Copyright and licence: © 2024 Stephanie Shipp

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Chay_Tee on Shutterstock.

How to cite: Keller S, Shipp S, Lancaster V, Salvo J (2024). “Advancing Data Science in Official Statistics – What is the Curated Data Enterprise?” Real World Data Science, November 8, 2024. URL

References

Berinato, Scott. 2019. “Data Science and the Art of Persuasion: Organizations Struggle to Communicate the Insights in All the Information They’ve Amassed. Here’s Why, and How to Fix It.” Harvard Business Review 97 (1). https://hbr.org/2019/01/data-science-and-the-art-of-persuasion.

Berman, Francine, Rob Rutenbar, Henrik Christensen, Susan Davidson, Deborah Estrin, Michael Franklin, Brent Hailpern, et al. 2016. “Realizing the Potential of Data Science: Final Report from the National Science Foundation Computer and Information Science and Engineering Advisory Committee Data Science Working Group.” National Science Foundation Computer and Information Science and Engineering Advisory Committee Report.

Box, George EP, William H Hunter, Stuart Hunter, et al. 1978. Statistics for Experimenters. Vol. 664. John Wiley; sons New York.

Cannon, Sandra. 2013. “Defining ‘Core’ Metadata: What Is Needed to Make Data Discoverable. Paper Presented at the Federal CASIC Workshops (Survey Uses of Metadata).” https://www.census.gov/fedcasic/fc2013/.

Keller, Sallie, Vicki Lancaster, and Stephanie Shipp. 2017. “Building Capacity for Data-Driven Governance: Creating a New Foundation for Democracy.” Statistics and Public Policy 4 (1): 1–11.

Keller, Sallie, Stephanie S Shipp, Aaron D Schroeder, and Gizem Korkmaz. 2020. “Doing Data Science: A Framework and Case Study.” Harvard Data Science Review 2 (1). https://doi.org/10.1162/99608f92.2d83f7f5.

Keller, Sallie, Stephanie Shipp, and Aaron Schroeder. 2016. “Does Big Data Change the Privacy Landscape? A Review of the Issues.” Annual Review of Statistics and Its Application 3: 161–80. https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-041715-033453 .

Kujala, Johanna, Sybille Sachs, Heta Leinonen, Anna Heikkinen, and Daniel Laude. 2022. “Stakeholder Engagement: Past, Present, and Future.” Business & Society 61 (5): 1136–96. https://doi.org/10.1177/00076503211066595.

Leonelli, Sabina. 2019. “Data Governance Is Key to Interpretation: Reconceptualizing Data in Data Science.” https://doi.org/10.1162/99608f92.17405bb6.

Shipp, Stephanie, Donna LaLonde, and Wendy Martinez. 2023. “Making Ethical Decisions Is Hard!” CHANCE 36 (4): 42–50. https://www.tandfonline.com/eprint/D5KR3XFRUG2QV4FVCKQI/full?target=10.1080/09332480.2023.2290955.

Snee, Ronald D, Richard D DeVeaux, and Roger W Hoerl. 2014. “Follow the Fundamentals.” Quality Progress 47 (1): 24–28. https://search-proquest-com.proxy01.its.virginia.edu/docview/1491963574?accountid=14678.

United Nations. 2024. “Development of a National Statistical System, Principle 1 - Relevance, Impartiality and Equal Access.” https://unstats.un.org/unsd/goodprac/bpaboutpr.asp?RecId=1.

Wing, Jeannette M. 2019. “The Data Life Cycle.” Harvard Data Science Review 1 (1): 6. https://doi.org/10.1162/99608f92.e26845b4.

Footnotes

Advancing Data Science in Official Statistics – The Policy Problem

Fri, 01 Nov 2024 00:00:00 GMT

Acknowledgments: This research was sponsored by the:
United States Census Bureau Agreement No. 01-21-MOU-06 and
Alfred P. Sloan Foundation Grant No. G-2022-19536

The views expressed in this artice are those of the authors and not the Census Bureau.

Introduction

Two centuries ago, when the Framers of the US Constitution laid the cornerstone for the federal statistical system, they could not have imagined the complexity of questions future generations would want to ask or the variety of data sources available to address them. Back in 1787, counting the population and apportioning state seats in the House of Representatives were the most urgent tasks before the young nation, and so a requirement for a decennial census was written into the Constitution. Now, 233 years later, the census continues to serve its original purpose – but purposes and uses for census data have exploded.

Questions we now seek to answer go beyond what the census (or surveys) alone can hope to address. Even with the multitude of other surveys commissioned by today’s US Census Bureau, researchers and policymakers find themselves looking to novel sources of data – from structured numeric data in traditional databases to unstructured text documents scraped from the internet – to explore issues such as understanding how prepared nursing homes and communities are for extreme climate events,eg, hurricanes, wildfires, or floods. Wrangling these sources with traditionally designed data, such as censuses and surveys, can fill data gaps, improve the quality and usefulness of statistical products, speed up their dissemination, and inspire the creation of new types of statistical products.

That is the impetus for developing the Curated Data Enterprise (CDE), an innovation in data science aimed at creating statistical products from all data types and building the infrastructure to support them. The Curated Data Enterprise, as the name implies, includes an end-to-end curation model to capture the complete statistical product development process. The CDE is designed to enable data discovery and retrieval, data quality assessment across multiple and diverse sources of information, and the reuse of data and models over time to accelerate statistical product development. The US Census Bureau has partnered with the University of Virginia, a working group of former Census Bureau Directors, a Communication Director, and university, non-profit and industry experts to develop this approach.

The US Census Bureau

The US Census Bureau provides the latest official statistics, facts, and figures about America’s people, places, and economy. It collects data for 130 surveys annually and the decennial census that gives the Bureau its name. The US Census Bureau collects data from households, businesses, governments and non-profit organizations. For each survey, tabulations and margins of error are published in news releases and reports. Public-use microdata subject to disclosure rules are provided for household and demographic surveys. Microdata for economic and household surveys, without disclosure rules applied, are accessible to researchers through the Federal Statistical Research Data Centers.

Statistical agencies in other countries are also modernizing their surveys and statistical product development. See a summary of selected countries (Lanman, Davis, and Shipp 2023).

A new approach

To realize the CDE vision, the development of statistical products will address stakeholder questions using all data types – designed surveys and censuses, public and private administrative data, opportunity data scraped from the internet and procedural data (Keller et al. 2022). This new approach aligns with the US Census Bureau’s modernization and transformation (Thieme 2022) while maintaining the fundamental responsibilities of statistical agencies (OMB 2023). It is also consistent with a conclusion by the NASEM Panel on the Implications of Using Multiple Data Sources for Major Survey Programs: ‘The quality of statistics produced from multiple data sources depends on properties of the individual sources as well as the methods used to combine them. A new framework of quality standards and guidelines is needed to evaluate such data sources’ fitness for use’ (NASEM 2023, 192).

The CDE approach provides such a framework to address many of the challenges that official statistics face today, as well as demonstrate that they are poised to adopt a new approach to producing official statistics. For example:

The timeliness and frequency of our official statistics are insufficient when there are shocks to the economy, such as the Covid-19 pandemic, when retrospective survey data were of limited usefulness. Federal agencies responded during the pandemic with relevance and agility by creating and launching fast-response Household Pulse Surveys that met immediate needs for data, trading off timeliness for quality (Groshen 2021). Public engagement and support for these new relevant and timely data products at a time of crisis were essential to the success of this new statistical product.
The policy environment has responded to technological, social, and survey changes by encouraging efficient use of existing data, reuse, sharing and furthering open data principles. Researchers are now creating innovative statistical products using multiple data sources to better address the US’s needs and interests. The Commission on Evidence-Based Policymaking (Abraham et al. 2018) and the Federal Data Strategy (“Federal Data Strategy, Leveraging Data as a Strategic Asset” 2021) recommendations encourage agencies to permit access to data to undertake evaluation and research studies.
Techniques such as rapid scanning, text recognition, user-friendly uploads, and new devices, sensors, and systems can now record and transcribe data in real time. Using these techniques, governments and corporations now routinely and instantaneously collect and store data on behaviors and states as varied as purchase transactions, climate and road conditions, healthcare plan utilization, and land use and zoning. Extensive digitization and recording, better system connectedness and interactivity, and increased human-computer interaction can result in faster data accumulation, enhancing the usability of private and public administrative data while maintaining privacy and confidentiality (Brady 2019; Jarmin 2019).
New techniques and data sources can transform statistical agencies ‘from the 20th-century survey-centric model to a 21st-century model that blends structured survey data with administrative and unstructured alternative digital data sources’, leading to better measures of the gig economy, retail sales, healthcare, workforce, and tools and methods to integrate multiple data sources while maintaining privacy and confidentiality (Jarmin 2019).

The next three articles in this series will:

provide an overview of the CDE and its corresponding framework
put the CDE Framework into practice through a demonstration use case on the resilience of skilled nursing facilities
describe our next steps for developing the CDE through a use case research program.

Part 2: What is the Curated Data Enterprise? →

About the authors: Sallie Keller is the Chief Scientist and Associate Director of Research and Methodology at the US Census Bureau. She is a statistician with research interest in social and decision informatics, statistics underpinnings of data science, and data access and confidentiality. Sallie Keller was at the University of Virginia when this work was conducted.; Stephanie Shipp leads the Curated Data Enterprise research portfolio and collaborates with the US Census. She is an economist with experience in data science, survey statistics, public policy, innovation, ethics, and evaluation.; Vicki Lancaster is a statistician with expertise in experimental design, linear models, computation, visualizations, data analysis, and interpretation. She works with scientists at federal agencies on projects requiring statistical skills and creativity, eg, defining skilled technical workforce using novel data sources.; Joseph Salvo is a demographer with experience in US Census Bureau statistics and data. He makes presentations on demographic subjects to a wide range of groups about managing major demographic projects involving the analysis of large data sets for local applications.

Copyright and licence: © 2024 Stephanie Shipp

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Lukas Blazek on Unsplash.

How to cite: Keller S, Shipp S, Lancaster V, Salvo J (2024). “Advancing Data Science in Official Statistics: The Policy Problem.” Real World Data Science, November 01, 2024. URL

References

Abraham, Katherine G, Ron Haskins, Sherry Glied, Robert M Groves, Robert Hahn, Hilary Hoynes, and KR Wallin. 2018. “The Promise of Evidence-Based Policymaking: Report of the Commission on Evidence-Based Policymaking.” Washington, DC: Commission on Evidence-Based Policymaking. https://www.cep.gov/ content/dam/cep/report/cep-final-report.pdf.

Brady, Henry E. 2019. “The Challenge of Big Data and Data Science.” Annual Review of Political Science 22: 297–323. https://www.annualreviews.org/doi/abs/10.1146/annurev-polisci-090216-023229.

“Federal Data Strategy, Leveraging Data as a Strategic Asset.” 2021. 2021. https://strategy.data.gov/.

Groshen, Erica L. 2021. “The Future of Official Statistics.” Harvard Data Science Review 3 (4). https://doi.org/10.1162/99608f92.591917c6.

Jarmin, Ron S. 2019. “Evolving Measurement for an Evolving Economy: Thoughts on 21st Century US Economic Statistics.” Journal of Economic Perspectives 33 (1): 165–84.

Lanman, Kathryn, Olivia Davis, and Stephanie Shipp. 2023. “What Can We Learn from Other Countries about How They Are Using Administrative Data to Supplement, Enhance, or Create New Data Products?” Proceedings of the Biocomplexity Institute. https://doi.org/10.18130/2n54-sc22.

NASEM. 2023. “Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources.” National Academies of Science, Engineering, and Medicine. https://doi.org/10.17226/26804.

OMB. 2023. “Fundamental Responsibilities of Recognized Statistical Agencies and Units.” Federal Register: The Daily Journal of the US Government, 56708–44. https://www.federalregister.gov/documents/2023/08/18/2023-17664/fundamental-responsibilities-of-recognized-statistical-agencies-and-units .

Thieme, Michael. 2022. “Technology Transformations at the Census Bureau: Building a Modern, Data-Centric Ecosystem.” hhttps://www.census.gov/newsroom/blogs/research-matters/2022/10/technology-transformation.html .

Forecasting the Health Needs of a Changing Population

Wed, 08 May 2024 00:00:00 GMT

Background

Decisions around medium and long-term allocation of healthcare resources are fraught with challenges and uncertainties, which explains the use of blunt resource allocations based on across-the-board annual percentage uplifts.

The Bristol, North Somerset, South Gloucestershire Integrated Care Board (BNSSG ICB - we love elaborate acronyms in the National Health Service!), in the south west of England, is part of the local NHS apparatus responsible for planning the current and future health needs of the one million resident population.

Figure 1: A map of the area covered by BNSSG, a space covered by three local authorities, with about 1 million people living inside it.

Population Segmentation

Before tackling the complex problem of forecasting healthcare resources into the future, we first need to understand the current situation regarding the distribution of health needs.

While every individual has a unique set of circumstances, population segmentation is an approach used to help understand overall need by combining individuals into different groups, based on certain criteria.

We use the Cambridge Multimorbidity Score which is a metric designed to summarise the presence of multiple health conditions, known as multimorbidity. Using that score, which applies different weights to different health conditions, we previously found a way of splitting the adult (17+) population into five Core Segments, with Core Segment 1 patients having the lowest score and being the least ill and Core Segment 5 being those with the most multimorbidity.

Applied to the BNSSG adult population (of around 750K individuals), the following interesting properties were found:

Halving: Going up one segment results in roughly half the number of people in that segment
Doubling: Going up one segment results in roughly twice the NHS monetary spend per person per year

We can see this in Figure 2.

2.94 with 3% of the population and £5600 mean annual spend per person as the last row. The propoportion of population column roughly halved row-by-row, the mean annual spend per person roughly doubled row by row.">

Figure 2: Halving-Doubling Effect of the Core Segments

Creating The Model

To forecast health needs of the population, in terms of how many people will be in which Core Segment in what future year, the Dynamic Population Model (DPM) takes information from two different sources:

The Office for National Statistics projections for our area. From this, we get yearly projections for not just the total 17+ population, but also the predicted number of people turning 17 (and so entering our model), deaths, and in- and out-ward migration.
NHS patient attribute and activity data, stored in the System Wide Dataset (SWD). This gives us: past and current information on the adult population’s NHS healthcare usage; the Core Segment breakdown of our current and past populations; the proportion of those turning 17, migrating, and dying that are in each Core Segment. From this, we estimate the historical rates of transition within Core Segments, which is essentially the yearly number of people getting sicker or healthier.

By synthesising these pieces of data, we create our DPM forecast. Starting from the most up to date Core Segment population breakdown, the model takes yearly time steps into the future, at each time step using the inputs to estimate how many people are to be in each Core Segment. This modelling approach of having discrete time steps and different movements between states can be set up as a Markov chain, although here we have formulated it as a set of difference equations - through which the outflow of each Core Segment population at each time step is deterministic. The design was led by Zehra and Christos, through a collaboration between the NHS and the Centre for Healthcare Innovation and Improvement (CHI2) at the University of Bath.

The model can be thought of as having the following inputs:

Model Input	Description	Data Source
initial population	The starting number of people in each Core Segment	SWD
inner transition matrix	The yearly proportions of people moving from one Core Segment to another	SWD
births, net migration, deaths - numbers	The yearly number of people moving in and out of the area	ONS
births, net migration, deaths - proportions	The proportion of births/migrations/deaths that come from each Core Segment group	SWD

From these inputs, it deterministically outputs the yearly forecasts for the number of people in each Core Segment. From these yearly Core Segment population figures, we can also forecast use by point of delivery by taking historic SWD information on the activity used by current Core Segment breakdown, under the assumption that stays the same into the future.

We combine these population health segment projections – i.e., how many people will be in which Core Segment in what future year – with recent NHS healthcare usage data to yield forecasted changes for various delivery points, like Emergency Department (ED) visits or maternity service appointments.

Findings

The first output of the model is the population forecast for each Core Segment, as plotted in Figure 3. The visualisation is a type of sankey diagram called an alluvial plot, which shows the proportion of people moving between the Core Segments each year. As it is to be expected, the majority of individuals stay in the same Core Segment year-on-year as the process of acquiring conditions and developing multimorbidity takes places over many years and decades.

The concerning insight shown in Figure 3 is that all Core Segments apart from (the most healthy) Core Segment 1 are due to increase in size, with Core Segment 5 having the largest percentage increase over the next 20 years. While, at first glance, this could be attributed to the effect an ageing population, in which people are staying alive for longer we will see in the next set of results that this itself does not wholly explain the forecasted Core Segment changes.

Figure 3: All Core Segments, except the most healthy (CS1), are forecast to increase in size. BNSSG Population rescaled to have an initial population of 1,000.

In applying the typical NHS healthcare usage per Core Segment to the projections of Figure 3, we derive the expected future healthcare usage for various healthcare settings (Figure 4). In overlaying to these the equivalent projections due solely to demographic factors (both for total population size and capturing the effect of Age and Sex), we see that the DPM projections for increased resource use are not solely attributable to an ageing and growing population, but also to a population becoming gradually less healthy over time.

Specifically, from Figure 4 we can glean the following insights:

In all areas except Maternity, the DPM forecasts an increased use beyond just the growing, aging population. The reason that Maternity can be explained as the exception is due to it closely following the demographic changes forecast, specifically for numbers of women of child bearing age.
For Community contacts, with the highest proportion of use from Core Segment 5 patients, the DPM forecasts the highest increase into the future. This is because, relative to current size, the number of Core Segment 5 patients is set to increase the largest and so that has the largest impact on Community contacts, which include home visits to patients to support rehabilitation and services to manage long-term mobility issues such as physiotherapy.
Whilst Secondary Elective and Non-Elective activity is forecast to grow at similar rates, the Carbon and Cost values are forecast to grow more for Secondary Non-Elective due to the average Carbon and Cost usage per person in Core Segment 5 being higher. In this context ‘Secondary’ is a hospital stay, with ‘Elective’ being planned and ‘Non-Elective’ being unplanned. For example, a hip replacement is elective whereas an admission following a road traffic accident is non-elective.

Figure 4: Forecasts by activity, carbon, and cost for four different points of delivery.

Limitations

It’s difficult to make predictions, especially about the future.

– Danish Proverb

As with any modelling / forecasting method, there are limitations to be mindful of.

The cost and activity usage estimates are made under the assumption that we will continue to deliver services as they are currently being delivered. We know this isn’t going to be true, as healthcare-seeking behaviour evolves over time, with younger people accessing healthcare in different ways to previous generations. On top of that, healthcare advances can result in significant changes in healthcare provision, in ways unaccounted for within this model.
The model is tied to ONS forecasts for population change, and robust forecasting is hard. It is difficult to estimate what the population will look like in 20 years’ time, and the influence of uncertain and unknown future local development and housing plans. Having said this, population forecasts tend to be robust, one way to consider this is that everyone who will be an adult by the end of the forecast in 20 years’ time has already been born.
The DPM does not explicitly account for the interaction of demand and capacity: it simply predicts future healthcare resource requirement assuming that health needs of a given Core Segment patient are met in the same way they are met now. This is an essential assumption to help ensure legitimate use of the empirically derived Core Segment transition rates. However, it inevitably limits practical use, as flexing demand and capacity assumptions is of importance to planners and service managers.
It is not possible to validate the model on historic data, firstly because of point 3. above but also because we only have good quality SWD information for the past two years, so cannot reliably look further back into the past and create a forecast that we can check against what actually happened.
Whilst it is possible to use the model in other healthcare systems and geographic areas, the underlying data required to generate the Core Segments is non-trivial, so significant data pipelining may be required to get to create local model inputs, as explained above in Section 3.

What Next

We have already generated local use cases for the DPM in forecasting different geographical areas or specific hospital trusts. We envisage the DPM becoming a standard tool in most forward planning initiatives and will continue to refine the model as more information becomes available both for calibration and validation.

Outside of BNSSG, we are keen to disseminate our modelling approach to others who may be interested, as well as expanding our collaboration. There are also other innovative approaches in this space, such as the Health in 2040 report by the Health Foundation which looks at England-level and uses the same ONS forecasts, but using a different ‘micro simulation’ modelling approach.

If long-term forecasting in the NHS is of interest to you and your work, we’d love to chat! Please get in touch at bnssg.analytics@nhs.net

Summary

Reliably forecasting longer-term population health needs and healthcare resource requirements is essential if the NHS is to effectively plan for tomorrow’s problems today.

While this is undoubtedly a difficult problem – both conceptually and statistically – our modelling, undertaken through an academic-NHS collaboration, demonstrates that there are alternatives beyond the commonly-used but simplistic approaches based only on demographic factors.

Find more case studies

About the authors: Luke Shaw is a Data Scientist working in the NHS.; Rich Wood is Head of Modelling Analytics at BNSSG ICB and Senior Visiting Research Follow at University of Bath School of Management.; Christos Vasilakis is Director of the Centre for Healthcare Innovation and Improvement (CHI2), and Professor at the University of Bath School of Management.; Zehra Onen Dumlu is a Research Associate at CHI2 and Lecturer at the University of Bath.

Copyright and licence: © 2024 Luke Shaw

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Shaw, Luke et al 2024. “Forecasting the Health Needs of a Changing Population” Real World Data Science, May 08, 2024. URL

Deduplicating and linking large datasets using Splink

Robin Linacre — Wed, 22 Nov 2023 00:00:00 GMT

In 2019, the data linking team at the Ministry of Justice was challenged to develop a new data linking methodology to produce new, higher quality linked datasets from the justice system.

The ultimate goal was to share new linked datasets with academic researchers, as part of the ADR UK-funded Data First programme. These datasets – which include data from prisons, probation, and the criminal and family courts – are now available, and researchers can apply for secure access.

The linking methodology is widely applicable and has been published as a free and open source software package called Splink. The software applies statistical best practice to accurately and quickly link and deduplicate large datasets. The software has now been downloaded over 7 million times, and has been used widely in government, academia and the private sector.

The problem

Data duplication is a ubiquitous problem affecting data quality. Organisations often have multiple records that refer to the same entity but no unique identifier that ties these entities together. Data entry errors and other issues mean that variations usually exist, so the records belonging to a single entity aren’t necessarily identical.

For example, in a company, customer data may have been entered multiple times in multiple different databases, with different spellings of names, different addresses, and other typos. The inability to identify which records belong to each customer presents a data quality problem at all stages of data analysis – from basic questions such as counting the number of unique customers, through to advanced statistical analysis.

With the growing size of datasets held by many organisations, any solution must be able to work on very large datasets of tens of millions of records or more.

Approach

In collaboration with academic experts, the team started with desk research into data linking theory and practice, and a review of existing open source software implementations.

One of the most common theoretical approaches described in the literature is the Fellegi-Sunter model. This statistical model has a long history of application for high profile, important record linking tasks such as in the US Census Bureau and the UK Office for National Statistics (ONS).

The model takes pairwise comparisons of records as an input, and outputs a match score between 0 and 1, which (loosely) can be interpreted as the probability of the two records being a match. Since the record comparison can be either two records from the same dataset, or records from different datasets, this is applicable to both deduplication and linkage problems.

An important benefit of the model is explainability. The model uses a number of parameters, each of which has an intuitive explanation that can be understood by a non-technical audience. The relative simplicity of the model also means it is easier to understand and explain how biases in linkage may occur, such as varying levels of accuracy for different ethnic groups.

Example

Consider the following simple record comparison. Are these records a match?

Figure 1: Colour coded comparison of two records.

The parameters of the model are known as partial match weights, which capture the strength of the evidence in favour or against these records being a match.

They can be represented in a chart as follows, in which the highlighted bars correspond to the above example record comparison:

Figure 2: Chart showing partial match weights of model.

We can see, for example, that the first name (Robin vs Robyn) is not an exact match, but they have a Jaro-Winkler similarity of above 0.9. As a result, the model ‘activates’ the corresponding partial match weight (in orange). This lends some evidence in favour of a match, but the partial match weight is not as strong as it would have been for an exact match.

Similarly we can see that the non-match on gender leads to the activation (in purple) of a strong negative partial match weight.

The activated partial match weight can then be represented in a waterfall chart as follows, which shows how the final match score is calculated:

Figure 3: Waterfall chart showing how partial match weights combine to calculate the final prediction.

The parameter estimates in these charts all have intuitive explanations:

The partial match weight on first name is positive, but relatively weak. This makes sense, because the first names are a fuzzy match, not an exact match, so this provides only moderate evidence in favour of the record being a match.
The match weight for the exact match on postcode is stronger than the equivalent weight for surname. This is because the cardinality of the postcode field in the underlying data is higher than the cardinality for surname, so matches on postcode are less likely to occur by chance than matches on surname.
The negative match weight for the mismatch on gender is relatively strong. This reflects the fact that, in this dataset, it’s uncommon for the ‘gender’ field to match amongst truly matching records.

The final result is that the model predicts these records are a match, but with only 94% probability: it’s not sure. Most examples would be less ambiguous than this one, and would have a match probability very close to either 0 or 1.

For further details of the theory behind the Fellegi-Sunter model, and a deep dive into the intuitive explanations of the model, I have have developed a series of interactive tutorials.

Implementation

Through our desk research and open source software review, an existing software package called fastLink was identified which implements the Fellegi-Sunter model, but unfortunately the software is not able to handle very large datasets of more than a few hundred thousand records.

Inspired by the popularity of fastLink, the team quickly realised that the methodology it was developing was generally applicable and could be valuable to a wide range of users if published as a software package.

As we spoke to colleagues across government and beyond, we found record linkage and deduplication problems are pervasive, and crop up in many different guises, meaning that any software needed to be very general and flexible.

The result is Splink – which is a Python package that implements the Fellegi-Sunter model, and enables parameters to be estimated using the Expectation Maximisation algorithm.

The package is free to use, and open source. It is accompanied by detailed documentation, including a tutorial and a set of examples.

Splink makes no assumptions about the type of entity being linked, so it is very flexible. We are aware of its use to match data on a variety of entity types including persons, companies, financial transactions and court cases.

The package closely follows the statistical approach described in fastLink. In particular it implements the same mathematical model and likelihood functions described in the fastLink paper (see pages 354 to 357), with a comprehensive suite of tests to ensure correctness of the implementation.

In addition, Splink introduces a number of innovations:

Able to work at massive scale – with proven examples of its use on over 100 million records.
Extremely fast – capable of linking 1 million records on a laptop in around a minute.
Comprehensive graphical output showing parameter estimates and iteration history make it easier to understand the model and diagnose statistical issues.
A waterfall chart which can be generated for any record pair, which explains how the estimated match probability is derived.
Support for deduplication, linking, and a combination of both, including support for deduplicating and linking multiple datasets.
Greater customisability of record comparisons, including the ability to specify custom, user defined comparison functions.
Term frequency adjustments on any number of columns.
It’s possible to save a model once it’s been estimated – enabling a model to be estimated, quality assured, and then reused as new data becomes available.
A companion website provides a complete description of the various configuration options, and examples of how to achieve different linking objectives.

Using Splink

Full documentation and a tutorial are available for Splink, but the following snippet gives a simple example of Splink in action:

from splink.datasets import splink_datasets
from splink.duckdb.blocking_rule_library import block_on
from splink.duckdb.comparison_library import (
    exact_match,
    jaro_winkler_at_thresholds,
    levenshtein_at_thresholds,
)
from splink.duckdb.linker import DuckDBLinker

df = splink_datasets.fake_1000

# Specify a data linkage model
settings = {
    "link_type": "dedupe_only",
    "blocking_rules_to_generate_predictions": [
      block_on("first_name"),
      block_on("surname"),
    ],
    "comparisons": [
        jaro_winkler_at_thresholds("first_name", 2),
        jaro_winkler_at_thresholds("surname"),
        levenshtein_at_thresholds("dob"),
        exact_match("city", term_frequency_adjustments=True),
        exact_match("email"),
    ],
}

linker = DuckDBLinker(df, settings)

# Estimate model parameters

# Direct estimation using random sampling can be used for the u probabilities
linker.estimate_u_using_random_sampling(target_rows=1e6)

# Expectation maximisation is used to train the m values
br_training = block_on(["first_name", "surname"])
linker.estimate_parameters_using_expectation_maximisation(br_training)

br_training = block_on("dob")
linker.estimate_parameters_using_expectation_maximisation(br_training)

# Use the model to compute pairwise match scores
pairwise_predictions = linker.predict()

# Cluster the match scores into groups to produce a synthetic unique person id
clusters = linker.cluster_pairwise_predictions_at_threshold(
  pairwise_predictions, 0.95
)
clusters.as_pandas_dataframe(limit=5)

The example shows the flexibility of Splink, and how various types of configuration can be used:

How should different data fields be compared? In this example, the Jaro-Winkler distance is used for names, whereas Levenshtein is used for date of birth since Jaro-Winkler is not appropriate for numeric data.
What blocking rules should be used? Blocking rules are the primary determinants of how fast Splink will run, but there is a trade off between speed and accuracy. In this case, the input data is small, so the blocking rules are loose.
How should the model parameters be estimated? In this case, the user has no labels for supervised training, and so uses the unsupervised Expectation Maximisation approach.
Is clustering needed? In this case, each person may potentially have many duplicates, so clustering is used. This creates an estimated (synthetic) unique identifier for each entity (person) in the input dataset.

Outcomes

Splink has been used to link some of the largest datasets held by the Ministry of Justice as part of the Data First programme, and researchers are now able to apply for secure access to these datasets. Research using this data won the ONS Linked Administrative Data Award at the 2022 Research Excellence Awards.

More widely, the demand for Splink has been higher than we expected – with over 7 million downloads. It has been used in other government departments including the Office for National Statistics and internationally, the private sector, and published academic research from top international universities.

Splink has also had external contributions from over 30 people, including staff at the Australian Bureau of Statistics, DataBricks, other government departments, academics, and various private sector consultancies.

Editor’s note: For more on data linkage, check out our interview with Helen Miller-Bakewell of the UK Office for Statistics Regulation, discussing the OSR report, Data Sharing and Linkage for the Public Good.

Find more case studies

About the author: Robin Linacre is an economist, data scientist and data engineer based at the UK Ministry of Justice. He is the lead author of Splink.

Copyright and licence: © 2023 Robin Linacre

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Possessed Photography on Unsplash.

How to cite: Linacre, Robin. 2023. “Deduplicating and linking large datasets using Splink.” Real World Data Science, November 22, 2023. URL

Learning from failure: ‘Red flags’ in body-worn camera data

Noah Wright — Thu, 16 Nov 2023 00:00:00 GMT

Incarcerated youth are an exceptionally vulnerable population, and body-worn cameras are an important tool of accountability both for those incarcerated and the staff who supervise them. In 2018 the Texas Juvenile Justice Department (TJJD) deployed body-worn cameras for the first time, and this is a case study of how the agency developed a methodology for measuring the success of the camera rollout. This is also a case study of analysis failure, as it became clear that real-world implementation problems were corrupting the data and rendering the methodology unusable. However, the process of working through the causes of this failure helped the agency identify previously unrecognized problems and ultimately proved to be of great benefit. The purpose of this case study is to demonstrate how negative findings can still be incredibly useful in real-world settings.

Why body-worn cameras?

Body-worn cameras became a standard tool of policing in the US in the mid-2010s. By recording officer interactions with the public, law enforcement agencies could achieve a greater degree of accountability. Not only could credible claims of police abuse against civilians be easily verified, the argument went, but false accusations would decline as well, saving law enforcement agencies time and resources that would otherwise be wasted on spurious allegations. Initial studies seemed to support this argument.

TJJD faced similar issues to law enforcement agencies, and body-worn cameras seemed like they could be a useful tool. Secure youth residential facilities in Texas all had overhead cameras, but these were very old (they still ran on tape) and captured no audio. This presented a number of problems when it came to deciphering contested incidents, not to mention that these cameras had clearly not prevented any of the agency’s prior scandals from taking place. TJJD received special funding from the legislature to roll out body-worn cameras system-wide, and all juvenile correctional officers were required to wear one.

Background

From the outset of the rollout of body-worn cameras, TJJD faced a major issue with implementation: in 2019, body worn cameras were an established tool for law enforcement, but there was very little literature or best practice to draw from for their use in a correctional environment. Unlike police officers, juvenile correctional officers (JCOs) deal directly with their charges for virtually their entire shift. In an eight-hour shift, a police officer might record a few calls and traffic stops. A juvenile correctional officer, on the other hand, would record for almost eight consecutive hours. And, because TJJD recorded round-the-clock for hundreds of employees at a time, this added up very quickly to a lot of footage.

For example, a typical dorm in a correctional center might have four JCOs assigned to it. Across a single week, these four JCOs would be expected to record at least 160 hours of footage.

Figure 1: Four JCOs x 40 hours per week = 160 hours of footage.

This was replicated across every dorm. Three dorms, for example, would produce nearly 500 hours of footage, as seen below.

Figure 2: Three dorms x four JCOs x 40 hours per week = 480 hours of footage.

Finally, we had more than one facility. Four facilities with three dorms each would produce nearly 2,000 hours of footage every week.

Figure 3: Four facilities x three dorms x four JCOs x 40 hours per week = 1,960 hours of footage.

In actuality, we had a total of five facilities each with over a dozen dorms producing an anticipated 17,000 hours of footage every week – an impossible amount to monitor manually.

As a result, footage review had to be done in a limited, reactive manner. If our monitoring team received an incident report, they could easily zero in on the cameras of the officers involved and review the incident accordingly. But our executive team had hoped to be able to use the footage proactively, looking for “red flags” in order to prevent potential abuses instead of only responding to allegations.

Because the agency had no way of automating the monitoring of footage, any proactive analysis had to be metadata-based. But what to look for in the metadata? Once again, the lack of best-practice literature left us in the lurch. So, we brainstormed ideas for “red flags” and came up with the following that could be screened for using camera metadata:

Minimal quantity of footage – our camera policy required correctional officers to have their cameras on at all times in the presence of youth. No footage meant they weren’t using their cameras.
Frequently turning the camera on and off – a correctional officer working a dorm should have their cameras always on when around youth and not be turning them on and off repeatedly.
Large gaps between clips – it defeats the purpose of having cameras if they’re not turned on.

In addition, we came up with a fourth red flag, which could be screened for by comparing camera metadata with shift-tracking metadata:

Mismatch between clips recorded and shifts worked – the agency had very recently rolled out a new shift tracking software. We should expect to see the hours logged by the body cameras roughly match the shift hours worked.

Analysis, part 1: Quality control and footage analysis

For this analysis, I gathered the most recent three weeks of body-worn camera data – which, at the time, covered April 1–21, 2019. I also pulled data from Shifthound (our shift management software) covering the same time period. Finally, I gathered HR data from CAPPS, the system that most of the State of Texas used at the time for personnel management and finance.¹ I then performed some quality control work, summarized in the dropdown box below.

Initial quality control steps

SkimR is a helpful R package for exploratory analysis that gives summary statistics for every variable in a data frame, including missing values. After using the skim function on clip data, shift data, and HR data, I noticed that the clip data had some missing values for employee ID. This was an error which pointed to data entry mistakes – body-worn cameras do not record footage on their own, after all, so employee IDs should be assigned to each clip.

From here I compared the employee ID field in the clip data to the employee ID field in the HR data. Somewhat surprisingly, IDs existed in the clip data that did not correspond to any entries in the HR data, indicating yet more data entry mistakes – the HR data is the ground truth for all employee IDs. I checked the shift data for the same error – employee IDs that did not exist in the HR data – and found the same problem.

As well as employee IDs that did not exist in the HR data, I also looked for employee IDs in the footage and shift data which related to staff who were not actually employed between April 1–21, 2019. I found some examples of this, which indicated yet more errors: staff cannot use a body-worn camera or log a shift if they have yet to begin working or if they have been terminated (system permissions are revoked upon leaving employment).

I made a list of every erroneous ID to pass off to HR and monitoring staff before excluding them from the subsequent analysis. In total, 10.6% of clips representing 11.3% of total footage had to be excluded due to these initial data quality issues, foreshadowing the subsequent data quality issues the analysis would uncover.

The full analysis script can be found on GitHub.

In order to operationalize the “red flags” from our brainstorming session, I needed to see what exactly the cameras captured in their metadata. The variables most relevant to our purposes were:

Clip start
Clip end
Camera used
Who was assigned to the camera at the time
The role of the person assigned to the camera

Using these fields, I first created the following aggregations per employee ID:

Number of clips = Number of clips recorded.

Days with footage = Number of discrete dates that appear in these clips.

Footage hours = Total duration of all shot footage.

Significant gaps = Number of clips where the previous clip’s end date was either greater than 15 minutes or less than eight hours before current clip’s start date.

I used these aggregations to devise the following staff metrics:

Clips per day = Number of clips / Days with footage.

Footage per day = Footage hours / Days with footage.

Average clip length = Footage hours / Number of clips.

Gaps per day = Gaps / Days with footage.

Once I established these metrics for each employee I looked at their respective distributions. Standard staff shift lengths at the time were eight hours. If staff were using their cameras appropriately, we would expect to see distributions centered around clip lengths of about an hour, eight or fewer clips per day, and 8-12 footage hours per day. We would also expect to see 0 large gaps.

Show the code

Food for Thought: Third place winners – Loyola Marymount

Yifan Hu and Mandy Korpusik — Mon, 21 Aug 2023 00:00:00 GMT

Undergraduate student Yifan (Rosetta) Hu was responsible for writing the Python script that pre-processes the 2015–2016 UPC, EC, and PPC data for training neural network models. Her script randomly sampled five negative EC descriptions for every positive match between a UPC and EC code. Professor Mandy Korpusik performed the remaining work, including setting up the environment, training the BERT model, and evaluation. Hu spent roughly 10 hours on the competition, and Korpusik spent roughly 40 hours of work (and many additional hours running and monitoring the training and testing scripts).

Our perspective on the challenge

The goal of this challenge is to use machine learning and natural language processing (NLP) to link language-based entries in the IRI and FNDDS databases. Our proposed approach is based on our prior work using deep learning models to map users’ natural language meal descriptions to the FNDDS database (Korpusik, Collins, and Glass 2017b) to retrieve nutrition information in a spoken diet tracking system. In the past, we found a trade-off between accuracy and cost, leading us to select convolutional neural networks over recurrent long short-term memory (LSTM) networks – with nearly 10x as many parameters and 2x the training time required, LSTMs achieved slightly lower performance on semantic tagging and food database mapping on meals in the breakfast category. Here, we propose to investigate state-of-the-art transformers, specifically the contextual embedding model (i.e., the entire sentence is used as context to generate the embedding) known as BERT (Bidirectional Encoder Representations from Transformers, Devlin et al. 2018).

Our approach

Our approach is to fine-tune a large pre-trained BERT language model on the food data. BERT was originally trained on a massive amount of text for a language modelling task (i.e., predicting which word should come next in a sentence). It relies on a transformer model, which uses an “attention” mechanism to identify which words the model should pay the most “attention” to. We are specifically using BERT for binary sequence classification, which refers to predicting a label (i.e., classification) for a sequence of words. In our case, during fine-tuning (i.e., training the model further on our own dataset) we will feed the model pairs of sentences (where one sentence is the UPC description of a food item and the other is the EC description of another food item), and the model will perform binary classification, predicting whether the sentences are a match (i.e., 1) or not (i.e., 0). We start with the 2015–2016 ground truth PPC data for positive examples, and five randomly sampled negative examples per positive example.

Training methods

Since we used a neural network model, the only features passed into our model were the tokenized words themselves of the EC and UPC food descriptions – we did not conduct any manual feature engineering (Dong and Liu 2018). The model was trained on a 90/10 split into 90% training and 10% validation data, where the validation data was used as a test set to fine-tune the model’s hyperparameters. We started with a randomly sampled set of 16,000 pairs, batch size of 16 (i.e., the model would train on batches of 16 samples at a time), AdamW (Loshchilov and Hutter 2017) as the optimizer (which adaptively updates the learning rate, or how large the update should be to the model’s parameters), a linear schedule with warmup (i.e., starting with a small learning rate in the first few epochs of training due to large variance in early stages of training, L. Liu et al. 2019), and one epoch (i.e., the number of times the model passes through all the training data). We then added the next randomly sampled set of 16,000 pairs to get a model trained on 32,000 data points. Finally, we reached a total of 48,000 data samples used for training. Each pair of sequences was tokenized with the pre-trained BERT tokenizer, with the special CLS and SEP tokens (where CLS is a learned vector that is typically passed to downstream layers for final classification, and SEP is a learned vector that separates two input sequences), and was padded with zeros to the maximum length input sequence of 240 tokens, so that each input sequence would be the same length.

Model development approach

We faced many challenges due to the secure nature of the ADRF environment. Since our approach relies on BERT, we were blocked by errors due to the local BERT installation. Typically, BERT is downloaded from the web as the program runs. However, for this challenge, BERT must be installed locally for security reasons. To fix the errors, the BERT models needed to be installed with git lfs clone instead of git.

Second, we were unable to retrieve the test data from the database due to SQLAlchemy errors. We found a workaround by using DBeaver directly to save database tables as Excel spreadsheets, rather than accessing the database tables through Python.

Finally, we needed a GPU in order to efficiently train our BERT models. However, we initially only had a CPU, so there was a delay due to setting up the GPU configuration. Once the GPU image was set up, there was still a CUDA error when running the BERT model during training. We determined that the model was too big to fit into GPU memory, so we found a workaround using gradient checkpointing (trading off computation speed for memory) with the transformers library’s Trainer and TrainingArguments. Unfortunately, the version of transformers we were using did not have these tools, and the library was not updated until less than a week before the deadline, so we still had to train the model on the CPU.

To deal with the inability to run jobs in the background, our process was checkpointing our models every five batches, and saving the model predictions during evaluation to a csv file every five batches as well.

Find the code in the Real World Data Science GitHub repository.

Our results

After training, the 48K model (so-called because it was trained on 48,000 data samples) was used at test time via ranking all possible 2017–18 EC descriptions given an unseen UPC description. The rankings were obtained through the model’s output value – the higher the output (or confidence), the more highly we ranked that EC description. To speed up the ranking process, we used blocking (i.e., only ranking a subset of all possible matches), specifically with exact word matches (using only the first six words in the UPC description, which appeared to be the most important), and fed all possible matches through the model in one batch per UPC description. Since we still did not have sufficient time to complete evaluation on the full set of test UPC descriptions, we implemented an expedited evaluation that only considered the first 10 matching EC descriptions in the BERT ranking process (which we call BERT-FAST). We also report results for the slower evaluation method that considers all EC descriptions that match at least one of the first six words in a given UPC description, but note that these results are based on just a small subset of the total test set. See Table 1 below for our results, where the (5?) indicates how often the correct match was ranked among the top-5. See Table 2 for an estimate of how long it takes to train and test the model on a CPU.

Table 1: S@5 and NCDG@5 for BERT, both for fast evaluation over the whole test set, and slower evaluation on a smaller subset (711 UPCs out of 37,693 total).

Model	Success@5	NDCG@5
BERT-FAST	0.057	0.047
BERT-SLOW	0.537	0.412

Table 2: An estimate of the time required to train and test the model.

	Time
Training (on 48K samples)	16 hours
Testing (BERT-FAST)	52 hours
Testing (BERT-SLOW)	63 days

Future work/refinement

In the future, with more time available, we would train on all data, not just our limited dataset of 48,000 pairs, as well as perform evaluation on the held-out test set with the full set of possible EC matches that have one or more words in common with the UPC description. We would compare against baseline word embedding methods such as word2vec (Mikolov et al. 2017) and Glove (Pennington, Socher, and Manning 2014), and we would explore hierarchical prediction methods for improving efficiency and accuracy. Specifically, we would first train a classifier to predict the generic food category, and then train finer-grained models to predict specific foods within a general food category. Finally, we are exploring multi-modal transformer-based approaches that allow two input modalities (i.e., food images and text descriptions of a meal) for predicting the best UPC match.

Lessons learned

We recommend that future challenges provide every team with both a CPU and a GPU in their workspace, to avoid transitioning from one to the other midway through the challenge. In addition, if possible, it would be very helpful to provide a mechanism for running jobs in the background. Finally, it may be useful for teams to submit snippets of code along with library package names, in order for the installations to be tested properly beforehand.

← Part 4: Second place winners

Part 6: The value of competitions →

About the authors: Yifan (Rosetta) Hu is an undergraduate student and Mandy Korpusik is an assistant professor of computer science at Loyola Marymount University’s Seaver College of Science and Engineering.

Copyright and licence: © 2023 Yifan Hu and Mandy Korpusik

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Peter Bond on Unsplash.

How to cite: Hu, Yifan, and Mandy Korpusik. 2023. “Food for Thought: Third place winners – Loyola Marymount.” Real World Data Science, August 21, 2023. URL

References

Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. “BERT: Pre-Training of Deep Bidirectional Transformers for Language Understanding.” CoRR abs/1810.04805. http://arxiv.org/abs/1810.04805.

Dong, G., and H. Liu, eds. 2018. Feature Engineering for Machine Learning and Data Analytics. First edition. CRC Press.

Korpusik, M., Z. Collins, and J. Glass. 2017a. “Character-Based Embedding Models and Reranking Strategies for Understanding Natural Language Meal Descriptions.” In Proceedings of Interspeech, 3320–24. https://doi.org/10.21437/Interspeech.2017-422.

———. 2017b. “Semantic Mapping of Natural Language Input to Database Entries via Convolutional Neural Networks.” In 2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 5685–89. https://doi.org/10.1109/ICASSP.2017.7953245.

Korpusik, M., and J. Glass. 2017. “Spoken Language Understanding for a Nutrition Dialogue System.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 25 (7): 1450–61. https://doi.org/10.1109/TASLP.2017.2694699.

———. 2018. “Convolutional Neural Networks and Multitask Strategies for Semantic Mapping of Natural Language Input to a Structured Database.” In 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6174–78. https://doi.org/10.1109/ICASSP.2018.8461769.

———. 2019. “Deep Learning for Database Mapping and Asking Clarification Questions in Dialogue Systems.” IEEE/ACM Transactions on Audio, Speech, and Language Processing 27 (8): 1321–34. https://doi.org/10.1109/TASLP.2019.2918618.

Korpusik, M., C. Huang, M. Price, and J. Glass. 2016. “Distributional Semantics for Understanding Spoken Meal Descriptions.” In 2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 6070–74. https://doi.org/10.1109/ICASSP.2016.7472843.

Korpusik, M., Z. Liu, and J. Glass. 2019. “A Comparison of Deep Learning Methods for Language Understanding.” In Proceedings of Interspeech, 849–53. https://doi.org/10.21437/Interspeech.2019-1262.

Korpusik, M., N. Schmidt, J. Drexler, S. Cyphers, and J. Glass. 2014. “Data Collection and Language Understanding of Food Descriptions.” In 2014 IEEE Spoken Language Technology Workshop (SLT), 560–65. https://doi.org/10.1109/SLT.2014.7078635.

Lan, Z., M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2019. “ALBERT: A Lite BERT for Self-Supervised Learning of Language Representations.” CoRR abs/1909.11942. http://arxiv.org/abs/1909.11942.

Liu, L., H. Jiang, P. He, W. Chen, X. Liu, J. Gao, and J. Han. 2019. “On the Variance of the Adaptive Learning Rate and Beyond.” CoRR abs/1908.03265. http://arxiv.org/abs/1908.03265.

Liu, Y., M. Ott, N. Goyal, J. Du, M. Joshi, D. Chen, O. Levy, M. Lewis, L. Zettlemoyer, and V. Stoyanov. 2019. “RoBERTa: A Robustly Optimized BERT Pretraining Approach.” CoRR abs/1907.11692. http://arxiv.org/abs/1907.11692.

Loshchilov, I., and F. Hutter. 2017. “Fixing Weight Decay Regularization in Adam.” CoRR abs/1711.05101. http://arxiv.org/abs/1711.05101.

Mikolov, T., E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin. 2017. “Advances in Pre-Training Distributed Word Representations.” CoRR abs/1712.09405. http://arxiv.org/abs/1712.09405.

Pennington, J., R. Socher, and C. Manning. 2014. “GloVe: Global Vectors for Word Representation.” In Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (EMNLP), 1532–43. Doha, Qatar: Association for Computational Linguistics. https://doi.org/10.3115/v1/D14-1162.

Peters, M. E., M. Neumann, M. Iyyer, M. Gardner, C. Clark, K. Lee, and L. Zettlemoyer. 2018. “Deep Contextualized Word Representations.” CoRR abs/1802.05365. http://arxiv.org/abs/1802.05365.

Radford, A., K. Narasimhan, T. Salimans, and I. Sutskever. 2018. “Improving Language Understanding by Generative Pre-Training.” https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf.

Vaswani, A., N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin. 2017. “Attention Is All You Need.” In Advances in Neural Information Processing Systems, edited by I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett. Vol. 30. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf.

Yang, Z., Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le. 2019. “XLNet: Generalized Autoregressive Pretraining for Language Understanding.” In Advances in Neural Information Processing Systems, edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. dAlché-Buc, E. Fox, and R. Garnett. Vol. 32. Curran Associates, Inc. https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf.

Food for Thought: The value of competitions for confidential data

Steven Bedrick, Ophir Frieder, Julia Lane, and Philip Resnik — Mon, 21 Aug 2023 00:00:00 GMT

We are witnessing a sea change in data collection practices by both governments and businesses – from purposeful collection (through surveys and censuses, for example) to opportunistic (drawing on web and social media data, and administrative datasets). This shift has made clear the importance of record linkage – a government might, for example, look to link records held by its various departments to understand how citizens make use of the gamut of public services.

However, creating manual linkages between datasets can be prohibitively expensive, time consuming, and subject to human constraints and bias. Machine learning (ML) techniques offer the potential to combine data better, faster, and more cheaply. But, as the recently released National AI Research Resources Task Force report highlights, it is important to have an open and transparent approach to ensure that unintended biases do not occur.

In other words, ML tools are not a substitute for thoughtful analysis. Both private and public producers of a linked dataset have to determine the level of linkage quality – such as what precision/recall tradeoff is best for the intended purpose (that is, the balance between false-positive links and failure to cover links that should be there), how much processing time and cost is acceptable, and how to address coverage issues. The challenge is made more difficult by the idiosyncrasies of heterogeneous datasets, and more difficult yet when datasets to be linked include confidential data (Christensen and Miguel 2018; Christen, Ranbaduge, and Schnell 2020).

And, of course, an ML solution is never the end of the road: many data linkage scenarios are highly dynamic, involving use cases, datasets, and technical ecosystems that change and evolve over time; effective use of ML in practice necessitates an ongoing and continuous investment (Koch et al. 2021). Because techniques are constantly improving, producers need to keep abreast of new approaches. A model that is working well today may no longer work in a year because of changes in the data, or because the organizational needs have changed so that a certain type of error is no longer acceptable. As Sculley et al. point out, “it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning” (Sculley et al. 2014).

Also important is that record linkage is not seen as a technical problem relegated to the realm of computer scientists to solve. The full engagement of domain experts in designing the optimization problem, identifying measures of success, and evaluating the quality of the results is absolutely critical, as is building an understanding of the pros and cons of different measures (Schafer et al. 2021; Hand and Christen 2018). There will need to be much learning by doing in “sandbox” environments, and back and forth communication across communities to achieve successful outcomes, as noted in the recommendations of the Advisory Committee on Data for Evidence Building (a screenshot of which is shown in Figure 1).

Figure 1: A recommendation for building an “innovation sandbox” as part of the creation of a new National Secure Data Service in the United States.

Despite the importance of trial and error and transparency about linkage quality, there is no handbook that guides domain experts in how to design such sandboxes. There is a very real need for agreed-upon, domain-independent guidelines, or better yet, official standards to evaluate sandboxes. Those standards would define “who” could and would conduct the evaluation, and help guarantee independence and repeatability. And while innovation challenges have been embraced by the federal government, the devil can be very much in the details (Williams 2012).

It is for this reason that the approach taken in the Food for Thought linkage competition, and described in this compendium, provides an important first step towards a well specified, replicable framework for achieving high quality outcomes. In that respect it joins other recent efforts to bring together community-level research on shared sensitive data (MacAvaney et al. 2021; Tsakalidis et al. 2022). This competition, like those, helped bring to the foreground both the opportunities and challenges of doing research in secure sandboxes with sensitive data. Notably, these exercises highlight a kind of cultural tension between secure, managed environments, on the one hand, and unfettered machine learning research, on the other. The need for flexibility and agility in computational research bumps up against the need for advance planning and careful step-by-step processes in environments with well-defined data governance rules, and one of the key lessons learned is that the tradeoffs here need to be recognized and planned for.

This particular competition was important for a number of other reasons. Thanks to its organization as a competition, complete with prizes and bragging rights for strongly performing teams, it attracted new eyes from computer science and data science to think about how to address a critical real-world linkage problem. It offered the potential to produce approaches that were scalable, transparent, and reproducible. The engagement of domain experts and statisticians meant that it will be possible to conduct an informed error analysis, to explicitly relate the performance metrics in the task to the problem being solved in the real world, and to bring in the expertise of survey methodologists to think about the possible adjustments. And because it identified different approaches of addressing the same problem, it created an environment for new innovative ideas.

More generally, in addition to the excitement of the new approaches, this exercise laid bare the fragility of linkages in general and highlighted the importance of secure sandboxes for confidential data. While the promise of privacy preserving technologies is alluring as an alternative to bringing confidential data together in one place, such approaches are likely too immature to deploy ad hoc until a better understanding is established of how to translate real-world problems and their associated data into well-defined tasks, how to measure quality, and particularly how to assess the impact of match quality on different subgroups (Domingo-Ferrer, Sánchez, and Blanco-Justicia 2021). The scientific profession has gone through too painful a lesson with the premature application of differential privacy techniques to ignore the lessons that can be learned from a careful and systematic analysis of different approaches (2021; Van Riper et al. 2020; Ruggles et al. 2019; Giles et al. 2022).

We hope that the articles in this collection provide not only the first steps towards a handbook of best practices, but also an inspiration to share lessons learned, so that success can be emulated, and failures understood and avoided.

← Part 5: Third place winners

Find more case studies

About the authors: Steven Bedrick is an associate professor in Oregon Health and Science University’s Department of Medical Informatics and Clinical Epidemiology.; Ophir Frieder is a professor in Georgetown University’s Department of Computer Science, and in the Department of Biostatistics, Bioinformatics & Biomathematics at Georgetown University Medical Center.; Julia Lane is a professor at the NYU Wagner Graduate School of Public Service and a NYU Provostial Fellow for Innovation Analytics. She co-founded the Coleridge Initiative.

Philip Resnik holds a joint appointment as professor in the University of Maryland Institute for Advanced Computer Studies and the Department of Linguistics, and an affiliate professor appointment in computer science.

Copyright and licence: © 2023 Steven Bedrick, Ophir Frieder, Julia Lane, and Philip Resnik

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Alexandru Tugui on Unsplash.

How to cite: Bedrick, Steven, Ophir Frieder, Julia Lane, and Philip Resnik. 2023. “Food for Thought: The value of competitions for confidential data.” Real World Data Science, August 21, 2023. URL

References

Christen, P., T. Ranbaduge, and R. Schnell. 2020. Linking Sensitive Data - Methods and Techniques for Practical Privacy-Preserving Information Sharing. Springer. https://doi.org/10.1007/978-3-030-59706-1.

Christensen, G., and E. Miguel. 2018. “Transparency, Reproducibility, and the Credibility of Economics Research.” Journal of Economic Literature 56 (3): 920–80. https://doi.org/10.1257/jel.20171350.

Domingo-Ferrer, J., D. Sánchez, and A. Blanco-Justicia. 2021. “The Limits of Differential Privacy (and Its Misuse in Data Release and Machine Learning).” Communications of the ACM 64 (7): 33–35. https://doi.org/10.1145/3433638.

Giles, O., K. Hosseini, G. Mingas, O. Strickson, L. Bowler, C. Rangel Smith, H. Wilde, et al. 2022. “Faking Feature Importance: A Cautionary Tale on the Use of Differentially-Private Synthetic Data.” https://arxiv.org/abs/2203.01363.

Hand, D., and P. Christen. 2018. “A Note on Using the f-Measure for Evaluating Record Linkage Algorithms.” Statistics and Computing 28 (3): 539–47. https://doi.org/10.1007/s11222-017-9746-6.

Koch, B., E. Denton, A. Hanna, and J. G. Foster. 2021. “Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research.” CoRR abs/2112.01716. https://arxiv.org/abs/2112.01716.

MacAvaney, S., A. Mittu, G. Coppersmith, J. Leintz, and P. Resnik. 2021. “Community-Level Research on Suicidality Prediction in a Secure Environment: Overview of the CLPsych 2021 Shared Task.” In Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, 70–80. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.clpsych-1.7.

Ruggles, S., C. Fitch, D. Magnuson, and J. Schroeder. 2019. “Differential Privacy and Census Data: Implications for Social and Economic Research.” AEA Papers and Proceedings 109 (May): 403–8. https://doi.org/10.1257/pandp.20191107.

Schafer, K. M., G. Kennedy, A. Gallyer, and P. Resnik. 2021. “A Direct Comparison of Theory-Driven and Machine Learning Prediction of Suicide: A Meta-Analysis.” PLOS ONE 16 (4): 1–23. https://doi.org/10.1371/journal.pone.0249833.

Sculley, D., G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, and M. Young. 2014. “Machine Learning: The High Interest Credit Card of Technical Debt.” In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop).

Tsakalidis, A., J. Chim, I. M. Bilal, A. Zirikly, D. Atzil-Slonim, F. Nanni, P. Resnik, et al. 2022. “Overview of the CLPsych 2022 Shared Task: Capturing Moments of Change in Longitudinal User Posts.” In Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology, 184–98. Seattle, USA: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.clpsych-1.16.

Van Riper, D., T. Kugler, J. Schroeder, and S. Ruggles. 2020. “Differential Privacy and Racial Residential Segregation.” In 2020 APPAM Fall Research Conference.

Williams, H. 2012. “Innovation Inducement Prizes: Connecting Research to Policy.” Journal of Policy Analysis and Management 31 (3): 752–76. http://www.jstor.org/stable/41653827.

Food for Thought: Competition and challenge design

Zheyuan Zhang and Uyen Le — Mon, 21 Aug 2023 00:00:00 GMT

Since 2014, the professional services firm Westat, Inc. has been developing the Purchase to Plate Crosswalk (PPC) for the United States Department of Agriculture (USDA) Economic Research Service (ERS). The PPC links the retail food transactions database from IRI’s InfoScan service and the USDA Food and Nutrient Database for Dietary Studies (FNDDS). However, the current linkage process uses only partly automated data matching, meaning it is resource intensive, time consuming, and requires manual review.

With sponsorship from ERS, Westat partnered with the Coleridge Initiative to host the Food for Thought competition to challenge researchers and data scientists to use machine learning and natural language processing to find accurate and efficient methods for creating the PPC. Figure 1 provides a visual overview of the challenge set by the competition.

Figure 1: Overview of the Food for Thought Competition Challenge.

The one-to-many matching task that is central to the competition throws up many challenges for researchers to wrestle with. Because IRI data contains food transactions collected from partnered retail establishments for over 350,000 items, the matchings need to be made based on limited data features, including categories, providers, and semantically inconsistent descriptions that consist of short phrases. Consider this hypothetical example: IRI product-related information about a (fictional) “Cheesy Hashbrowns Hamburger Helper, 5.5 Oz Box” needs to be linked to FNDDS nutrition-related information found under “Mixed dishes – meat, poultry, seafood: Mixed meat dishes”. Figure 2 demonstrates how the two databases are linked with each other to create the PPC. As can be seen, there is no common word that easily indicates that “Cheesy Hashbrowns Hamburger Helper…” should be matched with “Mixed dishes…”, and such cases exist in all IRI tables used for the challenge, from 2012 through 2018.

Figure 2: Each universal product code (UPC) from the IRI data could match to only one ensemble code (EC) from the FNDDS data, whereas one EC code could match to multiple UPCs.

Also, because nutritionists or food scientists will always need to review the matching, regardless of the matching method used, it was important that our evaluation of proposed matching methods focused both on the accuracy of prediction models and also on metrics that would lead participants to develop models that facilitate qualified reviewers to reduce their workloads.

Organising the competition was also a challenge in its own right, for data privacy reasons. IRI scanner data contains sensitive information, such as store name, location, unit price, and weekly quantity sold for each item. This ruled out using existing online platforms like Kaggle, DrivenData or AIcrowd to host the competition, and instead required a private secure data enclave to ensure the safe use of sensitive and confidential data assets. The need for such an environment imposed capacity constraints on the competition, meaning only dozens of teams could be invited to take part, whereas on open platforms it is common to have thousands of teams competing and sharing ideas and code.

Competition structure

The competition ran over 10 months and consisted of three separate challenges: two interim, one final. Applications opened in September 2021, and the competition started in January 2022. Submission deadlines for the first and second interim challenges were in July and September 2022, respectively. For these rounds, participants submitted preliminary solutions for evaluation based solely on quantitative metrics, and two awards of $10,000 were given to the highest-scoring teams. The deadline for the final challenge was in October 2022. Here, solutions were evaluated by the scientific review board based on three judging criteria: quantitative metrics, transferability, and innovation. First, second, and third place winners received awards of $30,000, $1,500, and $1,000 respectively. Final presentations were given at the Food for Thought symposium in December 2022.

The competition was run entirely within the Coleridge Initiative’s Administrative Data Research Facility (ADRF), which was established by the United States Census Bureau to inform the decision-making of the Commission on Evidence-Based Policy under the Evidence Act. ADRF follows the Five Safes Framework: safe projects, safe people, safe data, safe settings, and safe outputs.

In keeping with this framework, participants were provided with ADRF login credentials after signing the relevant data use agreements during the onboarding process. All participants were required to agree to the ADRF terms of use, to complete security training, and to pass a security training assessment prior to accessing the challenge data. Participants’ access within ADRF was limited to the challenge environment and data only. There was no internet access, so Coleridge Initiative ensured that any packages requested by teams were available for use within the environment after passing security review. All codes and documentation were only allowed to be exported outside ADRF after export reviews from both Coleridge Initiative and USDA staff. At the end of each challenge, the teams submitted write-ups and supporting files by placing all the necessary submission files in their ADRF team folder. Detailed submission instructions are available via the Real World Data Science GitHub repository.

Metrics

Submissions were evaluated by Coleridge Initiative and technical review and subject review boards based on the following criteria:

Quantitative metrics were used to measure the predictive accuracy and runtime of the model.
Transferability measured the quality of documentation and code, and the ability of individuals who are not involved in model development to replicate and implement the team’s approach.
Innovation measured novelty and creativity of the model in addressing the linkage problem.

Technical review was overseen by faculty members from computer science and engineering departments of top US universities. Subject review was handled by subject matter experts from USDA and Westat.

From a quantitative perspective, the most common way to evaluate machine learning competition submissions is to use model predictive accuracy. However, single metrics are typically incomplete descriptions of real-world tasks, and they can easily hide significant differences between models which simple predictive accuracy cannot capture. To select the most appropriate official challenge metrics, Coleridge Initiative reviewed the literature on the use of evaluation measures in both classification and ranking task machine learning competitions. Success at 5 (S@5) and Normalized Discounted Cumulative Gain at 5 (NDCG@5) scores were ultimately used as the quantitative metrics.

The metrics were applied as follows: models proposed by each team were tasked with outputting five potential FNDDS matches for each IRI code, with potential FNDDS matches ordered from most likely to least likely. S@5 and NDCG@5 scores are broadly similar – both measure whether a correct match is present in the five proposed matches that participants were asked to identify. However, S@5 does not take rank position into account and only considers whether the five proposed FNDDS matches contain the correct FNDDS response. NDCG@5 does take rank into account and also measures how highly the correct FNDDS response is ranked among the five proposed matches. Both measures range from 0 to 1 (or 0% to 100%). Models get a “full credit” for S@5 as long as they contain the correct FNDDS option. NDCG@5 penalizes models when the correct match is ranked lower on the list of 5 proposed matches.

Technical description

Environment setup

Coleridge Initiative solicited technical requirements from participants at the challenge application stage to prepare the ADRF environment as much as possible before the competition began. Each team was asked to share anticipated workspace specifications and software library requests in their application package. From this we identified, reviewed, and installed the requested Python and R packages, libraries, and library components (e.g., pre-trained models, training data) that were not yet available within ADRF.

The setup of graphics processing units (GPUs) was also a critical part of competition preparation. We created an environment with 16 gibibyte (GiB) of GPU memory for each team. Our technology team met with multiple teams several times to discuss computing environment configurations to ensure the GPU could work properly. None of these efforts was wasted: without GPU access, it would be impossible for teams to use state-of-the-art pre-trained models such as the Bidirectional Encoder Representations from Transformers (BERT, Devlin et al. 2018).

We completed the setup of new team workspaces, each customized to the individual team’s resource and library requirements, including GPU configuration. The isolation and customization of workspaces was vital because teams may request different versions of libraries that potentially have version conflict with other libraries. We ensured the configurations were all set before the challenge began because such data challenges are bursty in nature (Macavaney et al. 2021), and handling support requests in the private data enclave risked causing delays. We hoped to avoid receiving too many requests in the beginning phase of the competition in order to give participants a better experience, though we did of course provide participants with instructions on how to request additional libraries during the challenge period.

Supporting materials

In addition to environment preparation, we made available a list of supporting documentation, including IRI, PPC, and FNDDS codebooks, technical reports, and related publications that could help teams understand the challenge datasets. The FNDDS codebook pooled information on variable availability, coding, and descriptions across dataset files and years. It also included internal Westat food category coding difficulty ratings and notes on created PPC codes and provided UPC code, EC code, and general dataset remarks and observations that may take time for analysts to discover on their own.

We developed a baseline model to demonstrate the challenge task and the expected outputs – both outside of ADRF using FNDDS and fictitious data in place of IRI data, and an analogous model using FNDDS and IRI data within the ADRF secure environment. Moreover, we provided the teams with an evaluation script to read in their submissions and evaluate them for predictive accuracy against the public test set using S@5 and NDCG@5 challenge metrics. Finally, we held multiple webinars during the course of the challenge to explain next steps, address participant questions, solicit feedback, and provide general support. Multiple teams also met with our technology team to clarify ADRF-related questions or troubleshoot technical issues.

(Baseline model, toolkits, and evaluation script are available from the Real World Data Science GitHub repository.)

Data splitting

To mimic the real-world scenario, the competition used 2012–2016 IRI data as the training set, and the 2017–2018 IRI data as the test set, since the data change over time and USDA could provide the most recent data available. To make sure that models were generalizable and not just overfit to the test set, we split the test set into private and public test sets. In this way, we guaranteed that the models were evaluated on completely hidden data. In order to keep the similar distribution of the two sets, we first divided the data into five quintiles based on EC code frequencies and then randomly sampled 80% of records in each group without repetition for placement into the private test set. Later in the competition, because of the computation limit, we further shrank the private test set to 40% of its original size using the same data-splitting method.

Judging

In the first two rounds, submissions were evaluated based on the quantitative metrics, as previously mentioned above. Coleridge Initiative was responsible for running the evaluation script, making sure not to re-train the model or modify the configs in any way, and only applying the model to predict the private test set. Prediction results were then compared against ground truth to get the private scores.

The final challenge was reviewed by the scientific review board on all three judging criteria. Submitted models were first evaluated by Coleridge Initiative in the same way as in the first two rounds. The runtime of models was also recorded as an assessment of model cost. The scientific review boards then assessed the models by the quality of documentation, the quality of code, and the ability to replicate and implement the team’s approach, and scored the models for innovation and creativity in addressing the linkage problem. Lastly, scores were summarized and the scientific review board discussed and decided the winners of the competition.

Results

The next few articles in this collection walk readers through the solutions proposed by competition finalists. Figure 3 provides a brief summary.

Figure 3: Top competitors and their solutions to the Food for Thought challenge.

Lessons learned

It was undoubtedly challenging for teams to work with highly secured data in a private data enclave for this data challenge. We solicited feedback from teams and summarized the issues that we experienced throughout the competitions, together with the solutions to resolve those issues. Below are our main lessons learned and we hope this summary can serve to inform future competitions.

Environmental factors: The installation and setup of packages, libraries, and resources, as well as the configuration of GPUs, system dependencies, and workspace design were expected to take a long time as each team had their own needs. To accelerate the process, we requested a list of specific package and environment requirements from the teams in advance. However, due to the complexity of the system configuration required by the teams, environment setup took longer than expected. Thus, the challenge deadlines had to be postponed a few times to accommodate this.
Time commitment: Twelve teams were selected to participate in the challenge, but only three teams remained in the final challenge. Other than one team that was disqualified for violating the ADRF terms of use agreement, eight dropped out because of other commitments and insufficient time to meaningfully participate. To ensure security, ADRF does not allow jobs to run in the backend, which also adds to the time commitment of teams. To encourage teams to participate in the final challenge, we gave out additional awards for second and third places.
Computing resource limit: One issue encountered in evaluating submitted models was computing environment resource limits due to the secured nature of the data enclave. The original private test dataset is four times larger than the public test dataset, making it unfeasible to evaluate. To overcome this issue, given the fixed resource constraints, we decided to reduce the private test set to 40% of its original size. It would have been helpful, though, if the competition had set a model running time limit at the outset, so that participants could build simpler yet effective models.
Supporting code: Although the initial baseline model we provided was extremely simple, we found this helped participants a lot in the initial phase – yet there is space to improve. To be specific, supporting codes should be constructed so that all relevant data tables are used and specify the main function to run the code, especially how the model should be tested. The teams only used the main table, which was the only table that was used in the baseline model, for training and did not touch the other supporting table. If we included the other table in the baseline model, it could help participants to have a better use of this data as well. In addition, a baseline model should be intuitive for the participants to follow, allowing evaluators to easily replace the public test set with the private test set without any programming modifications.

← Part 1: Purchase to Plate

Part 3: First place winners →

About the authors: Zheyuan Zhang and Uyen Le are research scientists at the Coleridge Initiative.

Copyright and licence: © 2023 Zheyuan Zhang and Uyen Le

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Zhang, Zheyuan, and Uyen Le. 2023. “Food for Thought: Competition and challenge design.” Real World Data Science, August 21, 2023. URL

References

Macavaney, S., A. Mittu, G. Coppersmith, J. Leintz, and P. Resnik. 2021. “Community-Level Research on Suicidality Prediction in a Secure Environment: Overview of the CLPsych 2021 Shared Task.” In Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access.

Food for Thought: First place winners – Auburn Big Data

Alex Knipper, Naman Bansal, Jingyi Zheng, Wenying Li, and Shubhra Kanti Karmaker — Mon, 21 Aug 2023 00:00:00 GMT

The Auburn Big Data team from Auburn University consists of five members, including three assistant professors: Dr Wenying Li of the Department of Agricultural Economics and Rural Sociology, Dr Jingyi Zheng of the Department of Mathematics and Statistics, and Dr Shubhra Kanti Karmaker of the Department of Computer Science and Software Engineering. Additionally, the team comprises two PhD students, Naman Bansal and Alex Knipper, who are affiliated with Dr Karmaker’s big data lab at Auburn University.

It is estimated that our team has spent approximately 1,400 hours on this project.

Our perspective on the challenge

At the start of this competition, we decided to test three general approaches, in the order listed:

A heuristic approach, where we use only the data and a defined similarity metric to predict which FNDDS label a given IRI item should have.
A simpler modeling approach, where we train a simple statistical classifier, like a random forest (Parmar, Katariya, and Patel 2019), logistic regression, etc., to predict the FNDDS label for a given IRI item. For this method, we opted to use a random forest as our statistical model, as it was a simpler model to use as a baseline, having shown decent performance in a wide range of classification tasks. As it turned out, this approach was quite robust and accurate, so we kept it as our main model for this approach.
A large language modeling approach, where we train a model like BERT (Devlin et al. 2018) to map the descriptions for given IRI and FNDDS items to the FNDDS category the supplied IRI item belongs to.

Our approach

As we explored the data provided, we opted to use the given 2017–2018 PPC dataset as our primary dataset for both training and testing. To ensure a fair evaluation of the model, we randomly split the dataset into 60% training samples and 40% testing samples, making sure our training process never sees the testing dataset. For evaluating our models, we adopted the competition’s metrics: Success@5 and NDCG@5. After months of testing, our statistical classifier (approach #2) proved itself to be the model that both processes the data fastest and achieves the highest performance on our testing metrics.

This approach, at a high level, takes in the provided data (among other configuration parameters), formats the data in a computer-readable format – converting the IRI and FNDDS descriptions to a numerical representation with word embeddings (2018; Mikolov et al. 2013; Pennington, Socher, and Manning 2014) and then using that numerical representation to calculate the distances between each description – and then trains a classification model (random forest (2019)/neural network (Schmidhuber 2015)) that can predict an FNDDS label for a given IRI item.

In terms of data, our approach uses the FNDDS/IRI descriptions, combining them into a single “description” field, and the IRI item’s categorical items – department, aisle, category, product, brand, manufacturer, and parent company – to further discern between items.

While most industrial methods require use of a graphics processing unit (graphics card, or GPU) to perform this kind of processing, our primary method only requires the computer’s internal processor (CPU) to function properly. With that in mind, to achieve the best possible performance on our test metrics, the most time-consuming operations are run in parallel. The time taken to train our primary model can likely be further improved if we parallelize these operations across a GPU, with the only downside being the imposition of a GPU requirement for systems aiming to run this method.

In addition to our primary method, our team has worked with alternate approaches on the GPU (using BERT (2018), neural networks (2015), etc.) to either: 1) speed up the time it takes to process and make inferences for the data, achieving similar performance on our test metrics, or 2) achieve higher performance, likely at a cost to the time it takes to process everything. Our reasoning behind doing so is that if a simple statistical model performs well, then a larger language model should be able to demonstrate a higher performance on our test metrics without much of an increase in training time. At the current time, these methods are still unable to match the performance/efficiency tradeoff of our primary method.

After exploring alternate methods to no avail, our team then decided to focus again on our primary method, the random forest (2019), and a secondary method, feed-forward neural network mapping our input features (X) to the FNDDS labels (Y) (2015), to optimize their training hyperparameters for the dataset. Our aim in this is to see which of our already-implemented, easier-to-run downstream methods would better optimize the performance/efficiency tradeoff after having its training parameters optimized to the fullest. This has resulted in a marginal increase in training time (+20-30 minutes) and a roughly 5% increase in performance for our still-highest performing model, the random forest.

Overall, our primary method – the random forest – gave us an approximate training time (including data pre-processing) of 4 hours 30 minutes for our ~38,000 IRI item training set, and an approximate inference time of 15 minutes on our testing set of ~15,000 IRI items. Furthermore, our method gave us a Success@5 score of .789 and an NDCG@5 score of .705 on our testing set.

Key features

Here is a list of the key features we utilize, along with what type of data we treat it as.

FNDDS
- food_code – identifier
- main_food_description – text
- additional_food_description – text
- ingredient_description – text
IRI
- upc – identifier
- upcdesc – text
- dept – categorical
- aisle – categorical
- category – categorical
- product – categorical
- brand – categorical
- manufacturer – categorical
- parent – categorical

The intuition behind using these particular features is that the text-based descriptions provide the majority of the “meaning” of the item. By converting each description to a numerical representation (2013; 2014), we can then calculate the similarity between each “meaning” to determine which FNDDS label is most similar to the IRI item provided. However, that alone is not enough. The categorical features on the IRI item help to further enhance the model’s classifications using the logic and categories people use in places like grocery stores. For example, if given an item whose aisle was “fruit” and brand was “Dole”, the item could be reasonably expected to be something like “peaches” over something like “broccoli”.

Feature selection

Aforementioned intuition aside, our feature selection was rather naive, in that we manually examined the data and removed any redundant text features before doing anything else. After that, we decided to use description fields as “text” data to comprise the main “meaning” of the item, represented numerically after converting the text using a word embedding (2013; 2014). We also decided to use the non-description fields (aisle, category, etc.) as “categorical” data that would be turned into its own numerical representation, allowing our model to more easily discern between items using similar systems to people.

Feature transformations

Our feature transformations are also relatively simple. First, we combine all description fields for each item to make one large description, and then use a word embedding method (like GloVe (2014) or BERT (2018)) to convert the description into a numerical representation, resulting in a 300-dimensional GloVe or 768-dimensional BERT vector of numbers for each description. Then, for each IRI item, we calculate the cosine and Euclidean distances from each FNDDS item, resulting in two vectors, both equal in length to the original FNDDS data (in this case, two vectors of length ~7,300). The intuition behind this is that while cosine and Euclidean distances can tell us similar things, providing both of these sets of distances to the model should allow it to pick up on a more nuanced set of relationships between the IRI and FNDDS items.

For categorical data, we take all unique values in each field and assign them an ID number. While that is often not the best practice for making a numerical representation out of categorical data (Potdar, Pardawala, and Pai 2017), it seemed to work for the downstream model.

Regardless, the aforementioned feature transformations give us (ad hoc) ~14,900 features if we use GloVe and ~15,300 features if we use BERT. Both feature sets can then be sent to the downstream random forest/neural network to start classifying items.

It should be noted that processing the data is by far the most time-consuming part of our method. The data processing times for each embedding are as follows:

GloVe: ~3 hours
BERT: ~6 hours

Due to BERT both taking so long to process data and performing lower than our GloVe embeddings on the classification task, we opt to use GloVe embeddings for our primary method. Our only theoretical explanation here is that since BERT is better at context-dependent tasks (Wang, Nulty, and Lillis 2021), it likely will expect something similar to well-structured sentences as input, which is not what the IRI/FNDDS descriptions are. Rather, GloVe – being a method that depends less on context (2013; 2014) – should excel better when the input text is not a well-formed sentence.

Training methods

Once the data has been processed, we collect the following data for each IRI item:

UPC code
Description (converted to numerical representation)
Categorical variables (converted to numerical representation)
Distances to each FNDDS item

Once that has been collected for each IRI item, we can finally use our classification model. We initialize our model and begin the training process with the IRI data mentioned above and the target FNDDS labels for each one, so the model knows what the “correct” answer is for the given data. Once the model has trained on our training dataset, we save the model and it is ready for use.

This part of training takes much less time than preparing the data, since calculating the embeddings takes a lot more computation than a random forest model. The training times for each method are as follows:

Random Forest: ~1 hour 15 minutes
Neural Network: ~25 minutes

Despite the neural network taking far less time to train than the random forest, it still scores lower on the scoring metrics than the random forest, so we opt to continue using the random forest model as our primary method.

General approach to developing the model

Since the linkage problem involves mapping tens of thousands of items to a smaller category set of a few thousand items, we decided to frame this problem as a multi-class classification problem (Aly 2005), where we then rank the top “k” most probable class mappings, as requested by the competition ruleset.

Most of the usable data available to us is text data, so we need a method that can use that text-based information to accurately map classes based on the aforementioned text information. To best accomplish this, we opt to use word embedding techniques to calculate an average numerical representation for each text description (both IRI and FNDDS), so we can calculate distances between each description, giving our model a sense of how similar each description is.

The key “trick” to the model

Since text descriptions hold the most information that can be used to link between an IRI item and an FNDDS item, finding a way to calculate the similarity between each description is paramount to making this method work.

Both distance calculation methods used in this work, cosine and Euclidean distance, are very similar in the type of information encoded, the only major difference being that cosine distance is implicitly normalized and Euclidean distance is not (Qian et al. 2004).

Notable observations

Just by building the ranking using the cosine similarities between each IRI item and all FNDDS items, we can achieve a Success@5 performance of 0.234 and an NDCG@5 performance of 0.312. The other features are provided and the random forest classifier is used to add some extra discriminative power to the model.

Data disclaimer

Our current method only uses the data readily available from the 2017–2018 dataset, which we acknowledge is intended for testing. To remedy this, we further split this dataset into train/test sets and report results on our unseen test subset for our primary performance metrics. This gives a decent look into how the model will perform on unseen data.

Find the code in the Real World Data Science GitHub repository.

Our results

Approximate training time

Overall, our approximate training time for our primary method is 4 hours 30 minutes broken down (approximately) as follows:

Reading data from database: 30 seconds
Calculating ~7,300 FNDDS description embeddings: 15 minutes 45 seconds
Calculating ~38,000 IRI description embeddings and similarity scores: 2 hours 20 minutes 45 seconds
Formatting calculated data for the random forest classifier: 35 minutes
Training the random forest classifier: 1 hour 15 minutes

Approximate inference time

Our approximate inference time for our primary method is 15 minutes to make inferences for ~15,000 IRI items.

S@5 & NDCG@5 performance

This is how our best-performing model (GloVe + random forest) performs at the current time on the testing set:

NDCG@5: 0.705
Success@5: 0.789

When we evaluate that same model on the full PPC dataset we were provided (~38,000 items), we get the following scores:

NDCG@5: 0.879
Success@5: 0.916

(Note: The full PPC dataset contains approximately 15,000 items that we used to train the model, so these scores are not as representative of our method’s performance as the previous scores.)

Future work/refinement

As mentioned previously, we only used the given 2017–2018 PPC dataset as our primary dataset for both training and testing. Going forward, we would like to include datasets from previous years as well, which we believe would further increase our model performance. Additionally, the datasets generated from this research have the potential to inform and support additional studies from a variety of perspectives, including nutrition, consumer research, and public health. Further research utilizing these datasets has the potential to make significant contributions to our understanding of consumer behavior and the role of food and nutrient consumption in overall health and well-being.

Lessons learned

It was interesting that the random forest model performed better than the vanilla neural network model. This shows that a simple solution can work better, depending on the application. This observation is in line with the well-established principle in machine learning that the choice of model should be guided by the nature of the problem and the characteristics of the data. In this case, the random forest model, being a simpler and more interpretable model, was better suited to the problem at hand and was able to outperform the more complex neural network model. These results underscore the importance of careful model selection and the need to consider both the complexity of the model and the specific requirements of the problem when choosing an algorithm for a particular application.

← Part 2: Competition design

Part 4: Second place winners →

About the authors: Alex Knipper and Naman Bansal are PhD students, and Jingyi Zheng, Wenying Li, and Shubhra Kanti Karmaker are assistant professors at Auburn University.

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by nrd on Unsplash.

How to cite: Knipper, Alex, Naman Bansal, Jingyi Zheng, Wenying Li, and Shubhra Kanti Karmaker. 2023. “Food for Thought: First place winners – Auburn Big Data.” Real World Data Science, August 21, 2023. URL

References

Aly, M. 2005. “Survey on Multiclass Classification Methods, Tech. Rep.” California Institute of Technology.

Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. “Efficient Estimation of Word Representations in Vector Space.” https://arxiv.org/abs/1301.3781.

Parmar, A., R. Katariya, and V. Patel. 2019. “A Review on Random Forest: An Ensemble Classifier.” In International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018, edited by J. Hemanth, X. Fernando, P. Lafata, and Z. Baig, 758–63. Cham: Springer International Publishing.

Potdar, K., T. S. Pardawala, and C. D. Pai. 2017. “A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers.” International Journal of Computer Applications 175 (4): 7–9. https://doi.org/10.5120/ijca2017915495.

Qian, G., S. Sural, Y. Gu, and S. Pramanik. 2004. “Similarity Between Euclidean and Cosine Angle Distance for Nearest Neighbor Queries.” In Proceedings of the 2004 ACM Symposium on Applied Computing, 1232–37. SAC ’04. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/967900.968151.

Schmidhuber, J. 2015. “Deep Learning in Neural Networks: An Overview.” Neural Networks 61: 85–117. https://doi.org/https://doi.org/10.1016/j.neunet.2014.09.003.

Wang, C., P. Nulty, and D. Lillis. 2021. “A Comparative Study on Word Embeddings in Deep Learning for Text Classification.” In Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval, 37–46. NLPIR ’20. New York, NY, USA: Association for Computing Machinery. https://doi.org/10.1145/3443279.3443304.

The Food for Thought Challenge: Using AI to support evidence-based food and nutrition policy

Brian Tarran and Julia Lane — Mon, 21 Aug 2023 00:00:00 GMT

There’s a saying: “You are what you eat.” Its meaning is somewhat open to interpretation, as with many such sayings, but it is typically used to make the point that if you want to be well, you need to eat well. Nutrition scientists and dieticians spend their careers trying to figure out what “eating well” looks like – the foods the human body needs, in what quantities, and how best to consume them. Their research informs advice and guidance issued by health professionals and governments. Ultimately, though, the choice of what to eat falls to us – individuals and families – and our choices are often determined by our tastes, the availability of foodstuffs in our local stores, their price and affordability.

So, what exactly do we eat? Answers come from a variety of sources. In the United States, there are dietary recall studies such as the National Health and Nutrition Examination Survey, which asks a sample of respondents to report their food and beverage consumption over a set period of time. There are also organisations like IRI that collect point-of-sale data from retail stores on the actual food and drink being sold to consumers. By and large, this information comes from barcodes on product packaging being scanned at checkouts, so it is often referred to as “scanner data”.

This data – from dietary recall studies and retail scanners – is valuable: once we know what people are eating, we can check the nutritional content of those foods and build up a picture of what the diet of a typical individual or family looks like and how it compares to the diet recommended by doctors and policymakers. And, if we know what other foodstuffs are available, how much they cost, and the nutritional value of those items, we can work out how much families need to spend, and on what, in order to eat well and, hopefully, be well.

Figuring all this out is where something called the Purchase to Plate Crosswalk (PPC) comes in. It’s a key tool for understanding the “healthfulness of retail food purchases” and it does this by linking IRI scanner data on what people buy with data on the nutritional content of those foods, as recorded in the US Department of Agriculture’s Food and Nutrient Database for Dietary Studies (FNDDS). But there’s a catch: scanner data is collected about hundreds of thousands of food products, whereas the FNDDS has nutritional profile information for only a few thousand items. Linking these two datasets therefore gives rise to a one-to-many matching problem – a problem that takes several hundred person-hours to resolve.

What if machine learning can help? That question inspired a competition, the Food for Thought Challenge, organized by the Coleridge Initiative, a nonprofit organization working with governments to ensure that data are more effectively used for public decision-making. Researchers and data scientists were invited to use machine learning and natural language processing to more efficiently link data on supermarket products to nutrient databases.

This collection of articles tells the story of the Food for Thought Challenge. We begin by exploring the policy issues that drive the development of the PPC – the need to understand the national diet, developing healthy diet plans, and costing up those plans – and the issues posed by record linkage. Next, we learn about the nature of the challenge and the structure of the competition in more detail, and then the three winning teams walk us through their solutions. We end the collection with some closing thoughts on the value of competitions for addressing data scientific challenges in the public sector.

Find more case studies

Part 1: The Purchase to Plate Suite →

About the authors: Brian Tarran is editor of Real World Data Science, and head of data science platform at the Royal Statistical Society.; Julia Lane is a professor at the NYU Wagner Graduate School of Public Service and a NYU Provostial Fellow for Innovation Analytics. She co-founded the Coleridge Initiative, whose goal is to use data to transform the way governments access and use data for the social good through training programs, research projects and a secure data facility. She recently served on the Advisory Committee on Data for Evidence Building and the National AI Research Resources Task Force.

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Melanie Lim on Unsplash.

How to cite: Tarran, Brian, and Julia Lane. 2023. “The Food for Thought Challenge: Using AI to support evidence-based food and nutrition policy.” Real World Data Science, August 21, 2023. URL

Food for Thought: The importance of the Purchase to Plate Suite

Andrea Carlson and Thea Palmer Zimmerman — Mon, 21 Aug 2023 00:00:00 GMT

Disclaimer

The findings and conclusions in this publication are those of the authors and should not be construed to represent any official USDA or US Government determination or policy. This research was supported by the US Department of Agriculture’s Economic Research Service and Center for Nutrition, Policy and Promotion. Findings should not be attributed to Circana (formerly IRI).

About 600,000 deaths per year in the United States are related to chronic diseases that are linked to poor dietary choices. Many other individuals suffer from diet-related health conditions, which may limit their ability to work, learn, and be physically active (US Department of Agriculture and US Department of Health and Human Services 2020). In recognition of the link between diet and health, in 1974 the Senate Select Committee on Nutrition and Human Needs, originally formed to eliminate hunger, expanded its focus to improving eating habits, nutrition policy and the national diet. Since 1980, the Dietary Guidelines for Americans have been released every five years by the US Departments of Agriculture (USDA) and Health and Human Services (DHHS). The guidelines present “advice on what to eat and drink to meet nutrient needs, promote health, and prevent disease”.

Because there can be economic and social barriers to maintaining a healthy diet, USDA promotes Food and Nutrition Security so that everyone has consistent and equitable access to healthy, safe, and affordable foods that promote optimal health and well-being. A set of data tools called the Purchase to Plate Suite (PPS) supports these goals by enabling the update of the Thrifty Food Plan (TFP), which estimates how much a budget-conscious family of four needs to spend on groceries to ensure a healthy diet. The TFP market basket – consisting of the specific amounts of various food categories required by the plan – forms the basis of the maximum allotment for the Supplemental Nutrition Assistance Program (SNAP, formerly known as the “Food Stamps” program), which provided financial support towards the cost of groceries for over 41 million individuals in almost 22 million households in fiscal year 2022.

The 2018 Farm Act (Agriculture Improvement Act of 2018) requires that USDA reevaluate the TFP every five years using current food composition, consumption patterns, dietary guidance, and food prices, and using approved scientific methods. USDA’s Economic Research Service (ERS) was charged with estimating the current food prices using retail food scanner data (Levin et al. 2018; Muth et al. 2016) and utilized the PPS for this task. The most recent TFP update was released in August 2021 and the revised cost of the market basket was the first non-inflation adjustment increase in benefits for SNAP in over 40 years (US Department of Agriculture 2021).

The PPS combines datasets to enhance research related to the economics of food and nutrition. There are four primary components of the suite:

Purchase to Plate Crosswalk (PPC),
Purchase to Plate Price Tool (PPPT),
Purchase to Plate National Average Prices (PP-NAP) for the National Health and Nutrition Examination Survey (NHANES), and
Purchase to Plate Ingredient Tool (PPIT).

The PPC allows researchers to measure the healthfulness of store purchases. On average US consumers acquire about 75% of their calories from retail stores, and there are a number of studies linking the availability of foods at home to the healthfulness of the overall diet (e.g., Gattshall et al. 2008; Hanson et al. 2005). Thus, understanding the healthfulness of store purchases allows us to understand differences in consumers who purchase healthy versus less healthy foods, and may contribute to better policies that promote healthier food purchases. While healthier diets are linked to a lower risk of disease outcomes (Reedy et al. 2014), other factors such as health care access may also be contributors (Cleary, Liu, and Carlson 2022). The PPC also forms the basis of the price tool, PPPT – which allows researchers to estimate custom prices for dietary recall studies – and a new ERS data product, the PP-NAP. The national average prices from PP-NAP are used in reevaluating the TFP. By using the PP-NAP with 24-hour dietary recall information from surveys such as What We Eat in America (WWEIA) – the dietary component of the nationally representative National Health and Nutrition Examination Survey(NHANES)¹ – researchers can examine the relationship between the cost of food, dietary intake, and chronic diseases linked to poor diets. The price estimates also allow researchers to develop cost-effective healthy diets such as MyPlate Kitchen. The final component of the Purchase to Plate Suite, the ingredient tool (PPIT), breaks dietary recall-reported foods back into purchasable ingredients, based on US retail food purchases. The PPIT is also used in the revaluation of the TFP, and by researchers who want to look at the relationship between reported ingestion of grocery items, cost and disease outcomes using WWEIA/NHANES. More information on the development of the PPC is available in two papers by Carlson et al. (2019, 2022).

The Food for Thought competition aimed to support the development of the PPC – and thus policy-oriented research – by linking retail food scanner data to the USDA nutrition data used to analyze NHANES dietary recall data, specifically the Food and Nutrient Database for Dietary Studies (FNDDS) (2018, 2020). In particular, the competition set out to use artificial intelligence (AI) to reduce human resources in creating the links for the PPC, while still maintaining the high-quality standards required for reevaluating the TFP and for data published by ERS (which is one of 13 Principle Statistical Agencies in the United States Federal Government).

Methods used to date

On the surface, the linking process may appear simple: both the FNDDS and retail food scanner data are databases of food. But the scanner data are produced for market research, and the FNDDS for dietary studies. The scanner data include about 350,000 items with sales each year, while the FNDDS has only 10,000–15,000 items. Scanner data relates to specific products, while FNDDS items are often more general. Both datasets have different hierarchical structures – the FNDDS hierarchy is based around major food groups: dairy; meat, poultry and seafood; eggs; nuts and legumes; grains; fruits; vegetables; fats and oils; and sugars, sweets, and beverages. Items fall into the groups regardless of preparation method or form. That is, broccoli prepared from frozen and from fresh both appear in the vegetable group, and for some fruits and vegetables, the fresh, frozen, canned and dried form are the same FNDDS item. Vegetable-based mixed dishes, such as broccoli and carrot stir-fry or soup, are also classified in the vegetable group. On the other hand, the scanner data classifies foods by grocery aisle. That is, the fresh and frozen broccoli are classified in different areas: produce and frozen vegetables. Similarly, when sold as a prepared food, the broccoli and carrot stir-fry may be found in the frozen entries, as a kit in either the frozen or produce section, refrigerated foods, or all of these.

To allow researchers to import the FNDDS nutrient data into the scanner data, a one-to-many match between FNDDS and scanner data items was needed. The food descriptions in the scanner data include brand names and package sizes and are written as a consumer would pronounce them – e.g., fresh and crisp broccoli florets, ready-cut, 10 oz – versus a more general FNDDS description such as “Broccoli, raw”. (Also linked to the “Broccoli, raw” code would be broccoli sold with stems attached, broccoli spears, and any other way raw broccoli is sold.) In the scanner data, the Universal Product Code (UPC) and the European Article Number (EAN) can link items between tables within the scanner data, as well as between datasets of grocery items, such as the USDA Global Branded Foods Product Database, a component of USDA’s Food Data Central. However, these codes are not related to the FNDDS codes, or any other column within the FNDDS. In other words, before development of the PPC, there were no established linking identifiers.

Figure 1 shows the process USDA uses to develop matches between scanner data and FNDDS.

Figure 1: Process currently used to create the matches between the USDA Food and Nutrient Database for Dietary Studies (FNDDS) and the retail scanner data (labelled “IRI” for the IRI InfoScan and Consumer Network) product dictionaries. Source: Author provided.

We start the linking process by categorizing the scanner data items into homogeneous groups to make the first round of automated matching more efficient. To save time, we use the second lowest hierarchical category in the scanner data which generally divides items within a grocery aisle into homogenous groups such as produce, canned beans, baking mixes, and bread. Once the linking categories for scanner data are established, we select appropriate items from the FNDDS. Since the FNDDS is highly structured, this selection is usually straightforward.

Our next step is to use semantic matching to create a search table that aligns similar terms within the IRI product dictionary and FNDDS. This first requires that we extract attributes from the FNDDS descriptions into fields similar to those in the scanner data product dictionary. The FNDDS descriptions are found across multiple columns because they are added as the need arises to provide examples of brand names or alternative descriptions of foods which help code the foods WWEIA participants report eating. We manually create matching tables that link terms used in FNDDS to those used in the scanner data, organized by the fields defined in the restructured FNDDS. We then use this table as the basis of a probabilistic matching process. For example, when linking the produce group, “fresh” in the scanner data would be aligned with “raw” and “prepared from fresh” and NOT “prepared from frozen” in the FNDDS, and “broccoli florets” would also be aligned with “raw” and “broccoli”. Since the FNDDS is designed to code the foods individuals report eating, many of the foods in the FNDDS are already prepared and result in descriptions such as “broccoli, steamed, prepared from fresh” or “broccoli, boiled, prepared from frozen”.

Once the linking table is established, the probabilistic match process returns the single best possible match for each item in the scanner data. For example, a match between fresh broccoli florets and frozen broccoli would have a lower probability score than “broccoli, raw”. Because these matches form the basis of major USDA policies, we cannot accept an error rate of more than 5 percent, and lower is preferred. To reach that goal, nutritionists review every match to make sure the probabilistic match did not return a match between cauliflower florets and fresh broccoli, say, or that a broccoli and carrot stir-fry is not matched to a dish with broccoli, carrots, and chicken. The correct matches, such as the one between fresh broccoli florets and raw broccoli, are set aside while the items with an incorrect match, such as cauliflower florets and the broccoli and carrot stir-fry, are used to revise the search table. Revisions might include adding (NOT chicken) to the broccoli and carrot stir-fry dish. Mixed dishes — such as the broccoli and carrot stir-fry — pose particular challenges because there are a wide variety of similar products available in the grocery store. After a few rounds of revising the search table and running the probabilistic match process, it is more efficient to use a manual match, established by one nutritionist and reviewed by another, after which the match is assumed to be correct.

The process improved with each new wave of FNDDS and IRI data. Our first creation of the PPC linked the FNDDS 2011/12 to the 2013 IRI retail scanner data. Subsequent waves started with the previous search table and resulting matches were reviewed by nutritionists. We also used more fields in the IRI product dictionary to create the homogeneous linking groups and made modifications to these groups with each wave. During each wave we experimented with the number of rounds of probabilistic matching that was the most cost effective. For some linking groups it took less human time to manually match from the start, while for other groups it was more efficient to do multiple rounds of improvements to the search table. Starting with the most recent wave (matching FNDDS 2017/18 to the 2017 and 2018 retail scanner data), we assumed previous matches appearing in the newer data were correct. Although this assumption was good for most matches, a review demonstrated the need to review previous matches prior to removing the item from the list of scanner data items needing FNDDS matches. In the future we intend to explore methods developed by the participants of the Food for Thought competition.

Linking challenges

An ongoing challenge to the linking problem is that both the scanner data and the FNDDS undergo substantive changes each year, meaning that both the previous matches and search tables need to be reviewed and revised with each new effort, as tables that work with one cycle of FNDDS and scanner data will need revisions to use with the next cycle. Changes to the scanner data that impact our current method include dropped and added items, data corrections, and revisions to the categories that form the basis of the homogeneous linking groups. In addition, there are errors such as incorrect food descriptions, conflicting package size information, and changes in the item description from year to year. Since the FNDDS is designed to support dietary recall studies, revisions reflect both changes to available foods and the level of detail respondents can provide. These revisions result in dropped/added food codes, changes to food descriptions that impact which scanner data items match to the FNDDS items, and revisions to recipes used in the nutrient coding which impacts the number of retail ingredients available in the FNDDS.

Of the four parts of the PPS, establishing the matches is the most time-consuming task and constitutes at least 60 percent of the total budget. In the most recent round, we had 168 categories and each one went through 2-3 automated matching rounds; after each round, nutritionists spent an average of two hours reviewing the matches. This adds up to somewhere between 670 and 1,000 hours of review time. After the automated review, manual matching requires an additional 300 hours. Reducing the amount of time required to establish matches and link the FNDDS and retail scanner datasets may lead to significant time savings, resulting in faster data availability. That, in turn, could allow more timely policy-based research, and the mandated revision of the Thrifty Food Plan can continue with the most recent food price data.

← Introduction

Part 2: Competition design →

About the authors: Andrea Carlson is an agricultural economist in the Food Markets Branch of the Food Economics Division in USDA’s Economic Research Service. She is the project lead for the Purchase to Plate Suite, which allows users to import USDA nutrient and food composition data into retail food scanner data acquired by USDA and estimate individual food prices for dietary intake data.; Thea Palmer Zimmerman is a senior study director and research nutritionist at Westat.

Image credit: Thumbnail photo by Kenny Eliason on Unsplash.

How to cite: Carlson, Andrea, and Thea Palmer Zimmerman. 2023. “Food for Thought: The importance of the Purchase to Plate Suite.” Real World Data Science, August 21, 2023. URL

Acknowledgements

The research presented in this compendium supports the Purchase to Plate Suite of data products. Carlson has been privileged to both develop and lead this project over the course of her career, but it is not a solo project. Many thanks to the Linkages Team from USDA’s Economic Research Service (Christopher Lowe, Mark Denbaly Elina Page, and Catherine Cullinane Thomas) the Center for Nutrition Policy and Promotion (Kristin Koegel, Kevin Kuczynski, Kevin Meyers Mathieu, TusaRebecca Pannucci), and our contractor Westat, Inc. (Thea Palmer Zimmerman, Carina E. Tornow, Amber Brown McFadden, Caitlin Carter, Viji Narayanaswamy, Lindsay McDougal, Elisha Lubar, Lynnea Brumby, Raquel Brown, and Maria Tamburri). Many others have supported this project over the years.

References

Carlson, A. C., E. T. Page, T. P. Zimmerman, C. E. Tornow, and S. Hermansen. 2019. “Linking USDA Nutrition Databases to IRI Household-Based and Store-Based Scanner Data.” Technical bulletin 1952. US Department of Agriculture, Economic Research Service.

Carlson, A. C., C. E. Tornow, E. T. Page, A. Brown McFadden, and T. Palmer Zimmerman. 2022. “Development of the Purchase to Plate Crosswalk and Price Tool: Estimating Prices for the National Health and Nutrition Examination Survey (NHANES) Foods and Measuring the Healthfulness of Retail Food Purchases.” Journal of Food Composition and Analysis 106: 104344. https://doi.org/10.1016/j.jfca.2021.104344.

Cleary, R., Y. Liu, and A. Carlson. 2022. “Differences in the Distribution of Nutrition Between Households Above and Below Poverty.” Agricultural and Applied Economic Association Annual Meeting. Anaheim, CA. https://ageconsearch.umn.edu/record/322267.

Gattshall, M. L., J. A. Shoup, J. A. Marshall, L. A. Crane, and P. A. Estabrooks. 2008. “Validation of a Survey Instrument to Assess Home Environments for Physical Activity and Healthy Eating in Overweight Children.” International Journal of Behavioral Nutrition and Physical Activity 5 (3). https://doi.org/10.1186/1479-5868-5-3.

Hanson, N. I., D. Neumark-Sztainer, M. E. Eisenberg, M. Story, and M. Wall. 2005. “Associations Between Parental Report of the Home Food Environment and Adolescent Intakes of Fruits, Vegetables and Dairy Foods.” Public Health Nutrition 8 (1). https://doi.org/10.1079/PHN2005661.

Levin, D., D. Noriega, C. Dicken, A. Okrent, M. Harding, and M. Lovenheim. 2018. “Examining Store Scanner Data: A Comparison of the IRI Infoscan Data with Other Data Sets, 2008-12.” Technical bulletin 1949. US Department of Agriculture, Economic Research Service.

Muth, M. K., M. Sweitzer, D. Brown, K. Capogrossi, S. Karns, D. Levin, A. Okrent, P. Siegel, and C. Zhen. 2016. “Understanding IRI Household-Based and Store-Based Scanner Data.” Technical bulletin 1942. US Department of Agriculture, Economic Research Service.

Reedy, J., S. M. Krebs-Smith, P. E. Miller, A. D. Liese, L. L. Kahle, Y. Park, and A. F. Subar. 2014. “Higher Diet Quality Is Associated with Decreased Risk of All-Cause, Cardiovascular Disease, and Cancer Mortality Among Older Adults.” The Journal of Nutrition 144 (6): 881–89. https://doi.org/10.3945/jn.113.189407.

US Department of Agriculture. 2021. “Thrifty Food Plan, 2021.” Food and Nutrition Service 916. US Department of Agriculture. https://FNS.usda.gov/TFP.

US Department of Agriculture, Agricultural Research Service. 2018. “USDA Food and Nutrient Database for Dietary Studies 2015-2016.” US Department of Agriculture, Agricultural Research Service. https://www.ars.usda.gov/nea/bhnrc/fsrg.

———. 2020. “USDA Food and Nutrient Database for Dietary Studies 2017-2018.” US Department of Agriculture, Agricultural Research Service. https://www.ars.usda.gov/nea/bhnrc/fsrg.

US Department of Agriculture and US Department of Health and Human Services. 2020. “Dietary Guidelines for Americans, 2020-2025.” 9th edition. US Department of Agriculture and US Department of Health and Human Services. https://DietaryGuidelines.gov.

Footnotes

NHANES is a multi-module continuous survey conducted by the Centers for Disease Control and Prevention. In addition to the WWEIA, NHANES includes a four-hour complete medical exam including a health history, and a blood and urine analysis.↩︎

Food for Thought: Second place winners – DeepFFTLink

Yang Wu, Aishwarya Budhkar, Kai Zhang, Xuhong Zhang, and Xiaozhong Liu — Mon, 21 Aug 2023 00:00:00 GMT

DeepFFTLink team members: Yang Wu and Kai Zhang are PhD students at Worcester Polytechnic Institute. Aishwarya Budhkar is a PhD student at Indiana University Bloomington. Xuhong Zhang is an assistant professor at Indiana University Bloomington. Xiaozhong Liu is an associate professor at Worcester Polytechnic Institute.

Perspective on the challenge

Text matching is an essential task in natural language processing (NLP, Pang et al. 2016), while record linkage across different sources is an essential task in data science. Machine learning techniques allow people to combine data faster and cheaper than using manual linkage. However, in the context of the Food for Thought challenge, existing methods for matching universal product codes (UPCs) to ensemble codes (ECs) require every UPC to be compared with every EC code (Figure 1a). Such approaches can be computationally expensive in the training process when data is noisy. Here, we propose an ensemble model with a category-based adapter to tackle this problem, drawing on the category information included in UPC and EC data. The category-based adapter allows UPCs to be first matched with only a small and reliable set of ECs (Figure 1b). Then, an ensemble model will be deployed to make predictions for UPC-EC matching. Our proposed approach can achieve competitive performance compared with state-of-the-art models.

(a)

(b)

Figure 1: A toy example of our method. Panel (a) shows the traditional matching method, while (b) is our proposed ensemble model with category-based adapter. With the help of the adapter, UPC 1 only needs to be matched with EC 1 and EC 3.

Our approach

We propose a two-step framework to address this problem. To begin with, we use a category-based adapter to get reliable candidate ECs for each UPC. Then, an ensemble model (Dietterich 2000) is deployed to make a prediction for each UPC-EC pair.

Category-based adapter

By using 2015–2016 UPC-EC data, we created a knowledge base, which is a UPC category–EC pair-wised table for generating candidate ECs. Within this setting, each UPC category is, on average, related to only 32 ECs. This knowledge base is then used as context to further filter the candidate ECs. Note that there are some new ECs generated year by year, which can also be part of the potential ECs in the UPC-EC matching task, since the contextual information of new ECs does not exist in our knowledge base.

Ensembled model

We ensemble the base-string match and BERT models. BERT is a deep learning model for natural language processing (Devlin et al. 2018). In the base-string match model, we used the Term Frequency-Inverse Document Frequency (TFIDF) of each UPC and EC description as features to calculate a pairwise cosine similarity, which is a distance between instances. Meanwhile, we used features extracted from UPC and EC descriptions to fine-tune the BERT base model and calculated the cosine similarity of embeddings between each UPC and EC. Then we rank ECs based on their similarity scores with the UPC.

Figure 2: The framework of our proposed model. A two-step strategy is used to make the final prediction.

Find the code in the Real World Data Science GitHub repository.

Our results

We randomly selected 500 samples from the 2017–2018 UPC-EC data to train the ensembled weight for each model. Two functions were adapted to make a fusion of base-string and BERT models:

denotes the final confidence score. and represent base_string_similarity_score and BERT_similarity_score, respectively. and are corresponding model weights for base_string and BERT models.

A better Success@5 is achieved with function (1). The ensembled weights for the base-string model and BERT model are 0.738 and 0.262, respectively. The experiment result indicates that the base_string model contributes more than the BERT model when the ensemble model makes predictions. The prediction result for the 2017–2018 data is:

Success@5: 0.727
NDCG@5: 0.528

Computation time is 6 hours.

Future work

Our next step will focus on adding the newly generated EC data into our knowledge base, which allows the model to be more stable to make predictions for UPC-EC matching. Our model is an unsupervised method, which does not need labels for each instance. We use cosine similarity to rank the matches, so no labels are needed in the training process. However, our future work will try to label some instances to handle the UPC-EC matching task in a supervised manner.

Lessons learned

If the data is not complex, simple models may outperform complex models. For example, in our experiment, we found that the base-string model outperforms single RoBERTa (Liu et al. 2019) or BERT models. However, our ensemble model can outperform each individual model since model fusion allows information aggregation from multiple models.
Multi-label models may not work well on UPC-EC data. In our early work, we tried to consider the UPC-EC matching task as a multi-label problem, e.g., we labeled each EC as a binary label which indicated whether the EC was an appropriate match or not. We mapped UPC and EC pairs into a multi-label table. However, we find that the UPC and EC keeps a one-to-one relation for most UPCs. The model performance of a multi-label model, i.e., Label-Specific Attention Network (LSAN, Xiao et al. 2019), is lower than base-string model on both Success@5 and NDCG@5 metrics.

← Part 3: First place winners

Part 5: Third place winners →

About the authors: Yang Wu and Kai Zhang are PhD students, and Xiaozhong Liu is an associate professor at Worcester Polytechnic Institute. Aishwarya Budhkar is a PhD student and Xuhong Zhang is an assistant professor at Indiana University Bloomington.

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Hanson Lu on Unsplash.

How to cite: Wu, Yang, Aishwarya Budhkar, Kai Zhang, Xuhong Zhang, and Xiaozhong Liu. 2023. “Food for Thought: Second place winners – DeepFFTLink.” Real World Data Science, August 21, 2023. URL

References

Dietterich, T. G. 2000. “Ensemble Methods in Machine Learning.” In Multiple Classifier Systems, 1–15. Berlin, Heidelberg: Springer Berlin Heidelberg.

Pang, L., Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng. 2016. “Text Matching as Image Recognition.” CoRR abs/1602.06359. http://arxiv.org/abs/1602.06359.

Xiao, L., X. Huang, B. Chen, and L. Jing. 2019. “Label-Specific Document Representation for Multi-Label Text Classification.” In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), 466–75. Hong Kong, China: Association for Computational Linguistics. https://doi.org/10.18653/v1/D19-1044.

The road to reproducible research: hazards to avoid and tools to get you there safely

Davit Svanidze, Andre Python, et al. — Thu, 15 Jun 2023 00:00:00 GMT

Reproducibility, or “the ability of a researcher to duplicate the results of a prior study using the same materials as the original investigator”, is critical for sharing and building upon scientific findings. Reproducibility not only verifies the correctness of processes leading to results but also serves as a prerequisite for assessing generalisability to other datasets or contexts. This we refer to as replicability, or “the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected”. Reproducibility, which is the focus of our work here, can be challenging – especially in the context of deep learning. This article, and associated material, aims to provide practical advice for overcoming these challenges.

Our story begins with Davit Svanidze, a master’s degree student in economics at the London School of Economics (LSE). Davit’s efforts to make his bachelor’s thesis reproducible are what inspires this article, and we hope that readers will be able to learn from Davit’s experience and apply those learnings to their own work. Davit will demonstrate the use of Jupyter notebooks, GitHub, and other relevant tools to ensure reproducibility. He will walk us through code documentation, data management, and version control with Git. And, he will share best practices for collaboration, peer review, and dissemination of results.

Davit’s story starts here, but there is much more for the interested reader to discover. At certain points in this article, we will direct readers to other resources, namely a Jupyter notebook and GitHub repository which contain all the instructions, data and code necessary to reproduce Davit’s research. Together, these components offer a comprehensive overview of the thought process and technical implementation required for reproducibility. While there is no one-size-fits-all approach, the principles remain consistent.

Credit: Discord software, Midjourney bot.

Davit’s journey towards reproducibility

More power, please

The focus of my bachelor’s thesis was to better understand the initial spread of Covid-19 in China using deep learning algorithms. I was keen to make my work reproducible, but not only for my own sake. The “reproducibility crisis” is a well-documented problem in science as a whole,¹ ² ³ ⁴ with studies suggesting that around one-third of social science studies published between the years 2010 and 2015 in top journals like Nature and Science could not be reproduced.⁵ Results that cannot be reproduced are not necessarily “wrong”. But, if findings cannot be reproduced, we cannot be sure of their validity.

For my own research project, I gathered all data and started working on my computer. After I built the algorithms to train the data, my first challenge to reproducibility was computational. I realised that training models on my local computer was taking far too long, and I needed a faster, more powerful solution to be able to submit my thesis in time. Fortunately, I could access the university server to train the algorithms. Once the training was complete, I could generate the results on my local computer, since producing maps and tables was not so demanding. However…

Bloody paths!

In switching between machines and computing environments, I soon encountered an issue with my code: the paths, or file directory locations, for the trained algorithms had been hardcoded! As I quickly discovered, hardcoding a path can lead to issues when the code is run in a different environment, as the path might not exist in the new environment.

As my code became longer, I overlooked the path names linked to algorithms that were generating the results. This mistake – which would have been easily corrected if spotted earlier – resulted in incorrect outputs. Such errors could have enormous (negative) implications in a public health context, where evidence-based decisions have real impacts on human lives. It was at this point that I realised that my code is the fundamental pillar of the validity of my empirical work. How can someone trust my work if they are not able to verify it?

The following dummy code demonstrates the hardcoding issue:

```{python}
# Hardcoded path
file_path = "/user/notebooks/toydata.csv"
try:
    with open(file_path) as file:
        data = file.read()
        print(data)
except FileNotFoundError:
    print("File not found")
```

In the code above, a dummy file (toydata.csv) is used. The dummy file contains data on the prices of three different toys, but only the path of the file is relevant to this example. If the hardcoded file path – "/user/notebooks/toydata.csv" – exists on the machine being used, the code will run just fine. But, when run in a different environment without said path, the code will result in a "File not found error". Better code that uses relative paths can be written as:

```{python}
# Relative path
import os

file_path = os.path.join(os.getcwd(), "toydata.csv")
try:
    with open(file_path) as file:
        data = file.read()
        print(data)
except FileNotFoundError:
    print("File not found")
```

You can see that this code has successfully imported data from the dataset toydata.csv and printed its two columns (toy and price) and three rows.

The following example is a simplified version of what happened when I wrote code to train several models, store the results and run a procedure to compare results with the predictive performance of a benchmark model:

```{python}
# Set an arbitrary predictive performance value of a benchmark model
# and accept/reject models if the results are above/below the value.
benchmark = 50
# Set the model details in one place for a better overview
model = {
    "model1": {"name": "model1", "type": "simple"}, 
    "model2": {"name": "model2", "type": "complex"}
}
# Set the current model to "model1" to use it for training and check its results
current_model = model["model1"]
# Train a simple model for "model1" and a complex model for "model2"
# Training result of the "model1" is 30 and for "model2" is 70
model_structure = train(current_model["type"])
# Save the model and its result in a .csv file
model_structure.to_csv('/all/notebooks/results-of-model1.csv', index=False)
```

```{python}
# Load the model result and compare with benchmark
print("Model name: {}".format(current_model["name"]))
print("Model type: {}".format(current_model["type"]))
# Load the result of the current model
result = pd.read_csv('/all/notebooks/results-of-model2.csv').iloc[0, 0]
print("Result: {}".format(result))

if result > benchmark:
    print("\033[3;32m>>> Result is better than the benchmark -> Accept the model and use it for calculations")
else:
    print("\033[3;31m>>> Result is NOT better than the benchmark -> Reject the model as it is not optimal")
```

Everything looks fine at a glance. But, if you examine the code carefully, you may spot the problem. Initially, when I coded the procedure (training the model, saving and loading the results), I hardcoded the paths and had to change them for each tested model. First, I trained model2, a complex model, and tested it against the benchmark (70 > 50 → accepted). I repeated the procedure for model1 (a simple model). Its result was identical to model2, therefore I kept model1 following the parsimony principle.

However, for the code line loading the result for the current model (line 5, second cell), I forgot to amend the path and so mistakenly loaded the result of model2. As a consequence, I accepted a model which should have been rejected. These wrong results were then spread further in the code, including all charts and maps and the conclusions of my analysis.

A small coding error like this can therefore be fatal to an analysis. Below is the corrected code:

```{python}
import os
# Set an arbitrary predictive performance value of a benchmark model
# and accept/reject models if the results are above/below the value.
benchmark = 50
# Set the model details (INCLUDING PATHS) in one place for a better overview
model = {
    "model1": {"name": "model1", "type": "simple", "path": os.path.join(os.getcwd(), "results-of-model1.csv")}, 
    "model2": {"name": "model2", "type": "complex", "path": os.path.join(os.getcwd(), "results-of-model2.csv")}
}
# Set the current model to "model1" to use it for training and check its results
current_model = model["model1"]
# Train a simple model for "model1" and a complex model for "model2"
# Training result of the "model1" is 30 and for "model2" is 70
model_structure = train(current_model["type"])
# Save the model and its result in a .csv file
model_structure.to_csv(current_model["path"], index=False)
```

```{python}
# Get the model result and compare with the benchmark
print("Model name: {}".format(current_model["name"]))
print("Model type: {}".format(current_model["type"]))
# Load the result of the current model WITH a VARIABLE PATH
result = pd.read_csv(current_model["path"]).iloc[0, 0]
print("Result: {}".format(result))

if result > benchmark:
    print("\033[3;32m>>> Result is better than the benchmark -> Accept the model and use it for calculations")
else:
    print("\033[3;31m>>> Result is NOT better than the benchmark -> Reject the model as it is not optimal")
```

Here, the paths are stored with other model details (line 7–8, first cell). Therefore, we can use them as variables when we need them (e.g., line 16, first cell, and line 5, second cell). Now, when the current model is set to model1 (line 11, first cell), everything is automatically adjusted. Also, if the path details need to be changed, we only need to change them once and everything else is automatically adjusted and updated. The code now correctly states that model1 performs worse than the benchmark and is therefore rejected and we should keep model2, which performs best.

I managed to catch this error in time, but it often can be difficult to spot our own mistakes. That is why making code available to others is crucial. A code review by a second (or third) pair of eyes can save everyone a lot of time and avoid spreading incorrect results and conclusions.

Solving compatibility chaos with Docker

One might think that it would be easy to copy code from one computer to another and run it without difficulties, but it turns out to be a real headache. Different operating systems on my local computer and the university server caused multiple compatibility issues and it was very time-consuming to try to solve them. The university server was running on Ubuntu, a Linux distribution, which was not compatible with my macOS-based code editor. Moreover, the server did not support the Python programming language – and all the deep learning algorithm packages that I needed – in the same way as my macOS computer did.

As a remedy, I used Docker containers, which allowed me to create a virtual environment with all the necessary packages and dependencies installed. This way, I could integrate them with different hardware and use the processing power of that hardware. To get started with Docker, I first had to install it on my local computer. The installation process is straightforward and the Docker website provides step-by-step instructions for different operating systems. In fact, I found the Docker website very helpful, with lots of resources and tutorials available. Once Docker was installed, it was easy to create virtual environments for my project and work with my code, libraries, and packages, without any compatibility issues. Not only did Docker containers save me a lot of time and effort, but they could also make it easier for others to reproduce my work.

Below is an example of a Dockerfile which recreates an environment with Python 3.7 on Linux. It describes what, how, when and in which order operations should be carried out to generate the environment with all Python packages required to run the main Python script, main.py.

An example of a Dockerfile.

In this example, by downloading the project, including the Dockerfile, anyone can run main.py without installing packages or worrying about what OS was used for development or which Python version should be installed. You can view Docker as a great robot chef: show it a recipe (Dockerfile), provide the ingredients (project files), push the start button (to build the container) and wait to sample the results.

Why does nobody check your code?

Even after implementing Docker, I still faced another challenge to reproducibility: making the verification process for my code easy enough that it could be done by anyone, without them needing a degree in computer science! Increasingly, there is an expectation for researchers to share their code so that results can be reproduced, but there are as yet no widely accepted or enforced standards on how to make code readable and reusable. However, if we are to embrace the concept of reproducibility, we must write and publish code under the assumption that someone, somewhere – boss, team member, journal reviewer, reader – will want to rerun our code. And, if we expect that someone will want to rerun our code (and hopefully check it), we should ensure that the code is readable and does not take too long to run.

If your code does take too long to run, some operations can often be accelerated – for example, by reducing the size of the datasets or by implementing computationally efficient data processing approaches (e.g., using PyTorch). Aim for a running time of a few minutes – or about as long as it takes to make a cup of tea or coffee. Of course, if data needs to be reduced to save computational time, the person rerunning your code won’t generate the same results as in your original analysis. This therefore will not lead to reproducibility, sensu stricto. However, as long as you state clearly what are the expected results from the reduced dataset, your peers can at least inspect your code and offer feedback, and this marks a step towards reproducibility.

We should also make sure our code is free from bugs – both the kind that might lead to errors in analysis and also those that stop the code running to completion. Bugs can occur for various reasons. For example, some code chunks written on a Windows machine may not properly execute on a macOS machine because the former uses \ for file paths, while the latter uses /:

```{python}
# Path works on macOS/Linux
with open("../../all/notebooks/toydata.csv", "r") as f:
    print(f.read())

# Path works only on Windows    
with open(r"..\..\all\notebooks\toydata.csv", "r") as f:
   print(f.read())
```

Here, only the macOS/Linux version works, since the code this capture was taken from was implemented on a Linux server. There are alternatives, however. The code below works on macOS, Linux, and also Windows machines:

```{python}
from pathlib import Path

# Path works on every OS: macOS/Linux/Windows
# It will automatically replace the path to "..\..\all\notebooks\toydata.csv" when it runs on Windows
with open(Path("../../all/notebooks/toydata.csv"), "r") as f:
    print(f.read())
```

The extra Python package, pathlib, is of course unnecessary if you build a Docker container for your project, as discussed in the previous section.

Jupyter, King of the Notebooks

By this stage in my project, I was feeling that I’d made good progress towards ensuring that my work would be reproducible. I’d expended a lot of effort to make my code readable, efficient, and also absent of bugs (or, at least, this is what I was hoping for). I’d also built a Docker container to allow others to replicate my computing environment and rerun the analysis. Still, I wanted to make sure there were no barriers that would prevent people – my supervisors, in particular – from being able to review the work I had done for my undergraduate thesis. What I wanted was a way to present a complete narrative of my project that was easy to understand and follow. For this, I turned to Jupyter Notebook.

Credit: Discord software, Midjourney bot.

Jupyter notebooks combine Markdown text, code, and visualisations. The notebook itself can sit within an online directory of folders and files that contain all the data and code related to a project, allowing readers to understand the processes behind the work and also access the raw resources. From the notebook I produced, readers can see exactly what I did, how I did it, and what my results were.

While creating my notebook, I was able to experiment with my code and iterate quickly. Code cells within a document can be run interactively, which allowed me to try out different approaches to solving a problem and see the results almost in real time. I could also get feedback from others and try out new ideas without having to spend a lot of time writing and debugging code.

Version control with Git and GitHub

My Jupyter notebook and associated folders and files are all available via GitHub. Git is a version control system that allows you to keep track of changes to your code over time, while GitHub is a web-based platform that provides a central repository for storing and sharing code. With Git and GitHub, I was able to version my code and collaborate with others without the risk of losing any work. I really couldn’t afford to redo the entire year I spent on my dissertation!

Git and GitHub are great for reproducibility. By sharing code via these platforms, others can access your work, verify it and reproduce your results without risking changing or, worse, destroying your work – whether partially or completely. These tools also make it easy for others to build on your work if they want to further develop your research. You can also use Git and GitHub to share or promote your results across a wider community. The ability to easily store and share your code also makes it easy to keep track of the different versions of your code and to see how your work has evolved.

The following illustration shows the tracking of very simple changes in a Python file. The previous version of the code is shown on the left; the new version is shown on the right. Additions and deletions are highlighted in green and red, and with + and - symbols, respectively.

A simple example of GitHub version tracking.

The deep learning challenge

So far, this article has dealt with barriers to reproducibility – and ways around them – that will apply to most, if not all, modern research projects. While I’d encourage any scientist to adopt these practices in their own work, it is important to stress that these alone cannot guarantee reproducibility. In cases where standard statistical procedures are used within statistical software packages, reproducibility is often achievable. However, in reality, even when following the same procedures, differences in outputs can occur, and identifying the reasons for this may be challenging. Cooking offers a simple analogy: subtle changes in room temperature or ingredient quality from one day to the next can impact the final product.

One of the challenges for research projects employing machine learning and deep learning algorithms is that outputs can be influenced by the randomness that is inherent in these approaches. Consider the four portraits below, generated by the Midjourney bot.

Credit: Discord software, Midjourney bot.

Each portrait looks broadly similar at first glance. However, upon closer inspection, critical differences emerge. These differences arise because deep learning models rely on numerous interconnected layers to learn intricate patterns and representations. Slight random perturbations, such as initial parameter values or changes in data samples, can propagate through the network, leading to different decisions during the learning process. As a result, even seemingly negligible randomness can amplify and manifest as considerable differences in the final output, as with the distinct features of the portraits.

Randomness is not necessarily a bad thing – it mitigates overfitting and helps predictions to be generalised. However, it does present an additional barrier to reproducibility. If you cannot get the same results using the same raw materials – data, code, packages and computing environment – then you might have good reasons to doubt the validity of the findings.

There are many elements of an analysis in which randomness may be present and lead to different results. For example, in a classification (where your dependent variable is binary, e.g., success/failure with 1 and 0) or a regression (where your dependent variable is continuous, e.g., temperature measurements of 10.1°C, 2.8°C, etc.), you might need to split your data into training and testing sets. The training set is used to estimate the model (hyper)parameters and the testing set is used to compute the performance of the model. The way the split is usually operationalised is as a random selection of rows of your data. So, in principle, each time you split your data into training and testing sets, you may end up with different rows in each set. Differences in the training set may therefore lead to different values of the model (hyper)parameters and affect the predictive performance that is measured from the testing set. Also, differences in the testing set may lead to variations in the predictive performance scores, which in turn lead to potentially different interpretations and, ultimately, decisions if the results are used for that purpose.

This aspect of randomness in the training of models is relatively well known. But randomness may hide in other parts of code. One such example is illustrated below. Here, using Python, we set the seed number to 0 using np.random.seed(seed value). The random.seed() function from the package numpy (abbreviated np) saves the state of a random function so that it can create identical random numbers independently of the machine you use, and this is for any number of executions. A seed value is an initial input or starting point used by a pseudorandom number generator to generate a sequence of random numbers. It is often an integer or a timestamp. The number generator takes this seed value and uses it to produce a deterministic series of random numbers that appear to be random but can be recreated by using the same seed value. Without providing this seed value, the first execution of the function typically uses the current system time. The animation below generates two random arrays arr1 and arr2 using np.random.rand(3,2). Note that the values 3,2 indicate that we want random values for an array that has 3 rows and 2 columns.

```{python}
import numpy as np

#Set the seed number e.g. to 0
np.random.seed(0)
# Generate random array
arr1 = np.random.rand(3,2)
## print("Array 1:")
## print(arr1)

#Set the seed number as before to get the same results
np.random.seed(0)
# Generate another random array
arr2 = np.random.rand(3,2)
## print("\nArray 2:")
## print(arr2)
```

If you run the code yourself multiple times, the values of arr1 and arr2 should remain identical. If this is not the case, check that the seed value is set to 0 in lines 4 and 11. These identical results are possible because we set the seed value to 0, which ensures that the random number generator produces the same sequence of numbers each time the code is run. Now, let’s look at what happens if we remove the line np.random.seed(0):

```{python}
#Generate random array
arr1 = np.random.rand(3,2)
## print("Array 1:")
## print(arr1)

#Generate another random array
arr2 = np.random.rand(3,2)
## print("\nArray 2:")
## print(arr2)
```

Here, the values of arr1 and arr2 will be different each time we run the code since the seed value was not set and is therefore changing over time.

This short code demonstrates how randomness that can be controlled by the seed value may affect your code. Therefore, unless randomness is required, e.g., to get some uncertainty in the results, setting the seed value will contribute to making your work reproducible. I also find it helpful to document the seed number I use in my code so that I can easily reproduce my findings in the future. If you are currently working on some code that involves random number generators, it might be worth checking your code and making all necessary changes. In our work (see code chunk 9 in the Jupyter notebook) we set the seed value in a general way, using a framework (config) so that our code always uses the same seed to train our algorithm.

Conclusion

We hope you have enjoyed learning more about our quest for reproducibility. We have explained why reproducibility matters and provided tips for how to achieve it – or, at least, work towards it. We have introduced a few important issues that you are likely to encounter on your own path to reproducibility. In sum, we have mentioned:

The importance of having relative instead of hard-coded paths in code.
Operating system compatibility issues, which can be solved by using Docker containers for a consistent computing environment.
The convenience of Jupyter notebooks for code editing – particularly useful for data science projects and work using deep learning because of the ability to include text and code in the same document and make the work accessible to everyone (so long as they have an internet connection).
The need for version control using, for example, Git and GitHub, which allows you to keep track of changes in your code and collaborate with others efficiently.
The importance of setting the seed values in random number generators.

The graphic below provides a visual overview of the different components of our study and shows how each component works with the others to support reproducibility.

We use (A) the version control system, Git, and its hosting service, GitHub, which enables a team to share code with peers, efficiently track and synchronise code changes between local and server machines, and reset the project to a working state in case something breaks. Docker containers (B) include all necessary objects (engine, data, and scripts). Docker needs to be installed (plain-line arrows) by all users (project leader, collaborator(s), reviewer(s), and public user(s)) on their local machines (C); and (D) we use a user-friendly interface (JupyterLab) deployed from a local machine to facilitate the operations required to reproduce the work. The project leader and collaborators can edit (upload/download) the project files stored on the GitHub server (plain-line arrows) while reviewers and public users can only read the files (dotted-line arrows).

Now, it is over to you. Our Jupyter notebook provides a walkthrough of our research. Our GitHub repository has all the data, code and other files you need to reproduce our work, and this README file will help you get started.

And with that, we wish you all the best on the road to reproducibility!

Find more case studies

About the authors: Davit Svanidze is a master’s degree student in economics at the London School of Economics (LSE). Andre Python is a young professor of statistics at Zhejiang University’s Center for Data Science. Christoph Weisser is a senior data scientist at BASF. Benjamin Säfken is professor of statistics at TU Clausthal. Thomas Kneib is professor of statistics and dean of research at the Faculty of Business and Economic Sciences at Goettingen University. Junfen Fu is professor of pediatrics, chief physician and director of the Endocrinology Department of Children’s Hospital, Zhejiang University, School of Medicine.

Acknowledgement: Andre Python has been funded by the National Natural Science Foundation of China (82273731), the National Key Research and Development Program of China (2021YFC2701905) and Zhejiang University global partnership fund (188170-11103).

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite: Svanidze, Davit, Andre Python, Christoph Weisser, Benjamin Säfken, Thomas Kneib, and Junfen Fu. 2023. “The road to reproducible research: hazards to avoid and tools to get you there safely.” Real World Data Science, June 15, 2023. URL

References

Peng, Roger D. 2011. “Reproducible Research in Computational Science.” Science 334 (6060): 1226–1227.↩︎
Ioannidis, John P. A., Sander Greenland, Mark A. Hlatky, Muin J. Khoury, Malcolm R. Macleod, David Moher, Kenneth F. Schulz, and Robert Tibshirani. 2014. “Increasing Value and Reducing Waste in Research Design, Conduct, and Analysis.” The Lancet 383 (9912): 166–175.↩︎
Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” Science 349 (6251): aac4716.↩︎
Baker, Monya. 2016. “Reproducibility Crisis?” Nature 533 (26): 353–366.↩︎
Camerer, Colin F., Anna Dreber, Felix Holzmeister, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler, Gideon Nave, Brian A. Nosek, Thomas Pfeiffer, et al. 2018. “Evaluating the Replicability of Social Science Experiments in Nature and Science between 2010 and 2015.” Nature Human Behaviour 2: 637–644.↩︎

Real World Data Science

Defining Purposes and Uses to Support the Development of Statistical Products in a 21st Century Census Curated Data Enterprise Environment

1 Summing it up

2 Developing an end-to-end (E2E) curation system

References

Translating the Curated Data Model into Practice - Climate resiliency of skilled nursing facilities

1 Introduction

2 Purposes and uses

3 Statistical product development stages

3.1 Question 1: Can SNF workers get to work during an extreme flooding event?

3.2 Question 2. Are SNFs prepared for emergencies?

3.3 Question 3: Can communities support SNFs during emergencies?

4 Guiding principles for ethical, transparent, reproducible statistical product development and dissemination.

5 Using the SNF statistical product

6 What CDE capabilities have this use case demonstrated?

References

Footnotes

Advancing Data Science in Official Statistics – What is the Curated Data Enterprise?

Introduction

Guiding principles

Purposes and uses

Stakeholders

Curation

Equity and ethics

Privacy and confidentiality

Communication and dissemination

Research steps

Subject matter input

Data discovery

Data ingest and governance

Data wrangling

Fitness-for-purpose

Statistics development

References

Footnotes

Advancing Data Science in Official Statistics – The Policy Problem

Introduction

A new approach

References

Forecasting the Health Needs of a Changing Population

Background

Population Segmentation

Creating The Model

Findings

Limitations

What Next

Summary

Deduplicating and linking large datasets using Splink

The problem

Approach

Example

Implementation

Using Splink

Outcomes

Learning from failure: ‘Red flags’ in body-worn camera data

Background

Analysis, part 1: Quality control and footage analysis

Food for Thought: Third place winners – Loyola Marymount

Our perspective on the challenge

Related work

Our approach

Training methods

Model development approach

Our results

Future work/refinement

Lessons learned

References

Food for Thought: The value of competitions for confidential data

References

Food for Thought: Competition and challenge design

Competition structure

Metrics

Technical description

Environment setup

Supporting materials

Data splitting

Judging

Results

Lessons learned

References