Advancing Data Science in Official Statistics – The Policy Problem

In this article, we introduce a new approach for the US Census Bureau to produce statistical products using survey, administrative, procedural and opportunity data. This is the first in a series of four articles describing the development of the Curated Data Enterprise Framework.

Public Policy
Data Analysis
Data Integration
Curation
Statistical Products
Author

Sallie Keller, Stephanie Shipp, Vicki Lancaster and Joseph Salvo
University of Virginia

Published

November 1, 2024

Acknowledgments: This research was sponsored by the:
United States Census Bureau Agreement No. 01-21-MOU-06 and
Alfred P. Sloan Foundation Grant No. G-2022-19536



The views expressed in this artice are those of the authors and not the Census Bureau.

Introduction

Two centuries ago, when the Framers of the US Constitution laid the cornerstone for the federal statistical system, they could not have imagined the complexity of questions future generations would want to ask or the variety of data sources available to address them. Back in 1787, counting the population and apportioning state seats in the House of Representatives were the most urgent tasks before the young nation, and so a requirement for a decennial census was written into the Constitution. Now, 233 years later, the census continues to serve its original purpose – but purposes and uses for census data have exploded.

Questions we now seek to answer go beyond what the census (or surveys) alone can hope to address. Even with the multitude of other surveys commissioned by today’s US Census Bureau, researchers and policymakers find themselves looking to novel sources of data – from structured numeric data in traditional databases to unstructured text documents scraped from the internet – to explore issues such as understanding how prepared nursing homes and communities are for extreme climate events,eg, hurricanes, wildfires, or floods. Wrangling these sources with traditionally designed data, such as censuses and surveys, can fill data gaps, improve the quality and usefulness of statistical products, speed up their dissemination, and inspire the creation of new types of statistical products.

That is the impetus for developing the Curated Data Enterprise (CDE), an innovation in data science aimed at creating statistical products from all data types and building the infrastructure to support them. The Curated Data Enterprise, as the name implies, includes an end-to-end curation model to capture the complete statistical product development process. The CDE is designed to enable data discovery and retrieval, data quality assessment across multiple and diverse sources of information, and the reuse of data and models over time to accelerate statistical product development. The US Census Bureau has partnered with the University of Virginia, a working group of former Census Bureau Directors, a Communication Director, and university, non-profit and industry experts to develop this approach.

The US Census Bureau

The US Census Bureau provides the latest official statistics, facts, and figures about America’s people, places, and economy. It collects data for 130 surveys annually and the decennial census that gives the Bureau its name. The US Census Bureau collects data from households, businesses, governments and non-profit organizations. For each survey, tabulations and margins of error are published in news releases and reports. Public-use microdata subject to disclosure rules are provided for household and demographic surveys. Microdata for economic and household surveys, without disclosure rules applied, are accessible to researchers through the Federal Statistical Research Data Centers.

Statistical agencies in other countries are also modernizing their surveys and statistical product development. See a summary of selected countries (Lanman, Davis, and Shipp 2023).

A new approach

To realize the CDE vision, the development of statistical products will address stakeholder questions using all data types – designed surveys and censuses, public and private administrative data, opportunity data scraped from the internet and procedural data (Keller et al. 2022). This new approach aligns with the US Census Bureau’s modernization and transformation (Thieme 2022) while maintaining the fundamental responsibilities of statistical agencies (OMB 2023). It is also consistent with a conclusion by the NASEM Panel on the Implications of Using Multiple Data Sources for Major Survey Programs: ‘The quality of statistics produced from multiple data sources depends on properties of the individual sources as well as the methods used to combine them. A new framework of quality standards and guidelines is needed to evaluate such data sources’ fitness for use’ (NASEM 2023, 192).

The CDE approach provides such a framework to address many of the challenges that official statistics face today, as well as demonstrate that they are poised to adopt a new approach to producing official statistics. For example:

  • The timeliness and frequency of our official statistics are insufficient when there are shocks to the economy, such as the Covid-19 pandemic, when retrospective survey data were of limited usefulness. Federal agencies responded during the pandemic with relevance and agility by creating and launching fast-response Household Pulse Surveys that met immediate needs for data, trading off timeliness for quality (Groshen 2021). Public engagement and support for these new relevant and timely data products at a time of crisis were essential to the success of this new statistical product.

  • The policy environment has responded to technological, social, and survey changes by encouraging efficient use of existing data, reuse, sharing and furthering open data principles. Researchers are now creating innovative statistical products using multiple data sources to better address the US’s needs and interests. The Commission on Evidence-Based Policymaking (Abraham et al. 2018) and the Federal Data Strategy (“Federal Data Strategy, Leveraging Data as a Strategic Asset” 2021) recommendations encourage agencies to permit access to data to undertake evaluation and research studies.

  • Techniques such as rapid scanning, text recognition, user-friendly uploads, and new devices, sensors, and systems can now record and transcribe data in real time. Using these techniques, governments and corporations now routinely and instantaneously collect and store data on behaviors and states as varied as purchase transactions, climate and road conditions, healthcare plan utilization, and land use and zoning. Extensive digitization and recording, better system connectedness and interactivity, and increased human-computer interaction can result in faster data accumulation, enhancing the usability of private and public administrative data while maintaining privacy and confidentiality (Brady 2019; Jarmin 2019).  

  • New techniques and data sources can transform statistical agencies ‘from the 20th-century survey-centric model to a 21st-century model that blends structured survey data with administrative and unstructured alternative digital data sources’, leading to better measures of the gig economy, retail sales, healthcare, workforce, and tools and methods to integrate multiple data sources while maintaining privacy and confidentiality (Jarmin 2019).

The next three articles in this series will:

  • provide an overview of the CDE and its corresponding framework

  • put the CDE Framework into practice through a demonstration use case on the resilience of skilled nursing facilities

  • describe our next steps for developing the CDE through a use case research program.

About the authors

Sallie Keller is the Chief Scientist and Associate Director of Research and Methodology at the US Census Bureau. She is a statistician with research interest in social and decision informatics, statistics underpinnings of data science, and data access and confidentiality. Sallie Keller was at the University of Virginia when this work was conducted.

Stephanie Shipp leads the Curated Data Enterprise research portfolio and collaborates with the US Census. She is an economist with experience in data science, survey statistics, public policy, innovation, ethics, and evaluation.

Vicki Lancaster is a statistician with expertise in experimental design, linear models, computation, visualizations, data analysis, and interpretation. She works with scientists at federal agencies on projects requiring statistical skills and creativity, eg, defining skilled technical workforce using novel data sources.

Joseph Salvo is a demographer with experience in US Census Bureau statistics and data. He makes presentations on demographic subjects to a wide range of groups about managing major demographic projects involving the analysis of large data sets for local applications.

Copyright and licence

© 2024 Stephanie Shipp

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Lukas Blazek on Unsplash.

How to cite

Keller S, Shipp S, Lancaster V, Salvo J (2024). “Advancing Data Science in Official Statistics: The Policy Problem.” Real World Data Science, November 01, 2024. URL

References

Abraham, Katherine G, Ron Haskins, Sherry Glied, Robert M Groves, Robert Hahn, Hilary Hoynes, and KR Wallin. 2018. “The Promise of Evidence-Based Policymaking: Report of the Commission on Evidence-Based Policymaking.” Washington, DC: Commission on Evidence-Based Policymaking. https://www.cep.gov/ content/dam/cep/report/cep-final-report.pdf.
Brady, Henry E. 2019. “The Challenge of Big Data and Data Science.” Annual Review of Political Science 22: 297–323. https://www.annualreviews.org/doi/abs/10.1146/annurev-polisci-090216-023229.
“Federal Data Strategy, Leveraging Data as a Strategic Asset.” 2021. 2021. https://strategy.data.gov/.
Groshen, Erica L. 2021. “The Future of Official Statistics.” Harvard Data Science Review 3 (4). https://doi.org/10.1162/99608f92.591917c6.
Jarmin, Ron S. 2019. “Evolving Measurement for an Evolving Economy: Thoughts on 21st Century US Economic Statistics.” Journal of Economic Perspectives 33 (1): 165–84.
Keller, Sallie, Kenneth Prewitt, John Thompson, Steve Jost, Christopher Barrett, Sarah Nusser, Joseph Salvo, and Stephanie Shipp. 2022. “A 21st Century Census Curated Data Enterprise. A Bold New Approach to Create Official Statistics. Technical Report.” Proceedings of the Biocomplexity Institute BI-2022-1115: 297–323. https://doi.org/10.18130/r174-yk24.
Lanman, Kathryn, Olivia Davis, and Stephanie Shipp. 2023. “What Can We Learn from Other Countries about How They Are Using Administrative Data to Supplement, Enhance, or Create New Data Products?” Proceedings of the Biocomplexity Institute. https://doi.org/10.18130/2n54-sc22.
NASEM. 2023. “Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources.” National Academies of Science, Engineering, and Medicine. https://doi.org/10.17226/26804.
OMB. 2023. “Fundamental Responsibilities of Recognized Statistical Agencies and Units.” Federal Register: The Daily Journal of the US Government, 56708–44. https://www.federalregister.gov/documents/2023/08/18/2023-17664/fundamental-responsibilities-of-recognized-statistical-agencies-and-units .
Thieme, Michael. 2022. “Technology Transformations at the Census Bureau: Building a Modern, Data-Centric Ecosystem.” hhttps://www.census.gov/newsroom/blogs/research-matters/2022/10/technology-transformation.html .