Food for Thought: The value of competitions for confidential data

The Food for Thought Challenge attracted new eyes from computer science and data science to think about how to address a critical real-world data linkage problem. And, in identifying different ways of addressing the same problem, it created an environment for new innovative ideas.

Machine learning
Natural language processing
Public policy
Health and wellbeing
Author

Steven Bedrick, Ophir Frieder, Julia Lane, and Philip Resnik

Published

August 21, 2023

We are witnessing a sea change in data collection practices by both governments and businesses – from purposeful collection (through surveys and censuses, for example) to opportunistic (drawing on web and social media data, and administrative datasets). This shift has made clear the importance of record linkage – a government might, for example, look to link records held by its various departments to understand how citizens make use of the gamut of public services.

However, creating manual linkages between datasets can be prohibitively expensive, time consuming, and subject to human constraints and bias. Machine learning (ML) techniques offer the potential to combine data better, faster, and more cheaply. But, as the recently released National AI Research Resources Task Force report highlights, it is important to have an open and transparent approach to ensure that unintended biases do not occur.

In other words, ML tools are not a substitute for thoughtful analysis. Both private and public producers of a linked dataset have to determine the level of linkage quality – such as what precision/recall tradeoff is best for the intended purpose (that is, the balance between false-positive links and failure to cover links that should be there), how much processing time and cost is acceptable, and how to address coverage issues. The challenge is made more difficult by the idiosyncrasies of heterogeneous datasets, and more difficult yet when datasets to be linked include confidential data (Christensen and Miguel 2018; Christen, Ranbaduge, and Schnell 2020).

And, of course, an ML solution is never the end of the road: many data linkage scenarios are highly dynamic, involving use cases, datasets, and technical ecosystems that change and evolve over time; effective use of ML in practice necessitates an ongoing and continuous investment (Koch et al. 2021). Because techniques are constantly improving, producers need to keep abreast of new approaches. A model that is working well today may no longer work in a year because of changes in the data, or because the organizational needs have changed so that a certain type of error is no longer acceptable. As Sculley et al. point out, “it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning” (Sculley et al. 2014).

Also important is that record linkage is not seen as a technical problem relegated to the realm of computer scientists to solve. The full engagement of domain experts in designing the optimization problem, identifying measures of success, and evaluating the quality of the results is absolutely critical, as is building an understanding of the pros and cons of different measures (Schafer et al. 2021; Hand and Christen 2018). There will need to be much learning by doing in “sandbox” environments, and back and forth communication across communities to achieve successful outcomes, as noted in the recommendations of the Advisory Committee on Data for Evidence Building (a screenshot of which is shown in Figure 1).

Figure 1: A recommendation for building an “innovation sandbox” as part of the creation of a new National Secure Data Service in the United States.

Despite the importance of trial and error and transparency about linkage quality, there is no handbook that guides domain experts in how to design such sandboxes. There is a very real need for agreed-upon, domain-independent guidelines, or better yet, official standards to evaluate sandboxes. Those standards would define “who” could and would conduct the evaluation, and help guarantee independence and repeatability. And while innovation challenges have been embraced by the federal government, the devil can be very much in the details (Williams 2012).

It is for this reason that the approach taken in the Food for Thought linkage competition, and described in this compendium, provides an important first step towards a well specified, replicable framework for achieving high quality outcomes. In that respect it joins other recent efforts to bring together community-level research on shared sensitive data (MacAvaney et al. 2021; Tsakalidis et al. 2022). This competition, like those, helped bring to the foreground both the opportunities and challenges of doing research in secure sandboxes with sensitive data. Notably, these exercises highlight a kind of cultural tension between secure, managed environments, on the one hand, and unfettered machine learning research, on the other. The need for flexibility and agility in computational research bumps up against the need for advance planning and careful step-by-step processes in environments with well-defined data governance rules, and one of the key lessons learned is that the tradeoffs here need to be recognized and planned for.

This particular competition was important for a number of other reasons. Thanks to its organization as a competition, complete with prizes and bragging rights for strongly performing teams, it attracted new eyes from computer science and data science to think about how to address a critical real-world linkage problem. It offered the potential to produce approaches that were scalable, transparent, and reproducible. The engagement of domain experts and statisticians meant that it will be possible to conduct an informed error analysis, to explicitly relate the performance metrics in the task to the problem being solved in the real world, and to bring in the expertise of survey methodologists to think about the possible adjustments. And because it identified different approaches of addressing the same problem, it created an environment for new innovative ideas.

More generally, in addition to the excitement of the new approaches, this exercise laid bare the fragility of linkages in general and highlighted the importance of secure sandboxes for confidential data. While the promise of privacy preserving technologies is alluring as an alternative to bringing confidential data together in one place, such approaches are likely too immature to deploy ad hoc until a better understanding is established of how to translate real-world problems and their associated data into well-defined tasks, how to measure quality, and particularly how to assess the impact of match quality on different subgroups (Domingo-Ferrer, Sánchez, and Blanco-Justicia 2021). The scientific profession has gone through too painful a lesson with the premature application of differential privacy techniques to ignore the lessons that can be learned from a careful and systematic analysis of different approaches (2021; Van Riper et al. 2020; Ruggles et al. 2019; Giles et al. 2022).

We hope that the articles in this collection provide not only the first steps towards a handbook of best practices, but also an inspiration to share lessons learned, so that success can be emulated, and failures understood and avoided.

About the authors
Steven Bedrick is an associate professor in Oregon Health and Science University’s Department of Medical Informatics and Clinical Epidemiology.

Ophir Frieder is a professor in Georgetown University’s Department of Computer Science, and in the Department of Biostatistics, Bioinformatics & Biomathematics at Georgetown University Medical Center.

Julia Lane is a professor at the NYU Wagner Graduate School of Public Service and a NYU Provostial Fellow for Innovation Analytics. She co-founded the Coleridge Initiative.

Philip Resnik holds a joint appointment as professor in the University of Maryland Institute for Advanced Computer Studies and the Department of Linguistics, and an affiliate professor appointment in computer science.

Copyright and licence
© 2023 Steven Bedrick, Ophir Frieder, Julia Lane, and Philip Resnik

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence. Thumbnail photo by Alexandru Tugui on Unsplash.

How to cite
Bedrick, Steven, Ophir Frieder, Julia Lane, and Philip Resnik. 2023. “Food for Thought: The value of competitions for confidential data.” Real World Data Science, August 21, 2023. URL

References

Christen, P., T. Ranbaduge, and R. Schnell. 2020. Linking Sensitive Data - Methods and Techniques for Practical Privacy-Preserving Information Sharing. Springer. https://doi.org/10.1007/978-3-030-59706-1.
Christensen, G., and E. Miguel. 2018. “Transparency, Reproducibility, and the Credibility of Economics Research.” Journal of Economic Literature 56 (3): 920–80. https://doi.org/10.1257/jel.20171350.
Domingo-Ferrer, J., D. Sánchez, and A. Blanco-Justicia. 2021. “The Limits of Differential Privacy (and Its Misuse in Data Release and Machine Learning).” Communications of the ACM 64 (7): 33–35. https://doi.org/10.1145/3433638.
Giles, O., K. Hosseini, G. Mingas, O. Strickson, L. Bowler, C. Rangel Smith, H. Wilde, et al. 2022. “Faking Feature Importance: A Cautionary Tale on the Use of Differentially-Private Synthetic Data.” https://arxiv.org/abs/2203.01363.
Hand, D., and P. Christen. 2018. “A Note on Using the f-Measure for Evaluating Record Linkage Algorithms.” Statistics and Computing 28 (3): 539–47. https://doi.org/10.1007/s11222-017-9746-6.
Koch, B., E. Denton, A. Hanna, and J. G. Foster. 2021. “Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research.” CoRR abs/2112.01716. https://arxiv.org/abs/2112.01716.
MacAvaney, S., A. Mittu, G. Coppersmith, J. Leintz, and P. Resnik. 2021. “Community-Level Research on Suicidality Prediction in a Secure Environment: Overview of the CLPsych 2021 Shared Task.” In Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access, 70–80. Online: Association for Computational Linguistics. https://doi.org/10.18653/v1/2021.clpsych-1.7.
Ruggles, S., C. Fitch, D. Magnuson, and J. Schroeder. 2019. “Differential Privacy and Census Data: Implications for Social and Economic Research.” AEA Papers and Proceedings 109 (May): 403–8. https://doi.org/10.1257/pandp.20191107.
Schafer, K. M., G. Kennedy, A. Gallyer, and P. Resnik. 2021. “A Direct Comparison of Theory-Driven and Machine Learning Prediction of Suicide: A Meta-Analysis.” PLOS ONE 16 (4): 1–23. https://doi.org/10.1371/journal.pone.0249833.
Sculley, D., G. Holt, D. Golovin, E. Davydov, T. Phillips, D. Ebner, V. Chaudhary, and M. Young. 2014. “Machine Learning: The High Interest Credit Card of Technical Debt.” In SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop).
Tsakalidis, A., J. Chim, I. M. Bilal, A. Zirikly, D. Atzil-Slonim, F. Nanni, P. Resnik, et al. 2022. “Overview of the CLPsych 2022 Shared Task: Capturing Moments of Change in Longitudinal User Posts.” In Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology, 184–98. Seattle, USA: Association for Computational Linguistics. https://doi.org/10.18653/v1/2022.clpsych-1.16.
Van Riper, D., T. Kugler, J. Schroeder, and S. Ruggles. 2020. “Differential Privacy and Racial Residential Segregation.” In 2020 APPAM Fall Research Conference.
Williams, H. 2012. “Innovation Inducement Prizes: Connecting Research to Policy.” Journal of Policy Analysis and Management 31 (3): 752–76. http://www.jstor.org/stable/41653827.