<?xml version="1.0" encoding="UTF-8"?>
<rss  xmlns:atom="http://www.w3.org/2005/Atom" 
      xmlns:media="http://search.yahoo.com/mrss/" 
      xmlns:content="http://purl.org/rss/1.0/modules/content/" 
      xmlns:dc="http://purl.org/dc/elements/1.1/" 
      version="2.0">
<channel>
<title>Real World Data Science</title>
<link>https://realworlddatascience.net/applied-insights/case-studies/</link>
<atom:link href="https://realworlddatascience.net/applied-insights/case-studies/index.xml" rel="self" type="application/rss+xml"/>
<description></description>
<image>
<url>https://realworlddatascience.net/images/rwds-logo-150px.png</url>
<title>Real World Data Science</title>
<link>https://realworlddatascience.net/applied-insights/case-studies/</link>
<height>83</height>
<width>144</width>
</image>
<generator>quarto-1.9.37</generator>
<lastBuildDate>Wed, 11 Feb 2026 00:00:00 GMT</lastBuildDate>
<item>
  <title>Understanding and Addressing Algorithmic Bias: a Credit Scoring Case Study</title>
  <dc:creator>Devin Partida</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2026/02/11/algorithmic_bias_credit_scoring.html</link>
  <description><![CDATA[ 





<p>When you apply for a credit card or a loan, algorithms work in the background to determine financial worthiness. Despite increasing advancements, these are <a href="https://hai.stanford.edu/news/how-flawed-data-aggravates-inequality-credit">still imperfect</a> due to inherent biases. As data science students and professionals, you’ll inevitably face similar issues relating to biased data sets and should know how to combat them. What are some of the most effective techniques, and why do they matter?</p>
<section id="the-critical-issue-of-algorithmic-bias-in-credit-scoring-models" class="level2">
<h2 class="anchored" data-anchor-id="the-critical-issue-of-algorithmic-bias-in-credit-scoring-models">The Critical Issue of Algorithmic Bias in Credit Scoring Models</h2>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2026/02/11/images/thumbcredit.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:80.0%"></p>
</figure>
</div>
<p>One of the most concerning aspects of algorithmic bias is the limited recourse available to those negatively impacted, leaving them vulnerable to opaque decision-making processes. This challenge underscores the need for increased transparency in <a href="https://www.psecu.com/learn/whats-in-a-credit-score">how credit scoring models are developed</a> and deployed. Ultimately, proactive mitigation strategies are essential to ensure fairness and equity in financial outcomes.</p>
<p>One of the biggest issues surrounding algorithmic bias in credit scoring is that adversely affected parties usually have little or no recourse for appealing unfavorable decisions. This problem happens because most widely used algorithms still can’t explain how they reached specific decisions, leaving people in the dark and forcing them to trust the technology, even as it potentially ruins lives.</p>
<p>A November 2025 academic review <a href="https://giesbusiness.illinois.edu/news/2025/11/12/new-research-reveals-widespread-bias--inefficiency-in-credit-scoring-and-mortgage-lending">revealed numerous flaws in financial algorithms</a> and confirmed various impacts. They included systematic disadvantages for minority groups and miscalibrated credit scores for individual borrowers. The researchers also discovered that these issues appeared despite the financial technology industry’s promises of superior efficiency.</p>
<p>One of the cited studies consistently showed that female applicants received credit scores six to eight points lower than their male counterparts. The researchers determined that the associated effects diminished economic welfare and that the ramifications continued for multiple borrowing cycles. Another investigation revealed persistent disparities across minority groups, despite the applicant’s chosen lender type.</p>
<p>Elsewhere, researchers examined the effects of using large language models to evaluate applicants’ loan data. This approach regularly <a href="https://news.lehigh.edu/ai-exhibits-racial-bias-in-mortgage-underwriting-decisions">recommended charging higher interest rates</a> to Black applicants or denying their applications. It did not make the same suggestions for identical white applicants. These examples demonstrate why data science professionals must remain constantly aware of the potential for bias and uphold fairness by reducing the issue whenever possible.</p>
</section>
<section id="practical-tips-for-bias-detection-and-mitigation" class="level2">
<h2 class="anchored" data-anchor-id="practical-tips-for-bias-detection-and-mitigation">Practical Tips for Bias Detection and Mitigation</h2>
<p>Sources of bias in credit scoring data and algorithms are more common than you might think. They can include:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2026/02/11/images/infographic.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:80.0%"></p>
</figure>
</div>
<p>One straightforward way to identify bias in training data is to be aware of the most common types and build algorithms to be less reliant on them when possible. Regularly reviewing the training data is similarly effective because it can catch biases before they have real-life effects.</p>
<p>You can also perform a disparate impact analysis to mitigate bias, which compares aggregate measurements attributed to fewer privileges or less representation to their counterparts. Dividing the proportion of one group that received an adverse outcome and comparing it with that of the other group identifies bias.</p>
<p>Creating a set of fairness metrics is another practical mitigation approach, as it encourages data scientists to understand the impacts of various aspects of a person’s background that they can and cannot control. The examined attributes could include someone’s employment history, income and debt, but also the extent to which they experienced equal opportunities.</p>
<p>Improvements in explainable artificial intelligence will further both bias detection and mitigation. Developers then see <a href="https://rehack.com/ai/explainable-ai/">the heavily weighted but irrelevant factors</a> that could lead to unfair outcomes and correct issues early.</p>
</section>
<section id="considerations-to-reduce-algorithmic-bias-in-credit-scoring" class="level2">
<h2 class="anchored" data-anchor-id="considerations-to-reduce-algorithmic-bias-in-credit-scoring">Considerations to Reduce Algorithmic Bias in Credit Scoring</h2>
<p>Reducing algorithmic bias in credit scoring requires cooperation across roles and departments and includes those who will use the tools containing the algorithms. Committing to specific postprocessing steps empowers people to become more familiar with an algorithm’s functionality and potential shortcomings rather than automatically trusting the results. Those developing the algorithms should prioritize transparency by designing explainable models when possible and making them accessible enough for the expected audience.</p>
<p>Staying abreast of recent research shapes data scientists’ efforts by helping them understand the possibilities. In a 2024 case study, MIT researchers created a new <a href="https://news.mit.edu/2024/researchers-reduce-bias-ai-models-while-preserving-improving-accuracy-1211">technique that identifies and eliminates</a> the specific attributes of training data that are the strongest contributors to a model’s biases about minority subgroups. This approach also preserves overall accuracy because it preserves more of the data compared to other options.</p>
<p>The developers confirmed that the technique can find hidden bias sources and training datasets with unlabeled information. This capability is significant because the data used by many applications lacks labels. They envision combining their technique with other approaches to improve fairness in high-stakes situations. This detail makes it well-suited for the financial industry because many of the associated decisions alter people’s lives and opportunities.</p>
<p>Maintaining responsible data science practices requires equipping professionals with the skills to detect and mitigate bias in an evolving technological landscape. A 2025 study involved a biased dataset that contained a <a href="https://www.psu.edu/news/bellisario-college-communications/story/most-users-cannot-identify-ai-bias-even-training-data">disproportionately high number</a> of white people with happy faces. This issue caused the AI algorithm to correlate race with emotional expressions.</p>
<p>The results of three experiments with human participants showed that most individuals did not notice the bias. This result shows why data scientists need ongoing education to spot less-obvious examples.</p>
</section>
<section id="stay-vigilant-to-maintain-fairness" class="level2">
<h2 class="anchored" data-anchor-id="stay-vigilant-to-maintain-fairness">Stay Vigilant to Maintain Fairness</h2>
<p>Your work on algorithms for credit scoring could adversely affect people’s lives and leave them with no way to contest unfavorable outcomes. Being a responsible data scientist means understanding the numerous risk factors and the controllable factors to minimize harm. Remaining aware of emerging AI applications in the financial industry and regularly meeting with colleagues to discuss ways forward increases fairness for everyone.</p>
<div class="article-btn">
<p><a href="../../../../../../applied-insights/index.html">Explore more data science ideas</a></p>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the author:</dt>
<dd>
<a href="https://devinpartida.com/">Devin Partida</a> is a data science and technology writer, as well as the Editor-in-Chief of ReHack.com. Her work has been featured on Hackernoon, TechTarget, DZone and others.
</dd>
</dl>
<div class="g-col-12 g-col-md-6">
<p><strong>Copyright and licence</strong> : © 2026 Devin Partida <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img style="height:22px!important;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"> <img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"> </a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;">International licence</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<p><strong>How to cite</strong> :<br>
Partida, Devin. 2026. “<strong>Understanding and Addressing Algorithmic Bias: a Credit Scoring Case Study</strong>.” <em>Real World Data Science</em>, 2026. <a href="https://realworlddatascience.net/applied-insights/case-studies/posts/2026/02/11/algorithmic-bias-credit-scoring.html">URL</a></p>
</div>
</div>
</div>


</div>
</section>

 ]]></description>
  <category>Ethics</category>
  <category>Algorithms</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2026/02/11/algorithmic_bias_credit_scoring.html</guid>
  <pubDate>Wed, 11 Feb 2026 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2026/02/11/images/thumbcredit.png" medium="image" type="image/png" height="96" width="144"/>
</item>
<item>
  <title>Why 95% Of AI Projects Fail and How to Change the Odds</title>
  <dc:creator>Lee Clewley</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2026/01/12/why-95-percent-of-ai-projects-fail.html</link>
  <description><![CDATA[ 





<p>Artificial intelligence is now capable of performing substantive work across scientific, medical, industrial and economic domains, yet organisational experience remains uneven. Most large firms have experimented with AI; very few report material gains. MIT’s NANDA study of enterprise generative AI estimates that only 5 percent of custom tools reach production with measurable impact on profit and loss <span class="citation" data-cites="mitnanda2025">(1)</span>. Early analysis from MIT’s Iceberg project points in the same direction at task level: current systems could already support far more work than they do today, but observed use remains shallow, concentrated in a narrow set of roles and often confined to standalone ‘copilot’ tools rather than embedded in core workflows <span class="citation" data-cites="chopra2025">(2)</span>.</p>
<p>For anyone who has sat through AI vendor demonstrations, the pattern is familiar: a procession of polished prototypes that rarely change how important decisions are made. As one Chief Information Officer put it <span class="citation" data-cites="mitnanda2025">(1)</span>: ‘We’ve seen dozens of demos this year. Maybe one or two are genuinely useful. The rest are wrappers or science projects.’</p>
<p>Two caveats matter here. Many pilots are exploratory by design, so failure to reach production may not necessarily be a failure in a scientific sense. Profit based metrics also miss scientific and operational learning, which often matters more in research intensive organisations <span class="citation" data-cites="ransbotham2020">(3)</span>; <span class="citation" data-cites="bcg2024">(4)</span>; <span class="citation" data-cites="deloitte2024">(5)</span>; <span class="citation" data-cites="schlegel2023">(6)</span>. Even allowing for those points, evidence across independent surveys is remarkably consistent: most organisations struggle to turn AI model capability into repeated value, and with an estimated 95% failure rate, the question becomes: how do we change the odds? <span class="citation" data-cites="mitnanda2025">(1)</span>; <span class="citation" data-cites="ransbotham2020">(3)</span>; <span class="citation" data-cites="bcg2024">(4)</span>; <span class="citation" data-cites="deloitte2024">(5)</span>; <span class="citation" data-cites="schlegel2023">(6)</span>.</p>
<section id="three-reasons-for-failure" class="level2">
<h2 class="anchored" data-anchor-id="three-reasons-for-failure">Three Reasons for Failure</h2>
<p>There are three important reasons why so many projects fail that are often overlooked in the literature. First, the problem is often mis-specified: it is framed by technologists or vendors rather than co-owned by the domain experts who understand the decision and bear the consequences. Second, leadership expectations are frequently misaligned, short time horizons and demands for certainty collide with a technology that improves through iteration and organisational learning. Third, many deployments are brittle: they assume stability in a domain defined by rapid model change and rising user expectations, when what is needed is an engineered system designed to adapt.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2026/01/12/images/text1.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:80.0%"></p>
</figure>
</div>
<p>This article draws on two decades of work building AI systems for medicine discovery at GSK <span class="citation" data-cites="gskjules2024">(7)</span> and at Tangram <span class="citation" data-cites="tangram2025">(8)</span> to argue that success rests on three principles, in roughly this order:</p>
<ul>
<li>integrated subject matter expertise;</li>
<li>patient and informed executive leadership;</li>
<li>and building AI systems that learn with the organisation.</li>
</ul>
<p>The sections that follow develop each element in the context of drug discovery and show how an AI platform can help move real projects into the small minority that deliver value.</p>
</section>
<section id="essential-contexts-outside-our-focus" class="level2">
<h2 class="anchored" data-anchor-id="essential-contexts-outside-our-focus">Essential Contexts Outside Our Focus</h2>
<p>Before turning to the three principles developed below, it is worth acknowledging four adjacent domains that I will not treat in detail here, each of which now has a substantial literature of its own. First, organisational scholars have long shown that new information systems reshape power, status and discretion, making the politics of implementation as important as the technology itself <span class="citation" data-cites="markus1983">(9)</span>; <span class="citation" data-cites="orlikowski1992">(10)</span>. Second, governance, regulation and responsible AI practice which includes everything from model documentation and auditability to privacy, robustness and safety, have become central determinants of what can be deployed in practice, especially in regulated sectors <span class="citation" data-cites="paleyes2022">(11)</span>; <span class="citation" data-cites="ey2025">(12)</span>; <span class="citation" data-cites="capgemini2024">(13)</span>. Third, there is an emerging body of work on workforce transformation: how AI complements or displaces skills, how hybrid human–AI roles are designed, and how training, trust and professional bodies mediate adoption <span class="citation" data-cites="chopra2025">(2)</span>; <span class="citation" data-cites="ransbotham2020">(3)</span>; <span class="citation" data-cites="bcg2024">(4)</span>; <span class="citation" data-cites="deloitte2024">(5)</span>. Finally, the question of how to measure value and learn at portfolio scale with existing legacy IT systems (through experimentation, counterfactuals and disciplined comparisons between use cases) is itself a rich field that extends well beyond any single organisation. <span class="citation" data-cites="ransbotham2020">(3)</span>; <span class="citation" data-cites="bcg2024">(4)</span>; <span class="citation" data-cites="deloitte2024">(5)</span>; <span class="citation" data-cites="schlegel2023">(6)</span>; <span class="citation" data-cites="davenport2018">(14)</span>. Each of these strands is critical to understanding why AI succeeds, stalls or remains at a proof-of-concept level.</p>
<p>To maintain focus, I concentrate on three overlooked questions arising from direct experience: how to organise subject matter expertise such that the enterprise owns its AI; how to cultivate genuine leadership ownership; and how to engineer systems that learn and adapt rather than remain isolated demonstrations.</p>
</section>
<section id="principle-1-the-importance-of-building-with-subject-matter-experts" class="level2">
<h2 class="anchored" data-anchor-id="principle-1-the-importance-of-building-with-subject-matter-experts">Principle 1: The importance of building with subject matter experts</h2>
<p>The hardest part of building an AI platform is not the models or the engineers but assembling the subject matter experts (SMEs) who will frame and judge the work. Most commentary treats SMEs as validators, brought in at the end to bless a prototype. It rarely explains how to organise a molecular biologist, a clinician and a chemist so that they can state, in plain terms, what counts as an acceptable outcome.</p>
<p>Most commentators are not operators. They observe patterns across organisations but do not live with the consequences of poor SME integration. This distance between writing and practice shows up in the surveys. Foundry’s 2024 State of the CIO, summarised in MIT Sloan Management Review, reports that 85% of IT leaders see the CIO role as a driver of change, yet only 28% list leading transformation as their top priority <span class="citation" data-cites="foundry2024">(15)</span>]. The people commenting on AI often sit with strategy decks rather than with the unglamorous work of managing technological change and cross-functional coordination.</p>
<p>Drug discovery starkly exposes the gap. The relevant team is wide and requires exceptional coordination. Business and portfolio leaders understand how projects absorb capital and create value whereas molecular biologists and geneticists judge whether a gene is plausibly causal for a disease. Clinicians think through trial design and patient risk. Chemists know what can be made and delivered. Statisticians, AI engineers and data scientists understand models, data pipelines, experimental design and evaluation. This diversity is a strength but requires a lot more from leaders of such teams. When these groups work as separate silos, the result is a generic set of tools whose outputs are not trusted by the users and whose inputs are irrelevant. When these experts can operate as a single team, the conversation starts with a simple set of questions. Which decisions are we trying to improve? How will we know if we have succeeded? What data and statistical methods count as acceptable evidence? Which risks are we prepared to take and which are not negotiable?</p>
<p>At Tangram, that joint framing often collapses into one critical choice: which disease do we want to target for drug development, and which gene is driving it? That decision already embeds genetics, hepatocyte biology, chemistry, clinical feasibility and commercial context. The role of AI and engineering is then precise. It is to help the group search the vast hypothesis space, structure the evidence and quantify uncertainty, while leaving the final judgement with experts who feel they own the AI platform, can see why the AI has come to the conclusions it has, and also own the consequences.</p>
</section>
<section id="principle-2-patient-and-strategic-executive-leadership." class="level2">
<h2 class="anchored" data-anchor-id="principle-2-patient-and-strategic-executive-leadership.">Principle 2: Patient and strategic executive leadership.</h2>
<p>The second element is executive patience. Research from MIT Sloan shows that firms gaining value from AI tend to run more projects, over more years, with a sustained focus on learning how people and AI work together. <span class="citation" data-cites="ransbotham2020">(3)</span> Leaders in these organisations accept that early returns are small and uneven. They invest in a pipeline of use cases rather than a single bet. They resist what researchers have called the “last mile problem”: AI projects that reach technical proof of concept but never change how work is done. <span class="citation" data-cites="davenport2018">(14)</span> In life sciences this is acute. Discovery timelines are long, data are messy and early signals are faint. Leaders who expect quick, clean returns tend to cycle through pilots without ever building an asset that scientists trust.</p>
<p>Patience does not mean passivity; actually, the opposite is true. It means choosing a small number of important decisions, funding cross-functional teams to attack them, and holding the bar for quality high. It requires senior sponsorship to unblock data access, align incentives across discovery, clinical and commercial groups, and shield long term work from quarterly fashion cycles. When those conditions are in place, AI stops being a sequence of demonstrations and starts to become part of how the organisation thinks: the informed leader knows the difference.</p>
</section>
<section id="principle-3-building-ai-systems-that-learn-with-the-organisation" class="level2">
<h2 class="anchored" data-anchor-id="principle-3-building-ai-systems-that-learn-with-the-organisation">Principle 3: Building AI Systems that learn with the organisation</h2>
<p>The third element is the AI engineering. The MIT findings on the 95 percent figure are instructive: most generative AI projects fail not because the models are weak but because the systems around them are brittle. <span class="citation" data-cites="mitnanda2025">(1)</span>; <span class="citation" data-cites="paleyes2022">(11)</span> Foundation models are dropped into existing workflows with minimal adaptation. There is limited monitoring. Data quality is assumed rather than measured. When something breaks (as it inevitably will in non-deterministic systems) teams revert to manual work.</p>
<p>Modern AI engineering starts from the opposite assumption. Models and tools will change quickly. The surrounding stack must absorb that change without being rebuilt each time. The strategy must be built so that, when the product director hears a new technology is built or an LLM improved, this is always a good day.</p>
<p><strong>Build only what you must.</strong></p>
<p>The sensible principle is to only build the components where your domain expertise creates defensible value. Everything else should be bought. But buying is not effortless. Integration, monitoring, and vendor management drains teams unless you are staffed for it. Organisations that succeed at scale partner with vendors offering systems that learn and adapt; they focus on workflow integration; they deploy tools where process alignment is easiest.</p>
<p>So assuming you have the right staff who are working together, supportive leaders and good vendor relationships the next problem is how to create a platform itself.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2026/01/12/images/text2.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:80.0%"></p>
</figure>
</div>
<p><strong>A useful pattern for a modern AI platform has four parts:</strong></p>
<ul>
<li><strong>First, the data stack.</strong> Discovery teams need a small number of trusted stores for data with clear provenance and at least basic quality checks. For early target selection this means human genetics, expression data, interaction networks, preclinical phenotypes, clinical outcomes and internal experiments. Traceability and reproducibility is critical here. Any claim a model makes about a gene and disease pair should be traceable back to specific pieces of evidence.</li>
<li><strong>Second, the platform needs to be built on modular services rather than monoliths.</strong> Each service has a single responsibility and can be swapped when a better service appears. This keeps the cost of change low and allows teams to combine external tools with internal components in a controlled way.</li>
<li><strong>Third, the system needs to have continuous evaluation.</strong> Every component that answers questions is tested on held-out tasks, with simple metrics for accuracy, faithfulness, and recall, and monitored continually. There should be repeat measures and other tests of robustness. <span class="citation" data-cites="bolton2024">(16)</span> There is no reason not to report error bars in AI and yet they are rarely part of AI publications. Where this matters most is at the interface with non-determinism, inherent in large language models. A good medical AI assistant should give consistent answers even when questions are phrased differently. It should also say it does not know when the information is unclear or incomplete. <span class="citation" data-cites="bolton2024">(16)</span>; <span class="citation" data-cites="ji2023">(17)</span>; <span class="citation" data-cites="gskrambla2024">(18)</span></li>
<li><strong>Fourth, include memory and reinforcement learning so that the system learns.</strong> This is the most difficult component to implement and the one most often deferred. A system that cannot learn from use will make the same mistake repeatedly. Even the most patient users will lose trust and patience. But building memory into production systems, where the model retains context across sessions and improves from feedback, requires specialist expertise in reinforcement learning, retrieval-augmented generation with persistent stores, and the infrastructure to support online learning without catastrophic forgetting <span class="citation" data-cites="ouyang2022">(19)</span>. These skills are in high demand and short supply. The alternative is a system that feels potentially useful in demonstrations but frustrates users in daily work.</li>
</ul>
<p>For this to work, engineering teams need to stay in constant contact with biologists, geneticists, clinicians, chemists and portfolio managers. Together they decide what error rate is acceptable for a triage tool, what form of uncertainty estimate a portfolio board will respect and where human review is mandatory, for example before a new target enters serious preclinical work. Work on human–AI interaction design reinforces this point: systems should explain what they can and cannot do, expose their confidence and make it easy for users to correct them. The hardest part is that the first version is almost never right. Cross-functional teams need patience and ownership. They contribute real examples, refine prompts and evaluation sets, and expect the system to learn from its mistakes. The AI platform must be useful enough, early enough, that experts are willing to spend scarce attention improving it.</p>
</section>
<section id="a-worked-example-target-indication-pairing-in-sirna" class="level2">
<h2 class="anchored" data-anchor-id="a-worked-example-target-indication-pairing-in-sirna">A worked example: target-indication pairing in siRNA</h2>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2026/01/12/images/illu.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:80.0%"></p>
</figure>
</div>
<p>At Tangram we built LLibra OS, an internal system designed to surface and assess new small interfering RNA targets <span class="citation" data-cites="tangram2025">(8)</span>. siRNA refers to short double-stranded RNA molecules that can silence a specific gene via the RNA interference pathway. The purpose is narrow: the AI needs to help scientists identify medicines worth taking forward.</p>
<p>In early discovery, hypothesis generation often reduces to a single question: which target and disease pair should we move into the siRNA pipeline? The question sounds simple. It is not.</p>
<p>A purely data-driven approach can surface millions of candidates. Genome-wide association, expression atlases, and protein interaction networks will produce statistical associations at scale. But association is not mechanism. Two things can correlate because they share upstream causes, because they sit in the same pathway without being rate-limiting, or because of confounding in the data.</p>
<p>Plausibility requires a different kind of evidence. If we modulate this target, what functional change should we observe at the cellular or tissue level? Does that functional phenotype connect credibly to the disease we care about? For our purposes, the chain of reasoning must pass through liver biology: does knockdown of this gene alter a measurable secretory or metabolic function, and does that function relate to the clinical phenotype we wish to treat? Is there a real unmet need for your research for patients? <span class="citation" data-cites="crooke2021">(20)</span></p>
<p>The AI assists in answering these questions. It helps the team hold multiple threads of conditional evidence in view simultaneously. It retrieves, reasons and summarises over tens of millions of papers in the literature, joins and flags inconsistencies between thousands of data sources and quantifies uncertainty where the evidence is thin. Whilst the AI platform does the work of thousands of researchers and continually learns on the job, the expert judgement remains with the SMEs. The SME is central and is given all the reasons why and how the AI found a piece of evidence. The AI structures and expands the space in which that judgement operates. When it works well the AI platform uncovers the non-obvious connections that researchers may never have found.</p>
</section>
<section id="conclusions" class="level2">
<h2 class="anchored" data-anchor-id="conclusions">Conclusions</h2>
<p>Seen from this angle, the 95 percent figure is not a verdict on AI technology but a statistic about organisational design: how rarely good questions, high quality data, diverse experts and committed leaders are brought together at the same time. The systems described in this essay matter, but they are secondary. The primary determinant of value is whether biologists, clinicians, chemists, data scientists and portfolio leaders sit together, own the same objectives and are backed by executives willing to invest over years rather than quarters. Where that integrated team is absent, even elegant architectures will fail. Where they are present, imperfect tools still move the needle.</p>
<p>Much commentary on AI spends its energy on model choice, technical detail and tooling. This article has argued that the more important work is organisational: deciding which decisions to improve, agreeing what counts as acceptable evidence, and creating cross-functional teams that can live with the consequences. The few organisations that succeed treat AI as an experiment in decision making rather than a procurement exercise. They expect the stack to change and the vendors to turn over, but they hold fast to the team, the questions and the discipline.</p>
<p>For Real World Data Science readers, the implication is direct. AI projects fail when nobody owns the estimand, the counterfactual and the error bars. AI doesn’t need to be perfect, but it needs to be good enough. As the statistician George E. P. Box famously observed: “All models are wrong, but some are useful”. Usefulness here depends on design, discipline and humility as much as model choice. Statisticians, data scientists and methodologists can reclaim the narrative by insisting not only that every AI project begins with a clear question, a credible experiment and a plan to learn, but also that these are held collectively by an integrated team with visible executive backing. That is how more organisations move into the 5 percent.</p>
<div class="article-btn">
<p><a href="../../../../../../applied-insights/index.html">Explore more data science ideas</a></p>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the author:</dt>
<dd>
<a href="https://www.linkedin.com/in/lee-clewley-988bbb18/">Lee Clewley (PhD)</a> is VP of Applied AI &amp; Informatics at <a href="https://tangramtx.com/">Tangram Therapeutics</a>, where he led the design and deployment of LLibra, a multi-LLM, agentic system for early discovery. Formerly Head of Applied AI at <a href="https://www.gsk.com/en-gb/">GSK</a>, he is a member of the Real World Data Science <a href="https://realworlddatascience.net/the-pulse/editors-blog/posts/2022/10/18/meet-the-team.html">editorial board</a>.
</dd>
</dl>
<div class="g-col-12 g-col-md-6">
<p><strong>Copyright and licence</strong> : © 2026 Lee Clewley<br>
<a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img style="height:22px!important;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"> <img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"> </a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;">International licence</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<p><strong>How to cite</strong> :<br>
Clewley, Lee. 2026. “<strong>Why 95% Of AI Projects Fail and How to Change the Odds</strong>.” <em>Real World Data Science</em>, 2026. <a href="https://realworlddatascience.net/applied-insights/tutorials/posts/2026/12/why-95-percent-of-ai-projects-fail.html">URL</a></p>
</div>
</div>
</div>


</div>

</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body">
<div id="ref-mitnanda2025" class="csl-entry">
<div class="csl-left-margin">1. </div><div class="csl-right-inline">MIT NANDA. The GenAI divide: State of AI in business 2025. Massachusetts Institute of Technology; 2025.</div>
</div>
<div id="ref-chopra2025" class="csl-entry">
<div class="csl-left-margin">2. </div><div class="csl-right-inline"><span class="nocase">Chopra A et al.</span> Measuring skills-centered exposure in the AI economy. MIT Project Iceberg; Oak Ridge National Laboratory; 2025.</div>
</div>
<div id="ref-ransbotham2020" class="csl-entry">
<div class="csl-left-margin">3. </div><div class="csl-right-inline">Ransbotham S, Khodabandeh S, Kiron D, Candelon F, Chu M, LaFountain B. Expanding AI’s impact with organizational learning. MIT Sloan Management Review. 2020.</div>
</div>
<div id="ref-bcg2024" class="csl-entry">
<div class="csl-left-margin">4. </div><div class="csl-right-inline"><span class="nocase">Bellefonds N de et al.</span> Where’s the value in AI? Boston Consulting Group; 2024.</div>
</div>
<div id="ref-deloitte2024" class="csl-entry">
<div class="csl-left-margin">5. </div><div class="csl-right-inline">Deloitte AI Institute. The state of generative AI in the enterprise: Now decides next. Deloitte; 2024.</div>
</div>
<div id="ref-schlegel2023" class="csl-entry">
<div class="csl-left-margin">6. </div><div class="csl-right-inline">Schlegel D, Schuler K, Westenberger J. Failure factors of AI projects: Results from expert interviews. International Journal of Information Systems and Project Management. 2023;11(3):25–40.</div>
</div>
<div id="ref-gskjules2024" class="csl-entry">
<div class="csl-left-margin">7. </div><div class="csl-right-inline">GSK.ai. JulesOS: GSK’s agent-based operating system [Internet]. 2024. Available from: <a href="https://www.gsk.ai">https://www.gsk.ai</a></div>
</div>
<div id="ref-tangram2025" class="csl-entry">
<div class="csl-left-margin">8. </div><div class="csl-right-inline">Tangram Therapeutics. LLibra OS: Identifying the right targets. 2025.</div>
</div>
<div id="ref-markus1983" class="csl-entry">
<div class="csl-left-margin">9. </div><div class="csl-right-inline">Markus ML. Power, politics, and MIS implementation. Communications of the ACM. 1983;26(6):430–44.</div>
</div>
<div id="ref-orlikowski1992" class="csl-entry">
<div class="csl-left-margin">10. </div><div class="csl-right-inline">Orlikowski WJ. The duality of technology: Rethinking the concept of technology in organizations. Organization Science. 1992;3(3):398–427.</div>
</div>
<div id="ref-paleyes2022" class="csl-entry">
<div class="csl-left-margin">11. </div><div class="csl-right-inline">Paleyes A, Urma RG, Lawrence ND. Challenges in deploying machine learning: A survey of case studies. ACM Computing Surveys. 2022;55(6).</div>
</div>
<div id="ref-ey2025" class="csl-entry">
<div class="csl-left-margin">12. </div><div class="csl-right-inline">EY. How responsible AI translates investment into impact. Ernst<br>
&amp; Young; 2025.</div>
</div>
<div id="ref-capgemini2024" class="csl-entry">
<div class="csl-left-margin">13. </div><div class="csl-right-inline">Capgemini Research Institute. Generative AI in organizations 2024. Capgemini; 2024.</div>
</div>
<div id="ref-davenport2018" class="csl-entry">
<div class="csl-left-margin">14. </div><div class="csl-right-inline">Davenport TH, Ronanki R. Artificial intelligence for the real world. Harvard Business Review. 2018;96(1):108–16.</div>
</div>
<div id="ref-foundry2024" class="csl-entry">
<div class="csl-left-margin">15. </div><div class="csl-right-inline">Foundry. State of the CIO survey 2024. Foundry; 2024.</div>
</div>
<div id="ref-bolton2024" class="csl-entry">
<div class="csl-left-margin">16. </div><div class="csl-right-inline">Bolton WJ, Poyiadzi R, Morrell ER, Bergen Gonzalez Bueno G van, Goetz L. RAmBLA: A framework for evaluating the reliability of LLMs as assistants in the biomedical domain. arXiv preprint. 2024.</div>
</div>
<div id="ref-ji2023" class="csl-entry">
<div class="csl-left-margin">17. </div><div class="csl-right-inline"><span class="nocase">Ji Z, Lee N, Frieske R, et al.</span> Survey of hallucination in natural language generation. ACM Computing Surveys. 2023;55(12).</div>
</div>
<div id="ref-gskrambla2024" class="csl-entry">
<div class="csl-left-margin">18. </div><div class="csl-right-inline">GSK.ai. RAmBLA: Evaluating the reliability of LLMs as biomedical assistants. 2024.</div>
</div>
<div id="ref-ouyang2022" class="csl-entry">
<div class="csl-left-margin">19. </div><div class="csl-right-inline"><span class="nocase">Ouyang L, Wu J, Jiang X, et al.</span> Training language models to follow instructions with human feedback. In: Advances in neural information processing systems. 2022. p. 27730–44.</div>
</div>
<div id="ref-crooke2021" class="csl-entry">
<div class="csl-left-margin">20. </div><div class="csl-right-inline">Crooke ST, Liang XH, Baker BF, Crooke RM. Antisense technology: A review. Journal of Biological Chemistry. 2021;296:100416.</div>
</div>
</div></section></div> ]]></description>
  <category>Viewpoints</category>
  <category>AI</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2026/01/12/why-95-percent-of-ai-projects-fail.html</guid>
  <pubDate>Mon, 12 Jan 2026 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2026/01/12/images/thumb95.png" medium="image" type="image/png" height="96" width="144"/>
</item>
<item>
  <title>Deploying LLMs for Nonprofits: 10 Lessons from Knowbot</title>
  <dc:creator>Annie Flynn</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2025/12/17/deploying_llms_nonprofits.html</link>
  <description><![CDATA[ 





<p>*Based on <a href="https://realworlddatascience.net/foundation-frontiers/posts/2025/11/27/MHF-interview.html">our conversation</a> with <a href="https://www.mikehudsonfoundation.org/">Mike Hudson</a>, Founder of <a href="https://www.mikehudsonfoundation.org">MHF</a>, <a href="https://www.knowbot.uk/">Knowbot</a> and <a href="https://www.testramp.org/">TestRAMP.</a></p>
<p>Large language models (LLMs) are a form of generative artificial intelligence (AI) that offer transformative opportunities for organisations with complex information ecosystems. But deploying them responsibly requires technical pragmatism, cultural awareness, and a respect for context.</p>
<p>Specialist AI donor the <a href="https://www.mikehudsonfoundation.org/">Mike Hudson Foundation</a> has developed an LLM-powered ‘answer engine’ that sits on websites and answers users’ questions. Our recent conversation with MHF’s Foundersurfaced some valuable insights for data science practitioners working in this sphere.</p>
<section id="start-with-the-simplest-possible-use-case" class="level2">
<h2 class="anchored" data-anchor-id="start-with-the-simplest-possible-use-case">1. Start With the Simplest Possible Use Case</h2>
<p>One of Knowbot’s core design principles was <em>minimal friction</em>. MHF looked for a “gateway use case”: a low-risk, easy-to-understand tool that organisations could immediately see value in and adopt quickly.</p>
<p><strong>Practitioner takeaway:</strong> Don’t begin with the most ambitious AI project your organisation can imagine. Begin with an easy, low risk project that still delivers value and treat it as a learning experience..</p>
</section>
<section id="culture-matters-more-than-budget" class="level2">
<h2 class="anchored" data-anchor-id="culture-matters-more-than-budget">2. Culture Matters More Than Budget</h2>
<p>Hudson notes that AI readiness among nonprofits varies widely and isn’t correlated with organisational size. Some large charities are slow to innovate due to bureaucracy; some small ones are enthusiastic but unlikely to benefit.</p>
<p><strong>Practitioner takeaway:</strong> When planning an LLM deployment, assess <em>cultural readiness</em>, not just technical readiness. Ask:</p>
<ul>
<li><p>Who are the internal champions?</p></li>
<li><p>How much AI literacy exists?</p></li>
<li><p>How cautious is the organisation by default?</p></li>
</ul>
<p>This will drive adoption far more than infrastructure.</p>
</section>
<section id="build-for-trust-first-then-functionality" class="level2">
<h2 class="anchored" data-anchor-id="build-for-trust-first-then-functionality">3. Build for Trust First, Then Functionality</h2>
<p>The biggest obstacle MHF faced wasn’t the model, the infrastructure, or the code. It was <em>accessing the right decision-makers</em> and establishing trust with LLMs in general and Knowbot in particular.</p>
<p><strong>Practitioner takeaway:</strong> AI deployments in nonprofits are trust projects as much as technical ones. Practitioners should:</p>
<ul>
<li><p>Engage early with leadership.</p></li>
<li><p>Be explicit about risks and mitigations.</p></li>
<li><p>Provide clear, responsible documentation.</p></li>
<li><p>Avoid overclaiming what the model can do.</p></li>
</ul>
<p>The more transparent the process, the smoother the adoption.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2025/12/17/images/LLM2.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:80.0%"></p>
</figure>
</div>
</section>
<section id="where-appropriate-restrict-the-models-knowledge-domain-to-reduce-risk" class="level2">
<h2 class="anchored" data-anchor-id="where-appropriate-restrict-the-models-knowledge-domain-to-reduce-risk">4. Where Appropriate, Restrict the Model’s Knowledge Domain to Reduce Risk</h2>
<p>Knowbot deliberately confines itself to the content on the host organisation’s website(s), plus general internal LLM knowledge. It doesn’t trawl the open internet. This dramatically limits opportunities for hallucinations, unsafe advice, or reputational risk.</p>
<p><strong>Practitioner takeaway:</strong> Whenever possible, design LLM answer engine systems that operate on <em>curated, organisation-owned content</em>. Domain restriction is one of the most effective forms of practical AI safety.</p>
</section>
<section id="expect-surprising-user-behaviour-and-design-for-it" class="level2">
<h2 class="anchored" data-anchor-id="expect-surprising-user-behaviour-and-design-for-it">5. Expect Surprising User Behaviour — And Design for It</h2>
<p>One of the unexpected patterns in early usage: people asked Knowbot, <em>“Who are you?”</em> This required the team to add a new prompt component and require every partner to host a <em>“What is Knowbot?”</em> page.</p>
<p><strong>Practitioner takeaway:</strong> Build processes for:</p>
<ul>
<li><p>Unexpected inputs</p></li>
<li><p>Prompt evolution</p></li>
<li><p>Iterative refinement</p></li>
</ul>
<p>LLM deployment is never “set and forget.”</p>
</section>
<section id="technological-timing-matters-and-keeps-improving" class="level2">
<h2 class="anchored" data-anchor-id="technological-timing-matters-and-keeps-improving">6. Technological Timing Matters — And Keeps Improving</h2>
<p>Hudson emphasised that many capabilities now considered standard (e.g.&nbsp;long context length, ring fenced access to specific types of knowledge, server deployment ease) would have been impossible even a year earlier. The tools needed to fulfil a nonprofit’s evolving needs often appear in the LLM ecosystem soon after the nonprofit requests new functionality - making the decision whether to ‘build custom’ or ‘wait’ a tricky one.</p>
<p><strong>Practitioner takeaway:</strong> Stay current. Model capabilities, guardrails, and hosting options evolve at high speed. What was impossible last quarter may be trivial today.</p>
</section>
<section id="value-impact-over-volume" class="level2">
<h2 class="anchored" data-anchor-id="value-impact-over-volume">7. Value Impact Over Volume</h2>
<p>Knowbot’s LLM processing costs MHF money, and soKnowbot’s team evaluates success not just by the number of questions answered but by the <em>relevance</em> of those questions to valuable decision-making. A tool that helps a policymaker or researcher retrieve something critical can have outsized impact.</p>
<p><strong>Practitioner takeaway:</strong>When measuring impact, develop metrics that capture qualitative value, not just quantitative usage. For example you might consider:</p>
<ul>
<li><p>Complexity of queries</p></li>
<li><p>Decision relevance</p></li>
<li><p>Equity of access</p></li>
<li><p>Whether the tool reduces burden on staff</p></li>
</ul>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2025/12/17/images/LLM3.png" class="img-fluid quarto-figure quarto-figure-center figure-img" style="width:80.0%"></p>
</figure>
</div>
</section>
<section id="in-fast-moving-environments-admit-what-you-dont-know" class="level2">
<h2 class="anchored" data-anchor-id="in-fast-moving-environments-admit-what-you-dont-know">8. In Fast-Moving Environments, Admit What You Don’t Know</h2>
<p>Both Knowbot and TestRAMP were built in contexts where knowledge was changing daily. Hudson emphasises the importance of asking “naïve” questions, learning quickly, and not pretending expertise where there is none.</p>
<p><strong>Practitioner takeaway:</strong> Cultivate humility. Curiosity and fast learning beats early certainty. Pair technical exploration with organisational openness about unknowns.</p>
</section>
<section id="relationships-and-partnerships-are-everything" class="level2">
<h2 class="anchored" data-anchor-id="relationships-and-partnerships-are-everything">9. Relationships and Partnerships Are Everything</h2>
<p>Across both initiatives, success depended less on algorithms and more on building new human relationships.</p>
<p><strong>Practitioner takeaway: </strong>AI for public good is a team sport. Map stakeholders. Share progress transparently. Community buy-in creates technical resilience.</p>
</section>
<section id="the-next-frontier-agentic-ai" class="level2">
<h2 class="anchored" data-anchor-id="the-next-frontier-agentic-ai">10. The Next Frontier: Agentic AI</h2>
<p>Hudson argues we’re at a turning point where AI will expand from being “retrieval engines” to becoming “agentic systems that can do things.” With that shift comes both opportunity and new categories of risk.</p>
<p><strong>Practitioner takeaway:</strong> Prepare now for agentic systems. Start with controlled automation, clear constraints, auditable logs, and robust governance. Retrieval is only the beginning.</p>
<p><a href="https://realworlddatascience.net/foundation-frontiers/posts/2025/11/27/MHF-interview.html">Read our full conversation with Mike Hudson here.</a></p>
<p><a href="https://www.mikehudsonfoundation.org/">Find out more about the Mike Hudson Foundation here.</a></p>
<p><em>Mike Hudson is an entrepreneur in technology &amp; electronic markets. He now uses his expertise to help solve social problems. Mike founded TestRAMP, a pandemic nonprofit social market described as a “major contribution to Covid PCR testing &amp; genomic sequencing” &amp; donated its £2.4mn profits for charity. Mike is a Fellow of ZSL &amp; adviser to its CEO. He is an honorary Research Fellow at City, University of London. Mike is a member of the Responsible AI Institute. He is a Foundation Fellow at St Antony’s College, University of Oxford.</em></p>
<div class="article-btn">
<p><a href="../../../../../../applied-insights/index.html">Explore more data science ideas</a></p>
</div>


</section>

 ]]></description>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2025/12/17/deploying_llms_nonprofits.html</guid>
  <pubDate>Wed, 17 Dec 2025 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2025/12/17/images/LLMthumb.png" medium="image" type="image/png" height="96" width="144"/>
</item>
<item>
  <title>Deploying Agentic AI - What Worked, What Broke, and What We Learned</title>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2025/08/12/deploying-agentic-ai.html</link>
  <description><![CDATA[ 





<section id="we-built-agentic-systems.-heres-what-broke." class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="we-built-agentic-systems.-heres-what-broke."><span class="header-section-number">1</span> We Built Agentic Systems. Here’s What Broke.</h2>
<p>When Agentic AI started dominating research papers, demos, and conference talks, I was curious but cautious. The idea of intelligent agents, autonomous systems powered by large language models that can plan, reason, and take actions using tools, sounded brilliant in theory. But I wanted to know what happened when you used them. Not in a toy notebook or a slick demo, but in real projects, with real constraints, where things needed to work reliably and repeatably.</p>
<p>In my role as Clinical AI &amp; Data Scientist at Bayezian Limited, I work at the intersection of data science, statistical modelling, and clinical AI governance, with a strong emphasis on regulatory-aligned standards such as CDISC. I have been directly involved in deploying agentic systems into environments where trust and reproducibility are not optional. These include real-time protocol compliance, CDISC mapping, and regulatory workflows. We gave agents real jobs. We let them loose on messy documents. And then we watched them work, fail, learn, and (sometimes) recover.</p>
<p>This article is not a critique of Agentic AI as a concept. I believe Agentic AI has potential value, but I also believe it demands more critical evaluation. That means assessing these systems in conditions that mirror the real world, not in benchmark papers filled with sanitised datasets. It means observing what happens when agents are under pressure, when they face ambiguity, and when their outputs have real consequences. What follows is not speculation about what Agentic AI might become a decade from now. It is a candid reflection on what it feels like to use these systems today. It is about watching a chain of prompts unravel or a multi-agent system drop the baton halfway through a task. If we want Agentic AI to be trustworthy, robust, and practical, then our standards for evaluating it must be shaped by lived experience rather than theoretical ideals.</p>
</section>
<section id="what-agentic-ai-looks-like-in-practice" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="what-agentic-ai-looks-like-in-practice"><span class="header-section-number">2</span> What Agentic AI Looks Like in Practice</h2>
<p>If you’re imagining robots in lab coats, that’s not quite what this is. It is more like releasing a highly motivated intern into a complex archive with partial instructions, limited supervision, and the freedom to decide which filing cabinets, databases, or tools to open next. It is messy. It is unpredictable. And it sometimes surprises you with just how resourceful or confused it can get. Agentic AI systems are purpose-built setups where a large language model is given a task and enough autonomy to decide how to approach it. That might mean choosing which tools to use, when to use them, and how to adapt when things go off-script. You are not just sending one prompt and getting an answer. You are watching a system reason, remember, call APIs, retry when things go wrong, and ideally, get to a useful result.</p>
<p>At Bayezian, we have explored this in several internal projects, including generating clinical codes from statistical analysis plans and study specifications, monitoring synthetic Electronic Health Records (EHRs) for rule violations, and running chained reasoning loops to validate document alignment. These efforts reflect the reality of building LLM agents into safety-critical and compliance-heavy workflows. Across these deployments, the question is never just “can it do the task” but “can it do the task reliably, interpretably, and safely in context”.</p>
<p>Broader research has followed similar directions. In clinical pharmacology and translational sciences, researchers have explored how AI agents can automate modelling and trial design while keeping a human in the loop, and offering blueprints for scalable, compliant agentic workflows link. In the context of patient-facing systems, agentic retrieval-augmented generation has improved the quality and safety of educational materials, with LLMs acting as both generators and validators of content link. Other teams have used multi-agent systems to simulate cross-disciplinary collaboration, where each AI agent brings a different scientific role to design and validate therapeutic molecules like SARS-CoV-2 nanobodies link.</p>
<p>Some of the systems we built used agent frameworks like LangChain or LlamaIndex. Others were bespoke combinations of APIs, function libraries, memory stores, and prompt stacks wired together to mimic workflow behavior. Regardless of the architecture, the core structure remained the same. The agent was given a task, a bit of autonomy, and access to tools, and then left to figure things out. Sometimes it worked. Sometimes it did not. That gap between intention and execution is where most of the interesting lessons sit.</p>
<p>In the next section, I describe one of those deployments in more detail: a multi-agent system used to monitor data flow in a simulated clinical trial setting.</p>
</section>
<section id="case-study-monitoring-protocol-deviations-with-agentic-ai" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="case-study-monitoring-protocol-deviations-with-agentic-ai"><span class="header-section-number">3</span> Case Study: Monitoring Protocol Deviations with Agentic AI</h2>
<p><strong>Why We Built It</strong></p>
<p>Clinical trials generate a stream of complex data, from scheduled lab results to adverse event logs. Hidden in that stream are subtle signs that something may be off: a visit occurred too late, a test was skipped, or a dose changed when it shouldn’t have. These are protocol deviations, and catching them quickly matters. They can affect safety, skew outcomes, and trigger regulatory scrutiny.</p>
<p>Traditionally, reviewing these events is a painstaking task. Study teams trawl through spreadsheets and timelines, cross-referencing against lengthy protocol documents. It is time-consuming, easy to miss context, and prone to delay. We wondered whether an AI-driven approach could act like a vigilant reviewer. Not to replace the team, but to help it focus on what truly needed attention.</p>
<p>Our motivation was twofold. First, to introduce earlier, more consistent detection without relying on rule-based systems that often buckle under real-world variability. Second, to test whether a group of coordinated language model agents, each with a clear focus, could carry out this work at scale while still being interpretable and auditable.</p>
<p>To do that, we built the system from the ground up. We designed a pipeline that could ingest clinical documents, extract key protocol elements, embed them for semantic search, and store them in structured form. That created the foundation for agents to work not just as readers of data, but as context-aware monitors. Understanding whether a missed Electrocardiogram (ECG) or a delayed Day 7 visit violated the protocol required more than lookup tables. It required reasoning. It required memory. It required agents built with intent.</p>
<p>What emerged was a system designed not just to scan data, but to think with constraints, assess context, and escalate issues when the boundaries of the trial were breached. The goal was not perfection, but partnership. A system that could flag what mattered, explain why, and stay open to human feedback.</p>
<p><strong>How It Was Set Up</strong></p>
<p>The system was built around a group of focused agents, each responsible for checking a specific type of protocol rule. Rather than relying on one large model to do everything, we broke the task into smaller parts. One agent reviewed visit timing. Another checked medication use. Others handled inclusion criteria, missed procedures, or serious adverse events. This made each agent easier to understand, easier to test, and less likely to be overwhelmed by conflicting information.</p>
<p>Before any agents could be activated, however, an early classifier was introduced to determine what type of document had arrived. Was it a screening form or a post-randomisation visit report? That initial decision shaped the downstream path. If it was a screening file, the system activated the inclusion and exclusion criteria checker. If it was a visit document, it was handed off to agents responsible for tracking timing, treatment exposure, scheduled procedures, and adverse events.</p>
<p>These agents did not operate in isolation. They worked on top of a pipeline that handled the messy reality of clinical data. Documents in different formats were extracted, cleaned, and converted into structured representations. Tables and free text were processed together. Key elements from study protocols were embedded and stored to allow flexible retrieval later. This gave the agents access to a searchable memory of what the trial actually required.</p>
<p>While many agentic systems today rely heavily on frameworks like LangChain or LlamaIndex, our system was built from the ground up to suit the demands of clinical oversight and regulatory traceability. We avoided packaged orchestration frameworks. Instead, we constructed a lightweight pipeline using well-tested Python tools, giving us more control over transparency and integration. For semantic memory and search, protocol content was indexed using FAISS, a vector store optimised for fast similarity-based retrieval. This allowed each agent to fetch relevant rules dynamically and reason through them with appropriate context.</p>
<p>When patient data flowed in, the classifier directed the document to the appropriate agents. If any agent spotted something unusual, it could escalate the case to a second agent responsible for suggesting possible actions. That might mean logging the issue, generating a report, or prompting a review from the study team. Throughout, a human remained involved to validate decisions and interpret edge cases that needed nuance.</p>
<p>We did not assume the agents would get everything right. The idea was to create a process where AI could handle the repetitive scanning and flagging, leaving people to focus on the work that demanded clinical judgement. The combination of structured memory, clear responsibilities, document classification, and human oversight formed the backbone of the system.</p>
<p>Figure 1 illustrates a two-phase agentic system architecture, where protocol documents are first parsed, structured, and embedded into a searchable memory (green), enabling real-time agents (orange) to classify incoming clinical data from the Clinical Trial Management System (CTMS), reason over protocol rules, detect deviations, and escalate issues with human oversight.</p>
<div id="fig-cde" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-cde-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2025/08/12/images/figure-1-sa.png" class="img-fluid quarto-figure quarto-figure-center figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-cde-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: System Architecture and Agent Flow
</figcaption>
</figure>
</div>
<p><strong>Where It Got Complicated</strong></p>
<p>In early tests, the system did what it was built to do. It scanned incoming records, spotted missing data, flagged unexpected medication use, and pointed out deviations that might otherwise have slipped through. On structured examples, it handled the checks with speed and consistency.</p>
<p>But as we moved closer to real trial conditions, the gaps started to show. The agents were trained to recognise rules, but real-world data rarely plays by the book. Information arrived out of order. Visit dates overlapped. Exceptions buried in footnotes became critical. Suddenly, a task that looked simple in isolation became tangled in edge cases.</p>
<p>One of the most frequent problems was handover failure. A deviation might be correctly identified by the first agent, only to be lost or misunderstood by the next. A flagged issue would travel halfway through the chain and then disappear or be misclassified because the follow-up agent missed a piece of context. These were not coding errors. They were coordination breakdowns, small lapses in memory between steps that led to big differences in outcome.</p>
<p>We also found that decisions based on time windows were especially fragile. An agent could recognise that a visit was missing, but not always remember whether the protocol allowed a buffer. That kind of reasoning depended on holding specific details in working memory. Without it, the agents began to misfire, sometimes raising the alarm too early, other times not at all.</p>
<p>None of this was surprising. We had built the system to learn from its own limitations. But seeing those moments play out across agents, in ways that were subtle and sometimes difficult to trace, helped surface the exact places where autonomy met ambiguity and where structure gave way to noise.</p>
<p><strong>A Glimpse Into the Details</strong></p>
<p>One case brought the system’s limits into focus. A monitoring agent flagged a protocol deviation for a missing lab test on Day 14. On the surface, it looked like a valid call. The entry for that day was missing, and the protocol required a test at that visit. The alert was logged, and the case moved on to the next agent in the chain.</p>
<p>But there was a catch.</p>
<p>The protocol did call for a Day 14 lab, but it also allowed a two-day window either side. That detail had been extracted earlier and embedded in the system’s memory. However, at the moment of evaluation, that context was not carried through. The agent saw an empty cell for Day 14 and treated it as a breach. It did not recall that a test on Day 13, which had already been recorded, fulfilled the requirement.</p>
<p>This was not a failure of logic. It was a failure of coordination. The information the agent needed was available, but not in the right place at the right time. The memory had thinned just enough between steps to turn a routine variation into a false positive.</p>
<p>From a human perspective, the decision would have been easy. A reviewer would glance at the timeline, check the visit window, and move on. But for the agent, the absence of a test on the exact date triggered a response. It did not understand flexibility unless that flexibility was made explicit in the prompt it received.</p>
<p>That small oversight rippled through the process. It triggered an unnecessary escalation, pulled attention away from genuine issues, and reminded us that autonomy without memory is not the same as understanding.</p>
<p><strong>How We Measured Success</strong></p>
<p>To understand how well the system was performing, we needed something to compare it against. So we asked clinical reviewers to go through a set of patient records and mark any protocol deviations they spotted. This gave us a reference set, a gold standard, that we could use to test the agents.</p>
<p>We then ran the same data through the system and tracked how often it matched the human reviewers. When the agent flagged something that was also noted by a reviewer, we counted it as a hit. If it missed something important or raised a false alarm, we marked it accordingly. This gave us basic measures like sensitivity and specificity, in plain terms, how good the system was at picking up real issues and how well it avoided false ones.</p>
<p>But we also looked at the process itself. It was not just about whether a single agent made the right call, but whether the information made it through the chain. We tracked handovers between agents, how often a detected issue was correctly passed along, whether follow-up steps were triggered, and whether the right output was produced in the end.</p>
<p>This helped us see where the system worked as intended and where things broke down, even when the core detection was accurate. It was never just a question of getting the right answer. It was also about getting it to the right place.</p>
<p><strong>What We Changed Along the Way</strong></p>
<p>Once we understood where things were going wrong, we made a few targeted changes to steady the system.</p>
<p>First, we introduced structured memory snapshots. These acted like running notes that captured key protocol rules and exceptions at each stage. Rather than expecting every agent to remember what came before, we gave them a shared space to refer back to. This made it easier to hold onto details like visit windows or exemption clauses, even as the task moved between agents.</p>
<p>We also moved beyond rigid prompt templates. Early versions of the system leaned heavily on predefined phrasing, which limited the agents’ flexibility. Over time, we allowed the agents to generate their own sets of questions and reason through the answers independently. This gave them more space to interpret ambiguous situations and respond with a clearer sense of context, rather than relying on tightly scripted instructions. Alongside this, we rewrote prompts to be clearer and more grounded in the original trial language. Ambiguity in wording was often enough to derail performance, so small tweaks, phrasing things the way a study nurse might, made a noticeable difference. We then added stronger handoff signals. These were markers that told the next agent what had just happened, what context was essential, and what action was expected. It was a bit like writing a handover note for a colleague. Without that, agents sometimes acted without full context or missed the point altogether. Finally, we built in simple checks to track what happened after an alert was raised. Did the follow-up agent respond? Was the right report generated? If not, where did the thread break? These checks gave us better visibility into system behaviour and helped us spot patterns that weren’t obvious from the output alone.</p>
<p>None of these changes made the system perfect. But they helped close the loop. Errors became easier to trace. Fixes became faster to test. And confidence grew that when something went wrong, we would know where to look.</p>
<p><strong>What It Taught Us</strong></p>
<p>The system did not live up to the hype, and it was not flawless, but it proved genuinely useful. It spotted patterns early. It highlighted things we might have overlooked. And, just as importantly, it changed how people interacted with the data. Rather than spending hours checking every line, reviewers began focusing on the edge cases and thinking more critically about how to respond. The role shifted from manual detective work to something closer to intelligent triage.</p>
<p>What agentic AI brought to the table was not magic, but structure. It added pace to routine checks, consistency to decisions, and visibility into what had been flagged and why. Every alert came with a traceable rationale, every step with a record. That made it easier to explain what the system had done and why, which in turn made it easier to trust.</p>
<p>At the same time, it reminded us what agents still cannot do. They do not infer the way people do. They do not fill in blanks or read between the lines. But they do follow instructions. They do handle repetition. They do maintain logic across complex checks. And in clinical research, where consistency matters just as much as cleverness, that counts for a lot.</p>
<p>This experience did not make us think agentic systems were ready to run trials alone. But it did show us they could support the process in a way that was measurable, transparent, and worth building on.</p>
<p><strong>What This Taught Us About Evaluation</strong></p>
<p>Working with agentic systems made one thing especially clear. The way most people assess language models does not prepare you for what happens when those models are placed inside a real workflow.</p>
<p>It is easy enough to test for accuracy or coherence in response to a single prompt. But those surface checks do not reflect what it takes to complete a task that unfolds over time. When an agent is making decisions, juggling memory, switching between tools, and coordinating with others, a different kind of evaluation is needed.</p>
<p>We began paying attention to the sorts of things that rarely make it into research papers. Could the agent perform the same task consistently across repeated attempts? Did it remember what had just happened a few steps earlier? When one component passed information to another, did it land correctly? Did the agent use the right tool when the moment called for it, even without being told explicitly?</p>
<p>These were not academic concerns. They were practical indicators of whether the system would hold up under pressure. So we built simple ways to track them.</p>
<p>We looked at how stable the agent remained from one run to the next. We measured how often a person needed to step in. We checked whether the agent could retrieve details it had already encountered. And we monitored how information moved through the system, from one part to another, without being lost or altered along the way.</p>
<p>None of this required complex metrics. But each of these signals told us more about how the system behaved in real use than any benchmark ever did.</p>
</section>
<section id="a-call-for-practical-evaluation-standards" class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="a-call-for-practical-evaluation-standards"><span class="header-section-number">4</span> A Call for Practical Evaluation Standards</h2>
<p>If we want reliable ways to judge these systems, we need to start from what happens when they are used in the real world. Much of the current thinking around evaluating agentic AI remains too abstract. It often focuses on what the system is supposed to do in principle, not what it manages to do in practice. But the most useful insights emerge when things fall apart. When an agent loses track of its task, forgets what just happened, or takes an unexpected turn under pressure.</p>
<p><a href="https://sakana.ai/ai-scientist/">A recent assessment of Sakana.ai’s AI Scientist</a> made this point sharply. The system promised end-to-end research automation, from forming hypotheses to writing up results. It was an ambitious step forward. But <a href="https://arxiv.org/html/2502.14297v1">when tested</a>, it fell short in important ways. It skimmed literature without depth, misunderstood experimental methods, and stitched together reports that looked complete but were riddled with basic errors. One reviewer said it read like something written in a hurry by a student who had not done the reading. The outcome was not a failure of intent, but a reminder that sophisticated language does not always reflect sound reasoning.</p>
<p>Instead of designing evaluation methods in isolation, we should begin with real scenarios. That means observing where agents stumble, how they recover, and whether they can carry through when steps are long and outcomes matter. It means showing the messy bits, not just polished results. Tools that help us retrace decisions, inspect memory, and understand what went wrong are just as important as the outputs themselves.</p>
<p>Only by starting from lived use with its uncertainty, complexity, and human oversight, can we build evaluation methods that truly reflect what it means for these systems to be useful.</p>
</section>
<section id="closing-thoughts-from-the-field" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="closing-thoughts-from-the-field"><span class="header-section-number">5</span> Closing Thoughts from the Field</h2>
<p>Agentic AI carries genuine promise, but even a single deployment can reveal how much distance there is between ambition and execution. These systems can be impressively capable in some moments and surprisingly brittle in others. And in domains where decisions must be precise and timelines matter, that brittleness is more than an inconvenience; it introduces real risk.</p>
<p>The lessons from our experience were not abstract. They came from watching one system try to handle a demanding, high-context task and seeing where it stumbled. It was not a matter of poor design or unrealistic expectations. The complexity was built in, the kind that only becomes visible once a system moves beyond isolated prompts and into continuous workflows.</p>
<p>That is why evaluation needs to begin with real use. With lived attempts, not controlled tests. With unexpected behaviours, not just benchmark scores. As practitioners, we have a front-row seat to what breaks, what improves with small tweaks, and what truly helps. That view should help shape how the field evolves.</p>
<p>If agentic systems are to mature, the stories of where they struggled and how we adapted cannot sit on the sidelines. They are part of how progress happens. And they may be the clearest indicators of what needs to change next.</p>
<div class="article-btn">
<p><a href="../../../../../../applied-insights/case-studies/index.html">Find more case studies</a></p>
</div>
<dl>
<dt>About the authors</dt>
<dd>
<a href="https://www.linkedin.com/in/francis-osei-b2b02116a/"><strong>Francis Osei</strong></a> is the Lead Clinical AI Scientist and Researcher at Bayezian Limited, where he designs and builds intelligent systems to support clinical trial automation, regulatory compliance, and the safe, transparent use of AI in healthcare. His work brings together data science, statistical modelling, and real-world clinical insight to help organisations adopt AI they can understand, trust, and act on.
</dd>
</dl>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2025 Francis Osei
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img style="height:22px!important;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>. Thumbnail photo by <a href="https://www.shutterstock.com/g/donut8449">khunkornStudio</a> on <a href="https://www.shutterstock.com/image-photo/ai-chatbot-technology-virtual-assistant-customer-2582430481">Shutterstock</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Osei, F. (2025). “Deploying Agentic AI: What Worked, What Broke, and What We Learned”, Real World Data Science, August 12, 20245. <a href="https://realworlddatascience.net/applied-insights/case-studies/posts/2025/08/12/deploying-agentic-ai.html">URL</a>
</dd>
</dl>
</div>
<p>::: :::</p>


</section>

 ]]></description>
  <category>Reproducibility</category>
  <category>Data Analysis</category>
  <category>Machine learning</category>
  <category>Statistics</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2025/08/12/deploying-agentic-ai.html</guid>
  <pubDate>Tue, 12 Aug 2025 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2025/08/12/images/agentic-ai.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Defining Purposes and Uses to Support the Development of Statistical Products in a 21st Century Census Curated Data Enterprise Environment</title>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/22/development-plan-2.html</link>
  <description><![CDATA[ 





<center>
Acknowledgments: This research was sponsored by the: <br> Unites States Census Bureau Agreement No.&nbsp;01-21-MOU-06 and <br> Alfred P. Sloan Foundation Grant No.&nbsp;G-2022-19536
</center>
<p><br> <br> <em>The views expressed in this article are those of the authors and not the Census Bureau.</em></p>
<section id="summing-it-up" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="summing-it-up"><span class="header-section-number">1</span> Summing it up</h2>
<p>We end where we began in the first article of our series. Through this four-part series, we introduced a Curated Data Enterprise (CDE) Framework (see Figure&nbsp;1) that can guide the development and dissemination of statistics broadly applicable to addressing social and economic issues while ensuring replicability and reusability. The CDE provides the scaffold for scaling the statistical product development of interest to the US Census Bureau and broadly applies to official statistics agencies <span class="citation" data-cites="keller2022bold">(Keller et al. 2022)</span>. We illustrated this through a use case on climate resiliency of skilled nursing facilities, highlighting the replicability and reusability of the capabilities that would benefit inclusion in a CDE.</p>
<div id="fig-cde" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-cde-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/22/images/figure-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-cde-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: The CDE Framework starts with the purposes &amp; uses of the statistical products. The outer rectangle identifies the guiding principles for ethical, transparent, reproducible statistical product development and dissemination. The inner rectangle identifies the statistical product development steps.
</figcaption>
</figure>
</div>
<p>As noted in the first three articles, the process begins with articulating purposes and uses through stakeholder engagement and continues by leveraging that engagement, including subject matter expertise, to inform statistical product development. Eliciting purposes and uses from stakeholders and data users is facilitated by asking questions such as: &nbsp;</p>
<ol type="1">
<li><p>What questions keep you awake at night because you don’t have data insights to address them? What are those purposes and uses that you need statistical products to support?</p></li>
<li><p>How do we collaborate and engage with you to better understand your needs and help you identify gaps in understanding regarding purpose and use?</p></li>
<li><p>How do we prioritize what statistical products to develop first?</p></li>
</ol>
<p>Examples of purposes and uses that drive new statistical products include accurately measuring gig employment <span class="citation" data-cites="salvo2022gig">(Salvo et al. 2022a)</span>, migration due to extreme climate events <span class="citation" data-cites="salvo2022migration">(Salvo et al. 2022b)</span>, the various dimensions of housing affordability <span class="citation" data-cites="wu2023housing">(Wu et al. 2023)</span>, and addressing the undercount of young children <span class="citation" data-cites="Salvo2023children">(Salvo et al. 2023)</span>. Other topics that require multiple sources and types of data include creating a household living budget based on the minimum necessary to ensure an adequate standard of living <span class="citation" data-cites="lancaster2023HLB">(Lancaster et al. 2023)</span> and using this budget as a starting point for measuring insecurity across components such as food or housing <span class="citation" data-cites="montalvo2023">(Montalvo et al. 2023)</span>.</p>
</section>
<section id="developing-an-end-to-end-e2e-curation-system" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="developing-an-end-to-end-e2e-curation-system"><span class="header-section-number">2</span> Developing an end-to-end (E2E) curation system</h2>
<p>Purposes and uses defined in use cases are important to support the rapid development of statistical products. These use cases will capture the imagination of those working to address today’s critical issues and advance public understanding and trust in federal statistics.&nbsp;The above paragraph provides examples of purposes and uses for which we have developed use cases.</p>
<p>Use cases are a powerful mechanism to promote methodological research to develop and implement capabilities needed in a CDE. The objectives are to undertake research projects that have the potential to create statistical products with explicit purposes and uses that will exercise the end-to-end (E2E) curation components.</p>
<p>When implemented, these proposed use cases will demonstrate a sequence of capabilities needed to build the CDE, such as agile data discovery, reusing modules and data (including synthetic data), tracking the provenance of collected and generated data, reusing synthetic data and methods to integrate many types of data, conducting statistical analysis involving heterogeneous data integration, and reviewing data and statistical results with an equity and ethics lens. These steps will be captured in an end-to-end curation system.</p>
<ol type="1">
<li><strong>Criteria for developing and evaluating use cases that will uncover the capabilities and research necessary to develop the CDE</strong></li>
</ol>
<p>Criteria are needed to evaluate, and partner with researchers and stakeholders in developing and implementing the capabilities to capture in the CDE. The choice of use cases, when curated, needs to provide unique insight into CDE capabilities and statistical product development. The capabilities to be developed include addressing some purpose and use that no single source of information can resolve, generating practical diagnostics to improve existing methods, creating pilot software, and validating new and improved statistical products. These criteria, developed through listening sessions and discussions with experts, guide the prioritization and selection of use cases and their evaluation after curation (see Table 2) <span class="citation" data-cites="keller2022bold">(Keller et al. 2022)</span>.</p>
<table class="caption-top table">
<caption>Table 2. Criteria for Selecting and Prioritizing Use Cases to Identify CDE Capabilities</caption>
<colgroup>
<col style="width: 100%">
</colgroup>
<tbody>
<tr class="odd">
<td><strong>Value and feasibility of the CDE approach described in the existing research (potential use case)</strong> to address emerging or long-standing issues, ie, its purpose and use over and above existing approaches to address high-priority problems. | | <strong>Stakeholders’</strong> challenges and issues as the source of purposes and uses. | | <strong>Subject matter experts</strong> to advise on the approach and implementation. | | <strong>Partners to access data</strong> from local and state governments, non-profit organizations, and the private sector, and strategies to overcome legal and administrative barriers to such access that benefits to both the providers and recipients of the data. | <strong>Survey, administrative, opportunity, and procedural data</strong> from multiple sources (eg, local, state, federal, third-party) to address the purpose and use (issue) in an integrated way. There are well-defined data ingestion and governance requirements. | | <strong>Computation and measurement requirements for statistical products include</strong> the unit(s) of analysis and their characteristics, temporal sequence, geocoded location data, and methods for imputations, projections, and statistical analysis. | | <strong>Equity and ethical dimensions are considered</strong> at each step to ensure that the use case provides fair and accurate representation across groups and an assessment that the potential benefits outweigh the potential harm. | | <strong>Evidence of CDE capabilities</strong> to be built, including the code, data, and documentation to create the statistical products, which can be described in the curation step. | | <strong>Statistical products</strong> include integrated data sources, indicators, maps, visualizations, storytelling and analysis. | | Potential viability of proposed <strong>dissemination platforms</strong> for interactive access to data products at all levels of data acumen <span class="citation" data-cites="keller2021acumen">(Keller and Shipp 2021)</span> while adhering to confidentiality and privacy rules. |</td>
</tr>
</tbody>
</table>
<ol start="2" type="1">
<li><strong>An end-to-end curation process</strong></li>
</ol>
<p>Curation is an end-to-end process defined by the context of the purposes and uses that document the decisions and trade-offs at each step in the CDE Framework. The following curation definition will be used as it serves the CDE’s vision.</p>
<p><strong><em>Curation</em></strong> involves documenting, for each statistical product, the <strong>inputs</strong> from which the product is derived, the <strong>wrangling</strong> used to transform the information into product, and the <strong>statistical product</strong> itself. Purposes and uses provide the context for each statistic and statistical product.</p>
<p>This definition has evolved from numerous stakeholder discussions via listening sessions and discussions with Census Bureau staff. <span class="citation" data-cites="nusser2024curation faniel2019context nasem2022transparency">(Nusser et al. forthcoming; Faniel et al. 2019; NASEM 2022)</span>.</p>
<p>As use cases are curated, the CDE capabilities will evolve to quickly develop statistical products. These curated use cases are integral to developing an E2E curation process for the CDE. &nbsp;</p>
<ol start="3" type="1">
<li><strong>Invitation to contribute purpose and use ideas for developing new statistical products</strong></li>
</ol>
<p>The CDE development aims to curate a significant number of use cases that address social and economic issues that have the potential to define capabilities to be built in the CDE. Initially, they are seeking ideas for purposes and uses to define these use cases and statistical products.</p>
<p>The skilled nursing facility use case included code, data, and documentation to calculate the probability of workers getting to work during a weather event, resilience indicators at the county or sub-county level, alternative skilled nursing home deficiency measures, and other capabilities.</p>
<p><strong>Incorporating capabilities in the CDE</strong></p>
<p>To accelerate the development of statistical products, the Census Bureau will develop use cases to articulate and create CDE capabilities. This requires identifying those valuable nuggets for learning and quickly translating and incorporating this information into the CDE. Examples of critical capabilities of interest are learning about the utility of synthetic data, the ability to aggregate data into custom geographies, and combining different units of analysis. The expected outcome is the creation of an innovative 21<sup>st</sup> Century Census Curated Data Enterprise focused on purposes and uses that overcome the limitations and challenges of today’s survey-alone model. &nbsp;</p>
<p>The 21<sup>st</sup> Century Census Curated Data Enterprise development presents an opportunity for researchers to help drive the development of the CDE as the foundation for creating new statistical products. The US Census Bureau is seeking ideas for purposes and uses that will define new statistical products. They are interested in research projects (use cases) that are guided by the CDE framework as potential new statistical products. They want to learn from and understand your experiences in using the CDE framework, for example, what worked well, what challenges you faced, how each step in the framework was curated, and what capabilities are replicable and reusable for developing and enhancing statistical products.</p>
<div class="nav-btn-container">
<div class="grid">
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2024/11/19/use-case-2.html">← Part 3: Climate resiliency of skilled nursing facilities</a></p>
</div>
</div>
</div>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<strong>Stephanie Shipp</strong> leads the Curated Data Enterprise research portfolio and collaborates with the US Census. She is an economist with experience in data science, survey statistics, public policy, innovation, ethics, and evaluation.
</dd>
<dd>
<strong>Joseph Salvo</strong> is a demographer with experience in US Census Bureau statistics and data. He makes presentations on demographic subjects to a wide range of groups about managing major demographic projects involving the analysis of large data sets for local applications.
</dd>
<dd>
<strong>Vicki Lancaster</strong> is a statistician with expertise in experimental design, linear models, computation, visualizations, data analysis, and interpretation.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2024 Stephanie Shipp
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img style="height:22px!important;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>. Thumbnail photo by <a href="https://unsplash.com/@goumbik">Lukas Blazek</a> on <a href="https://unsplash.com/photos/turned-on-black-and-grey-laptop-computer-mcSDtbWXUZU">Unsplash</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Shipp S, Salvo J, Lancaster V (2024). “Statistical Products in a 21<sup>st</sup> Century Census Curated Data Enterprise Environment” Real World Data Science, November 22, 2024. <a href="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/22/development-plan-2.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-faniel2019context" class="csl-entry">
Faniel, Ixchel M, Rebecca D Frank, and Elizabeth Yakel. 2019. <span>“Context from the Data Reuser’s Point of View.”</span> <em>Journal of Documentation</em> 75 (6): 1274–97. <a href="https://doi.org/10.1108/JD-08-2018-0133">https://doi.org/10.1108/JD-08-2018-0133</a>.
</div>
<div id="ref-keller2022bold" class="csl-entry">
Keller, Sallie, Kenneth Prewitt, John Thompson, et al. 2022. <span>“A 21st Century Census Curated Data Enterprise. A Bold New Approach to Create Official Statistics. Technical Report.”</span> <em>Proceedings of the Biocomplexity Institute</em> BI-2022-1115: 297–323. <a href="https://doi.org/10.18130/r174-yk24">https://doi.org/10.18130/r174-yk24</a>.
</div>
<div id="ref-keller2021acumen" class="csl-entry">
Keller, Sallie, and Stephanie Shipp. 2021. <span>“Data Acumen in Action.”</span> <em>Notices of the American Mathematical Society</em>. <a href="https://www.ams.org/journals/notices/202109/noti2353/noti2353.html?adat=October%202021&amp;trk=2353&amp;galt=feature&amp;cat=feature&amp;pdfissue=202109&amp;pdffile=rnoti-p1468.pdf
  ">https://www.ams.org/journals/notices/202109/noti2353/noti2353.html?adat=October%202021&amp;trk=2353&amp;galt=feature&amp;cat=feature&amp;pdfissue=202109&amp;pdffile=rnoti-p1468.pdf </a>.
</div>
<div id="ref-lancaster2023HLB" class="csl-entry">
Lancaster, V., M. Montalvo, J. Salvo, and S. Shipp. 2023. <span>“The Importance of Household Living Budget in the Context of Measuring Economic Vulnerability: A Census Curated Data Enterprise Use Case Demonstration.”</span> <em>Proceedings of the Biocomplexity Institute</em> Technical Report. TR# BI-2023-258. <a href="https://doi.org/10.18130/p43z-c742">https://doi.org/10.18130/p43z-c742</a>.
</div>
<div id="ref-montalvo2023" class="csl-entry">
Montalvo, Cesar, Vicki Lancaster, Joseph Salvo, and Stephanie Shipp. 2023. <span>“The Importance of Household Living Budget in the Context of Food Insecurity: A Census Curated Data Enterprise Use Case Demonstration.”</span> <em>Proceedings of the Biocomplexity Institute, Technical Report BI-2023-261</em>. <a href="https://doi.org/10.18130/2kgx-tv50">https://doi.org/10.18130/2kgx-tv50</a>.
</div>
<div id="ref-nasem2022transparency" class="csl-entry">
NASEM. 2022. <span>“Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies.”</span> <em>National Academies of Science, Engineering, and Medicine</em>. <a href="https://doi.org/10.1162/99608f92.17405bb6">https://doi.org/10.1162/99608f92.17405bb6</a>.
</div>
<div id="ref-nusser2024curation" class="csl-entry">
Nusser, S., S. Keller, S. Shipp, Z. Zhu, and E. Wu. forthcoming. <span>“Curation in the Context of the Census Curated Data Enterprise (CDE).”</span> <em>TBD</em>, Forthcoming forthcoming.
</div>
<div id="ref-Salvo2023children" class="csl-entry">
Salvo, J., V. Lancaster, and S. Shipp. 2023. <span>“The Net Undercount of Children Under 5 Years of Age in the Decennial Census: An Art of the Possible Use Case.”</span> <em>Proceedings of the Biocomplexity Institute</em> Technical Report. TR# BI-2023-000. <a href="https://doi.org/10.18130/nzyj-m621">https://doi.org/10.18130/nzyj-m621</a>.
</div>
<div id="ref-salvo2022gig" class="csl-entry">
Salvo, J., S. Shipp, and S. Zhang. 2022a. <span>“Defining the Role of Gig Employment in the Post-Pandemic World of Work.”</span> <em>Proceedings of the Biocomplexity Institute</em> Technical Report BI 2022-026 (2022a)<a href=".&nbsp;https://doi.org/10.18130/wkx0-4y46">.&nbsp;https://doi.org/10.18130/wkx0-4y46</a>.
</div>
<div id="ref-salvo2022migration" class="csl-entry">
Salvo, J., S. Shipp, and S. Zhang. 2022b. <span>“Building a Case Study of Domestic Migration and the Curated Data TR# 2022-027 - Essential Elements.”</span> <em>Proceedings of the Biocomplexity Institute</em> Technical Report BI 2022-027 (2022b). <a href="https://doi.org/10.18130/bcwa-gt69">https://doi.org/10.18130/bcwa-gt69</a>.
</div>
<div id="ref-wu2023housing" class="csl-entry">
Wu, E., J. Salvo, V. Lancaster, and S. Shipp. 2023. <span>“Housing Affordability – an Art of the Possible Use Case to Develop the 21st Century Census Curated Data Enterprise.”</span> <em>Proceedings of the Biocomplexity Institute</em> Technical Report BI-2023-262. <a href="https://doi.org/10.18130/qgkd-va29">https://doi.org/10.18130/qgkd-va29</a>.
</div>
</div></section></div> ]]></description>
  <category>Public Policy</category>
  <category>Data Analysis</category>
  <category>Data Integration</category>
  <category>Curation</category>
  <category>Statistical Products</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/22/development-plan-2.html</guid>
  <pubDate>Fri, 22 Nov 2024 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/22/images/figure-1.png" medium="image" type="image/png" height="81" width="144"/>
</item>
<item>
  <title>Translating the Curated Data Model into Practice - Climate resiliency of skilled nursing facilities</title>
  <dc:creator>Vicki Lancaster, Stephanie Shipp, Sallie Keller, Henning Mortveit, Samarth Swarup, Aaron Schroeder, and Dawen Xie &lt;br /&gt; University of Virginia, Biocomplexity Institute</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/19/use-case-2.html</link>
  <description><![CDATA[ 





<center>
Acknowledgments: This research was sponsored by the: <br> Unites States Census Bureau Agreement No.&nbsp;01-21-MOU-06 and <br> Alfred P. Sloan Foundation Grant No.&nbsp;G-2022-19536
</center>
<p><br> <br></p>
<section id="introduction" class="level2" data-number="1">
<h2 data-number="1" class="anchored" data-anchor-id="introduction"><span class="header-section-number">1</span> Introduction</h2>
<p>Here, we demonstrate how the CDE Framework can be implemented for a research use case related to skilled nursing facilities. The framework provides the guiding principles for ethical, transparent, and reproducible research and dissemination and the research process for developing the statistical product.</p>
<p>Across the US, federally regulated skilled nursing facilities (SNFs) provide essential care, rehabilitation, and related health services to about 1.3 million people. An SNF is a facility that meets specific federal regulatory certification requirements that enable it to provide short-term inpatient care and services to patients who require medical, nursing, or rehabilitative services. Their patients can be among the most vulnerable members of our society, and yet, historically, SNFs have not been incorporated into existing emergency response systems. For example, during the 2004 Florida hurricane season, SNFs were given the same priority as day spas for restoring electricity, telephones, water, and other essential services <span class="citation" data-cites="hyer2006establishing">(Hyer et al. 2006)</span>. Even worse are the deaths of SNF residents in Louisiana following Hurricanes Katrina and Rita in 2005 <span class="citation" data-cites="dosa2008controversy">(Dosa et al. 2008)</span>. This was still an issue in 2021. In Louisiana, 15 SNF residents died when evacuated to a warehouse during Hurricane Ida (2021), and 12 died in Florida as a result of Hurricane Irma (2017). In both instances, the deaths were attributed to extreme heat and lack of electricity <span class="citation" data-cites="skarha2021association">(Skarha et al. 2021)</span>.</p>
<p>These events prompted the <span class="citation" data-cites="sheet2022protecting">(The White House 2022)</span> initiative, <em>Protecting Seniors by Improving Safety and Quality of Care in the Nation’s Nursing Homes</em>, stating, ‘All people deserve to be treated with dignity and respect and to have access to quality medical care.’</p>
<p>However, there are questions that need to be addressed to best protect SNFs and their residents. For example, how resilient are SNFs in extreme climate events? This use case demonstration shows how we built a new statistical product to address this question using the CDE Framework <span class="citation" data-cites="lancaster2023CDE">(Lancaster et al. 2023)</span>.</p>
</section>
<section id="purposes-and-uses" class="level2" data-number="2">
<h2 data-number="2" class="anchored" data-anchor-id="purposes-and-uses"><span class="header-section-number">2</span> Purposes and uses</h2>
<p>A skilled nursing facility (SNF) is a federally regulated nursing facility with the staff and equipment to provide skilled nursing care, skilled rehabilitation services, and other related health services <span class="citation" data-cites="cmsglossary">(Medicare &amp; Medicaid Services 2023)</span>. The context of this use case is to create a baseline picture of SNFs in Virginia and then integrate information on the risk of extreme flood events to assess facility and community preparedness – for example, how likely are the nursing staff<sup>1</sup> to make it to the facility in the event of a flood?</p>
<p>This use case has two parts. The first creates a baseline data picture of SNFs, bringing together data about the residents, nursing staff, and SNF characteristics. The second addresses two issues raised in the <span class="citation" data-cites="sheet2022protecting">(The White House 2022)</span> initiative: emergency preparedness and nurse staffing. We frame these issues into three purpose and use questions with the ultimate goal of creating statistical products that address these questions:</p>
<ol type="1">
<li><p>Can SNF workers get to work during an extreme flood event?</p></li>
<li><p>Are SNFs prepared for a flood emergency?</p></li>
<li><p>Can communities support SNFs during an emergency?</p></li>
</ol>
</section>
<section id="statistical-product-development-stages" class="level2" data-number="3">
<h2 data-number="3" class="anchored" data-anchor-id="statistical-product-development-stages"><span class="header-section-number">3</span> Statistical product development stages</h2>
<p><strong>Subject matter input and literature review</strong></p>
<p>The subject matter experts consulted included nursing facility administrators, SNF resident advocates, demographers, and researchers. Our discussions and literature review informed us of the many federal policies governing SNFs regarding inspections and data reporting requirements (procedural data). In addition, we were told about non-public data sources on residents and SNF staff that were aggregated to the SNF level and provided to the public under a grant from the National Institute on Aging. This information was important since we had yet to come across this source in our data discovery process. The dialogue with experts and our literature review helped us generate a ‘wish list’ of variables we used to inform our data discovery process that we visualized into a conceptual data map (see Figure&nbsp;1).</p>
<div id="fig-data" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-data-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/19/images/figure-3.png" class="img-fluid quarto-figure quarto-figure-center figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-data-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: Conceptual Data Map Aligned to Purpose and Use: The conceptual data map displays the results of our data discovery. The team identifies the data needs informed by expert elicitation and literature review. For this use case the data discovery took three phases: (1) create a data picture of SNF owners, nursing staff, and residents, and the communities the facilities reside in; (2) identify the potential risks of a severe flood events, coastal and riverine; and (3) identify the potential weakness in the SNF’s and community’s ability to respond.
</figcaption>
</figure>
</div>
<p><strong>Data discovery</strong></p>
<p>Data discovery focused on identifying data sources to address the purpose and use questions and was informed by the conceptual data map.</p>
<p>For the first question – Can SNF workers get to work during an extreme flood event? – we discovered and used proprietary synthetic population, transportation routes, building data sources, and publicly available flood data. The <a href="https://developer.here.com/documentation">HERE Premium Streets</a> proprietary data includes information about roads, such as type of road, speed limits, number of lanes, etc. The proprietary synthetic population data, Building Knowledge Base (BKB), are used to identify where SNF workers live and work to map transportation routes from home to work <span class="citation" data-cites="mortveitNSSAC">(Mortveit et al. 2023)</span>. Publicly available data from the Federal Emergency Management Administration (FEMA) provided flooding risk estimates along the routes from nursing staff homes to the SNF.</p>
<p>For the second question – Are SNFs prepared for a flood emergency? – we used Center for Medicare and Medicaid (CMS) SNF inspection and deficiency data as a proxy for preparedness. We also examined SNF residents’ physical and mental health to assess SNF emergency preparedness. For example, if most residents faced mobility challenges, the SNF would need more resources available during an emergency to move residents to a safer facility. We used data about residents from the Long Term Care Focus <span class="citation" data-cites="brown2022ltcfocus">(LTCFocus 2022)</span> Public Use Data sponsored by the National Institute on Aging (Brown University 2022).</p>
<p>We used data to measure community resilience, assets, and risks by geography at the county, city, and census tract levels to address the third question, Can communities support SNFs during an emergency? These data included:</p>
<ul>
<li>Health professional shortages area (HRSA 2022)</li>
<li>Shelter facilities and emergency service providers data <span class="citation" data-cites="dhs2022hifld">(Homeland Security: Geospatial Management Office 2022)</span></li>
<li>Community Resilience Indicator Analysis and National Risk Index for Natural Hazards <span class="citation" data-cites="FEMA2022a">(FEMA 2022)</span>.</li>
</ul>
<p>All data are provided in a <a href="https://github.com/uva-bi-sdad/census_cde_demo_2/tree/main/data">GitHub</a> repository along with their metadata, except for the three proprietary data sources. Articles about how the synthetic estimates are constructed are provided for two of these proprietary data sources. The third data source was obtained from a private-sector vendor whose data and documentation are proprietary; a link is provided to their website.</p>
<p><strong>Data ingest and governance</strong></p>
<p>All the public data, metadata, code, statistical products, data processes, and relevant literature on SNF policies and regulations are stored in a <a href="https://github.com/uva-bi-sdad/census_cde_demo_2/tree/main">GitHub</a> repository.</p>
<p>In our experience, data wrangling is the most time-consuming and challenging part of product development. This speaks directly to the benefit of the CDE; once a researcher has wrangled together multiple data sources, it can be made available to other researchers.</p>
<p>The two predominant issues with data wrangling for this Use Case included reconciling data sources that contain data on the same topic and creating linkages between data sources. For example, we reviewed three hospital data sources:</p>
<ol type="1">
<li><a href="https://hifld-geoplatform.opendata.arcgis.com/">Homeland Security Infrastructure Foundation-Level Data</a> (HIFLD) (DHS 2022)</li>
<li><a href="https://healthdata.gov/dataset/COVID-19-Reported-Patient-Impact-and-Hospital-Capa/6xf2-c3ie">HealthData.gov - COVID-19 Reported Patient Impact and Hospital Capacity by State</a> (HHS 2022)</li>
<li><a href="https://vhha.com/about-virginia-hospitals/">Map of VHHA Hospital and Health System Members</a> (Virginia Hospital &amp; Healthcare Association 2022)</li>
</ol>
<p>We observed inconsistences and omissions across the three data sources including:&nbsp;</p>
<ul>
<li>non-standard hospital names and hospital classification types</li>
<li>inconsistent availability of hospital IDs (such as Medicare Provider Number) &nbsp;</li>
<li>conflicting geographic information, including address, latitude, and longitude.</li>
</ul>
<p>We did not attempt to reconcile these inconsistencies for the demonstration but decided to use a single source for shelter facility and emergency service provider data. We used <a href="https://hifld-geoplatform.opendata.arcgis.com/">HIFLD</a> data since they provided the most current data (DHS 2022). The use of these data reinforces the purpose of the use case – to illuminate the challenges in creating statistical products and what the Census Bureau would need to consider.</p>
<p>Similar inconsistencies made it difficult to link data sources using geographic variables. For example, we used shelter facility and emergency service provider data sources from the HIFLD – including hospitals, Red Cross chapter facilities, National Shelter System Facilities, emergency medical service stations, fire stations, and urgent care facilities – to calculate a metric for potential community support. The goal was to place each facility in a Virginia county or independent city. Virginia is divided into 95 counties, and 38 independent cities considered county-equivalents for census purposes, and in some cases, there is a county and a city with the same name (eg, Richmond County and Richmond City, each in different locations in Virginia). It was necessary to <a href="https://en.wikipedia.org/wiki/Canonicalization">canonicalize</a> the county and city names (when available), which meant aligning upper and lower cases, removing unnecessary characters, and distinguishing between county and city.<sup>2</sup></p>
<p>The challenge with locating shelter facilities and emergency service providers within a county or independent city was using different variables to identify their location (latitude and longitude, address, ZIP code<sup>3</sup>, Federal Information and Processing Standard (FIPS) code, and county/city name). In cases where the data source only had a ZIP or FIPS code, a Department of Housing and Urban Development crosswalk was used to link the two codes; in other cases, a crosswalk that linked non-independent cities and towns to counties was used; and in others, a crosswalk that linked FIP codes to counties and independent cities. Researchers would benefit from exhaustive crosswalks between all variables on the same topic, such as location variables, facility names, and identification numbers, to reduce the time spent on data wrangling.</p>
<p>Regarding data products related to popular indices, such as climate disaster risks and community resilience, they are operationalized differently across the various departments and agencies within the federal and state governments and private and non-profit sectors. It is an enormous task to review the methodology and technology reports (if available) to understand their differences and decide which versions are most relevant (fitness-for-purpose) for a particular use case. Again, after reviewing the options for this use case, we determined that the National Risk Index for riverine and coastal floods from FEMA was the best option for climate risk estimates. The detailed technical report, <em>National Risk Index Technical Document</em> <span class="citation" data-cites="FEMA2021risk">(FEMA 2021)</span>, provides a clear assessment of the assumptions and limitations of the data and a description of how the risk estimates were derived. Researchers would benefit from guidance on the numerous constructions of indices on the same topic. A use case on a specific index topic could be used to highlight differences and similarities among indices, which would help with data wrangling and fitness-for-use. Ideally, the use case could benchmark the various constructions and provide a statistical assessment.</p>
<section id="question-1-can-snf-workers-get-to-work-during-an-extreme-flooding-event" class="level3" data-number="3.1">
<h3 data-number="3.1" class="anchored" data-anchor-id="question-1-can-snf-workers-get-to-work-during-an-extreme-flooding-event"><span class="header-section-number">3.1</span> <strong>Question 1: Can SNF workers get to work during an extreme flooding event?</strong></h3>
<p>Sufficient nursing staff is of significant concern to assure resident safety and quality of care.</p>
<p>Since proprietary synthetic population data and commercial sector digitized mapping data were used to construct the routes SNF nursing staff are likely to take from home to work, only an outline of the computational process used to identify the routes is provided. Publicly available data from FEMA were used to estimate flooding risk along a particular route. Below is a general description of the modeling steps and the proprietary data used to assess SNF vulnerability as a function of the nursing staff’s inability to report to work due to the transportation infrastructure <span class="citation" data-cites="choupani2016population">(Choupani and Mamdoohi 2016)</span>.</p>
<p><strong>Computational modules</strong></p>
<p>Here is the basic outline of the process that uses proprietary data that starts at network construction and ends with routes. For more details, see the GitHub repository: <a href="https://github.com/uva-bi-sdad/census_cde_demo_2/blob/main/documents/products/processes/commute_vulnerability/algorithm.md">Vulnerability of SNFs concerning Commuting</a>.</p>
<ol type="1">
<li>Extract network data from HERE (2021 Q1 in this use case).</li>
<li>Process the extracted data to form a network suitable for routing. This includes inference of speed limits for road links where such data is missing.</li>
<li>Prepare origin-destination pairs. In this case, the list of locations pairs a worker’s home and work locations. The person is constructed in the synthetic population pipeline, and residences and workplaces are derived through the data fusion process used to construct the NSSAC building database.</li>
<li>Construct routes using the Quest router.</li>
</ol>
<p>Once the routes to an SNF were established, the expected number of nursing staff at an SNF during a flood event could be calculated as the sum of the probabilities of each worker being able to commute to work during a flood event. A computational model was developed using the following data:</p>
<ul>
<li>SNF locations in Virginia from the Centers for Medicare &amp; Medicaid Services (CMS);</li>
<li>Home locations of workers at each SNF assigned from the synthetic population and Building Knowledge Base <span class="citation" data-cites="beckman1996creating mortveitNSSAC">(Beckman et al. 1996; Mortveit et al. 2023)</span>;</li>
<li>Virginia road networks; and</li>
<li>FEMA census tract-level riverine and coastal flood risks.</li>
</ul>
<p>Using router software, the Virginia road network was used from the HERE map data to compute each nursing staff’s likely route to their SNF. Routers are commonly used within transportation and traffic simulators. The router software used for this demonstration is a highly parallelizable router previously developed in BI NSSAC, known as the Simba router <span class="citation" data-cites="barrett2013planning">(<span class="nocase">Barrett et al.</span> 2013)</span>.</p>
<p>The FEMA risk data provide the riverine and coastal flood risks for each census tract in Virginia. Given the routes, the FEMA riverine and coastal flood risks were used to estimate the probability of the nursing staff making it to work. The FEMA technical document <em>National Risk Index Technical Document</em> <span class="citation" data-cites="FEMA2021risk">(FEMA 2021)</span> provides information on how natural hazard risks are calculated. We use these risk estimates ranging from 0 to 100 as a proxy for the probability a worker can reach the SNF by dividing by 100. For example, we assume a risk is zero if there is zero probability of being unable to reach the SNF due to an extreme flood event.</p>
<p>In contrast, a risk of 100 indicates the roads are underwater, and the probability of being unable to reach the SNF is one. The maximum risks along transportation routes leading to an SNF range from 0 to 47 for riverine flooding and 0 to 40 for coastal flooding. We assume the combined value of the maximum riverine and coastal flood risks along a worker’s transportation routes, divided by 100, is the worker’s probability of not getting to work during a flooding event.</p>
<p>Since we do not have data on the exact home locations of the nursing staff, we estimated how many could reach the facility by taking a random sample (whose size is the CMS average daily nursing staff<sup>4</sup> for an SNF) from the possible routes identified using the HERE Virginia road network. We calculated the average with a 95% nonparametric confidence interval. The 283 SNFs used in our research have an average daily nursing staff of 12,609. Using the above approach, we estimated that 10,005 (95% CI: 9,013, 10,700) or 79% can get work during an extreme flood event. The individual SNF nursing staff percentage who can make it to work ranges from 48% to 93%.</p>
<p>Figure&nbsp;2 visualizes this analysis for the 283 SNFs ordered by the observed average daily nursing staff numbers at the facility from smallest to largest, displayed using the orange line. The black line indicates the expected number in an extreme flood event and the 95% nonparametric confidence interval (grey band). The code for Figure&nbsp;2 is provided in the <a href="https://github.com/uva-bi-sdad/census_cde_demo_2/blob/main/source_code/analyses/VA_Probability_of_Getting_to_SNF.R">GitHub</a> repository.</p>
<div id="fig-ns" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-ns-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/19/images/figure-4.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="2000">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-ns-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;2: SNF Average Observed and Expected Average Daily Nursing Staff Numbers: The horizontal axis is ordered by the size of the nursing staff at the facility from smallest to largest. The orange line displays the observed average daily nursing staff numbers. The black line displays the estimated numbers in the event of an extreme coastal and/or riverine flood event. The grey band is the 95% nonparametric confidence interval.
</figcaption>
</figure>
</div>
<p>For example, in King George County, the SNF is Heritage Hall King George (Federal Provider Number 495300 in Figure&nbsp;3), located near the Potomac River, which opens to the Chesapeake Bay. According to CMS, the Heritage Hall King George facility has an average daily skilled nursing staff of 41. Using the HERE Virginia road network, we identified 101 routes the staff could use to reach the facility. The combined maximum coastal and riverine flood risks along these routes ranged from 5.6 to 66.7; a random sample of 41 from the 101 routes gives an average probability of reaching the facility of 0.74 with a 95% nonparametric confidence interval of [0.65, 0.80]. These were used to estimate the average number of nursing staff at the facility, 30, during a flood event, along with a 95% nonparametric confidence interval [14, 38]. Publicly available data from the Federal Emergency Management Administration (FEMA) provided flooding risk estimates along the routes from the nursing staff home to the SNF along with proprietary road and building information<strong>.</strong></p>
<div id="fig-map" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-map-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/19/images/figure-5.png" class="img-fluid quarto-figure quarto-figure-center figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-map-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;3: An Example of Nursing Staff Routes to Heritage Hall King George SNF: Routes that workers can take to work at Heritage Hall  King George SNF FPN 495300 (identified with the black oval). The risk levels of each road are identified with colors, from low risk (blue), medium-low (yellow), orange (medium), red (medium-high), to high risk (dark red). The risk scores are used to calculate the probability of a worker getting to work during an extreme flood event using publicly available FEMA data and proprietary road and building data.
</figcaption>
</figure>
</div>
</section>
<section id="question-2.-are-snfs-prepared-for-emergencies" class="level3" data-number="3.2">
<h3 data-number="3.2" class="anchored" data-anchor-id="question-2.-are-snfs-prepared-for-emergencies"><span class="header-section-number">3.2</span> <strong>Question 2. Are SNFs prepared for emergencies?</strong></h3>
<p>To address this question, we examined how prepared SNFs are for emergencies using annual inspection and deficiency data as a proxy for preparedness. CMS issues deficiencies to SNFs that fail to meet federal Medicare and Medicaid preparedness standards. Every deficiency is classified into one of 12 categories based on the scope and severity of the deficiency. There are two broad types of non-health-related deficiencies:</p>
<ul>
<li><p>Emergency Preparedness Deficiencies – There are four elements of emergency preparedness. They cover an emergency plan, policies and procedures, a communication plan, and training and testing.</p></li>
<li><p>Fire Life Safety Code – The set of fire protection requirements are designed to provide a reasonable degree of safety from fire. They cover construction, protection, and operational features designed to provide safety from fire, smoke, and panic.</p></li>
</ul>
<p>We calculated separate Emergency Preparedness and Fire Life Safety Code deficiency indices to combine them to create a single index to measure SNF preparedness and distinguish between high and low performing SNFs. The computation of the indices has four steps.</p>
<ol type="1">
<li><p><em>Number of deficiencies</em>: For each SNF, the total number of deficiencies during the past four years, 2018-2022, was divided by the number of SNF inspections over the same period to estimate the average number of deficiencies per inspection.</p></li>
<li><p><em>Time to resolve deficiencies</em>: We next computed the average number of days it took to resolve each deficiency.</p></li>
<li><p><em>Scope and severity of deficiencies</em>: We then transformed the deficiency letter inspection rating for scope and severity to a numerical weight using the CMS technical guide, <em>Care Compare Nursing Home Five-Star Quality Rating System</em> <span class="citation" data-cites="CMS2022design">(Medicare &amp; Medicaid Services 2022)</span>,and averaged the ratings.</p></li>
<li><p>The estimates from these three steps were summed to compute separate Emergency Preparedness and Fire Life Safety Code deficiency indices (see Figure&nbsp;4) and are provided for reuse in a .csv file on <a href="https://github.com/uva-bi-sdad/census_cde_demo_2/blob/main/documents/products/processes/derived_variables/va_snf_deficiency_indices_k_e.csv">GitHub</a>.</p></li>
</ol>
<p>Figure&nbsp;4 displays the results of an exploratory data analysis for each index. These analyses assessed fitness-for-use; we wanted to construct an indicator with sufficient variability to discriminate between high and low-performing SNFs. It is evident we accomplished this in Figure&nbsp;4 there are SNFs with indices outside the main body of the data. We summed the Emergency Preparedness and Fire Life Safety Code indices and categorized them into high, medium, low, and no deficiencies.</p>
<div id="fig-def" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-def-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/19/images/figure-6.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="900">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-def-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;4: Exploratory Data Analysis Visualizations for the Emergency Preparedness and Fire Life Safety Code Deficiencies
</figcaption>
</figure>
</div>
</section>
<section id="question-3-can-communities-support-snfs-during-emergencies" class="level3" data-number="3.3">
<h3 data-number="3.3" class="anchored" data-anchor-id="question-3-can-communities-support-snfs-during-emergencies"><span class="header-section-number">3.3</span> <strong>Question 3: Can communities support SNFs during emergencies?</strong></h3>
<p>To answer this question, we computed a community resiliency index using the US Census American Community Survey and the guidance provided by the <em>Homeland Security document Community Resilience Indicator Analysis: County-Level Analysis of Commonly Used Indicators from Peer-Reviewed Research</em> <span class="citation" data-cites="edgemon2018community">(Edgemon et al. 2018)</span><em>.</em> The index was constructed by summing the county (census tract) level percentages for the following variables:</p>
<ul>
<li>fraction employed</li>
<li>fraction with no disability</li>
<li>fraction with a high school diploma or greater</li>
<li>fraction of households with at least one vehicle</li>
<li>reverse GINI Index – so all indicators are in a positive direction.</li>
</ul>
<p>Figure&nbsp;5 displays the combined deficiency indices, Emergency Preparedness + Fire Life Safety Code, for each SNF with the choropleth map for the community resilience index at the census tract level. We also examined the number of shelter facilities and emergency service providers and the availability of medical staff per 10,000 residents. We constructed isochrones to establish the distance from the SNF to these potential sources of support. Working on this component of the use case highlighted the need for cross-agency data, pointing to the utility of future strategic partnering between the US Census Bureau, CMS, and FEMA.</p>
<div id="fig-cri" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-cri-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/19/images/figure-7.png" class="img-fluid quarto-figure quarto-figure-center figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-cri-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;5: 2020 Population Resilience Composite Index for Virginia Census Tracts: The light yellow tracts are the least resilient, and the dark green are the most resilient. The locations of the 283 SNFs are identified with filled circles, orange circles with the highest
</figcaption>
</figure>
</div>
<p>In addition to describing the population using a resilience index, we also developed a measure to present the number of shelter facilities and emergency service providers (data from Homeland Security / Homeland Infrastructure Foundation Level Data) and the availability of medical doctors (MDs) and Doctor of Osteopathic Medicine (ODs) who provide direct patient care (HRSA 2022) (Figure&nbsp;6).&nbsp;</p>
<p>The number of MDs and ODs is described as a primary care health professional shortage area. HRSA defines these contiguous areas where primary medical care professionals are overutilized, excessively distant, or inaccessible to the population of the area under consideration. Figure&nbsp;6 (bottom) shows that approximately one-third of the counties and independent cities have health professional shortage areas across their entire boundary, and another 40 percent have shortages within parts of their boundaries.</p>
<div id="fig-help" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-help-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/19/images/figure-8.png" class="img-fluid quarto-figure quarto-figure-center figure-img" width="1000">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-help-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;6: Assessment of the number of shelter facilities and emergency service providers per 10,000 population (top) and medically underserved areas (bottom): On both maps, the lighter the color, the more in need is the population of shelter facilities and emergency services (top chart) or health professionals (bottom chart). The location of the 283 SNFs are identified with filled circles, orange circles are those with the highest deficiency index and grey circles are those with no deficiencies.
</figcaption>
</figure>
</div>
</section>
</section>
<section id="guiding-principles-for-ethical-transparent-reproducible-statistical-product-development-and-dissemination." class="level2" data-number="4">
<h2 data-number="4" class="anchored" data-anchor-id="guiding-principles-for-ethical-transparent-reproducible-statistical-product-development-and-dissemination."><span class="header-section-number">4</span> Guiding principles for ethical, transparent, reproducible statistical product development and dissemination.</h2>
<p><strong>Communication</strong></p>
<p>We communicated results throughout the Demonstration Use Case research with our Census CDE Working Group (composed of former Census Bureau Directors and Communication Director, and academic and industry census experts), with the Census Bureau, at conferences such as the annual Federal Statistical Committee on Methodology, and sharing drafts to seek input and ideas. The discussions and presentations helped to shape ideas and advance our thinking about how best to address the purpose and use questions.</p>
<p><strong>Stakeholder engagement</strong></p>
<p>We engaged stakeholders by sharing our research and results through conference presentations at the American Community Survey Data Users Conference and the Applied Public Data Users Conference.&nbsp;We also shared this demonstration project at Listening Sessions with stakeholders as an example of statistical product development. The Listening Sessions bring together 7 to 12 stakeholders by topic (e.g., children’s health) or function (e.g., state demographers) to seek their ideas for new statistical products.</p>
<p><strong>Equity and ethics</strong></p>
<p>As described in the Introduction, there are ethics and equity issues that drew us to develop this Use Case. Here we focus on equity and ethics vis-a-vis the data choices and analyses. With regard to ethical considerations with our data discovery process, fitness-for-purpose evaluation, and analyses, two questions arose:</p>
<ol type="1">
<li><p>What role does synthetic data have to play, and how do you benchmark it to evaluate fitness-for-purpose?</p></li>
<li><p>How do you construct and evaluate an index with the goal of identifying vulnerable populations?</p></li>
</ol>
<p>Realizing the importance of nursing staff levels, we discussed and questioned whether the synthetic data had biases and were not representative of SNF residents and employees. We benchmarked the synthetic SNF nursing staff numbers against those submitted quarterly to CMS and observed they were biased low, so we decided to use the CMS data. These data were used to estimate the average number of nursing staff that could reach the facility during an extreme flood event (Figure&nbsp;2).</p>
<p>In this use case, we were fortunate to have the “truth” to benchmark the synthetic data for the average daily nursing staff at each SNF. But this was not the case for the home locations of the nursing staff, therefore, the synthetic locations were not used since we had no way to benchmark them. Ideally, we would use the actual addresses of SNF employees. Instead, we used a simulation to estimate the average risks over routes leading to the SNF. This approach could be replaced with (or benchmarked against) the Census commuting data sets (eg, <a href="https://www.census.gov/topics/employment/commuting/guidance/flows.html">Commuting Flows</a> or the <a href="https://lehd.ces.census.gov/data/">LEHD Origin-Destination Employment Statistics</a>) and the home census tract used as the starting point for each worker. For the number of nursing staff and their home locations, it is impossible to identify potential biases that would result in the inequitable allocation of emergency rescue resources without a thorough understanding of how the synthetic data were generated.</p>
<p>How one evaluates the equity of an index is a more challenging task. Questions that need to be addressed include:</p>
<ol type="1">
<li><p>How do you select the variables used to construct an indicator to guide an equitable allocation of technical assistance?</p></li>
<li><p>What relationship between these variables is important?</p></li>
<li><p>What are the differences across the numerous publicly available resilience estimators? Do some lead to a more equitable allocation of technical assistance in the event of an extreme clime event?</p></li>
<li><p>How do you validate a resilience estimator?</p></li>
</ol>
<p>The technical document <em>Community Resilience Indicator Analysis: County-Level Analysis of Commonly Used Indicators from Peer-Reviewed Research</em> <span class="citation" data-cites="edgemon2018community">(Edgemon et al. 2018)</span> identified the 20 most commonly selected variables for constructing resilience estimators from peer-reviewed research. Future research will need to validate these indices against past extreme climate events.</p>
<p><strong>Privacy and confidentiality</strong></p>
<p>We did not do a full disclosure review. However, some data are proprietary, and we could not release those data. We discuss how we used these data.</p>
<p><strong>Dissemination</strong></p>
<p>We disseminated the final version of the use case in the University of Virginia Libra Open repository <span class="citation" data-cites="lancaster2023CDE">(Lancaster et al. 2023)</span>.</p>
<p><strong>Curation</strong></p>
<p>Curation involves documenting all steps of the process so that they can be repeated, validated, reused, or extended. The final report explains the process in words. Curation must also provide the data, metadata, source code, and products. This led us to construct a GitHub repository. A <a href="https://github.com/uva-bi-sdad/census_cde_demo_2/blob/main/README.pdf">README</a> file guides the reader through the material and provides instructions for replicating the research results. Note that the README file must be downloaded for the hyperlinks to work.</p>
</section>
<section id="using-the-snf-statistical-product" class="level2" data-number="5">
<h2 data-number="5" class="anchored" data-anchor-id="using-the-snf-statistical-product"><span class="header-section-number">5</span> Using the SNF statistical product</h2>
<p>This potential statistical product has many uses. Federal policymakers and administrators regulate SNFs; however, they only sometimes realize the impacts on costs and the need for increased resources to meet these regulations. For example, by reviewing the aggregate inspection deficiency metrics, policymakers can target resources where they are most needed. Providing additional funding to pay workers more, improve their facilities, and address inspection deficiencies would improve the quality of SNFs.&nbsp;</p>
<p>The media and advocacy groups play a role in highlighting good and bad cases of SNF care or where communities do not have adequate assets to support SNFs during an emergency event. For example, a <em>New Yorker</em> article <span class="citation" data-cites="rafiei2022private">(Rafiei 2022)</span> highlighted how nursing homes decline dramatically when bought by private equity owners. The GAO (September 22, 2023) recently identified the need for more information about private equity ownership in CMS data – a gap that CMS needs to address. And, of course, researchers and analysts are essential for conducting research that leads to creating and improving statistical products around SNFs. By releasing a regularly scheduled SNF statistical product, the changes in SNFs over time can be monitored.</p>
</section>
<section id="what-cde-capabilities-have-this-use-case-demonstrated" class="level2" data-number="6">
<h2 data-number="6" class="anchored" data-anchor-id="what-cde-capabilities-have-this-use-case-demonstrated"><span class="header-section-number">6</span> What CDE capabilities have this use case demonstrated?</h2>
<p>As demonstrated by this use case, the CDE Framework is a powerful process for guiding and curating the development of statistics to address complex purposes and uses. Additionally, use cases help illuminate technical capabilities that should be present in the data enterprise to facilitate and accelerate the reuse of data and methods in the development and dissemination of new statistical products.</p>
<p>This CDE demonstration is the first of many use cases needed to define and develop CDE capabilities. Underlying each use case is the curation process. Curation documents each step, including decisions that may involve trade-offs. Curation preserves and adds value to the data. This includes organizing to facilitate data discovery and easy access; providing metadata to enable the reuse in scientific and programmatic research; enhancing the value of the data enterprise through linkages between datasets; and mapping the network of interconnections between datasets, research outputs, researchers, and institutions. Over time, a searchable curation system will be needed as a foundation for creating statistical products in the CDE.</p>
<p>The types of products from a use case that can benefit the larger community are only limited by the creativity of the researchers and stakeholders carrying out the use case. The products from this use case are re-useable code; integrated data sets across diverse topics for each SNF; maps and other visualizations; statistical products such as SNF deficiency indices and various indices that measure community and SNF resilience; the probability of a worker reaching an SNF in the event of extreme flooding; and a GitHub repo that provides easy access to all these products plus relevant metadata, literature, and government documents and regulations.</p>
<p>Conducting this use case has been an eye-opening experience as to the amount and quality of publicly available data to address our research questions. The statistical capabilities and products flowing from diverse use cases can only be identified as the program progresses.</p>
<div class="nav-btn-container">
<div class="grid">
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2024/11/08/what-is-CDE-2.html">← Part 2: What is the CDE?</a></p>
</div>
</div>
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2024/11/22/development-plan-2.html">Part 4: Census Curated Data Enterprise Environment →</a></p>
</div>
</div>
</div>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<strong>Vicki Lancaster</strong> is a statistician with expertise in experimental design, linear models, computation, visualizations, data analysis, and interpretation. She works with scientists at federal agencies on projects requiring statistical skills and creativity, eg, defining skilled technical workforce using novel data sources.
</dd>
<dd>
<strong>Stephanie Shipp</strong> leads the Curated Data Enterprise research portfolio and collaborates with the US Census. She is an economist with experience in data science, survey statistics, public policy, innovation, ethics, and evaluation.
</dd>
<dd>
<strong>Sallie Keller</strong> is the Chief Scientist and Associate Director of Research and Methodology at the US Census Bureau. She is a statistician with research interest in social and decision informatics, statistics underpinnings of data science, and data access and confidentiality. Sallie Keller was at the University of Virginia when this work was conducted.
</dd>
<dd>
<strong>Aaron Schroeder</strong> has experience in the technologies and related policies of information and data integration and systems analysis, including policy and program development and implementation.
</dd>
<dd>
<strong>Henning Mortveit</strong> develops massively interacting systems and the mathematics supporting rigorous analysis and understanding of their stability and resiliency.
</dd>
<dd>
<strong>Samarth Swarup</strong> conducts research in computational social science, resiliency and sustainability, and stimulation analytics.
</dd>
<dd>
<strong>Dawen Xie</strong> develops geographic information systems, visual analytics, information management systems, and databases, with a current focus on building dynamic web systems.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2024 Stephanie Shipp
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img style="height:22px!important;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>. Thumbnail photo by <a href="https://www.shutterstock.com/g/Ground+Picture">Ground Picture</a> on <a href="https://www.shutterstock.com/image-photo/lovely-nurse-assisting-senior-man-get-2006404274">Shutterstock</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Lancaster V, Shipp S, Keller S et al.&nbsp;(2024). “Translating the Curated Data Model into Practice - climate resiliency of skilled nursing facilities” Real World Data Science, November 19, 2024. <a href="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/19/use-case-2.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>



</section>


<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-barrett2013planning" class="csl-entry">
<span class="nocase">Barrett, Christopher, Keith Bisset, Shridhar Chandan, et al.</span> 2013. <span>“Planning and Response in the Aftermath of a Large Crisis: An Agent-Based Informatics Framework.”</span> <em>2013 Winter Simulations Conference (WSC)</em>, 1515–26.
</div>
<div id="ref-beckman1996creating" class="csl-entry">
Beckman, Richard J, Keith A Baggerly, and Michael D McKay. 1996. <span>“Creating Synthetic Baseline Populations.”</span> <em>Transportation Research Part A: Policy and Practice</em> 30 (6): 415–29.
</div>
<div id="ref-choupani2016population" class="csl-entry">
Choupani, Abdoul-Ahad, and Amir Reza Mamdoohi. 2016. <span>“Population Synthesis Using Iterative Proportional Fitting (IPF): A Review and Future Research.”</span> <em>Transportation Research Procedia</em> 17: 223–33.
</div>
<div id="ref-dosa2008controversy" class="csl-entry">
Dosa, David M, Kathryn Hyer, Lisa M Brown, Andrew W Artenstein, LuMarie Polivka-West, and Vincent Mor. 2008. <span>“The Controversy Inherent in Managing Frail Nursing Home Residents During Complex Hurricane Emergencies.”</span> <em>Journal of the American Medical Directors Association</em> 9 (8): 599–604. <a href="https://pubmed.ncbi.nlm.nih.gov/19083295/">https://pubmed.ncbi.nlm.nih.gov/19083295/</a>.
</div>
<div id="ref-edgemon2018community" class="csl-entry">
Edgemon, Lesley, Carol Freeman, Carmella Burdi, Trail, and Kyle Pfeiffer. 2018. <span>“Community Resilience Indicator Analysis: County-Level Analysis of Commonly Used Indicators from Peer-Reviewed Research.”</span> <em>Argonne National Laboratory</em>. <a href="https://www.researchgate.net/publication/331232094_Community_Resilience_Indicator_Analysis_County-Level_Analysis_of_Commonly_Used_Indicators_From_Peer-Reviewed_Research">https://www.researchgate.net/publication/331232094_Community_Resilience_Indicator_Analysis_County-Level_Analysis_of_Commonly_Used_Indicators_From_Peer-Reviewed_Research</a>.
</div>
<div id="ref-FEMA2021risk" class="csl-entry">
FEMA. 2021. <span>“National Risk Index Technical Documentation.”</span> Federal Emergency Management Agency. <a href="https://www.fema.gov/sites/default/files/documents/fema_national-risk-index_technical-documentation.pdf
  ">https://www.fema.gov/sites/default/files/documents/fema_national-risk-index_technical-documentation.pdf </a>.
</div>
<div id="ref-FEMA2022a" class="csl-entry">
FEMA. 2022. <span>“Community Resilience Indicator Analysis: Commonly Used Indicators from Peer-Reviewed Research: Updated for Research Published 2003-2021.”</span> Federal Emergency Management Agency. <a href="hhttps://www.fema.gov/sites/default/files/documents/fema_2022-community-resilience-indicator-analysis.pdf
  ">hhttps://www.fema.gov/sites/default/files/documents/fema_2022-community-resilience-indicator-analysis.pdf </a>.
</div>
<div id="ref-dhs2022hifld" class="csl-entry">
Homeland Security: Geospatial Management Office, Department of. 2022. <span>“Homeland Security Infrastructure Foundation-Level Data Open Data.”</span> <a href="https://hifld-geoplatform.opendata.arcgis.com/">https://hifld-geoplatform.opendata.arcgis.com/</a>.
</div>
<div id="ref-hyer2006establishing" class="csl-entry">
Hyer, Kathryn, Lisa M Brown, Amy Berman, and LuMarie Polivka-West. 2006. <span>“Establishing and Refining Hurricane Response Systems for Long-Term Care Facilities: The John a. Hartford Foundation Was the Lead Funder of a Hurricane Summit to Focus on the Neglected Needs of the Elderly.”</span> <em>Health Affairs</em> 25 (Suppl1): W407–11. <a href="https://www.healthaffairs.org/doi/full/10.1377/hlthaff.25.w407?casa_token=XbJ2j-CdtssAAAAA:USJMJsZq_jlYlQlASQt4O4OYJcq_AOKjpXOx5tTMUIZxoNVXZCzj1_ejtQyLHrnTg6B1BygFuuGZ">https://www.healthaffairs.org/doi/full/10.1377/hlthaff.25.w407?casa_token=XbJ2j-CdtssAAAAA:USJMJsZq_jlYlQlASQt4O4OYJcq_AOKjpXOx5tTMUIZxoNVXZCzj1_ejtQyLHrnTg6B1BygFuuGZ</a>.
</div>
<div id="ref-lancaster2023CDE" class="csl-entry">
Lancaster, V., S. Shipp, S. Keller, et al. 2023. <em>Census Curated Data Enterprise Use Case Demonstration: Climate Resiliency of Skilled Nursing Facilities</em>. TR 2023-53. <a href="https://doi.org/10.18130/ce97-sp05">https://doi.org/10.18130/ce97-sp05</a>.
</div>
<div id="ref-brown2022ltcfocus" class="csl-entry">
LTCFocus, Brown University. 2022. <span>“Who We Are.”</span> <a href="https://ltcfocus.org/about">https://ltcfocus.org/about</a>.
</div>
<div id="ref-CMS2022design" class="csl-entry">
Medicare &amp; Medicaid Services, Centers for. 2022. <span>“Design for Care Compare Nursing Home Five-Star Quality Rating System: Technical Users’ Guide.”</span> <a href="https://www.cms.gov/medicare/provider-enrollment-and-certification/certificationandcomplianc/downloads/usersguide.pdf">https://www.cms.gov/medicare/provider-enrollment-and-certification/certificationandcomplianc/downloads/usersguide.pdf</a>.
</div>
<div id="ref-cmsglossary" class="csl-entry">
Medicare &amp; Medicaid Services, Centers for. 2023. <span>“CMS Glossary.”</span> <a href="https://www.cms.gov/glossary?term=skilled+nursing+facility&amp;items_per_page=10&amp;viewmode=grid ">https://www.cms.gov/glossary?term=skilled+nursing+facility&amp;items_per_page=10&amp;viewmode=grid </a>.
</div>
<div id="ref-mortveitNSSAC" class="csl-entry">
Mortveit, H., D. Xie, and M. Marathe. 2023. <em>NSSAC Building Knowledge Base: Modeling and Implementation</em>.
</div>
<div id="ref-rafiei2022private" class="csl-entry">
Rafiei, Y. 2022. <span>“When Private Equity Takes over a Nursing Home.”</span> <em>New Yorker</em> 2022: 333. <a href="https://www.newyorker.com/news/dispatch/when-private-equity-takes-over-a-nursing-home">https://www.newyorker.com/news/dispatch/when-private-equity-takes-over-a-nursing-home</a>.
</div>
<div id="ref-skarha2021association" class="csl-entry">
Skarha, Julianne, Lily Gordon, Nazmus Sakib, et al. 2021. <span>“Association of Power Outage with Mortality and Hospitalizations Among Florida Nursing Home Residents After Hurricane Irma.”</span> <em>JAMA Health Forum</em> 2: e213900–213900. <a href="https://jamanetwork.com/journals/jama-health-forum/fullarticle/2786665">https://jamanetwork.com/journals/jama-health-forum/fullarticle/2786665</a>.
</div>
<div id="ref-sheet2022protecting" class="csl-entry">
The White House. 2022. <span>“Protecting Seniors by Improving Safety and Quality of Care in the Nation’s Nursing Homes.”</span> <a href="https://www.whitehouse.gov/briefing-room/statements-releases/2022/02/28/fact-sheet-protecting-seniors-and-people-with-disabilities-by-improving-safety-and-quality-of-care-in-the-nations-nursing-homes/
  ">https://www.whitehouse.gov/briefing-room/statements-releases/2022/02/28/fact-sheet-protecting-seniors-and-people-with-disabilities-by-improving-safety-and-quality-of-care-in-the-nations-nursing-homes/ </a>.
</div>
</div></section><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>Nursing staff includes medical aides and technicians, certified nursing assistants, licensed practical nurses (LPNs), LPNs with administrative duties, registered nurses (RNs), RNs with administrative duties, and the RN director of nursing.↩︎</p></li>
<li id="fn2"><p>For example, distinguishing county from city when the name is the same could be done using State/County FIPS codes. Richmond County is 51159; Richmond City is 51760.↩︎</p></li>
<li id="fn3"><p><em>ZIP code is a system of postal codes used by the United States Postal Service. ZIP</em> was chosen to indicate mail travels more quickly when senders use the postal code.↩︎</p></li>
<li id="fn4"><p>Average Daily Nursing Staff is the daily number of Medical Aides and Technicians, CNAs, LPNs, LPNs with administrative duties, RNs, RNs with administrative duties, and RN Director of Nursing averaged over three months.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>Public Policy</category>
  <category>Data Analysis</category>
  <category>Data Integration</category>
  <category>Curation</category>
  <category>Statistical Products</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/19/use-case-2.html</guid>
  <pubDate>Tue, 19 Nov 2024 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/19/images/nurse-thumbnail.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Advancing Data Science in Official Statistics – What is the Curated Data Enterprise?</title>
  <dc:creator>Sallie Keller, Stephanie Shipp, Vicki Lancaster, and Joseph Salvo &lt;br /&gt; University of Virginia</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/08/what-is-CDE-2.html</link>
  <description><![CDATA[ 





<center>
Acknowledgments: This research was sponsored by the: <br> Unites States Census Bureau Agreement No.&nbsp;01-21-MOU-06 and <br> Alfred P. Sloan Foundation Grant No.&nbsp;G-2022-19536
</center>
<p><br> <br></p>
<p><em>The views expressed in this perspective are those of the authors and not the Census Bureau.</em></p>
<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>Today, official statistics – tables, reports and microdata – are produced using data from a single survey. These surveys are foundational for researchers and policymakers. However, many issues cannot be answered by surveys alone. For example, creating a picture of how prepared skilled nursing facilities (SNFs) are for climate emergencies requires wrangling all types of data about the facilities and their communities.(<em>Note: A skilled nursing facility is a facility that meets specific federal regulatory certification requirements that enable it to provide short-term inpatient care and services to patients who require medical, nursing, or rehabilitative services.</em>) This includes SNF data on the number and dates of inspections, deficiencies, residents’ mental and physical health, the number of nursing staff and where they live, community assets data on the number of shelter facilities, health professionals and emergency service providers, and community risks data on the probability of an extreme climate event. How can we create new statistical products useful to policymakers, emergency responders, skilled nursing facility staff, and others to inform their decisions?</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Official statistics
</div>
</div>
<div class="callout-body-container callout-body">
<p>Official statistics are essential for a democratic society as they provide economic, demographic, social, and environmental data about the government, the economy, and the environment. Official statistical agencies should compile and make these statistics available impartially to honor the right to public information.</p>
<p>Objective, reliable, and accessible official statistics instill confidence in the integrity of government and public decision-making regarding a country’s economic, social, and environmental situation at national and international levels. They should be widely available and meet the needs of various users <span class="citation" data-cites="UnitedNations2024">(United Nations 2024)</span>.</p>
</div>
</div>
<p>With the explosion of available data, there is an opportunity to combine all types of information to create statistical products that address cross-cutting topics for a wide range of purposes and uses. The US Census Bureau is modernizing and transforming its enterprise system to accommodate a new way to produce statistical products that take advantage of all data types: designed surveys and censuses, public and private administrative data, opportunity data scraped from the internet, and procedural data <span class="citation" data-cites="keller2022bold">(Keller et al. 2022)</span>.</p>
<blockquote class="blockquote">
<p><em>‘We are moving towards a single enterprise, data-centric operation that enables us to funnel data from many sources in a single data lake using common collection and ingestion platforms… This is the essence of <strong>a curated data approach</strong> — assemble, assess, and fill in the gaps to create quality statistical data.’</em></p>
</blockquote>
<blockquote class="blockquote">
<p><strong>Robert Santos,</strong> Director, US Census Bureau</p>
</blockquote>
<p>This curated approach is embodied in the Curated Data Enterprise (CDE). The Curated Data Enterprise Framework in Figure&nbsp;1 provides a guide for creating statistical products that enable the full integration of data from many sources <span class="citation" data-cites="keller2020doing">(Keller et al. 2020)</span>. At the heart of the framework are the purposes and uses that provide the context and driving force for developing the statistical product. The outer rectangle in Figure&nbsp;1 identifies the guiding principles for ethical, transparent and reproducible product development and dissemination. The inner rectangle identifies the steps in the statistical product development, including integrating primary and secondary data sources. The arrows convey that this process may only sometimes be linear. Instead, the process is iterative, where new information may be discovered at any point, requiring reevaluating and updating prior steps. Our Social and Decision Analytics research group in the Biocomplexity Institute developed, tested, and refined the CDE (data science) Framework in our research since 2013 <span class="citation" data-cites="keller2017building keller2020doing">(Keller et al. 2017, 2020)</span>. The proposed use of the CDE to develop statistical products at the US Census Bureau is in its early stages.</p>
<div id="fig-cde" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-cde-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/08/images/figure-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-cde-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: The CDE Framework starts with the purposes &amp; uses of the statistical products. The outer rectangle identifies the guiding principles for ethical, transparent, reproducible statistical product development and dissemination. The inner rectangle identifies the statistical product development steps.
</figcaption>
</figure>
</div>
<p>The next article in this series will put the CDE Framework into practice by demonstrating the use case on skilled nursing facilities’ preparedness for emergencies during extreme climate events. As a prelude to that article, we have created a visual for the statistical product development component of how that process works in action in Figure&nbsp;2.</p>
<div id="fig-ex" class="quarto-float quarto-figure quarto-figure-center anchored" data-fig-align="center">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-ex-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/08/images/figure-2.png" class="img-fluid quarto-figure quarto-figure-center figure-img">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-ex-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;2: Example: Steps in the statistical product development for the skilled nursing facility use case. The diagram describes the steps applied to a use case on the resilience of skilled nursing facilities. Section 3 of this series describes the steps in detail.
</figcaption>
</figure>
</div>
<p>The CDE Framework’s guiding principles and research steps are described below. To find out more click on a cross reference.</p>
<p><strong>Guiding principles</strong>:</p>
<ul>
<li>Purposes and uses</li>
<li>Stakeholders</li>
<li>Curation</li>
<li>Equity and ethics</li>
<li>Privacy and confidentiality</li>
<li>Communications and dissemination</li>
</ul>
<p><strong>Research steps</strong>:</p>
<ul>
<li>Subject matter input</li>
<li>Data discovery</li>
<li>Data ingestion &amp; Governance</li>
<li>Data wrangling</li>
<li>Fitness-for-purpose</li>
<li>Statistics development</li>
</ul>
</section>
<section id="guiding-principles" class="level2">
<h2 class="anchored" data-anchor-id="guiding-principles">Guiding principles</h2>
<section id="sec-gp1" class="level3">
<h3 class="anchored" data-anchor-id="sec-gp1">Purposes and uses</h3>
<p>The CDE is centered on developing statistical products to meet specific purposes and uses. Researchers and stakeholders propose the purposes and uses, defining the ‘why’ for developing statistics and statistical products. They include questions or issues that the statistics should be designed to support and are clarified by documented best practices, literature reviews and conversations with subject matter experts.</p>
</section>
<section id="sec-gp2" class="level3">
<h3 class="anchored" data-anchor-id="sec-gp2">Stakeholders</h3>
<p>Stakeholders include individuals, groups, and organizations that have the potential to affect or be affected by the outcome of the research. Engaging stakeholders is crucial for fostering the connection and trust that can lead to better decision making. <span class="citation" data-cites="kujala2022stakeholder">Kujala et al. (2022)</span> best described the principle of stakeholder engagement: ‘Stakeholder engagement refers to the aims, activities, and impacts of stakeholder relations in a moral, strategic, and pragmatic manner.’ When placed within the CDE context and represented in the Framework, collaborative engagement with stakeholders occurs at all stages of product development to better understand what the final product needs to look like. Further, product development is not a linear process but occurs through successive waves of iteration with users.</p>
<p>Forming partnerships with stakeholders is instrumental in identifying requirements and implementing statistical products. This requires listening to community voices in an active engagement strategy.<sup>1</sup> Of necessity, these partnerships entail collaboration, such as creative and collaborative problem-solving workshops and the development of innovative digital tools vetted by networks of users.<sup>2</sup></p>
</section>
<section id="sec-gp3" class="level3">
<h3 class="anchored" data-anchor-id="sec-gp3">Curation</h3>
<p>The broad meaning of curation is the act of organizing, documenting and maintaining a collection of artifacts. The artifacts of the development and dissemination of statistics or statistical products include all the components in Figure&nbsp;1, from meeting with stakeholders to formulating the purposes and uses to creating and disseminating the statistical products. Maintaining the artifacts is the essence of the CDE. <em>Every step in the process should be documented and easily accessible in a repository, for example, GitHub, for the work to be transparent and reproducible</em>. Curation in the context of the CDE is an end-to-end activity. It involves documenting the purpose and use, providing the context for acquiring, wrangling, and archiving data from many sources to support the development of statistical products. It will include metadata <span class="citation" data-cites="cannon2013">(Cannon 2013)</span>, the code used to read and write the data, and the code that ingested the data from the source and prepared it for analysis.</p>
<p><em>Curation steps</em></p>
<ul>
<li>Document the development of the research questions, why this research is important, and how it supports the purposes and uses and resulting statistical product.</li>
<li>Document the context for the purposes and uses, ie, a policy directive, stakeholder request, policy evaluation, etc.</li>
<li>What stakeholder engagement and transparency are built into the process?</li>
</ul>
</section>
<section id="sec-gp4" class="level3">
<h3 class="anchored" data-anchor-id="sec-gp4">Equity and ethics</h3>
<p>An ethics review ensures dialogue on this topic throughout the statistical product development and dissemination life cycle. This involves teams of researchers and stakeholders across many areas of expertise, each with its own research integrity norms and practices. This requires that ethics be woven into every aspect of the CDE. An <em>equity</em> review ensures that underserved groups are represented and biases inherent in various data sources are acknowledged.</p>
<p><em>Curation questions</em></p>
<ul>
<li>What are the project’s expected benefits to the ‘public good’? Do they outweigh potential risks to specific sub-populations, eg, individuals, firms and their locations by different levels of geography?</li>
<li>Are there implicit assumptions and biases regarding the studied communities in framing the project and associated data sources? If yes, how will they be addressed?</li>
<li>What type of institutional approval process and contracts are needed? What statistical quality standards and confidentiality standards will be needed? For an explanation of the Institution Review Board see Note&nbsp;1.</li>
</ul>
<p>An ethics checklist can help with this process. Links to ethics checklists are provided below.</p>
<ul>
<li>University of Virginia, Biocomplexity Institute, <a href="https://biocomplexity.virginia.edu/sites/default/files/sda/UVA%20SDAD%20EthicsChecklist%2018May2022.pdf">Social and Decision Analytics Division Data Science Project Ethics Tool</a></li>
<li>United Kingdom Government, <a href="https://www.gov.uk/government/publications/data-ethics-framework#full-publication-update-history">Data Ethics Framework</a></li>
</ul>
</section>
<section id="sec-gp5" class="level3">
<h3 class="anchored" data-anchor-id="sec-gp5">Privacy and confidentiality</h3>
<p>Privacy is about the individual, whereas confidentiality is about the individual’s information. Privacy refers to an individual’s desire to control their information. Confidentiality refers to the researcher’s agreement with the individual, which could be an agency like the Census Bureau, regarding how their information will be handled, managed, and disseminated <span class="citation" data-cites="keller2016does">(Keller et al. 2016)</span>. This is a guiding principle because it needs to be considered and embraced at the earliest possible stages of statistical product development and will impact dissemination choices.</p>
<p><em>Curation questions</em></p>
<ul>
<li>What steps are taken to ensure the privacy and confidentiality of the data?</li>
<li>What statistical methods (if any) are used to ensure the privacy and confidentiality of the data?</li>
<li>How do the methods chosen to protect confidentiality affect the purposes and uses of the data?</li>
<li>What stakeholder engagement and transparency are built into the process?</li>
<li>Does the context surrounding the purposes, uses, and anticipated data sources require an Institutional Review Board (IRB) review and approval? If yes, is it archived?</li>
</ul>
<div id="nte-irb" class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
Note&nbsp;1: Institutional Review Board
</div>
</div>
<div class="callout-body-container callout-body">
<p>In the United States, institutional review boards (IRBs) assess the ethics and safety of research studies involving human subjects, such as behavioral studies or clinical trials for new drugs or medical devices. Today, the definition of human subjects has evolved to include secondary data, such as administrative data collected for other purposes, eg, local property data collected for tax purposes.</p>
<p>The Belmont Commission was convened in the late 1970s after the ethical failures of many research projects that involved vulnerable populations surfaced. The Belmont Commission issued three principles for the conduct of ethical research:</p>
<ul>
<li><p><strong>Respect for people</strong> — treating people as autonomous and honoring their wishes</p></li>
<li><p><strong>Beneficence</strong> — understanding the risks and benefits of the study and weighing the balance between (1) doing no harm and (2) maximizing possible benefits and minimizing possible harms</p></li>
<li><p><strong>Justice</strong> — deciding if the risks and benefits of research are distributed fairly.</p></li>
</ul>
<p>These principles were translated to a set of regulations called the Common Rule that govern federally-funded research. The Belmont Commission provided the foundation for IRB principles and focused on research involving human subjects in experiments and studies. IRB approval is required to be eligible for federal grants and contracts. Many universities also require IRB review for research conducted by faculty, students, and researchers <span class="citation" data-cites="shipp2023making">(Shipp et al. 2023)</span>.</p>
</div>
</div>
</section>
<section id="sec-gp6" class="level3">
<h3 class="anchored" data-anchor-id="sec-gp6">Communication and dissemination</h3>
<p>Communication involves sharing data, statistical method choices, well-documented code, working papers, and <em>dissemination</em> through research team meetings, stakeholder engagements, conference presentations, publications, webinars, websites, and social media. As a principle, communication and dissemination are critical to ensure that statistical product development processes and findings are transparent and reproducible <span class="citation" data-cites="berman2016realizing">(<span class="nocase">Berman et al.</span> 2016)</span>. An essential facet of this step is to tell the story of the analysis by conveying the context, purpose, and implications of the research and findings <span class="citation" data-cites="berinato2019data wing2019data nasem2022transparency">(Berinato 2019; Wing 2019; NASEM 2022)</span>.</p>
<p><em>Curation questions</em></p>
<ul>
<li>Are the meeting notes, statistical products, code, reports, and presentations archived in a repository?</li>
<li>Briefly describe what did not work in this process, eg, data wrangling challenges where data sources could not be integrated, data source changes after a fitness-for-purpose assessment, analyses that were changed because assumptions were not met, etc.</li>
<li>Have project methods and outputs been made as transparent as possible?</li>
<li>Are the potential limitations of the research clearly presented?</li>
<li>Why or why not should the research be used as the basis for an institutional or policy action?</li>
<li>Have the predicted benefits and social costs to all potentially affected communities been considered?</li>
</ul>
</section>
</section>
<section id="research-steps" class="level2">
<h2 class="anchored" data-anchor-id="research-steps">Research steps</h2>
<section id="sec-rs1" class="level3">
<h3 class="anchored" data-anchor-id="sec-rs1">Subject matter input</h3>
<p>Subject matter (domain) expertise plays a role in translating the information acquired into understanding the underlying phenomena in the data <span class="citation" data-cites="box1978statistics">(<span class="nocase">Box et al.</span> 1978)</span>. Domain knowledge provides the context to define, evaluate and interpret the findings at each research stage <span class="citation" data-cites="leonelli2019data snee2014follow">(Leonelli 2019; Snee et al. 2014)</span>. Subject matter input can be obtained through a review of the literature, talking to experts, or learning about their work at conferences or other convenings. Subject matter experts are different than stakeholders. Both provide important input to identifying and clarifying purposes and uses.</p>
<p><em>Curation steps</em></p>
<ul>
<li>Document the meetings with subject matter experts and stakeholders.</li>
<li>Document the literature search methods and the results of the literature review.</li>
<li>Document choices are made during the development of the products.</li>
<li>Were subject matter experts and stakeholders recruited from underrepresented groups?</li>
</ul>
</section>
<section id="sec-rs2" class="level3">
<h3 class="anchored" data-anchor-id="sec-rs2">Data discovery</h3>
<p>Data discovery identifies potential sources that address the research goals defined by purposes and uses. Data sources include the following types <span class="citation" data-cites="keller2020doing">(Keller et al. 2020)</span>.</p>
<ol type="1">
<li><p>Designed data are collected using statistically designed methods, such as surveys, censuses, and data generated from an experimental or quasi-experimental design, such as a clinical trial or agricultural field study.</p></li>
<li><p>Administrative data are collected for the administration of an organization or program by entities such as government agencies.</p></li>
<li><p>Opportunity data are derived from internet-based information, such as websites, wearable and other sensor devices, and social media, and captured through application programming interfaces (APIs) and web scraping, eg, geocoded place-based data, transportation routes, and other data sources.</p></li>
<li><p>Procedural data are processes and policies, such as a change in health care coverage, a data repository policy outlining procedures and the metadata required to store data, or a responsible AI policy.</p></li>
</ol>
<p>The goal of the data discovery process is to think broadly and imaginatively about all data types and to capture the variety of data sources that could be useful for the problem. There are three steps in the data discovery process <span class="citation" data-cites="keller2016does">(Keller et al. 2016)</span>.</p>
<ol type="1">
<li><p>Identify potential data sources and make an inventory.</p></li>
<li><p>Create a set of questions to screen the data sources to ensure the data meet the criteria for use.</p></li>
<li><p>Select and acquire the data sources that meet the screening criteria.</p></li>
</ol>
<p><em>Curation steps</em></p>
<ul>
<li>Describe your data discovery process and reasoning behind the selected data sources.
<ul>
<li>Do underrepresented groups have adequate geographic coverage? If not, are there methods, such as synthetic data, you can use to provide adequate coverage?</li>
<li>Have checks and balances been established to identify and address implicit biases in the data and interpretation of the data? Has the team engaged in discussion and provided insights across their diverse perspectives?</li>
</ul></li>
<li>Describe the assumptions that need to be made to use these data sources.</li>
<li>Identify and document the paradata and metadata that describe each data source. Paradata describe how the data were collected, while metadata are ‘data about data’. It includes information about the data’s content, data dictionaries and technical documents that will help the user assess its fitness for purpose <span class="citation" data-cites="cannon2013 nasem2022transparency">(Cannon 2013; NASEM 2022)</span>.</li>
<li>Discuss data sources you would have used if they were available.</li>
</ul>
</section>
<section id="sec-rs3" class="level3">
<h3 class="anchored" data-anchor-id="sec-rs3">Data ingest and governance</h3>
<p>Data ingestion is the process of bringing data into the data management platform(s) for use. Data governance establishes and adheres to rules and procedures regarding data access, dissemination and destruction.</p>
<p><em>Curation steps</em></p>
<ul>
<li>Document policies and institutional agreements for data use.
<ul>
<li>Have team members reviewed data use agreements, standard operating procedures (SOPs), and data management plans? Are they fair?</li>
<li>Do additional procedures need to be defined for this project?</li>
</ul></li>
<li>Document the code and processes used to ingest the data sources and manage governance.</li>
</ul>
</section>
<section id="sec-rs4" class="level3">
<h3 class="anchored" data-anchor-id="sec-rs4">Data wrangling</h3>
<p>Data wrangling includes the activities of data profiling, preparing, linking and exploring used to assess the data’s quality and representativeness and what analyses the data can support.</p>
<table class="caption-top table">
<caption>Table 1. Activities of data wrangling</caption>
<colgroup>
<col style="width: 31%">
<col style="width: 14%">
<col style="width: 30%">
<col style="width: 21%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: center;">Profiling</th>
<th style="text-align: center;">Preparing</th>
<th style="text-align: center;">Linking</th>
<th style="text-align: center;">Exploring</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: center;"><ul>
<li>data quality</li>
<li>data structure</li>
<li>meta data, paradata, and provenance</li>
</ul></td>
<td style="text-align: center;"><ul>
<li>cleaning</li>
<li>transforming</li>
<li>structuring</li>
</ul></td>
<td style="text-align: center;"><ul>
<li>ontology selection &amp; alignment</li>
<li>entity resolution / harmonization</li>
</ul></td>
<td style="text-align: center;"><ul>
<li>visualizations</li>
<li>descriptive statistics</li>
<li>characterizations</li>
</ul></td>
</tr>
</tbody>
</table>
<p><em>Curation steps</em></p>
<ul>
<li>Describe any data quality issues within the stated purpose and use context and how they were resolved. This can include statistical solutions like imputing missing data, identifying outliers or constructing synthetic populations.
<ul>
<li>How representative are the data?</li>
<li>What populations are and are not covered?</li>
</ul></li>
<li>Describe any issues with the wrangling process and how they were resolved.</li>
<li>Document the code used to wrangle the data and describe how it was validated.</li>
<li>Document assumptions made regarding the transformation and use of the data.</li>
</ul>
</section>
<section id="sec-rs5" class="level3">
<h3 class="anchored" data-anchor-id="sec-rs5">Fitness-for-purpose</h3>
<p>Fitness-for-purpose starts with assessing the constraints imposed on the data by the particular statistical methods used and the population to which the inferences extend. It is a function of the modeling, data quality needs of the models, and data coverage (representativeness) needs of the models. The statistical product’s ‘fitness-for-purpose’ involves those on the receiving end of the data helping identify issues germane to the data application, such as identifying biases affecting equity. For example, given known differences in their availability, does using administrative records lead to better modeling outcomes for some groups more than others? What can be done to compensate for such bias?</p>
<p><em>Curation steps</em></p>
<ul>
<li>Document the constraints and limitations of the data.&nbsp;
<ul>
<li>What are the limitations of the results? Are the results useful, given the purpose of the study?</li>
</ul></li>
<li>Discuss the populations to which any inferences will generalize.
<ul>
<li>Do the statistical results support the potential benefits of the study previously stated?</li>
<li>Do any data require revisiting the question of potential biases being introduced through the choice of data sets and variables?</li>
</ul></li>
</ul>
</section>
<section id="sec-rs6" class="level3">
<h3 class="anchored" data-anchor-id="sec-rs6">Statistics development</h3>
<p>The development of statistics and statistical products for dissemination is a function of the research questions, the data’s limitations and the assumptions of the statistical method(s) used.</p>
<p><em>Curation steps</em></p>
<ul>
<li>Describe the statistical methods planned and used and how the method assumptions were evaluated.</li>
<li>Discuss the conclusions of the statistical analyses and any inferences that can be made from the disseminated statistical products.</li>
<li>Discuss how the statistics support the purposes and uses driving the development of the products.</li>
</ul>
<p>Here, we have defined the CDE and provided a conceptual walk through of the framework from Figure&nbsp;1. In the next article, we will put the CDE Framework into practice through a demonstration use case on the resilience of skilled nursing facilities.</p>
<div class="nav-btn-container">
<div class="grid">
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2024/11/01/policy-problem.html">← Part 1: The policy problem</a></p>
</div>
</div>
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2024/11/19/use-case-2.html">Part 3: Climate resiliency of skilled nursing facilities →</a></p>
</div>
</div>
</div>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<p><strong>Sallie Keller</strong> is the Chief Scientist and Associate Director of Research and Methodology at the US Census Bureau. She is a statistician with research interest in social and decision informatics, statistics underpinnings of data science, and data access and confidentiality. Sallie Keller was at the University of Virginia when this work was conducted.</p>
</dd>
<dd>
<p><strong>Stephanie Shipp</strong> leads the Curated Data Enterprise research portfolio and collaborates with the US Census. She is an economist with experience in data science, survey statistics, public policy, innovation, ethics, and evaluation.</p>
</dd>
<dd>
<p><strong>Vicki Lancaster</strong> is a statistician with expertise in experimental design, linear models, computation, visualizations, data analysis, and interpretation. She works with scientists at federal agencies on projects requiring statistical skills and creativity, eg, defining skilled technical workforce using novel data sources.</p>
</dd>
<dd>
<strong>Joseph Salvo</strong> is a demographer with experience in US Census Bureau statistics and data. He makes presentations on demographic subjects to a wide range of groups about managing major demographic projects involving the analysis of large data sets for local applications.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
<p>© 2024 Stephanie Shipp</p>
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" style="height:22px!important;vertical-align:text-bottom;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" style="height:22px!important;margin-left:3px;vertical-align:text-bottom;"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>. Thumbnail photo by <a href="https://www.shutterstock.com/g/Chaay_Tee">Chay_Tee</a> on <a href="https://www.shutterstock.com/image-photo/back-rear-view-young-asian-woman-2170748613">Shutterstock</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Keller S, Shipp S, Lancaster V, Salvo J (2024). “Advancing Data Science in Official Statistics – What is the Curated Data Enterprise?” Real World Data Science, November 8, 2024. <a href="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/08/what-is-CDE-2.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>



</section>
</section>


<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-berinato2019data" class="csl-entry">
Berinato, Scott. 2019. <span>“Data Science and the Art of Persuasion: Organizations Struggle to Communicate the Insights in All the Information They’ve Amassed. Here’s Why, and How to Fix It.”</span> <em>Harvard Business Review</em> 97 (1). <a href="https://hbr.org/2019/01/data-science-and-the-art-of-persuasion">https://hbr.org/2019/01/data-science-and-the-art-of-persuasion</a>.
</div>
<div id="ref-berman2016realizing" class="csl-entry">
<span class="nocase">Berman, Francine, Rob Rutenbar, Henrik Christensen, et al.</span> 2016. <span>“Realizing the Potential of Data Science: Final Report from the National Science Foundation Computer and Information Science and Engineering Advisory Committee Data Science Working Group.”</span> <em>National Science Foundation Computer and Information Science and Engineering Advisory Committee Report</em>.
</div>
<div id="ref-box1978statistics" class="csl-entry">
<span class="nocase">Box, George EP, William H Hunter, Stuart Hunter, et al.</span> 1978. <em>Statistics for Experimenters</em>. Vol. 664. John Wiley; sons New York.
</div>
<div id="ref-cannon2013" class="csl-entry">
Cannon, Sandra. 2013. <em>Defining <span>“Core”</span> Metadata: What Is Needed to Make Data Discoverable. Paper Presented at the Federal CASIC Workshops (Survey Uses of Metadata)</em>. <a href="https://www.census.gov/fedcasic/fc2013/">https://www.census.gov/fedcasic/fc2013/</a>.
</div>
<div id="ref-keller2017building" class="csl-entry">
Keller, Sallie, Vicki Lancaster, and Stephanie Shipp. 2017. <span>“Building Capacity for Data-Driven Governance: Creating a New Foundation for Democracy.”</span> <em>Statistics and Public Policy</em> 4 (1): 1–11.
</div>
<div id="ref-keller2022bold" class="csl-entry">
Keller, Sallie, Kenneth Prewitt, John Thompson, et al. 2022. <span>“A 21st Century Census Curated Data Enterprise. A Bold New Approach to Create Official Statistics. Technical Report.”</span> <em>Proceedings of the Biocomplexity Institute</em> BI-2022-1115: 297–323. <a href="https://doi.org/10.18130/r174-yk24">https://doi.org/10.18130/r174-yk24</a>.
</div>
<div id="ref-keller2020doing" class="csl-entry">
Keller, Sallie, Stephanie S Shipp, Aaron D Schroeder, and Gizem Korkmaz. 2020. <span>“Doing Data Science: A Framework and Case Study.”</span> <em>Harvard Data Science Review</em> 2 (1). <a href="https://doi.org/10.1162/99608f92.2d83f7f5">https://doi.org/10.1162/99608f92.2d83f7f5</a>.
</div>
<div id="ref-keller2016does" class="csl-entry">
Keller, Sallie, Stephanie Shipp, and Aaron Schroeder. 2016. <span>“Does Big Data Change the Privacy Landscape? A Review of the Issues.”</span> <em>Annual Review of Statistics and Its Application</em> 3: 161–80. <a href="https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-041715-033453
  ">https://www.annualreviews.org/content/journals/10.1146/annurev-statistics-041715-033453 </a>.
</div>
<div id="ref-kujala2022stakeholder" class="csl-entry">
Kujala, Johanna, Sybille Sachs, Heta Leinonen, Anna Heikkinen, and Daniel Laude. 2022. <span>“Stakeholder Engagement: Past, Present, and Future.”</span> <em>Business &amp; Society</em> 61 (5): 1136–96. <a href="https://doi.org/10.1177/00076503211066595">https://doi.org/10.1177/00076503211066595</a>.
</div>
<div id="ref-leonelli2019data" class="csl-entry">
Leonelli, Sabina. 2019. <em>Data Governance Is Key to Interpretation: Reconceptualizing Data in Data Science</em>. <a href="https://doi.org/10.1162/99608f92.17405bb6">https://doi.org/10.1162/99608f92.17405bb6</a>.
</div>
<div id="ref-nasem2022transparency" class="csl-entry">
NASEM. 2022. <span>“Transparency in Statistical Information for the National Center for Science and Engineering Statistics and All Federal Statistical Agencies.”</span> <em>National Academies of Science, Engineering, and Medicine</em>. <a href="https://doi.org/10.1162/99608f92.17405bb6">https://doi.org/10.1162/99608f92.17405bb6</a>.
</div>
<div id="ref-shipp2023making" class="csl-entry">
Shipp, Stephanie, Donna LaLonde, and Wendy Martinez. 2023. <span>“Making Ethical Decisions Is Hard!”</span> <em>CHANCE</em> 36 (4): 42–50. <a href="https://www.tandfonline.com/eprint/D5KR3XFRUG2QV4FVCKQI/full?target=10.1080/09332480.2023.2290955">https://www.tandfonline.com/eprint/D5KR3XFRUG2QV4FVCKQI/full?target=10.1080/09332480.2023.2290955</a>.
</div>
<div id="ref-snee2014follow" class="csl-entry">
Snee, Ronald D, Richard D DeVeaux, and Roger W Hoerl. 2014. <span>“Follow the Fundamentals.”</span> <em>Quality Progress</em> 47 (1): 24–28. <a href="https://search-proquest-com.proxy01.its.virginia.edu/docview/1491963574?accountid=14678">https://search-proquest-com.proxy01.its.virginia.edu/docview/1491963574?accountid=14678</a>.
</div>
<div id="ref-UnitedNations2024" class="csl-entry">
United Nations. 2024. <em>Development of a National Statistical System, Principle 1 - Relevance, Impartiality and Equal Access</em>. <a href="https://unstats.un.org/unsd/goodprac/bpaboutpr.asp?RecId=1">https://unstats.un.org/unsd/goodprac/bpaboutpr.asp?RecId=1</a>.
</div>
<div id="ref-wing2019data" class="csl-entry">
Wing, Jeannette M. 2019. <span>“The Data Life Cycle.”</span> <em>Harvard Data Science Review</em> 1 (1): 6. <a href="https://doi.org/10.1162/99608f92.e26845b4">https://doi.org/10.1162/99608f92.e26845b4</a>.
</div>
</div></section><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p><a href="https://www.census.gov/newsroom/blogs/director/2023/01/a-look-ahead-2023.html" class="uri">https://www.census.gov/newsroom/blogs/director/2023/01/a-look-ahead-2023.html</a>&nbsp;↩︎</p></li>
<li id="fn2"><p><a href="https://www.census.gov/partners/act.html" class="uri">https://www.census.gov/partners/act.html</a>↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>Public Policy</category>
  <category>Data Analysis</category>
  <category>Data Integration</category>
  <category>Curation</category>
  <category>Statistical Products</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/08/what-is-CDE-2.html</guid>
  <pubDate>Fri, 08 Nov 2024 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/08/images/screen.thumbnail.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Advancing Data Science in Official Statistics – The Policy Problem</title>
  <dc:creator>Sallie Keller, Stephanie Shipp, Vicki Lancaster and Joseph Salvo &lt;br /&gt; University of Virginia</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/01/policy-problem.html</link>
  <description><![CDATA[ 





<center>
Acknowledgments: This research was sponsored by the: <br> United States Census Bureau Agreement No.&nbsp;01-21-MOU-06 and <br> Alfred P. Sloan Foundation Grant No.&nbsp;G-2022-19536
</center>
<p><br> <br> <em>The views expressed in this artice are those of the authors and not the Census Bureau.</em></p>
<section id="introduction" class="level2">
<h2 class="anchored" data-anchor-id="introduction">Introduction</h2>
<p>Two centuries ago, when the Framers of the US Constitution laid the cornerstone for the federal statistical system, they could not have imagined the complexity of questions future generations would want to ask or the variety of data sources available to address them. Back in 1787, counting the population and apportioning state seats in the House of Representatives were the most urgent tasks before the young nation, and so a requirement for a decennial census was written into the Constitution. Now, 233 years later, the census continues to serve its original purpose – but purposes and uses for census data have exploded.</p>
<p>Questions we now seek to answer go beyond what the census (or surveys) alone can hope to address. Even with the multitude of other surveys commissioned by today’s US Census Bureau, researchers and policymakers find themselves looking to novel sources of data – from structured numeric data in traditional databases to unstructured text documents scraped from the internet – to explore issues such as understanding how prepared nursing homes and communities are for extreme climate events,eg, hurricanes, wildfires, or floods. Wrangling these sources with traditionally designed data, such as censuses and surveys, can fill data gaps, improve the quality and usefulness of statistical products, speed up their dissemination, and inspire the creation of new types of statistical products.</p>
<p>That is the impetus for developing the Curated Data Enterprise (CDE), an innovation in data science aimed at creating statistical products from all data types and building the infrastructure to support them. The Curated Data Enterprise, as the name implies, includes an end-to-end curation model to capture the complete statistical product development process. The CDE is designed to enable data discovery and retrieval, data quality assessment across multiple and diverse sources of information, and the reuse of data and models over time to accelerate statistical product development. The US Census Bureau has partnered with the University of Virginia, a working group of former Census Bureau Directors, a Communication Director, and university, non-profit and industry experts to develop this approach.</p>
<div class="callout callout-style-default callout-note callout-titled">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>The US <a href="https://www.census.gov/">Census Bureau</a>
</div>
</div>
<div class="callout-body-container callout-body">
<p>The US Census Bureau provides the latest official statistics, facts, and figures about America’s people, places, and economy. It collects data for 130 surveys annually and the decennial census that gives the Bureau its name. The US Census Bureau collects data from households, businesses, governments and non-profit organizations. For each survey, tabulations and margins of error are published in news releases and reports. Public-use microdata subject to disclosure rules are provided for household and demographic surveys. Microdata for economic and household surveys, without disclosure rules applied, are accessible to researchers through the <a href="https://www.census.gov/about/adrm/fsrdc.html">Federal Statistical Research Data Centers</a>.</p>
<p>Statistical agencies in other countries are also modernizing their surveys and statistical product development. See a summary of selected countries <span class="citation" data-cites="Lanman2023">(Lanman et al. 2023)</span>.</p>
</div>
</div>
</section>
<section id="a-new-approach" class="level2">
<h2 class="anchored" data-anchor-id="a-new-approach">A new approach</h2>
<p>To realize the CDE vision, the development of statistical products will address stakeholder questions using all data types – designed surveys and censuses, public and private administrative data, opportunity data scraped from the internet and procedural data <span class="citation" data-cites="keller2022bold">(Keller et al. 2022)</span>. This new approach aligns with the US Census Bureau’s modernization and transformation <span class="citation" data-cites="thieme2022technology">(Thieme 2022)</span> while maintaining the fundamental responsibilities of statistical agencies <span class="citation" data-cites="management2023fundamentals">(OMB 2023)</span>. It is also consistent with a conclusion by the NASEM <em>Panel on the Implications of Using Multiple Data Sources for Major Survey Programs</em>: ‘The quality of statistics produced from multiple data sources depends on properties of the individual sources as well as the methods used to combine them. A new framework of quality standards and guidelines is needed to evaluate such data sources’ fitness for use’ <span class="citation" data-cites="NASEM2023">(NASEM 2023, 192)</span>.</p>
<p>The CDE approach provides such a framework to address many of the challenges that official statistics face today, as well as demonstrate that they are poised to adopt a new approach to producing official statistics. For example:</p>
<ul>
<li><p>The timeliness and frequency of our official statistics are insufficient when there are shocks to the economy, such as the Covid-19 pandemic, when retrospective survey data were of limited usefulness. Federal agencies responded during the pandemic with relevance and agility by creating and launching fast-response Household Pulse Surveys that met immediate needs for data, trading off timeliness for quality <span class="citation" data-cites="Groshen2021Future">(Groshen 2021)</span>. Public engagement and support for these new relevant and timely data products at a time of crisis were essential to the success of this new statistical product.</p></li>
<li><p>The policy environment has responded to technological, social, and survey changes by encouraging efficient use of existing data, reuse, sharing and furthering open data principles. Researchers are now creating innovative statistical products using multiple data sources to better address the US’s needs and interests. The Commission on Evidence-Based Policymaking <span class="citation" data-cites="abraham2018promise">(Abraham et al. 2018)</span> and the Federal Data Strategy <span class="citation" data-cites="FedDataStrat">(<span>“Federal Data Strategy, Leveraging Data as a Strategic Asset”</span> 2021)</span> recommendations encourage agencies to permit access to data to undertake evaluation and research studies.</p></li>
<li><p>Techniques such as rapid scanning, text recognition, user-friendly uploads, and new devices, sensors, and systems can now record and transcribe data in real time. Using these techniques, governments and corporations now routinely and instantaneously collect and store data on behaviors and states as varied as purchase transactions, climate and road conditions, healthcare plan utilization, and land use and zoning. Extensive digitization and recording, better system connectedness and interactivity, and increased human-computer interaction can result in faster data accumulation, enhancing the usability of private and public administrative data while maintaining privacy and confidentiality <span class="citation" data-cites="brady2019challenge jarmin2019evolving">(Brady 2019; Jarmin 2019)</span>. &nbsp;</p></li>
<li><p>New techniques and data sources can transform statistical agencies ‘from the 20th-century survey-centric model to a 21st-century model that blends structured survey data with administrative and unstructured alternative digital data sources’, leading to better measures of the gig economy, retail sales, healthcare, workforce, and tools and methods to integrate multiple data sources while maintaining privacy and confidentiality <span class="citation" data-cites="jarmin2019evolving">(Jarmin 2019)</span>.</p></li>
</ul>
<p>The next three articles in this series will:</p>
<ul>
<li><p>provide an overview of the CDE and its corresponding framework</p></li>
<li><p>put the CDE Framework into practice through a demonstration use case on the resilience of skilled nursing facilities</p></li>
<li><p>describe our next steps for developing the CDE through a use case research program.</p></li>
</ul>
<div class="nav-btn-container">
<div class="grid">
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2024/11/08/what-is-CDE-2.html">Part 2: What is the Curated Data Enterprise? →</a></p>
</div>
</div>
</div>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<p><strong>Sallie Keller</strong> is the Chief Scientist and Associate Director of Research and Methodology at the US Census Bureau. She is a statistician with research interest in social and decision informatics, statistics underpinnings of data science, and data access and confidentiality. Sallie Keller was at the University of Virginia when this work was conducted.</p>
</dd>
<dd>
<p><strong>Stephanie Shipp</strong> leads the Curated Data Enterprise research portfolio and collaborates with the US Census. She is an economist with experience in data science, survey statistics, public policy, innovation, ethics, and evaluation.</p>
</dd>
<dd>
<p><strong>Vicki Lancaster</strong> is a statistician with expertise in experimental design, linear models, computation, visualizations, data analysis, and interpretation. She works with scientists at federal agencies on projects requiring statistical skills and creativity, eg, defining skilled technical workforce using novel data sources.</p>
</dd>
<dd>
<strong>Joseph Salvo</strong> is a demographer with experience in US Census Bureau statistics and data. He makes presentations on demographic subjects to a wide range of groups about managing major demographic projects involving the analysis of large data sets for local applications.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
<p>© 2024 Stephanie Shipp</p>
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" style="height:22px!important;vertical-align:text-bottom;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" style="height:22px!important;margin-left:3px;vertical-align:text-bottom;"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>. Thumbnail photo by <a href="https://unsplash.com/@goumbik">Lukas Blazek</a> on <a href="https://unsplash.com/photos/turned-on-black-and-grey-laptop-computer-mcSDtbWXUZU">Unsplash</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Keller S, Shipp S, Lancaster V, Salvo J (2024). “Advancing Data Science in Official Statistics: The Policy Problem.” Real World Data Science, November 01, 2024. <a href="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/01/policy-problem.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-abraham2018promise" class="csl-entry">
Abraham, Katherine G, Ron Haskins, Sherry Glied, et al. 2018. <span>“The Promise of Evidence-Based Policymaking: Report of the Commission on Evidence-Based Policymaking.”</span> <em>Washington, DC: Commission on Evidence-Based Policymaking</em>. <a href="https://www.cep.gov/ content/dam/cep/report/cep-final-report.pdf">https://www.cep.gov/ content/dam/cep/report/cep-final-report.pdf</a>.
</div>
<div id="ref-brady2019challenge" class="csl-entry">
Brady, Henry E. 2019. <span>“The Challenge of Big Data and Data Science.”</span> <em>Annual Review of Political Science</em> 22: 297–323. <a href="https://www.annualreviews.org/doi/abs/10.1146/annurev-polisci-090216-023229">https://www.annualreviews.org/doi/abs/10.1146/annurev-polisci-090216-023229</a>.
</div>
<div id="ref-FedDataStrat" class="csl-entry">
<span>“Federal Data Strategy, Leveraging Data as a Strategic Asset.”</span> 2021. <a href="https://strategy.data.gov/">https://strategy.data.gov/</a>.
</div>
<div id="ref-Groshen2021Future" class="csl-entry">
Groshen, Erica L. 2021. <span>“The <span>Future</span> of <span>Official</span> <span>Statistics</span>.”</span> <em>Harvard Data Science Review</em> 3 (4).<a href=" https://doi.org/10.1162/99608f92.591917c6"> https://doi.org/10.1162/99608f92.591917c6</a>.
</div>
<div id="ref-jarmin2019evolving" class="csl-entry">
Jarmin, Ron S. 2019. <span>“Evolving Measurement for an Evolving Economy: Thoughts on 21st Century US Economic Statistics.”</span> <em>Journal of Economic Perspectives</em> 33 (1): 165–84.
</div>
<div id="ref-keller2022bold" class="csl-entry">
Keller, Sallie, Kenneth Prewitt, John Thompson, et al. 2022. <span>“A 21st Century Census Curated Data Enterprise. A Bold New Approach to Create Official Statistics. Technical Report.”</span> <em>Proceedings of the Biocomplexity Institute</em> BI-2022-1115: 297–323. <a href="https://doi.org/10.18130/r174-yk24">https://doi.org/10.18130/r174-yk24</a>.
</div>
<div id="ref-Lanman2023" class="csl-entry">
Lanman, Kathryn, Olivia Davis, and Stephanie Shipp. 2023. <span>“What Can We Learn from Other Countries about How They Are Using Administrative Data to Supplement, Enhance, or Create New Data Products?”</span> <em>Proceedings of the Biocomplexity Institute</em>. <a href="https://doi.org/10.18130/2n54-sc22">https://doi.org/10.18130/2n54-sc22</a>.
</div>
<div id="ref-NASEM2023" class="csl-entry">
NASEM. 2023. <span>“Toward a 21st Century National Data Infrastructure: Enhancing Survey Programs by Using Multiple Data Sources.”</span> <em>National Academies of Science, Engineering, and Medicine</em>. <a href="https://doi.org/10.17226/26804">https://doi.org/10.17226/26804</a>.
</div>
<div id="ref-management2023fundamentals" class="csl-entry">
OMB. 2023. <span>“Fundamental Responsibilities of Recognized Statistical Agencies and Units.”</span> <em>Federal Register: The Daily Journal of the US Government</em>, 56708–44. <a href="https://www.federalregister.gov/documents/2023/08/18/2023-17664/fundamental-responsibilities-of-recognized-statistical-agencies-and-units
  ">https://www.federalregister.gov/documents/2023/08/18/2023-17664/fundamental-responsibilities-of-recognized-statistical-agencies-and-units </a>.
</div>
<div id="ref-thieme2022technology" class="csl-entry">
Thieme, Michael. 2022. <em>Technology Transformations at the Census Bureau: Building a Modern, Data-Centric Ecosystem</em>. <a href="hhttps://www.census.gov/newsroom/blogs/research-matters/2022/10/technology-transformation.html
  ">hhttps://www.census.gov/newsroom/blogs/research-matters/2022/10/technology-transformation.html </a>.
</div>
</div></section></div> ]]></description>
  <category>Public Policy</category>
  <category>Data Analysis</category>
  <category>Data Integration</category>
  <category>Curation</category>
  <category>Statistical Products</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/01/policy-problem.html</guid>
  <pubDate>Fri, 01 Nov 2024 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/11/01/images/laptop-thumbnail.jpg" medium="image" type="image/jpeg"/>
</item>
<item>
  <title>Forecasting the Health Needs of a Changing Population</title>
  <dc:creator>Luke Shaw (BNSSG ICB), Rich Wood (BNSSG ICB, University of Bath), Christos Vasilakis (University of Bath), Zehra Onen Dumlu (University of Bath)</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2024/05/08/dpm.html</link>
  <description><![CDATA[ 
<div class="page-columns page-rows-contents page-layout-article"><div class="social-share"><a href="https://twitter.com/share?url=https://realworlddatascience.net/applied-insights/case-studies/posts/2024/05/08/dpm.html&amp;text=Forecasting the Health Needs of a Changing Population" target="_blank" class="twitter"><i class="fab fa-x-twitter fa-fw fa-lg"></i></a><a href="https://www.linkedin.com/shareArticle?url=https://realworlddatascience.net/applied-insights/case-studies/posts/2024/05/08/dpm.html&amp;title=Forecasting the Health Needs of a Changing Population" target="_blank" class="linkedin"><i class="fa-brands fa-linkedin-in fa-fw fa-lg"></i></a>  <a href="mailto:?subject=Forecasting the Health Needs of a Changing Population&amp;body=Check out this link:https://realworlddatascience.net/applied-insights/case-studies/posts/2024/05/08/dpm.html" target="_blank" class="email"><i class="fa-solid fa-envelope fa-fw fa-lg"></i></a><a href="https://bsky.app/intent/compose?text=https://realworlddatascience.net/applied-insights/case-studies/posts/2024/05/08/dpm.html Forecasting the Health Needs of a Changing Population" target="_blank" class="bsky"><i class="fa-brands fa-bluesky"></i></a></div></div>





<section id="background" class="level2">
<h2 class="anchored" data-anchor-id="background">Background</h2>
<p>Decisions around medium and long-term allocation of healthcare resources are fraught with challenges and uncertainties, which explains the use of blunt resource allocations based on across-the-board annual percentage uplifts.</p>
<p>The Bristol, North Somerset, South Gloucestershire Integrated Care Board (BNSSG ICB - we love elaborate acronyms in the National Health Service!), in the south west of England, is part of the local NHS apparatus responsible for planning the current and future health needs of the one million resident population.</p>
<div id="fig-bnssg-map" class="quarto-float quarto-figure quarto-figure-center anchored">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-bnssg-map-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<div class="tv-iframe-container">
  <iframe class="responsive-iframe" src="images/bnssg-map.html" title="fig-bnssg-map" width="80%" height="500"></iframe>
</div>
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-bnssg-map-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;1: A map of the area covered by BNSSG, a space covered by three local authorities, with about 1 million people living inside it.
</figcaption>
</figure>
</div>
</section>
<section id="population-segmentation" class="level2">
<h2 class="anchored" data-anchor-id="population-segmentation">Population Segmentation</h2>
<p>Before tackling the complex problem of forecasting healthcare resources into the future, we first need to understand the current situation regarding the distribution of health needs.</p>
<p>While every individual has a unique set of circumstances, population segmentation is an approach used to help understand overall need by combining individuals into different groups, based on certain criteria.</p>
<p>We use the <a href="https://pubmed.ncbi.nlm.nih.gov/32015079/">Cambridge Multimorbidity Score</a> which is a metric designed to summarise the presence of multiple health conditions, known as multimorbidity. Using that score, which applies different weights to different health conditions, we <a href="https://www.tandfonline.com/doi/full/10.1080/20479700.2023.2232980">previously</a> found a way of splitting the adult (17+) population into five Core Segments, with <span style="color:#77A033;"><strong>Core Segment 1</strong></span> patients having the lowest score and being the least ill and <span style="color:#FF6C53;"><strong>Core Segment 5</strong></span> being those with the most multimorbidity.</p>
<p>Applied to the BNSSG adult population (of around 750K individuals), the following interesting properties were found:</p>
<ol type="1">
<li><strong>Halving</strong>: Going up one segment results in roughly half the number of people in that segment</li>
<li><strong>Doubling</strong>: Going up one segment results in roughly twice the NHS monetary spend per person per year</li>
</ol>
<p>We can see this in Figure&nbsp;2.</p>
<div id="fig-halving-doubling" class="quarto-float quarto-figure quarto-figure-center anchored" alt="Table showing the 5 Core Segments with CS1 having a Cambridge Score of <0.09, 52% of the population and £300 mean annual spend per person as the first row. This then changes by row through to CS1 having a Cambridge Score of >2.94 with 3% of the population and £5600 mean annual spend per person as the last row. The propoportion of population column roughly halved row-by-row, the mean annual spend per person roughly doubled row by row.">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-halving-doubling-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/05/08/images/halving-doubling-no-arrows.png" class="img-fluid figure-img" alt="Table showing the 5 Core Segments with CS1 having a Cambridge Score of <0.09, 52% of the population and £300 mean annual spend per person as the first row. This then changes by row through to CS1 having a Cambridge Score of >2.94 with 3% of the population and £5600 mean annual spend per person as the last row. The propoportion of population column roughly halved row-by-row, the mean annual spend per person roughly doubled row by row.">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-halving-doubling-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;2: Halving-Doubling Effect of the Core Segments
</figcaption>
</figure>
</div>
</section>
<section id="sec-creating-the-model" class="level2">
<h2 class="anchored" data-anchor-id="sec-creating-the-model">Creating The Model</h2>
<p>To forecast health needs of the population, in terms of how many people will be in which Core Segment in what future year, the Dynamic Population Model (DPM) takes information from two different sources:</p>
<ol type="1">
<li><p>The Office for National Statistics <a href="https://www.ons.gov.uk/peoplepopulationandcommunity/populationandmigration/populationprojections/datasets/localauthoritiesinenglandtable2">projections</a> for our area. From this, we get yearly projections for not just the total 17+ population, but also the predicted number of people turning 17 (and so entering our model), deaths, and in- and out-ward migration.</p></li>
<li><p>NHS patient attribute and activity data, stored in the <a href="https://bnssghealthiertogether.org.uk/population-health-management/">System Wide Dataset</a> (SWD). This gives us: past and current information on the adult population’s NHS healthcare usage; the Core Segment breakdown of our current and past populations; the proportion of those turning 17, migrating, and dying that are in each Core Segment. From this, we estimate the historical rates of transition within Core Segments, which is essentially the yearly number of people getting sicker or healthier.</p></li>
</ol>
<p>By synthesising these pieces of data, we create our DPM forecast. Starting from the most up to date Core Segment population breakdown, the model takes yearly time steps into the future, at each time step using the inputs to estimate how many people are to be in each Core Segment. This modelling approach of having discrete time steps and different movements between states can be set up as a Markov chain, although here we have formulated it as a set of difference equations - through which the outflow of each Core Segment population at each time step is deterministic. The design was led by <a href="https://researchportal.bath.ac.uk/en/persons/zehra-onen-dumlu">Zehra</a> and <a href="https://researchportal.bath.ac.uk/en/persons/christos-vasilakis">Christos</a>, through a collaboration between the NHS and the <a href="https://www.bath.ac.uk/research-centres/centre-for-healthcare-innovation-and-improvement-chi2/">Centre for Healthcare Innovation and Improvement (CHI2)</a> at the University of Bath.</p>
<p>The model can be thought of as having the following inputs:</p>
<table class="caption-top table">
<colgroup>
<col style="width: 31%">
<col style="width: 59%">
<col style="width: 8%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Model Input</th>
<th style="text-align: left;">Description</th>
<th style="text-align: left;">Data Source</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">initial population</td>
<td style="text-align: left;">The starting number of people in each Core Segment</td>
<td style="text-align: left;">SWD</td>
</tr>
<tr class="even">
<td style="text-align: left;">inner transition matrix</td>
<td style="text-align: left;">The yearly proportions of people moving from one Core Segment to another</td>
<td style="text-align: left;">SWD</td>
</tr>
<tr class="odd">
<td style="text-align: left;">births, net migration, deaths - numbers</td>
<td style="text-align: left;">The yearly number of people moving in and out of the area</td>
<td style="text-align: left;">ONS</td>
</tr>
<tr class="even">
<td style="text-align: left;">births, net migration, deaths - proportions</td>
<td style="text-align: left;">The proportion of births/migrations/deaths that come from each Core Segment group</td>
<td style="text-align: left;">SWD</td>
</tr>
</tbody>
</table>
<p>From these inputs, it deterministically outputs the yearly forecasts for the number of people in each Core Segment. From these yearly Core Segment population figures, we can also forecast use by point of delivery by taking historic SWD information on the activity used by current Core Segment breakdown, under the assumption that stays the same into the future.</p>
<p>We combine these population health segment projections – i.e., how many people will be in which Core Segment in what future year – with recent NHS healthcare usage data to yield forecasted changes for various delivery points, like Emergency Department (ED) visits or maternity service appointments.</p>
</section>
<section id="findings" class="level2">
<h2 class="anchored" data-anchor-id="findings">Findings</h2>
<p>The first output of the model is the population forecast for each Core Segment, as plotted in Figure&nbsp;3. The visualisation is a type of sankey diagram called an alluvial plot, which shows the proportion of people moving between the Core Segments each year. As it is to be expected, the majority of individuals stay in the same Core Segment year-on-year as the process of acquiring conditions and developing multimorbidity takes places over many years and decades.</p>
<p>The concerning insight shown in Figure&nbsp;3 is that all Core Segments apart from (the most healthy) <span style="color:#77A033;"><strong>Core Segment 1</strong></span> are due to increase in size, with <span style="color:#FF6C53;"><strong>Core Segment 5</strong></span> having the largest percentage increase over the next 20 years. While, at first glance, this could be attributed to the effect an ageing population, in which people are staying alive for longer we will see in the next set of results that this itself does not wholly explain the forecasted Core Segment changes.</p>
<div id="fig-sankey" class="quarto-float quarto-figure quarto-figure-center anchored" alt="over 20 years when scaled to 1000 population initially we have that population changes in the following ways: CS1 decreases from 520 to 490, CS2 increases from 240 to 310, CS3 increased from 130 to 180, CS4 increases from 70 to 110, CS5 increases from 40 to 60.">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-sankey-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/05/08/images/dpm-sankey-no-title.png" class="img-fluid figure-img" alt="over 20 years when scaled to 1000 population initially we have that population changes in the following ways: CS1 decreases from 520 to 490, CS2 increases from 240 to 310, CS3 increased from 130 to 180, CS4 increases from 70 to 110, CS5 increases from 40 to 60.">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-sankey-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;3: All Core Segments, except the most healthy (CS1), are forecast to increase in size. BNSSG Population rescaled to have an initial population of 1,000.
</figcaption>
</figure>
</div>
<p>In applying the typical NHS healthcare usage per Core Segment to the projections of Figure&nbsp;3, we derive the expected future healthcare usage for various healthcare settings (Figure&nbsp;4). In overlaying to these the equivalent projections due solely to demographic factors (both for total population size and capturing the effect of Age and Sex), we see that the DPM projections for increased resource use are not solely attributable to an ageing and growing population, but also to a population becoming gradually less healthy over time.</p>
<p>Specifically, from Figure&nbsp;4 we can glean the following insights:</p>
<ol type="a">
<li><p>In all areas except Maternity, the DPM forecasts an increased use beyond just the growing, aging population. The reason that Maternity can be explained as the exception is due to it closely following the demographic changes forecast, specifically for numbers of women of child bearing age.</p></li>
<li><p>For Community contacts, with the highest proportion of use from <span style="color:#FF6C53;"><strong>Core Segment 5</strong></span> patients, the DPM forecasts the highest increase into the future. This is because, relative to current size, the number of <span style="color:#FF6C53;"><strong>Core Segment 5</strong></span> patients is set to increase the largest and so that has the largest impact on Community contacts, which include home visits to patients to support rehabilitation and services to manage long-term mobility issues such as physiotherapy.</p></li>
<li><p>Whilst Secondary Elective and Non-Elective activity is forecast to grow at similar rates, the Carbon and Cost values are forecast to grow more for Secondary Non-Elective due to the average Carbon and Cost usage per person in Core Segment 5 being higher. In this context ‘Secondary’ is a hospital stay, with ‘Elective’ being planned and ‘Non-Elective’ being unplanned. For example, a hip replacement is elective whereas an admission following a road traffic accident is non-elective.</p></li>
</ol>
<div id="fig-pod-forecasts" class="quarto-float quarto-figure quarto-figure-center anchored" alt="the image shows 15 separate graphs, with the columns being Community, Maternity, Secondary Elective, Secondary Non-Elective and Total, and the rows being Activity, Carbon, and Cost. All graphs have similar overall shape of increase into the future, but with different gradients.">
<figure class="quarto-float quarto-float-fig figure">
<div aria-describedby="fig-pod-forecasts-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
<img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/05/08/images/dpm-pod-forecasts.png" class="img-fluid figure-img" alt="the image shows 15 separate graphs, with the columns being Community, Maternity, Secondary Elective, Secondary Non-Elective and Total, and the rows being Activity, Carbon, and Cost. All graphs have similar overall shape of increase into the future, but with different gradients.">
</div>
<figcaption class="quarto-float-caption-bottom quarto-float-caption quarto-float-fig" id="fig-pod-forecasts-caption-0ceaefa1-69ba-4598-a22c-09a6ac19f8ca">
Figure&nbsp;4: Forecasts by activity, carbon, and cost for four different points of delivery.
</figcaption>
</figure>
</div>
</section>
<section id="limitations" class="level2">
<h2 class="anchored" data-anchor-id="limitations">Limitations</h2>
<blockquote class="blockquote">
<p>It’s difficult to make predictions, especially about the future.</p>
<p>– <cite>Danish Proverb</cite></p>
</blockquote>
<p>As with any modelling / forecasting method, there are limitations to be mindful of.</p>
<ol type="1">
<li><p>The cost and activity usage estimates are made under the assumption that we will continue to deliver services as they are currently being delivered. We know this isn’t going to be true, as healthcare-seeking behaviour evolves over time, with younger people accessing healthcare in different ways to previous generations. On top of that, healthcare advances can result in significant changes in healthcare provision, in ways unaccounted for within this model.</p></li>
<li><p>The model is tied to ONS forecasts for population change, and robust forecasting is hard. It is difficult to estimate what the population will look like in 20 years’ time, and the influence of uncertain and unknown future local development and housing plans. Having said this, population forecasts tend to be robust, one way to consider this is that everyone who will be an adult by the end of the forecast in 20 years’ time has already been born.</p></li>
<li><p>The DPM does not explicitly account for the interaction of demand and capacity: it simply predicts future healthcare resource requirement assuming that health needs of a given Core Segment patient are met in the same way they are met now. This is an essential assumption to help ensure legitimate use of the empirically derived Core Segment transition rates. However, it inevitably limits practical use, as flexing demand and capacity assumptions is of importance to planners and service managers.</p></li>
<li><p>It is not possible to validate the model on historic data, firstly because of point 3. above but also because we only have good quality SWD information for the past two years, so cannot reliably look further back into the past and create a forecast that we can check against what actually happened.</p></li>
<li><p>Whilst it is possible to use the model in other healthcare systems and geographic areas, the underlying data required to generate the Core Segments is non-trivial, so significant data pipelining may be required to get to create local model inputs, as explained above in Section&nbsp;3.</p></li>
</ol>
</section>
<section id="what-next" class="level2">
<h2 class="anchored" data-anchor-id="what-next">What Next</h2>
<p>We have already generated local use cases for the DPM in forecasting different geographical areas or specific hospital trusts. We envisage the DPM becoming a standard tool in most forward planning initiatives and will continue to refine the model as more information becomes available both for calibration and validation.</p>
<p>Outside of BNSSG, we are keen to disseminate our modelling approach to others who may be interested, as well as expanding our collaboration. There are also other innovative approaches in this space, such as the <a href="https://www.health.org.uk/publications/health-in-2040">Health in 2040</a> report by the Health Foundation which looks at England-level and uses the same ONS forecasts, but using a different ‘micro simulation’ modelling approach.</p>
<blockquote class="blockquote">
<p>If long-term forecasting in the NHS is of interest to you and your work, we’d love to chat! Please get in touch at <a href="mailto:bnssg.analytics@nhs.net">bnssg.analytics@nhs.net</a></p>
</blockquote>
</section>
<section id="summary" class="level2">
<h2 class="anchored" data-anchor-id="summary">Summary</h2>
<p>Reliably forecasting longer-term population health needs and healthcare resource requirements is essential if the NHS is to effectively plan for tomorrow’s problems today.</p>
<p>While this is undoubtedly a difficult problem – both conceptually and statistically – our modelling, undertaken through an academic-NHS collaboration, demonstrates that there are alternatives beyond the commonly-used but simplistic approaches based only on demographic factors.</p>
<div class="article-btn">
<p><a href="../../../../../../applied-insights/case-studies/index.html">Find more case studies</a></p>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<strong>Luke Shaw</strong> is a Data Scientist working in the NHS.
</dd>
<dd>
<strong>Rich Wood</strong> is Head of Modelling Analytics at BNSSG ICB and Senior Visiting Research Follow at University of Bath School of Management.
</dd>
<dd>
<strong>Christos Vasilakis</strong> is Director of the Centre for Healthcare Innovation and Improvement (CHI2), and Professor at the University of Bath School of Management.
</dd>
<dd>
<strong>Zehra Onen Dumlu</strong> is a Research Associate at CHI2 and Lecturer at the University of Bath.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2024 Luke Shaw
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img style="height:22px!important;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Shaw, Luke et al 2024. “Forecasting the Health Needs of a Changing Population” Real World Data Science, May 08, 2024. <a href="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/05/08/dpm.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>


</section>

 ]]></description>
  <category>Health and wellbeing</category>
  <category>Forecasting</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2024/05/08/dpm.html</guid>
  <pubDate>Wed, 08 May 2024 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2024/05/08/images/doctor-patient-thumbnail.png" medium="image" type="image/png" height="105" width="144"/>
</item>
<item>
  <title>Deduplicating and linking large datasets using Splink</title>
  <dc:creator>Robin Linacre</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/22/splink.html</link>
  <description><![CDATA[ 
<div class="page-columns page-rows-contents page-layout-article"><div class="social-share"><a href="https://twitter.com/share?url=https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/22/splink.html&amp;text=Forecasting the Health Needs of a Changing Population" target="_blank" class="twitter"><i class="fab fa-x-twitter fa-fw fa-lg"></i></a><a href="https://www.linkedin.com/shareArticle?url=https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/22/splink.html&amp;title=Forecasting the Health Needs of a Changing Population" target="_blank" class="linkedin"><i class="fa-brands fa-linkedin-in fa-fw fa-lg"></i></a>  <a href="mailto:?subject=Forecasting the Health Needs of a Changing Population&amp;body=Check out this link:https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/22/splink.html" target="_blank" class="email"><i class="fa-solid fa-envelope fa-fw fa-lg"></i></a><a href="https://bsky.app/intent/compose?text=https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/22/splink.html Forecasting the Health Needs of a Changing Population" target="_blank" class="bsky"><i class="fa-brands fa-bluesky"></i></a></div></div>





<p>In 2019, the data linking team at the Ministry of Justice was challenged to develop a new data linking methodology to produce new, higher quality linked datasets from the justice system.</p>
<p>The ultimate goal was to share new linked datasets with academic researchers, as part of the ADR UK-funded <a href="https://www.gov.uk/guidance/ministry-of-justice-data-first">Data First programme</a>. These datasets – which include data from prisons, probation, and the criminal and family courts – are now available, and researchers can <a href="https://www.gov.uk/government/publications/moj-data-first-application-form-for-secure-access-to-data">apply for secure access</a>.</p>
<p>The linking methodology is widely applicable and has been published as a free and open source software package called <a href="https://github.com/moj-analytical-services/splink">Splink</a>. The software applies statistical best practice to accurately and quickly link and deduplicate large datasets. The software has now been downloaded over 7 million times, and has been used widely in government, academia and the private sector.</p>
<section id="the-problem" class="level2">
<h2 class="anchored" data-anchor-id="the-problem">The problem</h2>
<p>Data duplication is a ubiquitous problem affecting data quality. Organisations often have multiple records that refer to the same entity but no unique identifier that ties these entities together. Data entry errors and other issues mean that variations usually exist, so the records belonging to a single entity aren’t necessarily identical.</p>
<p>For example, in a company, customer data may have been entered multiple times in multiple different databases, with different spellings of names, different addresses, and other typos. The inability to identify which records belong to each customer presents a data quality problem at all stages of data analysis – from basic questions such as counting the number of unique customers, through to advanced statistical analysis.</p>
<p>With the growing size of datasets held by many organisations, any solution must be able to work on very large datasets of tens of millions of records or more.</p>
</section>
<section id="approach" class="level2">
<h2 class="anchored" data-anchor-id="approach">Approach</h2>
<p>In collaboration with academic experts, the team started with desk research into data linking theory and practice, and a review of existing open source software implementations.</p>
<p>One of the most common theoretical approaches described in the literature is the Fellegi-Sunter model. This statistical model has a long history of application for high profile, important record linking tasks such as in the US Census Bureau and the UK Office for National Statistics (ONS).</p>
<p>The model takes pairwise comparisons of records as an input, and outputs a match score between 0 and 1, which (loosely) can be interpreted as the probability of the two records being a match. Since the record comparison can be either two records from the same dataset, or records from different datasets, this is applicable to both deduplication and linkage problems.</p>
<p>An important benefit of the model is explainability. The model uses a number of parameters, each of which <a href="https://www.robinlinacre.com/partial_match_weights/">has an intuitive explanation</a> that can be understood by a non-technical audience. The relative simplicity of the model also means it is easier to understand and explain how biases in linkage may occur, such as varying levels of accuracy for different ethnic groups.</p>
<section id="example" class="level3">
<h3 class="anchored" data-anchor-id="example">Example</h3>
<p>Consider the following simple record comparison. Are these records a match?</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/22/images/record_comparison.png" class="img-fluid figure-img"></p>
<figcaption><strong>Figure 1</strong>: Colour coded comparison of two records.</figcaption>
</figure>
</div>
<p>The parameters of the model are known as partial match weights, which capture the strength of the evidence in favour or against these records being a match.</p>
<p>They can be represented in a chart as follows, in which the highlighted bars correspond to the above example record comparison:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/22/images/partial_match_weights.png" class="img-fluid figure-img"></p>
<figcaption><strong>Figure 2</strong>: Chart showing partial match weights of model.</figcaption>
</figure>
</div>
<p>We can see, for example, that the first name (Robin vs Robyn) is not an exact match, but they have a Jaro-Winkler similarity of above 0.9. As a result, the model ‘activates’ the corresponding partial match weight (in orange). This lends some evidence in favour of a match, but the partial match weight is not as strong as it would have been for an exact match.</p>
<p>Similarly we can see that the non-match on gender leads to the activation (in purple) of a strong negative partial match weight.</p>
<p>The activated partial match weight can then be represented in a waterfall chart as follows, which shows how the final match score is calculated:</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/22/images/waterfall.png" class="img-fluid figure-img"></p>
<figcaption><strong>Figure 3</strong>: Waterfall chart showing how partial match weights combine to calculate the final prediction.</figcaption>
</figure>
</div>
<p>The parameter estimates in these charts all have intuitive explanations:</p>
<ul>
<li>The partial match weight on first name is positive, but relatively weak. This makes sense, because the first names are a fuzzy match, not an exact match, so this provides only moderate evidence in favour of the record being a match.</li>
<li>The match weight for the exact match on postcode is stronger than the equivalent weight for surname. This is because the cardinality of the postcode field in the underlying data is higher than the cardinality for surname, so matches on postcode are less likely to occur by chance than matches on surname.</li>
<li>The negative match weight for the mismatch on gender is relatively strong. This reflects the fact that, in this dataset, it’s uncommon for the ‘gender’ field to match amongst truly matching records.</li>
</ul>
<p>The final result is that the model predicts these records are a match, but with only 94% probability: it’s not sure. Most examples would be less ambiguous than this one, and would have a match probability very close to either 0 or 1.</p>
<p>For further details of the theory behind the Fellegi-Sunter model, and a deep dive into the intuitive explanations of the model, I have have developed a <a href="https://www.robinlinacre.com/intro_to_probabilistic_linkage/">series of interactive tutorials</a>.</p>
</section>
</section>
<section id="implementation" class="level2">
<h2 class="anchored" data-anchor-id="implementation">Implementation</h2>
<p>Through our desk research and open source software review, an existing software package called <a href="https://github.com/kosukeimai/fastLink">fastLink</a> was identified which implements the Fellegi-Sunter model, but unfortunately the software is not able to handle very large datasets of more than a few hundred thousand records.</p>
<p>Inspired by the popularity of fastLink, the team quickly realised that the methodology it was developing was generally applicable and could be valuable to a wide range of users if published as a software package.</p>
<p>As we spoke to colleagues across government and beyond, we found record linkage and deduplication problems are pervasive, and crop up in many different guises, meaning that any software needed to be very general and flexible.</p>
<p>The result is Splink – which is a Python package that implements the Fellegi-Sunter model, and enables parameters to be estimated using the Expectation Maximisation algorithm.</p>
<p>The package is free to use, <a href="https://github.com/moj-analytical-services/splink">and open source</a>. It is accompanied by <a href="https://moj-analytical-services.github.io/splink/index.html">detailed documentation</a>, including a <a href="https://moj-analytical-services.github.io/splink/demos/tutorials/00_Tutorial_Introduction.html">tutorial</a> and a set of <a href="https://moj-analytical-services.github.io/splink/demos/examples/examples_index.html">examples</a>.</p>
<p>Splink makes no assumptions about the type of entity being linked, so it is very flexible. We are aware of its use to match data on a variety of entity types including persons, companies, financial transactions and court cases.</p>
<p>The package closely follows the statistical approach described in fastLink. In particular it implements the same mathematical model and likelihood functions described in the <a href="http://imai.fas.harvard.edu/research/files/linkage.pdf">fastLink paper</a> (see pages 354 to 357), with a comprehensive suite of tests to ensure correctness of the implementation.</p>
<p>In addition, Splink introduces a number of innovations:</p>
<ul>
<li>Able to work at massive scale – with proven examples of its use on over 100 million records.</li>
<li>Extremely fast – capable of linking 1 million records on a laptop in around a minute.</li>
<li><a href="https://moj-analytical-services.github.io/splink/charts/index.html">Comprehensive graphical output</a> showing parameter estimates and iteration history make it easier to understand the model and diagnose statistical issues.</li>
<li><a href="https://moj-analytical-services.github.io/splink/charts/waterfall_chart.html">A waterfall chart</a> which can be generated for any record pair, which explains how the estimated match probability is derived.</li>
<li>Support for deduplication, linking, and a combination of both, including support for deduplicating and linking multiple datasets.</li>
<li>Greater customisability of record comparisons, including the ability <a href="https://moj-analytical-services.github.io/splink/topic_guides/comparisons/customising_comparisons.html">to specify custom, user defined comparison functions.</a></li>
<li>Term frequency adjustments on any number of columns.</li>
<li>It’s possible to save a model once it’s been estimated – enabling a model to be estimated, quality assured, and then reused as new data becomes available.</li>
<li>A <a href="https://moj-analytical-services.github.io/splink/">companion website</a> provides a complete description of the various configuration options, and examples of how to achieve different linking objectives.</li>
</ul>
</section>
<section id="using-splink" class="level2">
<h2 class="anchored" data-anchor-id="using-splink">Using Splink</h2>
<p><a href="https://moj-analytical-services.github.io/splink/">Full documentation</a> and <a href="https://moj-analytical-services.github.io/splink/demos/tutorials/00_Tutorial_Introduction.html">a tutorial</a> are available for Splink, but the following snippet gives a simple example of Splink in action:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode python code-with-copy"><code class="sourceCode python"><span id="cb1-1"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> splink.datasets <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> splink_datasets</span>
<span id="cb1-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> splink.duckdb.blocking_rule_library <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> block_on</span>
<span id="cb1-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> splink.duckdb.comparison_library <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> (</span>
<span id="cb1-4">    exact_match,</span>
<span id="cb1-5">    jaro_winkler_at_thresholds,</span>
<span id="cb1-6">    levenshtein_at_thresholds,</span>
<span id="cb1-7">)</span>
<span id="cb1-8"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> splink.duckdb.linker <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> DuckDBLinker</span>
<span id="cb1-9"></span>
<span id="cb1-10">df <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> splink_datasets.fake_1000</span>
<span id="cb1-11"></span>
<span id="cb1-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Specify a data linkage model</span></span>
<span id="cb1-13">settings <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb1-14">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"link_type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dedupe_only"</span>,</span>
<span id="cb1-15">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"blocking_rules_to_generate_predictions"</span>: [</span>
<span id="cb1-16">      block_on(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"first_name"</span>),</span>
<span id="cb1-17">      block_on(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"surname"</span>),</span>
<span id="cb1-18">    ],</span>
<span id="cb1-19">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"comparisons"</span>: [</span>
<span id="cb1-20">        jaro_winkler_at_thresholds(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"first_name"</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>),</span>
<span id="cb1-21">        jaro_winkler_at_thresholds(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"surname"</span>),</span>
<span id="cb1-22">        levenshtein_at_thresholds(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dob"</span>),</span>
<span id="cb1-23">        exact_match(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"city"</span>, term_frequency_adjustments<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">True</span>),</span>
<span id="cb1-24">        exact_match(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"email"</span>),</span>
<span id="cb1-25">    ],</span>
<span id="cb1-26">}</span>
<span id="cb1-27"></span>
<span id="cb1-28">linker <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> DuckDBLinker(df, settings)</span>
<span id="cb1-29"></span>
<span id="cb1-30"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Estimate model parameters</span></span>
<span id="cb1-31"></span>
<span id="cb1-32"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Direct estimation using random sampling can be used for the u probabilities</span></span>
<span id="cb1-33">linker.estimate_u_using_random_sampling(target_rows<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">1e6</span>)</span>
<span id="cb1-34"></span>
<span id="cb1-35"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Expectation maximisation is used to train the m values</span></span>
<span id="cb1-36">br_training <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_on([<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"first_name"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"surname"</span>])</span>
<span id="cb1-37">linker.estimate_parameters_using_expectation_maximisation(br_training)</span>
<span id="cb1-38"></span>
<span id="cb1-39">br_training <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> block_on(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"dob"</span>)</span>
<span id="cb1-40">linker.estimate_parameters_using_expectation_maximisation(br_training)</span>
<span id="cb1-41"></span>
<span id="cb1-42"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Use the model to compute pairwise match scores</span></span>
<span id="cb1-43">pairwise_predictions <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> linker.predict()</span>
<span id="cb1-44"></span>
<span id="cb1-45"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Cluster the match scores into groups to produce a synthetic unique person id</span></span>
<span id="cb1-46">clusters <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> linker.cluster_pairwise_predictions_at_threshold(</span>
<span id="cb1-47">  pairwise_predictions, <span class="fl" style="color: #AD0000;
background-color: null;
font-style: inherit;">0.95</span></span>
<span id="cb1-48">)</span>
<span id="cb1-49">clusters.as_pandas_dataframe(limit<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">5</span>)</span></code></pre></div></div>
<p>The example shows the flexibility of Splink, and how various types of configuration can be used:</p>
<ul>
<li><strong>How should different data fields be compared?</strong> In this example, the Jaro-Winkler distance is used for names, whereas Levenshtein is used for date of birth since Jaro-Winkler is not appropriate for numeric data.</li>
<li><strong>What blocking rules should be used?</strong> Blocking rules are the primary determinants of how fast Splink will run, but there is a trade off between speed and accuracy. In this case, the input data is small, so the blocking rules are loose.</li>
<li><strong>How should the model parameters be estimated?</strong> In this case, the user has no labels for supervised training, and so uses the unsupervised Expectation Maximisation approach.</li>
<li><strong>Is clustering needed?</strong> In this case, each person may potentially have many duplicates, so clustering is used. This creates an estimated (synthetic) unique identifier for each entity (person) in the input dataset.</li>
</ul>
</section>
<section id="outcomes" class="level2">
<h2 class="anchored" data-anchor-id="outcomes">Outcomes</h2>
<p>Splink has been used to link some of the largest datasets held by the Ministry of Justice as part of the <a href="https://www.gov.uk/guidance/ministry-of-justice-data-first">Data First programme</a>, and researchers are now <a href="https://www.gov.uk/government/publications/moj-data-first-application-form-for-secure-access-to-data">able to apply for secure access to these datasets</a>. Research using this data <a href="https://www.ons.gov.uk/aboutus/whatwedo/statistics/requestingstatistics/onsresearchexcellenceaward">won the ONS Linked Administrative Data Award at the 2022 Research Excellence Awards</a>.</p>
<p>More widely, the demand for Splink has been higher than we expected – with over 7 million downloads. It has been used in other government departments including the Office for National Statistics and internationally, the private sector, and published academic research from top international universities.</p>
<p>Splink has also had external contributions from over 30 people, including staff at the Australian Bureau of Statistics, DataBricks, other government departments, academics, and various private sector consultancies.</p>
<div class="callout callout-style-simple callout-note" style="margin-top: 2.25rem;">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-body-container">
<p><strong>Editor’s note</strong>: For more on data linkage, <a href="https://realworlddatascience.net/foundation-frontiers/interviews/posts/2023/10/16/data-sharing-in-gov.html">check out our interview with Helen Miller-Bakewell of the UK Office for Statistics Regulation</a>, discussing the OSR report, <a href="https://osr.statisticsauthority.gov.uk/publication/data-sharing-and-linkage-for-the-public-good/">Data Sharing and Linkage for the Public Good</a>.</p>
</div>
</div>
</div>
<div class="article-btn">
<p><a href="../../../../../../applied-insights/case-studies/index.html">Find more case studies</a></p>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the author</dt>
<dd>
<strong>Robin Linacre</strong> is an economist, data scientist and data engineer based at the UK Ministry of Justice. He is the lead author of Splink.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2023 Robin Linacre
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img style="height:22px!important;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>. Thumbnail photo by <a href="https://unsplash.com/@possessedphotography?utm_content=creditCopyText&amp;utm_medium=referral&amp;utm_source=unsplash">Possessed Photography</a> on <a href="https://unsplash.com/photos/yellow-metal-chain-NwpSBZMhc-M?utm_content=creditCopyText&amp;utm_medium=referral&amp;utm_source=unsplash">Unsplash</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Linacre, Robin. 2023. “Deduplicating and linking large datasets using Splink.” Real World Data Science, November 22, 2023. <a href="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/22/splink.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>


</section>

 ]]></description>
  <category>Crime and justice</category>
  <category>Data quality</category>
  <category>Data linkage</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/22/splink.html</guid>
  <pubDate>Wed, 22 Nov 2023 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/22/images/possessed-photography-NwpSBZMhc-M-unsplash.png" medium="image" type="image/png" height="105" width="144"/>
</item>
<item>
  <title>Learning from failure: ‘Red flags’ in body-worn camera data</title>
  <dc:creator>Noah Wright</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/learning-from-failure.html</link>
  <description><![CDATA[ 





<p>Incarcerated youth are an exceptionally vulnerable population, and body-worn cameras are an important tool of accountability both for those incarcerated and the staff who supervise them. In 2018 the Texas Juvenile Justice Department (TJJD) deployed body-worn cameras for the first time, and this is a case study of how the agency developed a methodology for measuring the success of the camera rollout. This is also a case study of analysis failure, as it became clear that real-world implementation problems were corrupting the data and rendering the methodology unusable. However, the process of working through the causes of this failure helped the agency identify previously unrecognized problems and ultimately proved to be of great benefit. The purpose of this case study is to demonstrate how negative findings can still be incredibly useful in real-world settings.</p>
<div class="callout callout-style-simple callout-note callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-1-contents" aria-controls="callout-1" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Why body-worn cameras?
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-1" class="callout-1-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<p>Body-worn cameras became a standard tool of policing in the US in the mid-2010s. By recording officer interactions with the public, law enforcement agencies could achieve a greater degree of accountability. Not only could credible claims of police abuse against civilians be easily verified, the argument went, but false accusations would decline as well, saving law enforcement agencies time and resources that would otherwise be wasted on spurious allegations. Initial studies seemed to support this argument.</p>
<p>TJJD faced similar issues to law enforcement agencies, and body-worn cameras seemed like they could be a useful tool. Secure youth residential facilities in Texas all had overhead cameras, but these were very old (they still ran on tape) and captured no audio. This presented a number of problems when it came to deciphering contested incidents, not to mention that these cameras had clearly not prevented any of the agency’s prior scandals from taking place. TJJD received special funding from the legislature to roll out body-worn cameras system-wide, and all juvenile correctional officers were required to wear one.</p>
</div>
</div>
</div>
<section id="background" class="level2">
<h2 class="anchored" data-anchor-id="background">Background</h2>
<p>From the outset of the rollout of body-worn cameras, TJJD faced a major issue with implementation: in 2019, body worn cameras were an established tool for law enforcement, but there was very little literature or best practice to draw from for their use in a correctional environment. Unlike police officers, juvenile correctional officers (JCOs) deal directly with their charges for virtually their entire shift. In an eight-hour shift, a police officer might record a few calls and traffic stops. A juvenile correctional officer, on the other hand, would record for almost eight consecutive hours. And, because TJJD recorded round-the-clock for hundreds of employees at a time, this added up very quickly to <em>a lot</em> of footage.</p>
<p>For example, a typical dorm in a correctional center might have four JCOs assigned to it. Across a single week, these four JCOs would be expected to record at least 160 hours of footage.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/jcos-recording-totals.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="A table illustrating working hours over the course of a week for four juvenile correctional officers"></p>
</figure>
</div>
<div class="figure-caption" style="text-align: center;">
<p><strong>Figure 1:</strong> Four JCOs x 40 hours per week = 160 hours of footage.</p>
</div>
<p>This was replicated across every dorm. Three dorms, for example, would produce nearly 500 hours of footage, as seen below.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/jcos-recording-totals-across-dorms.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="A table illustrating working hours over the course of a week for four juvenile correctional officers in each of three dorms in one juvenile correctional facility"></p>
</figure>
</div>
<div class="figure-caption" style="text-align: center;">
<p><strong>Figure 2:</strong> Three dorms x four JCOs x 40 hours per week = 480 hours of footage.</p>
</div>
<p>Finally, we had more than one facility. Four facilities with three dorms each would produce nearly 2,000 hours of footage every week.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/jcos-recording-totals-across-facilities.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="A table illustrating working hours over the course of a week for four juvenile correctional officers in each of three dorms in four separate juvenile correctional facilities"></p>
</figure>
</div>
<div class="figure-caption" style="text-align: center;">
<p><strong>Figure 3:</strong> Four facilities x three dorms x four JCOs x 40 hours per week = 1,960 hours of footage.</p>
</div>
<p>In actuality, we had a total of five facilities each with over a dozen dorms producing an anticipated <strong>17,000 hours</strong> of footage every week – an impossible amount to monitor manually.</p>
<p>As a result, footage review had to be done in a limited, reactive manner. If our monitoring team received an incident report, they could easily zero in on the cameras of the officers involved and review the incident accordingly. But our executive team had hoped to be able to use the footage proactively, looking for “red flags” in order to <em>prevent</em> potential abuses instead of only responding to allegations.</p>
<p>Because the agency had no way of automating the monitoring of footage, any proactive analysis had to be metadata-based. But what to look for in the metadata? Once again, the lack of best-practice literature left us in the lurch. So, we brainstormed ideas for “red flags” and came up with the following that could be screened for using camera metadata:</p>
<ol type="1">
<li><p><strong>Minimal quantity of footage</strong> – our camera policy required correctional officers to have their cameras on at all times in the presence of youth. No footage meant they weren’t using their cameras.</p></li>
<li><p><strong>Frequently turning the camera on and off</strong> – a correctional officer working a dorm should have their cameras always on when around youth and not be turning them on and off repeatedly.</p></li>
<li><p><strong>Large gaps between clips</strong> – it defeats the purpose of having cameras if they’re not turned on.</p></li>
</ol>
<p>In addition, we came up with a fourth red flag, which could be screened for by comparing camera metadata with shift-tracking metadata:</p>
<ol start="4" type="1">
<li><strong>Mismatch between clips recorded and shifts worked</strong> – the agency had very recently rolled out a new shift tracking software. We should expect to see the hours logged by the body cameras roughly match the shift hours worked.</li>
</ol>
</section>
<section id="analysis-part-1-quality-control-and-footage-analysis" class="level2">
<h2 class="anchored" data-anchor-id="analysis-part-1-quality-control-and-footage-analysis">Analysis, part 1: Quality control and footage analysis</h2>
<p>For this analysis, I gathered the most recent three weeks of body-worn camera data – which, at the time, covered April 1–21, 2019. I also pulled data from Shifthound (our shift management software) covering the same time period. Finally, I gathered HR data from CAPPS, the system that most of the State of Texas used at the time for personnel management and finance.<sup>1</sup> I then performed some quality control work, summarized in the dropdown box below.</p>
<div class="callout callout-style-simple callout-note callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-2-contents" aria-controls="callout-2" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Initial quality control steps
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-2" class="callout-2-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<p><code>SkimR</code> is a helpful <code>R</code> package for exploratory analysis that gives summary statistics for every variable in a data frame, including missing values. After using the <code>skim</code> function on clip data, shift data, and HR data, I noticed that the clip data had some missing values for employee ID. This was an error which pointed to data entry mistakes – body-worn cameras do not record footage on their own, after all, so employee IDs should be assigned to each clip.</p>
<p>From here I compared the employee ID field in the clip data to the employee ID field in the HR data. Somewhat surprisingly, IDs existed in the clip data that did not correspond to any entries in the HR data, indicating yet more data entry mistakes – the HR data is the ground truth for all employee IDs. I checked the shift data for the same error – employee IDs that did not exist in the HR data – and found the same problem.</p>
<p>As well as employee IDs that did not exist in the HR data, I also looked for employee IDs in the footage and shift data which related to staff who were not actually employed between April 1–21, 2019. I found some examples of this, which indicated yet more errors: staff cannot use a body-worn camera or log a shift if they have yet to begin working or if they have been terminated (system permissions are revoked upon leaving employment).</p>
<p>I made a list of every erroneous ID to pass off to HR and monitoring staff before excluding them from the subsequent analysis. In total, 10.6% of clips representing 11.3% of total footage had to be excluded due to these initial data quality issues, foreshadowing the subsequent data quality issues the analysis would uncover.</p>
<p>The full analysis script <a href="https://t.ly/BUNRZ">can be found on GitHub</a>.</p>
</div>
</div>
</div>
<p>In order to operationalize the “red flags” from our brainstorming session, I needed to see what exactly the cameras captured in their metadata. The variables most relevant to our purposes were:</p>
<ul>
<li>Clip start</li>
<li>Clip end</li>
<li>Camera used</li>
<li>Who was assigned to the camera at the time</li>
<li>The role of the person assigned to the camera</li>
</ul>
<p>Using these fields, I first created the following <strong>aggregations per employee ID</strong>:</p>
<div class="quarto-layout-panel" data-layout-nrow="2">
<div class="quarto-layout-row">
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-left">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/film.png" class="img-fluid figure-img" alt="graphical icon representing a strip of film"></p>
<figcaption><strong>Number of clips</strong> = Number of clips recorded.</figcaption>
</figure>
</div>
</div>
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-left">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/calendar.png" class="img-fluid figure-img" alt="graphical icon representing a calendar"></p>
<figcaption><strong>Days with footage</strong> = Number of discrete dates that appear in these clips.</figcaption>
</figure>
</div>
</div>
</div>
<div class="quarto-layout-row">
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-left">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/clock.png" class="img-fluid figure-img" alt="graphical icon representing a stopwatch"></p>
<figcaption><strong>Footage hours</strong> = Total duration of all shot footage.</figcaption>
</figure>
</div>
</div>
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-left">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/caution.png" class="img-fluid figure-img" alt="graphical icon of an exclamation mark inside a circle"></p>
<figcaption><strong>Significant gaps</strong> = Number of clips where the previous clip’s end date was either greater than 15 minutes or less than eight hours before current clip’s start date.</figcaption>
</figure>
</div>
</div>
</div>
</div>
<p>I used these aggregations to devise the following <strong>staff metrics</strong>:</p>
<div class="quarto-layout-panel" data-layout-nrow="2">
<div class="quarto-layout-row">
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-left">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/clip-per-day.png" class="img-fluid figure-img" alt="graphical icons representing a strip of film and a calendar"></p>
<figcaption><strong>Clips per day</strong> = Number of clips / Days with footage.</figcaption>
</figure>
</div>
</div>
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-left">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/footage-per-day.png" class="img-fluid figure-img" alt="graphical icons representing a stopwatch and a calendar"></p>
<figcaption><strong>Footage per day</strong> = Footage hours / Days with footage.</figcaption>
</figure>
</div>
</div>
</div>
<div class="quarto-layout-row">
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-left">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/avg-clip-length.png" class="img-fluid figure-img" alt="graphical icons representing a stopwatch and a strip of film"></p>
<figcaption><strong>Average clip length</strong> = Footage hours / Number of clips.</figcaption>
</figure>
</div>
</div>
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-left">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/gaps-per-day.png" class="img-fluid figure-img" alt="graphical icons of an exclamation mark inside a circle and a calendar"></p>
<figcaption><strong>Gaps per day</strong> = Gaps / Days with footage.</figcaption>
</figure>
</div>
</div>
</div>
</div>
<p>Once I established these metrics for each employee I looked at their respective distributions. Standard staff shift lengths at the time were eight hours. If staff were using their cameras appropriately, we would expect to see distributions centered around clip lengths of about an hour, eight or fewer clips per day, and 8-12 footage hours per day. We would also expect to see 0 large gaps.</p>
<details>
<summary>
Show the code
</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb1-1"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```{r}</span></span>
<span id="cb1-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(tidyverse)</span>
<span id="cb1-3"></span>
<span id="cb1-4">Footage_Metrics_by_Employee <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">read_csv</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Output/Footage Metrics by Employee.csv"</span>)</span>
<span id="cb1-5"></span>
<span id="cb1-6">Footage_Metrics_by_Employee <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb1-7">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">select</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>Clips, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>Days_With_Footage, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>Footage_Hours, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>Gaps) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb1-8">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">pivot_longer</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span>Employee_ID, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">names_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Metric"</span>, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">values_to =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Value"</span>) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb1-9">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ggplot</span>(<span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">aes</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">x =</span> Value)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-10">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">geom_histogram</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">+</span></span>
<span id="cb1-11">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">facet_wrap</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">~</span>Metric, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">scales =</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"free"</span>)</span>
<span id="cb1-12"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```</span></span></code></pre></div></div>
</details>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/fig-1.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Four histograms of key metrics - average clip length, average number of clips per day, average footage hours per day, and average gaps per day."></p>
</figure>
</div>
<p>By eyeballing the distributions I could tell most staff were recording fewer than 10 clips per day, shooting about 0.5–2 hours for each clip, for a total of 2–10 hours of daily footage, with the majority of employees having less than one significant gap per day. Superficially, this appeared to provide evidence of widespread attempts at complying with the body-worn camera policy and no systemic rejection or resistance. If this were indeed the case, then we could turn our attention to individual outliers.</p>
<p>First, though, we thought we would attempt to validate this initial impression by testing another assumption. If each employee works on average 40 hours per week – a substantial underestimate given how common overtime was – we should expect, over a three-week period, to see about 120 hours of footage per employee in the dataset. This is <em>not</em> what we found.</p>
<p>Average footage per employee was 70.2 hours over the three-week period, meaning that the average employee was recording less than 60% of shift hours worked. With so many hours going unrecorded for unknown reasons, we needed to investigate further.</p>
<p>Surely the shift data would clarify this…</p>
</section>
<section id="analysis-part-2-footage-and-shift-comparison" class="level2">
<h2 class="anchored" data-anchor-id="analysis-part-2-footage-and-shift-comparison">Analysis, part 2: Footage and shift comparison</h2>
<p>With the data on shifts worked from our timekeeping system, I could theoretically compare actual shifts worked to the amount of footage recorded. If there were patterns in where the gaps in footage fell, that comparison might help to explain why.</p>
<p>In order to join the shift data to the camera data, I needed a common unit of analysis beyond “Employee ID.” Using only this value would produce a nonsensical table that joined up every clip of footage to every shift worked.</p>
<p>For example, let’s take employee #9001005 at Facility Epsilon between April 1–3. This employee has the following clips recorded during that time period:</p>
<div class="table-responsive">
<table class="caption-top table">
<thead>
<tr class="header">
<th style="text-align: left;">Employee_ID</th>
<th style="text-align: left;">Clip_ID</th>
<th style="text-align: left;">Clip_Start</th>
<th style="text-align: left;">Clip_End</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">156421</td>
<td style="text-align: left;">2019-04-01 05:54:34</td>
<td style="text-align: left;">2019-04-01 08:34:34</td>
</tr>
<tr class="even">
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">155093</td>
<td style="text-align: left;">2019-04-01 08:40:59</td>
<td style="text-align: left;">2019-04-01 08:54:51</td>
</tr>
<tr class="odd">
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">151419</td>
<td style="text-align: left;">2019-04-01 09:03:16</td>
<td style="text-align: left;">2019-04-01 11:00:30</td>
</tr>
<tr class="even">
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">153133</td>
<td style="text-align: left;">2019-04-01 11:10:09</td>
<td style="text-align: left;">2019-04-01 12:39:51</td>
</tr>
<tr class="odd">
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">151088</td>
<td style="text-align: left;">2019-04-01 12:57:51</td>
<td style="text-align: left;">2019-04-01 14:06:44</td>
</tr>
<tr class="even">
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">150947</td>
<td style="text-align: left;">2019-04-02 05:56:34</td>
<td style="text-align: left;">2019-04-02 09:48:50</td>
</tr>
<tr class="odd">
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">151699</td>
<td style="text-align: left;">2019-04-02 09:54:23</td>
<td style="text-align: left;">2019-04-02 12:17:15</td>
</tr>
</tbody>
</table>
</div>
<p>We can join this to a similar table of shifts logged. This particular employee had the following shifts scheduled from April 1–3:</p>
<div class="table-responsive">
<table class="caption-top table">
<thead>
<tr class="header">
<th style="text-align: left;">Employee_ID</th>
<th style="text-align: left;">Shift_ID</th>
<th style="text-align: left;">Shift_Start</th>
<th style="text-align: left;">Shift_End</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">E050603</td>
<td style="text-align: left;">2019-04-01 06:00:00</td>
<td style="text-align: left;">2019-04-01 14:00:00</td>
</tr>
<tr class="even">
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">E051303</td>
<td style="text-align: left;">2019-04-02 06:00:00</td>
<td style="text-align: left;">2019-04-02 14:00:00</td>
</tr>
</tbody>
</table>
</div>
<p>The table shows two eight-hour morning shifts from 6:00 am to 2:00 pm. We can join the two tables together by ID on a messy many-to-many join, but that tells us nothing about how much they overlap (or fail to overlap) without extensive additional work. For example, we have a unique identifier for employee clip (Clip_ID) and employee shift (Shift_ID), but what we need is a unique identifier that can be used to join the two. Fortunately, for this particular data we can <em>create</em> a unique identifier since both clips and shifts are fundamentally measures of <em>time</em>. While Employee_ID is not in itself unique (i.e., one employee can have multiple clips attached to that ID), Employee_ID combined with time of day is unique. A person can only be in one place at a time, after all!</p>
<p>To reshape the data for joining, I created a function that takes any data frame with a start and end column and unfolds it into discrete units of time. Using the code below to create the “Interval_Convert” function, the shift data above for employee 9001005 converts into one entry per hour of the day per shift. As a result, two eight-hour shifts get turned into 16 employee hours (a sample of which is shown below).</p>
<details>
<summary>
Show the code
</summary>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb2-1"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```{r}</span></span>
<span id="cb2-2"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(sqldf)</span>
<span id="cb2-3"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">library</span>(lubridate)</span>
<span id="cb2-4"></span>
<span id="cb2-5">Interval_Convert <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">function</span>(DF, Start_Col, End_Col, Int_Unit, <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Int_Length =</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>) {</span>
<span id="cb2-6"><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">browser</span>()</span>
<span id="cb2-7">  Start_Col2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">enquo</span>(Start_Col)</span>
<span id="cb2-8">  End_Col2 <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">enquo</span>(End_Col)</span>
<span id="cb2-9">  </span>
<span id="cb2-10">  Start_End <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> DF <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-11">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ungroup</span>() <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-12">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">summarize</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Min_Start =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">min</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!!</span>Start_Col2),</span>
<span id="cb2-13">              <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Max_End =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">max</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!!</span>End_Col2)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-14">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Start =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">floor_date</span>(Min_Start, Int_Unit),</span>
<span id="cb2-15">           <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">End =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">ceiling_date</span>(Max_End, Int_Unit))</span>
<span id="cb2-16">  </span>
<span id="cb2-17">  DF <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> DF <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-18">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Single =</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!!</span>Start_Col2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!!</span>End_Col2)</span>
<span id="cb2-19">  </span>
<span id="cb2-20">  Interval_Table <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">data.frame</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Interval_Start =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">seq.POSIXt</span>(Start_End<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>Start[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], Start_End<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">$</span>End[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">1</span>], <span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">by =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">str_c</span>(Int_Length, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">" "</span>, Int_Unit))) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-21">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Interval_End =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">lead</span>(Interval_Start)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-22">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span><span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">is.na</span>(Interval_End))</span>
<span id="cb2-23">  </span>
<span id="cb2-24">  by <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">join_by</span>(Interval_Start <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;=</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!!</span>End_Col2, Interval_End <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;=</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!!</span>Start_Col2)  </span>
<span id="cb2-25">  </span>
<span id="cb2-26">  Interval_Data_Table <span class="ot" style="color: #003B4F;
background-color: null;
font-style: inherit;">&lt;-</span> Interval_Table <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb2-27">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">left_join</span>(DF, by) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span> </span>
<span id="cb2-28">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">mutate</span>(<span class="at" style="color: #657422;
background-color: null;
font-style: inherit;">Seconds_Duration_Within_Interval =</span> <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">if_else</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!!</span>End_Col2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> Interval_End, Interval_End, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!!</span>End_Col2) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">-</span></span>
<span id="cb2-29">             <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">if_else</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!!</span>Start_Col2 <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&lt;</span> Interval_Start, Interval_Start, <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!!</span>Start_Col2)) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">%&gt;%</span></span>
<span id="cb2-30">    <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">filter</span>(<span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!</span>(Single <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&amp;</span> Interval_End <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">==</span> <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">!!</span>Start_Col2),</span>
<span id="cb2-31">           <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">as.numeric</span>(Seconds_Duration_Within_Interval) <span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb2-32">  </span>
<span id="cb2-33">  <span class="fu" style="color: #4758AB;
background-color: null;
font-style: inherit;">return</span>(Interval_Data_Table)</span>
<span id="cb2-34">}</span>
<span id="cb2-35"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```</span></span></code></pre></div></div>
</details>
<div class="table-responsive">
<table class="caption-top table">
<colgroup>
<col style="width: 13%">
<col style="width: 12%">
<col style="width: 13%">
<col style="width: 9%">
<col style="width: 11%">
<col style="width: 10%">
<col style="width: 29%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">Interval_Start</th>
<th style="text-align: left;">Interval_End</th>
<th style="text-align: left;">Employee_ID</th>
<th style="text-align: left;">Shift_ID</th>
<th style="text-align: left;">Shift_Start</th>
<th style="text-align: left;">Shift_End</th>
<th style="text-align: left;">Seconds_Duration_Within_Interval</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">2019-04-01 06:00:00</td>
<td style="text-align: left;">2019-04-01 07:00:00</td>
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">E050603</td>
<td style="text-align: left;">2019-04-01 06:00:00</td>
<td style="text-align: left;">2019-04-01 14:00:00</td>
<td style="text-align: left;">3600 secs</td>
</tr>
<tr class="even">
<td style="text-align: left;">2019-04-01 07:00:00</td>
<td style="text-align: left;">2019-04-01 08:00:00</td>
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">E050603</td>
<td style="text-align: left;">2019-04-01 06:00:00</td>
<td style="text-align: left;">2019-04-01 14:00:00</td>
<td style="text-align: left;">3600 secs</td>
</tr>
<tr class="odd">
<td style="text-align: left;">2019-04-01 08:00:00</td>
<td style="text-align: left;">2019-04-01 09:00:00</td>
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">E050603</td>
<td style="text-align: left;">2019-04-01 06:00:00</td>
<td style="text-align: left;">2019-04-01 14:00:00</td>
<td style="text-align: left;">3600 secs</td>
</tr>
<tr class="even">
<td style="text-align: left;">2019-04-01 09:00:00</td>
<td style="text-align: left;">2019-04-01 10:00:00</td>
<td style="text-align: left;">9001005</td>
<td style="text-align: left;">E050603</td>
<td style="text-align: left;">2019-04-01 06:00:00</td>
<td style="text-align: left;">2019-04-01 14:00:00</td>
<td style="text-align: left;">3600 secs</td>
</tr>
<tr class="odd">
<td style="text-align: left;">…</td>
<td style="text-align: left;">…</td>
<td style="text-align: left;">…</td>
<td style="text-align: left;">…</td>
<td style="text-align: left;">…</td>
<td style="text-align: left;">…</td>
<td style="text-align: left;">…</td>
</tr>
</tbody>
</table>
</div>
<p>The footage could be converted in a similar manner, and in this way I could break down both the shift data and the clip data into an hour-by-hour view and compare them to one another. Using this new format, I joined together the full tables of footage and shifts to determine how much footage was recorded with no corresponding shift in the timekeeping system.</p>
<div class="table-responsive">
<table class="caption-top table">
<colgroup>
<col style="width: 18%">
<col style="width: 34%">
<col style="width: 47%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">HR_Location</th>
<th style="text-align: left;">Footage_Hours_No_Shift</th>
<th style="text-align: left;">Employee_IDs_With_Missing_Shift</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Alpha</td>
<td style="text-align: left;">1805</td>
<td style="text-align: left;">122</td>
</tr>
<tr class="even">
<td style="text-align: left;">Beta</td>
<td style="text-align: left;">3749</td>
<td style="text-align: left;">114</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Delta</td>
<td style="text-align: left;">1208</td>
<td style="text-align: left;">133</td>
</tr>
<tr class="even">
<td style="text-align: left;">Epsilon</td>
<td style="text-align: left;">2899</td>
<td style="text-align: left;">157</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Gamma</td>
<td style="text-align: left;">4153</td>
<td style="text-align: left;">170</td>
</tr>
</tbody>
</table>
</div>
<p>To summarize what the table is telling us: Almost every employee has footage hours that do not match with logged shifts, totaling nearly 14,000 hours when you add up the Footage_Hours_No_Shift column. But what about the opposite case? How many shift hours were logged with no corresponding footage?</p>
<div class="table-responsive">
<table class="caption-top table">
<colgroup>
<col style="width: 18%">
<col style="width: 33%">
<col style="width: 48%">
</colgroup>
<thead>
<tr class="header">
<th style="text-align: left;">HR_Location</th>
<th style="text-align: left;">Shift_Hours_No_Footage</th>
<th style="text-align: left;">Employee_IDs_With_Missing_Footage</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td style="text-align: left;">Alpha</td>
<td style="text-align: left;">7338</td>
<td style="text-align: left;">127</td>
</tr>
<tr class="even">
<td style="text-align: left;">Beta</td>
<td style="text-align: left;">6014</td>
<td style="text-align: left;">118</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Delta</td>
<td style="text-align: left;">12830</td>
<td style="text-align: left;">141</td>
</tr>
<tr class="even">
<td style="text-align: left;">Epsilon</td>
<td style="text-align: left;">9000</td>
<td style="text-align: left;">168</td>
</tr>
<tr class="odd">
<td style="text-align: left;">Gamma</td>
<td style="text-align: left;">11960</td>
<td style="text-align: left;">183</td>
</tr>
</tbody>
</table>
</div>
<p>Oh dear. Again, almost every employee has logged shift hours with no footage: 47,000 hours in total. To put it another way, that’s an entire work week per employee not showing up in camera footage.</p>
<p>At this point, we could probably rule out deliberate noncompliance. The clip data already implied that most employees were following the policy, and our facility leadership would surely have noticed a mass refusal large enough to show up this clearly in the data.</p>
<p>One way to check for deliberate noncompliance would be to first exclude shifts that contain zero footage whatsoever. This would rule out total mismatches, where – for whatever reason – the logged shifts had totally failed to overlap with recorded clips. For the remaining shifts that <em>do</em> contain footage, we could look at the proportion of the shift covered by footage. So, if an eight-hour shift had four hours of recorded footage associated with it, then we could say that 50% of the shift had been recorded. The following histogram is a distribution of the number of employees organized by the percent of their shift-hours they recorded (but only shifts that had a nonzero amount of footage).</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/fig-2.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="A histogram of average percent of shifts recorded, excluding shifts with no recorded footage."></p>
</figure>
</div>
<p>As it turned out, most employees recorded the majority of their matching shifts, a finding that roughly aligns with the initial clip analysis. So, what explains the 14,000 hours of footage with no shifts, and the 47,000 hours of shifts with no footage?</p>
</section>
<section id="causes-of-failure" class="level2">
<h2 class="anchored" data-anchor-id="causes-of-failure">Causes of failure</h2>
<p>Here, I believed, we had reached the end of what I could do with data alone, and so I presented these findings (or lack thereof) to executive leadership. The failure to gather reliable data from linking the clip data to the shift data prompted follow-ups into what exactly was going wrong. As it turned out, <em>many</em> things were going wrong.</p>
<p>First, a number of technical problems plagued the early rollout of the cameras:</p>
<ul>
<li><p>All of our facilities suffered from high turnover, and camera ownership was not consistently updated. Employees who no longer worked at the agency could therefore appear in the clip data – somebody else had taken over their camera but had not put their name and ID on it.</p></li>
<li><p>We had no way of telling if a camera was not recording due to being docked and recharging or not recording due to being switched off.</p></li>
<li><p>In the early days of the rollout, footage got assigned to an employee based on the owner of the <em>dock</em>, not the camera. In other words, if Employee A had recorded their shift with their camera but uploaded the footage using a dock assigned to Employee B then the footage would show up in the system as belonging to Employee B.</p></li>
</ul>
<p>The shift data was, unsurprisingly, even worse, and it was here we came across our most important finding. While the evidence showed that there wasn’t any widespread non-compliance with the use of the cameras, there <em>was</em> widespread non-compliance with the use of our shift management software. Details are included in the dropdown box below.</p>
<div class="callout callout-style-simple callout-note callout-titled">
<div class="callout-header d-flex align-content-center collapsed" data-bs-toggle="collapse" data-bs-target=".callout-3-contents" aria-controls="callout-3" aria-expanded="false" aria-label="Toggle callout">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Note</span>Quality issues in shift tracking data
</div>
<div class="callout-btn-toggle d-inline-block border-0 py-1 ps-1 pe-0 float-end"><i class="callout-toggle"></i></div>
</div>
<div id="callout-3" class="callout-3-contents callout-collapse collapse">
<div class="callout-body-container callout-body">
<p>Our HR system, CAPPS, had a feature that tracked hours worked in order to calculate leave and overtime pay. However, CAPPS was a statewide application designed for 9–5 office workers, and could not capture the irregular working hours of our staff (much less aid in planning future shifts). We had obtained separate shift management software to fill these gaps, but not realized how inconsistently it was being used. All facilities were required to have their employees log their shifts, but some followed through on this better than others. And even for those that did make a good-faith effort at follow-through, quality control was nonexistent.</p>
<p>In CAPPS, time entry determined pay, so strong incentives existed to ensure accurate entry. But for our shift management software, no incentives existed at all for making sure that entries were correct. For example, a correctional officer could have a 40-hour work week scheduled in the shift software but miss the entire week due to an injury, and the software would still show them as having worked 40 hours that week. Nobody bothered to go back and correct these types of errors because there was no reason to.</p>
<p>The software was intended to be used proactively for planning purposes, not after-the-fact for logging and tracking purposes. Thus, it produced data that was totally inconsistent with actual hours worked, which became apparent when compared to data (like body-worn camera footage) that tracked actual hours on the floor.</p>
<p>In the end, we had to rethink a number of aspects of the shift software’s implementation. In the process of these fixes, leadership also came to make explicit that the software’s primary purpose was to help facilities schedule future shifts, not audit hours worked after the fact (which CAPPS already did, just on a day-by-day basis as opposed to an hour-by-hour basis). This analysis was the only time we attempted to use the shift data in this manner.</p>
</div>
</div>
</div>
</section>
<section id="what-we-learned-from-failure" class="level2">
<h2 class="anchored" data-anchor-id="what-we-learned-from-failure">What we learned from failure</h2>
<p>Whatever means we used to monitor compliance with the camera policy, we learned that it couldn’t be fully automated. The agency followed up this analysis with a random sampling approach, in which monitors would randomly select times of day they knew a given staff member would have to have their cameras turned on and actually watch the associated clips. This review process confirmed the first impressions from the statistical review above: most employees <em>were</em> making good faith attempts at complying with the policy despite technical glitches, short-staffing, and administrative confusion. It also confirmed that proactive monitoring of correctional officers was a human process which had to come from supervisors and staff.</p>
<p>The one piece of the analysis we did use going forward was the clip analysis (converted into a Power BI dashboard and included in the <a href="https://github.com/enndubbs/Body-Worn-Camera-Monitoring">GitHub repository</a> for this article), but only as a supplement for already-launched investigations, not a prompt for one. Body-worn camera footage remained immensely useful for investigations after-the-fact, but inconsistencies in clip data were not, in and of themselves, particularly noteworthy “red flags.” At the end of the day, analytics can contextualize and enhance human judgment, but it cannot replace it.</p>
<p>In academia, the bias in favor of positive findings is well-documented. The failure to find something, or a lack of statistical significance, does not lend itself to publication in the same way that a novel discovery does. But, in an applied setting, where results matter more than publication criteria, negative findings can be highly insightful. They can falsify erroneous assumptions, bring unknown problems to light, and prompt the creation of new processes and tools. In this context, a failure is only truly a failure if nothing is learned from it.</p>
<div class="article-btn">
<p><a href="../../../../../../applied-insights/case-studies/index.html">Find more case studies</a></p>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the author</dt>
<dd>
<strong>Noah Wright</strong> is a data scientist with the Texas Juvenile Justice Department. He is interested in the applications of data science to public policy in the context of real-world constraints, and the ethics thereof (ethics being highly relevant in his line of work). He can be reached on <a href="https://www.linkedin.com/in/noahdwright/">LinkedIn</a>.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2023 Noah Wright
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img style="height:22px!important;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Wright, Noah. 2023. “Learning from failure: ‘Red flags’ in body-worn camera data.” Real World Data Science, November 16, 2023. <a href="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/learning-from-failure.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>The underlying data for the analysis as presented in this article was requested through the Texas Public Information Act and went through TJJD’s approval process for ensuring anonymity of records. It is available on <a href="https://github.com/enndubbs/Body-Worn-Camera-Monitoring">GitHub</a> along with the rest of the code used to write this article.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>Crime and justice</category>
  <category>Public policy</category>
  <category>Data quality</category>
  <category>Data analysis</category>
  <category>Monitoring</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/learning-from-failure.html</guid>
  <pubDate>Thu, 16 Nov 2023 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/11/16/images/bodycam-monitor.png" medium="image" type="image/png" height="105" width="144"/>
</item>
<item>
  <title>Food for Thought: The value of competitions for confidential data</title>
  <dc:creator>Steven Bedrick, Ophir Frieder, Julia Lane, and Philip Resnik</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/06-value-of-competitions.html</link>
  <description><![CDATA[ 





<p>We are witnessing a sea change in data collection practices by both governments and businesses – from purposeful collection (through surveys and censuses, for example) to opportunistic (drawing on web and social media data, and administrative datasets). This shift has made clear the importance of record linkage – a government might, for example, look to link records held by its various departments to understand how citizens make use of the gamut of public services.</p>
<p>However, creating manual linkages between datasets can be prohibitively expensive, time consuming, and subject to human constraints and bias. Machine learning (ML) techniques offer the potential to combine data better, faster, and more cheaply. But, as the recently released <a href="https://www.ai.gov/wp-content/uploads/2023/01/NAIRR-TF-Final-Report-2023.pdf">National AI Research Resources Task Force report</a> highlights, it is important to have an open and transparent approach to ensure that unintended biases do not occur.</p>
<p>In other words, ML tools are not a substitute for thoughtful analysis. Both private and public producers of a linked dataset have to determine the level of linkage quality – such as what precision/recall tradeoff is best for the intended purpose (that is, the balance between false-positive links and failure to cover links that should be there), how much processing time and cost is acceptable, and how to address coverage issues. The challenge is made more difficult by the idiosyncrasies of heterogeneous datasets, and more difficult yet when datasets to be linked include confidential data <span class="citation" data-cites="10.1257/jel.20171350 DBLP:books/sp/ChristenRS20">(Christensen and Miguel 2018; Christen et al. 2020)</span>.</p>
<p>And, of course, an ML solution is never the end of the road: many data linkage scenarios are highly dynamic, involving use cases, datasets, and technical ecosystems that change and evolve over time; effective use of ML in practice necessitates an ongoing and continuous investment <span class="citation" data-cites="DBLP:journals/corr/abs-2112-01716">(Koch et al. 2021)</span>. Because techniques are constantly improving, producers need to keep abreast of new approaches. A model that is working well today may no longer work in a year because of changes in the data, or because the organizational needs have changed so that a certain type of error is no longer acceptable. As Sculley et al.&nbsp;point out, “it is remarkably easy to incur massive ongoing maintenance costs at the system level when applying machine learning” <span class="citation" data-cites="43146">(Sculley et al. 2014)</span>.</p>
<p>Also important is that record linkage is not seen as a technical problem relegated to the realm of computer scientists to solve. The full engagement of domain experts in designing the optimization problem, identifying measures of success, and evaluating the quality of the results is absolutely critical, as is building an understanding of the pros and cons of different measures <span class="citation" data-cites="10.1371/journal.pone.0249833 10.1007/s11222-017-9746-6">(Schafer et al. 2021; Hand and Christen 2018)</span>. There will need to be much learning by doing in “sandbox” environments, and back and forth communication across communities to achieve successful outcomes, as noted in the <a href="https://www.bea.gov/system/files/2022-10/acdeb-year-2-report.pdf">recommendations of the Advisory Committee on Data for Evidence Building</a> (a screenshot of which is shown in Figure 1).</p>
<p><a href="images/pt6-fig1.png"><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/pt6-fig1.png" class="img-fluid" width="700"></a></p>
<div class="figure-caption">
<p><strong>Figure 1:</strong> A recommendation for building an “innovation sandbox” as part of the creation of a new National Secure Data Service in the United States.</p>
</div>
<p>Despite the importance of trial and error and transparency about linkage quality, there is no handbook that guides domain experts in how to design such sandboxes. There is a very real need for agreed-upon, domain-independent guidelines, or better yet, official standards to evaluate sandboxes. Those standards would define “who” could and would conduct the evaluation, and help guarantee independence and repeatability. And while innovation challenges have been embraced by the federal government, the devil can be very much in the details <span class="citation" data-cites="4138bca6-f7b7-3af8-a96c-5e2544823c5c">(Williams 2012)</span>.</p>
<p>It is for this reason that the approach taken in the Food for Thought linkage competition, and described in this compendium, provides an important first step towards a well specified, replicable framework for achieving high quality outcomes. In that respect it joins other recent efforts to bring together community-level research on shared sensitive data <span class="citation" data-cites="macavaney-etal-2021-community tsakalidis-etal-2022-overview">(MacAvaney et al. 2021; Tsakalidis et al. 2022)</span>. This competition, like those, helped bring to the foreground both the opportunities and challenges of doing research in secure sandboxes with sensitive data. Notably, these exercises highlight a kind of cultural tension between secure, managed environments, on the one hand, and unfettered machine learning research, on the other. The need for flexibility and agility in computational research bumps up against the need for advance planning and careful step-by-step processes in environments with well-defined data governance rules, and one of the key lessons learned is that the tradeoffs here need to be recognized and planned for.</p>
<p>This particular competition was important for a number of other reasons. Thanks to its organization as a competition, complete with prizes and bragging rights for strongly performing teams, it attracted new eyes from computer science and data science to think about how to address a critical real-world linkage problem. It offered the potential to produce approaches that were scalable, transparent, and reproducible. The engagement of domain experts and statisticians meant that it will be possible to conduct an informed error analysis, to explicitly relate the performance metrics in the task to the problem being solved in the real world, and to bring in the expertise of survey methodologists to think about the possible adjustments. And because it identified different approaches of addressing the same problem, it created an environment for new innovative ideas.</p>
<p>More generally, in addition to the excitement of the new approaches, this exercise laid bare the fragility of linkages in general and highlighted the importance of secure sandboxes for confidential data. While the promise of privacy preserving technologies is alluring as <a href="https://www.bea.gov/system/files/2022-10/acdeb-year-2-report.pdf">an alternative to bringing confidential data together in one place</a>, such approaches are likely too immature to deploy ad hoc until a better understanding is established of how to translate real-world problems and their associated data into well-defined tasks, how to measure quality, and particularly how to assess the impact of match quality on different subgroups <span class="citation" data-cites="10.1145/3433638">(Domingo-Ferrer et al. 2021)</span>. The scientific profession has gone through too painful a lesson with the premature application of differential privacy techniques to ignore the lessons that can be learned from a careful and systematic analysis of different approaches <span class="citation" data-cites="10.1145/3433638 van_riper 10.1257/pandp.20191107 giles2022faking">(2021; Van Riper et al. 2020; Ruggles et al. 2019; Giles et al. 2022)</span>.</p>
<p>We hope that the articles in this collection provide not only the first steps towards a handbook of best practices, but also an inspiration to share lessons learned, so that success can be emulated, and failures understood and avoided.</p>
<div class="nav-btn-container">
<div class="grid">
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/05-third-place-winners.html">← Part 5: Third place winners</a></p>
</div>
</div>
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/index.html">Find more case studies</a></p>
</div>
</div>
</div>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<strong>Steven Bedrick</strong> is an associate professor in Oregon Health and Science University’s Department of Medical Informatics and Clinical Epidemiology.
</dd>
<dd>
<strong>Ophir Frieder</strong> is a professor in Georgetown University’s Department of Computer Science, and in the Department of Biostatistics, Bioinformatics &amp; Biomathematics at Georgetown University Medical Center.
</dd>
<dd>
<strong>Julia Lane</strong> is a professor at the NYU Wagner Graduate School of Public Service and a NYU Provostial Fellow for Innovation Analytics. She co-founded the Coleridge Initiative.
</dd>
</dl>
<p><strong>Philip Resnik</strong> holds a joint appointment as professor in the University of Maryland Institute for Advanced Computer Studies and the Department of Linguistics, and an affiliate professor appointment in computer science.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2023 Steven Bedrick, Ophir Frieder, Julia Lane, and Philip Resnik
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" style="height:22px!important;vertical-align:text-bottom;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" style="height:22px!important;margin-left:3px;vertical-align:text-bottom;"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>. Thumbnail photo by <a href="https://unsplash.com/@alexandru_tugui?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Alexandru Tugui</a> on <a href="https://unsplash.com/photos/-inuQpBGbgI?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Bedrick, Steven, Ophir Frieder, Julia Lane, and Philip Resnik. 2023. “Food for Thought: The value of competitions for confidential data.” Real World Data Science, August 21, 2023. <a href="https://realworlddatascience.net/the-pulse/case-studies/posts/2023/08/21/06-value-of-competitions.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>




<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-DBLP:books/sp/ChristenRS20" class="csl-entry">
Christen, P., T. Ranbaduge, and R. Schnell. 2020. <em>Linking Sensitive Data - Methods and Techniques for Practical Privacy-Preserving Information Sharing</em>. Springer. <a href="https://doi.org/10.1007/978-3-030-59706-1">https://doi.org/10.1007/978-3-030-59706-1</a>.
</div>
<div id="ref-10.1257/jel.20171350" class="csl-entry">
Christensen, G., and E. Miguel. 2018. <span>“Transparency, Reproducibility, and the Credibility of Economics Research.”</span> <em>Journal of Economic Literature</em> 56 (3): 920–80. <a href="https://doi.org/10.1257/jel.20171350">https://doi.org/10.1257/jel.20171350</a>.
</div>
<div id="ref-10.1145/3433638" class="csl-entry">
Domingo-Ferrer, J., D. Sánchez, and A. Blanco-Justicia. 2021. <span>“The Limits of Differential Privacy (and Its Misuse in Data Release and Machine Learning).”</span> <em>Communications of the ACM</em> (New York, NY, USA) 64 (7): 33–35. <a href="https://doi.org/10.1145/3433638">https://doi.org/10.1145/3433638</a>.
</div>
<div id="ref-giles2022faking" class="csl-entry">
Giles, O., K. Hosseini, G. Mingas, et al. 2022. <em>Faking Feature Importance: A Cautionary Tale on the Use of Differentially-Private Synthetic Data</em>. <a href="https://arxiv.org/abs/2203.01363">https://arxiv.org/abs/2203.01363</a>.
</div>
<div id="ref-10.1007/s11222-017-9746-6" class="csl-entry">
Hand, D., and P. Christen. 2018. <span>“A Note on Using the f-Measure for Evaluating Record Linkage Algorithms.”</span> <em>Statistics and Computing</em> (USA) 28 (3): 539–47. <a href="https://doi.org/10.1007/s11222-017-9746-6">https://doi.org/10.1007/s11222-017-9746-6</a>.
</div>
<div id="ref-DBLP:journals/corr/abs-2112-01716" class="csl-entry">
Koch, B., E. Denton, A. Hanna, and J. G. Foster. 2021. <span>“Reduced, Reused and Recycled: The Life of a Dataset in Machine Learning Research.”</span> <em>CoRR</em> abs/2112.01716. <a href="https://arxiv.org/abs/2112.01716">https://arxiv.org/abs/2112.01716</a>.
</div>
<div id="ref-macavaney-etal-2021-community" class="csl-entry">
MacAvaney, S., A. Mittu, G. Coppersmith, J. Leintz, and P. Resnik. 2021. <span>“Community-Level Research on Suicidality Prediction in a Secure Environment: Overview of the <span>CLP</span>sych 2021 Shared Task.”</span> <em>Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access</em> (Online), June, 70–80. <a href="https://doi.org/10.18653/v1/2021.clpsych-1.7">https://doi.org/10.18653/v1/2021.clpsych-1.7</a>.
</div>
<div id="ref-10.1257/pandp.20191107" class="csl-entry">
Ruggles, S., C. Fitch, D. Magnuson, and J. Schroeder. 2019. <span>“Differential Privacy and Census Data: Implications for Social and Economic Research.”</span> <em>AEA Papers and Proceedings</em> 109 (May): 403–8. <a href="https://doi.org/10.1257/pandp.20191107">https://doi.org/10.1257/pandp.20191107</a>.
</div>
<div id="ref-10.1371/journal.pone.0249833" class="csl-entry">
Schafer, K. M., G. Kennedy, A. Gallyer, and P. Resnik. 2021. <span>“A Direct Comparison of Theory-Driven and Machine Learning Prediction of Suicide: A Meta-Analysis.”</span> <em>PLOS ONE</em> 16 (4): 1–23. <a href="https://doi.org/10.1371/journal.pone.0249833">https://doi.org/10.1371/journal.pone.0249833</a>.
</div>
<div id="ref-43146" class="csl-entry">
Sculley, D., G. Holt, D. Golovin, et al. 2014. <span>“Machine Learning: The High Interest Credit Card of Technical Debt.”</span> <em>SE4ML: Software Engineering for Machine Learning (NIPS 2014 Workshop)</em>.
</div>
<div id="ref-tsakalidis-etal-2022-overview" class="csl-entry">
Tsakalidis, A., J. Chim, I. M. Bilal, et al. 2022. <span>“Overview of the <span>CLP</span>sych 2022 Shared Task: Capturing Moments of Change in Longitudinal User Posts.”</span> <em>Proceedings of the Eighth Workshop on Computational Linguistics and Clinical Psychology</em> (Seattle, USA), July, 184–98. <a href="https://doi.org/10.18653/v1/2022.clpsych-1.16">https://doi.org/10.18653/v1/2022.clpsych-1.16</a>.
</div>
<div id="ref-van_riper" class="csl-entry">
Van Riper, D., T. Kugler, J. Schroeder, and S. Ruggles. 2020. <span>“Differential Privacy and Racial Residential Segregation.”</span> <em>2020 APPAM Fall Research Conference</em>.
</div>
<div id="ref-4138bca6-f7b7-3af8-a96c-5e2544823c5c" class="csl-entry">
Williams, H. 2012. <span>“Innovation Inducement Prizes: Connecting Research to Policy.”</span> <em>Journal of Policy Analysis and Management</em> 31 (3): 752–76. <a href="http://www.jstor.org/stable/41653827">http://www.jstor.org/stable/41653827</a>.
</div>
</div></section></div> ]]></description>
  <category>Machine learning</category>
  <category>Natural language processing</category>
  <category>Public policy</category>
  <category>Health and wellbeing</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/06-value-of-competitions.html</guid>
  <pubDate>Mon, 21 Aug 2023 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/06-value-of-competitions.png" medium="image" type="image/png" height="105" width="144"/>
</item>
<item>
  <title>Food for Thought: Third place winners – Loyola Marymount</title>
  <dc:creator>Yifan Hu and Mandy Korpusik</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/05-third-place-winners.html</link>
  <description><![CDATA[ 





<p>Undergraduate student Yifan (Rosetta) Hu was responsible for writing the Python script that pre-processes the 2015–2016 UPC, EC, and PPC data for training neural network models. Her script randomly sampled five negative EC descriptions for every positive match between a UPC and EC code. Professor Mandy Korpusik performed the remaining work, including setting up the environment, training the BERT model, and evaluation. Hu spent roughly 10 hours on the competition, and Korpusik spent roughly 40 hours of work (and many additional hours running and monitoring the training and testing scripts).</p>
<section id="our-perspective-on-the-challenge" class="level2">
<h2 class="anchored" data-anchor-id="our-perspective-on-the-challenge">Our perspective on the challenge</h2>
<p>The goal of this challenge is to use machine learning and natural language processing (NLP) to link language-based entries in the IRI and FNDDS databases. Our proposed approach is based on our prior work using deep learning models to map users’ natural language meal descriptions to the FNDDS database <span class="citation" data-cites="7953245">(Korpusik et al. 2017b)</span> to retrieve nutrition information in a spoken diet tracking system. In the past, we found a trade-off between accuracy and cost, leading us to select convolutional neural networks over recurrent long short-term memory (LSTM) networks – with nearly 10x as many parameters and 2x the training time required, LSTMs achieved slightly lower performance on semantic tagging and food database mapping on meals in the breakfast category. Here, we propose to investigate state-of-the-art transformers, specifically the contextual embedding model (i.e., the entire sentence is used as context to generate the embedding) known as BERT <span class="citation" data-cites="DBLP:journals/corr/abs-1810-04805">(Bidirectional Encoder Representations from Transformers, Devlin et al. 2018)</span>.</p>
<section id="related-work" class="level5">
<h5 class="anchored" data-anchor-id="related-work">Related work</h5>
<p>Within the past few years, several papers have come out that learn contextual representations of sentences, where the entire sentence is used to generate embeddings.</p>
<p>ELMo <span class="citation" data-cites="DBLP:journals/corr/abs-1802-05365">(Peters et al. 2018)</span> uses a linear combination of vectors extracted from intermediate layer representations of a bidirectional LSTM trained on a large text corpus as a language model; in this feature-based approach, the ELMo vector of the full input sentence is concatenated with the standard context-independent token representations and passed through a task-dependent model for final prediction. This showed performance improvement over state-of-the-art on six NLP tasks, including question answering, textual entailment, and sentiment analysis.</p>
<p>OpenAI GPT <span class="citation" data-cites="radford2018improving">(Radford et al. 2018)</span> is a fine-tuning approach, where they first pre-train a multi-layer transformer <span class="citation" data-cites="NIPS2017_3f5ee243">(Vaswani et al. 2017)</span> as a language model on a large text corpus, and then conduct supervised fine-tuning on the specific task of interest, with a linear softmax layer on top of the pre-trained transformer.</p>
<p>Google’s BERT <span class="citation" data-cites="DBLP:journals/corr/abs-1810-04805">(2018)</span> is a fine-tuning approach similar to GPT, but with the key difference that instead of combining separately trained forward and backward transformers, they instead use a masked language model for pre-training, where they randomly masked out input tokens and predicted only those tokens. They demonstrated state-of-the-art performance on 11 NLP tasks, including the CoNLL 2003 named entity recognition task, which is similar to our semantic tagging task.</p>
<p>Finally, many models have recently been developed that improve upon BERT, including RoBERTa <span class="citation" data-cites="DBLP:journals/corr/abs-1907-11692">(which improves BERT’s pre-training by using bigger batches and more data, Y. Liu et al. 2019)</span>, XLNet <span class="citation" data-cites="NEURIPS2019_dc6a7e65">(which uses Transformer-XL and avoids BERT’s pretrain-finetune discrepancy through learning a truly bidirectional context via permutations over the factorization order, Yang et al. 2019)</span>, and ALBERT <span class="citation" data-cites="DBLP:journals/corr/abs-1909-11942">(a lightweight BERT, Lan et al. 2019)</span>.</p>
<p>In our prior work on language understanding for nutrition <span class="citation" data-cites="7078635 7472843 7902155 korpusik17_interspeech 8461769 8721137">(Korpusik et al. 2014, 2016, 2017a; Korpusik and Glass 2017, 2018, 2019)</span>, we used a similar binary classification approach for learning embeddings, which were then used at test time to map from user-described meals to USDA food database matches, but with convolutional neural networks (CNNs) instead of BERT. (BERT was not created until 2018, and due to limited memory available for deployment, we needed a smaller model than even BERT base, which has 100 million parameters.) Further work demonstrated that BERT outperformed CNNs on several language understanding tasks, including nutrition <span class="citation" data-cites="korpusik19_interspeech">(Korpusik et al. 2019)</span>.</p>
</section>
</section>
<section id="our-approach" class="level2">
<h2 class="anchored" data-anchor-id="our-approach">Our approach</h2>
<p>Our approach is to fine-tune a large pre-trained BERT language model on the food data. BERT was originally trained on a massive amount of text for a language modelling task (i.e., predicting which word should come next in a sentence). It relies on a transformer model, which uses an “attention” mechanism to identify which words the model should pay the most “attention” to. We are specifically using BERT for binary sequence classification, which refers to predicting a label (i.e., classification) for a sequence of words. In our case, during fine-tuning (i.e., training the model further on our own dataset) we will feed the model pairs of sentences (where one sentence is the UPC description of a food item and the other is the EC description of another food item), and the model will perform binary classification, predicting whether the sentences are a match (i.e., 1) or not (i.e., 0). We start with the 2015–2016 ground truth PPC data for positive examples, and five randomly sampled negative examples per positive example.</p>
<section id="training-methods" class="level5">
<h5 class="anchored" data-anchor-id="training-methods">Training methods</h5>
<p>Since we used a neural network model, the only features passed into our model were the tokenized words themselves of the EC and UPC food descriptions – we did not conduct any manual feature engineering <span class="citation" data-cites="dong_liu">(Dong and Liu 2018)</span>. The model was trained on a 90/10 split into 90% training and 10% validation data, where the validation data was used as a test set to fine-tune the model’s hyperparameters. We started with a randomly sampled set of 16,000 pairs, batch size of 16 (i.e., the model would train on batches of 16 samples at a time), AdamW <span class="citation" data-cites="DBLP:journals/corr/abs-1711-05101">(Loshchilov and Hutter 2017)</span> as the optimizer (which adaptively updates the learning rate, or how large the update should be to the model’s parameters), a linear schedule with warmup <span class="citation" data-cites="DBLP:journals/corr/abs-1908-03265">(i.e., starting with a small learning rate in the first few epochs of training due to large variance in early stages of training, L. Liu et al. 2019)</span>, and one epoch (i.e., the number of times the model passes through all the training data). We then added the next randomly sampled set of 16,000 pairs to get a model trained on 32,000 data points. Finally, we reached a total of 48,000 data samples used for training. Each pair of sequences was tokenized with the pre-trained BERT tokenizer, with the special CLS and SEP tokens (where CLS is a learned vector that is typically passed to downstream layers for final classification, and SEP is a learned vector that separates two input sequences), and was padded with zeros to the maximum length input sequence of 240 tokens, so that each input sequence would be the same length.</p>
</section>
<section id="model-development-approach" class="level5">
<h5 class="anchored" data-anchor-id="model-development-approach">Model development approach</h5>
<p>We faced many challenges due to the secure nature of the ADRF environment. Since our approach relies on BERT, we were blocked by errors due to the local BERT installation. Typically, BERT is downloaded from the web as the program runs. However, for this challenge, BERT must be installed locally for security reasons. To fix the errors, the BERT models needed to be installed with <code>git lfs clone</code> instead of git.</p>
<p>Second, we were unable to retrieve the test data from the database due to SQLAlchemy errors. We found a workaround by using DBeaver directly to save database tables as Excel spreadsheets, rather than accessing the database tables through Python.</p>
<p>Finally, we needed a GPU in order to efficiently train our BERT models. However, we initially only had a CPU, so there was a delay due to setting up the GPU configuration. Once the GPU image was set up, there was still a CUDA error when running the BERT model during training. We determined that the model was too big to fit into GPU memory, so we found a workaround using gradient checkpointing (trading off computation speed for memory) with the transformers library’s Trainer and TrainingArguments. Unfortunately, the version of transformers we were using did not have these tools, and the library was not updated until less than a week before the deadline, so we still had to train the model on the CPU.</p>
<p>To deal with the inability to run jobs in the background, our process was checkpointing our models every five batches, and saving the model predictions during evaluation to a csv file every five batches as well.</p>
<div class="callout callout-style-simple callout-note">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-body-container">
<p>Find the code in the <a href="https://github.com/realworlddatascience/realworlddatascience.github.io/tree/main/case-studies/posts/2023/08/21/_code">Real World Data Science GitHub repository</a>.</p>
</div>
</div>
</div>
</section>
</section>
<section id="our-results" class="level2">
<h2 class="anchored" data-anchor-id="our-results">Our results</h2>
<p>After training, the 48K model (so-called because it was trained on 48,000 data samples) was used at test time via ranking all possible 2017–18 EC descriptions given an unseen UPC description. The rankings were obtained through the model’s output value – the higher the output (or confidence), the more highly we ranked that EC description. To speed up the ranking process, we used blocking (i.e., only ranking a subset of all possible matches), specifically with exact word matches (using only the first six words in the UPC description, which appeared to be the most important), and fed all possible matches through the model in one batch per UPC description. Since we still did not have sufficient time to complete evaluation on the full set of test UPC descriptions, we implemented an expedited evaluation that only considered the first 10 matching EC descriptions in the BERT ranking process (which we call BERT-FAST). We also report results for the slower evaluation method that considers all EC descriptions that match at least one of the first six words in a given UPC description, but note that these results are based on just a small subset of the total test set. See Table 1 below for our results, where the <span class="citation" data-cites="5">(<strong>5?</strong>)</span> indicates how often the correct match was ranked among the top-5. See Table 2 for an estimate of how long it takes to train and test the model on a CPU.<br>
<br>
<br>
</p>
<div class="figure-caption">
<p><strong>Table 1:</strong> S@5 and NCDG@5 for BERT, both for fast evaluation over the whole test set, and slower evaluation on a smaller subset (711 UPCs out of 37,693 total).</p>
</div>
<table class="caption-top table">
<thead>
<tr class="header">
<th>Model</th>
<th style="text-align: center;">Success@5</th>
<th style="text-align: center;">NDCG@5</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>BERT-FAST</td>
<td style="text-align: center;">0.057</td>
<td style="text-align: center;">0.047</td>
</tr>
<tr class="even">
<td>BERT-SLOW</td>
<td style="text-align: center;">0.537</td>
<td style="text-align: center;">0.412</td>
</tr>
</tbody>
</table>
<p><br>
</p>
<div class="figure-caption">
<p><strong>Table 2:</strong> An estimate of the time required to train and test the model.</p>
</div>
<table class="caption-top table">
<thead>
<tr class="header">
<th></th>
<th>Time</th>
</tr>
</thead>
<tbody>
<tr class="odd">
<td>Training (on 48K samples)</td>
<td>16 hours</td>
</tr>
<tr class="even">
<td>Testing (BERT-FAST)</td>
<td>52 hours</td>
</tr>
<tr class="odd">
<td>Testing (BERT-SLOW)</td>
<td>63 days</td>
</tr>
</tbody>
</table>
<p><br>
</p>
</section>
<section id="future-workrefinement" class="level2">
<h2 class="anchored" data-anchor-id="future-workrefinement">Future work/refinement</h2>
<p>In the future, with more time available, we would train on all data, not just our limited dataset of 48,000 pairs, as well as perform evaluation on the held-out test set with the full set of possible EC matches that have one or more words in common with the UPC description. We would compare against baseline word embedding methods such as word2vec <span class="citation" data-cites="DBLP:journals/corr/abs-1712-09405">(Mikolov et al. 2017)</span> and Glove <span class="citation" data-cites="pennington-etal-2014-glove">(Pennington et al. 2014)</span>, and we would explore hierarchical prediction methods for improving efficiency and accuracy. Specifically, we would first train a classifier to predict the generic food category, and then train finer-grained models to predict specific foods within a general food category. Finally, we are exploring multi-modal transformer-based approaches that allow two input modalities (i.e., food images and text descriptions of a meal) for predicting the best UPC match.</p>
</section>
<section id="lessons-learned" class="level2">
<h2 class="anchored" data-anchor-id="lessons-learned">Lessons learned</h2>
<p>We recommend that future challenges provide every team with both a CPU and a GPU in their workspace, to avoid transitioning from one to the other midway through the challenge. In addition, if possible, it would be very helpful to provide a mechanism for running jobs in the background. Finally, it may be useful for teams to submit snippets of code along with library package names, in order for the installations to be tested properly beforehand.</p>
<div class="nav-btn-container">
<div class="grid">
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/04-second-place-winners.html">← Part 4: Second place winners</a></p>
</div>
</div>
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/06-value-of-competitions.html">Part 6: The value of competitions →</a></p>
</div>
</div>
</div>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<strong>Yifan (Rosetta) Hu</strong> is an undergraduate student and <strong>Mandy Korpusik</strong> is an assistant professor of computer science at Loyola Marymount University’s Seaver College of Science and Engineering.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2023 Yifan Hu and Mandy Korpusik
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" style="height:22px!important;vertical-align:text-bottom;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" style="height:22px!important;margin-left:3px;vertical-align:text-bottom;"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>. Thumbnail photo by <a href="https://unsplash.com/@pvsbond?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Peter Bond</a> on <a href="https://unsplash.com/photos/KfvknMhkmw0?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Hu, Yifan, and Mandy Korpusik. 2023. “Food for Thought: Third place winners – Loyola Marymount.” Real World Data Science, August 21, 2023. <a href="https://realworlddatascience.net/the-pulse/case-studies/posts/2023/08/21/05-third-place-winners.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-DBLP:journals/corr/abs-1810-04805" class="csl-entry">
Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. <span>“<span>BERT:</span> Pre-Training of Deep Bidirectional Transformers for Language Understanding.”</span> <em>CoRR</em> abs/1810.04805. <a href="http://arxiv.org/abs/1810.04805">http://arxiv.org/abs/1810.04805</a>.
</div>
<div id="ref-dong_liu" class="csl-entry">
Dong, G., and H. Liu, eds. 2018. <em>Feature Engineering for Machine Learning and Data Analytics</em>. First edition. CRC Press.
</div>
<div id="ref-korpusik17_interspeech" class="csl-entry">
Korpusik, M., Z. Collins, and J. Glass. 2017a. <span>“<span class="nocase">Character-Based Embedding Models and Reranking Strategies for Understanding Natural Language Meal Descriptions</span>.”</span> <em>Proceedings of Interspeech</em>, 3320–24. <a href="https://doi.org/10.21437/Interspeech.2017-422">https://doi.org/10.21437/Interspeech.2017-422</a>.
</div>
<div id="ref-7953245" class="csl-entry">
Korpusik, M., Z. Collins, and J. Glass. 2017b. <span>“Semantic Mapping of Natural Language Input to Database Entries via Convolutional Neural Networks.”</span> <em>2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, 5685–89. <a href="https://doi.org/10.1109/ICASSP.2017.7953245">https://doi.org/10.1109/ICASSP.2017.7953245</a>.
</div>
<div id="ref-7902155" class="csl-entry">
Korpusik, M., and J. Glass. 2017. <span>“Spoken Language Understanding for a Nutrition Dialogue System.”</span> <em>IEEE/ACM Transactions on Audio, Speech, and Language Processing</em> 25 (7): 1450–61. <a href="https://doi.org/10.1109/TASLP.2017.2694699">https://doi.org/10.1109/TASLP.2017.2694699</a>.
</div>
<div id="ref-8461769" class="csl-entry">
Korpusik, M., and J. Glass. 2018. <span>“Convolutional Neural Networks and Multitask Strategies for Semantic Mapping of Natural Language Input to a Structured Database.”</span> <em>2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, 6174–78. <a href="https://doi.org/10.1109/ICASSP.2018.8461769">https://doi.org/10.1109/ICASSP.2018.8461769</a>.
</div>
<div id="ref-8721137" class="csl-entry">
Korpusik, M., and J. Glass. 2019. <span>“Deep Learning for Database Mapping and Asking Clarification Questions in Dialogue Systems.”</span> <em>IEEE/ACM Transactions on Audio, Speech, and Language Processing</em> 27 (8): 1321–34. <a href="https://doi.org/10.1109/TASLP.2019.2918618">https://doi.org/10.1109/TASLP.2019.2918618</a>.
</div>
<div id="ref-7472843" class="csl-entry">
Korpusik, M., C. Huang, M. Price, and J. Glass. 2016. <span>“Distributional Semantics for Understanding Spoken Meal Descriptions.”</span> <em>2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)</em>, 6070–74. <a href="https://doi.org/10.1109/ICASSP.2016.7472843">https://doi.org/10.1109/ICASSP.2016.7472843</a>.
</div>
<div id="ref-korpusik19_interspeech" class="csl-entry">
Korpusik, M., Z. Liu, and J. Glass. 2019. <span>“<span class="nocase">A Comparison of Deep Learning Methods for Language Understanding</span>.”</span> <em>Proceedings of Interspeech</em>, 849–53. <a href="https://doi.org/10.21437/Interspeech.2019-1262">https://doi.org/10.21437/Interspeech.2019-1262</a>.
</div>
<div id="ref-7078635" class="csl-entry">
Korpusik, M., N. Schmidt, J. Drexler, S. Cyphers, and J. Glass. 2014. <span>“Data Collection and Language Understanding of Food Descriptions.”</span> <em>2014 IEEE Spoken Language Technology Workshop (SLT)</em>, 560–65. <a href="https://doi.org/10.1109/SLT.2014.7078635">https://doi.org/10.1109/SLT.2014.7078635</a>.
</div>
<div id="ref-DBLP:journals/corr/abs-1909-11942" class="csl-entry">
Lan, Z., M. Chen, S. Goodman, K. Gimpel, P. Sharma, and R. Soricut. 2019. <span>“<span>ALBERT:</span> <span>A</span> Lite <span>BERT</span> for Self-Supervised Learning of Language Representations.”</span> <em>CoRR</em> abs/1909.11942. <a href="http://arxiv.org/abs/1909.11942">http://arxiv.org/abs/1909.11942</a>.
</div>
<div id="ref-DBLP:journals/corr/abs-1908-03265" class="csl-entry">
Liu, L., H. Jiang, P. He, et al. 2019. <span>“On the Variance of the Adaptive Learning Rate and Beyond.”</span> <em>CoRR</em> abs/1908.03265. <a href="http://arxiv.org/abs/1908.03265">http://arxiv.org/abs/1908.03265</a>.
</div>
<div id="ref-DBLP:journals/corr/abs-1907-11692" class="csl-entry">
Liu, Y., M. Ott, N. Goyal, et al. 2019. <span>“RoBERTa: <span>A</span> Robustly Optimized <span>BERT</span> Pretraining Approach.”</span> <em>CoRR</em> abs/1907.11692. <a href="http://arxiv.org/abs/1907.11692">http://arxiv.org/abs/1907.11692</a>.
</div>
<div id="ref-DBLP:journals/corr/abs-1711-05101" class="csl-entry">
Loshchilov, I., and F. Hutter. 2017. <span>“Fixing Weight Decay Regularization in Adam.”</span> <em>CoRR</em> abs/1711.05101. <a href="http://arxiv.org/abs/1711.05101">http://arxiv.org/abs/1711.05101</a>.
</div>
<div id="ref-DBLP:journals/corr/abs-1712-09405" class="csl-entry">
Mikolov, T., E. Grave, P. Bojanowski, C. Puhrsch, and A. Joulin. 2017. <span>“Advances in Pre-Training Distributed Word Representations.”</span> <em>CoRR</em> abs/1712.09405. <a href="http://arxiv.org/abs/1712.09405">http://arxiv.org/abs/1712.09405</a>.
</div>
<div id="ref-pennington-etal-2014-glove" class="csl-entry">
Pennington, J., R. Socher, and C. Manning. 2014. <span>“<span>G</span>lo<span>V</span>e: Global Vectors for Word Representation.”</span> <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (<span>EMNLP</span>)</em> (Doha, Qatar), October, 1532–43. <a href="https://doi.org/10.3115/v1/D14-1162">https://doi.org/10.3115/v1/D14-1162</a>.
</div>
<div id="ref-DBLP:journals/corr/abs-1802-05365" class="csl-entry">
Peters, M. E., M. Neumann, M. Iyyer, et al. 2018. <span>“Deep Contextualized Word Representations.”</span> <em>CoRR</em> abs/1802.05365. <a href="http://arxiv.org/abs/1802.05365">http://arxiv.org/abs/1802.05365</a>.
</div>
<div id="ref-radford2018improving" class="csl-entry">
Radford, A., K. Narasimhan, T. Salimans, and I. Sutskever. 2018. <em>Improving Language Understanding by Generative Pre-Training</em>. <a href="https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf">https://s3-us-west-2.amazonaws.com/openai-assets/research-covers/language-unsupervised/language_understanding_paper.pdf</a>.
</div>
<div id="ref-NIPS2017_3f5ee243" class="csl-entry">
Vaswani, A., N. Shazeer, N. Parmar, et al. 2017. <span>“Attention Is All You Need.”</span> In <em>Advances in Neural Information Processing Systems</em>, edited by I. Guyon, U. Von Luxburg, S. Bengio, et al., vol. 30. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf">https://proceedings.neurips.cc/paper_files/paper/2017/file/3f5ee243547dee91fbd053c1c4a845aa-Paper.pdf</a>.
</div>
<div id="ref-NEURIPS2019_dc6a7e65" class="csl-entry">
Yang, Z., Z. Dai, Y. Yang, J. Carbonell, R. R. Salakhutdinov, and Q. V. Le. 2019. <span>“XLNet: Generalized Autoregressive Pretraining for Language Understanding.”</span> In <em>Advances in Neural Information Processing Systems</em>, <span class="nocase">edited by H. Wallach, H. Larochelle, A. Beygelzimer, F. d'Alché-Buc, E. Fox, and R. Garnett</span>, vol. 32. Curran Associates, Inc. <a href="https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf">https://proceedings.neurips.cc/paper_files/paper/2019/file/dc6a7e655d7e5840e66733e9ee67cc69-Paper.pdf</a>.
</div>
</div></section></div> ]]></description>
  <category>Machine learning</category>
  <category>Natural language processing</category>
  <category>Public policy</category>
  <category>Health and wellbeing</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/05-third-place-winners.html</guid>
  <pubDate>Mon, 21 Aug 2023 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/05-lm.png" medium="image" type="image/png" height="105" width="144"/>
</item>
<item>
  <title>Food for Thought: Second place winners – DeepFFTLink</title>
  <dc:creator>Yang Wu, Aishwarya Budhkar, Kai Zhang, Xuhong Zhang, and Xiaozhong Liu</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/04-second-place-winners.html</link>
  <description><![CDATA[ 





<p>DeepFFTLink team members: Yang Wu and Kai Zhang are PhD students at Worcester Polytechnic Institute. Aishwarya Budhkar is a PhD student at Indiana University Bloomington. Xuhong Zhang is an assistant professor at Indiana University Bloomington. Xiaozhong Liu is an associate professor at Worcester Polytechnic Institute.</p>
<section id="perspective-on-the-challenge" class="level2">
<h2 class="anchored" data-anchor-id="perspective-on-the-challenge">Perspective on the challenge</h2>
<p>Text matching is an essential task in natural language processing <span class="citation" data-cites="DBLP:journals/corr/PangLGXWC16">(NLP, Pang et al. 2016)</span>, while record linkage across different sources is an essential task in data science. Machine learning techniques allow people to combine data faster and cheaper than using manual linkage. However, in the context of the Food for Thought challenge, existing methods for matching universal product codes (UPCs) to ensemble codes (ECs) require every UPC to be compared with every EC code (Figure 1a). Such approaches can be computationally expensive in the training process when data is noisy. Here, we propose an ensemble model with a category-based adapter to tackle this problem, drawing on the category information included in UPC and EC data. The category-based adapter allows UPCs to be first matched with only a small and reliable set of ECs (Figure 1b). Then, an ensemble model will be deployed to make predictions for UPC-EC matching. Our proposed approach can achieve competitive performance compared with state-of-the-art models.</p>
<div class="quarto-layout-panel" data-layout-ncol="2" style="padding-top: 1em; margin-bottom: 0;">
<div class="quarto-layout-row">
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="images/pt4-fig1a.png"><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/pt4-fig1a.png" class="img-fluid figure-img"></a></p>
<figcaption>(a)</figcaption>
</figure>
</div>
</div>
<div class="quarto-layout-cell" style="flex-basis: 50.0%;justify-content: flex-start;">
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><a href="images/pt4-fig1b.png"><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/pt4-fig1b.png" class="img-fluid figure-img"></a></p>
<figcaption>(b)</figcaption>
</figure>
</div>
</div>
</div>
</div>
<div class="figure-caption">
<p><strong>Figure 1:</strong> A toy example of our method. Panel (a) shows the traditional matching method, while (b) is our proposed ensemble model with category-based adapter. With the help of the adapter, UPC 1 only needs to be matched with EC 1 and EC 3.</p>
</div>
</section>
<section id="our-approach" class="level2">
<h2 class="anchored" data-anchor-id="our-approach">Our approach</h2>
<p>We propose a two-step framework to address this problem. To begin with, we use a category-based adapter to get reliable candidate ECs for each UPC. Then, an ensemble model <span class="citation" data-cites="10.1007/3-540-45014-9_1">(Dietterich 2000)</span> is deployed to make a prediction for each UPC-EC pair.</p>
<section id="category-based-adapter" class="level5">
<h5 class="anchored" data-anchor-id="category-based-adapter">Category-based adapter</h5>
<p>By using 2015–2016 UPC-EC data, we created a knowledge base, which is a UPC category–EC pair-wised table for generating candidate ECs. Within this setting, each UPC category is, on average, related to only 32 ECs. This knowledge base is then used as context to further filter the candidate ECs. Note that there are some new ECs generated year by year, which can also be part of the potential ECs in the UPC-EC matching task, since the contextual information of new ECs does not exist in our knowledge base.</p>
</section>
<section id="ensembled-model" class="level5">
<h5 class="anchored" data-anchor-id="ensembled-model">Ensembled model</h5>
<p>We ensemble the base-string match and BERT models. BERT is a deep learning model for natural language processing <span class="citation" data-cites="DBLP:journals/corr/abs-1810-04805">(Devlin et al. 2018)</span>. In the base-string match model, we used the Term Frequency-Inverse Document Frequency (TFIDF) of each UPC and EC description as features to calculate a pairwise cosine similarity, which is a distance between instances. Meanwhile, we used features extracted from UPC and EC descriptions to fine-tune the BERT base model and calculated the cosine similarity of embeddings between each UPC and EC. Then we rank ECs based on their similarity scores with the UPC.</p>
<p><a href="images/pt4-fig2.png"><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/pt4-fig2.png" class="img-fluid"></a></p>
<div class="figure-caption">
<p><strong>Figure 2:</strong> The framework of our proposed model. A two-step strategy is used to make the final prediction.</p>
</div>
<div class="callout callout-style-simple callout-note" style="margin-top: 2rem;">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-body-container">
<p>Find the code in the <a href="https://github.com/realworlddatascience/realworlddatascience.github.io/tree/main/case-studies/posts/2023/08/21/_code">Real World Data Science GitHub repository</a>.</p>
</div>
</div>
</div>
</section>
</section>
<section id="our-results" class="level2">
<h2 class="anchored" data-anchor-id="our-results">Our results</h2>
<p>We randomly selected 500 samples from the 2017–2018 UPC-EC data to train the ensembled weight for each model. Two functions were adapted to make a fusion of base-string and BERT models:</p>
<p><span id="eq-first"><img src="https://latex.codecogs.com/png.latex?%0AC%20=%20a%20*%20X%20+%20b%20*%20Y%20%20%0A%5Ctag%7B1%7D"></span></p>
<p><span id="eq-second"><img src="https://latex.codecogs.com/png.latex?%0AC%20=%20%20a%20*%20log(X)%20+%20b%20*%20log(Y)%20%5Ctext%7B.%20%7D%0A%5Ctag%7B2%7D"></span></p>
<p><img src="https://latex.codecogs.com/png.latex?C"> denotes the final confidence score. <img src="https://latex.codecogs.com/png.latex?X"> and <img src="https://latex.codecogs.com/png.latex?Y"> represent <em>base_string_similarity_score</em> and <em>BERT_similarity_score</em>, respectively. <img src="https://latex.codecogs.com/png.latex?a"> and <img src="https://latex.codecogs.com/png.latex?b"> are corresponding model weights for base_string and BERT models.</p>
<p>A better Success@5 is achieved with function (1). The ensembled weights for the base-string model and BERT model are 0.738 and 0.262, respectively. The experiment result indicates that the base_string model contributes more than the BERT model when the ensemble model makes predictions. The prediction result for the 2017–2018 data is:</p>
<ul>
<li>Success@5: 0.727</li>
<li>NDCG@5: 0.528</li>
</ul>
<p>Computation time is 6 hours.</p>
</section>
<section id="future-work" class="level2">
<h2 class="anchored" data-anchor-id="future-work">Future work</h2>
<p>Our next step will focus on adding the newly generated EC data into our knowledge base, which allows the model to be more stable to make predictions for UPC-EC matching. Our model is an unsupervised method, which does not need labels for each instance. We use cosine similarity to rank the matches, so no labels are needed in the training process. However, our future work will try to label some instances to handle the UPC-EC matching task in a supervised manner.</p>
</section>
<section id="lessons-learned" class="level2">
<h2 class="anchored" data-anchor-id="lessons-learned">Lessons learned</h2>
<ol type="1">
<li><p><strong>If the data is not complex, simple models may outperform complex models.</strong> For example, in our experiment, we found that the base-string model outperforms single RoBERTa <span class="citation" data-cites="DBLP:journals/corr/abs-1907-11692">(Liu et al. 2019)</span> or BERT models. However, our ensemble model can outperform each individual model since model fusion allows information aggregation from multiple models.</p></li>
<li><p><strong>Multi-label models may not work well on UPC-EC data.</strong> In our early work, we tried to consider the UPC-EC matching task as a multi-label problem, e.g., we labeled each EC as a binary label which indicated whether the EC was an appropriate match or not. We mapped UPC and EC pairs into a multi-label table. However, we find that the UPC and EC keeps a one-to-one relation for most UPCs. The model performance of a multi-label model, i.e., Label-Specific Attention Network <span class="citation" data-cites="xiao-etal-2019-label">(LSAN, Xiao et al. 2019)</span>, is lower than base-string model on both Success@5 and NDCG@5 metrics.</p></li>
</ol>
<div class="nav-btn-container">
<div class="grid">
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/03-first-place-winners.html">← Part 3: First place winners</a></p>
</div>
</div>
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/05-third-place-winners.html">Part 5: Third place winners →</a></p>
</div>
</div>
</div>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<strong>Yang Wu</strong> and <strong>Kai Zhang</strong> are PhD students, and <strong>Xiaozhong Liu</strong> is an associate professor at Worcester Polytechnic Institute. <strong>Aishwarya Budhkar</strong> is a PhD student and <strong>Xuhong Zhang</strong> is an assistant professor at Indiana University Bloomington.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2023 Yang Wu, Aishwarya Budhkar, Kai Zhang, Xuhong Zhang, and Xiaozhong Liu
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" style="height:22px!important;vertical-align:text-bottom;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" style="height:22px!important;margin-left:3px;vertical-align:text-bottom;"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>. Thumbnail photo by <a href="https://unsplash.com/@hansonluu?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Hanson Lu</a> on <a href="https://unsplash.com/photos/sq5P00L7lXc?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Wu, Yang, Aishwarya Budhkar, Kai Zhang, Xuhong Zhang, and Xiaozhong Liu. 2023. “Food for Thought: Second place winners – DeepFFTLink.” Real World Data Science, August 21, 2023. <a href="https://realworlddatascience.net/the-pulse/case-studies/posts/2023/08/21/04-second-place-winners.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-DBLP:journals/corr/abs-1810-04805" class="csl-entry">
Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. <span>“<span>BERT:</span> Pre-Training of Deep Bidirectional Transformers for Language Understanding.”</span> <em>CoRR</em> abs/1810.04805. <a href="http://arxiv.org/abs/1810.04805">http://arxiv.org/abs/1810.04805</a>.
</div>
<div id="ref-10.1007/3-540-45014-9_1" class="csl-entry">
Dietterich, T. G. 2000. <span>“Ensemble Methods in Machine Learning.”</span> <em>Multiple Classifier Systems</em> (Berlin, Heidelberg), 1–15.
</div>
<div id="ref-DBLP:journals/corr/abs-1907-11692" class="csl-entry">
Liu, Y., M. Ott, N. Goyal, et al. 2019. <span>“RoBERTa: <span>A</span> Robustly Optimized <span>BERT</span> Pretraining Approach.”</span> <em>CoRR</em> abs/1907.11692. <a href="http://arxiv.org/abs/1907.11692">http://arxiv.org/abs/1907.11692</a>.
</div>
<div id="ref-DBLP:journals/corr/PangLGXWC16" class="csl-entry">
Pang, L., Y. Lan, J. Guo, J. Xu, S. Wan, and X. Cheng. 2016. <span>“Text Matching as Image Recognition.”</span> <em>CoRR</em> abs/1602.06359. <a href="http://arxiv.org/abs/1602.06359">http://arxiv.org/abs/1602.06359</a>.
</div>
<div id="ref-xiao-etal-2019-label" class="csl-entry">
Xiao, L., X. Huang, B. Chen, and L. Jing. 2019. <span>“Label-Specific Document Representation for Multi-Label Text Classification.”</span> <em>Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP)</em> (Hong Kong, China), November, 466–75. <a href="https://doi.org/10.18653/v1/D19-1044">https://doi.org/10.18653/v1/D19-1044</a>.
</div>
</div></section></div> ]]></description>
  <category>Machine learning</category>
  <category>Natural language processing</category>
  <category>Public policy</category>
  <category>Health and wellbeing</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/04-second-place-winners.html</guid>
  <pubDate>Mon, 21 Aug 2023 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/04-deepfftlink.png" medium="image" type="image/png" height="105" width="144"/>
</item>
<item>
  <title>Food for Thought: First place winners – Auburn Big Data</title>
  <dc:creator>Alex Knipper, Naman Bansal, Jingyi Zheng, Wenying Li, and Shubhra Kanti Karmaker</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/03-first-place-winners.html</link>
  <description><![CDATA[ 





<p>The Auburn Big Data team from Auburn University consists of five members, including three assistant professors: Dr Wenying Li of the Department of Agricultural Economics and Rural Sociology, Dr Jingyi Zheng of the Department of Mathematics and Statistics, and Dr Shubhra Kanti Karmaker of the Department of Computer Science and Software Engineering. Additionally, the team comprises two PhD students, Naman Bansal and Alex Knipper, who are affiliated with Dr Karmaker’s big data lab at Auburn University.</p>
<p>It is estimated that our team has spent approximately 1,400 hours on this project.</p>
<section id="our-perspective-on-the-challenge" class="level2">
<h2 class="anchored" data-anchor-id="our-perspective-on-the-challenge">Our perspective on the challenge</h2>
<p>At the start of this competition, we decided to test three general approaches, in the order listed:</p>
<ol type="1">
<li><p>A heuristic approach, where we use only the data and a defined similarity metric to predict which FNDDS label a given IRI item should have.</p></li>
<li><p>A simpler modeling approach, where we train a simple statistical classifier, like a random forest <span class="citation" data-cites="10.1007/978-3-030-03146-6_86">(Parmar et al. 2019)</span>, logistic regression, etc., to predict the FNDDS label for a given IRI item. For this method, we opted to use a random forest as our statistical model, as it was a simpler model to use as a baseline, having shown decent performance in a wide range of classification tasks. As it turned out, this approach was quite robust and accurate, so we kept it as our main model for this approach.</p></li>
<li><p>A large language modeling approach, where we train a model like BERT <span class="citation" data-cites="DBLP:journals/corr/abs-1810-04805">(Devlin et al. 2018)</span> to map the descriptions for given IRI and FNDDS items to the FNDDS category the supplied IRI item belongs to.</p></li>
</ol>
</section>
<section id="our-approach" class="level2">
<h2 class="anchored" data-anchor-id="our-approach">Our approach</h2>
<p>As we explored the data provided, we opted to use the given 2017–2018 PPC dataset as our primary dataset for both training and testing. To ensure a fair evaluation of the model, we randomly split the dataset into 60% training samples and 40% testing samples, making sure our training process never sees the testing dataset. For evaluating our models, we adopted the competition’s metrics: Success@5 and NDCG@5. After months of testing, our statistical classifier (approach #2) proved itself to be the model that both processes the data fastest and achieves the highest performance on our testing metrics.</p>
<p>This approach, at a high level, takes in the provided data (among other configuration parameters), formats the data in a computer-readable format – converting the IRI and FNDDS descriptions to a numerical representation with word embeddings <span class="citation" data-cites="DBLP:journals/corr/abs-1810-04805 mikolov2013efficient pennington-etal-2014-glove">(2018; Mikolov et al. 2013; Pennington et al. 2014)</span> and then using that numerical representation to calculate the distances between each description – and then trains a classification model (random forest <span class="citation" data-cites="10.1007/978-3-030-03146-6_86">(2019)</span>/neural network <span class="citation" data-cites="SCHMIDHUBER201585">(Schmidhuber 2015)</span>) that can predict an FNDDS label for a given IRI item.</p>
<p>In terms of data, our approach uses the FNDDS/IRI descriptions, combining them into a single “description” field, and the IRI item’s categorical items – department, aisle, category, product, brand, manufacturer, and parent company – to further discern between items.</p>
<p>While most industrial methods require use of a graphics processing unit (graphics card, or GPU) to perform this kind of processing, our primary method only requires the computer’s internal processor (CPU) to function properly. With that in mind, to achieve the best possible performance on our test metrics, the most time-consuming operations are run in parallel. The time taken to train our primary model can likely be further improved if we parallelize these operations across a GPU, with the only downside being the imposition of a GPU requirement for systems aiming to run this method.</p>
<p>In addition to our primary method, our team has worked with alternate approaches on the GPU (using BERT <span class="citation" data-cites="DBLP:journals/corr/abs-1810-04805">(2018)</span>, neural networks <span class="citation" data-cites="SCHMIDHUBER201585">(2015)</span>, etc.) to either: 1) speed up the time it takes to process and make inferences for the data, achieving similar performance on our test metrics, or 2) achieve higher performance, likely at a cost to the time it takes to process everything. Our reasoning behind doing so is that if a simple statistical model performs well, then a larger language model should be able to demonstrate a higher performance on our test metrics without much of an increase in training time. At the current time, these methods are still unable to match the performance/efficiency tradeoff of our primary method.</p>
<p>After exploring alternate methods to no avail, our team then decided to focus again on our primary method, the random forest <span class="citation" data-cites="10.1007/978-3-030-03146-6_86">(2019)</span>, and a secondary method, feed-forward neural network mapping our input features (X) to the FNDDS labels (Y) <span class="citation" data-cites="SCHMIDHUBER201585">(2015)</span>, to optimize their training hyperparameters for the dataset. Our aim in this is to see which of our already-implemented, easier-to-run downstream methods would better optimize the performance/efficiency tradeoff after having its training parameters optimized to the fullest. This has resulted in a marginal increase in training time (+20-30 minutes) and a roughly 5% increase in performance for our still-highest performing model, the random forest.</p>
<p>Overall, our primary method – the random forest – gave us an approximate training time (including data pre-processing) of 4 hours 30 minutes for our ~38,000 IRI item training set, and an approximate inference time of 15 minutes on our testing set of ~15,000 IRI items. Furthermore, our method gave us a Success@5 score of .789 and an NDCG@5 score of .705 on our testing set.</p>
<section id="key-features" class="level5">
<h5 class="anchored" data-anchor-id="key-features">Key features</h5>
<p>Here is a list of the key features we utilize, along with what type of data we treat it as.</p>
<ul>
<li>FNDDS
<ul>
<li>food_code – identifier</li>
<li>main_food_description – text</li>
<li>additional_food_description – text</li>
<li>ingredient_description – text</li>
</ul></li>
<li>IRI
<ul>
<li>upc – identifier</li>
<li>upcdesc – text</li>
<li>dept – categorical</li>
<li>aisle – categorical</li>
<li>category – categorical</li>
<li>product – categorical</li>
<li>brand – categorical</li>
<li>manufacturer – categorical</li>
<li>parent – categorical</li>
</ul></li>
</ul>
<p>The intuition behind using these particular features is that the text-based descriptions provide the majority of the “meaning” of the item. By converting each description to a numerical representation <span class="citation" data-cites="mikolov2013efficient pennington-etal-2014-glove">(2013; 2014)</span>, we can then calculate the similarity between each “meaning” to determine which FNDDS label is most similar to the IRI item provided. However, that alone is not enough. The categorical features on the IRI item help to further enhance the model’s classifications using the logic and categories people use in places like grocery stores. For example, if given an item whose aisle was “fruit” and brand was “Dole”, the item could be reasonably expected to be something like “peaches” over something like “broccoli”.</p>
</section>
<section id="feature-selection" class="level5">
<h5 class="anchored" data-anchor-id="feature-selection">Feature selection</h5>
<p>Aforementioned intuition aside, our feature selection was rather naive, in that we manually examined the data and removed any redundant text features before doing anything else. After that, we decided to use description fields as “text” data to comprise the main “meaning” of the item, represented numerically after converting the text using a word embedding <span class="citation" data-cites="mikolov2013efficient pennington-etal-2014-glove">(2013; 2014)</span>. We also decided to use the non-description fields (aisle, category, etc.) as “categorical” data that would be turned into its own numerical representation, allowing our model to more easily discern between items using similar systems to people.</p>
</section>
<section id="feature-transformations" class="level5">
<h5 class="anchored" data-anchor-id="feature-transformations">Feature transformations</h5>
<p>Our feature transformations are also relatively simple. First, we combine all description fields for each item to make one large description, and then use a word embedding method (like GloVe <span class="citation" data-cites="pennington-etal-2014-glove">(2014)</span> or BERT <span class="citation" data-cites="DBLP:journals/corr/abs-1810-04805">(2018)</span>) to convert the description into a numerical representation, resulting in a 300-dimensional GloVe or 768-dimensional BERT vector of numbers for each description. Then, for each IRI item, we calculate the cosine and Euclidean distances from each FNDDS item, resulting in two vectors, both equal in length to the original FNDDS data (in this case, two vectors of length ~7,300). The intuition behind this is that while cosine and Euclidean distances can tell us similar things, providing both of these sets of distances to the model should allow it to pick up on a more nuanced set of relationships between the IRI and FNDDS items.</p>
<p>For categorical data, we take all unique values in each field and assign them an ID number. While that is often not the best practice for making a numerical representation out of categorical data <span class="citation" data-cites="10.5120/ijca2017915495">(Potdar et al. 2017)</span>, it seemed to work for the downstream model.</p>
<p>Regardless, the aforementioned feature transformations give us (ad hoc) ~14,900 features if we use GloVe and ~15,300 features if we use BERT. Both feature sets can then be sent to the downstream random forest/neural network to start classifying items.</p>
<p>It should be noted that processing the data is by far the most time-consuming part of our method. The data processing times for each embedding are as follows:</p>
<ul>
<li>GloVe: ~3 hours</li>
<li>BERT: ~6 hours</li>
</ul>
<p>Due to BERT both taking so long to process data and performing lower than our GloVe embeddings on the classification task, we opt to use GloVe embeddings for our primary method. Our only theoretical explanation here is that since BERT is better at context-dependent tasks <span class="citation" data-cites="10.1145/3443279.3443304">(Wang et al. 2021)</span>, it likely will expect something similar to well-structured sentences as input, which is not what the IRI/FNDDS descriptions are. Rather, GloVe – being a method that depends less on context <span class="citation" data-cites="mikolov2013efficient pennington-etal-2014-glove">(2013; 2014)</span> – should excel better when the input text is not a well-formed sentence.</p>
</section>
<section id="training-methods" class="level5">
<h5 class="anchored" data-anchor-id="training-methods">Training methods</h5>
<p>Once the data has been processed, we collect the following data for each IRI item:</p>
<ul>
<li>UPC code</li>
<li>Description (converted to numerical representation)</li>
<li>Categorical variables (converted to numerical representation)</li>
<li>Distances to each FNDDS item</li>
</ul>
<p>Once that has been collected for each IRI item, we can finally use our classification model. We initialize our model and begin the training process with the IRI data mentioned above and the target FNDDS labels for each one, so the model knows what the “correct” answer is for the given data. Once the model has trained on our training dataset, we save the model and it is ready for use.</p>
<p>This part of training takes much less time than preparing the data, since calculating the embeddings takes a lot more computation than a random forest model. The training times for each method are as follows:</p>
<ul>
<li>Random Forest: ~1 hour 15 minutes</li>
<li>Neural Network: ~25 minutes</li>
</ul>
<p>Despite the neural network taking far less time to train than the random forest, it still scores lower on the scoring metrics than the random forest, so we opt to continue using the random forest model as our primary method.</p>
</section>
<section id="general-approach-to-developing-the-model" class="level5">
<h5 class="anchored" data-anchor-id="general-approach-to-developing-the-model">General approach to developing the model</h5>
<p>Since the linkage problem involves mapping tens of thousands of items to a smaller category set of a few thousand items, we decided to frame this problem as a multi-class classification problem <span class="citation" data-cites="aly2005survey">(Aly 2005)</span>, where we then rank the top “k” most probable class mappings, as requested by the competition ruleset.</p>
<p>Most of the usable data available to us is text data, so we need a method that can use that text-based information to accurately map classes based on the aforementioned text information. To best accomplish this, we opt to use word embedding techniques to calculate an average numerical representation for each text description (both IRI and FNDDS), so we can calculate distances between each description, giving our model a sense of how similar each description is.</p>
</section>
<section id="the-key-trick-to-the-model" class="level5">
<h5 class="anchored" data-anchor-id="the-key-trick-to-the-model">The key “trick” to the model</h5>
<p>Since text descriptions hold the most information that can be used to link between an IRI item and an FNDDS item, finding a way to calculate the similarity between each description is paramount to making this method work.</p>
<p>Both distance calculation methods used in this work, cosine and Euclidean distance, are very similar in the type of information encoded, the only major difference being that cosine distance is implicitly normalized and Euclidean distance is not <span class="citation" data-cites="10.1145/967900.968151">(Qian et al. 2004)</span>.</p>
</section>
<section id="notable-observations" class="level5">
<h5 class="anchored" data-anchor-id="notable-observations">Notable observations</h5>
<p>Just by building the ranking using the cosine similarities between each IRI item and all FNDDS items, we can achieve a Success@5 performance of 0.234 and an NDCG@5 performance of 0.312. The other features are provided and the random forest classifier is used to add some extra discriminative power to the model.</p>
</section>
<section id="data-disclaimer" class="level5">
<h5 class="anchored" data-anchor-id="data-disclaimer">Data disclaimer</h5>
<p>Our current method only uses the data readily available from the 2017–2018 dataset, which we acknowledge is intended for testing. To remedy this, we further split this dataset into train/test sets and report results on our unseen test subset for our primary performance metrics. This gives a decent look into how the model will perform on unseen data.</p>
<div class="callout callout-style-simple callout-note">
<div class="callout-body d-flex">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-body-container">
<p>Find the code in the <a href="https://github.com/realworlddatascience/realworlddatascience.github.io/tree/main/case-studies/posts/2023/08/21/_code">Real World Data Science GitHub repository</a>.</p>
</div>
</div>
</div>
</section>
</section>
<section id="our-results" class="level2">
<h2 class="anchored" data-anchor-id="our-results">Our results</h2>
<section id="approximate-training-time" class="level5">
<h5 class="anchored" data-anchor-id="approximate-training-time">Approximate training time</h5>
<p>Overall, our approximate training time for our primary method is 4 hours 30 minutes broken down (approximately) as follows:</p>
<ol type="1">
<li>Reading data from database: 30 seconds</li>
<li>Calculating ~7,300 FNDDS description embeddings: 15 minutes 45 seconds</li>
<li>Calculating ~38,000 IRI description embeddings and similarity scores: 2 hours 20 minutes 45 seconds</li>
<li>Formatting calculated data for the random forest classifier: 35 minutes</li>
<li>Training the random forest classifier: 1 hour 15 minutes</li>
</ol>
</section>
<section id="approximate-inference-time" class="level5">
<h5 class="anchored" data-anchor-id="approximate-inference-time">Approximate inference time</h5>
<p>Our approximate inference time for our primary method is 15 minutes to make inferences for ~15,000 IRI items.</p>
</section>
<section id="s5-ndcg5-performance" class="level5">
<h5 class="anchored" data-anchor-id="s5-ndcg5-performance">S@5 &amp; NDCG@5 performance</h5>
<p>This is how our best-performing model (GloVe + random forest) performs at the current time on the testing set:</p>
<ul>
<li>NDCG@5: 0.705</li>
<li>Success@5: 0.789</li>
</ul>
<p>When we evaluate that same model on the full PPC dataset we were provided (~38,000 items), we get the following scores:</p>
<ul>
<li>NDCG@5: 0.879</li>
<li>Success@5: 0.916</li>
</ul>
<p>(Note: The full PPC dataset contains approximately 15,000 items that we used to train the model, so these scores are not as representative of our method’s performance as the previous scores.)</p>
</section>
</section>
<section id="future-workrefinement" class="level2">
<h2 class="anchored" data-anchor-id="future-workrefinement">Future work/refinement</h2>
<p>As mentioned previously, we only used the given 2017–2018 PPC dataset as our primary dataset for both training and testing. Going forward, we would like to include datasets from previous years as well, which we believe would further increase our model performance. Additionally, the datasets generated from this research have the potential to inform and support additional studies from a variety of perspectives, including nutrition, consumer research, and public health. Further research utilizing these datasets has the potential to make significant contributions to our understanding of consumer behavior and the role of food and nutrient consumption in overall health and well-being.</p>
</section>
<section id="lessons-learned" class="level2">
<h2 class="anchored" data-anchor-id="lessons-learned">Lessons learned</h2>
<p>It was interesting that the random forest model performed better than the vanilla neural network model. This shows that a simple solution can work better, depending on the application. This observation is in line with the well-established principle in machine learning that the choice of model should be guided by the nature of the problem and the characteristics of the data. In this case, the random forest model, being a simpler and more interpretable model, was better suited to the problem at hand and was able to outperform the more complex neural network model. These results underscore the importance of careful model selection and the need to consider both the complexity of the model and the specific requirements of the problem when choosing an algorithm for a particular application.</p>
<div class="nav-btn-container">
<div class="grid">
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/02-competition-design.html">← Part 2: Competition design</a></p>
</div>
</div>
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/04-second-place-winners.html">Part 4: Second place winners →</a></p>
</div>
</div>
</div>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<strong>Alex Knipper</strong> and <strong>Naman Bansal</strong> are PhD students, and <strong>Jingyi Zheng</strong>, <strong>Wenying Li</strong>, and <strong>Shubhra Kanti Karmaker</strong> are assistant professors at Auburn University.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2023 Alex Knipper, Naman Bansal, Jingyi Zheng, Wenying Li, and Shubhra Kanti Karmaker
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" style="height:22px!important;vertical-align:text-bottom;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" style="height:22px!important;margin-left:3px;vertical-align:text-bottom;"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>. Thumbnail photo by <a href="https://unsplash.com/@nicotitto?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">nrd</a> on <a href="https://unsplash.com/photos/D6Tu_L3chLE?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Knipper, Alex, Naman Bansal, Jingyi Zheng, Wenying Li, and Shubhra Kanti Karmaker. 2023. “Food for Thought: First place winners – Auburn Big Data.” Real World Data Science, August 21, 2023. <a href="https://realworlddatascience.net/the-pulse/case-studies/posts/2023/08/21/03-first-place-winners.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-aly2005survey" class="csl-entry">
Aly, M. 2005. <span>“Survey on Multiclass Classification Methods, Tech. Rep.”</span> <em>California Institute of Technology</em>.
</div>
<div id="ref-DBLP:journals/corr/abs-1810-04805" class="csl-entry">
Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. <span>“<span>BERT:</span> Pre-Training of Deep Bidirectional Transformers for Language Understanding.”</span> <em>CoRR</em> abs/1810.04805. <a href="http://arxiv.org/abs/1810.04805">http://arxiv.org/abs/1810.04805</a>.
</div>
<div id="ref-mikolov2013efficient" class="csl-entry">
Mikolov, Tomas, Kai Chen, Greg Corrado, and Jeffrey Dean. 2013. <em>Efficient Estimation of Word Representations in Vector Space</em>. <a href="https://arxiv.org/abs/1301.3781">https://arxiv.org/abs/1301.3781</a>.
</div>
<div id="ref-10.1007/978-3-030-03146-6_86" class="csl-entry">
Parmar, A., R. Katariya, and V. Patel. 2019. <span>“A Review on Random Forest: An Ensemble Classifier.”</span> In <em>International Conference on Intelligent Data Communication Technologies and Internet of Things (ICICI) 2018</em>, edited by J. Hemanth, X. Fernando, P. Lafata, and Z. Baig. Springer International Publishing.
</div>
<div id="ref-pennington-etal-2014-glove" class="csl-entry">
Pennington, J., R. Socher, and C. Manning. 2014. <span>“<span>G</span>lo<span>V</span>e: Global Vectors for Word Representation.”</span> <em>Proceedings of the 2014 Conference on Empirical Methods in Natural Language Processing (<span>EMNLP</span>)</em> (Doha, Qatar), October, 1532–43. <a href="https://doi.org/10.3115/v1/D14-1162">https://doi.org/10.3115/v1/D14-1162</a>.
</div>
<div id="ref-10.5120/ijca2017915495" class="csl-entry">
Potdar, K., T. S. Pardawala, and C. D. Pai. 2017. <span>“A Comparative Study of Categorical Variable Encoding Techniques for Neural Network Classifiers.”</span> <em>International Journal of Computer Applications</em> (New York, USA) 175 (4): 7–9. <a href="https://doi.org/10.5120/ijca2017915495">https://doi.org/10.5120/ijca2017915495</a>.
</div>
<div id="ref-10.1145/967900.968151" class="csl-entry">
Qian, G., S. Sural, Y. Gu, and S. Pramanik. 2004. <span>“Similarity Between Euclidean and Cosine Angle Distance for Nearest Neighbor Queries.”</span> <em>Proceedings of the 2004 ACM Symposium on Applied Computing</em> (New York, NY, USA), SAC ’04, 1232–37. <a href="https://doi.org/10.1145/967900.968151">https://doi.org/10.1145/967900.968151</a>.
</div>
<div id="ref-SCHMIDHUBER201585" class="csl-entry">
Schmidhuber, J. 2015. <span>“Deep Learning in Neural Networks: An Overview.”</span> <em>Neural Networks</em> 61: 85–117. https://doi.org/<a href="https://doi.org/10.1016/j.neunet.2014.09.003">https://doi.org/10.1016/j.neunet.2014.09.003</a>.
</div>
<div id="ref-10.1145/3443279.3443304" class="csl-entry">
Wang, C., P. Nulty, and D. Lillis. 2021. <span>“A Comparative Study on Word Embeddings in Deep Learning for Text Classification.”</span> <em>Proceedings of the 4th International Conference on Natural Language Processing and Information Retrieval</em> (New York, NY, USA), NLPIR ’20, 37–46. <a href="https://doi.org/10.1145/3443279.3443304">https://doi.org/10.1145/3443279.3443304</a>.
</div>
</div></section></div> ]]></description>
  <category>Machine learning</category>
  <category>Natural language processing</category>
  <category>Public policy</category>
  <category>Health and wellbeing</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/03-first-place-winners.html</guid>
  <pubDate>Mon, 21 Aug 2023 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/03-auburn.png" medium="image" type="image/png" height="105" width="144"/>
</item>
<item>
  <title>The Food for Thought Challenge: Using AI to support evidence-based food and nutrition policy</title>
  <dc:creator>Brian Tarran and Julia Lane</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/00-food-for-thought.html</link>
  <description><![CDATA[ 





<p>There’s a saying: “You are what you eat.” Its meaning is somewhat open to interpretation, as with many such sayings, but it is typically used to make the point that if you want to <em>be</em> well, you need to eat well. Nutrition scientists and dieticians spend their careers trying to figure out what “eating well” looks like – the foods the human body needs, in what quantities, and how best to consume them. Their research informs advice and guidance issued by health professionals and governments. Ultimately, though, the choice of what to eat falls to us – individuals and families – and our choices are often determined by our tastes, the availability of foodstuffs in our local stores, their price and affordability.</p>
<p>So, what exactly <em>do</em> we eat? Answers come from a variety of sources. In the United States, there are dietary recall studies such as the <a href="https://www.cdc.gov/nchs/nhanes/index.htm">National Health and Nutrition Examination Survey</a>, which asks a sample of respondents to report their food and beverage consumption over a set period of time. There are also organisations like <a href="https://www.iriworldwide.com/en-gb">IRI</a> that collect point-of-sale data from retail stores on the actual food and drink being sold to consumers. By and large, this information comes from barcodes on product packaging being scanned at checkouts, so it is often referred to as “scanner data”.</p>
<p>This data – from dietary recall studies and retail scanners – is valuable: once we know what people are eating, we can check the nutritional content of those foods and build up a picture of what the diet of a typical individual or family looks like and how it compares to the diet recommended by doctors and policymakers. And, if we know what other foodstuffs are available, how much they cost, and the nutritional value of those items, we can work out how much families need to spend, and on what, in order to eat well and, hopefully, be well.</p>
<p>Figuring all this out is where something called the Purchase to Plate Crosswalk (PPC) comes in. It’s a key tool for understanding the “<a href="https://www.sciencedirect.com/science/article/pii/S0889157521005445">healthfulness of retail food purchases</a>” and it does this by linking IRI scanner data on what people buy with data on the nutritional content of those foods, as recorded in the US Department of Agriculture’s Food and Nutrient Database for Dietary Studies (FNDDS). But there’s a catch: scanner data is collected about hundreds of thousands of food products, whereas the FNDDS has nutritional profile information for only a few thousand items. Linking these two datasets therefore gives rise to a one-to-many matching problem – a problem that takes several hundred person-hours to resolve.</p>
<p>What if machine learning can help? That question inspired a competition, the Food for Thought Challenge, organized by the Coleridge Initiative, a nonprofit organization working with governments to ensure that data are more effectively used for public decision-making. Researchers and data scientists were invited to use machine learning and natural language processing to more efficiently link data on supermarket products to nutrient databases.</p>
<p>This collection of articles tells the story of the <a href="https://coleridgeinitiative.org/projects/food-for-thought">Food for Thought Challenge</a>. We begin by exploring the <a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/01-purchase-to-plate.html">policy issues</a> that drive the development of the PPC – the need to understand the national diet, developing healthy diet plans, and costing up those plans – and the issues posed by record linkage. Next, we learn about <a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/02-competition-design.html">the nature of the challenge and the structure of the competition in more detail</a>, and then the <a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/03-first-place-winners.html">three</a> <a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/04-second-place-winners.html">winning</a> <a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/05-third-place-winners.html">teams</a> walk us through their solutions. We end the collection with some closing thoughts on <a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/06-value-of-competitions.html">the value of competitions for addressing data scientific challenges in the public sector</a>.</p>
<div class="nav-btn-container">
<div class="grid">
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/index.html">Find more case studies</a></p>
</div>
</div>
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/01-purchase-to-plate.html">Part 1: The Purchase to Plate Suite →</a></p>
</div>
</div>
</div>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<strong>Brian Tarran</strong> is editor of Real World Data Science, and head of data science platform at the Royal Statistical Society.
</dd>
<dd>
<strong>Julia Lane</strong> is a professor at the NYU Wagner Graduate School of Public Service and a NYU Provostial Fellow for Innovation Analytics. She co-founded the Coleridge Initiative, whose goal is to use data to transform the way governments access and use data for the social good through training programs, research projects and a secure data facility. She recently served on the Advisory Committee on Data for Evidence Building and the National AI Research Resources Task Force.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2023 Royal Statistical Society and Julia Lane
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" style="height:22px!important;vertical-align:text-bottom;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" style="height:22px!important;margin-left:3px;vertical-align:text-bottom;"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>. Thumbnail photo by <a href="https://unsplash.com/@melaniesylim?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Melanie Lim</a> on <a href="https://unsplash.com/photos/246b6c6IeC0?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Tarran, Brian, and Julia Lane. 2023. “The Food for Thought Challenge: Using AI to support evidence-based food and nutrition policy.” Real World Data Science, August 21, 2023. <a href="https://realworlddatascience.net/the-pulse/case-studies/posts/2023/08/21/00-food-for-thought.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>



 ]]></description>
  <category>Machine learning</category>
  <category>Natural language processing</category>
  <category>Public policy</category>
  <category>Health and wellbeing</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/00-food-for-thought.html</guid>
  <pubDate>Mon, 21 Aug 2023 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/00-shopping.png" medium="image" type="image/png" height="105" width="144"/>
</item>
<item>
  <title>Food for Thought: The importance of the Purchase to Plate Suite</title>
  <dc:creator>Andrea Carlson and Thea Palmer Zimmerman</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/01-purchase-to-plate.html</link>
  <description><![CDATA[ 





<div class="callout callout-style-default callout-important callout-titled" style="margin-top: 0;">
<div class="callout-header d-flex align-content-center">
<div class="callout-icon-container">
<i class="callout-icon"></i>
</div>
<div class="callout-title-container flex-fill">
<span class="screen-reader-only">Important</span>Disclaimer
</div>
</div>
<div class="callout-body-container callout-body">
<p>The findings and conclusions in this publication are those of the authors and should not be construed to represent any official USDA or US Government determination or policy. This research was supported by the US Department of Agriculture’s Economic Research Service and Center for Nutrition, Policy and Promotion. Findings should not be attributed to Circana (formerly IRI).</p>
</div>
</div>
<p>About 600,000 <a href="https://www.cdc.gov/nchs/fastats/leading-causes-of-death.htm">deaths per year in the United States</a> are related to chronic diseases that are linked to poor dietary choices. Many other individuals suffer from diet-related health conditions, which may limit their ability to work, learn, and be physically active <span class="citation" data-cites="usda_2020">(US Department of Agriculture and US Department of Health and Human Services 2020)</span>. In recognition of the link between diet and health, in 1974 the Senate Select Committee on Nutrition and Human Needs, originally formed to eliminate hunger, expanded its focus to improving eating habits, nutrition policy and the national diet. Since 1980, the Dietary Guidelines for Americans have been released every five years by the US Departments of Agriculture (USDA) and Health and Human Services (DHHS). The guidelines present “<a href="https://www.dietaryguidelines.gov/">advice on what to eat and drink to meet nutrient needs, promote health, and prevent disease</a>”.</p>
<p>Because there can be economic and social barriers to maintaining a healthy diet, USDA promotes <a href="https://www.usda.gov/nutrition-security">Food and Nutrition Security</a> so that everyone has consistent and equitable access to healthy, safe, and affordable foods that promote optimal health and well-being. A set of data tools called the <a href="https://www.ers.usda.gov/data-products/purchase-to-plate/">Purchase to Plate Suite</a> (PPS) supports these goals by enabling the update of the <a href="https://www.fns.usda.gov/snap/thriftyfoodplan#:~:text=What%20is%20the%20Thrifty%20Food,lowest%20cost%20of%20the%20four.">Thrifty Food Plan</a> (TFP), which estimates how much a budget-conscious family of four needs to spend on groceries to ensure a healthy diet. The TFP market basket – consisting of the specific amounts of various food categories required by the plan – forms the basis of the maximum allotment for the Supplemental Nutrition Assistance Program (SNAP, formerly known as the “Food Stamps” program), which provided financial support towards the cost of groceries for <a href="https://www.fns.usda.gov/pd/supplemental-nutrition-assistance-program-snap">over 41 million individuals in almost 22 million households in fiscal year 2022</a>.</p>
<p>The 2018 Farm Act (Agriculture Improvement Act of 2018) requires that USDA reevaluate the TFP every five years using current food composition, consumption patterns, dietary guidance, and food prices, and using approved scientific methods. USDA’s Economic Research Service (ERS) was charged with estimating the current food prices using retail food scanner data <span class="citation" data-cites="levin_et_al_2018 muth_et_al_2016">(Levin et al. 2018; Muth et al. 2016)</span> and utilized the PPS for this task. The most recent TFP update was released in August 2021 and the revised cost of the market basket was the first non-inflation adjustment increase in benefits for SNAP in over 40 years <span class="citation" data-cites="thrifty_food_plan_2021">(US Department of Agriculture 2021)</span>.</p>
<p>The PPS combines datasets to enhance research related to the economics of food and nutrition. There are four primary components of the suite:</p>
<ul>
<li>Purchase to Plate Crosswalk (PPC),</li>
<li>Purchase to Plate Price Tool (PPPT),</li>
<li>Purchase to Plate National Average Prices (PP-NAP) for the National Health and Nutrition Examination Survey (NHANES), and</li>
<li>Purchase to Plate Ingredient Tool (PPIT).</li>
</ul>
<p>The PPC allows researchers to measure the healthfulness of store purchases. On average <a href="https://www.ers.usda.gov/data-products/foodaps-national-household-food-acquisition-and-purchase-survey/summary-findings/#calories">US consumers acquire about 75% of their calories from retail stores</a>, and there are a number of studies linking the availability of foods at home to the healthfulness of the overall diet <span class="citation" data-cites="gattshall_et_al_2008 hanson_et_al_2005">(e.g., Gattshall et al. 2008; Hanson et al. 2005)</span>. Thus, understanding the healthfulness of store purchases allows us to understand differences in consumers who purchase healthy versus less healthy foods, and may contribute to better policies that promote healthier food purchases. While healthier diets are linked to a lower risk of disease outcomes <span class="citation" data-cites="REEDY2014881">(Reedy et al. 2014)</span>, other factors such as health care access may also be contributors <span class="citation" data-cites="cleary_et_al_2022">(Cleary et al. 2022)</span>. The PPC also forms the basis of the price tool, PPPT – which allows researchers to estimate custom prices for dietary recall studies – and a new ERS data product, the <a href="https://www.ers.usda.gov/data-products/purchase-to-plate/">PP-NAP</a>. The national average prices from PP-NAP are used in reevaluating the TFP. By using the PP-NAP with 24-hour dietary recall information from surveys such as What We Eat in America (<a href="https://www.ars.usda.gov/northeast-area/beltsville-md-bhnrc/beltsville-human-nutrition-research-center/food-surveys-research-group/docs/wweianhanes-overview/">WWEIA</a>) – the dietary component of the nationally representative <a href="https://www.cdc.gov/nchs/nhanes/index.htm">National Health and Nutrition Examination Survey</a>(NHANES)<sup>1</sup> – researchers can examine the relationship between the cost of food, dietary intake, and chronic diseases linked to poor diets. The price estimates also allow researchers to develop cost-effective healthy diets such as <a href="https://www.myplate.gov/myplate-kitchen/recipes">MyPlate Kitchen</a>. The final component of the Purchase to Plate Suite, the ingredient tool (PPIT), breaks dietary recall-reported foods back into purchasable ingredients, based on US retail food purchases. The PPIT is also used in the revaluation of the TFP, and by researchers who want to look at the relationship between reported ingestion of grocery items, cost and disease outcomes using WWEIA/NHANES. More information on the development of the PPC is available in two papers by Carlson et al. <span class="citation" data-cites="carlson_et_al_2019 carlson_et_al_2022">(2019, 2022)</span>.</p>
<p>The Food for Thought competition aimed to support the development of the PPC – and thus policy-oriented research – by linking retail food scanner data to the USDA nutrition data used to analyze NHANES dietary recall data, specifically the Food and Nutrient Database for Dietary Studies (FNDDS) <span class="citation" data-cites="fndds_2018 fndds_2020">(2018, 2020)</span>. In particular, the competition set out to use artificial intelligence (AI) to reduce human resources in creating the links for the PPC, while still maintaining the high-quality standards required for reevaluating the TFP and for data published by ERS (which is one of 13 Principle Statistical Agencies in the United States Federal Government).</p>
<section id="methods-used-to-date" class="level2">
<h2 class="anchored" data-anchor-id="methods-used-to-date">Methods used to date</h2>
<p>On the surface, the linking process may appear simple: both the FNDDS and retail food scanner data are databases of food. But the scanner data are produced for market research, and the FNDDS for dietary studies. The scanner data include about 350,000 items with sales each year, while the FNDDS has only 10,000–15,000 items. Scanner data relates to specific products, while FNDDS items are often more general. Both datasets have different hierarchical structures – the FNDDS hierarchy is based around major food groups: dairy; meat, poultry and seafood; eggs; nuts and legumes; grains; fruits; vegetables; fats and oils; and sugars, sweets, and beverages. Items fall into the groups regardless of preparation method or form. That is, broccoli prepared from frozen and from fresh both appear in the vegetable group, and for some fruits and vegetables, the fresh, frozen, canned and dried form are the same FNDDS item. Vegetable-based mixed dishes, such as broccoli and carrot stir-fry or soup, are also classified in the vegetable group. On the other hand, the scanner data classifies foods by grocery aisle. That is, the fresh and frozen broccoli are classified in different areas: produce and frozen vegetables. Similarly, when sold as a prepared food, the broccoli and carrot stir-fry may be found in the frozen entries, as a kit in either the frozen or produce section, refrigerated foods, or all of these.</p>
<p>To allow researchers to import the FNDDS nutrient data into the scanner data, a one-to-many match between FNDDS and scanner data items was needed. The food descriptions in the scanner data include brand names and package sizes and are written as a consumer would pronounce them – e.g., fresh and crisp broccoli florets, ready-cut, 10 oz – versus a more general FNDDS description such as “Broccoli, raw”. (Also linked to the “Broccoli, raw” code would be broccoli sold with stems attached, broccoli spears, and any other way raw broccoli is sold.) In the scanner data, the Universal Product Code (UPC) and the European Article Number (EAN) can link items between tables within the scanner data, as well as between datasets of grocery items, such as the USDA Global Branded Foods Product Database, a component of <a href="https://fdc.nal.usda.gov/index.html">USDA’s Food Data Central</a>. However, these codes are not related to the FNDDS codes, or any other column within the FNDDS. In other words, before development of the PPC, there were no established linking identifiers.</p>
<p>Figure 1 shows the process USDA uses to develop matches between scanner data and FNDDS.</p>
<p><a href="images/pt1-fig1.png"><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/pt1-fig1.png" class="img-fluid" width="700"></a></p>
<div class="figure-caption">
<p><strong>Figure 1:</strong> Process currently used to create the matches between the USDA Food and Nutrient Database for Dietary Studies (FNDDS) and the retail scanner data (labelled “IRI” for the IRI InfoScan and Consumer Network) product dictionaries. Source: Author provided.</p>
</div>
<p>We start the linking process by categorizing the scanner data items into homogeneous groups to make the first round of automated matching more efficient. To save time, we use the second lowest hierarchical category in the scanner data which generally divides items within a grocery aisle into homogenous groups such as produce, canned beans, baking mixes, and bread. Once the linking categories for scanner data are established, we select appropriate items from the FNDDS. Since the FNDDS is highly structured, this selection is usually straightforward.</p>
<p>Our next step is to use semantic matching to create a search table that aligns similar terms within the IRI product dictionary and FNDDS. This first requires that we extract attributes from the FNDDS descriptions into fields similar to those in the scanner data product dictionary. The FNDDS descriptions are found across multiple columns because they are added as the need arises to provide examples of brand names or alternative descriptions of foods which help code the foods WWEIA participants report eating. We manually create matching tables that link terms used in FNDDS to those used in the scanner data, organized by the fields defined in the restructured FNDDS. We then use this table as the basis of a probabilistic matching process. For example, when linking the produce group, “fresh” in the scanner data would be aligned with “raw” and “prepared from fresh” and NOT “prepared from frozen” in the FNDDS, and “broccoli florets” would also be aligned with “raw” and “broccoli”. Since the FNDDS is designed to code the foods individuals report eating, many of the foods in the FNDDS are already prepared and result in descriptions such as “broccoli, steamed, prepared from fresh” or “broccoli, boiled, prepared from frozen”.</p>
<p>Once the linking table is established, the probabilistic match process returns the single best possible match for each item in the scanner data. For example, a match between fresh broccoli florets and frozen broccoli would have a lower probability score than “broccoli, raw”. Because these matches form the basis of major USDA policies, we cannot accept an error rate of more than 5 percent, and lower is preferred. To reach that goal, nutritionists review every match to make sure the probabilistic match did not return a match between cauliflower florets and fresh broccoli, say, or that a broccoli and carrot stir-fry is not matched to a dish with broccoli, carrots, and chicken. The correct matches, such as the one between fresh broccoli florets and raw broccoli, are set aside while the items with an incorrect match, such as cauliflower florets and the broccoli and carrot stir-fry, are used to revise the search table. Revisions might include adding (NOT chicken) to the broccoli and carrot stir-fry dish. Mixed dishes — such as the broccoli and carrot stir-fry — pose particular challenges because there are a wide variety of similar products available in the grocery store. After a few rounds of revising the search table and running the probabilistic match process, it is more efficient to use a manual match, established by one nutritionist and reviewed by another, after which the match is assumed to be correct.</p>
<p>The process improved with each new wave of FNDDS and IRI data. Our first creation of the PPC linked the FNDDS 2011/12 to the 2013 IRI retail scanner data. Subsequent waves started with the previous search table and resulting matches were reviewed by nutritionists. We also used more fields in the IRI product dictionary to create the homogeneous linking groups and made modifications to these groups with each wave. During each wave we experimented with the number of rounds of probabilistic matching that was the most cost effective. For some linking groups it took less human time to manually match from the start, while for other groups it was more efficient to do multiple rounds of improvements to the search table. Starting with the most recent wave (matching FNDDS 2017/18 to the 2017 and 2018 retail scanner data), we assumed previous matches appearing in the newer data were correct. Although this assumption was good for most matches, a review demonstrated the need to review previous matches prior to removing the item from the list of scanner data items needing FNDDS matches. In the future we intend to explore methods developed by the participants of the Food for Thought competition.</p>
</section>
<section id="linking-challenges" class="level2">
<h2 class="anchored" data-anchor-id="linking-challenges">Linking challenges</h2>
<p>An ongoing challenge to the linking problem is that both the scanner data and the FNDDS undergo substantive changes each year, meaning that both the previous matches and search tables need to be reviewed and revised with each new effort, as tables that work with one cycle of FNDDS and scanner data will need revisions to use with the next cycle. Changes to the scanner data that impact our current method include dropped and added items, data corrections, and revisions to the categories that form the basis of the homogeneous linking groups. In addition, there are errors such as incorrect food descriptions, conflicting package size information, and changes in the item description from year to year. Since the FNDDS is designed to support dietary recall studies, revisions reflect both changes to available foods and the level of detail respondents can provide. These revisions result in dropped/added food codes, changes to food descriptions that impact which scanner data items match to the FNDDS items, and revisions to recipes used in the nutrient coding which impacts the number of retail ingredients available in the FNDDS.</p>
<p>Of the four parts of the PPS, establishing the matches is the most time-consuming task and constitutes at least 60 percent of the total budget. In the most recent round, we had 168 categories and each one went through 2-3 automated matching rounds; after each round, nutritionists spent an average of two hours reviewing the matches. This adds up to somewhere between 670 and 1,000 hours of review time. After the automated review, manual matching requires an additional 300 hours. Reducing the amount of time required to establish matches and link the FNDDS and retail scanner datasets may lead to significant time savings, resulting in faster data availability. That, in turn, could allow more timely policy-based research, and the mandated revision of the Thrifty Food Plan can continue with the most recent food price data.</p>
<div class="nav-btn-container">
<div class="grid">
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/00-food-for-thought.html">← Introduction</a></p>
</div>
</div>
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/02-competition-design.html">Part 2: Competition design →</a></p>
</div>
</div>
</div>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<strong>Andrea Carlson</strong> is an agricultural economist in the Food Markets Branch of the Food Economics Division in USDA’s Economic Research Service. She is the project lead for the Purchase to Plate Suite, which allows users to import USDA nutrient and food composition data into retail food scanner data acquired by USDA and estimate individual food prices for dietary intake data.
</dd>
<dd>
<strong>Thea Palmer Zimmerman</strong> is a senior study director and research nutritionist at Westat.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Image credit</dt>
<dd>
Thumbnail photo by <a href="https://unsplash.com/@neonbrand?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Kenny Eliason</a> on <a href="https://unsplash.com/photos/SvhXD3kPSTY?utm_source=unsplash&amp;utm_medium=referral&amp;utm_content=creditCopyText">Unsplash</a>.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Carlson, Andrea, and Thea Palmer Zimmerman. 2023. “Food for Thought: The importance of the Purchase to Plate Suite.” Real World Data Science, August 21, 2023. <a href="https://realworlddatascience.net/the-pulse/case-studies/posts/2023/08/21/01-purchase-to-plate.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>
</section>
<section id="acknowledgements" class="level2">
<h2 class="anchored" data-anchor-id="acknowledgements">Acknowledgements</h2>
<p>The research presented in this compendium supports the Purchase to Plate Suite of data products. Carlson has been privileged to both develop and lead this project over the course of her career, but it is not a solo project. Many thanks to the Linkages Team from USDA’s Economic Research Service (Christopher Lowe, Mark Denbaly Elina Page, and Catherine Cullinane Thomas) the Center for Nutrition Policy and Promotion (Kristin Koegel, Kevin Kuczynski, Kevin Meyers Mathieu, TusaRebecca Pannucci), and our contractor Westat, Inc.&nbsp;(Thea Palmer Zimmerman, Carina E. Tornow, Amber Brown McFadden, Caitlin Carter, Viji Narayanaswamy, Lindsay McDougal, Elisha Lubar, Lynnea Brumby, Raquel Brown, and Maria Tamburri). Many others have supported this project over the years.</p>



</section>


<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-carlson_et_al_2019" class="csl-entry">
Carlson, A. C., E. T. Page, T. P. Zimmerman, C. E. Tornow, and S. Hermansen. 2019. <em>Linking USDA Nutrition Databases to IRI Household-Based and Store-Based Scanner Data</em>. Technical Bulletin No. 1952. US Department of Agriculture, Economic Research Service.
</div>
<div id="ref-carlson_et_al_2022" class="csl-entry">
Carlson, A. C., C. E. Tornow, E. T. Page, A. Brown McFadden, and T. Palmer Zimmerman. 2022. <span>“Development of the Purchase to Plate Crosswalk and Price Tool: Estimating Prices for the National Health and Nutrition Examination Survey (NHANES) Foods and Measuring the Healthfulness of Retail Food Purchases.”</span> <em>Journal of Food Composition and Analysis</em> 106: 104344. <a href="https://doi.org/10.1016/j.jfca.2021.104344">https://doi.org/10.1016/j.jfca.2021.104344</a>.
</div>
<div id="ref-cleary_et_al_2022" class="csl-entry">
Cleary, R., Y. Liu, and A. Carlson. 2022. <em>Differences in the Distribution of Nutrition Between Households Above and Below Poverty</em>. Agricultural and Applied Economic Association Annual Meeting. Anaheim, CA. <a href="https://ageconsearch.umn.edu/record/322267">https://ageconsearch.umn.edu/record/322267</a>.
</div>
<div id="ref-gattshall_et_al_2008" class="csl-entry">
Gattshall, M. L., J. A. Shoup, J. A. Marshall, L. A. Crane, and P. A. Estabrooks. 2008. <span>“Validation of a Survey Instrument to Assess Home Environments for Physical Activity and Healthy Eating in Overweight Children.”</span> <em>International Journal of Behavioral Nutrition and Physical Activity</em> 5 (3). <a href="https://doi.org/10.1186/1479-5868-5-3">https://doi.org/10.1186/1479-5868-5-3</a>.
</div>
<div id="ref-hanson_et_al_2005" class="csl-entry">
Hanson, N. I., D. Neumark-Sztainer, M. E. Eisenberg, M. Story, and M. Wall. 2005. <span>“Associations Between Parental Report of the Home Food Environment and Adolescent Intakes of Fruits, Vegetables and Dairy Foods.”</span> <em>Public Health Nutrition</em> 8 (1). <a href="https://doi.org/10.1079/PHN2005661">https://doi.org/10.1079/PHN2005661</a>.
</div>
<div id="ref-levin_et_al_2018" class="csl-entry">
Levin, D., D. Noriega, C. Dicken, A. Okrent, M. Harding, and M. Lovenheim. 2018. <em>Examining Store Scanner Data: A Comparison of the IRI Infoscan Data with Other Data Sets, 2008-12</em>. Technical Bulletin No. 1949. US Department of Agriculture, Economic Research Service.
</div>
<div id="ref-muth_et_al_2016" class="csl-entry">
Muth, M. K., M. Sweitzer, D. Brown, et al. 2016. <em>Understanding IRI Household-Based and Store-Based Scanner Data</em>. Technical Bulletin No. 1942. US Department of Agriculture, Economic Research Service.
</div>
<div id="ref-REEDY2014881" class="csl-entry">
Reedy, J., S. M. Krebs-Smith, P. E. Miller, et al. 2014. <span>“Higher Diet Quality Is Associated with Decreased Risk of All-Cause, Cardiovascular Disease, and Cancer Mortality Among Older Adults.”</span> <em>The Journal of Nutrition</em> 144 (6): 881–89. <a href="https://doi.org/10.3945/jn.113.189407">https://doi.org/10.3945/jn.113.189407</a>.
</div>
<div id="ref-thrifty_food_plan_2021" class="csl-entry">
US Department of Agriculture. 2021. <em>Thrifty Food Plan, 2021</em>. Food and Nutrition Service No. 916. US Department of Agriculture. <a href="https://FNS.usda.gov/TFP">https://FNS.usda.gov/TFP</a>.
</div>
<div id="ref-fndds_2018" class="csl-entry">
US Department of Agriculture, Agricultural Research Service. 2018. <em>USDA Food and Nutrient Database for Dietary Studies 2015-2016</em>. US Department of Agriculture, Agricultural Research Service. <a href="https://www.ars.usda.gov/nea/bhnrc/fsrg">https://www.ars.usda.gov/nea/bhnrc/fsrg</a>.
</div>
<div id="ref-fndds_2020" class="csl-entry">
US Department of Agriculture, Agricultural Research Service. 2020. <em>USDA Food and Nutrient Database for Dietary Studies 2017-2018</em>. US Department of Agriculture, Agricultural Research Service. <a href="https://www.ars.usda.gov/nea/bhnrc/fsrg">https://www.ars.usda.gov/nea/bhnrc/fsrg</a>.
</div>
<div id="ref-usda_2020" class="csl-entry">
US Department of Agriculture and US Department of Health and Human Services. 2020. <em>Dietary Guidelines for Americans, 2020-2025</em>. 9th edition. <span>US Department of Agriculture and US Department of Health and Human Services</span>. <a href="https://DietaryGuidelines.gov">https://DietaryGuidelines.gov</a>.
</div>
</div></section><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">Footnotes</h2>

<ol>
<li id="fn1"><p>NHANES is a multi-module continuous survey conducted by the Centers for Disease Control and Prevention. In addition to the WWEIA, NHANES includes a four-hour complete medical exam including a health history, and a blood and urine analysis.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>Machine learning</category>
  <category>Natural language processing</category>
  <category>Public policy</category>
  <category>Health and wellbeing</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/01-purchase-to-plate.html</guid>
  <pubDate>Mon, 21 Aug 2023 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/01-pps.png" medium="image" type="image/png" height="105" width="144"/>
</item>
<item>
  <title>Food for Thought: Competition and challenge design</title>
  <dc:creator>Zheyuan Zhang and Uyen Le</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/02-competition-design.html</link>
  <description><![CDATA[ 





<p>Since 2014, the professional services firm Westat, Inc.&nbsp;has been developing the Purchase to Plate Crosswalk (PPC) for the United States Department of Agriculture (USDA) Economic Research Service (ERS). The PPC links the retail food transactions database from IRI’s InfoScan service and the USDA Food and Nutrient Database for Dietary Studies (FNDDS). However, the current linkage process uses only partly automated data matching, meaning it is resource intensive, time consuming, and requires manual review.</p>
<p>With sponsorship from ERS, Westat partnered with the Coleridge Initiative to host the Food for Thought competition to challenge researchers and data scientists to use machine learning and natural language processing to find accurate and efficient methods for creating the PPC. Figure 1 provides a visual overview of the challenge set by the competition.</p>
<p><a href="images/pt2-fig1.png"><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/pt2-fig1.png" class="img-fluid" width="700"></a></p>
<div class="figure-caption">
<p><strong>Figure 1:</strong> Overview of the Food for Thought Competition Challenge.</p>
</div>
<p>The one-to-many matching task that is central to the competition throws up many challenges for researchers to wrestle with. Because IRI data contains food transactions collected from partnered retail establishments for over 350,000 items, the matchings need to be made based on limited data features, including categories, providers, and semantically inconsistent descriptions that consist of short phrases. Consider this hypothetical example: IRI product-related information about a (fictional) “Cheesy Hashbrowns Hamburger Helper, 5.5 Oz Box” needs to be linked to FNDDS nutrition-related information found under “Mixed dishes – meat, poultry, seafood: Mixed meat dishes”. Figure 2 demonstrates how the two databases are linked with each other to create the PPC. As can be seen, there is no common word that easily indicates that “Cheesy Hashbrowns Hamburger Helper…” should be matched with “Mixed dishes…”, and such cases exist in all IRI tables used for the challenge, from 2012 through 2018.</p>
<p><a href="images/pt2-fig2.png"><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/pt2-fig2.png" class="img-fluid" width="700"></a></p>
<div class="figure-caption">
<p><strong>Figure 2:</strong> Each universal product code (UPC) from the IRI data could match to only one ensemble code (EC) from the FNDDS data, whereas one EC code could match to multiple UPCs.</p>
</div>
<p>Also, because nutritionists or food scientists will always need to review the matching, regardless of the matching method used, it was important that our evaluation of proposed matching methods focused both on the accuracy of prediction models and also on metrics that would lead participants to develop models that facilitate qualified reviewers to reduce their workloads.</p>
<p>Organising the competition was also a challenge in its own right, for data privacy reasons. IRI scanner data contains sensitive information, such as store name, location, unit price, and weekly quantity sold for each item. This ruled out using existing online platforms like Kaggle, DrivenData or AIcrowd to host the competition, and instead required a private secure data enclave to ensure the safe use of sensitive and confidential data assets. The need for such an environment imposed capacity constraints on the competition, meaning only dozens of teams could be invited to take part, whereas on open platforms it is common to have thousands of teams competing and sharing ideas and code.</p>
<section id="competition-structure" class="level2">
<h2 class="anchored" data-anchor-id="competition-structure">Competition structure</h2>
<p>The competition ran over 10 months and consisted of three separate challenges: two interim, one final. Applications opened in September 2021, and the competition started in January 2022. Submission deadlines for the first and second interim challenges were in July and September 2022, respectively. For these rounds, participants submitted preliminary solutions for evaluation based solely on quantitative metrics, and two awards of $10,000 were given to the highest-scoring teams. The deadline for the final challenge was in October 2022. Here, solutions were evaluated by the scientific review board based on three judging criteria: quantitative metrics, transferability, and innovation. First, second, and third place winners received awards of $30,000, $1,500, and $1,000 respectively. Final presentations were given at the Food for Thought symposium in December 2022.</p>
<p>The competition was run entirely within the Coleridge Initiative’s Administrative Data Research Facility (ADRF), which was established by the United States Census Bureau to inform the decision-making of the Commission on Evidence-Based Policy under the Evidence Act. ADRF follows the Five Safes Framework: safe projects, safe people, safe data, safe settings, and safe outputs.</p>
<p>In keeping with this framework, participants were provided with ADRF login credentials after signing the relevant data use agreements during the onboarding process. All participants were required to agree to the ADRF terms of use, to complete security training, and to pass a security training assessment prior to accessing the challenge data. Participants’ access within ADRF was limited to the challenge environment and data only. There was no internet access, so Coleridge Initiative ensured that any packages requested by teams were available for use within the environment after passing security review. All codes and documentation were only allowed to be exported outside ADRF after export reviews from both Coleridge Initiative and USDA staff. At the end of each challenge, the teams submitted write-ups and supporting files by placing all the necessary submission files in their ADRF team folder. Detailed submission instructions are available via the <a href="https://github.com/realworlddatascience/realworlddatascience.github.io/tree/main/case-studies/posts/2023/08/21/_code">Real World Data Science GitHub repository</a>.</p>
</section>
<section id="metrics" class="level2">
<h2 class="anchored" data-anchor-id="metrics">Metrics</h2>
<p>Submissions were evaluated by Coleridge Initiative and technical review and subject review boards based on the following criteria:</p>
<ul>
<li><strong>Quantitative metrics</strong> were used to measure the predictive accuracy and runtime of the model.<br>
</li>
<li><strong>Transferability</strong> measured the quality of documentation and code, and the ability of individuals who are not involved in model development to replicate and implement the team’s approach.<br>
</li>
<li><strong>Innovation</strong> measured novelty and creativity of the model in addressing the linkage problem.</li>
</ul>
<p>Technical review was overseen by faculty members from computer science and engineering departments of top US universities. Subject review was handled by subject matter experts from USDA and Westat.</p>
<p>From a quantitative perspective, the most common way to evaluate machine learning competition submissions is to use model predictive accuracy. However, single metrics are typically incomplete descriptions of real-world tasks, and they can easily hide significant differences between models which simple predictive accuracy cannot capture. To select the most appropriate official challenge metrics, Coleridge Initiative reviewed the literature on the use of evaluation measures in both classification and ranking task machine learning competitions. Success at 5 (S@5) and Normalized Discounted Cumulative Gain at 5 (NDCG@5) scores were ultimately used as the quantitative metrics.</p>
<p>The metrics were applied as follows: models proposed by each team were tasked with outputting five potential FNDDS matches for each IRI code, with potential FNDDS matches ordered from most likely to least likely. S@5 and NDCG@5 scores are broadly similar – both measure whether a correct match is present in the five proposed matches that participants were asked to identify. However, S@5 does not take rank position into account and only considers whether the five proposed FNDDS matches contain the correct FNDDS response. NDCG@5 does take rank into account and also measures how highly the correct FNDDS response is ranked among the five proposed matches. Both measures range from 0 to 1 (or 0% to 100%). Models get a “full credit” for S@5 as long as they contain the correct FNDDS option. NDCG@5 penalizes models when the correct match is ranked lower on the list of 5 proposed matches.</p>
</section>
<section id="technical-description" class="level2">
<h2 class="anchored" data-anchor-id="technical-description">Technical description</h2>
<section id="environment-setup" class="level5">
<h5 class="anchored" data-anchor-id="environment-setup">Environment setup</h5>
<p>Coleridge Initiative solicited technical requirements from participants at the challenge application stage to prepare the ADRF environment as much as possible before the competition began. Each team was asked to share anticipated workspace specifications and software library requests in their application package. From this we identified, reviewed, and installed the requested Python and R packages, libraries, and library components (e.g., pre-trained models, training data) that were not yet available within ADRF.</p>
<p>The setup of graphics processing units (GPUs) was also a critical part of competition preparation. We created an environment with 16 gibibyte (GiB) of GPU memory for each team. Our technology team met with multiple teams several times to discuss computing environment configurations to ensure the GPU could work properly. None of these efforts was wasted: without GPU access, it would be impossible for teams to use state-of-the-art pre-trained models such as the Bidirectional Encoder Representations from Transformers <span class="citation" data-cites="DBLP:journals/corr/abs-1810-04805">(BERT, Devlin et al. 2018)</span>.</p>
<p>We completed the setup of new team workspaces, each customized to the individual team’s resource and library requirements, including GPU configuration. The isolation and customization of workspaces was vital because teams may request different versions of libraries that potentially have version conflict with other libraries. We ensured the configurations were all set before the challenge began because such data challenges are bursty in nature <span class="citation" data-cites="macavaney_et_al_2021">(Macavaney et al. 2021)</span>, and handling support requests in the private data enclave risked causing delays. We hoped to avoid receiving too many requests in the beginning phase of the competition in order to give participants a better experience, though we did of course provide participants with instructions on how to request additional libraries during the challenge period.</p>
</section>
<section id="supporting-materials" class="level5">
<h5 class="anchored" data-anchor-id="supporting-materials">Supporting materials</h5>
<p>In addition to environment preparation, we made available a list of supporting documentation, including IRI, PPC, and FNDDS codebooks, technical reports, and related publications that could help teams understand the challenge datasets. The FNDDS codebook pooled information on variable availability, coding, and descriptions across dataset files and years. It also included internal Westat food category coding difficulty ratings and notes on created PPC codes and provided UPC code, EC code, and general dataset remarks and observations that may take time for analysts to discover on their own.</p>
<p>We developed a baseline model to demonstrate the challenge task and the expected outputs – both outside of ADRF using FNDDS and fictitious data in place of IRI data, and an analogous model using FNDDS and IRI data within the ADRF secure environment. Moreover, we provided the teams with an evaluation script to read in their submissions and evaluate them for predictive accuracy against the public test set using S@5 and NDCG@5 challenge metrics. Finally, we held multiple webinars during the course of the challenge to explain next steps, address participant questions, solicit feedback, and provide general support. Multiple teams also met with our technology team to clarify ADRF-related questions or troubleshoot technical issues.</p>
<p>(Baseline model, toolkits, and evaluation script are available from the <a href="https://github.com/realworlddatascience/realworlddatascience.github.io/tree/main/case-studies/posts/2023/08/21/_code">Real World Data Science GitHub repository</a>.)</p>
</section>
<section id="data-splitting" class="level5">
<h5 class="anchored" data-anchor-id="data-splitting">Data splitting</h5>
<p>To mimic the real-world scenario, the competition used 2012–2016 IRI data as the training set, and the 2017–2018 IRI data as the test set, since the data change over time and USDA could provide the most recent data available. To make sure that models were generalizable and not just overfit to the test set, we split the test set into private and public test sets. In this way, we guaranteed that the models were evaluated on completely hidden data. In order to keep the similar distribution of the two sets, we first divided the data into five quintiles based on EC code frequencies and then randomly sampled 80% of records in each group without repetition for placement into the private test set. Later in the competition, because of the computation limit, we further shrank the private test set to 40% of its original size using the same data-splitting method.</p>
</section>
<section id="judging" class="level5">
<h5 class="anchored" data-anchor-id="judging">Judging</h5>
<p>In the first two rounds, submissions were evaluated based on the quantitative metrics, as previously mentioned above. Coleridge Initiative was responsible for running the evaluation script, making sure not to re-train the model or modify the configs in any way, and only applying the model to predict the private test set. Prediction results were then compared against ground truth to get the private scores.</p>
<p>The final challenge was reviewed by the scientific review board on all three judging criteria. Submitted models were first evaluated by Coleridge Initiative in the same way as in the first two rounds. The runtime of models was also recorded as an assessment of model cost. The scientific review boards then assessed the models by the quality of documentation, the quality of code, and the ability to replicate and implement the team’s approach, and scored the models for innovation and creativity in addressing the linkage problem. Lastly, scores were summarized and the scientific review board discussed and decided the winners of the competition.</p>
</section>
</section>
<section id="results" class="level2">
<h2 class="anchored" data-anchor-id="results">Results</h2>
<p>The next few articles in this collection walk readers through the solutions proposed by competition finalists. Figure 3 provides a brief summary.</p>
<p><a href="images/pt2-fig3.png"><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/pt2-fig3.png" class="img-fluid" width="700"></a></p>
<div class="figure-caption">
<p><strong>Figure 3:</strong> Top competitors and their solutions to the Food for Thought challenge.</p>
</div>
</section>
<section id="lessons-learned" class="level2">
<h2 class="anchored" data-anchor-id="lessons-learned">Lessons learned</h2>
<p>It was undoubtedly challenging for teams to work with highly secured data in a private data enclave for this data challenge. We solicited feedback from teams and summarized the issues that we experienced throughout the competitions, together with the solutions to resolve those issues. Below are our main lessons learned and we hope this summary can serve to inform future competitions.</p>
<ul>
<li><p><strong>Environmental factors:</strong> The installation and setup of packages, libraries, and resources, as well as the configuration of GPUs, system dependencies, and workspace design were expected to take a long time as each team had their own needs. To accelerate the process, we requested a list of specific package and environment requirements from the teams in advance. However, due to the complexity of the system configuration required by the teams, environment setup took longer than expected. Thus, the challenge deadlines had to be postponed a few times to accommodate this.</p></li>
<li><p><strong>Time commitment:</strong> Twelve teams were selected to participate in the challenge, but only three teams remained in the final challenge. Other than one team that was disqualified for violating the ADRF terms of use agreement, eight dropped out because of other commitments and insufficient time to meaningfully participate. To ensure security, ADRF does not allow jobs to run in the backend, which also adds to the time commitment of teams. To encourage teams to participate in the final challenge, we gave out additional awards for second and third places.</p></li>
<li><p><strong>Computing resource limit:</strong> One issue encountered in evaluating submitted models was computing environment resource limits due to the secured nature of the data enclave. The original private test dataset is four times larger than the public test dataset, making it unfeasible to evaluate. To overcome this issue, given the fixed resource constraints, we decided to reduce the private test set to 40% of its original size. It would have been helpful, though, if the competition had set a model running time limit at the outset, so that participants could build simpler yet effective models.</p></li>
<li><p><strong>Supporting code:</strong> Although the initial baseline model we provided was extremely simple, we found this helped participants a lot in the initial phase – yet there is space to improve. To be specific, supporting codes should be constructed so that all relevant data tables are used and specify the main function to run the code, especially how the model should be tested. The teams only used the main table, which was the only table that was used in the baseline model, for training and did not touch the other supporting table. If we included the other table in the baseline model, it could help participants to have a better use of this data as well. In addition, a baseline model should be intuitive for the participants to follow, allowing evaluators to easily replace the public test set with the private test set without any programming modifications.</p></li>
</ul>
<div class="nav-btn-container">
<div class="grid">
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/01-purchase-to-plate.html">← Part 1: Purchase to Plate</a></p>
</div>
</div>
<div class="g-col-12 g-col-sm-6">
<div class="nav-btn">
<p><a href="../../../../../../applied-insights/case-studies/posts/2023/08/21/03-first-place-winners.html">Part 3: First place winners →</a></p>
</div>
</div>
</div>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<strong>Zheyuan Zhang</strong> and <strong>Uyen Le</strong> are research scientists at the Coleridge Initiative.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2023 Zheyuan Zhang and Uyen Le
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1" style="height:22px!important;vertical-align:text-bottom;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1" style="height:22px!important;margin-left:3px;vertical-align:text-bottom;"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Zhang, Zheyuan, and Uyen Le. 2023. “Food for Thought: Competition and challenge design.” Real World Data Science, August 21, 2023. <a href="https://realworlddatascience.net/the-pulse/case-studies/posts/2023/08/21/02-competition-design.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>



</section>

<div id="quarto-appendix" class="default"><section class="quarto-appendix-contents" id="quarto-bibliography"><h2 class="anchored quarto-appendix-heading">References</h2><div id="refs" class="references csl-bib-body hanging-indent">
<div id="ref-DBLP:journals/corr/abs-1810-04805" class="csl-entry">
Devlin, J., M.-W. Chang, K. Lee, and K. Toutanova. 2018. <span>“<span>BERT:</span> Pre-Training of Deep Bidirectional Transformers for Language Understanding.”</span> <em>CoRR</em> abs/1810.04805. <a href="http://arxiv.org/abs/1810.04805">http://arxiv.org/abs/1810.04805</a>.
</div>
<div id="ref-macavaney_et_al_2021" class="csl-entry">
Macavaney, S., A. Mittu, G. Coppersmith, J. Leintz, and P. Resnik. 2021. <em>Community-Level Research on Suicidality Prediction in a Secure Environment: Overview of the CLPsych 2021 Shared Task</em>. In Proceedings of the Seventh Workshop on Computational Linguistics and Clinical Psychology: Improving Access.
</div>
</div></section></div> ]]></description>
  <category>Machine learning</category>
  <category>Natural language processing</category>
  <category>Public policy</category>
  <category>Health and wellbeing</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/02-competition-design.html</guid>
  <pubDate>Mon, 21 Aug 2023 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/08/21/images/pt2-intro.png" medium="image" type="image/png" height="105" width="144"/>
</item>
<item>
  <title>The road to reproducible research: hazards to avoid and tools to get you there safely</title>
  <dc:creator>Davit Svanidze, Andre Python, et al.</dc:creator>
  <link>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/road-to-reproducible-research.html</link>
  <description><![CDATA[ 





<p>Reproducibility, or “<a href="https://www.nsf.gov/sbe/AC_Materials/SBE_Robust_and_Reliable_Research_Report.pdf">the ability of a researcher to duplicate the results of a prior study using the same materials as the original investigator</a>”, is critical for sharing and building upon scientific findings. Reproducibility not only verifies the correctness of processes leading to results but also serves as a prerequisite for assessing generalisability to other datasets or contexts. This we refer to as replicability, or “<a href="https://www.nsf.gov/sbe/AC_Materials/SBE_Robust_and_Reliable_Research_Report.pdf">the ability of a researcher to duplicate the results of a prior study if the same procedures are followed but new data are collected</a>”. Reproducibility, which is the focus of our work here, can be challenging – especially in the context of deep learning. This article, and associated material, aims to provide practical advice for overcoming these challenges.</p>
<p>Our story begins with Davit Svanidze, a master’s degree student in economics at the London School of Economics (LSE). Davit’s efforts to make his bachelor’s thesis reproducible are what inspires this article, and we hope that readers will be able to learn from Davit’s experience and apply those learnings to their own work. Davit will demonstrate the use of Jupyter notebooks, GitHub, and other relevant tools to ensure reproducibility. He will walk us through code documentation, data management, and version control with Git. And, he will share best practices for collaboration, peer review, and dissemination of results.</p>
<p>Davit’s story starts here, but there is much more for the interested reader to discover. At certain points in this article, we will direct readers to other resources, namely a <a href="https://github.com/dsvanidze/replicability/blob/master/notebooks/main.ipynb">Jupyter notebook</a> and <a href="https://github.com/dsvanidze/replicability">GitHub repository</a> which contain all the instructions, data and code necessary to reproduce Davit’s research. Together, these components offer a comprehensive overview of the thought process and technical implementation required for reproducibility. While there is no one-size-fits-all approach, the principles remain consistent.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/computer.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="A young man sits in front of a computer keyboard, surrounded by monitors and books and with computer cables covering various surfaces"></p>
</figure>
</div>
<div class="figure-caption" style="text-align: center;">
<p><strong>Credit:</strong> Discord software, Midjourney bot.</p>
</div>
<section id="davits-journey-towards-reproducibility" class="level2">
<h2 class="anchored" data-anchor-id="davits-journey-towards-reproducibility">Davit’s journey towards reproducibility</h2>
<section id="more-power-please" class="level3">
<h3 class="anchored" data-anchor-id="more-power-please">More power, please</h3>
<p>The focus of my bachelor’s thesis was to better understand the initial spread of Covid-19 in China using deep learning algorithms. I was keen to make my work reproducible, but not only for my own sake. The “reproducibility crisis” is a well-documented problem in science as a whole,<sup>1</sup> <sup>2</sup> <sup>3</sup> <sup>4</sup> with studies suggesting that around one-third of social science studies published between the years 2010 and 2015 in top journals like <em>Nature</em> and <em>Science</em> could not be reproduced.<sup>5</sup> Results that cannot be reproduced are not necessarily “wrong”. But, if findings cannot be reproduced, we cannot be sure of their validity.</p>
<p>For <a href="https://github.com/dsvanidze/replicability/blob/master/notebooks/main.ipynb">my own research project</a>, I gathered all data and started working on my computer. After I built the algorithms to train the data, my first challenge to reproducibility was computational. I realised that training models on my local computer was taking far too long, and I needed a faster, more powerful solution to be able to submit my thesis in time. Fortunately, I could access the university server to train the algorithms. Once the training was complete, I could generate the results on my local computer, since producing maps and tables was not so demanding. However…</p>
</section>
<section id="bloody-paths" class="level3">
<h3 class="anchored" data-anchor-id="bloody-paths">Bloody paths!</h3>
<p>In switching between machines and computing environments, I soon encountered an issue with my code: the <a href="https://en.wikipedia.org/wiki/Path_(computing)">paths</a>, or file directory locations, for the trained algorithms had been hardcoded! As I quickly discovered, hardcoding a path can lead to issues when the code is run in a different environment, as the path might not exist in the new environment.</p>
<p>As my code became longer, I overlooked the path names linked to algorithms that were generating the results. This mistake – which would have been easily corrected if spotted earlier – resulted in incorrect outputs. Such errors could have enormous (negative) implications in a public health context, where evidence-based decisions have real impacts on human lives. It was at this point that I realised that my code is the fundamental pillar of the validity of my empirical work. How can someone trust my work if they are not able to verify it?</p>
<p>The following dummy code demonstrates the hardcoding issue:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb1" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb1-1"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```{python}</span></span>
<span id="cb1-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Hardcoded path</span></span>
<span id="cb1-3">file_path <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"/user/notebooks/toydata.csv"</span></span>
<span id="cb1-4"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">try</span>:</span>
<span id="cb1-5">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(file_path) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">file</span>:</span>
<span id="cb1-6">        data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">file</span>.read()</span>
<span id="cb1-7">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(data)</span>
<span id="cb1-8"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">except</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">FileNotFoundError</span>:</span>
<span id="cb1-9">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"File not found"</span>)</span>
<span id="cb1-10"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```</span></span></code></pre></div></div>
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/hardcoded-paths-1.gif" class="img-fluid"></p>
<p>In the code above, a dummy file (<code>toydata.csv</code>) is used. The dummy file contains data on the prices of three different toys, but only the path of the file is relevant to this example. If the hardcoded file path – <code>"/user/notebooks/toydata.csv"</code> – exists on the machine being used, the code will run just fine. But, when run in a different environment without said path, the code will result in a <code>"File not found error"</code>. Better code that uses relative paths can be written as:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb2" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb2-1"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```{python}</span></span>
<span id="cb2-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Relative path</span></span>
<span id="cb2-3"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> os</span>
<span id="cb2-4"></span>
<span id="cb2-5">file_path <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> os.path.join(os.getcwd(), <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"toydata.csv"</span>)</span>
<span id="cb2-6"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">try</span>:</span>
<span id="cb2-7">    <span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(file_path) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">file</span>:</span>
<span id="cb2-8">        data <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">file</span>.read()</span>
<span id="cb2-9">        <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(data)</span>
<span id="cb2-10"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">except</span> <span class="pp" style="color: #AD0000;
background-color: null;
font-style: inherit;">FileNotFoundError</span>:</span>
<span id="cb2-11">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"File not found"</span>)</span>
<span id="cb2-12"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```</span></span></code></pre></div></div>
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/hardcoded-paths-2.gif" class="img-fluid"></p>
<p>You can see that this code has successfully imported data from the dataset <code>toydata.csv</code> and printed its two columns (toy and price) and three rows.</p>
<p>The following example is a simplified version of what happened when I wrote code to train several models, store the results and run a procedure to compare results with the predictive performance of a benchmark model:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb3" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb3-1"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```{python}</span></span>
<span id="cb3-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set an arbitrary predictive performance value of a benchmark model</span></span>
<span id="cb3-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># and accept/reject models if the results are above/below the value.</span></span>
<span id="cb3-4">benchmark <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span></span>
<span id="cb3-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set the model details in one place for a better overview</span></span>
<span id="cb3-6">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb3-7">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"model1"</span>: {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"model1"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"simple"</span>}, </span>
<span id="cb3-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"model2"</span>: {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"model2"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"complex"</span>}</span>
<span id="cb3-9">}</span>
<span id="cb3-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set the current model to "model1" to use it for training and check its results</span></span>
<span id="cb3-11">current_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"model1"</span>]</span>
<span id="cb3-12"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Train a simple model for "model1" and a complex model for "model2"</span></span>
<span id="cb3-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Training result of the "model1" is 30 and for "model2" is 70</span></span>
<span id="cb3-14">model_structure <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> train(current_model[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>])</span>
<span id="cb3-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Save the model and its result in a .csv file</span></span>
<span id="cb3-16">model_structure.to_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/all/notebooks/results-of-model1.csv'</span>, index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb3-17"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```</span></span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb4" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb4-1"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```{python}</span></span>
<span id="cb4-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load the model result and compare with benchmark</span></span>
<span id="cb4-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Model name: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(current_model[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>]))</span>
<span id="cb4-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Model type: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(current_model[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>]))</span>
<span id="cb4-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load the result of the current model</span></span>
<span id="cb4-6">result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">'/all/notebooks/results-of-model2.csv'</span>).iloc[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb4-7"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Result: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(result))</span>
<span id="cb4-8"></span>
<span id="cb4-9"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> benchmark:</span>
<span id="cb4-10">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\033</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">[3;32m&gt;&gt;&gt; Result is better than the benchmark -&gt; Accept the model and use it for calculations"</span>)</span>
<span id="cb4-11"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb4-12">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\033</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">[3;31m&gt;&gt;&gt; Result is NOT better than the benchmark -&gt; Reject the model as it is not optimal"</span>)</span>
<span id="cb4-13"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```</span></span></code></pre></div></div>
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/hardcoded-paths-3.gif" class="img-fluid"></p>
<p>Everything looks fine at a glance. But, if you examine the code carefully, you may spot the problem. Initially, when I coded the procedure (training the model, saving and loading the results), I hardcoded the paths and had to change them for each tested model. First, I trained <code>model2</code>, a complex model, and tested it against the benchmark (70 &gt; 50 → accepted). I repeated the procedure for <code>model1</code> (a simple model). Its result was identical to <code>model2</code>, therefore I kept <code>model1</code> following the <a href="https://www.sciencedirect.com/topics/computer-science/parsimony-principle">parsimony principle</a>.</p>
<p>However, for the code line loading the result for the current model (line 5, second cell), I forgot to amend the path and so mistakenly loaded the result of <code>model2</code>. As a consequence, I accepted a model which should have been rejected. These wrong results were then spread further in the code, including all charts and maps and the conclusions of my analysis.</p>
<p>A small coding error like this can therefore be fatal to an analysis. Below is the corrected code:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb5" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb5-1"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```{python}</span></span>
<span id="cb5-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> os</span>
<span id="cb5-3"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set an arbitrary predictive performance value of a benchmark model</span></span>
<span id="cb5-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># and accept/reject models if the results are above/below the value.</span></span>
<span id="cb5-5">benchmark <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">50</span></span>
<span id="cb5-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set the model details (INCLUDING PATHS) in one place for a better overview</span></span>
<span id="cb5-7">model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> {</span>
<span id="cb5-8">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"model1"</span>: {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"model1"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"simple"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"path"</span>: os.path.join(os.getcwd(), <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"results-of-model1.csv"</span>)}, </span>
<span id="cb5-9">    <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"model2"</span>: {<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"model2"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>: <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"complex"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"path"</span>: os.path.join(os.getcwd(), <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"results-of-model2.csv"</span>)}</span>
<span id="cb5-10">}</span>
<span id="cb5-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Set the current model to "model1" to use it for training and check its results</span></span>
<span id="cb5-12">current_model <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> model[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"model1"</span>]</span>
<span id="cb5-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Train a simple model for "model1" and a complex model for "model2"</span></span>
<span id="cb5-14"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Training result of the "model1" is 30 and for "model2" is 70</span></span>
<span id="cb5-15">model_structure <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> train(current_model[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>])</span>
<span id="cb5-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Save the model and its result in a .csv file</span></span>
<span id="cb5-17">model_structure.to_csv(current_model[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"path"</span>], index<span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span><span class="va" style="color: #111111;
background-color: null;
font-style: inherit;">False</span>)</span>
<span id="cb5-18"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```</span></span></code></pre></div></div>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb6" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb6-1"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```{python}</span></span>
<span id="cb6-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Get the model result and compare with the benchmark</span></span>
<span id="cb6-3"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Model name: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(current_model[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"name"</span>]))</span>
<span id="cb6-4"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Model type: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(current_model[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"type"</span>]))</span>
<span id="cb6-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Load the result of the current model WITH a VARIABLE PATH</span></span>
<span id="cb6-6">result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> pd.read_csv(current_model[<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"path"</span>]).iloc[<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>, <span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>]</span>
<span id="cb6-7"><span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"Result: </span><span class="sc" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">{}</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span>.<span class="bu" style="color: null;
background-color: null;
font-style: inherit;">format</span>(result))</span>
<span id="cb6-8"></span>
<span id="cb6-9"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">if</span> result <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">&gt;</span> benchmark:</span>
<span id="cb6-10">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\033</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">[3;32m&gt;&gt;&gt; Result is better than the benchmark -&gt; Accept the model and use it for calculations"</span>)</span>
<span id="cb6-11"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">else</span>:</span>
<span id="cb6-12">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\033</span><span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">[3;31m&gt;&gt;&gt; Result is NOT better than the benchmark -&gt; Reject the model as it is not optimal"</span>)</span>
<span id="cb6-13"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```</span></span></code></pre></div></div>
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/hardcoded-paths-4.gif" class="img-fluid"></p>
<p>Here, the paths are stored with other model details (line 7–8, first cell). Therefore, we can use them as variables when we need them (e.g., line 16, first cell, and line 5, second cell). Now, when the current model is set to <code>model1</code> (line 11, first cell), everything is automatically adjusted. Also, if the path details need to be changed, we only need to change them once and everything else is automatically adjusted and updated. The code now correctly states that <code>model1</code> performs worse than the benchmark and is therefore rejected and we should keep <code>model2</code>, which performs best.</p>
<p>I managed to catch this error in time, but it often can be difficult to spot our own mistakes. That is why making code available to others is crucial. A code review by a second (or third) pair of eyes can save everyone a lot of time and avoid spreading incorrect results and conclusions.</p>
</section>
<section id="solving-compatibility-chaos-with-docker" class="level3">
<h3 class="anchored" data-anchor-id="solving-compatibility-chaos-with-docker">Solving compatibility chaos with Docker</h3>
<p>One might think that it would be easy to copy code from one computer to another and run it without difficulties, but it turns out to be a real headache. Different operating systems on my local computer and the university server caused multiple compatibility issues and it was very time-consuming to try to solve them. The university server was running on Ubuntu, a Linux distribution, which was not compatible with my macOS-based code editor. Moreover, the server did not support the Python programming language – and all the deep learning algorithm packages that I needed – in the same way as my macOS computer did.</p>
<p>As a remedy, I used Docker containers, which allowed me to create a virtual environment with all the necessary packages and dependencies installed. This way, I could integrate them with different hardware and use the processing power of that hardware. To get started with Docker, I first had to install it on my local computer. The installation process is straightforward and <a href="https://docs.docker.com/desktop/">the Docker website</a> provides step-by-step instructions for different operating systems. In fact, I found the Docker website very helpful, with lots of resources and tutorials available. Once Docker was installed, it was easy to create virtual environments for my project and work with my code, libraries, and packages, without any compatibility issues. Not only did Docker containers save me a lot of time and effort, but they could also make it easier for others to reproduce my work.</p>
<p>Below is an example of a Dockerfile which recreates an environment with Python 3.7 on Linux. It describes what, how, when and in which order operations should be carried out to generate the environment with all Python packages required to run the main Python script, <code>main.py</code>.</p>
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/Dockerfile-example.png" class="img-fluid" alt="An example of a Dockerfile, showing the various steps required to recreate the correct environment for running Python file, main.py."></p>
<div class="figure-caption">
<p>An example of a Dockerfile.</p>
</div>
<p>In this example, by downloading the project, including the Dockerfile, anyone can run <code>main.py</code> without installing packages or worrying about what OS was used for development or which Python version should be installed. You can view Docker as a great robot chef: show it a recipe (Dockerfile), provide the ingredients (project files), push the start button (to build the container) and wait to sample the results.</p>
</section>
<section id="why-does-nobody-check-your-code" class="level3">
<h3 class="anchored" data-anchor-id="why-does-nobody-check-your-code">Why does nobody check your code?</h3>
<p>Even after implementing Docker, I still faced another challenge to reproducibility: making the verification process for my code easy enough that it could be done by anyone, without them needing a degree in computer science! Increasingly, there is an expectation for researchers to share their code so that results can be reproduced, but there are as yet no widely accepted or enforced standards on how to make code readable and reusable. However, if we are to embrace the concept of reproducibility, we must write and publish code under the assumption that someone, somewhere – boss, team member, journal reviewer, reader – will want to rerun our code. And, if we expect that someone will want to rerun our code (and hopefully check it), we should ensure that the code is readable and does not take too long to run.</p>
<p>If your code <em>does</em> take too long to run, some operations can often be accelerated – for example, by reducing the size of the datasets or by implementing computationally efficient data processing approaches (e.g., using <a href="https://pytorch.org/">PyTorch</a>). Aim for a running time of a few minutes – or about as long as it takes to make a cup of tea or coffee. Of course, if data needs to be reduced to save computational time, the person rerunning your code won’t generate the same results as in your original analysis. This therefore will not lead to reproducibility, <em>sensu stricto</em>. However, as long as you state clearly what are the expected results from the reduced dataset, your peers can at least inspect your code and offer feedback, and this marks a step towards reproducibility.</p>
<p>We should also make sure our code is free from bugs – both the kind that might lead to errors in analysis and also those that stop the code running to completion. Bugs can occur for various reasons. For example, some code chunks written on a Windows machine may not properly execute on a macOS machine because the former uses <code>\</code> for file paths, while the latter uses <code>/</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb7" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb7-1"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```{python}</span></span>
<span id="cb7-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Path works on macOS/Linux</span></span>
<span id="cb7-3"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"../../all/notebooks/toydata.csv"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"r"</span>) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> f:</span>
<span id="cb7-4">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(f.read())</span>
<span id="cb7-5"></span>
<span id="cb7-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Path works only on Windows    </span></span>
<span id="cb7-7"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(<span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">r"</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">..</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\.</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">.</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\a</span><span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">ll</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\n</span><span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">otebooks</span><span class="ch" style="color: #20794D;
background-color: null;
font-style: inherit;">\t</span><span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">oydata</span><span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">.</span><span class="vs" style="color: #20794D;
background-color: null;
font-style: inherit;">csv"</span>, <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"r"</span>) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> f:</span>
<span id="cb7-8">   <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(f.read())</span>
<span id="cb7-9"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```</span></span></code></pre></div></div>
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/path-cross-platform-compatibility-1.gif" class="img-fluid"></p>
<p>Here, only the macOS/Linux version works, since the code this capture was taken from was implemented on a Linux server. There are alternatives, however. The code below works on macOS, Linux, and also Windows machines:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb8" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb8-1"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```{python}</span></span>
<span id="cb8-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">from</span> pathlib <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> Path</span>
<span id="cb8-3"></span>
<span id="cb8-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Path works on every OS: macOS/Linux/Windows</span></span>
<span id="cb8-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># It will automatically replace the path to "..\..\all\notebooks\toydata.csv" when it runs on Windows</span></span>
<span id="cb8-6"><span class="cf" style="color: #003B4F;
background-color: null;
font-weight: bold;
font-style: inherit;">with</span> <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">open</span>(Path(<span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"../../all/notebooks/toydata.csv"</span>), <span class="st" style="color: #20794D;
background-color: null;
font-style: inherit;">"r"</span>) <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> f:</span>
<span id="cb8-7">    <span class="bu" style="color: null;
background-color: null;
font-style: inherit;">print</span>(f.read())</span>
<span id="cb8-8"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```</span></span></code></pre></div></div>
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/path-cross-platform-compatibility-2.gif" class="img-fluid"></p>
<p>The extra Python package, <code>pathlib</code>, is of course unnecessary if you build a Docker container for your project, as discussed in the previous section.</p>
</section>
<section id="jupyter-king-of-the-notebooks" class="level3">
<h3 class="anchored" data-anchor-id="jupyter-king-of-the-notebooks">Jupyter, King of the Notebooks</h3>
<p>By this stage in my project, I was feeling that I’d made good progress towards ensuring that my work would be reproducible. I’d expended a lot of effort to make my code readable, efficient, and also absent of bugs (or, at least, this is what I was hoping for). I’d also built a Docker container to allow others to replicate my computing environment and rerun the analysis. Still, I wanted to make sure there were no barriers that would prevent people – my supervisors, in particular – from being able to review the work I had done for my undergraduate thesis. What I wanted was a way to present a complete narrative of my project that was easy to understand and follow. For this, I turned to Jupyter Notebook.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/jupyter.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="A rendering of the god Jupiter, holding a pencil and sat in front of an open laptop computer"></p>
</figure>
</div>
<div class="figure-caption" style="text-align: center;">
<p><strong>Credit:</strong> Discord software, Midjourney bot.</p>
</div>
<p>Jupyter notebooks combine <a href="https://www.markdownguide.org/cheat-sheet/">Markdown text</a>, code, and visualisations. The notebook itself can sit within an online directory of folders and files that contain all the data and code related to a project, allowing readers to understand the processes behind the work and also access the raw resources. From <a href="https://github.com/dsvanidze/replicability/blob/master/notebooks/main.ipynb">the notebook I produced</a>, readers can see exactly what I did, how I did it, and what my results were.</p>
<p>While creating my notebook, I was able to experiment with my code and iterate quickly. Code cells within a document can be run interactively, which allowed me to try out different approaches to solving a problem and see the results almost in real time. I could also get feedback from others and try out new ideas without having to spend a lot of time writing and debugging code.</p>
</section>
<section id="version-control-with-git-and-github" class="level3">
<h3 class="anchored" data-anchor-id="version-control-with-git-and-github">Version control with Git and GitHub</h3>
<p>My Jupyter notebook and associated folders and files are all available via <a href="https://github.com/dsvanidze/replicability">GitHub</a>. <a href="https://git-scm.com/">Git</a> is a version control system that allows you to keep track of changes to your code over time, while GitHub is a web-based platform that provides a central repository for storing and sharing code. With Git and GitHub, I was able to version my code and collaborate with others without the risk of losing any work. I really couldn’t afford to redo the entire year I spent on my dissertation!</p>
<p>Git and GitHub are great for reproducibility. By sharing code via these platforms, others can access your work, verify it and reproduce your results without risking changing or, worse, destroying your work – whether partially or completely. These tools also make it easy for others to build on your work if they want to further develop your research. You can also use Git and GitHub to share or promote your results across a wider community. The ability to easily store and share your code also makes it easy to keep track of the different versions of your code and to see how your work has evolved.</p>
<p>The following illustration shows the tracking of very simple changes in a Python file. The previous version of the code is shown on the left; the new version is shown on the right. Additions and deletions are highlighted in green and red, and with <code>+</code> and <code>-</code> symbols, respectively.</p>
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/Git-example.png" class="img-fluid" alt="A simple example of GitHub version tracking, showing how changes to a file are tracked and highlighted"></p>
<div class="figure-caption">
<p>A simple example of GitHub version tracking.</p>
</div>
</section>
</section>
<section id="the-deep-learning-challenge" class="level2">
<h2 class="anchored" data-anchor-id="the-deep-learning-challenge">The deep learning challenge</h2>
<p>So far, this article has dealt with barriers to reproducibility – and ways around them – that will apply to most, if not all, modern research projects. While I’d encourage any scientist to adopt these practices in their own work, it is important to stress that these alone cannot guarantee reproducibility. In cases where standard statistical procedures are used within statistical software packages, reproducibility is often achievable. However, in reality, even when following the same procedures, differences in outputs can occur, and identifying the reasons for this may be challenging. Cooking offers a simple analogy: subtle changes in room temperature or ingredient quality from one day to the next can impact the final product.</p>
<p>One of the challenges for research projects employing machine learning and deep learning algorithms is that outputs can be influenced by the randomness that is inherent in these approaches. Consider the four portraits below, generated by the Midjourney bot.</p>
<div class="quarto-figure quarto-figure-center">
<figure class="figure">
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/DL_bkg.png" class="img-fluid quarto-figure quarto-figure-center figure-img" alt="Four portraits rendered by generative AI. Each portrait looks broadly similar, but there are noticeable differences in facial features and the abstract patterns overlaid on the portraits"></p>
</figure>
</div>
<div class="figure-caption" style="text-align: center;">
<p><strong>Credit:</strong> Discord software, Midjourney bot.</p>
</div>
<p>Each portrait looks broadly similar at first glance. However, upon closer inspection, critical differences emerge. These differences arise because deep learning models rely on numerous interconnected layers to learn intricate patterns and representations. Slight random perturbations, such as initial parameter values or changes in data samples, can propagate through the network, leading to different decisions during the learning process. As a result, even seemingly negligible randomness can amplify and manifest as considerable differences in the final output, as with the distinct features of the portraits.</p>
<p>Randomness is not necessarily a bad thing – it mitigates overfitting and helps predictions to be generalised. However, it does present an additional barrier to reproducibility. If you cannot get the same results using the same raw materials – data, code, packages and computing environment – then you might have good reasons to doubt the validity of the findings.</p>
<p>There are many elements of an analysis in which randomness may be present and lead to different results. For example, in a classification (where your dependent variable is binary, e.g., success/failure with 1 and 0) or a regression (where your dependent variable is continuous, e.g., temperature measurements of 10.1°C, 2.8°C, etc.), you might need to split your data into training and testing sets. The training set is used to estimate the model (hyper)parameters and the testing set is used to compute the performance of the model. The way the split is usually operationalised is as a random selection of rows of your data. So, in principle, each time you split your data into training and testing sets, you may end up with different rows in each set. Differences in the training set may therefore lead to different values of the model (hyper)parameters and affect the predictive performance that is measured from the testing set. Also, differences in the testing set may lead to variations in the predictive performance scores, which in turn lead to potentially different interpretations and, ultimately, decisions if the results are used for that purpose.</p>
<p>This aspect of randomness in the training of models is relatively well known. But randomness may hide in other parts of code. One such example is illustrated below. Here, using Python, we set the seed number to 0 using <code>np.random.seed(seed value)</code>. The <code>random.seed()</code> function from the package <code>numpy</code> (abbreviated <code>np</code>) saves the state of a random function so that it can create identical random numbers independently of the machine you use, and this is for any number of executions. A seed value is an initial input or starting point used by a pseudorandom number generator to generate a sequence of random numbers. It is often an integer or a timestamp. The number generator takes this seed value and uses it to produce a deterministic series of random numbers that appear to be random but can be recreated by using the same seed value. Without providing this seed value, the first execution of the function typically uses the current system time. The animation below generates two random arrays <code>arr1</code> and <code>arr2</code> using <code>np.random.rand(3,2)</code>. Note that the values <code>3,2</code> indicate that we want random values for an array that has 3 rows and 2 columns.</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb9" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb9-1"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```{python}</span></span>
<span id="cb9-2"><span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">import</span> numpy <span class="im" style="color: #00769E;
background-color: null;
font-style: inherit;">as</span> np</span>
<span id="cb9-3"></span>
<span id="cb9-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#Set the seed number e.g. to 0</span></span>
<span id="cb9-5">np.random.seed(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb9-6"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate random array</span></span>
<span id="cb9-7">arr1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.rand(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb9-8"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## print("Array 1:")</span></span>
<span id="cb9-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## print(arr1)</span></span>
<span id="cb9-10"></span>
<span id="cb9-11"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#Set the seed number as before to get the same results</span></span>
<span id="cb9-12">np.random.seed(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">0</span>)</span>
<span id="cb9-13"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;"># Generate another random array</span></span>
<span id="cb9-14">arr2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.rand(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb9-15"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## print("\nArray 2:")</span></span>
<span id="cb9-16"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## print(arr2)</span></span>
<span id="cb9-17"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```</span></span></code></pre></div></div>
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/randomisation-with-seed.gif" class="img-fluid"></p>
<p>If you run the code yourself multiple times, the values of <code>arr1</code> and <code>arr2</code> should remain identical. If this is not the case, check that the seed value is set to 0 in lines 4 and 11. These identical results are possible because we set the seed value to 0, which ensures that the random number generator produces the same sequence of numbers each time the code is run. Now, let’s look at what happens if we remove the line <code>np.random.seed(0)</code>:</p>
<div class="code-copy-outer-scaffold"><div class="sourceCode" id="cb10" style="background: #f1f3f5;"><pre class="sourceCode markdown code-with-copy"><code class="sourceCode markdown"><span id="cb10-1"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```{python}</span></span>
<span id="cb10-2"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#Generate random array</span></span>
<span id="cb10-3">arr1 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.rand(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb10-4"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## print("Array 1:")</span></span>
<span id="cb10-5"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## print(arr1)</span></span>
<span id="cb10-6"></span>
<span id="cb10-7"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">#Generate another random array</span></span>
<span id="cb10-8">arr2 <span class="op" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">=</span> np.random.rand(<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">3</span>,<span class="dv" style="color: #AD0000;
background-color: null;
font-style: inherit;">2</span>)</span>
<span id="cb10-9"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## print("\nArray 2:")</span></span>
<span id="cb10-10"><span class="co" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">## print(arr2)</span></span>
<span id="cb10-11"><span class="in" style="color: #5E5E5E;
background-color: null;
font-style: inherit;">```</span></span></code></pre></div></div>
<p><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/randomisation-without-seed.gif" class="img-fluid"></p>
<p>Here, the values of <code>arr1</code> and <code>arr2</code> will be different each time we run the code since the seed value was not set and is therefore changing over time.</p>
<p>This short code demonstrates how randomness that can be controlled by the seed value may affect your code. Therefore, unless randomness is required, e.g., to get some uncertainty in the results, setting the seed value will contribute to making your work reproducible. I also find it helpful to document the seed number I use in my code so that I can easily reproduce my findings in the future. If you are currently working on some code that involves random number generators, it might be worth checking your code and making all necessary changes. In our work (see code chunk 9 in <a href="https://github.com/dsvanidze/replicability/blob/master/notebooks/main.ipynb">the Jupyter notebook</a>) we set the seed value in a general way, using a framework (config) so that our code always uses the same seed to train our algorithm.</p>
</section>
<section id="conclusion" class="level2">
<h2 class="anchored" data-anchor-id="conclusion">Conclusion</h2>
<p>We hope you have enjoyed learning more about our quest for reproducibility. We have explained why reproducibility matters and provided tips for how to achieve it – or, at least, work towards it. We have introduced a few important issues that you are likely to encounter on your own path to reproducibility. In sum, we have mentioned:</p>
<ul>
<li>The importance of having relative instead of hard-coded paths in code.</li>
<li>Operating system compatibility issues, which can be solved by using Docker containers for a consistent computing environment.</li>
<li>The convenience of Jupyter notebooks for code editing – particularly useful for data science projects and work using deep learning because of the ability to include text and code in the same document and make the work accessible to everyone (so long as they have an internet connection).</li>
<li>The need for version control using, for example, Git and GitHub, which allows you to keep track of changes in your code and collaborate with others efficiently.</li>
<li>The importance of setting the seed values in random number generators.</li>
</ul>
<p>The graphic below provides a visual overview of the different components of our study and shows how each component works with the others to support reproducibility.</p>
<p><a href="images/docker-workflow.png"><img src="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/docker-workflow.png" class="img-fluid" alt="A diagrammatic overview of the interlinking systems and processes created by the authors to allow their research to be reproduced"></a></p>
<p>We use (A) the version control system, Git, and its hosting service, GitHub, which enables a team to share code with peers, efficiently track and synchronise code changes between local and server machines, and reset the project to a working state in case something breaks. Docker containers (B) include all necessary objects (engine, data, and scripts). Docker needs to be installed (plain-line arrows) by all users (project leader, collaborator(s), reviewer(s), and public user(s)) on their local machines (C); and (D) we use a user-friendly interface (JupyterLab) deployed from a local machine to facilitate the operations required to reproduce the work. The project leader and collaborators can edit (upload/download) the project files stored on the GitHub server (plain-line arrows) while reviewers and public users can only read the files (dotted-line arrows).</p>
<p>Now, it is over to you. Our <a href="https://github.com/dsvanidze/replicability/blob/master/notebooks/main.ipynb">Jupyter notebook</a> provides a walkthrough of our research. Our <a href="https://github.com/dsvanidze/replicability">GitHub repository</a> has all the data, code and other files you need to reproduce our work, and this <a href="https://github.com/dsvanidze/replicability#readme">README file</a> will help you get started.</p>
<p>And with that, we wish you all the best on the road to reproducibility!</p>
<div class="article-btn">
<p><a href="../../../../../../applied-insights/case-studies/index.html">Find more case studies</a></p>
</div>
<div class="further-info">
<div class="grid">
<div class="g-col-12 g-col-md-12">
<dl>
<dt>About the authors</dt>
<dd>
<strong>Davit Svanidze</strong> is a master’s degree student in economics at the London School of Economics (LSE). <strong>Andre Python</strong> is a young professor of statistics at Zhejiang University’s Center for Data Science. <strong>Christoph Weisser</strong> is a senior data scientist at BASF. <strong>Benjamin Säfken</strong> is professor of statistics at TU Clausthal. <strong>Thomas Kneib</strong> is professor of statistics and dean of research at the Faculty of Business and Economic Sciences at Goettingen University. <strong>Junfen Fu</strong> is professor of pediatrics, chief physician and director of the Endocrinology Department of Children’s Hospital, Zhejiang University, School of Medicine.
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-12">
<dl>
<dt>Acknowledgement</dt>
<dd>
Andre Python has been funded by the National Natural Science Foundation of China (82273731), the National Key Research and Development Program of China (2021YFC2701905) and Zhejiang University global partnership fund (188170-11103).
</dd>
</dl>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>Copyright and licence</dt>
<dd>
© 2023 Davit Svanidze, Andre Python, Christoph Weisser, Benjamin Säfken, Thomas Kneib, and Junfen Fu.
</dd>
</dl>
<p><a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> <img style="height:22px!important;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/cc.svg?ref=chooser-v1"><img style="height:22px!important;margin-left:3px;vertical-align:text-bottom;" src="https://mirrors.creativecommons.org/presskit/icons/by.svg?ref=chooser-v1"></a> This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) <a href="http://creativecommons.org/licenses/by/4.0/?ref=chooser-v1" target="_blank" rel="license noopener noreferrer" style="display:inline-block;"> International licence</a>.</p>
</div>
<div class="g-col-12 g-col-md-6">
<dl>
<dt>How to cite</dt>
<dd>
Svanidze, Davit, Andre Python, Christoph Weisser, Benjamin Säfken, Thomas Kneib, and Junfen Fu. 2023. “The road to reproducible research: hazards to avoid and tools to get you there safely.” Real World Data Science, June 15, 2023. <a href="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/road-to-reproducible-research.html">URL</a>
</dd>
</dl>
</div>
</div>
</div>


</section>


<div id="quarto-appendix" class="default"><section id="footnotes" class="footnotes footnotes-end-of-document"><h2 class="anchored quarto-appendix-heading">References</h2>

<ol>
<li id="fn1"><p>Peng, Roger D. 2011. “Reproducible Research in Computational Science.” <em>Science</em> 334 (6060): 1226–1227.↩︎</p></li>
<li id="fn2"><p>Ioannidis, John P. A., Sander Greenland, Mark A. Hlatky, Muin J. Khoury, Malcolm R. Macleod, David Moher, Kenneth F. Schulz, and Robert Tibshirani. 2014. “Increasing Value and Reducing Waste in Research Design, Conduct, and Analysis.” <em>The Lancet</em> 383 (9912): 166–175.↩︎</p></li>
<li id="fn3"><p>Open Science Collaboration. 2015. “Estimating the Reproducibility of Psychological Science.” <em>Science</em> 349 (6251): aac4716.↩︎</p></li>
<li id="fn4"><p>Baker, Monya. 2016. “Reproducibility Crisis?” <em>Nature</em> 533 (26): 353–366.↩︎</p></li>
<li id="fn5"><p>Camerer, Colin F., Anna Dreber, Felix Holzmeister, Teck-Hua Ho, Jürgen Huber, Magnus Johannesson, Michael Kirchler, Gideon Nave, Brian A. Nosek, Thomas Pfeiffer, <em>et al</em>. 2018. “Evaluating the Replicability of Social Science Experiments in <em>Nature</em> and <em>Science</em> between 2010 and 2015.” <em>Nature Human Behaviour</em> 2: 637–644.↩︎</p></li>
</ol>
</section></div> ]]></description>
  <category>Deep learning</category>
  <category>Reproducibility</category>
  <category>Coding</category>
  <category>Collaboration</category>
  <guid>https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/road-to-reproducible-research.html</guid>
  <pubDate>Thu, 15 Jun 2023 00:00:00 GMT</pubDate>
  <media:content url="https://realworlddatascience.net/applied-insights/case-studies/posts/2023/06/15/images/computer-intro.png" medium="image" type="image/png" height="105" width="144"/>
</item>
</channel>
</rss>
