The politics of performance measurement

Funding decisions, particularly for projects in the public sector, should be informed by data on what works and what doesn’t. But performance assessment is rarely straightforward – especially when people’s lives and livelihoods may be impacted by the decisions that are made. Data scientist Noah Wright recounts the trials and tribulations of managing and meeting stakeholder expectations in developing a performance measurement program for specialty courts in Texas.

Problem definition
Stakeholder communication
Relationship management
Author

Noah Wright

Published

April 18, 2023

At the beginning of 2016, the Criminal Justice Division (CJD) of the Texas Governor’s Office received news all government agencies dread: budgets were to be cut. CJD oversaw a grant program that funded specialty courts throughout the state, however it was now being told that the program’s budget of $10.6m would be reduced 20% to $8.5m by 2018.

How should these cuts be distributed among grant holders? CJD had no meaningful performance data on which to base its decisions, and I would know: I was hired by the agency just a few months before to analyze grant performance. Still, decisions needed to be made. We had to come up with a plan of action, and the clock was ticking…

This is a story of making opportunity out of crisis, of the interaction between not just theory of change and technical implementation, but the “political” process of negotiating these changes with stakeholders in a manner that led to better decisions. Through careful outreach and continuous communication, we developed a data collection and performance assessment process that enabled us to allocate budget cuts in a manner widely accepted.

The story ends on a bittersweet note. But, along the way, there are lessons to be learned about how to find common ground, manage expectations, forge productive working partnerships, and sustain a data science project longer term.

A crowd of people talking, with speech bubbles representing the different voices, digital art. Image created by Real World Data Science using Microsoft Bing Image Creator

Step 1: Consider your options

Texas had over 150 specialty courts in 2016, providing a program of specialized services – usually drug treatment – to offenders as an alternative to incarceration. About half of the state’s specialty courts received CJD grant funds (and about half of grantees received 100% of their program budget from our grants). Funding cuts of the size we needed to make would not go over well with them. Any changes to the program would have to run a gauntlet of decision-makers including advisory boards, interest groups, and professional associations, most with contacts in the legislature.

Complicating this situation further, CJD didn’t even make the final funding decisions. We administered the grants, but the merit review process fell to the Specialty Courts Advisory Council, an appointed group of specialty courts staff and related experts who annually scored the grant applications we received. We needed to get them onboard.

The way our Executive Director saw it, we had three options to implement the cut in a way that could get us buy-in from stakeholders and the Advisory Council:

  1. Cut across the board. The Advisory Council would employ the same scoring method as the previous year but reduce each grant amount by 20%.

This option would leave long-running grantees scrambling to make up for this shortfall by reducing services, laying off staff, or spending more of their limited local funds. Worse, it would punish all grantees equally – our most successful programs would be arbitrarily defunded.

  1. Fewer grants. Grants were scored based on the quality of their application and all grants that passed a certain threshold got funded. The Advisory Council would employ the same scoring method as the previous year but instead of funding the top $10.6m worth of grants, they would fund the top $8.5m worth.

This seemed a less bad option than cutting across the board, but we would still run into the problem of arbitrarily defunding successful programs. Grants near the bottom of the Advisory Council’s cutoff that got funded the previous year would be denied renewal only because the goalposts had moved.

  1. Targeted funding. The Advisory Council would incorporate performance data and statewide strategic plan alignment into their scoring method and make cuts accordingly.

At the time, the Advisory Council did not take performance into consideration when scoring grant applications. They agreed in theory that a grant requesting its tenth annual renewal should perhaps at some point be assessed on its outcomes, but they had never seen CJD commit to a rigorous performance assessment process before. We administered the grants, not them, so without our commitment to develop a performance assessment process, and their trust in that commitment, this would not be a viable option.

After due consideration, option 3 emerged as the favorite of our Executive Director. On the face of it, this seemed the most “objective” approach to take. We would let the data decide who gets funded and who doesn’t, rather than cutting arbitrarily. But that would be a fallacious argument. Data does not decide. It might inform our decisions, but it would be up to us to choose the structure of the performance measurement process: what aspects to focus on, what data to collect, what benchmarks to set – all of which would later help determine funding decisions. And in any funding decision, politics inevitably plays a role.

Politics is, in its broadest sense, the negotiated decision-making between groups with opposing interests. And in developing our performance measurement process we would encounter a variety of interests – from the Advisory Council down to the grantees themselves. Success would require us to acknowledge stakeholder perspectives and address or manage them appropriately. Planning decisions made in the early phases of a project as a result of political processes directly influence the type and scope of analysis a data scientist will eventually be able to perform, so it behooves the data scientist to participate in these processes!

Step 2: Engage stakeholders and define performance

Having settled on our preferred option, our Executive Director convened a strategy session with the Advisory Council to discuss how to proceed as part of a broader strategic plan. The session began by achieving consensus on high-level goals such as “fund strategically”, “focus on success”, “build capacity”, etc. The session also helped the Advisory Council and CJD alike to clarify our conception of how we ought to fit into the specialty courts field going forward. CJD would develop its performance assessment system to help the Advisory Council target funding, but that would come as part of a larger plan that included capacity building, training and technical assistance, helping courts obtain non-CJD sources of funding, and steering grantees toward established best practice.

We left the meeting with a very basic plan that looked good on paper. Our Executive Director set to work persuading our external stakeholders of the wisdom of this new strategic direction. Meanwhile, I had to build a performance assessment process that people could trust.

CJD had no formally designated standards to measure performance data against. However, drug courts have been around for decades and there existed a large body of research supporting the program model.1 Offering supervised drug treatment instead of incarceration had been repeatedly shown to cost less money and lower recidivism rates. I performed a literature review and spoke with numerous subject matter experts to get started on defining program-specific performance metrics.

I was conscious that imposing metrics without any feedback or input from affected parties would all but guarantee bad-faith engagement, especially if these metrics are tied to funding. A problem inherent to any performance measurement is that once something gets measured as a performance outcome, it warps the very processes it is intended to measure. This phenomenon happens so frequently that the phrase “Campbell’s Law” was coined to describe it in 1979.2 Think of standardized testing at schools: once the government ties test performance to school funding it creates powerful incentives for schools to improve test scores at any cost. Even in the absence of outright cheating, struggling schools feel massive pressure to adjust their curriculum, to the point where they teach test score optimization strategies more than math, language, history, and science.

I consistently heard from specialty court scholars and practitioners alike that arrest recidivism would be the ideal outcome measure. On paper, recidivism was a direct expression of long-term program success and could also be used as an outcome variable for classification modeling. And, in practice, a court could do little to affect recidivism by way of manipulation. Courts do not make arrests – police make arrests. Once a specialty court participant finished a program, the court itself no longer intervened in their lives. If a participant got arrested within 1-3 years of completion, the program had no say in the matter.

This, however, presented an implementation problem: one-year recidivism data would, by definition, take a year past the point of implementation to collect, i.e., not soon enough to inform our cuts. And while recidivism was the best measure of success, it could not be the only measure. Recidivism was, after all, a stochastic process not within the court’s control – a crime wave or other systemic factors could move recidivism up and make it look like a successful court had actually failed. We would have to use something else as well.

The National Association of Drug Court Professionals (NADCP) publishes a book of best practice standards, and our stakeholders identified a court’s adherence to these standards as another strong performance assessment standard. These criteria, unlike recidivism, were directly under the program’s control. Does your program have the recommended staff? Does your program drug test participants frequently enough to guarantee sobriety? Does your program meet with participants regularly enough? Do you offer a continuum of services instead of a “one-size-fits-all” approach?

In addition to being much easier to measure than recidivism, best practice adherence also resists Campbell’s Law by avoiding outcome measurement. In our school metaphor, this would be like measuring school performance based on student-to-teacher ratio, variety of course offerings, attendance rates, and teacher qualifications. Far from perfect, but measuring a variety of elements that predict success and taking them as a whole represents a vast improvement over a single, easily-gamed outcome measure.

But to operationalize these standards, we would have to have good data.

Step 3: Update processes and collect data

We inherited a longstanding process in which grantees had to fill out a form every six months asking them to report performance data. This is a screenshot of what that form looked like:

A screenshot of a data table containing example data collected from grantees in the Texas speciality courts funding program, prior to the development of a new performance measurement process.

No additional definitions or instructions were provided, leaving grantees with many questions: Does the request for “annual data” mean as of fiscal year or calendar year? What counts as a person being “assessed for eligibility”? And so on. Grantees did not know the answers, and neither did we. And these were the more straightforward measures. The form went on for 10 pages, most of which asked grantees to report extensively on information they had already provided as part their application.

This disaster of an assessment process did have a silver lining. When we announced we were throwing out these forms entirely we faced almost no pushback from grantees.

We knew from the start that our new assessment process would need to collect individual-level participant data instead of aggregated measures. Even with clear definitions, 75 grants would mean 75 different aggregations at work. Asking the grantees to report their individual-level participant data in a consistent format and doing the aggregations ourselves meant a single aggregation at work.

But we needed to establish trust with grantees before making this request. Strictly speaking, we could mandate the reporting of this data. However, if that angered enough of our grantees, they or their contacts might take it up with our bosses at the Governor’s Office, and our bosses could cancel any plan we came up with if they thought it was not worth the fuss. So, from day one we communicated clearly to all grantees that we would maintain total transparency when it came to definitions and calculations. Before we used any calculated metric to assess performance we would send it to the grantees themselves to review for accuracy.

To avoid the vagueness and inscrutability that characterized the old reporting process, every piece of data we asked for in the new process had a clear written definition and specific reason for being asked. These reasons usually amounted to some combination of best practices, Advisory Council recommendations, and grantee suggestions.

Implementing the new process was far from easy, however. We faced numerous administrative and technical barriers. Texas courts at this time did not share a common case management system, so we couldn’t just get a data export from everybody. Meanwhile, the Governor’s Office banned all of its divisions from all usage of the cloud. This forced us to build a more labor-intensive reporting process, in which courts would obtain blank Excel templates with required data fields. Courts had either to fill out these templates by hand or export their case management data and reconfigure it to template specifications. Then, courts submitted their data for review and we sent back any bad formatting.

We collected preliminary data at the six-month mark and made another adjustment based on these results, which we would not count toward performance measurement. A majority of courts had some kind of data error in this first case. Specific definitions of data fields had to be written and rewritten using grantee feedback over the course of the year, leading to significant changes between the six-month reports and the year-end reports.

Importantly, we had developed reporting requirements iteratively with participation from grantees and the Advisory Council from the start. By mid-2017 we had so successfully achieved buy-in that only one grantee court’s judge refused to give us data (the court’s grant manager later sent it to us).

Step 4: Analyze and report findings

In the course of this process, we established the benchmarks in Table 1 based on best practices and justification for funding. Because this was our initial rollout, we set the specific values low to function more as minimum standards than targets.

Table 1: Specialty court best practices translated into quantitative measures.

Benchmark Best practice Rationale
1. Number of participants 10+ CJD decision: programs should be of sufficient size to justify a grant
2. Number of graduates 5+ CJD decision: programs should be of sufficient size to justify a grant
3. Graduation rate 20%-90% CJD decision: 0% and 100% success rates are both red flags
4. Average amount of time graduates spent in program (in months) 12-24 NADCP best practice recommended program lengths of 1-2 years
5. Percent of graduates employed, seeking education, or supported through family, partner, SSI, etc. 100% NADCP best practice recommended against releasing participants without financial support, which all but guarantees relapse or rearrest.
6. Percent of participants with “low-risk” assessment score 0% NADCP best practice recommended moderate- or high-risk participants. Research had shown that low-risk participants get little benefit.
7. Average sessions per participant per month 1+ NADCP best practice recommended sessions be held at least monthly.

Grantee performance data for each benchmark would be generated from the individual level data that courts sent us. Crucially, we sent our aggregations back to grantees for confirmation prior to using them in any kind of evaluation, alongside the program-wide average and the best practice values for comparison (example in the table below). If something didn’t look right, they had the chance to let us know before we took their numbers as final.

Table 2: Specialty court best practices compared with program-wide averages and grantee reported values.

Benchmark Best practice Program-wide average Grantee reported values
1. Number of participants 10+ 89 96
2. Number of graduates 5+ 25 27
3. Graduation rate 20%-90% 71% 56%
4. Average amount of time graduates spent in program (in months) 12-24 17 14
5. Percent of graduates employed, seeking education, or supported through family, partner, SSI, etc. 100% 95% 100%
6. Percent of participants with “low-risk” assessment score 0% 18% 2%
7. Average sessions per participant per month 1+ 2 3.7

In the end, we found seven grants that we could unequivocally recommend be cut. Two of the seven had effectively never gotten off the ground, and served almost no participants the entire year. The other five served mostly low-risk participants, the type of people that research had shown do not benefit from specialty court programs. Some of these grantees were inevitably disappointed at the decision, but we had so actively worked within the field to develop and justify our processes that they understood why the decision had been made.

Factors for success

In the span of one year, CJD went from collecting a large volume of useless data to a specific, targeted collection of data informed by best practices. The new collection process had high grantee compliance and stakeholder buy-in.

The following factors proved essential to getting to a place where we had useful, reliable data upon which to base future data science efforts:

  1. Discontent with status quo

    The Advisory Council wanted CJD to play a more active support role in the field. Meanwhile, everyone disliked the existing performance assessment process. As a result, most of the challenges we faced along the way related to implementation rather than defending the status quo on its merits.

  2. A catalyst for change

    Despite existing discontent, it took a funding shortfall to kickstart the process of change. It would have been unlikely for us to be able to create this system a priori.

  3. Continuous, high-quality communication

    We could impose rules and requirements all day long, but without good faith engagement from the grantees we could never collect the quality of data we needed. Note that “continuous communication” does not mean “tell them everything you do at every point”. People become overwhelmed by torrents of information.

  4. Humility and flexibility

    Had we begun this process assuming we had all of the answers, we would have been dead in the water. Continuous outreach and willingness to take criticism and suggestions shaped the process as it progressed, ultimately producing a better end-product than we could have devised on our own.

  5. An established program model

    Drug courts have been around for decades, with a vast body of supporting research and a community of practitioners and scholars we could speak to. That meant we could focus on implementation and execution instead of determining if the model worked or not.

  6. Strong leadership support

    From the very beginning, we could not have accomplished what we did without the full support and advocacy of our Executive Director.

Coda: Why knowledge transfer is vital

I wish I could write a follow-up article about how we started using classification modeling to identify the most successful programs and to promote better approaches and practices; about how we iterated the process through multiple funding cycles, tuning and perfecting it to better meet stakeholder needs. But I cannot.

The performance assessment system we built had some major weaknesses from the outset. It was labor intensive, not required by law, produced no immediate benefit to the agency itself, and was so new it had yet to be entrenched in agency practice. In other words, no institutional incentives worked in its favor. Only the continual push of our Executive Director and myself kept this new performance assessment system going, and once we left the agency, it foundered.

Still, the experience taught me much. I learned first and foremost that programs do not sustain themselves. Most of our attention had been focused on building up the best process we could. Only a minimal effort had been spent on institutionalizing and sustaining it. We had written documentation but no fundamental changes in policy or rule. We had undertaken groundbreaking efforts and built relationships, but had not planned for any meaningful knowledge transfer to other staff. While we had intended to eventually do these things, fate took us away before we could get them in place.

For any kind of change to last, sustainability must be built in from the start. In the moment, these actions can seem low-priority. Policy and rule changes can be arduous and time-consuming. Knowledge transfer from one stably employed staff to another feels redundant and wasteful. But without embedding sustainability, no success will outlast the individual people pushing for it.

Back to Careers

About the author
Noah Wright is a data scientist with the Texas Juvenile Justice Department. He is interested in the applications of data science to public policy in the context of real-world constraints, and the ethics thereof (ethics being highly relevant in his line of work). He can be reached on LinkedIn.
Copyright and licence
© 2023 Noah Wright

This article is licensed under a Creative Commons Attribution 4.0 (CC BY 4.0) International licence.

How to cite
Wright, Noah. 2023. “The politics of performance measurement.” Real World Data Science, April 18, 2023. URL

Footnotes

  1. Some newer types of courts (Commercial Sexual Exploitation, Mental Health, Veterans) had a much more limited body of research and had to be accommodated separately. For the sake of keeping this narrative coherent I’m focusing on drug courts, which were the majority of our programs.↩︎

  2. Rodamar, Jeffery. 2018. “There ought to be a law! Campbell versus Goodhart.” Significance 15 (6): 9. https://doi.org/10.1111/j.1740-9713.2018.01205.x↩︎