In Evaluation, The Perfect is the Enemy of the Good

There are so many ways that the perfect is the enemy of the good that no one can possibly count them.  This is also a problem in many situations in program evaluation – counting things perfectly gets in the way of counting them well enough.

quote-the-enemy-of-a-good-plan-is-the-dream-of-a-perfect-plan-carl-von-clausewitz-45-57-07I came to this realization when I recently read an evaluation report written by some long-time friends and colleagues of mine summing up the “Centers of Excellence” program of The John A. Hartford Foundation.  The program was designed to help build up the number of physician faculty specializing in geriatrics (the care of older adults) so as to improve research, teaching, and care all across the health system for a part of the population that is not well served by today’s health care.  The theory was that some academic centers had excellent resources for training future faculty (e.g., senior faculty mentors, access to technical training, and a strong pipeline of fellows[i]) and, if given extra resources, could grow faculty for the field and seed other institutions.

At its high water mark, the program funded more than 20 academic health centers around the US with $300,000 in annual, lightly restricted/general support – $150,000 of Foundation money and $150,000 in institutional match. Over time, the amounts, centers, and mechanisms varied a bit, but the CoEs lasted more or less intact for 30 years (’88-’16) during which time the Foundation contributed $57.7 million.  The Foundation’s purpose in commissioning a final evaluation was largely to celebrate the program, salute its many accomplishments, and give a “roughly right” picture of the careers of the alumni of the program.  The evaluators are strong professionals, experienced in their fields, and with great track records and largely succeeded in meeting their charge.  Nevertheless, thanks to an accumulation of decisions taken many years before they arrived on the scene and an overdose of hope, they wound up needing to make lemonade out of lemons.  I think there are some lessons to be learned.

Evaluating a program like this poses many, many technical challenges.  It is neither desirable nor practical to randomly assign academic centers to receive the grant support or not (although I know some observers thought we were doing that anyway J).  Nor within centers could fellows and junior faculty be randomly assigned to receive support from the CoE pot of money.  One also faces what is called “censoring” in trying to look at outcomes at any one point in time (e.g., today) – that is the period of observation is cut off (censored) with wildly different lengths.  In other words, the earliest participants in the program could have been followed for 28 years and even possibly finished their working careers by 2017, but many of the scholars who entered the program in its final years had only begun theirs.

These technical issues have technical solutions with a range of credibility.  And for the purposes of the Foundation – simply to document what happened in the careers of participants – some of the issues and solutions are irrelevant, especially at the end of a program where, for the original funder, there is no longer an opportunity to directly apply its lessons.  Nonetheless, the fundamental problem with the evaluation that was performed was its overreach.  Not overreach in an effort to draw conclusions the data could never support, such as inferences of about causality or generalizability of the results, but simply in its hope of reaching all of the alumni – a victory of hope over experience.

The evaluation had the appropriately limited and conservative goal of simply describing the activities and progress of the 1,164 scholars who had participated in the program over its lifetime.  To meet this goal, the foundation, its consultants, and national program office set out to survey every one of the alumni.  At the start, the work was handicapped by a common, but frustrating, part of human life – the inability to see the future.  Back when the program started in 1988 nobody at the Foundation was thinking of long-term evaluation (at that point, the Foundation didn’t think much about any kind of summative evaluations and reports at all).  There was little thought given to maintaining contact information for scholar alumni, and no funding provided for the expensive and difficult work to keep an eye on members of a longitudinal cohort through moves, name changes, and death.  So the eventual survey of career achievements and request for a current academic curriculum vita could only even be sent to the most recent 878 of the 1,164 total scholars due to lack of contact information (creating a 24.6% loss to follow-up at the get go).

And it only got worse: of the 878 people sent invitations to complete an online survey and to submit their CVs, 336 and 282 responded, respectively.  Response rates of 38% and 32% are actually quite decent for “cold” voluntary surveys and I’m sure that achieving them took a great deal of time, effort, and paper cuts.  While other, more recent Foundation scholar’s programs outside of medicine (e.g., nursing and social work) with far more emphasis on collective identity and interpersonal ties, *had* achieved much higher response rates in similar voluntary surveys, the CoE scholars program was substantially different.  As a general support grant, the local center directors decided how money would be spent and branding of the program was highly variable (some scholars probably never were told their support came from the Foundation or were unaware of the National Program Office).  I would argue the fact that the response rates were going to be in this range (or below) was already known/knowable from prior efforts and similar long-term follow-up efforts of the Foundation in the past[ii].

So, despite the good faith effort to reach everyone, the very approach deeply undermines even the descriptive interpretation of the results.  For example, on a key issue of keeping scholars in the field of aging/geriatrics, the estimate offered was that 97% (N=327) of respondents said “yes” to continuing in some role of influence (e.g., research, teaching, or policy).  Given the context of being asked for a CV and the almost certain relationship of being lost to follow-up to academic success (even in the subset when there was an email available), this estimate is obviously wrong.  It has to be somewhere between 28.1% (327/1164) and 97% but there is just no way to know exactly where.

A second metric is the so-called “leverage” generated by the program.  In JAHF parlance this term had come to mean the grant revenue earned by funded scholars that could be thought of as the ROI for the program grant funds “invested.”  In this case, simply counting grants reported on the 282 CVs collected, the total was calculated to be $1.1 BILLION[iii].  Now there is nothing wrong with this number – it is a big fat fun number and on a $57.7 million investment, it is just terrific.  But it is clearly a very low estimate of reality.  Showing great restraint, and knowing when to quit when they were ahead, the authors do not make the obvious ~3 x calculation to try to estimate the full “leverage” that would have been documented if all of the participants had reported.  But the real figure *is* clearly much higher than $1.1 billion, we just don’t know how much.

However, the evaluation does make the analogous calculation when considering how many trainees the scholars influenced.  Observing that the total number of trainees touched by the group responding to the online survey in 2014-2015 was 16,123 (ranging from medical students in classes to mentees at the junior faculty level), they created a “’rough estimate’” assuming that the non-responding alumni would have similar training influence.  Thus the calculated that the full cohort would have reached 55,000 people with their geriatrics expertise in that year.  I am sure that the assumption of proportionality between the responders and non-responders is more reasonable in the case of teaching than it was in the case of winning grants. But it is still going to be systematically off, leaving us with an estimate of somewhere between 16,000 and 55,000 in that year[iv].

This evaluation, because it was dependent upon a strong response rate, where it was unreasonable to hope for one, both understates and overstates the results of the CoE program in ways for which it is impossible to appropriately adjust.

Dilbert Correlation is Not Causation

So what would have been better than this predictable disappointment?

I argued strongly when I was still involved, that instead of a futile effort to reach all the participants, that the evaluation should focus its limited staff time and scarce resources on a truly representative sampling of the population of interest.

Margin of ErrorIf the effort that was used to collect ~30% response rate from the 1,164 people had been used to collect responses from a planned, representative sample of 300 people, the estimates would be impacted by sampling error, but not the same uninterpretable bias due to respondent self-selection.  With 300 respondents the margin of error would be +/- 4.5% for percentage estimates around 80%.  Margin of error at the 95% confidence level for percentages is calculated using the formula  * +/- 1.96.  So if we were to find that 80% of the respondents were still in the field of geriatrics our sample error would be  and we would have 95% confidence that the true proportion was between 75.5% and 84.5%.  In fact, even if we had a sample of only 100 the margin of error at the 95% confidence level would still be less than +/-1 10%.  (+/- 7.8%, to be precise)[v],[vi].

With the number of people to be reached cut down to a manageable size, one could spend a lot more of time badgering people to complete their surveys and/or offer $$ incentives to participate, or any number of strategies.  I also feel pretty sure, that in these days of Google, social media, PubMed, and with the benefit of national licensing boards and membership associations, that 40-45 of 50 randomly selected people from the initial group with no contact info could be run to earth (“Hi! Are you the John Smith who did a fellowship in geriatrics at Harvard’s Beth Israel in 1989?”).  In fairness, I must admit that you won’t get 100% of a planned sample either, but if you can crank it up to 90% of the random sample participating, response bias has a much smaller room for mischief.

Sampling strategies also give some special opportunities.  For fun, one could stratify the sampling by era, gender, etc. to be sure that you had some room for comparisons.  And for extra credit – you could even oversample some rare populations (e.g., racial minority members, bench scientists, second career trainees) and make use of the extra precision of measurement without bollixing up the overall estimates by weighting the responses.  Even without the bells and whistles a good random sample of 100-300 beats an ad hoc group of 300 created by unknowable, but surely biased processes of self-selection.  It may sound small and embarrassing to settle for such a limitation at the outset, but it is just fine for government (or even Foundation) work.

Finally, if I were sponsoring this kind of work in the future, I would consider reducing the sample even further, to change the commission to the evaluation team to add some kind of comparisons that could make the results more meaningful.  If we can get estimates at +/- 10% for a measure like “having a continuing role in academic geriatrics” with only 100 respondents, could we use our freed up resources to look at the overall continuation rate in academic medicine?  I don’t always feel that causality is an answerable or even important question in Foundation work, but given that there are only a few hundred fellows certified in geriatrics each year, it wouldn’t be hard to get information to create a cohort of fellows who didn’t get the CoE “leg-up” so as to give some kind of reference comparison.

I think I know why these sampling strategies are not as popular as they should be.  I think people are embarrassed by the admission at the design stage that even beneficiaries of a program like this one, won’t respond to follow up.  (I personally know board members who concluded a-market-for-the-lemons-11-728that because the response rate was low, that the program couldn’t have been good, even for follow-ups done many years later.)  Sampling is a mildly technical concept that adds a layer of mystery that “survey” doesn’t have.  I also suspect that accepting the certainty of sampling error is an upfront loss, and there is a natural human tendency to try to avoid such a certain loss, even the logically expected outcome of an uncertain option is even worse.  We all tend to hope that “this time it will be different” and make the perfect the enemy of the good enough.

So don’t do it.  Hope is not a plan – give it up.  Embrace limitations and make them work for you.  Put your faith in random sampling – it works.



[i] In academic geriatric medicine, people finish college (~4 years), medical school (~4 years), residency in internal or family medicine (~3 years), a clinical/research fellowship (in geriatrics the time shrank from 2 to 1 year during the CoE program), and then somewhere along the way or as a junior faculty person may get further research or educational training in an MPH or some similar program.  The recruitment pipeline was a constant problem – one of the main frustrations of trying to build the field of geriatrics was getting good people to start down the pathway and finish all 12+ years of it.
[ii] For example, the survey of alumni of fellows trained in geriatrics between 1990 and 1998, reported 15 years earlier by Medina-Walpole, Barker, and Katz, et al. in 2002 already had a 63% response rate and voluntary surveys of medical students who did a summer research experience in geriatrics, had several times shown even lower response rates.
[iii] And for the sake of simplicity, I am just ignoring the unequal observation periods of early versus later scholars.  If we could estimate a good “dollars per year” figure for the whole group from our observations, we could get an even nicer number and even project it forwards to the end of each person’s career, if we wished to assume a constant earning rate and funding environment.
[iv] And this “rough estimate” also doesn’t consider the element of time.  Why is the reach in 2014-2015 more important than all the prior years?
[v] And these calculations don’t even take advantage of the fact, that for descriptive purposes, the sample estimate is only intended to describe the 1,164 participants in the program, not generalize to the entire universe of people who might have participated in the program.  When generalizing to a finite population, using a substantial fraction of the population (e.g., 30% = 300 out of 1,164), you actually get a smaller margin of error – e.g., +/- 3.9%.
[vi] Similar processes can be used to estimate sampling error around estimate such as dollars collected and trainees influenced, although there is also an unclear bias created by the different duration of observation for participants depending upon year of entry that would need to be accounted for as well.
Continue reading