In Evaluation, The Perfect is the Enemy of the Good

There are so many ways that the perfect is the enemy of the good that no one can possibly count them.  This is also a problem in many situations in program evaluation – counting things perfectly gets in the way of counting them well enough.

quote-the-enemy-of-a-good-plan-is-the-dream-of-a-perfect-plan-carl-von-clausewitz-45-57-07I came to this realization when I recently read an evaluation report written by some long-time friends and colleagues of mine summing up the “Centers of Excellence” program of The John A. Hartford Foundation.  The program was designed to help build up the number of physician faculty specializing in geriatrics (the care of older adults) so as to improve research, teaching, and care all across the health system for a part of the population that is not well served by today’s health care.  The theory was that some academic centers had excellent resources for training future faculty (e.g., senior faculty mentors, access to technical training, and a strong pipeline of fellows[i]) and, if given extra resources, could grow faculty for the field and seed other institutions.

At its high water mark, the program funded more than 20 academic health centers around the US with $300,000 in annual, lightly restricted/general support – $150,000 of Foundation money and $150,000 in institutional match. Over time, the amounts, centers, and mechanisms varied a bit, but the CoEs lasted more or less intact for 30 years (’88-’16) during which time the Foundation contributed $57.7 million.  The Foundation’s purpose in commissioning a final evaluation was largely to celebrate the program, salute its many accomplishments, and give a “roughly right” picture of the careers of the alumni of the program.  The evaluators are strong professionals, experienced in their fields, and with great track records and largely succeeded in meeting their charge.  Nevertheless, thanks to an accumulation of decisions taken many years before they arrived on the scene and an overdose of hope, they wound up needing to make lemonade out of lemons.  I think there are some lessons to be learned.

Evaluating a program like this poses many, many technical challenges.  It is neither desirable nor practical to randomly assign academic centers to receive the grant support or not (although I know some observers thought we were doing that anyway J).  Nor within centers could fellows and junior faculty be randomly assigned to receive support from the CoE pot of money.  One also faces what is called “censoring” in trying to look at outcomes at any one point in time (e.g., today) – that is the period of observation is cut off (censored) with wildly different lengths.  In other words, the earliest participants in the program could have been followed for 28 years and even possibly finished their working careers by 2017, but many of the scholars who entered the program in its final years had only begun theirs.

These technical issues have technical solutions with a range of credibility.  And for the purposes of the Foundation – simply to document what happened in the careers of participants – some of the issues and solutions are irrelevant, especially at the end of a program where, for the original funder, there is no longer an opportunity to directly apply its lessons.  Nonetheless, the fundamental problem with the evaluation that was performed was its overreach.  Not overreach in an effort to draw conclusions the data could never support, such as inferences of about causality or generalizability of the results, but simply in its hope of reaching all of the alumni – a victory of hope over experience.

The evaluation had the appropriately limited and conservative goal of simply describing the activities and progress of the 1,164 scholars who had participated in the program over its lifetime.  To meet this goal, the foundation, its consultants, and national program office set out to survey every one of the alumni.  At the start, the work was handicapped by a common, but frustrating, part of human life – the inability to see the future.  Back when the program started in 1988 nobody at the Foundation was thinking of long-term evaluation (at that point, the Foundation didn’t think much about any kind of summative evaluations and reports at all).  There was little thought given to maintaining contact information for scholar alumni, and no funding provided for the expensive and difficult work to keep an eye on members of a longitudinal cohort through moves, name changes, and death.  So the eventual survey of career achievements and request for a current academic curriculum vita could only even be sent to the most recent 878 of the 1,164 total scholars due to lack of contact information (creating a 24.6% loss to follow-up at the get go).

And it only got worse: of the 878 people sent invitations to complete an online survey and to submit their CVs, 336 and 282 responded, respectively.  Response rates of 38% and 32% are actually quite decent for “cold” voluntary surveys and I’m sure that achieving them took a great deal of time, effort, and paper cuts.  While other, more recent Foundation scholar’s programs outside of medicine (e.g., nursing and social work) with far more emphasis on collective identity and interpersonal ties, *had* achieved much higher response rates in similar voluntary surveys, the CoE scholars program was substantially different.  As a general support grant, the local center directors decided how money would be spent and branding of the program was highly variable (some scholars probably never were told their support came from the Foundation or were unaware of the National Program Office).  I would argue the fact that the response rates were going to be in this range (or below) was already known/knowable from prior efforts and similar long-term follow-up efforts of the Foundation in the past[ii].

So, despite the good faith effort to reach everyone, the very approach deeply undermines even the descriptive interpretation of the results.  For example, on a key issue of keeping scholars in the field of aging/geriatrics, the estimate offered was that 97% (N=327) of respondents said “yes” to continuing in some role of influence (e.g., research, teaching, or policy).  Given the context of being asked for a CV and the almost certain relationship of being lost to follow-up to academic success (even in the subset when there was an email available), this estimate is obviously wrong.  It has to be somewhere between 28.1% (327/1164) and 97% but there is just no way to know exactly where.

A second metric is the so-called “leverage” generated by the program.  In JAHF parlance this term had come to mean the grant revenue earned by funded scholars that could be thought of as the ROI for the program grant funds “invested.”  In this case, simply counting grants reported on the 282 CVs collected, the total was calculated to be $1.1 BILLION[iii].  Now there is nothing wrong with this number – it is a big fat fun number and on a $57.7 million investment, it is just terrific.  But it is clearly a very low estimate of reality.  Showing great restraint, and knowing when to quit when they were ahead, the authors do not make the obvious ~3 x calculation to try to estimate the full “leverage” that would have been documented if all of the participants had reported.  But the real figure *is* clearly much higher than $1.1 billion, we just don’t know how much.

However, the evaluation does make the analogous calculation when considering how many trainees the scholars influenced.  Observing that the total number of trainees touched by the group responding to the online survey in 2014-2015 was 16,123 (ranging from medical students in classes to mentees at the junior faculty level), they created a “’rough estimate’” assuming that the non-responding alumni would have similar training influence.  Thus the calculated that the full cohort would have reached 55,000 people with their geriatrics expertise in that year.  I am sure that the assumption of proportionality between the responders and non-responders is more reasonable in the case of teaching than it was in the case of winning grants. But it is still going to be systematically off, leaving us with an estimate of somewhere between 16,000 and 55,000 in that year[iv].

This evaluation, because it was dependent upon a strong response rate, where it was unreasonable to hope for one, both understates and overstates the results of the CoE program in ways for which it is impossible to appropriately adjust.

Dilbert Correlation is Not Causation

So what would have been better than this predictable disappointment?

I argued strongly when I was still involved, that instead of a futile effort to reach all the participants, that the evaluation should focus its limited staff time and scarce resources on a truly representative sampling of the population of interest.

Margin of ErrorIf the effort that was used to collect ~30% response rate from the 1,164 people had been used to collect responses from a planned, representative sample of 300 people, the estimates would be impacted by sampling error, but not the same uninterpretable bias due to respondent self-selection.  With 300 respondents the margin of error would be +/- 4.5% for percentage estimates around 80%.  Margin of error at the 95% confidence level for percentages is calculated using the formula  * +/- 1.96.  So if we were to find that 80% of the respondents were still in the field of geriatrics our sample error would be  and we would have 95% confidence that the true proportion was between 75.5% and 84.5%.  In fact, even if we had a sample of only 100 the margin of error at the 95% confidence level would still be less than +/-1 10%.  (+/- 7.8%, to be precise)[v],[vi].

With the number of people to be reached cut down to a manageable size, one could spend a lot more of time badgering people to complete their surveys and/or offer $$ incentives to participate, or any number of strategies.  I also feel pretty sure, that in these days of Google, social media, PubMed, and with the benefit of national licensing boards and membership associations, that 40-45 of 50 randomly selected people from the initial group with no contact info could be run to earth (“Hi! Are you the John Smith who did a fellowship in geriatrics at Harvard’s Beth Israel in 1989?”).  In fairness, I must admit that you won’t get 100% of a planned sample either, but if you can crank it up to 90% of the random sample participating, response bias has a much smaller room for mischief.

Sampling strategies also give some special opportunities.  For fun, one could stratify the sampling by era, gender, etc. to be sure that you had some room for comparisons.  And for extra credit – you could even oversample some rare populations (e.g., racial minority members, bench scientists, second career trainees) and make use of the extra precision of measurement without bollixing up the overall estimates by weighting the responses.  Even without the bells and whistles a good random sample of 100-300 beats an ad hoc group of 300 created by unknowable, but surely biased processes of self-selection.  It may sound small and embarrassing to settle for such a limitation at the outset, but it is just fine for government (or even Foundation) work.

Finally, if I were sponsoring this kind of work in the future, I would consider reducing the sample even further, to change the commission to the evaluation team to add some kind of comparisons that could make the results more meaningful.  If we can get estimates at +/- 10% for a measure like “having a continuing role in academic geriatrics” with only 100 respondents, could we use our freed up resources to look at the overall continuation rate in academic medicine?  I don’t always feel that causality is an answerable or even important question in Foundation work, but given that there are only a few hundred fellows certified in geriatrics each year, it wouldn’t be hard to get information to create a cohort of fellows who didn’t get the CoE “leg-up” so as to give some kind of reference comparison.

I think I know why these sampling strategies are not as popular as they should be.  I think people are embarrassed by the admission at the design stage that even beneficiaries of a program like this one, won’t respond to follow up.  (I personally know board members who concluded a-market-for-the-lemons-11-728that because the response rate was low, that the program couldn’t have been good, even for follow-ups done many years later.)  Sampling is a mildly technical concept that adds a layer of mystery that “survey” doesn’t have.  I also suspect that accepting the certainty of sampling error is an upfront loss, and there is a natural human tendency to try to avoid such a certain loss, even the logically expected outcome of an uncertain option is even worse.  We all tend to hope that “this time it will be different” and make the perfect the enemy of the good enough.

So don’t do it.  Hope is not a plan – give it up.  Embrace limitations and make them work for you.  Put your faith in random sampling – it works.



[i] In academic geriatric medicine, people finish college (~4 years), medical school (~4 years), residency in internal or family medicine (~3 years), a clinical/research fellowship (in geriatrics the time shrank from 2 to 1 year during the CoE program), and then somewhere along the way or as a junior faculty person may get further research or educational training in an MPH or some similar program.  The recruitment pipeline was a constant problem – one of the main frustrations of trying to build the field of geriatrics was getting good people to start down the pathway and finish all 12+ years of it.
[ii] For example, the survey of alumni of fellows trained in geriatrics between 1990 and 1998, reported 15 years earlier by Medina-Walpole, Barker, and Katz, et al. in 2002 already had a 63% response rate and voluntary surveys of medical students who did a summer research experience in geriatrics, had several times shown even lower response rates.
[iii] And for the sake of simplicity, I am just ignoring the unequal observation periods of early versus later scholars.  If we could estimate a good “dollars per year” figure for the whole group from our observations, we could get an even nicer number and even project it forwards to the end of each person’s career, if we wished to assume a constant earning rate and funding environment.
[iv] And this “rough estimate” also doesn’t consider the element of time.  Why is the reach in 2014-2015 more important than all the prior years?
[v] And these calculations don’t even take advantage of the fact, that for descriptive purposes, the sample estimate is only intended to describe the 1,164 participants in the program, not generalize to the entire universe of people who might have participated in the program.  When generalizing to a finite population, using a substantial fraction of the population (e.g., 30% = 300 out of 1,164), you actually get a smaller margin of error – e.g., +/- 3.9%.
[vi] Similar processes can be used to estimate sampling error around estimate such as dollars collected and trainees influenced, although there is also an unclear bias created by the different duration of observation for participants depending upon year of entry that would need to be accounted for as well.
Continue reading

Writing about Philanthropy – What is this ‘Sustainability’ thing?

‎As I wrote last week, writing usefully about philanthropy is hard. But a recent controversy has inspired me.  Late last year various news outlets carried the story that a generous donor, The Suder Foundation, creator of the First Scholars program of scholarships and support for “first generation” college students was suing some of its grantee institutions for failure to live up to the grant-based agreement to continue the program after the external funding ended.  Commentators talked about the perfidy of the grantee institutions or the naiveté and bad math of the grantor (failing to endow the programs).

This controversy highlights the broader issue of “sustainability” in philanthropy and non-profit activities. Sustainability seems to me to be one of those overused, “magic” words – people know it’s a good thing, but don’t think much further to try to understand exactly what kind of thing it is. At the end of the day, for a philanthropic funder, a program activity is sustainable when there is someone else who at some point will be willing to pay for it to go on. That “sustainability” funding may be from earned revenue, from dipping into an undifferentiated stream of individual donations, or somewhere else, but someone has to pay. There is no free lunch. And it is always harder than it sounds.

Sustainability is also not purely a program property, but a product of the program (costs and benefits), residing in an organization, in the context of its environment (funders, stakeholders, government).[1] Particularly as venture capital has increasingly become the metaphor for philanthropy, (I almost wrote “philanthropic investments,” which shows that I, too, have been sipping the Kool-Aid), the VC notion of being “taken out” gains currency. The idea is that “investment” money is always moving on and some other investor taking over, the way VC firms succeed each other at different stages of a start-up’s development. This occasionally may end in the Holy Grail of a successful IPO. Unfortunately, one of the potential over-extensions of the VC metaphor is that in the non-profit/philanthropic situation, an excellent program that delivers great impact can easily exist WITHOUT a viable new funding source to continue on. And while an IPO is like winning a lottery in the real start-up sector, there isn’t even a real analogy for an IPO for us.

In the case of the Suder Foundation’s grants and the Universities’ response (or lack of response), part of the problem seems to be this overextension of the financial metaphor.

The initial phase of external funding for the First Scholars program is described as the seed phase, where the Suder Foundation would pay for the start-up costs. According to news reports and its website, the First Scholars program was substantially more effective in keeping first generation college students enrolled than preexisting efforts.   While this success is described as “ROI” there is no actual short-term financial return to use to sustain the program. Metaphoric returns on investment are not spendable on real programs.


The second phase of the First Scholars is described as “University Self-Funding” but had no specified mechanism for raising new funding. (There was some discussion of Suder support for University “fundraising”, but having been in those shoes, I can testify how very hard it is to get one funder to replace another – it feels ignominious and uninspiring to most potential replacements. I’ve done it several times, but also faced intense board criticism for it.) So, to the extent that there was a plan, I deduce that the grantor and grantees expected to redirect existing scholarship and student support funding into a locally controlled version of the First Scholars program.[2] This is actually not an unreasonable plan. Finding new and more productive ways to spending existing resources is easy to support in concept, but doesn’t do justice to the complexity of such changes.

An example from my experience.

In 1999 at the John A. Hartford Foundation, I was staff lead for the IMPACT trial, an eight-site randomized clinical trial of depression treatment that still remains the largest such study in the US. Given our interest at the foundation we wanted the evidence from the trial to change national ‎practice and policy (still working on that), but we also wanted the participating sites to maintain the program, if the evidence showed that it produced superior outcomes for patients. Like Suder we paid directly for the services in the trial phase and we wanted the grantees to keep them going.

For four years as I traipsed around on my annual site visits, I would always ask the grantee team and whatever institutional leaders they could round-up: “If the model works, will you keep it?” And at least what I heard was “yes.” I distinctly recall a senior service leader at a large Midwest health system who always was very clear about the institution’s commitment to both evidence-based practice (meaning we would have to wait for results) and to high-quality care (meaning that if the results were good, they would keep the program in place.)

Well, when the trial phase came to the end and had not just good results but GREAT results – twice as effective as care as usual – this senior leader had retired and gone on a medical mission to‎ Africa. I was in the very same position as the Suder Foundation is relative to the University of Alabama where the entire development and leadership staff has turned over and nobody remembered any commitment to “sustain” the program.  Needless to say whatever minimal internal discretionary funds might have been under the control of this stakeholder, they were not forthcoming.

Does this make the institution evil or me naïve? (I certainly was less experienced, but I already knew sustainability was an issue.)  I would say “neither.”  Sustaining a program under these conditions or those of the Suder Foundation grants, requires recognizing a few realities.

  1. Academic/health institutions expect to get grant money. They are good at doing grant-funded projects, faculty and staff can “sell” time to new projects in a very flexible way, unlike in most companies where existing staff are dedicated to on-going functions. Senior administrators sign-off on many, many such grants each year, rarely with serious consideration of what they are agreeing to.
  2. However, on-going functions, such as standard health care practices or standing scholarship funds are controlled by other stakeholders, separate from the grant-seeking/grant-management leaders. This is part of what gives institutions flexibility –they insulate core activities as much as possible from the vagaries of grant funding.

It is not unreasonable, despite commentary to the contrary, to expect an institution to change the way it spends its own money depending upon the results of a grant funded project, it is just very complicated. Despite being within one institution, you essentially face the same problem of using evidence to encourage a program adoption without a grant when you want to achieve such “sustainability.” And, if at the end of the day, there simply aren’t enough shiftable resources to maintain the program, it won’t be sustained. A good business analysis and plan at the outset might be helpful, but should philanthropic dollars only go to those activities where someone else will be willing to pay down the road?



[1] The early days of philanthropy are sometimes thought be a golden era where better solutions to social problems were relatively easily taken up by government (e.g., painted road dividing lines and the 911 emergency system, both started as philanthropic efforts but are now sustained by tax dollars). However, I’m sure it never was easy then and it does still happen now on occasion. See


[2] This was then to lead to the Third Phase, a National Network of operating “franchised” sites that would do program research, quality assurance, etc. – but again without the actual dollar flow of a franchise operation – another financial metaphor overextension.