It often falls to the management scientist to evaluate how well a program (in the private, non-profit, or government sector) is performing.  There is a great number of ways to go about this task.  This article discusses some of the ways to evaluate a program.  

Different analysis techniques can be applied to the evaluation task, and as we shall see, the chosen technique is quite important. I don’t address the details of technique in this article  My emphasis today is on the variety of evaluation philosophies (principles).  

This is important for two reasons: 

First, because you want to ensure you are using the same vocabulary and concepts as your client – the beneficiary of your research – and second, because some of the evaluation philosophies are more intellectually defensible than others, you will wish to choose your approach in an informed manner. Current events highlight the importance of these matters, and I’ll comment on these at the end of the column.

Any program uses resources (“R” in the figure), processes these resources in some manner (“P”), and produces outcomes, “O.”

                        Resources              Process           Outcomes

 R                     P                      O 


The process may be known and visible (“white box”), or unknown and invisible (“black box”).  Evaluations can be based on:

·      The program’s own R, P, and/or O;

·     A comparison of the program’s R, P, and/or O against the legislation or mission statement that led to the program’s creation;

·      A comparison of the program’s R, P, and/or O against an “industry standard,”
which is some kind of summarizing function of the R, P, and O of similar
programs in the same or other organizations;

·      A comparison to the state of affairs that would obtain if the program did not
exist. Practically, this means finding and measuring a comparable population to
whom the program is not available. (For a lovely treatment of the logic and complications
of this approach, see;

or on a combination of these. 

Process-based evaluation.  You may be surprised (and, I hope, dismayed) to know that many government and non-profit programs are evaluated on only one dimension: Whether an organization and procedures have been put in place to serve the program’s ostensible mission.  Thus, a program can gain a positive evaluation just by having been created – even if it has never benefited (or even served) any clients!

These process-based, white-box evaluations consider only P, and do not necessarily involve comparisons to other programs.  Such evaluation mechanisms, which may even be specified in the legislation that created the program, can serve to protect a young program which has not had time to prove itself in terms of actual achievement.  On the other hand, political interests can restrict evaluations to the process-based in order to protect a useless pork-barrel program.  It is probably wise not to involve yourself with this kind of evaluation, especially if you hope for scientific publication. Or a scientific reputation.

Outcomes-based evaluation.  The rankings of MBA programs, for example, depend largely on the starting salaries of graduates, and the number of job offers per student at the time of graduation.  These are both “outcomes,” O. 

Universities game these rankings by admitting only applicants who already have high-paying jobs.  These applicants, if they do not fail miserably in graduate school, are bound to get more and higher-paying offers upon graduation.  In other words, the universities manipulate the outcomes by manipulating the inputs (resources), knowing that the latter are invisible to the evaluation process.

The rankings are then useful to students and their parents only to the extent that they indicate which schools are attended by career high-flyers. The rankings indicate nothing about quality education.

Similar things happen when primary or secondary schools require students to achieve a certain score on a standardized test before they may graduate.  District officials may then point with pride to the number of graduates and their average test scores.  However, teachers have been “teaching to the test,” thus robbing their students of the benefits of a liberal and well-rounded education. This is essentially a mis-construction of O leading to a distortion of P.

Outcomes-based evaluations may be white-box or black box. They may present interesting problems for management research – for example, the universities’ strategies for manipulating the resource inputs – but the evaluation philosophy itself is hardly rigorous. 

One exception is peer review of research. This is output-oriented, and rightly so. Journal reviewers do not care whether it cost $1 or $1million to produce the submitted paper, but only whether the research is correct and important.

Resource- and process-based evaluation.  The organizations that accredit universities and university programs look at process P (whether the registrar keeps more or less accurate records) and at resources (the size of the library collection, the number of student computer stations, the number of tennis courts, and so on). Curiously, they rarely look directly at learning outcomes O, but do concern themselves with whether the university has a procedure P for measuring learning outcomes! 

By favoring programs with high admission standards (that is, programs that reject a high proportion of less-qualified applicants), accreditors actually encourage the kind of inputs manipulation mentioned above.  A program that excels at transforming low performers or impaired students has little chance of winning accreditation. 

At this writing, the US Department of Education (which accredits the accreditors) is urging accrediting organizations to consider learning outcomes more explicitly than before.  This is probably a step in the right direction, but the accrediting organizations are resisting it.

Input-output based evaluation.  At the scientific pinnacle of evaluation philosophies are those that base evaluations on some operationalization of “O divided by R” or “minus R.” Organizations that effectively turn small amounts of resources into substantial outcomes earn high evaluations.  The usual business measures of return on investment and return on assets are input-output based evaluations.  So are economic measures of productivity, and operations-research based measures of efficiency.

Input-output based evaluations tell an organization, “Be creative with your processes, because we don’t even look at them.  Just get the maximum of outcomes per unit of resources.”  Of course, evaluators could just as well say, “Use the minimum of resources per unit of outcome.”  Either way, these evaluations are conducive to innovation, encouraging the evaluated program to develop new and better processes.

The pursuit of maximum output for minimum input can result in processes that are exploitative (child labor or sweatshops in the manufacture of clothing) and create undesirable externalities (strip mining, urban sprawl, traffic congestion).  Though input-output based evaluations are the most rigorous scientifically, they are not always the best from an ethical viewpoint. Questions for the ethical scientist then become: Would a different evaluation method encourage more ethical behavior in the program?  Should the “black box” be broken open and a light shone in?

Input-output based evaluations are as applicable to non-profit programs as to for-profit entities. One example: The Austin Technology Incubator, a joint program of the University of Texas and the City of Austin, cradled start-up companies that grew and created high-paying jobs.  The incubator bragged, “Two thousand dollars of public investment creates one job at ATI.”  This compared with the $20,000 required to create one job in most other public job-creation programs, and cast ATI in a good light.

Other O/R or O-R measures might include “person-hours to resolve one postal customer complaint,” or “average reduction in work accidents after four hours of safety training.”

Naturally, real programs use several kinds of resources and produce several kinds of outcomes, not all of which can be expressed in dollar units (especially in non-profit situations).  Nevertheless, they can be analyzed rigorously.  For example, one student’s 2006 dissertation, a particularly good one, used a combination of Data Envelopment Analysis (DEA) and the Malmquist productivity index to evaluate the performance of Egyptian banks.

A program’s processes may be transparent, but input-output based evaluations treat them as black-box.  Naturally, an organization evaluating its own operations is in a white-box situation. It can use the top-level input-output analysis as a trigger for “drilling down” to evaluate lower-level processes, following diagnostics, fixing problems, and identifying areas for innovation.  However, an evaluation using DEA and other methods of comparative evaluation is unlikely to obtain process data from all the comparative organizations.  This evaluation will remain “black-box.”

The “boards” that accredit professional people – the APA for licensed psychologists, hospital medical review boards for physicians, state bar associations for lawyers, etc. – base their decisions partly on the ethical behavior of the individual professional.  Philosophers have not yet determined whether ethics is a matter of process or of outcome, so the evaluation of professionals cannot be cleanly slotted into one of the above categories.

In 2007, the Pentagon delivered to the US Congress a self-evaluation of its performance in Iraq.  The evaluation was intended to influence the direction of funding for America’s continuing military venture in that country.  The Pentagon report, however, did not specify what methodology was used to arrive at the evaluation scores, and military spokespeople would not respond to questions about the methodology.  Scientists need not feel outrage over this charade, but should understand that it is a political negotiation and not a true evaluation exercise, and different methods of analysis are called for.

The Iraq war of the George W. Bush administration also featured ever-changing benchmarks.  As time passed and it became clearer that goals of nation-building, insurgency suppression, and control of neighborhoods would not be met, the White House revised the goals in a downward direction.  This maneuver allowed the President to use the words “success” and “victory” more freely.   The outcomes-orientation of the Bush strategy also steered public attention away from the lives lost and money expended in the war. An evaluation in which the program names (and even worse, revises) its own criteria is meaningless at best – unless it is prelude to confirmation by an independent external evaluator.

Most recently (October 7, 2009), OMB Director Peter Orszag issued a MEMORANDUM FOR THE HEADS OF EXECUTIVE DEPARTMENTS AND AGENCIES. The memo says, “Many important [Federal] programs have never been formally evaluated -- and the evaluations that have been done have not sufficiently shaped Federal budget priorities or agency management practices.” Recognizing the value of comparative evaluations, Orszag adds, “And Federal programs have rarely evaluated multiple approaches to the same problem with the goal of identifying which ones are most effective.”

Among the remedies he suggests are “a new inter-agency working group to promote stronger evaluation across the Federal government,” and “new, voluntary [and competitively funded] evaluation initiative[s].” The bad news is that “most activities related to procurement, construction, taxation, and national defense are beyond the initial scope of this initiative.  In addition, because drug and clinical medical evaluations have independently received extensive discussions, they are also excluded.” One wonders what politics led to these exclusions.

A contributor to the SciSIP listserv used the phrase "culture of evaluation" – which sounds nice, but rather less attractive than “culture of accomplishment.” Her expanded idea, however, shows the value of an evaluation culture: “Over time, [Orszag’s proposals] may help make evaluation more of a core value across agencies instead of something that is seen largely as a threat and/or an imposition….  Hopefully, the S&T policy agencies will be able to… assist other agencies to build more scientific approaches to evaluation.”

Her further suggestion, “It would be interesting to set-up an evaluation of this new policy on evaluation,” sounds like terrible bureaucratism, as does Orszag’s revelation that “Many agencies lack an office of evaluation.” Institutionalizing permanent evaluation offices sounds dangerous; who will evaluate the evaluators? 
When you commence an evaluation project,

  • Be sure you know what kind of evaluation is allowable.Understand any political constraints on the evaluation, and its ethical dimensions.  

  • Strive for a common vocabulary with your client.  

  • If possible, push for an evaluation that advances the program’s mission.  

  • Usually, this will be a comparative, input-output based evaluation.