The Evaluation of Organization Development Interventions: An Empirical Study


A survey of Fortune 500 HR Executives was used to test seven hypotheses concerning the evaluation of organization development, interventions. Among the findings are uncertainty in the intervention creates the client’s expectation for multiple-level evaluations, and idiosyncratic investment in an intervention effects the desired level of evaluation rigor.

Organization Development (OD) is one of the rare exceptions for which businesses consistently pay out large sums of money for services but fail to determine if the services obtained are satisfactory (Jones, Spier, Goodstein & Sashkin, 1980). One survey, albeit dated, of OD consultants indicated only 30% conduct evaluations more rigorous than a simple “gut instinct” determination that the intervention worked (Kegan, 1982). A recent critique of evaluation practices (Head, in press) supports the belief that very little has changed with regards to the frequency of rigorous evaluations.

Evaluation is a critical step in the OD process, appearing in almost all the process models (French, 1989). Cummings and Worley (1997, p. 30) make the case very succinctly:

Managers investing resources in OD efforts are increasingly being held accountable for results. They are being asked to justify the expenditures in terms of hard, bottom line outcomes more and more, and are using the results to make important resource allocation decisions about OD.

Evaluation provides the evidence on which to base decisions about maintaining, institutionalizing, and expanding successful programs, and modifying or abandoning unsuccessful ones (Weiss, 1972; Weissbord, 1973). Rossi and Freeman (1982) are more succinct regarding its purpose: Evaluation assures that the client “got what it paid for” (p. 16). There are two additional reasons to evaluate one’s work (Head, 2004). It is important to determine whether unforeseen problems arose as a consequence of the intervention. Without appropriate evaluation there is no way that an organization may be sure of a trouble-free intervention. Finally, until OD begins to gather solid evidence for its “results,” it will not be truly taken seriously as either a science or a body of knowledge. Science grows upon established facts, and no amount of rhetoric, or “gut instinct speculation,” however well-intentioned, will change this.

While failure to evaluate is a significant issue, there also appears to be a contradicting problem. Rutman and Mowbray (1983) believe that some clients might demand, and subsequently pay for, much more evaluation than is logically required. The problems that arise from this approach are fairly significant. First, evaluation is costly, and any unnecessary evaluation represents unwarranted expense. Second, processing evaluation data takes time, which can delay communicating critical information to the client. Finally, too much evaluation could represent an over-dependence on data in subsequent decision-making. In light of these consequences of overevaluating, it behooves both client and consultant to have some decision rules to guide in establishing the needed level of rigor.

Head and Sorensen (1990) proposed eight propositions to act as such a guide for establishing how much and what type of evaluation should be conducted. These propositions emerged through a review of various literatures tangentially related to OD, such as transaction costs economics and community and social agency program evaluation. Seven of these untested propositions serve as the hypotheses for this current study.

Head and Sorensen (1990) used Kirkpatrick’s (1960) four-level hierarchical training evaluation model as a base for their work. Kirkpatrick posits four evaluation levels arranged by increasing rigor (and cost). The selection of one level does not preclude, or require, the selection of any other; evaluation may take many forms, depending upon which levels the client desires to monitor, and how much it wishes to spend. Evaluation’s lowest level is reaction, measuring how the employees involved in the intervention perceived the experience (such as a satisfaction survey). The second level is learning, where the consultant determines whether or not the employees gained the new system’s requisite knowledge and understanding, often times through a test, or “dry run”. Behavior is the third level, and involves establishing whether or not the intervention altered the employees’ actions and behaviors in the desired direction. Typically, this level requires direct observation and pre/post measurements. The most rigorous evaluation level is that of results. At this level the consultant establishes the intervention’s impact on the client’s bottom-line performance (cost/ benefit analysis, long-term gains, and the like).

The original work by Head and Sorensen (1990) offers a complete discussion of how the current hypotheses were developed. Presented here, as research hypotheses, are seven of the eight propositions along with a brief summary of the logic for each.

The first proposition addresses uncertainty in the consulting situation. The greater the uncertainty in terms of problem identification and subsequent solution parameters, the more flexibility for action is required by the consultant, and the contract will have to be vague with regard to processes. However, this freedom to act creates a greater need to establish accountability. Therefore:

Hypothesis 1: The greater the uncertainty around the intervention, the more rigor will be demanded by the client in evaluations.

If a consultant has been frequently used by a client, a high degree of trust and confidence is likely to have been established between both parties, and the client will require less evaluation:

Hypothesis 2: The more frequently a client organization has used a specific consultant, the less rigor will be required in evaluation.

Often times clients are, by using a consultant, obligated to incur expenses for items that cannot be used for any other purpose, or by any other consultant. The more the client is expected to make such idiosyncratic investments, the more rigorous an evaluation will be required:

Hypothesis 3: The more the organization is required to invest in the intervention, the greater the likelihood that rigorous and multiple level evaluations are to be required.

With a “single-shot” intervention, one not intended to be repeated, the client’s interest focuses on whether or not the project worked. However, if the intervention under review is a pilot project to be replicated throughout the organization, then both parties will be interested in the intervention’s implementation process as well as its results. Therefore:

Hypothesis 4: If the intervention is intended to be used frequently by the organization, the initial evaluation will stress reaction and learning. But if the intervention is not intended for replication, the evaluation will stress behavior and results levels.

Typically, variables at the behavior and results level require a substantial period of time to lapse after the intervention to show any significant change. At the same time, for various reasons, particularly political, the client often requires quick evaluations. Therefore:

Hypothesis 5: The faster evaluation information is needed, regardless of the reason, the less rigorous the evaluation will be.

On the other hand, if an intervention has been performed several times in the client’s organization, it can be assumed to be reliable and that all the process “bugs” have been resolved. At the same time it should be relatively easy to determine long-term effects, as it is no longer an isolated incident:

Hypothesis 6: If an intervention has been performed frequently in an organization, then a results level evaluation will be the only evaluation requirement.

Finally, most behavior and results level variables require classical pre-test and post-test comparisons. Therefore, they require extensive planning and must be identified prior to any intervention’s implementation.

Hypothesis 7: If the intervention is initiated prior to the fmalization of evaluation plans, it is less likely that behavior and results levels of evaluation will be required.


Survey Development

Two one-paragraph cases were written for each hypothesis. The cases represent opposite ends of the issue under study. For example, hypothesis 3 examines the impact of client investment. The first case involves a client investment of $300 and the second involves the expenditure of $7500 for materials. Each case was then randomly assigned to one of two survey forms.

The survey cover page included a description of the four levels of evaluation (reaction, learning, behavior, results), some examples of how each might be measured, and clearly established that the levels are not mutually exclusive. Next, the seven cases were presented. Following each case were questions in which respondents were asked to identify the types of evaluation they would require from the OD consultant. Respondents were also asked what percentage of the consultant’s fee they would like to see dedicated to the evaluation process. Because only 30 percent of the respondents completed the fee percentage questions, this data was excluded from the ana\lysis.

The two survey forms were reviewed independently by five independent OD consultants. These individuals examined the survey for clarity, realism, and certainty that the cases were significant opposites. Based upon the reviewers’ feedback, the cases were revised. The final versions of the cases are located in the appendix.


Two hundred and fifty of each survey form, accompanied by a postage-paid return envelope, were mailed to Fortune 500 Human Resource Executives. The highest level HR individual from each company’s corporate office, or a large division, that could be identified was selected to receive the survey. Generally the surveys were mailed to the corporate vice president. The specific survey form each subject received was randomly assigned.


The survey resulted in two related sets of data. The first set involved the specific levels of evaluation selected for each case by the respondents. The second involved the number of evaluation levels the respondents selected for each case. The hypotheses were evaluated using Z tests on the transformed percentages.


Of the 500 mailed, 94 usable surveys were received, for a return rate of 18.9 percent. One form resulted in 52 returns (20.8 %) and the other form had 42 responses (16.8 %). It is important to remember in interpreting the results that the frequencies reflect what kinds of evaluations the respondents believed they would require from a consultant, and not what they actually did require in the past.

Hypothesis one stated that the more uncertainty surrounds a consulting situation, the greater the rigor that will be required from the evaluation. Therefore, in an ambiguous situation, one should expect multiple-level evaluations, while clear situations would require only one or two levels. There should also be a greater emphasis on results and behavior in ambiguous situations than in clear situations.

The survey results for hypothesis one (Table 1) show general support for the hypothesis. Significantly, more executives required evaluation at all four levels (Z = 3.463, p

Contrary to what was expected, no significant difference was found at the behavior level. However, as predicted, evaluation at the results level was required significantly (Z = 2.146, p

Hypothesis two revolved around the belief that a client will, over time, develop a high level of trust in a consultant and as a result will require less in terms of evaluations from him or her. This trust would manifest itself in fewer levels being expected and an emphasis on the more basic levels of reaction and learning.

Results of the analysis for hypothesis two are found in Table 2. The table shows mixed results. As predicted, having a great deal of experience with a consultant resulted in fewer, or no, levels of evaluation being required (1 : Z = 2.406, p

Evaluation at the reaction level was required significantly more often (Z = 3.272, p

Hypothesis three predicts that the greater an idiosyncratic investment is required from a client for a given consultant’s intervention, the more rigorous (behavior and results levels), and more levels of, evaluation will be required. The data, located in Table 3, strongly supports hypothesis three.

Table 1. Results for Hypothesis One Uncertainty in Consulting Situation Type and Number of Evaluation Levels Selected

When clients had to make a large idiosyncratic ($7500 in material) investment they did tend to require multiple level evaluations (4: Z = 1.776, p

While there were no differences between the two conditions at the results and learning levels, the other two levels did support the hypothesis. Reaction (Z = 4.48, p

Table 2. Results for Hypothesis Two Client Experience with the Consultant Type and Number of Evaluation Levels Selected

Table 3. Results for Hypothesis Three Client Investment in the Intervention Type and Number of Evaluation Levels Selected

Hypothesis four has two dimensions. With an intervention, which is not intended to be replicated, or a “single shot,” the client will be interested only in the outcomes (results and behavior levels). Clients will be more interested in the intervention’s process (reaction and learning levels) for a pilot intervention that is intended to be used frequently by the organization.

The results for hypothesis four, located in Table 4, strongly contradict the hypothesis. It was clear that evaluation was much more important for the pilot project than for the single shot intervention. In fact, significantly more subjects reported not requiring any evaluation (Z = 2.666,p

Table 4. Results for Hypothesis Four Intended Frequency of the Intervention Type and Number of Evaluation Levels Selected

Hypothesis five addresses the impracticality of requiring rigorous evaluations soon after an intervention’s implementation, as changes at the behavior and results levels require additional time to emerge. It is believed that evaluations needed immediately will stress reaction and learning, and evaluations with no such rush will tend to involve more levels and specifically include behavior and results.

Table 5. Results for Hypothesis Five Speed with Which Evaluation is Required Type and Number of Evaluation Levels Selected

Table 5 presents the results for hypothesis five. There appears to be mixed support. The type of evaluation selected is completely in line with expectations. Results (Z = 4.962, p

However only one significant trend emerged with regards to the number of evaluation levels clients expected, and this was contrary to the prediction. When the evaluation was not needed immediately, significantly fewer respondents (Z = 2.664, p

The results level, according to hypothesis six, will be the only concern for clients who have frequently used a particular intervention. This assumption is based upon the belief that any process bugs will have been resolved and the client is comfortable with the reaction, learning, and behavior levels (otherwise the intervention would not have been repeatedly implemented).

able 6 contains the results for hypothesis six. There was no significant difference for the results level between the conditions. In fact, one significant difference that occurred was completely opposite of the hypothesized prediction. When the client was experienced with the intervention, significantly more subjects required evaluation at the reaction level.

Hypothesis seven reflects the need for behavior and result-level evaluations to have pre-intervention baseline data available or comparison purposes. If an intervention is initiated prior to the finalization of evaluation plans, any evaluation at these two most rigorous levels may be suspect, and as a result not desired.

Table 6. Results for Hypothesis Six Frequency of Intervention Implementation Type and Number of Evaluation Levels Selected

Hypothesis seven received partial support as can be seen from the results located in Table 7. There was no significant difference, based upon whether the intervention began before the evaluation plan was established, at the behavior level. However, s\ignificantly more (Z = 4.046, p

Table 7. Results for Hypothesis Seven Intervention Implemented Prior to Evaluation Plan Type and Number of Evaluation Levels Selected


It is clear that organization development consultants must embrace evaluation as a necessary step in the action research process. The field cannot progress in knowledge and reputation without practitioners being able to “back up” claims of success with clear and hard data. In order to make these claims, we, as a field, must first gain insight into what criteria should drive our selection of evaluation data. The key to an effective evaluation is to provide the client with relevant information. Relevance, here, refers to data on the specific dimensions required by the client for future decision-making. It is critical to avoid both information underload and overload. Too little information and the client makes decisions “in the dark.” Too much information and not only is money wasted, but the client might be distracted by irrelevant information.

Seven theoretical propositions, based upon Kirkpatrick’s four levels of training evaluation, were tested using surveys received from 94 Fortune 500 HR executives. The first observation is that clearly executives want to see OD interventions evaluated in some manner. It also appears that straightforward logic, such as used in developing these propositions, is a (but not the only) driving force behind deciding on the exact form the evaluation takes. The OD consultant can also be comforted in knowing that in general, clients do recognize that often times specific intervention conditions will place constraints on the evaluation process.

Regarding the specific hypotheses the results indicate:

1. Uncertainty in the consulting situation leads to the desire for multiple-level evaluation with an emphasis at the results (bottom-line) level.

2. A frequently used consultant will either not be required to evaluate his or her work by the client, or at most will have to evaluate at only one or two levels.

3. The larger an idiosyncratic investment a client incurs in an intervention, the more likely it will require multiple-level evaluations, particularly emphasizing the “hard results” areas of reaction and behavior. Consultants who require employees to take expensive “off-the-shelf diagnostic surveys and training materials will find this result particularly important. This observation is not a criticism of such tools, rather an acknowledgement that they do represent idiosyncratic investments.

4. One time, or “single shot,” interventions are more likely not to be evaluated or evaluated at only one level, but “pilot” interventions that are anticipated to be replicated throughout the organization will typically require evaluations at all levels as the client seeks assurance that the intervention’s process was appropriate. In other words, the “means to the end” are just as important as the end results themselves.

5. When evaluation data are not required soon after program implementation, clients will expect data on both the results and behavior levels. When the evaluation is needed in a hurried fashion, the clients will tend to expect less rigorous data focusing upon reaction and learning levels. This finding supports the notion that clients recognize time is needed “to do it right” and that consultants will not be expected to do the impossible. However it does clearly establish that when given time, the consultant is expected to perform a rigorous evaluation.

6. A frequently used intervention in the client’s organization appears to require only reaction-level evaluations. The clients are confident that the intervention produces positive results, and therefore do not wish to pay for obtaining knowledge they already possess. At the same time, obtaining the inexpensive and easy reaction data can simply be their way of keeping an eye on the process to detect any anomalies or issues.

7. When an evaluation plan has been established prior to an intervention’s implementation, the client will expect rigorous, multiple-level, assessments with specific interest on the results level. This finding acknowledges that, for various reasons, OD projects do not always follow the action research model. However, it also establishes that when the consultant is able to conform to action research, it includes the essential evaluation step.

An interesting final issue that was raised by several of the respondents must be noted. This study, and in fact the majority of the evaluation literature, focuses upon evaluations conducted by the consultant. The fact that the client may conduct its own evaluation, or the only evaluation, was not considered, but appears to be a viable, and understandable, option.

This oversight raises many questions for future organization development model building. While evaluation is an essential element of the action research model, there is no requirement as to who must perform it. It has always been assumed that the consultant is in the best position to judge success, much like a physician treating the ill. Is this a valid analogy for organization development? When one purchases a product or service, do we typically leave it to the salesperson/provider to determine our satisfaction with the purchase?

There are several questions that OD’s future models need to examine. When and why would a client choose to conduct a self- evaluation? When and why would a client choose to rely on a consultant to conduct an evaluation? When and why would a client have both parties evaluate? Would this decision involve duplication or separate dimensions?

This paper has provided some direction and answered some questions, but it is possible that more issues have been raised than resolved, highlighting the need for organization development, as a field, to evaluate its models and activities, just as intervention evaluation is required. After all, a field that focuses entirely upon implementing change in others must itself be willing to embrace change as well.

Evaluation is a critical step in the OD process.

Some clients might demand, and subsequently pay for, much more evaluation than is logically required.

Head and Sorensen (1990) used Kirkpatrick’s (1960) four-level hierarchical training evaluation model as a base for their work.

The more uncertainty surrounds a consulting situation, the greater the rigor that will be required from the evaluation.

While evaluation is an essential element of the action research model, there is no requirement as to who must perform it.


Cummings, T. & Worley, C. (1997). Organization development and change, 6th ed., Cincinnati: Southwestern Publishing.

French, W. (1989) A checklist for organizing and implementing an OD effort. In W. French, C. Bell & R. Zawacki (Eds.), Organization Development: Theory, Practice & Research, 3rd ed. (522-532). Homewood, IL: Irwin Publishing.

Head, T. (2004). The current state of evaluation in organization development practice: A harsh critique. Pro Change International Journal, 1, in press.

Head, T. & Sorensen, P. (1990). Critical contingencies for the evaluation of OD interventions. Organization Development Journal, 8, 58-63.

Jones, J., Spier, M., Goodstein, L. & Sashkin, M. (1981). OD in the 80s: Preliminary projections and comparisons. Group and Organization Studies, 5, 5-17.

Kegan, D. (1982). Organization development as OD network members see it. Group and Organization Studies, 7, 5-11.

Kirkpatrick, D. (1969). Techniques for evaluating training programs, Training and Development Journal, 13, 3-9.

Rossi, P. & Freeman, H. (1982). Evaluation: A systematic approach. Beverly Hills, CA: Sage Publications.

Rutman, L. & Mobray, G. (1983). Understanding program evaluation. Beverly Hills, CA: Sage Publications.

Weisbord, M. (1973). The OD contract. OD Practitioner, 5, 1-4.

Weiss, C. (1972). Evaluating action programs. Englewood Cliffs, NJ: Prentice Hall.

Woodman, R. (1989). Evaluation research on organizational change: Arguments for a “combined paradigm” approach. Research in Organizational Change and Development, 3, 161-180.

Thomas C. Head, PhD

Peter F. Sorensen, Jr., PhD

Thomas C. Head, PhD

Thomas C. Head, PhD, is Professor of Management in the Heller College of Business, Roosevelt University. Tom has worked on projects for a wide range of clients from Fortune 15 to small start- ups. He has authored 13 books and over 50 scholarly articles. Tom is currently on the board of Pro-Change International and is the Region 4 President Elect for the ACBSP. His PhD is in Business Administration from Texas A&M University.

Contact Information

Walter E. Heller College of Business Administration

Roosevelt University

1400 N. Roosevelt Blvd.

Schaumburg, IL 60173

(847) 619-4866

[email protected]

Peter F. Sorensen, Jr., PhD

Peter F. Sorensen, Jr., PhD, is Director of the PhD program in OD and the MOB program at Benedictine University. In 2002 he was Chair of the ODC Division of Academy of Management. Peter was an invited, distinguished scholar to the first Academy of Management Conference on Global Change, and has received the “Outstanding OD Consultant of the Year Award” from the OD Institute.

Contact Information

Organizational Development Doctoral Program

Benedictine University

5700 College Rd.

Lisle, IL 60532

(630) 829-6222

[email protected]


Cases for Hypothesis 1

“You have recently hired a consultant to conduct employee workshops in o\rder to solve a problem. It was fairly easy to discover and completely agree on the workshop’s content.”

“You have recently hired a consultant to conduct employee workshops in order to solve a problem. You are aware there is a problem; however, you are not sure exactly what it is, or what form the workshops will take. You have given the consultant great discretion in terms of the workshop’s content and duration.”

Cases for Hypothesis 2

“You have just hired a consultant for a department turnaround project. You have frequently used this consultant and she has proven very successful in the past.”

“You have just hired a consultant for a department turnaround project. While you have heard good things about this consultant, she has not worked for you before.”

Cases for Hypothesis 3

“The consultant that you have been working with on a fairly large change project has required you to purchase $7500 in materials that you’re sure can’t be used for anything other than the consultant’s project.”

“The consultant that you have been working with on a fairly large change project has required you to purchase $300 in materials that you’re sure can’t be used for anything other than the consultant’s project.”

Cases for Hypothesis 4

“You are just finishing a one-shot project with a consultant. Regardless of how well the project works there is no intention to repeat it anywhere else.”

The project that you and a consultant have been working on is a pilot study. If all goes well it will serve as a model for similar projects in other plants.”

Cases for Hypothesis 5

“You have been told by the consultant that while a ‘current state of affairs’ is possible now, it would require at least four months for any accurate long-term trends to emerge from a just-completed project. The president tells you that while he does want a report, time is not of the essence.”

You have been told by the president that a report on a project’s outcome will be needed immediately after the project’s completion. The consultant has told you that while a ‘current state of affairs’ is possible, long-term predictions are not practical in so short a time.”

Cases for Hypothesis 6

“Your consultant is just finishing another department’s communication workshop. Almost all of the departments have now received the workshop.”

“Your consultant is just finishing the second department’s communication workshop. The plan is that each department will eventually receive the workshop.”

Cases for Hypothesis 7

“Because of the immediacy of the situation, you had the organization development consultant begin making changes quickly. In fact changes were implemented before the consultant could obtain baseline data making before/after comparisons impossible.”

“Because of the immediacy of the situation, you had the organization development consultant begin making changes quickly. Fortunately the consultant was able to obtain baseline data so (if you wish) before/after comparisons are possible.”

Copyright O D Institute Spring 2005