Texas TAAS Scores Revisited

Written By: Sam Savage

Published Date: June 3, 2008 Last Edited: June 3, 2008

By Lorence, Jon

Data based on all eligible Texas public school students reveals that scores from the Texas Assessment of Academic Skills (TAAS) reading and mathematics tests are more valid than Klein, Hamilton, McCaffrey, and Stecher (2000) implied. New analyses based on both individual student and school-level scores support the concurrent validity of the TAAS. TAAS scores are moderately to highly correlated with answers from the Stanford-9 reading and mathematics tests. In addition, analyses based on individual students and aggregated school scores support the nomological validity of TAAS scores. In contrast to the findings reported by Klein et al., economically disadvantaged students in the present study obtain lower reading and mathematics TAAS scores than students of higher socioeconomic status. Schools with a greater percentage of students qualifying for federal lunch assistance also evidence lower average performance on the TAAS. The present results indicate that the nonrandom sample of 20 schools available to Klein et al. (2000) led to findings unrepresentative of Texas students and schools. Background

The Texas Assessment of Academic Skills (TAAS), a criterionreferenced educational achievement test, has received considerable attention in recent years. Results from the TAAS tests became the basis for evaluating the performance of individual students, teachers, schools, and school districts in Texas. Beginning in the spring of 1994, Texas students in grades three through eight and grade ten were required to take tests in reading and mathematics. In 2003 the Texas Assessment of Knowledge and Skills (TAKS) tests replaced the TAAS as the foundation of the state=s accountability system. The new state mandated tests required additional testing in grades nine and eleven which had previously been excluded from taking the TAAS. Despite the fact that Texas now uses a different criterion-referenced test to measure what students have learned in school, there are several reasons to investigate the validity of scores from the TAAS. First, the Texas high-stakes accountability system was the foundation for the No Child Left Behind (2002) legislation, which attempted to implement throughout all states educational evaluation practices similar to those in Texas. Supporters of this legislation argued that educational reforms in Texas were successful because TAAS scores had increased during the latter half of the 1990s. However, critics of the Texas educational model argued that TAAS scores were invalid because school administrators and teachers focused only on a narrow curriculum devoted solely to TAAS questions (e.g., McNeil & Valenzuela, 2001). The observed increases in TAAS scores over time were attributed largely to teaching only to the tests, rather than an increase in “real” student learning. A RAND report by Klein, Hamilton, McCaffrey, and Strecher (2000) also argued that TAAS test scores were invalid.

Insofar as the Texas Education Agency has provided TAAS data to many researchers, analysts of TAAS data may find that their results will be challenged if the validity of TAAS scores is viewed as highly suspect. To illustrate, Allensworth (2005) and Shepard (2004) cited the Klein et al. (2000) study to indicate that educational findings based on TAAS scores are likely distorted and not believable. Given the prominence of the Klein et al. (2000) critique and the degree to which it is mentioned as challenging the validity of TAAS scores, it is worthwhile to reevaluate this RAND study’s findings.

Klein et al. (2000) Study

Although Klein et al. (2000) did not state the explicit strategies used to gauge the validity of TAAS scores, they utilized the general methods associated with concurrent and nomological validation. To assess the concurrent validity of the TAAS reading and mathematics tests, Klein et al. administered the Stanford-9 open- ended math test, the Stanford-9 multiple-choice science test, and a hands-on science test developed at RAND to approximately 2000 fifth grade students in 20 public schools from one geographic region in Texas. These tests were given to students in the spring of 1997, a few weeks after the children had taken the TAAS reading and mathematics tests. Scores from the TAAS and non-TAAS tests were found to be moderately correlated when individual students were the units of analysis. Pearson correlations ranged from .42 to .53. These results were consistent with prior research which often finds that students have comparable scores on different tests measuring similar concepts of academic achievement. Alternately, when school mean scores on the TAAS and non-TAAS tests were correlated, me associations were very small, ranging from -.07 to .21. Because these correlations were much smaller than usually reported, the authors questioned the validity of the TAAS results.

Klein et al. (2000) further investigated two general hypotheses to assess the nomological validity of the Texas reading and mathematics scores. First, the researchers presumed that changes in TAAS scores over time should be consistent with changes in other educational achievement tests. Klein et al. compared changes in average state TAAS scores between 1994 and 1998 with the trends in the mean performance of Texas students who took the National Assessment of Educational Progress (NAEP) reading and mathematics tests in 1992 or in 1996. Large increases in average TAAS scores suggested considerable improvement in reading and mathematics. However, the researchers found that mean NAEP scores revealed little growth in educational achievement among Texas fourth and eighth graders; the slight increases in the NAEP scores of Texas students were similar to those observed nationally. These divergent trends in educational achievement from the TAAS and NAEP tests led Klein et al. to conclude that various practices by Texas teachers and administrators likely inflated TAAS scores. True levels of student learning were assumed to be much lower than indicated by the TAAS examinations.

Klein et al. (2000) also attempted to replicate a common finding in educational research, i.e., students of lower socioeconomic status (SES) obtain lower test scores than do more affluent pupils. The authors assumed that the fifth graders of lower socioeconomic status in their sample would have lower scores on both the TAAS and non-TAAS tests. Although Klein et al. observed correlations between economic status and test scores in the hypothesized direction on the non-TAAS examinations when using individual pupils as the units of analysis, the correlations based on the TAAS tests were virtually zero. Moreover, when analyses were aggregated to the school level, the mean number of TAAS questions correctly answered was not associated with school average SES in the anticipated direction. The authors reported a curvilinear relationship between average school SES and mean school TAAS math scores. Conversely, greater percentages of low SES students were significantly associated with lower average scores on the nonTAAS tests. Klein et al. acknowledged that their findings pertaining to the associations between student and school economic standing and student TAAS scores could be attributable to their nonrandom sample of schools. They also stated that AWe are therefore reluctant to draw conclusions from our findings with these schools or to imply that these findings are likely to occur elsewhere in Texas@ (Klein et al., 2000, p. 15). Despite the authors= admonition, however, readers of their paper will likely infer that the findings reported between student socioeconomic background and test scores demonstrate the invalidity of TAAS scores.

Purpose of the Present Study

This paper attempts to replicate two sets of findings Klein et al. (2000) reported in their evaluation of TAAS test results:

1. The present analyses attempt to assess the concurrent validity of TAAS scores hi manner parallel to that of the RAND researchers. The concurrent validity of TAAS scores is ascertained by examining the relationship between student responses to the state=s criterion- referenced TAAS tests and Stanford-9 tests.

2. Next examined are a limited number of hypotheses pertaining to the nomological validity of TAAS scores. The relationship between student socioeconomic status and academic achievement scores is investigated using both individual-level and school-level analyses. Unlike Klein et al. (2000) who examined data only from fifth graders, the current analyses disaggregate data by all tested grade levels. The present analyses also examine the association between student economic standing and performance on non-TAAS tests.

Data

To assess the impact of certain educational practices throughout the state, the Texas Education Agency (TEA) provided to the Sociology of Education Research Group at the University of Houston annual individual-level demographic and TAAS information from all Texas students enrolled in public schools from 1994 through 2000. Data from the spring of 1999 are examined to investigate the association between student economic status and performance on the TAAS. Reading and mathematics scores from the 1999 TAAS are selected because another data set available to gauge the validity of TAAS scores originated in 1999.

The TAAS tests were the only examinations of academic achievement administered to all public school students in Texas. However, some school districts in the state also required their pupils to take one other standardized test so that school officials could compare the performance of their students with others throughout the nation who have also taken the same test. In the fall of 1997, a Texas metropolitan school district required that all its students take the Stanford-9 reading and Stanford-9 mathematics tests. In order to make comparisons between the TAAS and the Stanford-9 tests more meaningful, the school district did not administer the Stanford-9 until the spring of 1999. The fact that the TAAS and Stanford-9 were administered within about a month of each other helped ensure that students were at similar stages of academic instruction during the school year. Discussions with principals and teachers revealed they viewed Stanford results as having few consequences for school personnel. Instead, performance on the TAAS carried much greater importance for rewards and sanctions than did Stanford-9 scores. Consequently, principals and teachers did not emphasize to their students that Stanford-9 tests results were of the same importance as the TAAS tests. Unlike the TAAS, the Stanford-9 was a low-stakes test. Another advantage of comparing scores on the TAAS and Stanford- 9 tests administered in the spring of 1999 is that teachers were relatively unfamiliar with the Stanford-9 questions, as the test had been given previously only in the fall of 1997. Teachers in the spring of 1999 would have found it relatively difficult to teach directly to the Stanford-9 or incorporate specific items from the Stanford-9 into their lessons, a common criticism of the TAAS.

Not all students took the TAAS, e.g., students with severe learning disabilities were excluded. In addition, non-English speaking children (predominately Hispanic in Texas) were given a standardized test in Spanish instead of the TAAS.1 Despite the exclusion of these students from taking the TAAS, it must be emphasized that many students with limited English, as well as pupils who are classified as being in special education, completed the TAAS. The percentages of students taking the spring 1999 TAAS tests throughout the state, as well as in the metropolitan school district providing Stanford-9 data, are shown in Table 1. Approximately 89% of students took the TAAS exams. About seven percent of the students were excluded because of serious mental impairment or acute emotional problems. The second major reason for test exemption was insufficient comprehension of English among about two percent of the student population. The participation rate of students in the metropolitan school district providing Stanford-9 data is almost identical to that of the state participation rate. Moreover, the reasons for exempting the metropolitan students from the TAAS were similar to those throughout the state.

Given that most of the students who did not take the TAAS probably had extremely low scores, their exemption likely reduces the number of observations in the left-hand tail of the distribution of test scores. Their omission from the analyses may slightly reduce the magnitude of computed correlations.

To further assess whether students with Stanford-9 scores from the metropolitan district are similar to the population of Texas TAAS takers, the social and demographic characteristics of the two groups of students are shown in Table 2. Over 80% of the metropolitan students with test results are either Hispanic or African American. Only about 13% of the students in the metropolitan school district are non-Hispanic white compared to 48% of all students with scored tests in Texas. 2 The large percentage of metropolitan Hispanic students results in over 17% of the tested pupils being classified with a limited command of English, in contrast to about seven percent of all students in the state with test results. Students in the metropolitan district are also more economically disadvantaged than students throughout the state. Approximately six percent of the students with eligible test scores are labeled special education. The vast majority of students in this category are not severely mentally challenged, but are listed as having visual, auditory, or mobility impairments. As is evident in Table 2, the demographic characteristics of the metropolitan students answering both the TAAS and Stanford-9 differ from other students in the state with scored TAAS tests. However, should the analyses yield findings from the metropolitan students similar to those based on the population of Texas students, the demographic differences in Table 2 will be of less concern.

Findings

Descriptive Statistics of TAAS and Stanford-9 Tests B Individual Students

Like Klein et al. (2000), the present study assesses whether students who performed well on the TAAS also obtained high scores on another test. However, the Stanford-9 tests given in the metropolitan school district differed from those Klein and his colleagues used. As previously mentioned, the RAND researchers administered the Stanford-9 science test, the Stanford-9 open-ended mathematics test, and a hands-on science test developed by two of the authors. Students in the metropolitan school district were required to take only the closed-ended Stanford-9 reading and mathematics tests.

The maximum numbers of correct answers (raw scores) for each grade level tests are given in Table 3. Third grade pupils could correctly answer only 36 questions on the TAAS reading test, while eighth and tenth grade students had the possibility of a maximum score of 48. For the TAAS mathematics test, the maximum possible score in eighth and tenth grade was 60 while third graders could obtain only a score of 44. The Stanford-9 reading and mathematics tests required students to answer more questions. The maximum possible raw score for the Stanford-9 reading test in each grade was 84. Among primary and middle school students the maximum number of correct answers for the Stanford-9 mathematics tests ranged from 76 among third graders to 82 for the eighth graders. The highest score possible among sophomores was 48; high school students were asked fewer mathematics questions but higher order mathematical concepts pertaining to algebra and geometry were tested.

Means and standard deviations for both the TAAS and Stanford-9 tests among individual students are shown in Table 4. Summary statistics for the examinations are given for the seven grades in which students took both the TAAS and Stanford-9 reading and mathematics exams. The mean values indicate the average number of questions correctly answered. In general the average number of questions correctly answered on the reading and mathematics TAAS examinations increase with grade progression, partially because the number of items tested increases with grade level. The number of pupils in each grade with answers for both the TAAS and Stanford-9 tests is also given.

Although not shown, histograms of the distributions and skewness statistics of TAAS scores and Stanford-9 scores reveal the test results evidence considerable negative skew. In each grade TAAS scores are somewhat more negatively skewed than the Stanford-9 scores. Some readers may be concerned that the absence of normal distributions will result in attenuated estimated correlations between the two types of tests. However, Nunnally (1978, p. 141) states that if the general shape of two sets of scores are similar (whether normal or not), calculated correlations between the two variables will not be adversely affected. Because both TAAS and Stanford-9 tests scores are negatively skewed, their calculated correlation should not be distorted.

Descriptive Statistics of TAAS and Stanford-9 Tests B School Analyses

In addition to examining relationships derived from individual students, aggregate school-level correlations are calculated in an attempt to replicate the Klein et al. (2000) research. The mean school scores and standard deviations for the TAAS and Stanford-9 tests across grade levels appear in Table 5. These data are derived from the individual school averages, i.e., these are the means of the school averages. Also presented are the numbers of schools on which the correlations between school-average TAAS and Stanford-9 scores are computed. In 1999 there were 280 schools in the metropolitan district; however, 20 schools were excluded from the analyses. A few elementary schools were deleted because there were less than 10 students in a grade taking the TAAS and Stanford-9 tests. Middle schools and high schools which enrolled only students with severe behavioral and emotional problems were also excluded from the analyses. Even with these deletions, the analyses are based on a total of 258 schools and over 74,000 students. As seen in Table 5 there were considerably more elementary schools than middle schools or high schools. Some elementary schools differed in the grades taught. For example, of the sixth graders located in 81 schools, some attended elementary schools while others were assigned to middle schools.

Concurrent Validity B Individual Student Analyses

Research Question 1: Do students performing well on the TAAS tests also obtain higher scores on Stanford-9 Tests?

Pearson correlation coefficients based on individual student responses from the TAAS and Stanford-9 examinations are given in Table 6. The linear associations between the TAAS reading and mathematics test scores are positive and fairly consistent across the seven grade levels. Correlations range from .67 to .75. Students with more correct answers on the TAAS reading test also obtain higher scores on the TAAS mathematics test. A similar pattern occurs between the number of correct answers on the Stanford-9 reading and mathematics tests. The size of correlations between the reading and mathematics subsets of the TAAS and Stanford-9 tests are fairly comparable within each grade level. Students who do well on the reading test also obtain more correct answers on the same-grade mathematics test. Further, the correlations between the TAAS and Stanford tests measuring the same construct are also moderately positive. To illustrate, the association between TAAS and Stanford- 9 reading performance in fifth grade is .75. Fifth graders who score high on the TAAS mathematics test are also more apt to correctly answer a larger number of questions on the Stanford-9 mathematics test (r = .78). Although the reading and mathematics tests measure different content, students with higher scores on the TAAS reading test evidence higher scores on the Stanford-9 mathematics test. Likewise, pupils with more correct answers on the Stanford-9 reading test also have higher scores on the TAAS mathematics test. The fact that the same-subject correlations (i.e., TAAS reading and Stanford- 9 reading or TAAS mathematics with Stanford-9 math) are larger in magnitude than the different test content measures of association (i.e., reading scores correlated with math scores), implies that the TAAS and Stanford examinations are measuring similar skills. In general, students who perform well on the TAAS are also likely to have better scores on the Stanford-9 regardless of the content area examined. TAAS scores have also been found to be highly related to performance on the Iowa Test of Basis Skills, the Metropolitan Achievement Test 7, and the Otis-Lennon norm-referenced tests (Dworkin et al., 1999). The correlations observed in Table 6 between the TAAS and non-TAAS measures are higher than those reported by Klein et al. (2000). One reason for the difference may be that the present study analyzes more students from more schools than did the RAND study. The variability (as measured by standard deviations) in non-TAAS test scores reported by Klein et al. are somewhat smaller than shown in Table 5. The larger measures of association obtained in the present study may arise because the subject matters the TAAS and Stanford-9 tests administered to the metropolitan students are more similar. Whereas the RAND research examined two science examinations, the present study focuses only on relationships between reading and mathematics tests. Observed correlations may also be higher among the metropolitan students because both the TAAS and Stanford-9 tests use a multiple-choice format. The Klein et al. analyses utilized an open-ended mathematics and hands-on science test.3 Similarity in response DUMMY MAINTEXT formats between tests may result in higher correlations than observed across tests using different answering procedures to assess learning achievement. Perhaps a more important explanation for the weaker correlations between TAAS scores and the RAND administered science tests is that elementary and middle grade schools in Texas (as well as in many other states) have not emphasized science instruction to the same degree as reading and mathematics.

Concurrent Validity B School Analyses

Research Question 2: Do schools with higher average TAAS scores also obtain higher average scores on the Stanford-9? Like the RAND researchers, aggregate-level correlations were calculated in which the units of analysis were schools. The average numbers of correctly answered TAAS and Stanford-9 scores by grade were calculated for each school. Pearson correlation coefficients across schools are presented in Table 7. The correlation between mean school performance on the fifth grade TAAS reading test and mean school performance on the mathematics section of the fifth grade TAAS is .82, which is almost the same value of .85 Klein et al. (2000) reported on the fifth graders they tested in 1997. Examining the same-subject test correlations across all the grades reveals that the schools in the metropolitan district with higher scores on the TAAS or Stanford-9 reading tests also reported a greater number of correct answers on its mathematics counterpart The average school- level correlation across the seven grades between the TAAS reading and mathematics tests is .88 while the mean correlation across grades between performance on the Stanford-9 reading and mathematics tests is .87. As is seen in Table 7, for each grade, the correlation between the TAAS reading and math examination is very similar to the correlation between performance on the Stanford-9 reading and math test.

The present analyses, however, differ from those reported by Klein et al. (2000) when examining the school-level correlations between performance on the TAAS and Stanford-9 tests. Whereas the RAND researchers found no meaningful associations between mean school performance on the TAAS and mean school scores on the Stanford-9 (see their Table 4), the school-level correlations shown here in Table 7 are positive and substantial. The metropolitan schools with larger average TAAS reading scores also report higher average correct answers on the Stanford-9 reading test. Among the fifth graders, the correlation is .81. An identical correlation is observed among fifth graders taking the TAAS and Stanford mathematics tests. Across the seven grades in Table 7, the mean correlation between school-level scores on the TAAS reading test and the Stanford-9 mathematics test is .82 (i.e., the average of the seven correlations in the first row). Although somewhat smaller, the mean correlation across the seven grades showing the linear association between average TAAS mathematics scores and average Stanford-9 reading scores is .74. Schools with higher mean scores on the TAAS also have higher average scores on the Stanford-9, regardless of the subject matter tested. These aggregate level analyses further support the concurrent validity of TAAS scores.

Nomological Validation

Support for the nomological validity of test scores occurs when student responses are associated in a theoretically consistent manner with other concepts. Like Klein et al. (2000) the present research investigates a general proposition that economically disadvantaged students would have lower TAAS scores and fewer correct answers on the non-TAAS tests. Two levels of analysis, one based on individual students and another on schools, are again presented. One hypothesis is that students of lower economic status should obtain fewer correct test answers. A second hypothesis is that greater percentages of economically disadvantaged students attending a school will be associated with lower mean school test performance.

Individual student socioeconomic status (SES) is measured by participation in the federal lunch program. Students enrolled for a free or reduced-price lunch are given a score of 1 while non- participants are coded 0. Although relying solely on participation in the federal school lunch program is a problematic indicator of a student=s economic standing, it is often the only measure available from student school records. For school-level analyses, the percentage of students enrolled in the free or reduced-price lunch program in the school is the measure of school economic status most often reported. These indicators of student and school economic status are also the ones used by Klein et al. (2000). However, the current analyses differ from those of the RAND researchers because, in addition to the data from the metropolitan school district which provided data on both the TAAS and Stanford-9, individual-level data based on all Texas public school students are utilized. Further, all schools in Texas are available for examination.

Summary statistics of the individual students and school-level socioeconomic indicators from the metropolitan area studied are shown, respectively, in Tables 4 and 5. The means and standard deviations of individual TAAS scores in reading and mathematics derived from all public school students with scored tests are presented in Panel A of Table 8. TAAS scores are somewhat higher when computed from all students in the state than only from those in the metropolitan area under study. Variation in test scores is also greater when all students in the state are considered. The proportion of students throughout the state participating in the free or reduced-price school lunch program is smaller than among the metropolitan students. The percentage of economically disadvantaged students remains fairly similar across the elementary grades but decreases somewhat in higher grades.4

The average of the mean school TAAS reading and mathematics scores, along with the percentage of economically disadvantaged students are shown in Panel B of Table 8. Schools with less than 20 students in a grade were deleted from the analyses to help ensure that small schools did not distort the findings. The mean school reading and mathematics TAAS scores are slightly higher than those based on the metropolitan district schools. The percentage of students classified as economically disadvantaged is again lower throughout state schools overall when compared to the metropolitan schools previously analyzed. Nonetheless, the standard deviations based on all state schools are fairly similar to those computed from only the schools in the metropolitan school district with available Stanford-9 data.

Research Question 3: Do individual students from economically disadvantaged families obtain lower TAAS and Stanford-9 scores? The calculated correlations between individual student social standing and individual test scores are given in Table 9. Panel A shows the correlations based on all Texas students with scored TAAS tests. Correlations computed from the metropolitan students with both TAAS and Stanford-9 scores are presented in Panel B. Although it may seem redundant to show the correlations from the metropolitan district, the latter correlations are presented as a check to determine whether the findings from the metropolitan district are consistent with those based on the state population. Similar relationships between student socioeconomic status and student academic achievement in both the metropolitan district and those based on all eligible Texas students would support the generalizability of the TAAS-Stanford-9 associations based on the single school district reported earlier in the paper. In contrast to the results Klein et al. (2000) reported, the correlations between individual economic status and individual TAAS performance shown in Table 9 are in the hypothesized direction. For the correlations derived from all Texas students, and those in the metropolitan school district examined, economically disadvantaged students obtain fewer correct answers on the Texas reading and mathematics exams. The magnitudes of the TAAS correlations are fairly similar across grades. Although not large, the negative correlations in Table 9 based on the TAAS examinations are two to three times larger than those shown in Table 4 of the Klein et al. paper. The correlations between individual student economic standing and Stanford-9 scores presented in Table 9 are also larger than the measures of association Klein et al. reported. Further, the negative linear relationship between student SES and TAAS performance are fairly similar when comparing the correlations based on all Texas students and those pupils with scored tests in the metropolitan school district. The similarity in correlations between those based on the population of Texas students and those in the single metropolitan district suggest that the earlier findings from the metropolitan students may be similar to those throughout the state. Moreover, the negative correlations between student socioeconomic background and number of correct answers on the two Stanford-9 examinations given to the metropolitan school district students are also larger than reported in the RAND analyses of fifth graders from 20 schools.

Research Question 4: Do schools with a greater percentage of economically disadvantaged students evidence lower average TAAS and Stanford-9 scores?

Aggregate-level analyses using correlations based on schools are presented in Table 10. One of the most anomalous findings Klein et al. (2000) reported, which they believed challenged the validity of TAAS scores, was the absence of a significant negative association between the percentage of economically disadvantaged students in a school and the average number of correctly answered items by the school=s students on the required state examinations. Whereas the aggregate-level correlations between student SES and performance on the non-TAAS tests Klein et al. reported were negative and substantial, their school-level correlations between SES and TAAS reading and math scores were. 13 and -.21, respectively. In contrast to the findings presented by the RAND researchers, the correlations DUMMY MAINTEXT shown in Table 10 support the nomological validity of the TAAS scores. For both the reading and math sections of die TAAS, greater percentages of students with low SES background attending a school were associated with fewer average numbers of correctly answered TAAS questions. This pattern persists across all grades. Correlations based on almost all schools in the state reveal correlations from -.43 to -.74.

Examinations of bivariate scatterplots (not shown) between school SES and mean school TAAS scores revealed no curvilinear relationships as Klein et al. (2000) reported. Instead, straight lines best fit the data points in each grade level indicating a linear relationship between school economic composition and test scores. Aggregate correlations based only on the metropolitan schools were also negative. Panel B in Table 10 shows that all but one of the correlations can be considered at least moderate in size. The smallest correlation is -.27 between school SES and mean score for the sixth grade TAAS math exam. Similar to the linear associations presented in Table 4 of the Klein et al. paper, schools with less affluent students exhibit lower reading and mathematics scores on the Stanford-9 tests. Correlations ranging from -.45 to – .90 support the nomological validity of the Stanford-9 scores because they are consistent with the proposition that lower SES is associated with lower academic achievement

Aggregate-level correlations between mean school socioeconomic status and average scores on the required TAAS tests among the metropolitan school are similar, although somewhat smaller than observed with scores from the non-TAAS tests. Examining the magnitude of the standard deviations from the two kinds of tests, however, reveals that there is greater variation in mean scores based on the Stanford tests when compared to the TAAS tests. It seems plausible that the larger numbers of items asked on the Stanford-9 allow for greater dispersion in mean school level responses. DUMMY MAINTEXT The more restricted variation of the TAAS tests may result in somewhat lower correlations with the economic status of schools than observed for the Stanford-9 tests. Also, some of the substantive areas tested by the Stanford-9 are not aligned with the Texas curriculum.

Summary and Discussion

These analyses attempted to replicate a number of findings pertaining to TAAS scores Klein et al. (2000) reported. The results based on data from a major Texas metropolitan area are consistent with those of the RAND researchers. Urban students with high scores on the state=s mandatory accountability test also obtain high scores on the Stanford-9 reading and mathematics tests. Contrary to the school-level analyses of Klein et al., which found no linear association between mean responses on the TAAS and non-TAAS tests, the present data revealed large positive linear correlations between school-average TAAS scores and average Stanford-9 scores. Schools with a greater average number of correct answers on the TAAS tests also evidenced larger mean scores on the two Stanford-9 examinations. Analyses based on both individual students and schools support the concurrent validity of TAAS responses. These positive findings were observed across seven different grade levels. The TAAS reading and mathematics tests and their Stanford-9 counterparts apparently measure similar kinds of academic skills.

Neither the Klein et al. (2000) publication nor the present paper attempt an extensive nomological analysis of the Texas test scores. Rather than investigate many possible relationships which would support the validity of TAAS scores, the present paper examined two limited hypotheses: (1) the associate between student socioeconomic status and test performance and (2) the relationship between average school economic standing and average test performance. Data based on school means from a single metropolitan school district used in the present study revealed that schools with a greater percentage of economically disadvantaged pupils also evidenced lower average scores on the Stanford-9 tests. The remaining findings reported here contradict those of Klein et al. Whereas the RAND researchers found no significant linear association between school economic standing and average school TAAS scores, data from all eligible schools in the state indicated that schools with a greater percentage of students from lower socioeconomic families obtained lower mean TAAS scores in both reading and mathematics. When using students as the units of analysis, Klein et al. found no significant negative correlation between student family status and student TAAS scores. However, the present analyses, based on all TAAS takers in Texas, demonstrated that economically disadvantaged students provided fewer correct answers on the TAAS reading and mathematics exams. The hypothesized negative relationships were replicated across all TAAS tested grades in the state. Both school-level and individual-level analyses reported here support the nomological validity of TAAS scores.

A major reason for the difference in findings reported here and those of Klein et al. (2000) is that they did not have access to the population of state data. Instead they used data from a sample based on 20 schools and about 2000 fifth graders from a single geographic region of Texas. It must be emphasized that the RAND researchers did not explicitly attempt to generalize their findings to the state; however, readers of their study may conclude that their findings based on a non-random sample of schools and students may apply to all Texas public schools. The major rationale of the present study was to investigate whether some of the negative findings in the RAND study were representative of Texas public school students. The current results are fairly similar to other studies which attempt to gauge the concurrent and nomological validity of test scores. Although the RAND researchers did not explicitly claim that student answers to the TAAS were invalid, readers of the Klein et al. (2000) paper may incorrectly assume that the results from their limited non- representative sample apply to schools throughout the state. The major conclusion derived from the present findings is that TAAS scores are more valid than implied by the Klein et al. study. TAAS scores meet conventional standards of concurrent validity. Further, the analyses based on all students and schools in the state provide partial support for the nomological validity of TAAS scores. The major conclusion from the current analyses is that the TAAS scores are consistent with conventional definitions of measurement validity.

Using NAEP Results to Assess the Validity of State Accountability Scores

Critics of Texas TAAS results will likely contend that findings based on data from the 20 schools Klein et al. (2000) examined are irrelevant when compared to their comparison of trends in TAAS scores with those from the National Assessment of Educational Progress (NAEP) examinations. An important question Klein et al. (2000) raise is whether the upward trends in TAAS scores observed during the 1990s demonstrate a true increase in student academic achievement. The present paper does not investigate this issue. Nonetheless, a few comments are offered addressing the usefulness of the NAEP to verify trends in student performance within specific states. Critics of educational accountability systems argue that results from standardized tests provide an inflated estimate of how much pupils have actually learned. Student responses from tests states use are presumed to be corrupted because school personnel narrow instruction only to the content of the required examination, rather than focus on the general concepts and skills such tests attempt to measure. Teachers are accused of emphasizing testtaking skills and how to answer the specific kinds of questions which will likely appear on the required examination, instead of concentrating on the subject matter itself. To obtain a more accurate measure of student learning trends, Linn (2000) recommended that more than one test be used to monitor student learning gains. Similar patterns of improvement in scores across two tests would imply that the findings from the test mandated by the accountability system are valid. He suggested comparing examinations used by states with results from the National Assessment of Educational Progress (NAEP) tests. NAEP scores are considered good benchmarks of student learning because the content of the exams were agreed upon by nationally recognized experts in their respective areas. Rigorous sampling procedures are also followed to help ensure that the pupils who agree to be tested are representative of the state student population. The fact that no one ever learns how well individual students or schools performed is also considered an important feature of the NAEP because no pressure exists to generate biased test results. In addition to Klein et al. (2000), other studies have attempted to confirm the validity of increased TAAS scores by comparing state test results with the NAEP. Holland (2002) examined changes in NAEP scores from 1994 to 2000 with changes in TAAS scores over the same time span. 5 He noted discrepancies between results on the state test and the NAEP among AfricanAmericans. Whereas more African-American students passed the TAAS reading exams over time, there was no change in the proportion of African-American students who scored only at the basic level of learning defined on the NAEP. However, the mathematics scores of African-American children improved in the lower tail of the NAEP distribution. Holland also concluded that NAEP data indicated a slight decrease in the mathematics gap between African-American and white students. Unlike Holland, who had access to the NAEP scores of individual students, Linton and Kester (2003) analyzed aggregate changes in TAAS and NAEP mathematics scores from 1996 to 2000. Substantial increases on the Texas examination were not observed on the NAEP. Only among Hispanic students were increases on the TAAS consistent with higher NAEP scores. Even though African American and nonHispanic white students showed substantial gains on the TAAS mathematics test, increases in their NAEP scores were very slight. The growth in NAEP scores among Texas Hispanic students over the four years was also greater than observed among African American and non-Hispanic white pupils. The small gains in NAEP mathematics scores among African American and white test takers in Texas were similar to those observed among their respective racial groups throughout the nation. Comparable to Klein and his colleagues, Linton and Kester concluded that alleged improvements in TAAS scores were illusionary.

Differences in test results between the NAEP and TAAS are assumed to demonstrate that TAAS scores are invalid measures of change. However, critics of TAAS scores ignore the extent to which NAEP scores can be used to help confirm the validity of changes in state test results. It is very likely that certain features of the NAEP may underestimate the degree to which academic achievement has changed within states. The National Assessment Governing Board (NAGB), which oversees the NAEP, has recognized the importance of this issue. Although NAEP data can certainly be useful in assessing changes in student educational achievement, the NAGB (2002) released a report listing a number of factors which potentially limit the degree of convergence between state test results and those from the NAEP:

Potential differences between NAEP and state testing programs include: content coverage in the subjects, definitions of subgroups, changes in the demography within a state over time, sampling procedures, standard-setting approaches, reporting metrics, student motivation in taking the state test versus the NAEP, mix of item formats, test difficulty, etc. (NAGB, p. 16)

The NAGB report suggested that, should state assessment tests vary with the NAEP on some of these dimensions, the greater the likelihood that trends in student performance will diverge from those indicated by the NAEP. Consequently, the NAGB report recommended that comparisons between NAEP data and state tests should be made with caution. Researchers have thus far paid little attention to differences between state accountability tests and the NAEP which might partially explain different trends in student learning gains observed on the two kinds of tests.

Given that the federal government continues to emphasize the use of standardized tests to increase accountability in education, many states have developed testing procedures to gauge growth in student academic achievement. It is highly probable that trends in student scores from state accountability tests will be compared with NAEP answers. Thus far there has been little research investigating the extent to which possible differences in the kinds of tests administered by states and the NAPE may result in discrepant findings. Although many educational researchers will likely favor NAEP results as a better measure of student learning gains, a worthwhile research endeavor for the future would be to investigate the degree to which differences between state accountability tests and the NAEP result in contradictory findings. For example, variation in test content could be examined. The social and demographic characteristics of students taking the NAEP could be compared with the population of students taking a state assessment test. Levels of student motivation on mandatory and voluntary tests could also be investigated. This suggested line of research may enable educational administrators, educators, officials and other interested parties to better evaluate student academic progress.

References

Allensworth, E. M. Dropout rates after high-stakes testing in elementary school. A study of the contradictory effects of Chicago’s efforts to end social promotion. Educational Evaluation and Policy Analysis, 27(4), 341-346.

Dworkin, A. G., Lorence, J., Toenjes, L. A., Hill, A. N., Perez, N., & Thomas, M. (1999). Comparisons between the TAAS and norm- referenced tests: Issues of criterion-related validity. Houston, TX: University of Houston, Sociology of Education Research Group.

Holland, Paul W. (2002). Using NAEP data to confirm progress in State AC.@ Appendix B in National Assessment Governing Board National Assessment of Educational Progress (2002, March 1). Using the National Assessment of Educational Progress to confirm state test results. Retrieved December 1, 2005 from http://www.nagb.org/ pubs/color_document.pdf.

Klein, S. Hamilton, L., McCaffrey, B., & Stecher, B. (2000). What do test scores in Texas tell us? Educational Policy Analysis Archives 8(49). Retrieved May 19, 2005, from http:// epaa.asu.edu/ epaa/v8n49/.

Linn, R. L. (2000). Assessments and accountability. Educational Researcher, 29 (2), 4-16.

Linton, T. H. & Kester, D. (2003). Exploring the achievement gap between white and minority students in Texas: A comparison of the 1996 and 2000 NAEP and TAAS eighth grade mathematics test results. Educational Policy Analysis Archives, 8 (49). Retrieved May 19, 2005, from http://epaa.asu.edu/epaa/v11n10/

McNeil, L. & Valenzuela, A. (2001). The harmful effect of the TAAS system of testing in Texas: Beneath the accountability rhetoric. In G. Orfield & M. L. Kornhaber (Eds.), Raising standards or raising barriers? (pp. 127150). New York: Century Foundation Press. National Assessment Governing Board National Assessment of Educational Progress (2002, March 1). Using the National Assessment of Educational Progress to confirm state test results. Retrieved December 1, 2005 from http://www. nagb.org/pubs/color_document.pdf.

No Child Left Behind Act of 2001, 20 U.S.C. ‘ 6301 (2002).

Nunnally, J. C. (1978). Psychometric theory (2nd ed.). New York: McGraw-Hill.

Shepard, L. (2004). Understanding research on the consequences of retention. In H. J. Walberg, A. J. Reynolds, and M. C. Chang (Eds.), Can unlike students learn together? (pp. 183-202). Greenwich CT: Information Age Publishing.

Texas Education Agency (n.d.). TAAS participation rates by student characteristics. Retrieved December 30,2005 from http:// www.tea.state.tx.us/perfreport/aeis/99/state.html

Texas Education Agency (2005a). Percent meeting minimum expectations on the TAAS, 4th grade. Retrieved December 30, 2005 from http://www.tea.state.tx.us/ student. assessment/eporting/ results/swresults/august/g4all_au.pdf

Texas Education Agency (2005b). Percent meeting minimum expectations on the TAAS, 8th grade. Retrieved December 30, 2005 from http://www.tea.state.tx.us/ student. assessment/eeporting/ results/swresults/august/g8all_au.pdf

Jon Lorence

University of Houston

Comments

comments