G-0R1JG2KD7Q
Most countries in the world have some form of accountability system to measure school quality and student outcomes, and provide transparency over how taxpayers’ money is spent on schooling. Examples are high-stakes tests, school inspections or report cards and monitoring systems. These systems often have an element of control and compliance, while also aiming to improve student outcomes. Ofsted, the English Inspectorate of Education’s famous strapline of aiming to ‘ improve lives by raising standards in education and children's social care’ is a key example that can be found in similar wording in other parts of the world.
It is no surprise then that many academic studies have tried to establish whether these systems meet their promise of benefiting the schools and education systems they test or inspect. Measuring the effects of high stakes tests, inspections or other types of accountability is however inherently complex and requires a systems approach to data collection and analysis. Here is why:
In 1975, Donald Campbell introduced his famous law[1]:
“The more any quantitative social indicator is used for social decision making, the more subject it will be to corruption pressures and the more apt it will be to distort and corrupt the social processes it is intended to monitor.”
His law has proven to be relevant for many contemporary accountability systems, particularly those that are high stakes. Daniel Koretz (2017[2]) for example has provided mounting evidence that high stakes tests in the United States lead to score inflation and no longer represent actual improvements in learning; teachers learn about the types of items and content in the test and reorient their instruction accordingly. A similar strategy of ‘teaching to inspection’ (Ehren, 2016) can be found when inspectors observe lessons to measure teaching quality. A teacher once told me that staff would agree on having their students sing a particular song to alert their colleagues further down the corridor of the unannounced arrival of an inspector in the school. This would give colleagues time to change their teaching to the subject that was scheduled for the designated slot on the school timetable.
And who would blame them for doing so? Monitoring and inspection, even with the best intentions of only providing an evaluative assessment of strengths and weaknesses, is often high stakes and puts someone’s work under a magnifying glass at a specific point in time. Over time, teachers and school leaders will learn what is expected of them and will align their work to ensure a good outcome against the standards by which they are assessed. As a result, measures lose the ability to distinguish between those who perform poorly and those who perform well. To understand these responses and how measures (test, inspection or other types of monitoring) change teaching and learning (for better or worse) requires a research design that looks at change over time and what causes such changes.
Secondly, accountability systems are by nature composed of multiple actors (students, teachers, principals, school inspectors/district supervisors etc) who all interact in measuring school quality and in acting on these measures. In doing so they change and affect each other’s behaviour. The sum of all these interactions results in an entirely different outcome than what we would expect from the responses of each individual actor to an accountability measure. An example is the reputation loop described by Honingh et al (…) for the Dutch school inspection system. This study found that a school in a highly competitive environment with neighbouring schools that have a positive inspection report would enter a downward spiral when receiving a negative inspection report. This would result in parents (particularly relatively well educated and middle/high income) taking their children to another (higher performing) school, which would lead to a decline in school funding, difficulty in recruiting new, highly qualified staff and further difficulties to improve the standard of education.
Such a reputation loop would however not occur for a school in a rural area where there is no school choice or where parents are less preoccupied with good test or inspection outcomes.
What this example shows is that effects of accountability systems are highly contextual and the result of specific interactions: there is no average effect that can be interpreted meaningfully. Responses of individual actors are embedded in, and constrained by choice options and by the responses of other actors. We can only understand these by questioning the how, under what conditions and through which mechanisms accountability systems lead to change over time. Such questions cannot be answered by the type of experimental designs that tend to be seen as the golden standard to measure effects of interventions. Instead we need research designs that look at the functioning of systems and the interactions between actors that lead to change over time.
The world is not simple, let’s not assume it is through our choice of research design.
[1] Donald T. Campbell, (1975). “Assessing the impact of planned social change.” In G. M. Lyons (Ed.), Social Research And Public Policies : The Dartmouth/OECD Conference.
[2] Koretz, D. (2017). The Testing Charade; Pretending to Make Schools better. Chicago: University of Chicago Press.