A study of how specific principal behaviors affect teacher and student performance
There were several instruments used to take a number of measures in this study. Principal-completed and teacher-completed QIRs were used as measures of instructional practices. Classroom grades and discipline referrals were used as measures of student performance. Teacher and student surveys were used as measures for the frequency and focus of teacher conversations.
QIR (Quality Instruction Rubric)
The quality instruction rubric (QIR) was used as measures of teachersí instructional practices and as part of the treatment to structure teacher self-assessments and guide instructional conversations between individual teachers and principals as referenced in Appendix C (Quality Instruction Rubric).
Procedures. The QIR was completed individually by each teacher and by the team of principals in August and May of the year of full implementation (2008-2009). As a measure of teacher instructional practices (independent variable) and as a part of the treatment, each teacher completed a self-evaluation of their instructional practices using the QIR document in the fall of 2008 (pre-test) and then again and the conclusion of the school year in May 2009 (post test).
Validity. Development of the QIR. The QIR, based on the work of Charlotte Danielson (1996), as discussed in chapter two, was developed and adapted by teachers, principals, central office personnel, and teacher union personnel in Kenton County School District. The process began with a committee of four principals (two elementary, one middle, one high) four central office personnel (superintendant, deputy superintendent, assistant superintendant, special education coordinator) and eight teachers (all members of the teacher union including the president and assistant president) who indicated a need for a better evaluation tool to use with teachers. The high school principal on this committee was one of the two researchers who conducted this study. The committee began to investigate work on teacher quality and agreed that the work of Danielson (1996) and Halverson et al. (2004) were most helpful to improve teaching in the Kenton County School District.
The committee cited a general dissatisfaction with the current evaluation system by many employees due to seemingly vague descriptors and ambiguous ranking of instructional practices. This made Danielsonís work particularly interesting to the committee because of the core beliefs associated with this specific body of research. First, the committee felt a common language of instructional practices between administrators and teachers would provide a much more valid evaluation of teacher instructional practices. Second, the committee agreed that a common language of instructional practices would provide the opportunity for coaching teachers to proficiency. Finally the committee believed that a continuum of quality of instructional practices would provide teachers specific feedback on how instructional practices can be improved year to year.
Next, the committee laid out a plan for developing an instructional rubric which could be used to evaluate teachers. The first drafts were developed by large groups of teachers and building level administrators and led by members of the original committee working together in groups of four to five to define good teaching. The rubric was then reviewed by a group of teachers throughout the district chosen by their principals for this committee because they were perceived as good teachers.
Next, the rubric was field tested by groups consisting of principals, central office staff, and teachers in several hundred classrooms throughout the school district in an attempt to establish how the instrument would perform in practice. Field tests were conducted at each school by a team of four observers observing three to five classrooms once each month throughout the 2007-2008 school year. Included in the field tests were three high schools, four middle schools and eleven elementary schools for an estimated 700 field tests. Each observer team using the rubric during a classroom observation included two building principals and two central office personnel. Occasionally (less that 10% of the time), a teacher would accompany this team as a fifth observer. Each observation lasted approximately ten minutes. After a team of observers left a room, they would debrief for four to five minutes to discuss what was witnessed during the observation (based on the QIR) and to discuss good coaching tips for the teacher. While principals only participated in field tests at their own schools, the central office staff of five people involved in these field tests of the QIR were consistently observing in multiple buildings. The cross-building perspective of the five central office staff contributed to norming the use of the rubric across various school contexts in the district. When teachers participated in the field tests, they did so in a school different from their own.
From these field tests, many strands of the QIR were found to be redundant and thus combined or eliminated. With hundreds of field tests completed, the original committee (four principals, four central office personnel, and eight teachers) began making adjustments to drafts of the quality instruction rubric (QIR). Other slight changes were made to improve word consistency. For example, some elements used the word ďconsistentlyĒ under the proficient indicator while others used the words ďmost of the time.Ē It was agreed by the committee that using a common language across indicators would be more beneficial for teachers and observers and changes were made accordingly.
QIR training. To increase the validity of data from the teacher self-assessment on the QIR, training was provided to the faculty of Dixie Heights High School. Before conducting the QIR as a self-assessment, the faculty at Dixie Heights High School received whole group preliminary training on the intent and meaning (a self-reflection tool) of this instrument. While this may have been the first time some teachers actually examined the QIR, all teachers had received emails and drafts of the tool from the district office during its development during the previous year. Additionally, three Dixie teachers had served on committees that developed the document. During the training teachers were also informed that the principals would complete a separate evaluation of each teacher using the same instrument. Finally, the data was compiled to aid the principals in analyzing how the perceptions of the principals and teachers differed as well as to provide input for professional development and individual principal-teacher interactions. Because the principals of Dixie Heights High School were involved in the development of the QIR, multiple snapshot walks, and periodic calibration meetings, no additional training with the QIR was conducted with them.
The QIR used in this study was more complex than traditional evaluation instruments used for teacher evaluation. Each indicator included five descriptors and the evaluator had to identify each indicator of performance as Unsatisfactory, Beginning, Developing, Proficient or Exemplary. The complexities and newness of the QIR brought into question the validity of the teachersí original assessment of their instructional practices (on the pre-test). While all principals conducted multiple group snapshot visits and engaged in discussions about the QIR ratings different principals assigned based on common observations, only a few teachers were afforded this same exposure to the QIR. However, after using the document and experiencing a number of principal-teacher interactions during the school year, the validity of teacher data gathered from this instrument on the posttest increased.
Convergent validity (district calibration of the QIR). Kenton County School District central office personnel used the QIR in multiple schools within the county. District personnel accompanied the principals on classroom snapshot visits (treatment providers) on a monthly basis to aid in the calibration of the QIR. Although measures of teachers of other schools in the district were not within the scope of this research, the cross-district calibration of QIR ratings enhanced confidence in the generalizability of this particular instrument beyond one particular school.
Each month district personnel (usually an assistant superintendant and a district level curriculum coach) conducted snapshot visits with at least two of the building principals for at least three different teachers. After each snapshot visit, the group discussed what each observer noted while on the classroom visit regarding instructional practices demonstrated by the teacher. The group also discussed coaching tips with the principals for each teacher in order to improve principal-teacher interactions. A final version of coaching notes was then sent to each teacher visited by the building principal. Each observer in the snapshot visits kept individual notes of what they learned while observing teachers for reference in future committee meetings. Through these multiple calibration observations, staff members developed an operational definition of the ratings of the various QIR components.
Reliability. As an interrater reliability check of principalsí completion of the QIR, in August of 2009 the QIR instrument was completed separately by the four principals at Dixie Heights High School on fifty two individual teachers. A review of the results yielded a 92% overall agreement on the individual components from all four principals. A 100% agreement was observed on twenty of the twenty-four components for all 52 teachers with only one of the twenty-four components demonstrating more than one principalís ratings different than the group. None of the differently rated components were rated more than one level higher or lower than the group. This level of reliability on such a complex instrument seems difficult to achieve, but through periodic calibration meetings and calibration observations with district personnel, a strong interrater reliability was achieved.
Cronbach alpha was calculated within each domain (Planning & Preparation, Learning Environment, Instruction, Assessment) of the QIR. The QIR items required an ordinal judgment of teacher instructional practices for each item. Because of the ordinal nature of the instrument, a Likert scale was imposed upon the categories of Unsatisfactory, Beginning, Developing, Proficient, and Distinguished (as defined in the QIR rubric itself) by assigning the values of 1, 2, 3, 4, and 5, respectively. It is important to note that these numbers were assigned to the domains of the QIR for data analysis related to this research only. Assigning numbers to a level of performance was not part of the principal-teacher interactions used in this study. Danielson (1996) advised against assigning numbers to a teacherís performance level, claiming such practices as detrimental to the evaluation process and the growth of the teacher as a professional.
Classroom Grade Distributions
As a measure of student performance, classroom grade distributions were collected every twelve weeks and monitored for trends. As evidenced in the literature review, improved teacher instructional practices leads to improved student performance (Connors, 2000; Felner & Angela, 1988; Haycock, 1998; Lezotte, 2001; Price et al., 1988; Raudenbush et al., 1992). Classroom grade distributions were analyzed in reference to past distributions of the same classes by the same teacher, school distributions, department distributions, grade level distributions, statistical analysis of shape, normal distribution and flat distribution. Classroom grade distributions were collected for six school yearsófour years prior to the pilot year, the pilot year and the year of full implementation.
Dixie Heights High School, the setting for this research, had a limited number of practices and policies in the school which could have potentially affected the validity and reliability of classroom grade distributions. The grading scale, which was stable for ten years previous to this study, was established by district policy as referenced on Table 10.
Grading Scale for Classroom Grades at Dixie Heights High School
(Dixie Heights High School, 2009)
Although not an official policy, final exams traditionally counted for no more than twenty percent of the overall grade.
Policies which may have affected the validity and reliability of classroom grade distributions included a policy requirement for grades to be updated every two weeks into an electronic online grading program accessible to parents (see Appendix D for the Dixie Heights High School Instructional Practices Grading Policy) and a policy which established procedures for placing failing students on academic probation (see Appendix E, Enhancing Achievement Treatment Plan).
In general, Dixie Heights High School grading practices and policies were not dissimilar from many public high schools. Additionally, none of the grading practices or polices changed significantly during the years investigated in this study. Thus, data collected and analyzed in this study was not expected to be effected by a change in grading practices or policies.
Discipline reports were collected as a measure of student performance. As evidenced in the literature review, improved instructional practices lead to improved student behavior (Cushman & Delpit, 2003; Felner, Seitsinger, Brand, Burns, & Bolton, 2007; Rowan et al., 1997). Discipline referrals were collected for six school yearsófour years prior to the pilot year, the pilot year, and the year of full implementation.
In order to enhance the validity of interpretations drawn from analyses of discipline referrals, all teachers received training at the beginning of each school year on the appropriate procedures and behaviors to include on discipline referrals. The principals also received training at the beginning of each year, and periodically conferenced on the proper handling and recording of student discipline referrals. The Kenton County School District defined different types of behaviors as well as acceptable consequences for each action (see Appendix D for a copy of the Kenton County School District Code of Acceptable Behavior and Discipline).
Teacher and Student Surveys
Teacher and student surveys were used as measures of the frequency and focus of teacher conversations. The same teacher and student surveys were administered electronically to teachers and students in May of 2006, 2007, and 2008 anonymously. (See Appendices A and B for copies of the student and teacher surveys).
Validity. The teacher and student surveys were developed by an administrative team (four principals and three counselors) at the participating high school which included the principals (treatment providers) of the high school. Topics for survey questions were decided by the administrative team reflecting on their perception of instructional needs for the school. Question format was modeled after professional surveys which had been used by members of this team recently within the educational setting (e.g. We Teach and We Learn Surveys published by the International Center for Leadership in Education, 2006; My Voice Survey published by NCS Pearson Inc, 2006).
Initial development of teacher survey. Survey questions four through six were written to measure teacher perception of the frequency and focus of principal-teacher conversations. Survey questions one through three of the teacher survey were written to measure teacher perception of the frequency and focus of teacher-teacher conversations. Questions seven and eight of the teacher survey were written to measure teachersí perceptions of the frequency and length of principal classroom visits. Questions nine through fourteen on the teacher survey, were written in order to obtain information related to district initiatives not related directly to this research; results from those questions were not analyzed as part of this study. For specific details regarding how each question from the teacher survey aligns with the constructs of this study, see Table 11.
Teacher Survey Questions Aligned with Constructs of the Study
Initial Development of Student Survey. Questions one, two, three, six, seven, and eleven of the student survey were written to measure student perception of the frequency and focus of teacher-student conversations. It was expected that by increasing the number of quality principal-teacher interactions teachers would engage in more individual instructional conversations with students. Other questions on the student survey were not directly connected to this study and thus were not analyzed. For specific detail regarding how each question on the student survey aligns with the constructs of this study, see Table 12.
Student Survey Questions Aligned with Constructs of the Study
Expert review of the surveys. To further enhance the construct validity of teacher and student surveys, the survey questions were reviewed and wording was adjusted by district level personnel with experience in writing surveys and working in schools. The three central office members who reviewed these surveys were:
1. A deputy superintendant with a doctorate in educational leadership, with more than twenty years experience in public education and a background in school law.
2. An assistant superintendant with more than twenty years experience in public education and a background in English and writing.
3. A content curriculum specialist with more than twenty years experience in public education with a background in counseling and social studies.
This group was given the surveys and a description of the original intent of specific survey questions decided by the school administrative team. They were asked to give feedback in any form to improve the surveys.
After expert review of the surveys, small adjustments were made to the wording of questions to make the language audience friendly and consistent. See Appendices A and B for a copies of the final teacher and student surveys.
Reliability. As described earlier in this section, the teacher and student surveys were developed by an administrative team at the participating high school and were not tested for reliability. Utilizing Cronbachís alpha, each set of questions related to the frequency and focus of principal-teacher conversations, teacher-teacher conversations, and teacher-student conversations were analyzed for internal reliability. Results of all this analysis did not support combining the analysis of sets of questions in one measure. As a result, each question from the teacher and student surveys was analyzed separately.
Fidelity of Implementation of Snapshots
Teacher, date, observer, and number of snapshots per classroom by principals were recorded on an excel spreadsheet as they took place (see Appendix E for an example of the snapshot tracker). This measure was a count of unambiguous data with instantaneous calculations of number of visits per teacher, number of visits per principal, average number of visits and standard deviation for each teacher, and average number of visits and standard deviation for each day. These data were self-reported by the treatment providers. Visitations were also discussed at periodic calibration meetings, and questions related to this principal-teacher interaction were present on both the teacher and student surveys.