A study of how specific principal behaviors affect teacher and student performance
No personal information will reveal names or identification of the teachers involved in the study.
Research Questions One: How will the treatment of principal-teacher interactions affect teacher’s instructional practices?
The principal-completed and teacher-completed QIRs were used as measures of teacher instructional practices in combination with synthesis of snapshot observations. In order to screen for the data shape of results from the principal-completed and teacher-completed QIR, scores from the pre and post test for principals and teachers were analyzed for normality by running skewness and kurtosis tests while examining the mean and standard deviation of each. According to University of Surrey Psychology Department (2007), a distribution with skew and kurtosis values in the range +2 to -2 are near enough to be considered normally distributed for most purposes. Distributions were assessed for normality using these two statistics in order to establish if the data met underlying assumptions of ANOVA analyses.
Related sets of data were compared using t-tests in each of the four domains and overall, before and after the year of full implementation, to utilize a single group pretest posttest research design as modeled in Figure 2. Teacher-completed pretest data were compared to teacher-completed posttest data to discover teachers’ perceptions of changes in the quality of instructional practices. Principal-completed pretest data were compared to principal-completed posttest data to discover principal’s perceptions of changes in the quality of teacher instructional practices. Teacher-completed data were compared to principal-completed data for both the pretest and posttest, to discover differences in teacher and principal perceptions of instructional practices. Next, the teachers were parsed into three groups (high performing, medium performing, and low performing) based on principal-completed posttest results. The same analysis described above was then performed on each group of high, medium, and low performing teachers.
Quantifying the QIR. In order to quantify the data from the QIR, a five point Likert scale was imposed on the five categories of Unsatisfactory, Beginning, Developing, Proficient, and Exemplary. The responses within each domain of the QIR, Planning & Preparation, Learning Environment, Instruction, and Assessment, were averaged to achieve a mean score for each teacher in each of the domains. The ratings from each response across all four domains were averaged in order to obtain an overall mean for each teacher.
Comparing pretest and posttest QIRs. The process described above yielded four sets (teacher rated pretest, teacher rated posttest, principal rated pretest, and principal rated posttest) of five datum points (Planning & Preparation, Learning Environment, Instruction, Assessment, and Overall) for each teacher in the study. The teacher-completed QIR data from pre to post test were analyzed in each domain and overall using t-test, in order to discover any changes which occurred during the year of full implementation. However, as noted earlier, many teachers on the pre-QIR may have had an unrealistic sense of their teaching as characterized by the QIR because they hadn’t yet had the opportunity to either use it themselves or to receive feedback based on it. If that were the case, the pre-post comparison of teacher-completed QIR may not have yielded valid and interpretable results. The principal-completed QIR data from pre to post test was analyzed in each domain and overall using t-test to discover any changes which may have occurred during the year of full implementation. Because all the principals had been deeply involved in developing and calibrating the QIR across many teachers, their pretest ratings were likely to be much more aligned, by comparison to the teachers, with the intent of the QIR.
Comparing principal and teacher QIRs. Although not directly related to research question two, compelling data was discovered when comparing teachers’ ratings of their own instructional practices to the principals’ ratings of teachers’ instructional practices on the pretest and posttests. Teacher and principal results from the QIR on the pretest and posttests were compared using t-tests in order to discover any significant differences which may have existed between teacher and principal perceptions of teacher instructional practices.
Research Question Two: How will changes in teachers’ instructional practices, initiated by the set of principal-teacher interactions, affect student performance?
For this study student performance was defined as classroom grade distributions and discipline referrals. In order to use a single cross-sectional interrupted time series design as modeled in Figure 3, classroom grade distributions and discipline reports were collected for the six school years of 2003-2004 through 2008-2009, prior to the pilot year, during the pilot year, and during the year of full implementation. Linear regression modeling was used to analyze any changes which may have taken place after the initial set of principal-teacher interactions in this study was introduced. For each set of data, linear regression was used to predict expected values for the pilot year and year of full implementation based on the pretreatment data. Expected values for each datum point were then compared to the observed values.
Classroom Grade Distributions. Final classroom grade distributions for each school year were collected for all grades assigned to students within the school, for grades 9-12 and categorized by the traditional high school grading scale: A, B, C, D, and F. For each year and each letter grade the percentage of assigned grades were used as the datum point. Each year the same calculation and assigned percentage were used in order to establish the five datum points, the percentage of As, Bs, Cs, Ds, and Fs.
Classroom discipline referrals. Classroom discipline referrals for each school year were collected and analyzed for changes using linear regression. Classroom discipline referrals were categorized by the school district as aggressive to school employee, defiant, failure to comply with discipline, fights, harassment, profanity, tardies and skipping, tobacco, disorderly conduct, or repeated violations (see Appendix D for detailed descriptions of these offenses). For this study, the referrals of these categories were summed up to produce a number identified as the total discipline infractions. A second category (aggressive discipline) was also used by adding the discipline categories of aggressive to school employee, defiant, failure to comply with discipline, fights, harassment, profanity, disorderly conduct, and repeated violations, and were analyzed as well. Discipline referral data for the nested groups of males, females, freshmen, sophomores, juniors, and seniors were each analyzed in the same fashion.
Classroom grade distributions and discipline referrals of high, medium, and low performing teachers. Classroom grade distributions and total discipline referrals, percent As, Bs, Cs, Ds, Fs, and total number of discipline referrals for the three groups of high, medium, and low performing teachers were averaged for each of the school years 2006-2007, 2007-2008, and 2008-2009. Grades and total discipline for these three groups of teachers were compared utilizing t-tests to investigate any significant differences between the three groups during the three years. Grades and total discipline were compared across high, medium, and low performing teachers to investigate any significant differences between the three groups during the school years of 2006-2007, 2007-2008, and 2008-2009.
Research Question Three: How will changes in teachers’ instructional practices, initiated by the set of principal-teacher interactions, affect the frequency and focus of teacher conversations?
Specifically, how did the changes in teachers’ instructional practices change the frequency and focus of principal-teacher conversations, teacher-teacher conversations, and teacher-student conversations? As noted in Tables 11 and 12, questions four through eight of the teacher survey were used as measures of the frequency and focus of principal-teacher conversations, questions one through three of the teacher survey were used as measures of the frequency and focus of teacher-teacher interactions, and questions one, two, three, six, seven, and eleven of the student survey were used as measures of frequency and focus of teacher-student conversations. Pretest and posttest distributions of responses on the teacher and student surveys where analyzed using chi square to detect changes in response patterns which may have occurred. In particular, chi square compared the survey result from before and after the pilot year and before and after the year of full implementation to discover if there were any significant differences in the distributions.
Combining response categories of survey questions. All survey questions used as measures of frequency and focus of teacher conversations consisted of a Likert scale of five possible responses each. The purpose of using a pretest posttest research design was to discover any significant changes in the frequency and focus of teacher conversations which occurred from the administration of the pretest to the administration of the posttest. In retrospect after having collected the survey data, the distinctions between adjacent points on the Likert scale seemed small enough to not be meaningfully different. In those cases, statistically significant changes with fewer, grouped response options would have been a better indication of real change. So, each survey question and corresponding responses were examined by two independent teams to determine if any response categories could or should be combined.
One of the examining teams was made up of the four principals, two of which were the primary researchers of this study responsible for providing the set of principal-teacher interactions for this study. A second team was made up of the original three people who aided in the original expert review of the surveys (see Reviewing the Surveys under the measurements and instruments section of this chapter for a description of this group).
Each group was given copies of the survey questions of interest, questions one through eight of the teacher survey, and questions one, two, three, six, seven, and eleven, of the student survey, and the following directions:
Please examine the following question responses and see if it makes sense to group any of the responses. That is, there are five responses for each survey question. Are there response choices from teachers which are basically no different? For example, if one of the options for a response was ‘never’ and another option was ‘almost never’, would these two responses likely indicate the same frequency/focus of teacher conversations? You may combine none or as many responses as make sense. Thank you for your time.
Each group worked independently as a team to come to a consensus regarding which, if any, combining of responses were logical.
At the conclusion of the activity of the two teams, resultant combinations were remarkably similar. Each set of five responses for all questions were combined into three groups except responses to question seven of the teacher survey which were combined into four categories. See Table 13 and 14 for resultant groupings of each question responses in the teacher and student surveys. There was only one question for which the two groups disagreed on groupings. The question which the two groups disagreed on was question seven of the student survey. The first team (the set of four principals responsible for providing the set of principal-teacher interactions for this study) grouped the original responses of “Daily”, “Weekly”, “Monthly”, “Yearly”, “Never” into the three groups “Daily”, “Weekly or Monthly”, “Yearly or Never.” The second team, made up of the original three people who aided in the original review of the surveys, grouped the original responses of “Daily”, “Weekly”, “Monthly”, “Yearly”, “Never” into the three groups “Daily or Weekly”, “Monthly”, “Yearly or Never.” The two groups discussed together the reasons for the groupings they chose and came to a consensus that the grouping “Daily”, “Weekly or Monthly”, “Yearly or Never” made more sense for this question. In the end it was agreed that a teacher who is perceived as daily motivating and inspiring a student (question seven) would periodically engage in different teacher conversations than a teacher that is perceived as only doing these things weekly or monthly. Although all members agreed that monthly and weekly were good frequencies for these indicators, leaving students with the perception that they are occurring daily is exceptional.
Data analysis plan for survey results. Questions from the teacher and student surveys yielded a measure of the frequency and focus of teacher conversations. Response frequencies for each option in the combined response groupings in Tables 13 and 14 for each question were analyzed for the pilot year and the year of full implementation using a chi square test.
Regrouping Response Categories on Teacher Survey
Regrouping Response Categories on Student Survey