A study of how specific principal behaviors affect teacher and student performance

Results in this chapter are organized around the research questions of this study. 1. How will the treatment of principalteacher interactions affect teachers’ instructional practices? 2. How will changes in teachers’ instructional practices, initiated by the set of principalteacher interactions, affect student performance? 3. How will changes in principalteacher interactions affect the frequency and focus of teacher conversations with principals, students, and other teachers? The research design used in this study was quasi experimental with multiple quantitative analytic techniques. There were three specific time frames associated with this research in relation to the measures and principalteacher interactions: 1. Prior to the pilot year (prior to fall 2007) 2. Pilot year (20072008 school year) 3. Year of full implementation (20082009 school year) Two principalteacher interactions, snapshots and data reviews were implemented during the pilot year and the full set of four principalteacher interactions, oneonone summer meetings, snapshots, data reviews, and teacher selfassessment were implemented in the year of full implementation. Classroom grade distributions and student discipline referral data were collected from each of these three time frames. Teacher and student survey data were collected from two of these time frames, the pilot year and the year of full implementation. Principalcompleted and teachercompleted QIR data were collected only in the year of full implementation. Research Question One: How will the treatment of principalteacher interactions affect teachers’ instructional practices? QIR data were used in a single group pretest posttest research design in order to explore any affect the introduction of a set of principalteacher interactions had on the quality of teacher instructional practices as defined by the Quality Instruction Rubric (QIR) during the year of full implementation, the 20082009 school year. Data from the QIRs completed by the principals and the QIRs completed by teachers from both the pretest and posttest were analyzed using paired sample ttests. One assumption of ttests is that the data yields a normal distribution. The normality of the QIR score distributions were assessed by computing kurtosis and skewness for each of the four subscales and the overall QIR score for both the pretest and posttest data. The ten kurtosis values ranged from 0.54 to 1.69. The ten skewness values ranged from 0.01 to 1.03. A common interpretation is that absolute values of kurtosis and skewness less than 2 are approximately normal enough for most statistical assumption purposes (Minium, King, & Bear, 1993). These results support that normality assumptions were upheld for these data. Changes in the Quality of Teacher Instructional Practices During the Year of Full Implementation The results of a comparison of data from pretest and posttest QIR are presented in Table 15. According to analysis results of QIR ratings completed by teachers, the quality of teacher instructional practices improved in the two domains of Planning & Preparation and Learning Environment at a significance level of p<0.01, indicating a small effect size in each of these two domains. Analyses results of QIR ratings completed by the principals did not detect a change in the quality of teacher instructional practices in the same two domains of Planning & Preparation and Learning Environment. Analyses results of QIR ratings completed by teachers did not indicate a change in the quality of teacher instructional practices in the two domains of Instruction and Assessment. According to analyses results of QIR ratings completed by the principals, the quality of teacher instructional practices improved in the same two domains of Instruction and Assessment, at a significance level of p<0.001, indicating a small effect size in each domain. Overall, according to the analyses results of QIR ratings completed by the principals, the quality of teacher instructional practices did improve, at a significance level of p<0.05, producing a small effect size. Analyses results of QIR ratings completed by teachers also indicated the quality of teacher instructional practices improved overall at a significance level of p<0.05, producing a small effect size. However, as noted in Table 15, the specific domains in which significant changes occurred according to teachers’ ratings were exactly the opposite of those indicated by principals’ ratings. Table 15 Comparison of QIR PrePost Mean Scores (Standard Deviation) for Year of Full Implementation
^{ *}Indicates a small effect size (0.2<d< 0.5); ^{**}Indicates a medium effect size (0.5<d< 0.8); ^{***}Indicates a large effect size (d >0.8). (Cohen, 1988) A Comparison of the Differences of Perceptions of the Quality of Teacher Instructional Practices between Teachers and Principals The mean scores of QIR ratings completed by teachers and QIR ratings completed by the principals, presented in Table 15, appear to differ systematically. Results of a comparison of pretest and posttest data from QIR ratings completed by the principals to QIR ratings completed by teachers are presented in Table 16. Teachers rated the quality of their instructional practices higher than did principals in each domain and overall. The differences in the results of QIR ratings completed by the principals and QIR ratings completed by teachers were significant at a level of p<0.001 and indicated large differences in each domain and overall. Table 16 Comparison of Teachercompleted to Principalcompleted QIR Mean Scores (Standard Deviation) for Year of Full Implementation
^{ *}Indicates a small effect size (0.2<d< 0.5); ^{**}Indicates a medium effect size (0.5<d< 0.8); ^{***}Indicates a large effect size (d >0.8). (Cohen, 1988) Analyses of Systematic Differences in Teachers’ SelfRatings Teachers with differing depths of quality of instructional practices may have differed systematically in their selfratings. The prior analysis of all teachers in one group may mask any possible systematic differences. There were a number of different grouping methods which seemed logical in order to search for systematic differences in these data. Teachers’ QIR selfratings were separately analyzed by high, medium and low performing groups based on the overall posttest QIR ratings completed by the principals. Other options for generating teacher groups could have been overall pretest QIR ratings completed by the principals, overall posttest QIR ratings completed by teachers, or pretest QIR ratings completed by teachers. Consideration was given to group teachers based on the overall pretest QIR ratings completed by the principals. In anticipation of the question, a correlation coefficient was calculated between the overall pretest QIR ratings completed by the principals and the overall posttest QIR ratings completed by the principals and found to be 0.873. Such a high correlation between the pretest and posttest indicate that using either set of data for grouping purposes would result in similar groupings and similar results. Consideration was given to grouping teachers based on the overall QIR ratings completed by teachers. However, as established in chapter two, principal ratings of instructional practices are more likely to be more valid than teacher ratings of instructional practices. As discussed in chapter three, we implemented several procedures during the course of this study, such as field tests, norming, and calibration procedures, to increase the validity and reliability of the QIR ratings completed by the principals. Thus, for discussion purposes it seemed more logical to group teachers according to overall QIR ratings completed by the principals. Comparisons Among High, Medium, and Low Performing Teachers According to Posttest QIR Ratings Completed by the Principals. Teachers’ QIR selfratings were separately analyzed groups defined by their depth of quality instructional practices as determined by their placement on the QIR. The total sample (N=50), were split into three different, nearly equal sized, groups based on the overall posttest QIR ratings completed by the principals. Group OneHigh Performing Teachers (n= 16) Group TwoMedium Performing Teachers (n=17) Group ThreeLow Performing Teachers (n=17) The purpose of splitting the teachers into groups was to obtain as much discrimination between groups as possible. More than three groups are preferable. However, for this comparison we planned to compute means for each group to make potentially generalizable claims. Separating the original sample of 50 teachers into more than three groups would likely produce sample sizes which were too small for this purpose. The results of an ANOVA of the overall posttest QIR ratings completed by the principals on each of these three groups indicated that the ratings for each group were statistically different at a significance level of p<0.0001. The results of an ANOVA of the overall pretest QIR ratings completed by teachers indicated that high, medium, and low performing teachers ratings were equivalent. The results of an ANOVA of the overall posttest QIR ratings completed by teachers, also indicated that high, medium, and low performing teachers ratings were equivalent. Table 17 reports results of a comparison of QIR ratings completed by the principals and QIR ratings completed by teachers for high, middle, and low performing teachers. Table 17 shows that high performing teachers’ ratings of the quality of their instructional practices were statistically equivalent to the principals’ ratings. By contrast, medium performing teachers’ ratings of their instructional practices were significantly higher, with medium to large effect sizes, than the principals’ ratings in each domain and overall. Likewise, low performing teachers’ ratings of their instructional practices were significantly higher than the principals’ ratings in each domain and overall. The effect sizes between the low performing teachers’ and principals’ ratings were consistently larger than those between medium performing teachers and principals. Table 17 Comparison of Teachercompleted to Principalcompleted QIR Mean Scores (Standard Deviation) for High, Medium, and Low Performing Teachers
^{ *}Indicates a small effect size (0.2<d< 0.5); ^{**}Indicates a medium effect size (0.5<d< 0.8); ^{***}Indicates a large effect size (d >0.8). (Cohen, 1988) Changes in the Quality of Teacher Instructional Practices During the Year of Full Implementation for High, Medium, and Low Performing Teachers. Results of a comparison of data from pretest and posttest QIR ratings for high performing teachers are presented in Table 18. According to analyses results of pretest and posttest QIR ratings completed by teachers, the quality of teacher instructional practices of high performing teachers improved in the domain of Learning Environment at a significance level of p<0.001, indicating a medium effect size. Analyses results of QIR ratings completed by teachers did not indicate a change in the quality of teacher instructional practices of high performing teachers in the three domains of Planning & Preparation, Instruction, or Assessment. Analyses results of QIR ratings completed by teachers indicated the quality of teacher instructional practices of high performing teachers increased overall at a significance level of p<0.05, producing a small effect size. Table 18 Comparison of QIR PrePost Mean Scores (Standard Deviation) for Year of Full Implementation for High Performing Teachers
^{ *}Indicates a small effect size (0.2<d< 0.5); ^{**}Indicates a medium effect size (0.5<d< 0.8); ^{***}Indicates a large effect size (d >0.8). (Cohen, 1988) Results of a comparison of data from pretest and posttest QIR ratings for medium performing teachers are presented in Table 19. According to analyses results of the pretest and posttest QIR ratings completed by teachers, the quality of instructional practices of medium performing teachers did not change during the year of full implementation. According to analyses results of the pretest and posttest QIR ratings completed by the principals, the quality of instructional practices of medium performing teachers improved in the domain of Instruction at a significance level of p<0.05, indicating a medium effect size. According to analyses results of the pretest and posttest QIR ratings completed by the principals, the quality of instructional practices of medium performing teachers did not change in any other domain or overall. Given that of the ten possible indicators of a change in the quality of instructional practices for medium performing teachers, only one (Instruction; principalcompleted) indicated a change, it is likely that the quality of instructional practices of medium performing teachers were impacted significantly less by the set of principal teacher interactions during the year of full implementation than other teachers. Analyses results of QIR ratings completed by the principals did not detect a change in the quality of teacher instructional practices of high performing teachers in the domain of Planning & Preparation. According to analyses results of QIR ratings completed by the principals, instructional practices of high performing teachers improved in the three domains of Learning Environment, Instruction, and Assessment, at a significance level of p<0.05, indicating a small effect size in the two domains of Learning Environment and Assessment, and a large effect size in the domain of Instruction. Overall, according to analyses results of QIR ratings completed by the principals, instructional practices for high performing teachers improved at a significance level of p<0.01, producing a medium effect size. Table 19 Comparison of QIR PrePost Mean Scores (Standard Deviation) for Year of Full Implementation for Medium Performing Teachers
^{ *}Indicates a small effect size (0.2<d< 0.5); ^{**}Indicates a medium effect size (0.5<d< 0.8); ^{***}Indicates a large effect size (d >0.8). (Cohen, 1988) Results of a comparison of data from pretest and posttest QIR ratings for low performing teachers are presented in Table 20. According to analyses results of pretest and posttest QIR ratings completed by teachers, the quality of instructional practices of low performing teachers improved in the domain of Planning & Preparation at a significance level of p=0.05, indicating a medium effect size. Analyses results of pretest and posttest QIR ratings completed by teachers did not indicate a change in the quality of instructional practices of low performing teachers in the three domains of Planning & Preparation, Instruction, or Assessment. Analyses results of pretest and posttest QIR ratings completed by teachers indicated the quality of instructional practices of low performing teachers increased overall at a significance level of p<0.05, producing a medium effect size. Table 20 Comparison of QIR PrePost Mean Scores (Standard Deviation) for Year of Full Implementation for Low Performing Teachers
^{ *}Indicates a small effect size (0.2<d< 0.5); ^{**}Indicates a medium effect size (0.5<d< 0.8); ^{***}Indicates a large effect size (d >0.8). (Cohen, 1988) Analyses results of pretest and posttest QIR ratings completed by the principals, did not detect a change in the quality of instructional practices of low performing teachers in the two domains of Planning & Preparation and Learning Environment. According to analyses results of pretest and posttest QIR ratings completed by the principals, the quality of instructional practices of low performing teachers improved in the two domains of Instruction and Assessment, at a significance level of p<0.01 and p<0.001 respectively, indicating a medium and large effect size. Overall, according to analyses results of pretest and posttest QIR ratings completed by the principals, the quality of instructional practices of low performing teachers increased at a significance level of p<0.05, producing a small effect size. Research Question Two: How will changes in teachers’ instructional practices, initiated by the set of principalteacher interactions, affect student performance? Classroom grade distributions and student discipline referrals were used in a single, crosssectional group interrupted time series research design in order to explore any effect changes in teacher instructional practices, initiated by the set of principalteacher interactions, may have had on student performance during the pilot year and the year of full implementation. Data from classroom grade distributions and student discipline referrals from four years prior to the pilot year were analyzed using linear regression in order to predict expected levels of student performance during the pilot year (20072008) and the year of full implementation (20082009). Actual levels of student performance, operationalized as grade distributions and discipline referrals, from the pilot year and year of full implementation were then compared to the levels of student performance predicted from the regression analysis. Classroom Grade Distributions Classroom grade distributions are presented in Table 21 for four years previous to the pilot year (pretreatment), the pilot year, and the year of full implementation. Table 21 Actual Classroom Grade Distributions for all Students (n=approximately 1400)
^{ 1}Two of this study’s principalteacher interactions –snapshots and data reviews—were in place for this school year. ^{ 2}This study’s treatment – set of four principalteacher interactions – were in place for this school year. Using grade distributions data from four years previous to the pilot year, expected levels of grade distributions were calculated using linear regression. Figure 6 depicts the actual grade distributions of As, Bs, Cs, Ds, and Fs for school years 20032004 through 20082009. A line of best fit for each grade distribution has been placed on the graph based on data collected in the years prior to the pilot year. Dashed lines on Figure 6 indicate expected levels of grade distributions according to pretreatment data. The Greek symbol delta (D) is used to indicate the difference between expected and actual values for each grade distribution in the pilot year and the year of full implementation. The differences between expected and actual values for each are indicated within parentheses on Figure 6. Table 22 reports these differences. Table 22 Gap between Actual and Projected Classroom Grade Distributions for all Students (n=approximately 1400)
Percentages reported are the differences from the projected values based on linear regression of pretreatment data (School years 20032004 through 20062007) The percent As and percent Fs produced the differences of the largest magnitudes from expected values. The higher than expected percentage of Ds may have been due to a portion of the Fs becoming Ds. The higher than expected percentage of As may have been due to a portion of Bs becoming As.
Figure 6 Predicted and Actual Classroom Grade Distributions for all Students. Dashed lines represent predicted values based on pre treatment data from years 20032004 through 20062007. Differences between expected and actual values are represented within parentheses. Classroom Discipline Referrals The number of reported classroom discipline referrals, aggressive discipline referrals (aggressive to school employee, defiance, failure to comply with discipline, fights, harassment, profanity, disorderly conduct, and repeated violations), and discipline referrals for several disaggregated groups are presented in Table 23 for four years previous to the pilot year (pretreatment), the pilot year, and the year of full implementation. Table 23 Discipline Referrals for School Years 20032004 through 20082009 (n=approximately 1400)
^{ 1}Two of this study’s principalteacher interactions –snapshots and data reviews—were in place for this school year. ^{ 2}This study’s treatments – set of four principalteacher interactions – were in place for this school year. Using discipline referral data from four years previous to the pilot year, expected levels of discipline referrals were calculated using linear regression. The differences between expected levels of discipline referrals and actual levels of discipline referrals, in each category, from the pilot year and year of full implementation are presented in Table 24. The differences between expected and actual values for each are indicated within parentheses on Figure 7. Table 24 Differences of Discipline Referrals from Projected Frequencies (n=approximately 1400)
Numbers reported are the differences in frequencies from the projected value based on linear regression of pretreatment data (School years 20032004 through 20062007) Figures 7, 8, and 9 depict the actual discipline referral frequencies for school years 20032004 through 20082009. A line of best fit for each grade distribution has been placed on the graph based on data collected in the years prior to the pilot year. Dashed lines on each figure indicate the expected frequency of discipline referrals according to pretreatment data. In the data table Ds have been indicated on each graph in order to indicate the difference between expected and actual values for each level of discipline in the pilot year and the year of full implementation in each category. The differences between expected and actual values for each are indicated within parentheses on each figure. As indicated on Figure 7, the actual frequency of total discipline referrals was 12% lower than expected in the pilot year and 38% lower than expected in the year of full implementation. Additionally, the actual frequency of aggressive discipline referrals was 35% lower than expected lower than expected in the pilot year and 61% lower than expected in the year of full implementation. This pattern seems to indicate that essentially all of difference in actual discipline referrals and expected discipline referrals is due to actual aggressive discipline referrals being much lower than expected.
Figure 7 Total Discipline and Aggressive Discipline for all Students. Dashed lines represent predicted values based on pre treatment data from years 20032004 through 20062007. Differences between expected and actual values are represented within parentheses. As indicated on Figure 8, the actual frequency of male discipline referrals was 18% lower than expected in the pilot year and 51% lower than expected in the year of full implementation. However, the actual frequency of female discipline referrals for both the pilot year and the year of full implementation is essentially equivalent to the expected value, 3% and 4% higher respectively.
Figure 8 Total Discipline by Gender. Dashed lines represent predicted values based on pre treatment data from years 20032004 through 20062007. Differences between expected and actual values are represented within parentheses. Figure 9 indicates discipline referrals for individual grade levels during the school years 20032004 through 20082009. The actual frequency of freshman discipline referrals was 22% lower than expected in the pilot year and 36% lower than expected in the year of full implementation. The actual frequency of sophomore discipline referrals was 10% lower than expected in the pilot year and 28% lower than expected in the year of full implementation. The actual frequency of junior discipline referrals was only 7% lower than expected in the pilot year, but 27% lower than expected in the year of full implementation. The actual frequency of senior discipline referrals is essentially the same as expected in the pilot year, 4% lower than expected, and 61% lower than expected in the year of full implementation,.
Figure 9 Total Discipline for Freshman, Sophomores, Juniors, and Seniors. Dashed lines represent predicted values based on pre treatment data from years 20032004 through 20062007. Differences between expected and actual values are represented within parentheses.
Classroom Grade Distributions and Student Discipline Referrals for High, Medium, and Low Performing Teachers Analyses of QIR results indicated that ratings of teacher instructional practices completed by teachers and principal were divergent for high, medium, and low performing teachers. Thus it was of interest to investigate if there were differences in student outcomes across these three teacher groups. Mean classroom grade distributions and student discipline referrals disaggregated by high, medium, and low performing teachers for 20062007 through 20082009 are reported in Table 25. A comparison of classroom grade distributions and student discipline referrals indicated that there was a lack of statistically significant differences in classroom grade distributions or student discipline referrals for high, medium, or low performing teachers from 20062007 through 20082009. Table 25 Comparison of Classroom Grade Distributions and Discipline Referral Mean Scores (Standard Deviation) for High, Medium, and Low Performing Teachers for 20062007 through 20082009 (n=approximately 1400)
Research Question Three: How will changes in principalteacher interactions affect the frequency and focus of teacher conversations with principals, students, and other teachers? Teacher and student survey data were used in a single group pretestmidtestposttest design in order to explore any effect changes in principalteacher interactions coupled with changes in instructional practices had on the frequency and focus of teacher conversations with principals, students, and other teachers during the pilot year and the year of full implementation. Data from teacher and student surveys were compared using chi square from the spring of 2007, prior to the introduction of the set of principalteacher interactions, to the spring of 2008, end of the pilot year, and from spring of 2008, before the year of full implementation, to spring 2009, after the year of full implementation. Some of the questions on the teacher and student surveys are conceptually related research question three, the frequency and focus of teacher conversations. However, according to analysis using Cronbach's alpha as reported in chapter three, there was a lack of internal consistency between the resultant responses of similar questions from the teacher and student surveys. Therefore, each question on the teacher and student surveys was analyzed individually. Although chi square is an acceptable analysis tool to compare the distributions of the survey data in this study, two assumptions were occasionally violated during analysis. One assumption of a chi square analysis, violated by some survey data, was that no cells contain a zero frequency count. The second assumption of a chi square analysis, violated by some survey data, was that no more than 20% of cells report less than a five frequency count. However, these are not rules, but guidelines and researchers support analyses which uses chi square when these assumptions are violated (Levin, 1999). Teacher Survey Data Results of teacher surveys are presented in Table 26 for spring 2007 (pretest), spring 2008 (posttest for pilot year/pretest for year of full implementation), and spring 2009 (posttest for year of full implementation). Table 26
^{ 1 } Data are frequency counts of teacher responses in this category *=p<0.05, **=p<0.01, ***=p<0.001
The first three teacher survey questions in Table 26 relate specifically to the frequency and focus of teacherteacher interactions. Data analyses indicated a signifigant difference in the distribution of teachers responses to questions about the conversations with other teachers before and after the pilot year. There were few significant differences in the distribution of teachers responses to the same questions about conversations with other teachers from the pilot year to the year of full implementation; the one exception was teachers’ responses to how often they discussed curriculum issues with other teachers. The results of data analyses indicated that teachers did percieve an increase in the frequency of teacherteacher conversations related to curriculum and discipline as well overall. The same level of teacherteacher conversations were sustained during the year of full implementation. The next three teacher survey quesitons in Table 26 relate specifically to the frequency and focus of principalteacher conversations. With the exception of teachers’ responses to how often they discussed discipline issues with a principal, data analyses indicated a lack of significant differences in the distributions of teachers’ responses to questions concerning the frequency and focus of principalteacher conversations either during the pilot year or the year of full implementation. Teachers’ responses to how often teachers discussed discipline issues with a principal were significantly different from before to after the pilot year. The results of data analyses indicated that teachers, did not percieve a change in the frequency of principalteacher interactions related to curriculum, discipline, or teaching strategies during the pilot year or the year of full implementation. The last two teacher survey questions in Table 26 relate to the length and frequency of principal classroom snapshots. Data analyses indicates a significant difference, in the distribution of teachers responses to questions about the length and frequency of principal classroom visits before and after the pilot year. There were no significant differences in the distribution of teachers responses to the same questions about the length and frequency of principal classroom visits from the pilot year to the year of full implementation. The results of data analyses indicated that teachers did perceive an increase in the frequency and duration of principal classroom visits during the pilot year and that change was sustained during the year of full implementation. Student Survey Data Results of student surveys are presented in Table 27 for spring 2007 (pretest), spring 2008 (posttest for pilot year/pretest for year of full implementation), and spring 2009 (posttest for year of full implementation). Table 27 Student Survey of the Frequency and Focus of TeacherStudent Conversations
^{ 1 } Data are frequency counts of teacher responses in this category *p<0.05, **p<0.01, ***p<0.001 The first student survey question in Table 27 relates specifically to the daily frequency of teacherstudent conversations. Data analyses indicated that students did not perceive a change in the frequency of teacherstudent conversations during the pilot year. Analyses of the student responses indicate that students did perceive a decrease in the daily frequency of teacherstudent conversations, to a p<0.05 signifigance level, during the year of full implementation. The next five student survey questions in Table 27 relate to the frequency and focus of teacherstudent conversations. Data analysis indicated that, according to students there were essentially no perceived differences in the frequency and focus of teacherstudent conversations related to personal issues, discipline issues, learning strategies, motivation, or class performance during the pilot year or the year of full implementation. There were significant differences in the distribution of student responses to how often students discuss learning strategies with their teachers during the year of full implementation, to a significance level of p<0.05, indicating a slight decrease. A slight change in the frequency of teacherstudent conversations was indicated in the year of full implementation. The results indicated that fewer students talked to a teacher eight or more times a day, but more students talked to a teacher between two to seven times per day. This difference was likely trivial, especially since results to other survey questions failed to indicate significant shifts in the focus of teacherstudent conversations. It is logical to think that since data clearly indicates an improvement in the quality of teacher instructional practices that the frequency and focus of teacherstudent conversations would have also improved but such improvement was not indicated. 
