I kind of handed this sucker in two weeks late, so I probably got a 0/2 for commitment. Still, I scoured the net for examples and couldn’t find anything that our teacher didn’t give us, so I thought a sample project would be appreciated. I got 14/20, by the way, and my topic is "First Period Leads in National Hockey League Games and the Effect on the Final Score."
Plan of Investigation
While watching National Hockey League (NHL) games, I often heard the play-by-play announcer mention at the start of the third and final period how it would be tough for a team to come back from a one goal deficit. This led me to wonder just how difficult it was mathematically, and how much previous periods affected the final one. In this project, I will investigate whether the scores at the end of the first period affect the final score of NHL games.
I will gather the scores of 200 hockey games between 2005-2008 from the nhl.com website. I chose these years because the type of hockey before and after the new Collective Bargaining Agreement is different in terms of goals scored per game, with more goals scored per game on average before than after.
After gathering this data, I will analyze and compare the data. I will make scatterplots, plotting the scores of both losing and winning teams at the end of the first period on the Y-axis and the final score on the X-axis. This will let me see visually whether previous scores affect the amount of goals scored in the third period. I will also use Pearson’s correlation coefficient to see whether there appears to be a correlation between the score at the end of the first period and the final score. I will also try to describe the strength of association by using the coefficient of determination. I will then create a bar graph showing teams who won at the end of the first period and won at the end of the third and teams who lost at the end of the first period and won at the end of the third. Using information from the bar graph, I will use the chi-squared test of independence to see whether the score at the end of the first period and the score at the end of the third period are independent.
1. For the dates of the 200 games, I placed slips of paper into a three boxes. In one box were four slips of paper, one for each year between 2005-2008, in another box were 31 slips with the numbers 1-31, and in the final box were 9 slips with the months January-June, October-December
2. Took one slip of paper from each hat and noted the scores of all games played that day. I put the slips of paper back into the box after I noted the scores
3. Repeated this process until I reached 200 games
4. Made two scatterplots, one of the score of the winning team at the end of the first period and the score at the end of the third, and one of the score of the losing team at the end of the first period and at the end of the third
5. Found the line of best fit
6. Used Pearson’s correlation coefficient on the numbers used in the scatterplots
7. Used the coefficient of determination on the numbers used in step 5
8. Made bar graph of teams who won at the end of the first period and won at the end of the third versus teams who lost at the end of the first period and won at the end of the third
9. Concluded whether there appears to be a correlation between the score at the end of the first and the score at the end of the third
Uh, I had 100+ trials, I put the number into tables, you guys don't need to see this.
After collecting the data, two scatter plots were created using Microsoft Excel, one of the scores of Team A, the winning team at the end of the first period, and one of the scores of Team B, the losing team at the end of the third period.
[GRAPH] [GRAPH...if you really want to see them, send me an email/leave a comment or something and I can send it over]
To make this graph, I placed the data into Microsoft Excel, used XY (scatter) using the Chart Wizard. I made sure the maximums, minimums and scales were the same for the two graphs so a better comparison could be made between them. I next added a linear trend line. Under the Add Trendline Option, I elected to display the equation on the chart. --> NOTE: I don't think you're supposed to use Microsoft Excel, but I had lost my calculator, I was across the continent, I was really, really desperate
The scatter plot is a graph that shows the correlation between the two variables. The trendline, known as the line of best fit or the least squares regression line, shows the linear equation which best explains the sums up the data’s trend. The formula on the right is the formula of the line of best fit.
As can be observed from these plots, teams who win at the end of the first period have a much larger range of data, with 26 separate points versus 14 separate points on Team B’s graph. Since the scatter plot does not use a thicker dot when numerous teams share one place on the plot, this suggests that teams who lose at the end of the first period tend to have similar scores, as they have twelve fewer points than teams who win.
If a team loses in the first period, this means that its X-range must be smaller than that of the winning team. Since 9/26, or almost 35%, of Team A’s points are on 3 or 4 on the X-axis, it is logical for Team A to have a larger range. However, even disregarding 3 and 4 on the X-axis, Team A’s range is still larger than that of Team B’s. There is a greater spread of data on the Y-axis of Team A, as well.
Note that when Team A scores one goal in the first period, there are instances of them ending the third period with five or six goals. However, when Team B scores one goal in the first period, the most goals they ended up with is only four. This pattern is repeated when the teams score two goals in the first period. With Team A, there are instance(s) when they end up with seven; with team B, the most they have ended up with is six, one less than with Team A. This suggests that having a lead in the first does increase scoring slightly in the later periods.
I then used Pearson’s correlation coefficient. When necessary, I rounded numbers to three significant digits. Again, I used Microsoft Excel to find r, by placing data from Team A into A1 to A110 and B1 to B110. Next, I imputed =CORREL(A1:A110,B1:B110). This gave me r, +0.452. I repeated this process with data from Team B, although I imputed data into C and D columns and imputed =CORREL(C1:C110,D1:D110). The r for Team B was +0.436.
The purpose of r is to find the degree of linearity between the two variables. This allows me to evaluate the strength of the correlation between goals scored in the first period and third period by each time.
The Pearson’s correlation coefficient for Teams A and B suggests a weak positive correlation, with a slightly weaker correlation for Team B. This means that generally, the higher the score in the first period for both teams, the higher the score at the end of the third period. Team A is more likely than Team B to have a higher score in the third period given that it has a higher score in the first. Thus, Team A appears more likely to win because it either widens the difference in goals-scored between it and team B, or scores more goals in response to Team B trying to launch a comeback. However, this connection is quite weak. While this trend exists, it does not often occur. This suggests that if a higher score at the end of the first period does not frequently result in a higher score in the third period, then having a lead in the first period will aid the winning team, but only slightly.
Next, I squared r to find the coefficient of determination, r2.
r = 0.452
r2 = 0.204
r = 0.436
r2 = 0.190
The coefficient of determination allows me to better describe the strength of association. It shows a very weak correlation between goals scored in the first period and the third period with both Team A and Team B. This means that although goals scored in the first period may help both teams, it does not do so significantly. This corroborates with r, which showed a weak link between goals scored in the first and third periods. As discussed with the results of r, this lead helps Team A more than Team B.
My final step was to make a bar graph of when Team A won at the end of the third versus when Team B won at the end of the third.
This shows clearly that Team A is more likely than Team B to win at the end of the third period. Team A won 88 times, while Team B won only 22 times, meaning that Team A won 80% of the time. This is quite a significant difference and shows that while the goals scored during the first period may not make a large difference in the goals scored at the end of the third period, a win at the end of the first period usually results in a win at the end of the third.
Discussion and Evaluation of Results
From the data collected for this experiment, as well as analysis of this data, it appears that a team winning at the end of the first period is generally more likely to win at the end of the third period. Note that the bar graph shows that teams who win at the end of the first period go on to win at the end of the third period 80% of the time, which strongly supports this idea. However, goals scored in the first period only slightly affect the score at the end of the third period. This suggests that many games are very close, with one team winning by one or two goals.
This naturally results in something I noticed during the investigation – that a great many ties occurred. It makes sense that with so many close games, many end up going into overtime. This partially explains the application of the shootout in hockey after 2006, which eliminated ties and ensured that one team won every game, which results in a clear outcome for the audience.
Something interesting I noticed in the data was that on the occasions Team B won, they tended to score more goals than needed for them to win. Team B often scored up to four goals during the second and third periods, even if only one was needed to catch up to Team A. Another trend I noticed was that Team A either played defensively and did not score another goal, or they scored up to four or five more goals. Both of these trends are results of individual coaching styles and team motivation. This was not reflected in my scatter plot because as mentioned, my scatter plot did not have a thicker dot for an event happening more frequently. Thus, even if Team A winning by three goals was more common than Team A winning by one, both are represented the same way on the scatter plot. This misrepresentation is one of the fallibilities in my experiment. It could perhaps be remedied by creating another graph which shows how often each score occurred.
I also noticed that it is uncommon for teams to score more than 4 goals in a period. In most games, teams score between 0 to 3 goals per period. Going by this generalization, a team which leads by one or two goals in the first period is not advantaged by much, as Team B should be able to catch up by scoring one or two goals in the second and third periods. However, once a team leads by four or more goals in the first period, it should be harder for Team B to catch up, as it will be forced to score more goals than it usually does.
This generalization is supported by my scatter plot and data. Teams who score three or four goals in the first period can end up with seven or eight goals by the third period, while there are no instances of teams losing in the first period scoring more than six goals. This suggests that more goals are scored overall when a team scores more goals in the first period. My data shows that in almost every instance when a team scores three or four goals in the first period, this team wins the hockey game. Thus, although Team A usually wins, it has a higher chance of doing so when it scores more goals in the first period.
Several factors prevent this study from being perfectly valid. Although I deliberately chose the dates randomly, games are affected by a myriad of factors outside of simply scores. The scheduling can affect the scores as teams playing three games in four nights, or back to back games, are liable to being more fatigued and possibly scoring fewer goals. Scheduling could also affect scores in that during late season games, as some teams who will not make the playoffs stop trying to win, while others who wish to improve their position try harder. It is difficult to judge the former whether they score more or fewer goals in the first, as when they score, they are not as motivated to hold onto their lead, and when they do not score, they are not as motivated to catch up. Meanwhile, the latter are eager to score as much as possible, since sometimes ranking between similar teams is based on goals scored for and against.
I think that this problem could be helped by collecting more data. Although I randomly chose 110 games, it may not be enough to cover all the factors that can affect games throughout two and a half seasons considering 2460 games are played each season. Doubling or even tripling the number of games would better encompass these changes.
The reliability of the analysis stems from the fact that since the data is gathered throughout the regular season, it can be compared accurately, as opposed to data collected during regular season, preseason and the playoffs, when different types of hockey are played. It was also randomly gathered, reducing the chance of a bias resulting in games chosen from one circumstance.
Although goals scored in the first period do not affect goals scored in the third period in a major fashion, my investigation shows a strong correlation between winning in the first period and winning in the third period.