3 The coefficient of correlation (r)

3.1 The correlation coefficient (r) and its characteristics

Once the least squares line has been estimated, the significance (or strength) of the relationship must be evaluated. If there is a weak relationship between variables, there is no reason to use the estimated regression equation furthermore. Although a scatter plot can give us some information about the strength of the relationship, it will be useful to have a number that tells us exactly how strong or weak a relationship between the two variables is. In statistics, we call this number the Pearson product moment coefficient of correlation (in short, the correlation coefficient), and if it is calculated for the sample data, it is denoted by the symbol r.

So, how do we calculate the value of the correlation coefficient? You can use the following formula:

\[r = \frac{SS_{xy}}{\sqrt{SS_{xx}SS_{yy}}}\]

where:

\(SS_{yy} = \displaystyle\sum_{i=0}^{n} y^{2} - n(\bar{y})^{2}\)

\(SS_{xx} = \displaystyle\sum_{i=0}^{n} x^{2} - n(\bar{x})^{2}\)

\(SS_{xy} = \displaystyle\sum_{i=0}^{n} xy - n(\bar{x})(\bar{y})\)

Before we look at an example that illustrates how to calculate r, here are some facts about correlation that you need to know in order to interpret and make sense of the value of r.

A correlation coefficient can only measure the strength of a linear relationship. It does not measure the strength of curved relationships, no matter how strong they are.
The value of r is always between -1 and 1. A positive value indicates a positive relationship between the variables and a negative value indicates a negative relationship.
r = -1 indicates a perfect negative linear correlation (which usually do not exists).
r = 1 indicates a perfect positive linear correlation (which also usually do not exists).
r = 0 indicates no linear correlation. It tells us that there is no linear relationship, but there might or might not be a non-linear relationship.
According to the above, we can say that a value of r close to 0 implies little or no linear relationship between x and y. In contrast, the closer r is to -1 or 1, the stronger the linear relationship between x and y.

Example 3.1 Calculate the correlation coefficient for the income-savings data.

Solution

The quantities needed to calculate r are \(SS_{xx}\), \(SS_{xy}\) and \(SS_{yy}\). The first two quantities have been calculated previously and are repeated here for convenience. Let’s have a look at the data again.

Individual	Income (in R1000)	Savings (in R100)
A	24	12.0
B	26	14.0
C	12	1.5
D	22	9.0
E	20	6.0
F	18	2.0

\(\displaystyle\sum_{i=0}^{n}x_{i} = 122, \qquad \displaystyle\sum_{i=0}^{n} y_{i} = 44.5, \qquad \displaystyle\sum_{i=0}^{n} xy = 1024, \qquad \displaystyle\sum_{i=0}^{n}x^{2} = 2604\),

\(\bar{x} = \frac{\displaystyle\sum_{i=0}^{n} x_{i}}{n} = 20.3333 \qquad \bar{y} = \frac{\displaystyle\sum_{i=0}^{n} y_{i}}{n} = 7.4167\)

\((SS_{xx}) = \displaystyle\sum_{i=0}^{n} x^{2} - n(\bar{x})^{2} = 2604 - 6(20.3333)^{2} = 2604 - 2480.6585 = 123.3415\)

\((SS_{xy}) = \displaystyle\sum_{i=0}^{n} xy - n(\bar{x})(\bar{y}) = 1024 - 6(20.3333)(7.4167) = 1024-904.8359 = 119.1641\)

\((SS_{yy}) = \displaystyle\sum_{i=0}^{n} y^{2} - n(\bar{y})^{2} = 463.25 - 6(7.4167)^{2} = 463.25 - 330.0446 = 133.2054\)

\(r = \frac{SS_{xy}}{\sqrt{SS_{xx} ss_{yy}}} = \frac{119.1641}{\sqrt{123.3415 \times 133.2954}} = \frac{119.1641}{128.2219} = 0.9294\)

The correlation is very close to 1 which indicates that there is a very strong, positive correlation between “monthly income” and “savings”. This means that income(x) has a large influence on savings (y). The positive sign indicates that, if the monthly income (x) increases, the savings (y) will also increase or if the monthly income decreases, then the savings will also decrease.

3.2 Using R to calculate the coeffcient of correlation


income <- c(24, 26, 12, 22, 20, 18)

savings <- c(12, 14, 1.5, 9, 6, 2)

cor(income, savings)
## [1] 0.9297129

3.3 Hypothesis test for correlation (r)

The previous hypothesis tests were performed to statistically determine whether a linear relationship or regression exists between the variables. In this section we will discuss another hypothesis test which is about determining whether a correlation between the variables exists. Keep in mind that the correlation coefficient r measures the correlation (strength) between x-values and y-values for the sample data. But, a similar coefficient of correlation exists for the whole population of data from which the sample was selected. So, we want to know whether there is any evidence of a statistically significant correlation between the x-values and the y-values of the population of data. By now you know that, if we want more information about the population of data, we have to perform a hypothesis test. If r denotes the correlation for the sample data, what symbol are we going to use to denote the population correlation coefficient? The population correlation coefficient is denoted by the symbol \((\rho)\). The procedure to test for correlation between two variables is outlined in the figure below.

Example 3.2 Test at a 5% level of significance whether a linear correlation exists for the income-savings data. Recall that r was previously calculated as 0.9294 and \(R^{2}\) as 0.8638.

Solution

Hypothesis

\(H_{0}: \rho = 0\)

\(H_{a}: \rho \neq 0\)

\(\alpha = 5\%\)
\(t_{calc} = \frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}}\) = \(\frac{0.9294\sqrt{6-2}}{\sqrt{1-0.8638}}\) = \(\frac{1.8588}{0.3691}\) = 5.036
\(t_{tab} {(n-2; \quad \frac{\alpha}{2})}\) = \(t_{tab}(6-2; \quad \frac{0.05}{2})\) = \(t_{tab}(4;0.025)\) = 2.776
Reject \(H_{0}\) if \(|t_{calc}| > t_{tab}\)
Conclusion

Reject \(H_{0}\). There is a significant linear correlation between the population monthly income and savings at the 5% level of significance.

Example 3.3 Underinflated or overinflated tires can increase tire wear and fuel usage. A manufacturer of a new tire tested the tire for wear at different pressures. The correlation coefficient r was calculated as 0.1138. The dataset is below.

Interpret the value of r.
Determine the value of the coefficient of determination and interpret the value.
Test at a 10% level of significance whether a linear correlation between pressure and number of kilometers before the tire is weared out exists.

Pressure (kg per square cm)	Kilometers (in thousands)
30	29.5
30	30.2
31	32.1
31	34.5
32	36.3
32	35.0
33	38.2
33	37.6
34	37.7
34	36.1
35	33.6
35	34.2
36	26.8
36	27.4

Solution

The value of r is very close to zero and therefore there is a very weak, negative linear correlation between pressure and number of kilometers.
\(R^{2} = 0.013\) = 1.3%. Pressure has a weak influence on wear. Other factors cause 98.7% of the changes in wear and should also be taken into account. The pressure (x) is definitely not the main cause of the changes in y.

Hypothesis

\(H_{0}: \rho = 0\)

\(H_{a}: \rho \neq 0\)

\(\alpha = 10\%\)
\(t_{calc} = \frac{r\sqrt{n-2}}{\sqrt{1-r^{2}}}\) = \(\frac{-0.1138\sqrt{14-2}}{\sqrt{1-(-0.1138)^{2}}}\) = \(\frac{-0.3942}{0.9935}\) = -0.3938
\(t_{tab} {(n-2; \quad \frac{\alpha}{2})}\) = \(t_{tab}(14-2; \quad \frac{0.10}{2})\) = \(t_{tab}(12;0.05)\) = 1.782
Reject \(H_{0}\) if \(|t_{calc}| > t_{tab}\)
Conclusion

Don’t reject \(H_{0}\). There is no significant linear correlation between the tire pressure and wear at the 10% of error.