Skip to main content

SIMPLE LINEAR REGRESSION

As a special case of association or relationship, influence or effect is another notable concept in statistics. Instead of testing whether there is a relationship between communication quality and customers’ satisfaction, one may be more interested to investigate whether communication quality has an influence on customers’ satisfaction. Similarly, rather than questioning whether public perception and destination image of Jakarta are correlated, a researcher may enquire whether public perception has effect on destination image of Jakarta. One of many statistical tools to address such questions is regression analysis. In fact, regression analysis has many types and uses. What we are discussing here is the simplest form of regression model called simple linear regression.

To perform the regression analysis, firstly we have to classify the variables of interest into dependent variable and independent/explanatory variable. The variable which is hypothesized to have influence on another variable (= dependent variable) is called explanatory variable. Thus, in simple linear regression we test whether the explanatory variable has a significant effect on the dependent variable. From now on, the dependent variable is denoted by the capital letter Y and the explanatory variable X. Consequently, in the case of investigating whether communication quality has an influence on customers’ satisfaction, the Y is customers’ satisfaction and the X is communication quality. In determining whether public perception affects destination image of Jakarta, the Y is destination image of Jakarta and the X is public perception.

The mathematical model of simple linear regression is  $Y_i = {\beta}_1 + {\beta}_2 X_i + u_i$. Here, u (= stochastic disturbance) represents all the neglected variables which may affect Y but are not included in the model. Let’s see an example of testing whether salesmen’s salary (X) has influence on target accomplishment (Y). The regression model is: $accomplishment = {\beta}_1 + {\beta}_2 salary + u_i$. Many factors other than salesmen’s salary may affect the sales target accomplishment. To mention some of them are the products’ quality and price, advertising intensity. These variables are represented by the stochastic disturbance u. Using this model, to test whether X affects Y, we apply hypothesis testing procedure with H0: ${\beta}_2 = 0$ and H1: ${\beta}_2 \neq 0$. The test statistic is $t = \frac{{\hat{\beta}}_2}{s.e.({\hat{\beta}}_2)}$. Here s.e. stands for the standard error.  $s.e.({\hat{\beta}}_2)$ is the standard error of ${\hat{\beta}}_2$. ${\hat{\beta}}_2$ is the estimator of ${\beta}_2$ and calculated based on the sample.

Here is a set of formulae required to calculate the test statistic t.

For those who prefer the fewest possible number of formulae, just master the following two:


PROCEDURE TO FIND THE TEST STATISTIC t
Given n pairs of data (X1,Y1), (X2,Y2), (X3,Y3), ..., (Xn,Yn), calculate $\bar{X}$ and $\bar{Y}$, that is the arithmetic mean of Xi and Yi, respectively. Then, for each Xi, find xi using the formula $x_i = X_i - \bar{X}$. Similarly, find $y_i = Y_i - \bar{Y}$ for every Yi. As a result, we have n new pairs (x1,y1), (x2,y2), ... , (xn,yn). From these new pairs, we can find ${\hat{\beta}}_2$, $\sum_{i=1}^n {{\hat{u}}_i}^2$, $\hat{\sigma}$, $s.e.({\hat{\beta}}_2)$, and finally the t statistic.

The procedure could possibly be best performed with the aid of a table of the following format.


The cells (R1), (R2), and (R3) are filled with $\sum_{i=1}^n {x_i}^2$, $\sum_{i=1}^n x_i y_i$, and $ \sum_{i=1}^n {y_i}^2$, respectively

To illustrate how the “test for influence” is conducted, let’s see a sample problem:

A communication researcher is conducting a study to test whether communication by Public Relation personnels affects the companies’ customers satisfaction. To do that, he has selected a random sample of 10 cases and the results are summarized in the following table.


At .05 significance level, does communication by PR personnels have influence on the companies’ customers satisfaction?

To answer this, follow the 5-step procedure of hypothesis testing.

Step 1: State null and alternate hypotheses
H0: Communication by PR personnels does not affect customers satisfaction
H1: Communication by PR personnels affect customers satisfaction

Assuming that the regression model is $Y_i = {\beta}_1 + {\beta}_2 X_i + u_i$, the hypotheses can be mathematically stated by
H0: ${\beta}_2 = 0$
H1: ${\beta}_2 \neq 0$

Notes on the hypotheses
There are three possibilities for H1. If we are testing solely whether the influence exists, the H1 is $ {\beta}_2 \neq 0$. (We call such influence not directional). In other cases, we may hypothesize that the explanatory variable has some “directional” influence on the dependent variable, in the sense similar to “directional” association as discussed in the Spearman’s Rank Correlation Coefficient. Provided the influence is directional, the H1 is  ${\beta}_2 > 0$ or ${\beta}_2 < 0$, depending on the hypothesized direction, and as a consequence, one-side test applies. For instance, we may suggest that better communication quality of the PR personnels leads to higher customers satisfaction. In this case, the H1 must be ${\beta}_2 > 0$. Consider another situation where a researcher hypothesizes that exposure to social media suffered by juveniles has negative influence on their capability to socialize in real life, meaning that more exposure to social media contributes to less capability to socialize in real life. In this case, the H1 must be $\beta_2 < 0$.

Step 2: Select a level of significance
In this sample problem, the level of significance is $\alpha = .05$.

Notes on the significance level
In practice, it is set arbitrarily by the researcher. As this number is the probability of incorrectly rejecting the null hypothesis while it is true, researchers usually select a small number (such as .01, .05, .10, etc.) as the level of significance.

Step 3: Identify the test statistic
To test whether the explanatory variable affects the dependent variable, the test statistic is $t = \frac{{\hat{\beta}}_2}{s.e.({\hat{\beta}}_2)}$.

A note on the test statistic
If the stochastic disturbance ui is normally distributed with mean 0 and variance ${\sigma}^2$ then for any value of ${\beta}_2$ the random variable $T = \frac{({\hat{\beta}}_2 - {\beta}_2) \sqrt{\sum_{i=1}^n {x_i}^2}}{\hat{\sigma}}$ follows the t distribution with df = n - 2.

Step 4: Formulate a decision rule
As the hypothesized influence is not directional, there should be two rejection regions each of which has the area of $\frac{\alpha}{2} = \frac{.05}{2} = .025$. Referring to the t distribution critical values with upper tail probability = .025 and df = 8 (df = n - 2 = 10 - 2 = 8), we obtain the critical value 2.306. Consequently, the rejection regions are t < -2.306 or t > 2.306. It means that if the computed t value is less than -2.306 or greater than 2.306, the null hypothesis is rejected. The regions with red shading below depict the rejection areas.



Step 5: Decide whether the null hypothesis is rejected or not
To make decision about the null hypothesis, we have to calculate the t test statistic. We have to calculate $t = \frac{{\hat{\beta}}_2}{s.e.({\hat{\beta}}_2)}$ based on the sampling result.

First of all, copy the sample data to the table using the format suggested above. The utmost caution must be exercised when categorizing the data into explanatory and dependent variables. In this example, the researcher is going to test whether communication by Public Relation personnels affects the companies’ customers satisfaction. The variable which is hypothesized to affect another variable must be considered as the explanatory variable. Therefore, the explanatory variable (X) in this case is communication (quality). The other one, i.e. customers satisfaction, is the dependent variable (Y).

Then, find the average of Xi and Yi. The average of Xi is $\bar{X} = \frac{75+60+82+...+68}{10} = 75$. Similarly, $\bar{Y} = \frac{72+65+80+...+76}{10}=76$. To fill in column (3), use the formula $x_i = X_i - \bar{X}$. Thus,  subtract 75 from each entry in column (1). For instance, x1 = X1 - 75 = 75 - 75 = 0. Then, x2 = X2 - 75 = 60 - 75 = -15, and so on. The entries in column (4) are determined with similarly, using  $ y_i = Y_i - \bar{Y}$: Subtract 76 from entries in column (2). This leads to y1 = 72 - 76 = -4, y2 = 65 - 76 = -11, y3 = 80 - 76 = 4, etc. The complete entries of columns (3) and (4) are shown below.

Next, to fill in columns (5) and (7), calculate the square of each entry in columns (3) and (4), respectively. Subsequently, fill in column (6) by multiplying the entries in column (3) by the ones in column (4). Following this, add all the entries in column (5) together, as well as in columns (6) and (7). These result in the following table.

Referring to the bottom row of the table, we get $\sum_{i=1}^{10} {x_i}^2 = 1058$, $\sum_{i=1}^{10} x_i y_i = 762$, and $ \sum_{i=1}^{10} {y_i}^2 = 652$.

To find the standard error of ${\hat{\beta}}_2$, use the following formula:
Substituting the sums available at the bottom row of the table for the terms and factors on the right side of the equation, we get:
After this, calculate ${\hat{\beta}}_2 = \frac{\sum_{i=1}^n x_i y_i}{\sum_{i=1}^n {x_i}^2} = \frac{762}{1058} \approx .72$.
Finally, calculate the test statistic $t = \frac{{\hat{\beta}}_2}{s.e.({\hat{\beta}}_2)} = \frac{.72}{.11} \approx 6.545$.

According to the decision rule stipulated in Step 4, we have to reject the null hypothesis (as the computed t value 6.545 is greater than the critical value 2.306). [The following picture may illustrate the reason for the rejection.]


The conclusion is: Communication by PR personnels has significant influence on the companies’ customers satisfaction.


PROBLEM

A researcher majoring in communication science is conducting a study a causal relationship between teenagers exposure to Instagram and the quality of their real social life. The hypothesized relationship is: “Teenagers exposure to Instagram affects the quality of their real social life.” To test the hypothesis he took a sample of 12 teenagers, measuring the related variables and got the following data:

Using .05 level of significance, what conclusion must he draw?

Comments

Popular posts from this blog

CONSTRUCTING THE FREQUENCY DISTRIBUTION TABLE

The frequency distribution table is a table that divides data into groups (classes) and shows how many data values occur in each group/class. Below is an example of frequency distribution table. Now we are learning how to create a frequency distribution table. Suppose we have a collection of ungrouped data on last year’s advertising expenditures of 40 logistics companies, recorded in millions Rupiahs. To construct a frequency distribution table of the ungrouped data, apply the following steps. Step 1: Find the range of the data The range (R) is defined as the difference between the largest data and the smallest data. In this case, R = 307 - 242 = 65. Step 2: Determine the number of categories/classes (k) Applying Sturges rule (k = 1 + 3,322 log n, where n = the number of data), we have: $k = 1 + 3.322 \: log \: 40 \approx 6.32$ As the value of k must be a natural number, 6.32 is rounded up to 7, so k = 7. Step 3: Determine the class width (c) To find c, use $

THE QUARTILES AND MEDIAN OF GROUPED DATA

In this post,  we will learn how to determine the quartiles when some quantitative data are presented in a frequency distribution table. For example, we have the following data, showing Flesch Readability Score of 80 monthly bulletin articles published by Britt and Co. Ltd. Find the quartiles of these readability scores. To answer this, first augment the table with a new column to the right of the frequency column, namely Data Numbers column. There are 5 data in the first class, so the class contains data no. 1 to  no. 5. There are 7 data in the second class, so the class contains data no. 6 to no. 12. There are 13 data in the third class, so the class contains data no. 13 to no. 25. Continuing this way, we get the following: In this case, finding the first quartile means finding the  20 th  data, after the data have been ordered from the smallest to the highest (20 = ¼ x 80). Note that the  20 th  data is in the third class (20 is in the range of 13 - 25, as s

CALCULATING THE MEAN OF GROUPED DATA

Sometimes quantitative data are presented in the form of a frequency distribution table (FDT). A typical FDT is as follows. Suppose that the table above presents the duration of 16 cell phone conversations between pairs of teens. There are 2 conversations with duration from 30 seconds to 44 seconds, 3 conversations with duration from 45 seconds to 59 seconds, etc. How do we calculate the mean of the data? Step 1: Determine the midpoint of each class If M i denotes the midpoint of class i, $M_{i} = \frac{LB_{i}+UB_{i}}{2}$ where  LB i  = lower bound of class i and  UB i  = upper bound of class i. The lower bounds of class 1, 2, 3, 4, 5 are 30, 45, 60, 75, 90, respectively and the upper bounds are 44, 59, 74, 89, 104, respectively. Then, $M_{1} = \frac{30+44}{2} = 37$.  Similarly, $M_{2} = \frac{45+59}{2} = 52$. Continuing this way, we have the following table. Step 2: Multiply each class frequency f i by the corresponding class midpoint M i , resulting in f i M