Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Linear Regression

Linear Regression StiS a t ticS c olumn Gilbert Berdine MD, Shengping Yang PhD I am analyzing data from a height and age study for children under 10 years old. I am assuming that height and age have a linear relationship. Should I use a linear regression to analyze these data? In previous articles, statistical methods were presented which characterize data by a group mean and variance. The physical interpretation of this methodology is that the dependent variable has some average expected value (norm) and that deviations from the norm are due to random effects. Statistical methods were also discussed to compare two data sets and decide the likelihood that any differences were due to the random effects rather than system- atic differences. Some data have expected differences between two points. For example, it should surprise nobody that two children of different ages would have diffe- Figure 1 is adapted from a scatter plot in rent heights. Suppose we wish to examine the nature the Wikimedia Commons (1). The data points are the of the effect of one variable, such as age, on another blue dots. Each point represents a pair of indepen- variable, such as height. We are not attributing the dent variable x value with its dependent variable y differences in height to some unknown random effect, value. The red line is the regression line using slope such as imprecision in the birthdate, but we are ex- intercept form: pecting a difference in the dependent variable height due to an expected effect of the independent variable age. Every line can be defined by a slope (m) and Analysis starts with the examination of a scat- y-intercept (b). Linear regression fits a “best” line to ter plot of the dependent variable on the y-axis and the set of data. What defines “best”? The most com - the independent variable on the x-axis. mon method used to define “best” is the method of least squares. The “best” line is the line that minimiz- es the sum of the squares of the difference between Corresponding author: Gilbert Berdine the observed values for y and the predicted values Contact Information: gilbert.berdine@ttuhsc.edu. m*x + b. The squares of the differences are used so DOI: 10.12746/swrccc2014.0206.077 that deviations below the line do not cancel deviations above the line. The sum of the variances (S) between The Southwest Respiratory and Critical Care Chronicles 2014;2(6) 61 Gilbert Berdine et al. Linear Regression the predicted values and observed values can be ex- The slope m is calculated first and is then pressed as: used to calculate the intercept b. Note the symmetry in the formula for the slope m: both the numerator and denominator are the product of n and the sum of a product of individual values, minus the product of the sum of the first value and the sum of the second va- lue. This form is common to all types of moment anal- ysis. A complete discussion of moments is beyond the scope of this article. S is a function of the choices for both m and b. If a minimum value for a curve exists, the slope of the curve at that minimum is zero. The minimum value Correlation of S is determined by taking the derivative of S with respect to m and the derivative of S with respect to b How good is the fit between the observed and setting both expressions to zero. Note that during data and the parameterized linear model? The usual the calculation of the regression coefficients, the total approach to answering this question is known as the variance is a function of the coefficients and the data Pearson Correlation Coefficient r. The Pearson r is a values are treated as constants rather than variables. moment analysis known as covariance. For the pop- ulation Pearson correlation coefficient, the general n n n dS dS formula for r is: =−2 x y + 2b x + 2m x = 0, ∑ i i ∑ i ∑ i dm cov( X ,Y ) dm i=1 i=1 i=1 ρ = X ,Y n n n σ σ X Y dS dS =−2 y + 2m x + 2b= 0. ∑ i ∑ i ∑ db db where σ is the standard deviation of a variable. i=1 i=1 i=1 The Pearson r is usually calculated from the same intermediate values used to calculate the regression coefficients: 2b= 2nb nb Note that where n is the number of n n n i=1 n x y − x y ∑ ∑ ∑ i i i i data points. i=1 i=1 i=1 . rxy = xy 2 2 n n n n     2 2 This is a system of two equations with 2 un- n x − x  n y − y  ∑ i ∑ i ∑ i ∑ i knowns, so a unique solution can be solved. The i=1  i=1  i=1  i=1  solution is usually shown in the following form: The Pearson r can have values from -1 to +1 n n n with a value of 0 meaning no correlation at all, a value n x y − x y ∑ i i ∑ i∑ i of +1 meaning perfect fit to a positive slope line and i=1 i=1 i=1 slope m = , and n n n a value of -1 meaning perfect fit to a negative slope n x − x x line. The special case of a perfect fit to a horizontal ∑ i ∑ i∑ i i=1 i=1 i=1 line also has r = 0, but this is because the dependent n n variable does not vary at all. y − m x ∑ i ∑ i i=1 i=1 intercept b= . The Southwest Respiratory and Critical Care Chronicles 2014;2(6) 62 Gilbert Berdine et al. Linear Regression Example The graphical equivalence would be to plot the data on log-log paper and fit the best line to the result. Consider a simple set of 4 data points {(0, 1), Multiple-regression adds parameters that need (1, 3), (2, 5), (3, 7)}. to be solved for best fit. Each new parameter adds an additional derivative expression that is set to zero x = 0 + 1 + 2 + 3 = 6. ∑ i and is part of a larger system of equations with the i=1 number of equations equal to the number of param- eters. Generalized solutions of systems of equations y = 1 + 3 + 5 + 7 = 16. ∑ i are readily done with matrix notation that can be i=1 easily adapted to automated computing. This allows software packages to handle arbitrary numbers of pa- x = 0 + 1 + 4 + 9 = 14. ∑ i rameters. An important caveat is that the parameters i=1 cannot be degenerate: that is parameters cannot be y = 1 + 9 + 25 + 49 = 84. linear combinations of other parameters. ∑ i i=1 The method of ordinary least squares gives x y = 0 + 3 + 10 + 21 = 34. ∑ i i equal weight to all points based on the square of the i=1 deviation from best fit (variance). This method will tol - Slope m = [4 * 34 – 6 * 16] / [4 * 14 – 6 * 6] erate small errors (residuals) in many points rather = [136 – 96] / [56 – 36] = 40 /20 = 2. than larger errors (residuals) for a single point. Other Intercept b = [16 – 2 * 6] / 4 = [16 – 12] /4 = 4 / 4 = 1. best-fit models may work better for data sets where The Pearson r = [4 *34 – 6 * 16] / √ [4 * 14 – 6 * 6] x most points are good fits and a few points are outliers. [4 * 84 – 16 * 16] = [136 – 96] / √ [56 – 36] [336 – 256] The generalized linear model method can be = 40 / √ [20 * 80] = 40 / √ [1600] adapted to other types of data such as categorical = 40 /40 = 1. data. The method of logistic regression will be pre- Thus, we see a perfect fit to a line with positive slope. sented in the next article. Adaptations of Linear Regression Author affiliation : Gilbert Berdine is a pulmonary physician The main advantage of ordinary least squares in the Department of Internal Medicine at TTUHSC. Sheng- is simplicity. The next advantage is that the math is ping Yang is a biostatistician in the Department of Pathology at TTUHSC. well understood. This method can be easily adapted Published electronically: 4/15/2014 to non-linear functions. Bx The exponential function: y = Ae can be adapted by taking the logarithm of both sides: ln (y) = ln (A) + Bx. References 1. http://en.wikipedia.org/wiki/Linear_regression By transforming y’ = ln (y) one can fit ln (y) to A 2. http://en.wikipedia.org/wiki/Pearson_product-moment_ and B. This is, in effect, drawing the data on semi-log correlation_coefficient graph paper and fitting the best line to the graph. 3. http://www.statisticshowto.com/how-to-find-a-linear- regression-equation/ 4. http://www.statisticshowto.com/how-to-compute-pear- The power function: y = Ax can be analyzed in the sons-correlation-coefficients/ same way: ln (y) = ln (A) + B ln (x). The Southwest Respiratory and Critical Care Chronicles 2014;2(6) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png The Southwest Respiratory and Critical Care Chronicles Unpaywall

Linear Regression

The Southwest Respiratory and Critical Care ChroniclesJan 1, 2014

Loading next page...
 
/lp/unpaywall/linear-regression-hi7sDqm5kW

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

Publisher
Unpaywall
ISSN
2325-9205
DOI
10.12746/swrccc2014.0206.077
Publisher site
See Article on Publisher Site

Abstract

StiS a t ticS c olumn Gilbert Berdine MD, Shengping Yang PhD I am analyzing data from a height and age study for children under 10 years old. I am assuming that height and age have a linear relationship. Should I use a linear regression to analyze these data? In previous articles, statistical methods were presented which characterize data by a group mean and variance. The physical interpretation of this methodology is that the dependent variable has some average expected value (norm) and that deviations from the norm are due to random effects. Statistical methods were also discussed to compare two data sets and decide the likelihood that any differences were due to the random effects rather than system- atic differences. Some data have expected differences between two points. For example, it should surprise nobody that two children of different ages would have diffe- Figure 1 is adapted from a scatter plot in rent heights. Suppose we wish to examine the nature the Wikimedia Commons (1). The data points are the of the effect of one variable, such as age, on another blue dots. Each point represents a pair of indepen- variable, such as height. We are not attributing the dent variable x value with its dependent variable y differences in height to some unknown random effect, value. The red line is the regression line using slope such as imprecision in the birthdate, but we are ex- intercept form: pecting a difference in the dependent variable height due to an expected effect of the independent variable age. Every line can be defined by a slope (m) and Analysis starts with the examination of a scat- y-intercept (b). Linear regression fits a “best” line to ter plot of the dependent variable on the y-axis and the set of data. What defines “best”? The most com - the independent variable on the x-axis. mon method used to define “best” is the method of least squares. The “best” line is the line that minimiz- es the sum of the squares of the difference between Corresponding author: Gilbert Berdine the observed values for y and the predicted values Contact Information: gilbert.berdine@ttuhsc.edu. m*x + b. The squares of the differences are used so DOI: 10.12746/swrccc2014.0206.077 that deviations below the line do not cancel deviations above the line. The sum of the variances (S) between The Southwest Respiratory and Critical Care Chronicles 2014;2(6) 61 Gilbert Berdine et al. Linear Regression the predicted values and observed values can be ex- The slope m is calculated first and is then pressed as: used to calculate the intercept b. Note the symmetry in the formula for the slope m: both the numerator and denominator are the product of n and the sum of a product of individual values, minus the product of the sum of the first value and the sum of the second va- lue. This form is common to all types of moment anal- ysis. A complete discussion of moments is beyond the scope of this article. S is a function of the choices for both m and b. If a minimum value for a curve exists, the slope of the curve at that minimum is zero. The minimum value Correlation of S is determined by taking the derivative of S with respect to m and the derivative of S with respect to b How good is the fit between the observed and setting both expressions to zero. Note that during data and the parameterized linear model? The usual the calculation of the regression coefficients, the total approach to answering this question is known as the variance is a function of the coefficients and the data Pearson Correlation Coefficient r. The Pearson r is a values are treated as constants rather than variables. moment analysis known as covariance. For the pop- ulation Pearson correlation coefficient, the general n n n dS dS formula for r is: =−2 x y + 2b x + 2m x = 0, ∑ i i ∑ i ∑ i dm cov( X ,Y ) dm i=1 i=1 i=1 ρ = X ,Y n n n σ σ X Y dS dS =−2 y + 2m x + 2b= 0. ∑ i ∑ i ∑ db db where σ is the standard deviation of a variable. i=1 i=1 i=1 The Pearson r is usually calculated from the same intermediate values used to calculate the regression coefficients: 2b= 2nb nb Note that where n is the number of n n n i=1 n x y − x y ∑ ∑ ∑ i i i i data points. i=1 i=1 i=1 . rxy = xy 2 2 n n n n     2 2 This is a system of two equations with 2 un- n x − x  n y − y  ∑ i ∑ i ∑ i ∑ i knowns, so a unique solution can be solved. The i=1  i=1  i=1  i=1  solution is usually shown in the following form: The Pearson r can have values from -1 to +1 n n n with a value of 0 meaning no correlation at all, a value n x y − x y ∑ i i ∑ i∑ i of +1 meaning perfect fit to a positive slope line and i=1 i=1 i=1 slope m = , and n n n a value of -1 meaning perfect fit to a negative slope n x − x x line. The special case of a perfect fit to a horizontal ∑ i ∑ i∑ i i=1 i=1 i=1 line also has r = 0, but this is because the dependent n n variable does not vary at all. y − m x ∑ i ∑ i i=1 i=1 intercept b= . The Southwest Respiratory and Critical Care Chronicles 2014;2(6) 62 Gilbert Berdine et al. Linear Regression Example The graphical equivalence would be to plot the data on log-log paper and fit the best line to the result. Consider a simple set of 4 data points {(0, 1), Multiple-regression adds parameters that need (1, 3), (2, 5), (3, 7)}. to be solved for best fit. Each new parameter adds an additional derivative expression that is set to zero x = 0 + 1 + 2 + 3 = 6. ∑ i and is part of a larger system of equations with the i=1 number of equations equal to the number of param- eters. Generalized solutions of systems of equations y = 1 + 3 + 5 + 7 = 16. ∑ i are readily done with matrix notation that can be i=1 easily adapted to automated computing. This allows software packages to handle arbitrary numbers of pa- x = 0 + 1 + 4 + 9 = 14. ∑ i rameters. An important caveat is that the parameters i=1 cannot be degenerate: that is parameters cannot be y = 1 + 9 + 25 + 49 = 84. linear combinations of other parameters. ∑ i i=1 The method of ordinary least squares gives x y = 0 + 3 + 10 + 21 = 34. ∑ i i equal weight to all points based on the square of the i=1 deviation from best fit (variance). This method will tol - Slope m = [4 * 34 – 6 * 16] / [4 * 14 – 6 * 6] erate small errors (residuals) in many points rather = [136 – 96] / [56 – 36] = 40 /20 = 2. than larger errors (residuals) for a single point. Other Intercept b = [16 – 2 * 6] / 4 = [16 – 12] /4 = 4 / 4 = 1. best-fit models may work better for data sets where The Pearson r = [4 *34 – 6 * 16] / √ [4 * 14 – 6 * 6] x most points are good fits and a few points are outliers. [4 * 84 – 16 * 16] = [136 – 96] / √ [56 – 36] [336 – 256] The generalized linear model method can be = 40 / √ [20 * 80] = 40 / √ [1600] adapted to other types of data such as categorical = 40 /40 = 1. data. The method of logistic regression will be pre- Thus, we see a perfect fit to a line with positive slope. sented in the next article. Adaptations of Linear Regression Author affiliation : Gilbert Berdine is a pulmonary physician The main advantage of ordinary least squares in the Department of Internal Medicine at TTUHSC. Sheng- is simplicity. The next advantage is that the math is ping Yang is a biostatistician in the Department of Pathology at TTUHSC. well understood. This method can be easily adapted Published electronically: 4/15/2014 to non-linear functions. Bx The exponential function: y = Ae can be adapted by taking the logarithm of both sides: ln (y) = ln (A) + Bx. References 1. http://en.wikipedia.org/wiki/Linear_regression By transforming y’ = ln (y) one can fit ln (y) to A 2. http://en.wikipedia.org/wiki/Pearson_product-moment_ and B. This is, in effect, drawing the data on semi-log correlation_coefficient graph paper and fitting the best line to the graph. 3. http://www.statisticshowto.com/how-to-find-a-linear- regression-equation/ 4. http://www.statisticshowto.com/how-to-compute-pear- The power function: y = Ax can be analyzed in the sons-correlation-coefficients/ same way: ln (y) = ln (A) + B ln (x). The Southwest Respiratory and Critical Care Chronicles 2014;2(6)

Journal

The Southwest Respiratory and Critical Care ChroniclesUnpaywall

Published: Jan 1, 2014

There are no references for this article.