Access the full text.
Sign up today, get DeepDyve free for 14 days.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
StiS a t ticS c olumn Gilbert Berdine MD, Shengping Yang PhD I am analyzing data from a height and age study for children under 10 years old. I am assuming that height and age have a linear relationship. Should I use a linear regression to analyze these data? In previous articles, statistical methods were presented which characterize data by a group mean and variance. The physical interpretation of this methodology is that the dependent variable has some average expected value (norm) and that deviations from the norm are due to random effects. Statistical methods were also discussed to compare two data sets and decide the likelihood that any differences were due to the random effects rather than system- atic differences. Some data have expected differences between two points. For example, it should surprise nobody that two children of different ages would have diffe- Figure 1 is adapted from a scatter plot in rent heights. Suppose we wish to examine the nature the Wikimedia Commons (1). The data points are the of the effect of one variable, such as age, on another blue dots. Each point represents a pair of indepen- variable, such as height. We are not attributing the dent variable x value with its dependent variable y differences in height to some unknown random effect, value. The red line is the regression line using slope such as imprecision in the birthdate, but we are ex- intercept form: pecting a difference in the dependent variable height due to an expected effect of the independent variable age. Every line can be defined by a slope (m) and Analysis starts with the examination of a scat- y-intercept (b). Linear regression fits a “best” line to ter plot of the dependent variable on the y-axis and the set of data. What defines “best”? The most com - the independent variable on the x-axis. mon method used to define “best” is the method of least squares. The “best” line is the line that minimiz- es the sum of the squares of the difference between Corresponding author: Gilbert Berdine the observed values for y and the predicted values Contact Information: gilbert.berdine@ttuhsc.edu. m*x + b. The squares of the differences are used so DOI: 10.12746/swrccc2014.0206.077 that deviations below the line do not cancel deviations above the line. The sum of the variances (S) between The Southwest Respiratory and Critical Care Chronicles 2014;2(6) 61 Gilbert Berdine et al. Linear Regression the predicted values and observed values can be ex- The slope m is calculated first and is then pressed as: used to calculate the intercept b. Note the symmetry in the formula for the slope m: both the numerator and denominator are the product of n and the sum of a product of individual values, minus the product of the sum of the first value and the sum of the second va- lue. This form is common to all types of moment anal- ysis. A complete discussion of moments is beyond the scope of this article. S is a function of the choices for both m and b. If a minimum value for a curve exists, the slope of the curve at that minimum is zero. The minimum value Correlation of S is determined by taking the derivative of S with respect to m and the derivative of S with respect to b How good is the fit between the observed and setting both expressions to zero. Note that during data and the parameterized linear model? The usual the calculation of the regression coefficients, the total approach to answering this question is known as the variance is a function of the coefficients and the data Pearson Correlation Coefficient r. The Pearson r is a values are treated as constants rather than variables. moment analysis known as covariance. For the pop- ulation Pearson correlation coefficient, the general n n n dS dS formula for r is: =−2 x y + 2b x + 2m x = 0, ∑ i i ∑ i ∑ i dm cov( X ,Y ) dm i=1 i=1 i=1 ρ = X ,Y n n n σ σ X Y dS dS =−2 y + 2m x + 2b= 0. ∑ i ∑ i ∑ db db where σ is the standard deviation of a variable. i=1 i=1 i=1 The Pearson r is usually calculated from the same intermediate values used to calculate the regression coefficients: 2b= 2nb nb Note that where n is the number of n n n i=1 n x y − x y ∑ ∑ ∑ i i i i data points. i=1 i=1 i=1 . rxy = xy 2 2 n n n n 2 2 This is a system of two equations with 2 un- n x − x n y − y ∑ i ∑ i ∑ i ∑ i knowns, so a unique solution can be solved. The i=1 i=1 i=1 i=1 solution is usually shown in the following form: The Pearson r can have values from -1 to +1 n n n with a value of 0 meaning no correlation at all, a value n x y − x y ∑ i i ∑ i∑ i of +1 meaning perfect fit to a positive slope line and i=1 i=1 i=1 slope m = , and n n n a value of -1 meaning perfect fit to a negative slope n x − x x line. The special case of a perfect fit to a horizontal ∑ i ∑ i∑ i i=1 i=1 i=1 line also has r = 0, but this is because the dependent n n variable does not vary at all. y − m x ∑ i ∑ i i=1 i=1 intercept b= . The Southwest Respiratory and Critical Care Chronicles 2014;2(6) 62 Gilbert Berdine et al. Linear Regression Example The graphical equivalence would be to plot the data on log-log paper and fit the best line to the result. Consider a simple set of 4 data points {(0, 1), Multiple-regression adds parameters that need (1, 3), (2, 5), (3, 7)}. to be solved for best fit. Each new parameter adds an additional derivative expression that is set to zero x = 0 + 1 + 2 + 3 = 6. ∑ i and is part of a larger system of equations with the i=1 number of equations equal to the number of param- eters. Generalized solutions of systems of equations y = 1 + 3 + 5 + 7 = 16. ∑ i are readily done with matrix notation that can be i=1 easily adapted to automated computing. This allows software packages to handle arbitrary numbers of pa- x = 0 + 1 + 4 + 9 = 14. ∑ i rameters. An important caveat is that the parameters i=1 cannot be degenerate: that is parameters cannot be y = 1 + 9 + 25 + 49 = 84. linear combinations of other parameters. ∑ i i=1 The method of ordinary least squares gives x y = 0 + 3 + 10 + 21 = 34. ∑ i i equal weight to all points based on the square of the i=1 deviation from best fit (variance). This method will tol - Slope m = [4 * 34 – 6 * 16] / [4 * 14 – 6 * 6] erate small errors (residuals) in many points rather = [136 – 96] / [56 – 36] = 40 /20 = 2. than larger errors (residuals) for a single point. Other Intercept b = [16 – 2 * 6] / 4 = [16 – 12] /4 = 4 / 4 = 1. best-fit models may work better for data sets where The Pearson r = [4 *34 – 6 * 16] / √ [4 * 14 – 6 * 6] x most points are good fits and a few points are outliers. [4 * 84 – 16 * 16] = [136 – 96] / √ [56 – 36] [336 – 256] The generalized linear model method can be = 40 / √ [20 * 80] = 40 / √ [1600] adapted to other types of data such as categorical = 40 /40 = 1. data. The method of logistic regression will be pre- Thus, we see a perfect fit to a line with positive slope. sented in the next article. Adaptations of Linear Regression Author affiliation : Gilbert Berdine is a pulmonary physician The main advantage of ordinary least squares in the Department of Internal Medicine at TTUHSC. Sheng- is simplicity. The next advantage is that the math is ping Yang is a biostatistician in the Department of Pathology at TTUHSC. well understood. This method can be easily adapted Published electronically: 4/15/2014 to non-linear functions. Bx The exponential function: y = Ae can be adapted by taking the logarithm of both sides: ln (y) = ln (A) + Bx. References 1. http://en.wikipedia.org/wiki/Linear_regression By transforming y’ = ln (y) one can fit ln (y) to A 2. http://en.wikipedia.org/wiki/Pearson_product-moment_ and B. This is, in effect, drawing the data on semi-log correlation_coefficient graph paper and fitting the best line to the graph. 3. http://www.statisticshowto.com/how-to-find-a-linear- regression-equation/ 4. http://www.statisticshowto.com/how-to-compute-pear- The power function: y = Ax can be analyzed in the sons-correlation-coefficients/ same way: ln (y) = ln (A) + B ln (x). The Southwest Respiratory and Critical Care Chronicles 2014;2(6)
The Southwest Respiratory and Critical Care Chronicles – Unpaywall
Published: Jan 1, 2014
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.