This is a post that I found very useful from R-bloggers, thought it might be interesting to share with others.
Often times, a scatterplot reveals a pattern that seems not so linear. Polynomial regression can be used to explore a predictor at different levels of curvilinearity. This tutorial will demonstrate how polynomial regression can be used in a hierarchical fashion to best represent a dataset in R.
Tutorial Files
Before we begin, you may want to download the sample data (.csv) used in this tutorial. Be sure to right-click and save the file to your R working directory. Note that all code samples in this tutorial assume that this data has already been read into an R variable and has been attached. This dataset contains hypothetical student data that uses practice exam scores to predict final exam scores.
Scatterplot
The preceding scatterplot demonstrates that these data may not be linear. Notably, no one scored lower than 50 on the practice exam and at approximately the 85 and above practice mark, final exam scores taper off. These suggest that the data is curvilinear. Furthermore, since exam scores range between 0 to 100, it is not possible to observe nor appropriate to predict that an individual with a 150 practice score would have a certain final exam score.
Creating The Higher Order Variables
A two step process, identical to the one used to create interaction variables, can be followed to create higher order variables in R. First, the variables must be centered to mitigate multicollinearity. Second, the predictor must be multiplied by itself a certain number of times to create each higher order variable. In this tutorial, we will explore the a linear, quadratic, and cubic model. Therefore, the predictor will need to be squared to create the quadratic model and cubed to create the cubic model.
Step 1: Centering
To center a variable, simply subtract its mean from each data point and save the result into a new R variable, as demonstrated below.
> #center the independent variable
> FinalC <- Final - mean(Final)
> #center the predictor
> PracticeC <- Practice - mean(Practice)
Step 2: Multiplication
Once the input variable has been centered, the higher order terms can be created. Since a higher order variable is formed by the product of a predictor with itself, we can simply multiply our centered term from step one and save the result into a new R variable, as demonstrated below.
> #create the quadratic variable
> PracticeC2 <- PracticeC * PracticeC
> #create the cubic variable
> PracticeC3 <- PracticeC * PracticeC * PracticeC
Creating The Models
Now we have all of the pieces necessary to assemble our linear and curvilinear models.
> #create the models using lm(FORMULA, DATAVAR)
> #linear model
> linearModel <- lm(FinalC ~ PracticeC, datavar)
> #quadratic model
> quadraticModel <- lm(FinalC ~ PracticeC + PracticeC2, datavar)
> #cubic model
> cubicModel <- lm(FinalC ~ PracticeC + PracticeC2 + PracticeC3, datavar)
Evaluating The Models
As is the case in other forms of regression, it can be helpful to summarize and compare our potential models using the summary(MODEL) and anova(MODEL1, MODEL2,… MODELi) functions.
> #display summary information about the models
> summary(linearModel)
> summary(quadraticModel)
> summary(cubicModel)
#compare the models using ANOVA
anova(linearModel, quadraticModel, cubicModel)
The model summaries and ANOVA comparison chart are displayed below.
At this point we can compare the models. In this case, the quadratic and cubic terms are not statistically significant themselves nor are their models statistically significant beyond the linear model. However, in a real research study, there would be other practical considerations to make before deciding on a final model.
More On Interactions, Polynomials, and HLR
Certainly, much more can be done with these topics than I have covered in my tutorials. What I have provided is a basic discussion with guided examples. The regression topics covered in these tutorials can be mixed and matched to create exceedingly complex models. For example, multiple interactions and higher order variables could be contained in a single model. The good news is that more complex models can be created using the same techniques covered here. The basic principles remain the same.
Complete Polynomial Regression Example
To see a complete example of how polynomial regression models can be created in R, please download the polynomial regression example (.txt) file.
Tuesday, July 19, 2011
Monday, July 18, 2011
Fitting Polynomial Models to Data
http://www.facstaff.bucknell.edu/maneval/help211/fitting.html#evaluating
Computing statistical summaries of the fit
There are any number of statistical measures of the "quality" and "appropriateness" of a model fit to a set of data. This section shows how to compute some of the more common measures (coefficient of determination, observed f-value for the fit and the observed t-values/confidence intervals for the coefficients). Note that all of these measures are less informative than the by-eye views discussed above.
For what follows, it is assumed that you have a set of x-y data pairs and that you have used polyfit to compute the coefficients for a polynomial of given order (and have stored in a vector called coeff).
The coefficient of determination (also referred to as the R2 value) for the fit indicates the percent of the variation in the data that is explained by the model. This coefficient can be computed via the commands
ypred = polyval(coeff,x); % predictions
dev = y - mean(y); % deviations - measure of spread
SST = sum(dev.^2); % total variation to be accounted for
resid = y - ypred; % residuals - measure of mismatch
SSE = sum(resid.^2); % variation NOT accounted for
Rsq = 1 - SSE/SST; % percent of error explained
The closer that Rsq is to 1, the more completely the fitted model "explains" the data.
The observed f-statistic for the fit compares the "size" of the fraction of the data variation explained by the model to the "size of the variation unexplained by the model. The basis for this comparison is the ratio of the variances for the model and the error (residuals).
f = MSR/MSE
where MSR = SSR/dfr and MSE = SSE/dfe. The statistic is computed via the following commands (which assume the commands given above for the R2 computation have been executed)
SSR = SST - SSE; % the "ANOVA identity"
dfr =
dfe = length(x) - 1 - dfr; % degrees of freedom for error
MSE = SSE/dfe; % mean-square error of residuals
MSR = SSR/dfr; % mean-square error for regression
f = MSR/MSE; % f-statistic for regression
"Large" values of this f-statistic (typically > 6 but check an F(dfr,dfe)-table to be sure) indicate that the fit is significant.
The observed t-statistics for the coefficients indicate the level of significance for any one of the coefficients. This t-statistic is defined as the ratio of the value of the coefficient to its standard error. Hence, computation of these statistics requires computation of the standard errors.
The standard errors may be obtained from an alternate use of polyfit. Specifically, if polyfit is used to provide two outputs,
[coef,S] = polyfit(x,y,n)
the second output is a structure that contains three fields (as of version 5.2, at any rate):
S.R: an n-by-n matrix from the Q-R decomposition of the matrix used in computing the fit
S.df: the degrees of freedom for the residuals
S.normr: the 2-norm of the vector of the residuals for the fit
To compute the vector of standard errors of the coefficients and the observed t-values, use the commands
R = S.R; % The "R" from A = QR
d = (R'*R)\eye(n+1); % The covariance matrix
d = diag(d)'; % ROW vector of the diagonal elements
MSE = (S.normr^2)/S.df; % variance of the residuals
se = sqrt(MSE*d); % the standard errors
t = coef./se; % observed T-values
Note that a transpose is used when the diagonal elements are extracted from the covariance matrix. If this step is omitted, there will be a mismatch of dimension in the ./ step that follows because polyfit returns a row vector while diag returns a column vector.
A coefficient is usually significant if its t-value is 2 or better. To be specific, check a t-table at a selected level of significance and S.df degrees of freedom.
The confidence intervals of the coefficients are an alternative way of expressing the accuracy of results. The idea is to specify a level of confidence (a) and then use that to compute a t-statistic that will scale the standard error of the coefficient (see the discussion of the observed t-statistics, above) to define an interval about the predicted value of any of the coefficients predicted by polyfit.
The half-width of an a-%, 2-sided confidence interval is computed according to
wj = ta,dfe sej
where ta,dfe is the t-value determined using
P(T > t; dfe) = 1 - (a/100)/2
In words, this means find a value t such that the chance of finding ofther t-values greater than this value is (100 - a)/2 %. The computation reqires either a good T-table or a root find using an expression for the cumulative T-distribution. sej is the standard error for the jth coefficient and its computation is given above.
The confidence interval is then, [cj - wj, cj + wj].
Continuing the code from above, the half-widths can be computed simply from the commmands
tval =
width = tval*se; % half widths
ci = [coeff-width ; coef+width]; % confidence intervals
The lower and upper limits of the confidence intervals are stored in a 2-row vector, ci with the lower limits in the first row and the upper limits in the second row.
Computing statistical summaries of the fit
There are any number of statistical measures of the "quality" and "appropriateness" of a model fit to a set of data. This section shows how to compute some of the more common measures (coefficient of determination, observed f-value for the fit and the observed t-values/confidence intervals for the coefficients). Note that all of these measures are less informative than the by-eye views discussed above.
For what follows, it is assumed that you have a set of x-y data pairs and that you have used polyfit to compute the coefficients for a polynomial of given order (and have stored in a vector called coeff).
The coefficient of determination (also referred to as the R2 value) for the fit indicates the percent of the variation in the data that is explained by the model. This coefficient can be computed via the commands
ypred = polyval(coeff,x); % predictions
dev = y - mean(y); % deviations - measure of spread
SST = sum(dev.^2); % total variation to be accounted for
resid = y - ypred; % residuals - measure of mismatch
SSE = sum(resid.^2); % variation NOT accounted for
Rsq = 1 - SSE/SST; % percent of error explained
The closer that Rsq is to 1, the more completely the fitted model "explains" the data.
The observed f-statistic for the fit compares the "size" of the fraction of the data variation explained by the model to the "size of the variation unexplained by the model. The basis for this comparison is the ratio of the variances for the model and the error (residuals).
f = MSR/MSE
where MSR = SSR/dfr and MSE = SSE/dfe. The statistic is computed via the following commands (which assume the commands given above for the R2 computation have been executed)
SSR = SST - SSE; % the "ANOVA identity"
dfr =
dfe = length(x) - 1 - dfr; % degrees of freedom for error
MSE = SSE/dfe; % mean-square error of residuals
MSR = SSR/dfr; % mean-square error for regression
f = MSR/MSE; % f-statistic for regression
"Large" values of this f-statistic (typically > 6 but check an F(dfr,dfe)-table to be sure) indicate that the fit is significant.
The observed t-statistics for the coefficients indicate the level of significance for any one of the coefficients. This t-statistic is defined as the ratio of the value of the coefficient to its standard error. Hence, computation of these statistics requires computation of the standard errors.
The standard errors may be obtained from an alternate use of polyfit. Specifically, if polyfit is used to provide two outputs,
[coef,S] = polyfit(x,y,n)
the second output is a structure that contains three fields (as of version 5.2, at any rate):
S.R: an n-by-n matrix from the Q-R decomposition of the matrix used in computing the fit
S.df: the degrees of freedom for the residuals
S.normr: the 2-norm of the vector of the residuals for the fit
To compute the vector of standard errors of the coefficients and the observed t-values, use the commands
R = S.R; % The "R" from A = QR
d = (R'*R)\eye(n+1); % The covariance matrix
d = diag(d)'; % ROW vector of the diagonal elements
MSE = (S.normr^2)/S.df; % variance of the residuals
se = sqrt(MSE*d); % the standard errors
t = coef./se; % observed T-values
Note that a transpose is used when the diagonal elements are extracted from the covariance matrix. If this step is omitted, there will be a mismatch of dimension in the ./ step that follows because polyfit returns a row vector while diag returns a column vector.
A coefficient is usually significant if its t-value is 2 or better. To be specific, check a t-table at a selected level of significance and S.df degrees of freedom.
The confidence intervals of the coefficients are an alternative way of expressing the accuracy of results. The idea is to specify a level of confidence (a) and then use that to compute a t-statistic that will scale the standard error of the coefficient (see the discussion of the observed t-statistics, above) to define an interval about the predicted value of any of the coefficients predicted by polyfit.
The half-width of an a-%, 2-sided confidence interval is computed according to
wj = ta,dfe sej
where ta,dfe is the t-value determined using
P(T > t; dfe) = 1 - (a/100)/2
In words, this means find a value t such that the chance of finding ofther t-values greater than this value is (100 - a)/2 %. The computation reqires either a good T-table or a root find using an expression for the cumulative T-distribution. sej is the standard error for the jth coefficient and its computation is given above.
The confidence interval is then, [cj - wj, cj + wj].
Continuing the code from above, the half-widths can be computed simply from the commmands
tval =
width = tval*se; % half widths
ci = [coeff-width ; coef+width]; % confidence intervals
The lower and upper limits of the confidence intervals are stored in a 2-row vector, ci with the lower limits in the first row and the upper limits in the second row.
Subscribe to:
Posts (Atom)