What is simple linear regression?

Simple linear regression finds the best-fit straight line through a set of (x,y) data points. It calculates the slope (m) and intercept (b) of the line y = mx + b that minimizes the sum of squared vertical distances from the data points to the line.

How do I calculate the slope in linear regression?

The slope m = (n*sum(xy) - sum(x)*sum(y)) / (n*sum(x^2) - (sum(x))^2), where n is the number of data pairs. The slope represents the change in Y for each one-unit increase in X.

What does R-squared mean in linear regression?

R-squared (R²) measures the proportion of variance in Y that is explained by X. R² = 1 is a perfect fit. R² = 0 means X has no linear predictive power for Y. For example, R² = 0.75 means 75% of the variation in Y is explained by the regression line.

What is a good R-squared value?

It depends on the field. In physical sciences, R² above 0.95 is often expected. In social sciences and economics, R² of 0.3 to 0.6 is considered meaningful. There is no universal threshold; context and purpose determine what is acceptable.

What is the difference between R and R-squared?

Pearson r is the correlation coefficient ranging from -1 to +1, indicating strength and direction of the linear relationship. R-squared is r^2, ranging from 0 to 1, indicating the proportion of variance explained. R-squared is always positive regardless of whether the relationship is positive or negative.

Can I use linear regression for prediction?

Yes. Once you have the equation y = mx + b, enter any X value to predict the corresponding Y. However, predicting outside the range of your data (extrapolation) is risky because the linear relationship may not hold beyond the observed range.

What are residuals in linear regression?

A residual is the difference between the actual Y value and the predicted Y value from the regression line: residual = y - (mx + b). Small residuals indicate a good fit. Examining residuals helps verify the assumptions of linear regression.

Does linear regression require normally distributed data?

Linear regression assumes the residuals are approximately normally distributed, not the raw data. If residuals show clear patterns or non-normality, the regression assumptions may be violated and results should be interpreted cautiously.

Linear Regression Calculator | Slope, Intercept, R² and Equation

X	Y (actual)	ŷ (predicted)	Residual
1	45	47.5833	-2.5833
2	55	54.0595	+0.9405
3	60	60.5357	-0.5357
4	70	67.0119	+2.9881
5	75	73.4881	+1.5119
6	80	79.9643	+0.0357
7	85	86.4405	-1.4405
8	92	92.9167	-0.9167

What Is Linear Regression?

Simple linear regression finds the straight line that best fits a set of data points. It models the relationship between one independent variable (X) and one dependent variable (Y) as a linear equation: y = mx + b. This line minimizes the sum of squared vertical distances from each data point to the line, known as the least squares criterion.

Linear regression is one of the most widely used statistical methods in science, economics, engineering, medicine, and machine learning. It answers questions like: Does study time predict exam scores? Does advertising spend drive sales? How does temperature affect crop yield?

The Linear Regression Formulas

Given n pairs of (x, y) data points, the slope and intercept are calculated as:

\[ m = \frac{n \sum xy - \sum x \sum y}{n \sum x^2 - (\sum x)^2} \]

\[ b = \frac{\sum y - m \sum x}{n} \]

The resulting line y = mx + b passes through the point (x̄, ȳ), the centroid of the data. The slope m represents the change in Y for each one-unit increase in X. The intercept b is the predicted value of Y when X equals zero.

R-Squared: How Well Does the Line Fit?

The NIST/SEMATECH e-Handbook, Process Modelling chapter is the definitive reference for regression diagnostics. In line with its guidance, R² should always be interpreted alongside residual plots and coefficient significance, not used in isolation.

The coefficient of determination (R²) measures what proportion of the variance in Y is explained by X:

\[ R^2 = 1 - \frac{\text{SS}_{res}}{\text{SS}_{tot}} \]

Where SS_res is the sum of squared residuals (actual minus predicted) and SS_tot is the total sum of squares (actual minus mean). R² ranges from 0 to 1:

R² = 0: X explains none of the variation in Y
R² = 0.5: X explains 50% of the variation in Y
R² = 1: Perfect linear fit, X explains all variation in Y

Interpreting R² depends on the field. In physics, R² below 0.99 might indicate a poor fit. In social sciences, R² of 0.3 might represent a meaningful relationship.

Pearson Correlation Coefficient

The Pearson r is the square root of R², with the sign determined by the slope direction. It measures the strength and direction of the linear relationship, ranging from -1 (perfect negative) to +1 (perfect positive). Values near 0 indicate no linear relationship.

R-Squared Interpretation by Field

What counts as a "good" R² depends entirely on the subject area. The table below shows typical R² ranges and what they indicate across common disciplines.

R² Value	General Interpretation	Typical in These Fields
0.90 – 1.00	Excellent fit	Physics, engineering, controlled lab experiments
0.70 – 0.89	Strong fit	Economics (macro models), environmental science
0.50 – 0.69	Moderate fit	Business forecasting, epidemiology
0.30 – 0.49	Weak but meaningful	Psychology, social sciences, behavioral research
0.10 – 0.29	Low, but may be useful	Genetics, large-scale human population studies
0.00 – 0.09	Very weak or no linear relationship	No field ; revisit the model

A low R² does not mean the regression is worthless. If the slope is statistically significant and the coefficient is meaningful in context, a regression with R² = 0.2 can still provide valuable insights, particularly in social science where human behavior is inherently variable.

Worked Example: Linear Regression Step by Step

A student wants to know whether hours studied predicts exam score. Data from 5 students:

Hours studied (X)	Exam score (Y)	X²	XY
1	50	1	50
2	60	4	120
3	65	9	195
4	75	16	300
5	80	25	400
ΣX = 15	ΣY = 330	ΣX² = 55	ΣXY = 1065

Step 1 : Calculate the slope: m = (n·ΣXY − ΣX·ΣY) / (n·ΣX² − (ΣX)²) = (5×1065 − 15×330) / (5×55 − 15²) = (5325 − 4950) / (275 − 225) = 375 / 50 = 7.5

Step 2 : Calculate the intercept: b = (ΣY − m·ΣX) / n = (330 − 7.5×15) / 5 = (330 − 112.5) / 5 = 217.5 / 5 = 43.5

Regression equation: y = 7.5x + 43.5

Interpretation: Each additional hour of study is associated with a 7.5-point increase in exam score. A student who studies 0 hours is predicted to score 43.5 (the intercept). A student studying 6 hours: y = 7.5(6) + 43.5 = 88.5.

Assumptions and When Linear Regression Applies

The Khan Academy linear regression guide covers the four core assumptions: linearity, independence, homoscedasticity, and normality of residuals. In practice, violating these assumptions is the most common reason a regression model produces misleading results despite a high R².

Linear regression produces reliable results only when its assumptions are met. Violating these does not always ruin the analysis, but it can make results misleading: After fitting the regression line, assess model precision with our margin of error calculator to quantify the uncertainty around each predicted value.

Assumption	What it means	How to check
Linearity	The relationship between X and Y is actually linear	Plot X vs Y ; look for a straight-line pattern, not a curve
Independence	Each observation is independent of the others	Consider the data collection process ; time series data is often not independent
Homoscedasticity	Residuals have constant variance across all X values	Plot residuals vs fitted values ; look for a random scatter, not a funnel shape
Normality of residuals	Residuals are approximately normally distributed	Q-Q plot of residuals, or Shapiro-Wilk test
No extreme outliers	No single point dominates the regression line	Check leverage and Cook's distance for influential points

The Most Common Linear Regression Mistakes

After reviewing the most upvoted Quora and r/statistics threads on linear regression, the same errors appear repeatedly. The Statistics By Jim guide to OLS assumptions covers each of these in detail. With that in mind, the most consequential mistake is not checking assumptions before trusting regression output.

Confusing correlation with causation. A high R² and a significant slope only show that X and Y are linearly associated, not that X causes Y. Ice cream sales correlate strongly with drowning rates (both peak in summer), but ice cream does not cause drowning. Always consider confounding variables before drawing causal conclusions.

Extrapolating beyond the data range. A regression line is only valid within the range of X values used to build it. Predicting a student's exam score for 20 hours of study using a model built on 1–5 hours of data produces unreliable results. The relationship may not remain linear at extreme values.

Ignoring non-linear relationships. A low R² sometimes means the relationship is real but not linear. Always plot the data first. A U-shaped or exponential pattern requires a different model, forcing a straight line through curved data produces a misleading regression. For a single-number measure of overall model fit, combine regression output with our mean squared error calculator to put a precise error score on prediction accuracy.

Linear Regression Calculator