What’s the Difference Between Linear Regression and Logistic Regression?


What’s the Difference Between Linear Regression and Logistic Regression?

Linear regression and logistic regression are machine learning techniques that make predictions by analyzing historical data. For example, by looking at past customer purchase trends, regression analysis estimates future sales, so you can make more informed inventory purchases. Linear regression techniques mathematically model the unknown factor on multiple known factors to estimate the exact unknown value. Similarly, logistic regression uses mathematics to find the relationships between two data factors. It then uses this relationship to predict the value of one of those factors based on the other. The prediction usually has a finite number of outcomes, like yes or no.

Read about linear regression »

Read about logistic regression »

Making predictions: linear regression vs. logistic regression

Both linear regression and logistic regression use mathematical modeling to predict the value of an output variable from one or more input variables. Output variables are dependent variables and input variables are independent variables.

Linear regression

Each independent variable has a direct relationship to the dependent variable and has no relationship to the other independent variables. This relationship is known as a linear relationship. The dependent variable is typically a value from a range of continuous values.

This is the formula, or linear function, to create a linear regression model:

y= β0 + β1X1 + β2X2+… βnXn+ ε

Here’s what each variable means:

  • y is the predicted dependent variable
  • β0 is the y-intercept when all independent input variables equal 0
  • β1X1 is the regression coefficient (B1) of the first independent variable (X1), the impact value of the first independent variable on the dependent variable
  • βnXn is the regression coefficient (BN) of the last independent variable (XN), when there are multiple input values
  • ε is the model error

An example of linear regression is predicting a house price (dependent variable) based on the number of rooms, neighborhood, and age (independent variables).

Logistic regression  

The value of the dependent variable is one from a list of finite categories that use binary classification. These are called categorical variables. An example is the outcome from the roll of a six-sided die. This relationship is known as a logistic relationship.

The formula for logistic regression applies a logit transformation, or the natural logarithm of odds, to the probability of success or failure of a particular categorical variable.

y = e^(β0 + β1X1 + β2X2+… βnXn+ ε) / (1 + e^(β0 + β1 x 1 + β2 x 2 +… βn x n + ε))

Here’s what each variable means:

  • y give the probability of success of the y categorical variable
  • e (x) is Euler’s number, the inverse of the natural logarithm function or sigmoid function,  ln (x)
  • Β0, β1X1…βnXn have the same meaning as linear regression in the previous section

An example of logistic regression is predicting the chance of a house price being over $500,000 (dependent variable) based on the number of rooms, neighborhood, and age (independent variables).

What are the similarities between linear regression and logistic regression?

Linear regression and logistic regression share some commonalities and have similar broad-ranging application spaces.

Statistical analysis

Logistic and linear regression are both forms of statistical or data analysis, and come under the field of data science. Both use mathematical modeling to relate a set of independent or known variables with dependent variables. You can represent both logistic regression and linear regression as mathematical equations. You can also represent the model on a graph.

Machine learning techniques

Both linear regression and logistic regression models find use in supervised machine learning.

Supervised machine learning involves training a model by inputting labeled datasets. The dependent and independent variables are known and gathered by human researchers. By inputting known historical data, the mathematical equation is reverse-engineered. Eventually, the predictions can become accurate for calculating unknown dependent variables from known independent variables.

Supervised learning differs from unsupervised learning, where the data isn’t labeled.

Read about machine learning »

Training difficulty

Both logistic regression and linear regression require a significant amount of labeled data for the models to become accurate in predictions. This can be an arduous task for humans. For example, if you want to label whether an image contains a car, all images must have tags of variables like car sizes, photo angles, and obstructions. 

Limited prediction accuracy

A statistical model that fits the input data to the output data does not necessarily imply a causal relationship between the dependent and independent variable. For both logistic regression and linear regression, correlation is not causation.

To use the example of house pricing from the previous section, suppose the homeowner’s name joins the list of independent variables. Then, the name John Doe correlates to lower house sale prices. While linear regression and logistic regression will always predict lower house prices if an owner’s name is John Doe, logic says this relationship to the input data is incorrect.

Key differences: linear regression vs. logistic regression

Logistic regression and linear regression are most different in their mathematical approaches.

Output value

The linear regression output is a continuous value scale. For example, this includes numbers, kilometers, price, and weight.

In contrast, the logistic regression model output value is the probability of a fixed categorical event occurring. For example, 0.76 might mean a 76% chance of wearing a blue shirt, and 0.22 might mean a 22% chance of voting yes.

Variable relationship

In regression analysis, a regression line is the shape of the graph line representing the relationship between each independent variable and the dependent variable.

In linear regression, the regression line is straight. Any changes to an independent variable have a direct effect on the dependent variable.

In logistic regression, the regression line is an S-shaped curve, also known as a sigmoid curve.

Mathematical distribution type

Linear regression follows a normal or Gaussian distribution of the dependent variable. A normal distribution is depicted by a continuous line on a graph.

A logistic regression follows a binomial distribution. Binomial distribution is typically depicted as a bar graph.

When to use linear regression vs. logistic regression

You can use linear regression when you want to predict a continuous dependent variable from a scale of values. Use logistic regression when you expect a binary outcome (for example, yes or no).

Here are examples of linear regression: 

  • Predicting the height of an adult based on the mother’s and father’s height
  • Predicting pumpkin sales volume based on the price, time of year, and store location
  • Predicting the price of an airline ticket based on origin, destination, time of year, and airline
  • Predicting the number of social media likes based on the poster, their number of organic followers, the post’s content, and the time of day posted

Here are examples of logistic regression:

  • Predicting if a person will get heart disease based on BMI, smoking status, and genetic predisposition
  • Predicting which retail clothing items will be most popular based on color, size, type, and price
  • Predicting if an employee will quit in that year based on pay rate, days in the office, number of meetings, number of emails sent, team, and tenure
  • Predicting which sales team members will have more than $1 million in contracts in a year based on previous year sales, tenure, and commission rate

Summary of differences: linear regression vs. logistic regression

 

Linear regression

Logistic regression

What is it?

A statistical method to predict an output value from a set of input values.

A statistical method to predict the probability of an output value being from a certain category from a set of categorical variables.

Relationship

Linear relationship, represented by a straight line.

Logistic relationship or sigmoidal relationship, represented by an S-shaped curve.

Equation

Linear.

Logarithmic.

Type of supervised learning

Regression.

Classification.

Distribution type

Normal/gaussian.

Binomial.

Best suited for

Tasks requiring a predicted continuous dependent variable from a scale.

Tasks requiring a predicted likelihood of a categorical dependent variable occurring from a fixed set of categories.

How can you run linear regression and logistic regression analysis on AWS?

You can run linear and logistic regression analysis on Amazon Web Services (AWS) using Amazon SageMaker.

SageMaker is a fully managed machine learning service with built-in regression algorithms for both linear regression and logistic regression, among several other statistical software packages. You can implement linear regression with as many input values as you need, or solve regression problems with logistic probability models.

For example, here’s how you can benefit when you use SageMaker:

  • Prepare, build, train, and deploy regression models quickly
  • Remove the heavy lifting from each step of the linear and logistic regression process and develop high-quality regression models
  • Access all the components required for regression analysis in a single tool set to get models to production faster, easier, and more affordably

Get started with regression analysis on AWS by creating an account today.